Pipelines: More Than Meets the Eye
Pipelines and Packs are the heart of Stream, where data transformations occur. Each Pipeline/Pack offers a ‘stream load’ of Functions you can apply to your data and shape it. Some big use cases include:
- Redact/Mask Personal Identifiable Information (PII)
- Reduce duplicate or empty fields and events
- Reorganize ‘dirty’ data into neat and indexable formats
- Enrich data by adding fields or editing existing ones with lookups
- Add or correct timestamps
Remember: Each Data Route / QuickConnect connects a Source to a Destination using Passthru
, a Pipeline, or a Pack.
Pipelines are where your data gets transformed and enriched via Functions. Stream ships with many Functions, empowering you to do almost anything you would like with your data.
- If you did not explore the
Palo Alto Networks
Pack on the last page, click into it using thecribl-palo-alto-networks
ID - Click the
Pipelines
submenu - Click
pan_traffic
near the bottom of the left-hand list
Here we have a Pipeline labeled very descriptively, pan_traffic
. It deals with Palo Alto firewall logs and will be a great example for a few Functions and the Preview Pane.
First, let’s cover some things we can do with Functions. Using Functions, you can:
- Redact or mask information, such as erroneous messages in Windows events
- Cut down empty fields to save space in a SIEM
- Reformat one vendor’s data structure into another vendor’s formats
- Enrich data by applying GeoIP for better location awareness
- Adjust timestamps into the correct timezone
There are a lot more possibilities; these are just some good examples.
The selected Pipeline
showcases reducing data by using Drop
and Sample
as well as dropping specific incorrect timestamp fields (Eval
). Let’s quickly see what it does. Actually, let’s see how this Pipeline
affects our data in real time by loading up a sample!
- In the right pane, click
Simple
next topan_traffic.log
- At the top left of the Preview, click
OUT
so that we can see the results of thePipeline
function
What we just did is load up sample data that was captured from a Source. Now we can experiment with Functions in the Pipeline to see how they affect the data. Here is a summary of what’s happening in the functions:
#1. A comment that explains everything in this list 😉
#2. Simple eval to set the host, sourcetype, source, index, and cleanup the _raw message to remove the syslog header
#3. The parser function extracts all field values to the top level for event processing
#4. If the pan_device_name_as_host
Global Variable is set to true, use the dvc_name
field as the host value
#5-6. Use the Auto Timestamp function to set the event timestamp to the "generated time"
#7-8. Sample events (optional)
#9-10. Drop logs with subtype of start (optional)
#11-16. Reserialization of events back into CSV dropping fields that are not relevant
Ultimately, this Pipeline is used to help reduce erroneous data in a company’s firewall logs. How much? Let’s see!
- Click to enable Functions 7-8
- Click to enable Functions 9-10
- Click
Save
- Click the small bar graph icon (
Pipeline Diagnostics
) in the top right of the Preview Pane
Wow! These statistics show that this Pipeline was able to reduce the amount of data by 75%! Now that isn’t completely indicative of real world numbers, we would expect somewhere near 30%. But think about that – if your SIEM vendor charges you by how much data you store in your SIEM, reducing your storage by 30 - 75% is huge! Some enterprises send terabytes of data to their SIEM per day!
Conclusion time!