Pipelines: More Than Meets the Eye
Pipelines and Packs are the heart of Stream, where data transformations occur. Each Pipeline/Pack offers a ‘stream load’ of Functions you can apply to your data and shape it. Some big use cases include:
- Redact/Mask Personal Identifiable Information (PII)
- Reduce duplicate or empty fields and events
- Reorganize ‘dirty’ data into neat and indexable formats
- Enrich data by adding fields or editing existing ones with lookups
- Add or correct timestamps
Remember: Each Data Route connects a Source to a Destination using a Pipeline.
Pipelines are where your data gets transformed and enriched via Functions. Stream ships with many Functions, empowering you to do almost anything you would like with your data.
- Select the
Processing
submenu and clickPipelines
- Click
palo_alto_traffic
near the bottom of the left-hand list
NOTE: You may need to expand the first column on the left-hand side to see the names of the Pipelines.
Here we have a Pipeline labeled very descriptively, palo_alto_traffic
. It deals with Palo Alto firewall logs and will be a great example for a few Functions and the Preview Pane.
First, let’s cover some things we can do with Functions. Using Functions, you can:
- Redact or mask information, such as erroneous messages in Windows events
- Cut down empty fields to save space in a SIEM
- Reformat one vendor’s data structure into another vendor’s formats
- Enrich data by applying GeoIP for better location awareness
- Adjust timestamps into the correct timezone
There are a lot more possibilities; these are just some good examples.
The selected Pipeline
showcases reducing data by using Drop
and Sample
as well as dropping specific incorrect timestamp fields (Eval
). Let’s quickly see what it does. Actually, let’s see how this Pipeline
affects our data in real time by loading up a sample!
- In the right pane, click
Simple
next topan_firewall_traffic.log
- At the top left of the Preview, click
OUT
so that we can see the results of thePipeline
function - In the left pane, click the
On
switch to disable the firstEval
function - Click
Save
at the bottom of thePipeline
What we just did is load up sample data that was captured from a Source. Now we can experiment with Functions in the Pipeline to see how they affect the data. The third step, by the way, was to turn off a filter that makes sure only data from specific Sources gets into this Pipeline. Since we are using a sample, it doesn’t trip the filter and we need to turn it off.
Now we can see in the right side the first two events are greyed out and the third has a new sampled
field. We go into this in future courses, but we’ll summarize what’s happening in the functions here:
- (Disabled) Filter for the
Pipeline
- Extract some information from the event to use as filters later on
- Comment for the next function
- Drop certain events with subtype
start
- Comment for the next function (you should be reading these actually, they go into more detail than this summary!)
- Reduce events by keeping one sample event for every five
empty
events and one for every 10trusted
events - Comment for the next Function
- Drop any date fields
Ultimately, this Pipeline is used to help reduce erroneous data in a company’s firewall logs. How much? Let’s see!
Click the small bar graph icon (Pipeline Diagnostics
) in the top right of the Preview Pane
Wow! These statistics show that this Pipeline was able to reduce the amount of data by 75%! Now that isn’t completely indicative of real world numbers, we would expect somewhere near 30%. But think about that – if your SIEM vendor charges you by how much data you store in your SIEM, reducing your storage by 30 - 75% is huge! Some enterprises send terabytes of data to their SIEM per day!
Alright, only three more stops on this flight, let’s go!