Pipelines: More Than Meets the Eye

TL;DR

Pipelines and Packs are the heart of Stream, where data transformations occur. Each Pipeline/Pack offers a ‘stream load’ of Functions you can apply to your data and shape it. Some big use cases include:

Redact/Mask Personal Identifiable Information (PII)
Reduce duplicate or empty fields and events
Reorganize ‘dirty’ data into neat and indexable formats
Enrich data by adding fields or editing existing ones with lookups
Add or correct timestamps

Remember: Each Data Route / QuickConnect connects a Source to a Destination using Passthru, a Pipeline, or a Pack.

Pipelines are where your data gets transformed and enriched via Functions. Stream ships with many Functions, empowering you to do almost anything you would like with your data.

important

If you did not explore the Palo Alto Networks Pack on the last page, click into it using the cribl-palo-alto-networks ID
Click the Pipelines submenu
Click pan_traffic near the bottom of the left-hand list

Here we have a Pipeline labeled very descriptively, pan_traffic. It deals with Palo Alto firewall logs and will be a great example for a few Functions and the Preview Pane.

First, let’s cover some things we can do with Functions. Using Functions, you can:

Redact or mask information, such as erroneous messages in Windows events
Cut down empty fields to save space in a SIEM
Reformat one vendor’s data structure into another vendor’s formats
Enrich data by applying GeoIP for better location awareness
Adjust timestamps into the correct timezone

There are a lot more possibilities; these are just some good examples.

The selected Pipeline showcases reducing data by using Drop and Sample as well as dropping specific incorrect timestamp fields (Eval). Let’s quickly see what it does. Actually, let’s see how this Pipeline affects our data in real time by loading up a sample!

important

In the right pane, click Simple next to pan_traffic.log
At the top left of the Preview, click OUT so that we can see the results of the Pipeline function

What we just did is load up sample data that was captured from a Source. Now we can experiment with Functions in the Pipeline to see how they affect the data. Here is a summary of what’s happening in the functions: #1. A comment that explains everything in this list 😉 #2. Simple eval to set the host, sourcetype, source, index, and cleanup the _raw message to remove the syslog header #3. The parser function extracts all field values to the top level for event processing #4. If the pan_device_name_as_host Global Variable is set to true, use the dvc_name field as the host value #5-6. Use the Auto Timestamp function to set the event timestamp to the "generated time" #7-8. Sample events (optional) #9-10. Drop logs with subtype of start (optional) #11-16. Reserialization of events back into CSV dropping fields that are not relevant

Ultimately, this Pipeline is used to help reduce erroneous data in a company’s firewall logs. How much? Let’s see!

important

Click to enable Functions 7-8
Click to enable Functions 9-10
Click Save
Click the small bar graph icon (Pipeline Diagnostics) in the top right of the Preview Pane

Wow! These statistics show that this Pipeline was able to reduce the amount of data by 75%! Now that isn’t completely indicative of real world numbers, we would expect somewhere near 30%. But think about that – if your SIEM vendor charges you by how much data you store in your SIEM, reducing your storage by 30 - 75% is huge! Some enterprises send terabytes of data to their SIEM per day!

Conclusion time!