Skip to main content

Pipelines: More Than Meets the Eye

TL;DR

Pipelines and Packs are the heart of Stream, where data transformations occur. Each Pipeline/Pack offers a ‘stream load’ of Functions you can apply to your data and shape it. Some big use cases include:

  • Redact/Mask Personal Identifiable Information (PII)
  • Reduce duplicate or empty fields and events
  • Reorganize ‘dirty’ data into neat and indexable formats
  • Enrich data by adding fields or editing existing ones with lookups
  • Add or correct timestamps

Remember: Each Data Route / QuickConnect connects a Source to a Destination using Passthru, a Pipeline, or a Pack.

Pipelines are where your data gets transformed and enriched via Functions. Stream ships with many Functions, empowering you to do almost anything you would like with your data.

important
  1. If you did not explore the Palo Alto Networks Pack on the last page, click into it using the cribl-palo-alto-networks ID
  2. Click the Pipelines submenu
  3. Click pan_traffic near the bottom of the left-hand list

Here we have a Pipeline labeled very descriptively, pan_traffic. It deals with Palo Alto firewall logs and will be a great example for a few Functions and the Preview Pane.

First, let’s cover some things we can do with Functions. Using Functions, you can:

  • Redact or mask information, such as erroneous messages in Windows events
  • Cut down empty fields to save space in a SIEM
  • Reformat one vendor’s data structure into another vendor’s formats
  • Enrich data by applying GeoIP for better location awareness
  • Adjust timestamps into the correct timezone

There are a lot more possibilities; these are just some good examples.

The selected Pipeline showcases reducing data by using Drop and Sample as well as dropping specific incorrect timestamp fields (Eval). Let’s quickly see what it does. Actually, let’s see how this Pipeline affects our data in real time by loading up a sample!

important
  1. In the right pane, click Simple next to pan_traffic.log
  2. At the top left of the Preview, click OUT so that we can see the results of the Pipeline function

What we just did is load up sample data that was captured from a Source. Now we can experiment with Functions in the Pipeline to see how they affect the data. Here is a summary of what’s happening in the functions: #1. A comment that explains everything in this list 😉 #2. Simple eval to set the host, sourcetype, source, index, and cleanup the _raw message to remove the syslog header #3. The parser function extracts all field values to the top level for event processing #4. If the pan_device_name_as_host Global Variable is set to true, use the dvc_name field as the host value #5-6. Use the Auto Timestamp function to set the event timestamp to the "generated time" #7-8. Sample events (optional) #9-10. Drop logs with subtype of start (optional) #11-16. Reserialization of events back into CSV dropping fields that are not relevant

Ultimately, this Pipeline is used to help reduce erroneous data in a company’s firewall logs. How much? Let’s see!

important
  1. Click to enable Functions 7-8
  2. Click to enable Functions 9-10
  3. Click Save
  4. Click the small bar graph icon (Pipeline Diagnostics) in the top right of the Preview Pane

Wow! These statistics show that this Pipeline was able to reduce the amount of data by 75%! Now that isn’t completely indicative of real world numbers, we would expect somewhere near 30%. But think about that – if your SIEM vendor charges you by how much data you store in your SIEM, reducing your storage by 30 - 75% is huge! Some enterprises send terabytes of data to their SIEM per day!

Conclusion time!