Skip to main content

Pipelines: More Than Meets the Eye

TL;DR

Pipelines and Packs are the heart of Stream, where data transformations occur. Each Pipeline/Pack offers a ‘stream load’ of Functions you can apply to your data and shape it. Some big use cases include:

  • Redact/Mask Personal Identifiable Information (PII)
  • Reduce duplicate or empty fields and events
  • Reorganize ‘dirty’ data into neat and indexable formats
  • Enrich data by adding fields or editing existing ones with lookups
  • Add or correct timestamps

Remember: Each Data Route connects a Source to a Destination using a Pipeline.

Pipelines are where your data gets transformed and enriched via Functions. Stream ships with many Functions, empowering you to do almost anything you would like with your data.

important
  1. Select the Processing submenu and click Pipelines
  2. Click palo_alto_traffic near the bottom of the left-hand list
    NOTE: You may need to expand the first column on the left-hand side to see the names of the Pipelines.

Here we have a Pipeline labeled very descriptively, palo_alto_traffic. It deals with Palo Alto firewall logs and will be a great example for a few Functions and the Preview Pane.

First, let’s cover some things we can do with Functions. Using Functions, you can:

  • Redact or mask information, such as erroneous messages in Windows events
  • Cut down empty fields to save space in a SIEM
  • Reformat one vendor’s data structure into another vendor’s formats
  • Enrich data by applying GeoIP for better location awareness
  • Adjust timestamps into the correct timezone

There are a lot more possibilities; these are just some good examples.

The selected Pipeline showcases reducing data by using Drop and Sample as well as dropping specific incorrect timestamp fields (Eval). Let’s quickly see what it does. Actually, let’s see how this Pipeline affects our data in real time by loading up a sample!

important
  1. In the right pane, click Simple next to pan_firewall_traffic.log
  2. At the top left of the Preview, click OUT so that we can see the results of the Pipeline function
  3. In the left pane, click the On switch to disable the first Eval function
  4. Click Save at the bottom of the Pipeline

What we just did is load up sample data that was captured from a Source. Now we can experiment with Functions in the Pipeline to see how they affect the data. The third step, by the way, was to turn off a filter that makes sure only data from specific Sources gets into this Pipeline. Since we are using a sample, it doesn’t trip the filter and we need to turn it off.

Now we can see in the right side the first two events are greyed out and the third has a new sampled field. We go into this in future courses, but we’ll summarize what’s happening in the functions here:

  1. (Disabled) Filter for the Pipeline
  2. Extract some information from the event to use as filters later on
  3. Comment for the next function
  4. Drop certain events with subtype start
  5. Comment for the next function (you should be reading these actually, they go into more detail than this summary!)
  6. Reduce events by keeping one sample event for every five empty events and one for every 10 trusted events
  7. Comment for the next Function
  8. Drop any date fields

Ultimately, this Pipeline is used to help reduce erroneous data in a company’s firewall logs. How much? Let’s see!

important

Click the small bar graph icon (Pipeline Diagnostics) in the top right of the Preview Pane

Wow! These statistics show that this Pipeline was able to reduce the amount of data by 75%! Now that isn’t completely indicative of real world numbers, we would expect somewhere near 30%. But think about that – if your SIEM vendor charges you by how much data you store in your SIEM, reducing your storage by 30 - 75% is huge! Some enterprises send terabytes of data to their SIEM per day!

Alright, only three more stops on this flight, let’s go!