Suppression
In this section, we'll cover how to use the Suppress
Function to deduplicate data streams.
Suppression aims to deduplicate a stream. Suppression works well when a stream contains a lot of duplicative data. Examples are error storms, or status state as a log (repeated "still running!"–type messages). In our data stream, we're going to pick an arbitrary low-cardinality field to show how much reduction can be achieved.
note
Open business_event
Pipeline
If you're already in the business_event
Pipeline, you can skip this.
- Select the
Processing
submenu, then clickPipelines
and click thebusiness_event
pipeline.
For this example, we want a few more events in our capture to show off these features. So we're going to run a 100-second capture.
important
Run 100-Second Capture
- In the right pane, make sure
Sample Data
has focus. - Click
Capture New
. - In the
Capture Sample Data
dialog, clickCapture
. - For
Capture Time (sec)
, enter100
. - For
Capture Up to N Events
, enter100
. - Click
Start
. - Go grab coffee and come back in 100 seconds. (A blue status bar shows the capture's progress until it's complete.)
- When the capture has completed, click on
Save as Sample File
, bringing up the sample file settings dialog. - In that dialog, set
File Name
tobe_big.log
. - At the bottom right, click
Save
.
Now, let's add our Suppress
Function. Suppress
will emit Number to Allow
events every Suppression Period
seconds for each value returned by Key Expression
.
Key Expression
warrants some explanation. Like many areas in the product, we're giving you the full power of JavaScript here. Suppress will emit only Number to Allow
events per Suppression Period (sec)
for each unique value of Key Expression
. Since it's an expression, we can combine multiple fields together, or manipulate fields, to determine uniqueness. For this example, we're going to pick a field (accountType
) which has only two values in our dataset, to show how suppression works.
important
Add Suppress Function
- Make sure
Manage > Processing > Pipelines
is selected in the top nav, with thebusiness_event
Pipeline displayed. - Click
+ Function
at the top, search forSuppress
, and click it. - Scroll down and click into the new
Suppress
Function. - For
Filter
, use the expressionsourcetype=='business_event'
. - For
Key Expression
, enteraccountType
. - Click
Save
.
Scroll through the right Preview pane to the right. You should see that most of the events have been dropped. Let's disable Show Dropped Events
to clean up the list.
important
Disable Show Dropped Events
- At the top of the Preview pane, click the gear icon next to
Select Fields
. - Toggle
Show Dropped Events
toOff
.
As you scroll through this cleaned list, you should see two events every 30 seconds being emitted, one per accountType
. If you click the Chart icon next to Select Fields
, and look at the chart's rightmost column, you should see a ~92% reduction in event count.
If you leave Suppress
running, you can also see changes in the real-time stats that Stream collects. Scroll down to later events, and you can see suppressCount
set to the number of events we dropped for that accountType
in that interval. With this information, you can estimate the amount of original data that would have been emitted.
Now click Quick Stats
at the upper right, then click Outputs
, and you should see output that looks like this:
By comparing the output byte counts to the Events IN
above, you can see that suppression is drastically reducing the output volume for this dataset.
Before moving on, disable the Suppress
Function.
important
Disable Suppress
- In the
Suppress
Function's header row, toggleOn
toOff
. - Click
Save
.
Next, we're going to look at sampling.