Discovering Data

Since Data Collection jobs can filter the collected data, it's useful to first figure out what data a job will bring in. For this purpose, Stream provides the capability to do Discovery jobs.

A Discovery job is effectively a dry run of a full job. It walks the data store, finding all data that matches the specified filter, and keeps track of what it would retrieve if a full job were run. This allows you to test out your filters before an actual full run, tuning your run to get just the data you want.

Let's imagine that we're doing an investigation on a security incident that occurred between May 5, 2021 and May 10, 2021. In our imaginary incident, we need to look at east-west internal traffic. When we go back to our partitioning scheme, we can use the fields we captured in the collector to filter the data.

important

If you're not already here: Click Cribl on the top row of tabs, then with Manage active in Stream's top nav, select Data and click Sources
Under the Collectors section, click S3.
Click the Run button next to your configured collector, and the Run configuration modal will appear.
Set the Mode to Discovery.
Change the Time Range to Absolute
In the Earliest field, use the date picker to select May 5, 2021 and hit Ok.
In the Latest field, use the date picker to select May 10, 2021 and hit Ok.

In the Filter field, enter:

src_zone=='trusted' && dest_zone=='trusted'

At this point, the screen should look something like this:
Click Run.

Once the job starts, you'll be returned to the collector configuration page, and you'll notice a job ID show up in the Latest Ad Hoc Run column, like this: Discovery Job Id

Clicking on that job ID will bring up the Job Inspector. If you click its Discover Results tab, you'll see something like this: Discovery Results

As you can see, this shows you approximately how many objects (files) a Full Run collection job will pull from the datastore.

note

Play around with changing the filter and re-running the Discovery job, to see how changing the filter can change the results. The filter is an effective place to zero in on the data that you're specifically looking for, before you ingest it.