Skip to main content

Dataset Processing

Now that we have seen HOW we are connecting to the Amazon S3 object store, we should go back to the cribl_search_sample dataset to review WHAT data we are searching (or expecting to search) as a part of the dataset.

Trust the Process
  1. Click X on the Dataset Provider modal.
  2. Click Datasets in the left navigation bar.
  3. Click cribl_search_sample.
  4. Click Processing in the left navigation bar.

Datatypes

Here we can see all the Datatypes that are configured for the cribl_search_sample dataset. More specifically, these are the rulesets detailing how the different types of data within the Amazon S3 bucket should be broken into events, timestamped, and parsed. A dataset can be associated with one or more Rulesets. Rulesets are evaluated top‑down and consist of an ordered list of rules which are also evaluated top‑down. Any data that is not captured by one of the datatypes configured here will be procesed using the System Default Rule represented at the bottom of the list.

If you're not first, your're last

It is beneficial to put the rulesets that match most of the data first to speed up processing.

Mix n' Match

Because there are lots of different types of data that can be searched within an Amazon S3 bucket (or within any dataset provider), it is useful to note that datatypes are not explicitly tied to any dataset provider or dataset.

Ex. AWS Datatypes could be applied to a dataset that is searching events from an Azure Blobs dataset provider.

With respect to datasets, datatypes can have a one-to-one, one-to-many, many-to-one, or many-to-many relationship.

Datatype Configurations

Alright Alice, let's see how far the rabbit hole goes and take a closer look at the rules behind the datatypes listed here.

important
  1. Click Settings in the top navigation bar.
  2. Ensure Datatypes is highlighted in the left navigation bar.

This is where datatypes are created and configured. If you look with your special eyes you'll notice that in the Library column you can see which datatypes are custom-made and which are shipped with Cribl Search. If you can recall, AWS Datatypes was the first datatype listed for processing in our cribl_search_sample dataset, so it stands to reason we expect most of the data returned to match the rules represented there.

important

Click AWS Datatypes.

In the settings for this datatype you'll once again find an immutable ID field, followed by a description and a Tags input.

Tags are an optional mechanism for organizing configs. Below that you'll find the rules.

Rules are Rules

Rules ultimately determine what datatype is applied to an event. Cribl Search will attempt to match the data against all the ordered rules, and the first rule that matches is used to process the search results.

The name of the rule is captured in the Name field. Following the name is the Filter field which contains a javascript filter expression that the event is matched against. If the event matches the Filter then the event is given the datatype that is defined in the Datatype field. Unlike datasets, and dataset providers, which can both be referred to by their IDs when searching, datatypes are referred to by what is populated in the Datatype field when searching. Not by the ID of the Datatype or name of the rule.

Ex.

dataset="web_logs" datatype="aws_vpcflow"

The Event Breaker section configures how Cribl Search converts data into discrete events. Cribl Search provides several different formats for event breakers.

The process of datatyping is as follows:

  1. Event Breaking – Breaks raw bytes into discrete events.
  2. Timestamping – Assigns timestamps to events.
  3. Parsing – Parses fields from events.
  4. Add Fields – Adds additional fields.