Skip to main content

Searching Amazon S3 (cont.)

Ok, with that we should be able to connect to our Amazon S3 bucket. Now we need to tell Cribl what data to search and we do that by creating a... (hint: check the next section's title.)

Amazon S3 Dataset Creation

How full is your bucket?
  1. Ensure the Data tab is still selected in the top navigation bar.
  2. Click Datasets in the left navigation bar.
  3. Click Add Dataset.
  4. In the ID field enter sbx_weblogs_<your first name>.
  5. In the Description field enter Search Sandbox Web Logs.
  6. Ensure Dataset Provider is selected in the left navigation bar.
  7. Click the Select a Provider dropdown and select the sbx_s3provider_<your first name> provider that you just created.
  8. In the Bucket path field enter:
    everydayimsearching/data/${dataSource}/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}

Amazon S3 Dataset Processing

Now that we've told Cribl what data to grab let's instruct it on how to handle the data.

important
  1. Click Processing in the left navigation bar.

  2. Click Add Datatype Ruleset.

  3. Select Apache as the rulset.

  4. Drag Apache Datatypes to the number 1 position.

    First Thing's First

    Setting the Apache Datatypes ruleset first in the list will aid in performance since most of the data returned will match this rulset. The Cribl Search datatype is there to handle a few of the events that aren't in the native Apache format.

  5. Click Save.

Checking Our Work

Great! We've successfully, created a dataset provider and a dataset, or have we? The only way to know for sure is to search the data! So let's see if the fruits of our labor.

important

Click Home in the top navigation bar.

On the Home page in the left section titled Available Datasets you should now be able to see the dataset that you created titled sbx_weblogs_<your first name> (if you can't then something has gone terribly wrong and you should restart the sandbox from the beginning lol).

important
  1. In the Available Datasets section, locate your dataset titled sbx_weblogs_<your first name>.

  2. Click Search Now.

  3. Click the Sampling dropdown left of the Time Picker.

  4. Click 1:1,000.

  5. Click Search.

    Free Samples

    There is quite a bit of data in the Amazon S3 bucket. When limiting data using the limit operator, only the first 'N' events are returned which may not give a comprehensive view of the dataset.

    Sampling allows us to still retrieve the set number of events, but instead of getting the first 1000 events Cribl Search will spread out the results returned (according to the specified ratio) to achieve a more even distribution of data across all directories and events within the Amazon S3 bucket.

    This results in a result set that is a much better representation of the dataset overall.

  6. Once the search has completed, Click dataSource in the Field Browswer left of the events.

    note

    You should see that our dataset is receiving access_common, access_combined, and access_error logs.

  7. Click source in the Field Browser.

    note

    You should see that we are reading a mix of compressed gzip files (.gz) and uncomrpressed JSON files.

Nice! ICYMI, you just connected to, timestamped, parsed, and searched compressed (.gz) data and uncompressed JSON data in an Amazon S3 bucket in less than 5 mins, all without moving a single byte to centralized storage. Go on, take a moment to pat yourself on the back.