Searching Amazon S3 (cont.)
Ok, with that we should be able to connect to our Amazon S3 bucket. Now we need to tell Cribl what data to search and we do that by creating a... (hint: check the next section's title.)
Amazon S3 Dataset Creation
- Ensure the
Datatab is still selected in the left navigation bar. - Click
Datasetsin the top navigation bar. - Click
Add Dataset. - In the
IDfield entersbx_weblogs_<your first name>. - In the
Descriptionfield enterSearch Sandbox Web Logs. - Ensure
Dataset Provideris selected in the left navigation bar (of the modal). - Click the
Select a Providerdropdown and select thesbx_s3provider_<your first name>provider that you just created. - In the
Bucket pathfield enter:everydayimsearching/data/${dataSource}/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}
Amazon S3 Dataset Processing
Now that we've told Cribl what data to grab let's instruct it on how to handle the data.
-
Click
Processingin the left navigation bar. -
Click
Add Datatype Ruleset. -
Select
Apacheas the rulset. -
Drag
Apache Datatypesto the number 1 position.First Thing's FirstSetting the
Apache Datatypesruleset first in the list will aid in performance since most of the data returned will match this rulset. TheCribl Searchdatatype is there to handle a few of the events that aren't in the native Apache format. -
Click
Save.
Checking Our Work
Great! We've successfully, created a dataset provider and a dataset, or have we? The only way to know for sure is to search the data! So let's see if the fruits of our labor.
Click 🏠 Search Home in the left navigation bar.
On the Home page in the left section titled Available Datasets you should now be able to see the dataset that you created titled sbx_weblogs_<your first name> (if you can't then something has gone terribly wrong and you should restart the sandbox from the beginning lol).
-
In the
Available Datasetssection, locate yourdatasettitledsbx_weblogs_<your first name>. -
Click
Search Now. -
Click the
Samplingdropdown left of theTime Picker. -
Click
1:1,000. -
Click
Search.Free SamplesThere is quite a bit of data in the
Amazon S3bucket. When limiting data using thelimitoperator, only the first 'N' events are returned which may not give a comprehensive view of the dataset.Samplingallows us to still retrieve the set number of events, but instead of getting the first 1000 events Cribl Search will spread out the results returned (according to the specified ratio) to achieve a more even distribution of data across all directories and events within theAmazon S3bucket.This results in a result set that is a much better representation of the
datasetoverall. -
Once the search has completed, Click
dataSourcein theField Browswerleft of the events.noteYou should see that our
datasetis receivingaccess_common,access_combined, andaccess_errorlogs. -
Click
sourcein theField Browser.noteYou should see that we are reading a mix of compressed gzip files (.gz) and uncomrpressed JSON files.
Nice! ICYMI, you just connected to, timestamped, parsed, and searched compressed (.gz) data and uncompressed JSON data in an Amazon S3 bucket in less than 5 mins, all without moving a single byte to centralized storage. Go on, take a moment to pat yourself on the back.