Searching Amazon S3 (cont.)
Ok, with that we should be able to connect to our Amazon S3 bucket. Now we need to tell Cribl what data to search and we do that by creating a... (hint: check the next section's title.)
Amazon S3 Dataset Creation
- Ensure the Datatab is still selected in the top navigation bar.
- Click Datasetsin the left navigation bar.
- Click Add Dataset.
- In the IDfield entersbx_weblogs_<your first name>.
- In the Descriptionfield enterSearch Sandbox Web Logs.
- Ensure Dataset Provideris selected in the left navigation bar.
- Click the Select a Providerdropdown and select thesbx_s3provider_<your first name>provider that you just created.
- In the Bucket pathfield enter:everydayimsearching/data/${dataSource}/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}
Amazon S3 Dataset Processing
Now that we've told Cribl what data to grab let's instruct it on how to handle the data.
- 
Click Processingin the left navigation bar.
- 
Click Add Datatype Ruleset.
- 
Select Apacheas the rulset.
- 
Drag Apache Datatypesto the number 1 position.First Thing's FirstSetting the Apache Datatypesruleset first in the list will aid in performance since most of the data returned will match this rulset. TheCribl Searchdatatype is there to handle a few of the events that aren't in the native Apache format.
- 
Click Save.
Checking Our Work
Great! We've successfully, created a dataset provider and a dataset, or have we? The only way to know for sure is to search the data! So let's see if the fruits of our labor.
Click Home in the top navigation bar.
On the Home page in the left section titled Available Datasets you should now be able to see the dataset that you created titled sbx_weblogs_<your first name> (if you can't then something has gone terribly wrong and you should restart the sandbox from the beginning lol).
- 
In the Available Datasetssection, locate yourdatasettitledsbx_weblogs_<your first name>.
- 
Click Search Now.
- 
Click the Samplingdropdown left of theTime Picker.
- 
Click 1:1,000.
- 
Click Search.Free SamplesThere is quite a bit of data in the Amazon S3bucket. When limiting data using thelimitoperator, only the first 'N' events are returned which may not give a comprehensive view of the dataset.Samplingallows us to still retrieve the set number of events, but instead of getting the first 1000 events Cribl Search will spread out the results returned (according to the specified ratio) to achieve a more even distribution of data across all directories and events within theAmazon S3bucket.This results in a result set that is a much better representation of the datasetoverall.
- 
Once the search has completed, Click dataSourcein theField Browswerleft of the events.noteYou should see that our datasetis receivingaccess_common,access_combined, andaccess_errorlogs.
- 
Click sourcein theField Browser.noteYou should see that we are reading a mix of compressed gzip files (.gz) and uncomrpressed JSON files. 
Nice! ICYMI, you just connected to, timestamped, parsed, and searched compressed (.gz) data and uncompressed JSON data in an Amazon S3 bucket in less than 5 mins, all without moving a single byte to centralized storage. Go on, take a moment to pat yourself on the back.