Searching Amazon S3 (cont.)
Ok, with that we should be able to connect to our Amazon S3 bucket. Now we need to tell Cribl what data to search and we do that by creating a... (hint: check the next section's title.)
Amazon S3 Dataset Creation
- Ensure the
Data
tab is still selected in the top navigation bar. - Click
Datasets
in the left navigation bar. - Click
Add Dataset
. - In the
ID
field entersbx_weblogs_<your first name>
. - In the
Description
field enterSearch Sandbox Web Logs
. - Ensure
Dataset Provider
is selected in the left navigation bar. - Click the
Select a Provider
dropdown and select thesbx_s3provider_<your first name>
provider that you just created. - In the
Bucket path
field enter:everydayimsearching/data/${dataSource}/${_time:%Y}/${_time:%m}/${_time:%d}/${_time:%H}
Amazon S3 Dataset Processing
Now that we've told Cribl what data to grab let's instruct it on how to handle the data.
-
Click
Processing
in the left navigation bar. -
Click
Add Datatype Ruleset
. -
Select
Apache
as the rulset. -
Drag
Apache Datatypes
to the number 1 position.First Thing's FirstSetting the
Apache Datatypes
ruleset first in the list will aid in performance since most of the data returned will match this rulset. TheCribl Search
datatype is there to handle a few of the events that aren't in the native Apache format. -
Click
Save
.
Checking Our Work
Great! We've successfully, created a dataset provider
and a dataset
, or have we? The only way to know for sure is to search the data! So let's see if the fruits of our labor.
Click Home in the top navigation bar.
On the Home page in the left section titled Available Datasets
you should now be able to see the dataset
that you created titled sbx_weblogs_<your first name>
(if you can't then something has gone terribly wrong and you should restart the sandbox from the beginning lol).
-
In the
Available Datasets
section, locate yourdataset
titledsbx_weblogs_<your first name>
. -
Click
Search Now
. -
Click the
Sampling
dropdown left of theTime Picker
. -
Click
1:1,000
. -
Click
Search
.Free SamplesThere is quite a bit of data in the
Amazon S3
bucket. When limiting data using thelimit
operator, only the first 'N' events are returned which may not give a comprehensive view of the dataset.Sampling
allows us to still retrieve the set number of events, but instead of getting the first 1000 events Cribl Search will spread out the results returned (according to the specified ratio) to achieve a more even distribution of data across all directories and events within theAmazon S3
bucket.This results in a result set that is a much better representation of the
dataset
overall. -
Once the search has completed, Click
dataSource
in theField Browswer
left of the events.noteYou should see that our
dataset
is receivingaccess_common
,access_combined
, andaccess_error
logs. -
Click
source
in theField Browser
.noteYou should see that we are reading a mix of compressed gzip files (.gz) and uncomrpressed JSON files.
Nice! ICYMI, you just connected to, timestamped, parsed, and searched compressed (.gz) data and uncompressed JSON data in an Amazon S3
bucket in less than 5 mins, all without moving a single byte to centralized storage. Go on, take a moment to pat yourself on the back.