Configuring Scheduled Collection Jobs
In this section, you'll explore how to schedule jobs that collect data from a REST API endpoint. You can schedule jobs to recur on an interval that you define.
The earliest
and latest
Parameters
Cribl Stream uses two "magic" variables, earliest
and latest
, to specify a time range when collecting data. For additional details on configuring time ranges, take a moment to review this related documentation before you continue: Collector Sources > Scheduling and Running > Time Range.
To see how the earliest
and latest
variables work, let's configure a Collector that uses these parameters in a collection job.
If necessary, navigate to the REST Collector Source page. From the top nav of your Cribl Stream Sandbox, select Manage > Data > Sources, then select Collectors > REST from the Data Sources page's tiles or left nav.
-
Click Add Collector to open the REST > Add Collector modal.
-
In the Collector ID field, enter
echo
. -
Copy/paste the following URL into the Collect URL field.
'http://rest-server/echo'
-
Configure two Collect parameters by clicking the + Add Parameter button twice. Copy/paste the parameters' settings from the table below.
Name Value earliest `${earliest}`
latest `${latest}`
-
At the bottom left, click ► Save & Run. In the Run configuration modal, click Run again.
Observe the output from the REST API server. You'll see information related to the headers, body, and query string parameters returned to you.
{"headers":{"host":"rest-server","connection":"close"},"body":{},"query":{}}
Why are there no references to earliest
or latest
in the query
section? Because we didn't specify an absolute or relative time range when running the Collector. These variables' values are undefined
, meaning they are ignored during the collection job.
Now we'll configure Cribl Stream to send the earliest
and latest
parameters.
- If open, close the Preview modal from the previous step.
- Click the ► Run button on the
echo
Collector row. - In the Earliest field, enter
-5m@m
. This means Cribl Stream will snap to :00 seconds, 5 minutes ago. - In the Latest field, enter
@m
. This means Cribl Stream will snap to the last minute at :00 seconds. - Click the Run button.
Observe that the output from the REST API now includes the earliest
and latest
parameters, in UNIX epoch time format (seconds granularity). There should be 5 minutes' difference between the earliest and latest timestamps, and both should be snapped to :00 seconds.
You can translate the timestamps to a human-readable date and time by running the following command in your terminal (replace the placeholder timestamp with your result):
date -d @1654529520
If you need to set a default time range when the Collection runs, you can use the JavaScript Logical OR (||
) operator to set a default value.
For example, if your earliest
field value for this schedule will always be 5 minutes, you can use this syntax:
earliest || new Date().setTime(new Date().getTime() - (new Date().getTime() % (5 * 60 * 1000))) / 1000
If you need to format the time into a string, you can use the C.Time.strftime
function.
Scheduling
Now, let's configure the echo
job to run on a schedule. The goal is to collect a 60-second snapshot of data every 60 seconds.
- If open, close the Preview modal from the previous step.
- Click the ⏱ Scheduled button on the
echo
Collector row. - Set Enabled to Yes.
- Change the Cron schedule to
* * * * *
(meaning every minute). - Set Skippable to No.
- Set Resume missed runs to Yes. (This setting appears after you disable Skippable.)
- In the Earliest field, enter
-1m@m
. This means Cribl Stream will snap to :00 seconds, 1 minutes ago. - In the Latest field, enter
@m
. This means Cribl Stream will snap to the last minute at :00 seconds. - Click the Save button.
Your Schedule Collector window should look like the following:
In this Sandbox instance, we automatically apply all configuration changes when you save them. But you are running a Crib.Cloud or distributed deployment of Cribl Stream, you must next Commit and Deploy for your changes to take effect.
Why Disable Skippable, and Enable Resume Missed Runs?
These settings are important for reliable data collection with any Collector!
Cribl Stream places concurrency limits on its number of running jobs and tasks. This is to ensure that system resources are not depleted during runtime. When Sources like Office 365, and Collectors like S3 run concurrent jobs, they can exceed concurrency limits – and Cribl Stream might then skip a REST Collector job. With Skippable disabled, if Cribl Stream reaches concurrency limits, it will queue the job run until the next available start time.
The Resume missed runs setting is important when the Leader Node restarts or is unavailable. If you enable this setting, Cribl Stream tracks the last successful run time for each job. Upon restart, it will automatically schedule any skipped collection jobs.
Read more about Job Limits on the Cribl Docs site.
Conclusion
Congratulations, you now know how to schedule collection jobs! In the next module, we'll explore how to troubleshoot – using logs – when a REST collector is not working correctly.