Skip to content

GitLab Job Completion

This page is about monitoring & alerting on job completion (i.e., jobs that trigger but fail to complete within the expected time, or complete over the expected time). For alerting on jobs that fail to trigger, see periodic job monitoring.

The main purpose of a job completion metric is to observe if a given task or action has successfully run in the required interval. This can be used in various scenarios, where active monitoring is not applicable, such as cron jobs, or scheduled pipeline executions. If these were to fail to check back in the targeted interval, an alert would fire informing about this incident.

This is implementated via Prometheus Pushgateway. To register and check-in a successful execution, the cron/pipeline publish the required metrics to a Pushgateway. See below for details. Should the time difference be greater than the defined time it would trigger the alert.

Creating a new job completion metric is the same process as checking-in/updating. This is driven by convention over configuration. It is enough to publish a metric with an arbitrary resource label, specifying the resource that the alert reports on. This could be a URL to a repository and the job name in case of a scheduled pipeline, but must not include data, that changes between invocations (e.g. pipeline or job IDs). In addition to that the type and tier labels are required, as per all our alerts. These should correspond with the type and tier of the underlying service, that the deadman switch is monitoring.

Three metrics are required:

MetricDescription
gitlab_job_max_age_secondsThis is the allowed age before the alert should fire, in seconds.
gitlab_job_start_timestamp_secondsThis is the unix time in seconds when the job starts.
gitlab_job_success_timestamp_secondsThis is the unix time in seconds when the job completes succesfully. It should be set to 0 at job start.
gitlab_job_failedThis is a boolean value of the job completion status.

The below code can be used within a bash script (after having exported the respective environment variables)

report_start.sh:

Terminal window
cat <<PROM | curl -iv --data-binary @- "http://${PUSH_GATEWAY}:9091/metrics/job/${JOB}/tier/${TIER}/type/${TYPE}"
gitlab_job_start_timestamp_seconds{resource="${RESOURCE}"} $(date +%s)
gitlab_job_success_timestamp_seconds{resource="${RESOURCE}"} 0
gitlab_job_max_age_seconds{resource="${RESOURCE}"} ${MAX_AGE}
gitlab_job_failed{resource="${RESOURCE}"} 0
PROM

report_success.sh:

Terminal window
cat <<PROM | curl -iv --data-binary @- "http://${PUSH_GATEWAY}:9091/metrics/job/${JOB}/tier/${TIER}/type/${TYPE}"
gitlab_job_success_timestamp_seconds{resource="${RESOURCE}"} $(date +%s)
PROM

report_failed.sh:

Terminal window
cat <<PROM | curl -iv --data-binary @- "http://${PUSH_GATEWAY}:9091/metrics/job/${JOB}/tier/${TIER}/type/${TYPE}"
gitlab_job_failed{resource="${RESOURCE}"} 1
PROM
VariableDescription
PUSH_GATEWAYThe hostname/IP of the pushgateway to push to (check firewalls, stay within environment if possible)
MAX_AGEThe SLO value for alerting, in seconds.
RESOURCEThe resource identifier to include in alerts (e.g. assign_weights). Do not include data, that changes between invocations (such as pipeline or job IDs for example)
TIERThe tier of the monitored service (e.g. db)
TYPEThe tpye of the monitored service (e.g. postgres)

For tracking a job that is expected to succeed on each node use localhost as $PUSH_GATEWAY.

If you have a job that should only run on one random node in an env each time (e.g. the wal-g backup job), then use a central pushgateway to avoid having metrics labeled with different fqdn and thus getting alerts if the job didn’t happen to run on the same node for a while. For gstg, gprd and ops you can use the blackbox nodes as central pushgateway.

To remove metrics from the pushgateway check how to delete metrics

Any metric reporting created like above automatically has alerting enabled. These alerts will be sent out to s4.