index
title: “Snowplow Monitoring and Incident Response Runbook”
Section titled “title: “Snowplow Monitoring and Incident Response Runbook””Overview
Section titled “Overview”SnowPlow is a pipeline of nodes and streams used to accept events from GitLab.com and other applications. This runbook provides guidance for responding to CloudWatch alarms and troubleshooting issues with the Snowplow infrastructure.
Important Resources
Section titled “Important Resources”- Design Document
- Terraform Configuration
- Cloudwatch Dashboard
- AWS GPRD account:
855262394183
The Pipeline Diagram
Section titled “The Pipeline Diagram”Response Procedures
Section titled “Response Procedures”Alarm Classification
Section titled “Alarm Classification”All alarms include P0/P1/P2 in the name, this is what they represent:
Priority | Description | Response Time | Impact |
---|---|---|---|
P0 | Critical issues requiring immediate attention | Immediate | Immediate Data loss or service outage |
P1 | Significant issues requiring prompt action | Within 24 hours | Potential Data Loss in 24-48 hours |
P2 | Non-urgent issues requiring investigation | Within 1 week | Minimal immediate impact |
P0 Alarms
Section titled “P0 Alarms”P0 alarms indicate critical incidents requiring immediate attention. In the Snowplow infrastructure, this occurs when the Application Load Balancer cannot receive or route events properly, resulting in irrecoverable event loss.
Action Steps
Section titled “Action Steps”- Create an incident in Slack
- Follow the handbook instructions
- Label the incident as P3 (internal-only classification)
- In the incident.io Slack channel, tag
@data-engineers @Ankit Panchal @Niko Belokolodov @Jonas Larsen @Ashwin
- This tagging prevents duplicate incidents from being created
- Troubleshoot the issue yourself
- Review the “What is Important” section below
- Review logs and metrics in CloudWatch
P1 Alarms
Section titled “P1 Alarms”P1 alarms indicate significant issues requiring action within 24 hours.
Action Steps
Section titled “Action Steps”- Begin troubleshooting
- Review the “What is Important” section below
- Review logs and metrics in CloudWatch
- If uncertain how to resolve
- Tag
@vedprakash @Justin Wong
in Slack
- Tag
P2 Alarms
Section titled “P2 Alarms”P2 alarms indicate potential issues that don’t require immediate action but should be addressed.
Action Steps
Section titled “Action Steps”- Create an issue
- Log the issue in the analytics project
- Include alarm details, timestamps, and any patterns observed
- Investigate when convenient
- If blocked on resolution
- Tag
@juwong
in the issue
- Tag
What is Important?
Section titled “What is Important?”If you are reading this, most likely one of two things has gone wrong. Either the SnowPlow pipeline has stopped accepting events or it has stopped writing events to the S3 bucket.
- Not accepting requests is a big problem and should be fixed as soon as possible. Collecting events is important and a synchronous process.
- Processing events and writing them out is important, but not as time-sensitive. There is some slack in the queue to allow events to stack up before being written.
- The raw events Kinesis stream has a data retention period of 48 hours. This can be altered if needed in a dire situation by modifying the
retention_period
argument in aws-snowplow-prd/main.tf.
- The raw events Kinesis stream has a data retention period of 48 hours. This can be altered if needed in a dire situation by modifying the
Troubleshooting Guide
Section titled “Troubleshooting Guide”Problem 1: Not accepting requests
Section titled “Problem 1: Not accepting requests”-
A quick curl check should give you a good response of OK. This same URL is used for individual collector nodes to check health against port 8000:
Terminal window curl https://snowplowprd.trx.gitlab.net/health -
Log into GPRD AWS and verify that there are collector nodes in the
SnowPlowNLBTargetGroup
EC2 auto-scaling target group. If not, something has gone wrong with the snowplow PRD collector Auto Scaling group. -
Check
Cloudflare
and verify that the DNS name is still pointing to the EC2 SnowPlow load balancer DNS name. The record in Cloudflare should be aCNAME
.- aws-snowplow-prd env, DNS name:
snowplowprd.trx.gitlab.net
- aws-snowplow-prd env, DNS name:
-
If there are EC2
collectors
running, you can SSH (see ‘How to SSH into EC2 instances’ section) into the instance and then check the logs by running:Terminal window docker logs --tail 15 stream-collector -
Are the collectors writing events to the raw (good or bad) Kinesis streams?
- Look at the Cloudwatch dashboard, or go to the
Kinesis Data streams
service in AWS and look at the stream monitoring tabs.
- Look at the Cloudwatch dashboard, or go to the
Problem 2: Not writing events out
Section titled “Problem 2: Not writing events out”-
First, make sure the collectors are working ok by looking over the steps above. It’s possible that if nothing is getting collected, nothing is being written out.
-
In the aws-snowplow-prd Cloudwatch dashboard, look at the Stream Records Age graph to see if a Kinesis stream is backing up. This graph shows the milliseconds that records are left in the streams and it should be zero most of the time. If there are lots of records backing up, the enrichers may not be picking up work, or Firehose is not writing records to S3.
-
Verify there are running enricher instances by checking the
SnowPlowEnricher
auto scaling group. -
There is no current automated method to see if the enricher processes are running on the nodes. To check the logs, SSH (see ‘How to SSH into EC2 instances’ section) into one of the enricher instances and then run:
Terminal window docker logs --tail 15 stream-enrich -
Are the enricher nodes picking up events and writing them into the enriched Kinesis streams? Look for the
Kinesis stream monitoring
tabs. -
Check that the
Kinesis Firehose
monitoring for the enriched (good and bad) streams are processing events. You may want to turn on CloudWatch logging if you are stuck and can’t seem to figure out what’s wrong. -
Check the
Lambda
function that is used to process events in Firehose. There should be plenty of invocations at any time of day. A graph of invocations is also in Cloudwatch.
Cloudwatch Dashboard
Section titled “Cloudwatch Dashboard”The Cloudwatch dashboard is useful to quickly understand the state of the infrastructure when you’re debugging a problem. It’s organized by service, in chronological order of how an event passes through (LB -> EC2 -> Kinesis, etc)
In the past, some important widgets in the dashboard have been:
Kinesis stream records age
: the most important because it measures how long events are sitting in Kinesis (which means they’re not getting enriched, in the past we have had problem with it backing up)Auto-scaling group size
: if we see collectors scaling up, but not scaling back down, we may need to increase the number of collectors to make sure we’re always ready to ingest bigger event traffic
Maintenance Procedures
Section titled “Maintenance Procedures”Updating enricher config
Section titled “Updating enricher config”The Snowplow collector and enricher instances are started with launch configuration templates.
These launch configuration templates include the Snowplow configs- collector-user-data.sh
and enricher-user-data.sh
.
The Snowplow configs are used to configure how the Snowplow collector/enricher and the Kinesis stream interact, and may occasionally need to be updated, here are the steps:
- Within the .sh file(s), update the Snowplow config values
- Create an MR to apply the changes, which should update the aws_launch_configuration resource, example MR
Lastly, to check that your config has been updated, ssh into one of the instances (see ‘How to SSH into EC2 instances’ section) and run:
cat /snowplow/config/config.hocon
EC2 instance refresh
Section titled “EC2 instance refresh”You may need to do a instance refresh manually, for example because:
- instances have become unresponsive
Here are the instructions:
- The instances need to be terminated/recreated for them to use the updated config. To access the
instance_refresh
tab in the UI:- go to EC2 -> Auto Scaling groups -> click ‘snowplow PRD enricher’ or ‘snowplow PRD collector’ -> Instance refresh
- Once in the ‘Instance refresh’ tab, click ‘Start instance refresh’
- For settings, use:
- Terminate and launch (default already)
- Set healthy percentage, Min=
95%
- the rest of the settings, you can leave as is
- Click ‘Start instance refresh’, and track its progress
Important Notes
Section titled “Important Notes”A note on burstable machines
Section titled “A note on burstable machines”Currently, the EC2 collector/enricher instances both use the t
machine types.
These machine types are burstable:
The T instance family provides a baseline CPU performance with the ability to burst above the baseline at any time for as long as required
When the instances are bursting, they consume CPU credits.
If the CPU usage is especially high, it may not be apparent at first, because the machines are bursting.
But once all CPU credits have been consumed the machines can no longer burst, and this could lead to degradation of the system, as seen in the 2024-12-10
incident.
As such, it’s important to do the following:
- be aware that we are using burstable instances
- keep an eye on the CPU credits, which can be found in the aforementioned Cloudwatch dashboard
How to SSH into EC2 instances
Section titled “How to SSH into EC2 instances”There are 2 ways to SSH into EC2 instance:
- Using
EC2 Instance Connect
(AWS UI):- Login to AWS and go to EC2 Instances
- click the
instance_id
that you want to enter, then click the ‘Connect’ tab - Select
Connect using EC2 Instance Connect
(it should be selected by default), and then click ‘Connect’
- From bastion host:
-
you will need the
snowplow.pem
file from 1Password Production Vault and you will connect to the nodes as theec2-user
. Your command should look something like this:Terminal window ssh -i "snowplow.pem" ec2-user@<ec2-ip-address>
-
Past Incidents
Section titled “Past Incidents”Incident list starting in December, 2024. This list is not guaranteed to be complete, but could be useful to reference for future incidents:
- 2024-12-01: Investigate why snowplow good events backing up takes time
- 2024-12-10: Snowplow enriched events are not getting processed
Prioritizing gitlab.com traffic
Section titled “Prioritizing gitlab.com traffic”In this issue discussion, it was requested that we have some mechanism to prioritize gitlab.com traffic over Self-Managed (SM) traffic, in an emergency situation.
If this scenario arises, here is the plan:
- Drop SM requests using ALB fixed-response actions
- Prepared MR: ops.gitlab.net/10898
- Additional context if needed: Issue #148
- Prepare a new environment in config-mgmt specifically for Self-Managed instance traffic, see HOWTO.md