Skip to content

index


title: “Snowplow Monitoring and Incident Response Runbook”

Section titled “title: “Snowplow Monitoring and Incident Response Runbook””

SnowPlow is a pipeline of nodes and streams used to accept events from GitLab.com and other applications. This runbook provides guidance for responding to CloudWatch alarms and troubleshooting issues with the Snowplow infrastructure.

SnowPlow Diagram

All alarms include P0/P1/P2 in the name, this is what they represent:

PriorityDescriptionResponse TimeImpact
P0Critical issues requiring immediate attentionImmediateImmediate Data loss or service outage
P1Significant issues requiring prompt actionWithin 24 hoursPotential Data Loss in 24-48 hours
P2Non-urgent issues requiring investigationWithin 1 weekMinimal immediate impact

P0 alarms indicate critical incidents requiring immediate attention. In the Snowplow infrastructure, this occurs when the Application Load Balancer cannot receive or route events properly, resulting in irrecoverable event loss.

  1. Create an incident in Slack
    • Follow the handbook instructions
    • Label the incident as P3 (internal-only classification)
    • In the incident.io Slack channel, tag @data-engineers @Ankit Panchal @Niko Belokolodov @Jonas Larsen @Ashwin
    • This tagging prevents duplicate incidents from being created
  2. Troubleshoot the issue yourself
    • Review the “What is Important” section below
    • Review logs and metrics in CloudWatch

P1 alarms indicate significant issues requiring action within 24 hours.

  1. Begin troubleshooting
    • Review the “What is Important” section below
    • Review logs and metrics in CloudWatch
  2. If uncertain how to resolve
    • Tag @vedprakash @Justin Wong in Slack

P2 alarms indicate potential issues that don’t require immediate action but should be addressed.

  1. Create an issue
    • Log the issue in the analytics project
    • Include alarm details, timestamps, and any patterns observed
  2. Investigate when convenient
  3. If blocked on resolution
    • Tag @juwong in the issue

If you are reading this, most likely one of two things has gone wrong. Either the SnowPlow pipeline has stopped accepting events or it has stopped writing events to the S3 bucket.

  • Not accepting requests is a big problem and should be fixed as soon as possible. Collecting events is important and a synchronous process.
  • Processing events and writing them out is important, but not as time-sensitive. There is some slack in the queue to allow events to stack up before being written.
    • The raw events Kinesis stream has a data retention period of 48 hours. This can be altered if needed in a dire situation by modifying the retention_period argument in aws-snowplow-prd/main.tf.
  1. A quick curl check should give you a good response of OK. This same URL is used for individual collector nodes to check health against port 8000:

    Terminal window
    curl https://snowplowprd.trx.gitlab.net/health
  2. Log into GPRD AWS and verify that there are collector nodes in the SnowPlowNLBTargetGroup EC2 auto-scaling target group. If not, something has gone wrong with the snowplow PRD collector Auto Scaling group.

  3. Check Cloudflare and verify that the DNS name is still pointing to the EC2 SnowPlow load balancer DNS name. The record in Cloudflare should be a CNAME.

    • aws-snowplow-prd env, DNS name: snowplowprd.trx.gitlab.net
  4. If there are EC2 collectors running, you can SSH (see ‘How to SSH into EC2 instances’ section) into the instance and then check the logs by running:

    Terminal window
    docker logs --tail 15 stream-collector
  5. Are the collectors writing events to the raw (good or bad) Kinesis streams?

    • Look at the Cloudwatch dashboard, or go to the Kinesis Data streams service in AWS and look at the stream monitoring tabs.
  1. First, make sure the collectors are working ok by looking over the steps above. It’s possible that if nothing is getting collected, nothing is being written out.

  2. In the aws-snowplow-prd Cloudwatch dashboard, look at the Stream Records Age graph to see if a Kinesis stream is backing up. This graph shows the milliseconds that records are left in the streams and it should be zero most of the time. If there are lots of records backing up, the enrichers may not be picking up work, or Firehose is not writing records to S3.

  3. Verify there are running enricher instances by checking the SnowPlowEnricher auto scaling group.

  4. There is no current automated method to see if the enricher processes are running on the nodes. To check the logs, SSH (see ‘How to SSH into EC2 instances’ section) into one of the enricher instances and then run:

    Terminal window
    docker logs --tail 15 stream-enrich
  5. Are the enricher nodes picking up events and writing them into the enriched Kinesis streams? Look for the Kinesis stream monitoring tabs.

  6. Check that the Kinesis Firehose monitoring for the enriched (good and bad) streams are processing events. You may want to turn on CloudWatch logging if you are stuck and can’t seem to figure out what’s wrong.

  7. Check the Lambda function that is used to process events in Firehose. There should be plenty of invocations at any time of day. A graph of invocations is also in Cloudwatch.

The Cloudwatch dashboard is useful to quickly understand the state of the infrastructure when you’re debugging a problem. It’s organized by service, in chronological order of how an event passes through (LB -> EC2 -> Kinesis, etc)

In the past, some important widgets in the dashboard have been:

  1. Kinesis stream records age: the most important because it measures how long events are sitting in Kinesis (which means they’re not getting enriched, in the past we have had problem with it backing up)
  2. Auto-scaling group size: if we see collectors scaling up, but not scaling back down, we may need to increase the number of collectors to make sure we’re always ready to ingest bigger event traffic

The Snowplow collector and enricher instances are started with launch configuration templates. These launch configuration templates include the Snowplow configs- collector-user-data.sh and enricher-user-data.sh.

The Snowplow configs are used to configure how the Snowplow collector/enricher and the Kinesis stream interact, and may occasionally need to be updated, here are the steps:

  1. Within the .sh file(s), update the Snowplow config values
  2. Create an MR to apply the changes, which should update the aws_launch_configuration resource, example MR

Lastly, to check that your config has been updated, ssh into one of the instances (see ‘How to SSH into EC2 instances’ section) and run:

Terminal window
cat /snowplow/config/config.hocon

You may need to do a instance refresh manually, for example because:

  • instances have become unresponsive

Here are the instructions:

  1. The instances need to be terminated/recreated for them to use the updated config. To access the instance_refresh tab in the UI:
    • go to EC2 -> Auto Scaling groups -> click ‘snowplow PRD enricher’ or ‘snowplow PRD collector’ -> Instance refresh
  2. Once in the ‘Instance refresh’ tab, click ‘Start instance refresh’
  3. For settings, use:
    • Terminate and launch (default already)
    • Set healthy percentage, Min=95%
    • the rest of the settings, you can leave as is
  4. Click ‘Start instance refresh’, and track its progress

Currently, the EC2 collector/enricher instances both use the t machine types.

These machine types are burstable:

The T instance family provides a baseline CPU performance with the ability to burst above the baseline at any time for as long as required

When the instances are bursting, they consume CPU credits.

If the CPU usage is especially high, it may not be apparent at first, because the machines are bursting. But once all CPU credits have been consumed the machines can no longer burst, and this could lead to degradation of the system, as seen in the 2024-12-10 incident.

As such, it’s important to do the following:

  • be aware that we are using burstable instances
  • keep an eye on the CPU credits, which can be found in the aforementioned Cloudwatch dashboard

There are 2 ways to SSH into EC2 instance:

  1. Using EC2 Instance Connect (AWS UI):
    • Login to AWS and go to EC2 Instances
    • click the instance_id that you want to enter, then click the ‘Connect’ tab
    • Select Connect using EC2 Instance Connect (it should be selected by default), and then click ‘Connect’
  2. From bastion host:
    • you will need the snowplow.pem file from 1Password Production Vault and you will connect to the nodes as the ec2-user. Your command should look something like this:

      Terminal window
      ssh -i "snowplow.pem" ec2-user@<ec2-ip-address>

Incident list starting in December, 2024. This list is not guaranteed to be complete, but could be useful to reference for future incidents:

  1. 2024-12-01: Investigate why snowplow good events backing up takes time
  2. 2024-12-10: Snowplow enriched events are not getting processed

In this issue discussion, it was requested that we have some mechanism to prioritize gitlab.com traffic over Self-Managed (SM) traffic, in an emergency situation.

If this scenario arises, here is the plan:

  1. Drop SM requests using ALB fixed-response actions
  2. Prepare a new environment in config-mgmt specifically for Self-Managed instance traffic, see HOWTO.md