Skip to content

Scaling CustomersDot VMs

This runbook documents the process for scaling CustomersDot VMs both horizontally (adding new VMs) and vertically (resizing existing VMs). This capability is critical for handling increased load, particularly for Usage Billing and other high-traffic scenarios.

  • SRE Access: Standard SRE permissions for config-mgmt repository
  • InfraSec Approval: Required for Teleport token creation (for horizontal scaling)
  • Fulfillment Team Access: Maintainer access to the following repositories:
  • CustomersDot VMs run on Ubuntu 20.04 LTS with a specific boot image
  • All VMs must be registered with Teleport for SSH access. This requirement covers both human administrative access and automated deployment and provisioning processes.
  • The provisioning and deployment process requires coordination between SRE and Fulfillment teams

Create a merge request in config-mgmt that:

  1. Adds new VM(s) to the node map in the appropriate environment (stgsub or prdsub)
  2. Creates Teleport provisioning tokens for the new VM(s)

Example MR: config-mgmt!12567

Key configuration points:

  • Use the node map structure to define individual VMs
  • Specify the correct os_boot_image_override (currently ubuntu-os-cloud/ubuntu-2004-focal-v20240830)
  • Include Teleport token creation in the same MR
  • Ensure proper zone distribution for high availability

Approval Requirements:

  • SRE code owner approval
  • InfraSec approval (for Teleport tokens)

Once the MR is merged and Atlantis applies the changes:

  1. Monitor the VM startup via serial console:

    Terminal window
    gcloud compute --project=<PROJECT_ID> instances tail-serial-port-output <VM_NAME> --zone=<ZONE> --port=1
    # Example for staging:
    gcloud compute --project=gitlab-subscriptions-staging instances tail-serial-port-output customers-03-inf-stgsub --zone=us-east1-b --port=1
  2. Wait for the startup script to complete (typically 5-10 minutes)

  3. Look for the message: Startup finished in X.XXXs (kernel) + Xmin X.XXXs (userspace)

Note: With the Teleport token pre-created, Teleport should automatically work once the VM is up.

Run the Ansible provisioning job from the customersdot-ansible repository.

The ability to run these CI pipelines is limited to maintainers. In case of emergency, the EOC can use their admin account to kick off the pipelines.

Manual provisioning steps:

  1. Navigate to the customersdot-ansible project
  2. Go to CI/CD → Pipelines → Run Pipeline
  3. Select the appropriate branch (usually master)
  4. Run the pipeline and wait for the provision job to complete

Reference: Manual provisioning documentation

Expected duration: 25 - 30 minutes

Run the deployment job from the customers-gitlab-com repository.

The ability to run these CI pipelines is limited to maintainers. In case of emergency, the EOC can use their admin account to kick off the pipelines.

Manual deployment steps:

  1. Navigate to the customers-gitlab-com project
  2. Go to CI/CD → Pipelines
  3. Find the latest pipeline for the staging branch
  4. Manually trigger the deploy-staging or deploy-production job

Reference: Manual deployment documentation

Expected duration: 5-10 minutes

Monitor the CustomersDot Overview Dashboard to confirm:

  1. The new VM appears in the metrics
  2. The VM is receiving traffic
  3. No errors are being reported
  4. Response times are normal

Step 7: Remove the machine from teleport tokens

Section titled “Step 7: Remove the machine from teleport tokens”
  1. Create an MR to remove the machine name from environments/teleport-production/tokens.tf (added in step 1).

Vertical scaling involves changing the machine type of existing VMs to increase or decrease resources.

  • VMs must be stopped to change machine type
  • This causes downtime for the specific VM being resized
  • Resize VMs one at a time to maintain service availability
  • Total time per VM: approximately 2-5 minutes

For each VM you want to resize:

Terminal window
# Remove from target pools
gcloud compute target-pools remove-instances prdsub-tcp-lb-customers-http --instances=customers-XX-inf-prdsub --instances-zone=XXX
gcloud compute target-pools remove-instances prdsub-tcp-lb-customers-https --instances=customers-XX-inf-prdsub --instances-zone=XXX
# wait 5 mins or monitor active connections
# Stop the VM
gcloud compute instances stop <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>
# Change the machine type
gcloud compute instances set-machine-type <VM_NAME> --machine-type=<NEW_MACHINE_TYPE> --zone=<ZONE> --project=<PROJECT_ID>
# Start the VM
gcloud compute instances start <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>
# Wait for the VM to fully boot
while ! gcloud compute ssh <VM_NAME> --zone=<ZONE> --command="uptime" --ssh-flag="-o ConnectTimeout=10" --quiet >/dev/null 2>&1; do printf "."; sleep 5; done && echo " VM ready!"
# Add machine back to target pools
gcloud compute target-pools add-instances prdsub-tcp-lb-customers-http --instances=customers-XX-inf-prdsub --instances-zone=XXX
gcloud compute target-pools add-instances prdsub-tcp-lb-customers-https --instances=customers-XX-inf-prdsub --instances-zone=XXX

Example (resizing to n1-standard-8):

Terminal window
gcloud compute target-pools remove-instances stgsub-tcp-lb-customers-http --instances=customers-03-inf-stgsub --instances-zone=us-east1-b
gcloud compute target-pools remove-instances stgsub-tcp-lb-customers-https --instances=customers-03-inf-stgsub --instances-zone=us-east1-b
gcloud --project gitlab-subscriptions-staging compute instances stop customers-03-inf-stgsub --zone=us-east1-b
gcloud --project gitlab-subscriptions-staging compute instances set-machine-type customers-03-inf-stgsub --machine-type=n1-standard-8 --zone=us-east1-b
gcloud --project gitlab-subscriptions-staging compute instances start customers-03-inf-stgsub --zone=us-east1-b
gcloud compute target-pools add-instances stgsub-tcp-lb-customers-http --instances=customers-03-inf-stgsub --instances-zone=us-east1-b
gcloud compute target-pools add-instances stgsub-tcp-lb-customers-https --instances=customers-03-inf-stgsub --instances-zone=us-east1-b

Once all VMs have been resized, update the Terraform configuration to match:

  1. Create an MR in config-mgmt
  2. Update the machine_type in the node map for each resized VM
  3. Run atlantis plan -- --refresh to verify it’s a no-op plan
  4. Get the MR approved and merged

Example MR: config-mgmt!12571

Important: The Terraform plan should show no changes if the VMs were resized correctly.

Symptoms: Unable to connect via tsh ssh to a newly created VM

Solutions:

  1. Verify Teleport token was created in the config-mgmt MR
  2. Check the serial console for Teleport registration errors
  3. Ensure the VM has fully booted (check for “Startup finished” message)
  4. Verify your Okta user ID matches your Chef user ID

Break-glass procedure: If Teleport is completely unavailable, see the break-glass SSH access documentation

Symptoms: Chef run fails with error about esm-infra or Ubuntu Advantage

Background: This is a known issue with the Ubuntu Advantage cookbook on Ubuntu 20.04

Solutions:

  1. This has been addressed in recent Chef cookbook updates
  2. If it persists, see platform/runway/team#715
  3. The issue should not occur on Ubuntu 22.04 (planned upgrade)

Symptoms: Ansible provisioning fails with “Failed to import the required Python library (psycopg2)”

Solution: This should be fixed in the Ansible playbooks, but if it occurs:

Terminal window
# SSH to the VM
tsh ssh <VM_NAME>
# Install the package manually
sudo apt-get update
sudo apt-get install -y python3-psycopg2
# Retry the provisioning job

Symptoms: Ansible fails with “No such file or directory: ‘nginx’”

Solution: This should be fixed in the Ansible playbooks to use /usr/sbin/nginx, but if it occurs, ensure the Ansible playbooks are up to date.

Checklist:

  1. Verify the pet_name=customers label is set
  2. Check the CustomersDot dashboard
  3. Verify the VM is in the correct instance group
  4. Check nginx is running: sudo systemctl status nginx
  5. Check application logs: /home/customersdot/CustomersDot/current/log/production.log

Horizontal Scaling:

  • Infrastructure MR: Any SRE (requires SRE + InfraSec approval)
  • Provisioning: Maintainers on customersdot-ansible (currently limited SREs + Fulfillment team) or SRE with admin account.
  • Deployment: Maintainers on customers-gitlab-com (Fulfillment team + limited SREs) or SRE with admin account.

Vertical Scaling:

  • VM Resize: Any SRE with GCP access to the CustomersDot projects
  • Terraform Update: Any SRE (requires SRE approval)

Current SRE Maintainers on Fulfillment Repositories

Section titled “Current SRE Maintainers on Fulfillment Repositories”

As of November 2024:

  • Pierre Jambet
  • Cameron McFarland
  • Gonzalo Servat

Note: Additional SREs may need maintainer access for emergency scaling scenarios. File an Access Request if needed.

Direct SSH access to CustomersDot VMs is restricted due to audit requirements. Not all SREs have the GCP IAM permissions to use gcloud compute ssh with IAP. Teleport is the primary access method.

StepDurationNotes
Create and approve MR30-60 minDepends on reviewer availability
VM creation and boot5-10 minAutomated via Terraform
Provisioning25-30 minMay need retry if transient failures
Deployment5-10 minUsually succeeds first try
Total75-110 minAssuming no issues
StepDurationNotes
Stop, resize, start VM2-5 minPer VM
Update Terraform15-30 minMR creation and approval
Total per VM17-35 minDo VMs sequentially