Scaling CustomersDot VMs
Overview
Section titled “Overview”This runbook documents the process for scaling CustomersDot VMs both horizontally (adding new VMs) and vertically (resizing existing VMs). This capability is critical for handling increased load, particularly for Usage Billing and other high-traffic scenarios.
Prerequisites
Section titled “Prerequisites”Required Access
Section titled “Required Access”- SRE Access: Standard SRE permissions for config-mgmt repository
- InfraSec Approval: Required for Teleport token creation (for horizontal scaling)
- Fulfillment Team Access: Maintainer access to the following repositories:
Key Repositories
Section titled “Key Repositories”- Infrastructure: config-mgmt
- Provisioning: customersdot-ansible
- Application: customers-gitlab-com
Important Notes
Section titled “Important Notes”- CustomersDot VMs run on Ubuntu 20.04 LTS with a specific boot image
- All VMs must be registered with Teleport for SSH access. This requirement covers both human administrative access and automated deployment and provisioning processes.
- The provisioning and deployment process requires coordination between SRE and Fulfillment teams
Horizontal Scaling (Adding New VMs)
Section titled “Horizontal Scaling (Adding New VMs)”Step 1: Create Infrastructure MR
Section titled “Step 1: Create Infrastructure MR”Create a merge request in config-mgmt that:
- Adds new VM(s) to the node map in the appropriate environment (
stgsuborprdsub) - Creates Teleport provisioning tokens for the new VM(s)
Example MR: config-mgmt!12567
Key configuration points:
- Use the node map structure to define individual VMs
- Specify the correct
os_boot_image_override(currentlyubuntu-os-cloud/ubuntu-2004-focal-v20240830) - Include Teleport token creation in the same MR
- Ensure proper zone distribution for high availability
Approval Requirements:
- SRE code owner approval
- InfraSec approval (for Teleport tokens)
Step 2: Monitor VM Creation
Section titled “Step 2: Monitor VM Creation”Once the MR is merged and Atlantis applies the changes:
-
Monitor the VM startup via serial console:
Terminal window gcloud compute --project=<PROJECT_ID> instances tail-serial-port-output <VM_NAME> --zone=<ZONE> --port=1# Example for staging:gcloud compute --project=gitlab-subscriptions-staging instances tail-serial-port-output customers-03-inf-stgsub --zone=us-east1-b --port=1 -
Wait for the startup script to complete (typically 5-10 minutes)
-
Look for the message:
Startup finished in X.XXXs (kernel) + Xmin X.XXXs (userspace)
Note: With the Teleport token pre-created, Teleport should automatically work once the VM is up.
Step 3: Provision the VM
Section titled “Step 3: Provision the VM”Run the Ansible provisioning job from the customersdot-ansible repository.
The ability to run these CI pipelines is limited to maintainers. In case of emergency, the EOC can use their admin account to kick off the pipelines.
Manual provisioning steps:
- Navigate to the customersdot-ansible project
- Go to CI/CD → Pipelines → Run Pipeline
- Select the appropriate branch (usually
master) - Run the pipeline and wait for the provision job to complete
Reference: Manual provisioning documentation
Expected duration: 25 - 30 minutes
Step 4: Deploy the Application
Section titled “Step 4: Deploy the Application”Run the deployment job from the customers-gitlab-com repository.
The ability to run these CI pipelines is limited to maintainers. In case of emergency, the EOC can use their admin account to kick off the pipelines.
Manual deployment steps:
- Navigate to the customers-gitlab-com project
- Go to CI/CD → Pipelines
- Find the latest pipeline for the staging branch
- Manually trigger the
deploy-stagingordeploy-productionjob
Reference: Manual deployment documentation
Expected duration: 5-10 minutes
Step 6: Verify the VM is Serving Traffic
Section titled “Step 6: Verify the VM is Serving Traffic”Monitor the CustomersDot Overview Dashboard to confirm:
- The new VM appears in the metrics
- The VM is receiving traffic
- No errors are being reported
- Response times are normal
Step 7: Remove the machine from teleport tokens
Section titled “Step 7: Remove the machine from teleport tokens”- Create an MR to remove the machine name from
environments/teleport-production/tokens.tf(added in step 1).
Vertical Scaling (Resizing Existing VMs)
Section titled “Vertical Scaling (Resizing Existing VMs)”Vertical scaling involves changing the machine type of existing VMs to increase or decrease resources.
Important Notes
Section titled “Important Notes”- VMs must be stopped to change machine type
- This causes downtime for the specific VM being resized
- Resize VMs one at a time to maintain service availability
- Total time per VM: approximately 2-5 minutes
Step 1: Resize the VM
Section titled “Step 1: Resize the VM”For each VM you want to resize:
# Remove from target poolsgcloud compute target-pools remove-instances prdsub-tcp-lb-customers-http --instances=customers-XX-inf-prdsub --instances-zone=XXXgcloud compute target-pools remove-instances prdsub-tcp-lb-customers-https --instances=customers-XX-inf-prdsub --instances-zone=XXX
# wait 5 mins or monitor active connections
# Stop the VMgcloud compute instances stop <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>
# Change the machine typegcloud compute instances set-machine-type <VM_NAME> --machine-type=<NEW_MACHINE_TYPE> --zone=<ZONE> --project=<PROJECT_ID>
# Start the VMgcloud compute instances start <VM_NAME> --zone=<ZONE> --project=<PROJECT_ID>
# Wait for the VM to fully bootwhile ! gcloud compute ssh <VM_NAME> --zone=<ZONE> --command="uptime" --ssh-flag="-o ConnectTimeout=10" --quiet >/dev/null 2>&1; do printf "."; sleep 5; done && echo " VM ready!"
# Add machine back to target poolsgcloud compute target-pools add-instances prdsub-tcp-lb-customers-http --instances=customers-XX-inf-prdsub --instances-zone=XXXgcloud compute target-pools add-instances prdsub-tcp-lb-customers-https --instances=customers-XX-inf-prdsub --instances-zone=XXXExample (resizing to n1-standard-8):
gcloud compute target-pools remove-instances stgsub-tcp-lb-customers-http --instances=customers-03-inf-stgsub --instances-zone=us-east1-bgcloud compute target-pools remove-instances stgsub-tcp-lb-customers-https --instances=customers-03-inf-stgsub --instances-zone=us-east1-bgcloud --project gitlab-subscriptions-staging compute instances stop customers-03-inf-stgsub --zone=us-east1-bgcloud --project gitlab-subscriptions-staging compute instances set-machine-type customers-03-inf-stgsub --machine-type=n1-standard-8 --zone=us-east1-bgcloud --project gitlab-subscriptions-staging compute instances start customers-03-inf-stgsub --zone=us-east1-bgcloud compute target-pools add-instances stgsub-tcp-lb-customers-http --instances=customers-03-inf-stgsub --instances-zone=us-east1-bgcloud compute target-pools add-instances stgsub-tcp-lb-customers-https --instances=customers-03-inf-stgsub --instances-zone=us-east1-bStep 2: Update Terraform Configuration
Section titled “Step 2: Update Terraform Configuration”Once all VMs have been resized, update the Terraform configuration to match:
- Create an MR in config-mgmt
- Update the
machine_typein the node map for each resized VM - Run
atlantis plan -- --refreshto verify it’s a no-op plan - Get the MR approved and merged
Example MR: config-mgmt!12571
Important: The Terraform plan should show no changes if the VMs were resized correctly.
Troubleshooting
Section titled “Troubleshooting”Cannot SSH to New VM
Section titled “Cannot SSH to New VM”Symptoms: Unable to connect via tsh ssh to a newly created VM
Solutions:
- Verify Teleport token was created in the config-mgmt MR
- Check the serial console for Teleport registration errors
- Ensure the VM has fully booted (check for “Startup finished” message)
- Verify your Okta user ID matches your Chef user ID
Break-glass procedure: If Teleport is completely unavailable, see the break-glass SSH access documentation
Chef Fails with Ubuntu Advantage Error
Section titled “Chef Fails with Ubuntu Advantage Error”Symptoms: Chef run fails with error about esm-infra or Ubuntu Advantage
Background: This is a known issue with the Ubuntu Advantage cookbook on Ubuntu 20.04
Solutions:
- This has been addressed in recent Chef cookbook updates
- If it persists, see platform/runway/team#715
- The issue should not occur on Ubuntu 22.04 (planned upgrade)
Provisioning Fails with psycopg2 Error
Section titled “Provisioning Fails with psycopg2 Error”Symptoms: Ansible provisioning fails with “Failed to import the required Python library (psycopg2)”
Solution: This should be fixed in the Ansible playbooks, but if it occurs:
# SSH to the VMtsh ssh <VM_NAME>
# Install the package manuallysudo apt-get updatesudo apt-get install -y python3-psycopg2
# Retry the provisioning jobProvisioning Fails with nginx Error
Section titled “Provisioning Fails with nginx Error”Symptoms: Ansible fails with “No such file or directory: ‘nginx’”
Solution: This should be fixed in the Ansible playbooks to use /usr/sbin/nginx, but if it occurs, ensure the Ansible playbooks are up to date.
VM Not Receiving Traffic After Deployment
Section titled “VM Not Receiving Traffic After Deployment”Checklist:
- Verify the
pet_name=customerslabel is set - Check the CustomersDot dashboard
- Verify the VM is in the correct instance group
- Check nginx is running:
sudo systemctl status nginx - Check application logs:
/home/customersdot/CustomersDot/current/log/production.log
Access and Permissions
Section titled “Access and Permissions”Who Can Perform These Operations?
Section titled “Who Can Perform These Operations?”Horizontal Scaling:
- Infrastructure MR: Any SRE (requires SRE + InfraSec approval)
- Provisioning: Maintainers on customersdot-ansible (currently limited SREs + Fulfillment team) or SRE with admin account.
- Deployment: Maintainers on customers-gitlab-com (Fulfillment team + limited SREs) or SRE with admin account.
Vertical Scaling:
- VM Resize: Any SRE with GCP access to the CustomersDot projects
- Terraform Update: Any SRE (requires SRE approval)
Current SRE Maintainers on Fulfillment Repositories
Section titled “Current SRE Maintainers on Fulfillment Repositories”As of November 2024:
- Pierre Jambet
- Cameron McFarland
- Gonzalo Servat
Note: Additional SREs may need maintainer access for emergency scaling scenarios. File an Access Request if needed.
Restricted Access Note
Section titled “Restricted Access Note”Direct SSH access to CustomersDot VMs is restricted due to audit requirements. Not all SREs have the GCP IAM permissions to use gcloud compute ssh with IAP. Teleport is the primary access method.
Timeline Expectations
Section titled “Timeline Expectations”Horizontal Scaling (Adding One VM)
Section titled “Horizontal Scaling (Adding One VM)”| Step | Duration | Notes |
|---|---|---|
| Create and approve MR | 30-60 min | Depends on reviewer availability |
| VM creation and boot | 5-10 min | Automated via Terraform |
| Provisioning | 25-30 min | May need retry if transient failures |
| Deployment | 5-10 min | Usually succeeds first try |
| Total | 75-110 min | Assuming no issues |
Vertical Scaling (Resizing One VM)
Section titled “Vertical Scaling (Resizing One VM)”| Step | Duration | Notes |
|---|---|---|
| Stop, resize, start VM | 2-5 min | Per VM |
| Update Terraform | 15-30 min | MR creation and approval |
| Total per VM | 17-35 min | Do VMs sequentially |
Related Documentation
Section titled “Related Documentation”- CustomersDot Overview
- CustomersDot Ansible Documentation
- Fulfillment Escalation Process
- Infrastructure Change Management
Reference Issues and MRs
Section titled “Reference Issues and MRs”- Original discovery and testing: production-engineering#27880
- Node map conversion (staging): config-mgmt!12504
- Node map conversion (production): config-mgmt!12530
- Example horizontal scaling MR: config-mgmt!12567
- Example vertical scaling MR: config-mgmt!12571