CustomersDot main troubleshoot documentation
Overview
Section titled “Overview”customers.gitlab.com is the site where GitLab customers can manage their subscription(s) for GitLab.com.
For all availability issues see the escalation process for incidents or outages.
Production and Staging
Section titled “Production and Staging”The production and staging environments reside in Google Cloud projects.
- Staging: gitlab-subscriptions-staging
- Production: gitlab-subscriptions-prod
SSH connectivity to customer-dot VMs is established through Teleport. SREs don’t require any additional teleport permissions to access these VMs, and can do so directly as follow:
$ tsh login --proxy=production.teleport.gitlab.net$ tsh ssh customers-01-inf-prdsub
NOTE: if your user ID in Okta differs from that in Chef, you would need to make sure they match in order for this to work.
Break-glass procedure for SSH access
Section titled “Break-glass procedure for SSH access”SSH connectivity is normally established through Teleport. In the event that Teleport is unavailable, SREs can connect using SSH to the VMs through the Identity-Aware Proxy (IAP). For this to work, you need to file an MR to add the necessary firewall rule:
- Staging
- Production
- You might need to refer to these instructions
NOTE: Ensure you revert the MR once you’re back to using Teleport.
Once the firewall rule has been added, you should be able to do the following:
gcloud --project <PROJECT_ID> compute ssh <USERNAME>@<VM_NAME> --tunnel-through-iap
gcloud --project gitlab-subscriptions-staging compute ssh gservat@customers-01-inf-stgsub --tunnel-through-iap
Both staging and production services are proxied through Cloudflare. Refer to the section for NGINX below for rate limit information.
The web server on the VM has a rate limit set that should return a 429 when the rate is exceeded. This rate limit is specific to API Seat Requests and is is managed in Ansible.
- Local Logs
- NGINX Logs:
/var/log/nginx
- PostgreSQL Logs:
/var/log/postgresql
- Application Logs (Rails, Puma):
/home/customersdot/CustomersDot/current/log
- Sidekiq Logs:
/home/customersdot/CustomersDot/current/log
- NGINX Logs:
- Stackdriver Logs
- Application Logs from the VM are shipped into Stackdriver in GCP.
- https://cloudlogging.app.goo.gl/Jew7kUFaW8SUgeew9
Metrics
Section titled “Metrics”Specifications for a Customersdot metric catalog is available at metrics-catalog/services/customersdot.jsonnet
.
This catalog references the following SLIs:
gitlab_sli_customers_dot_requests_total
gitlab_sli_customers_dot_requests_error_total
gitlab_sli_customers_dot_requests_error_apdex_total
gitlab_sli_customers_dot_requests_error_apdex_success_total
gitlab_sli_customers_dot_sidekiq_jobs_total
gitlab_sli_customers_dot_sidekiq_jobs_error_total
gitlab_sli_customers_dot_sidekiq_jobs_error_apdex_total
gitlab_sli_customers_dot_sidekiq_jobs_error_apdex_success_total
These SLIs were introduced as a Rails application SLI for CustomersDot (see the Collector class and the Metrics::Slis class for more details).
Two Prometheus instances have been set up for CustomersDot:
Here is the main Grafana page for CustomersDot
Database and Rails console access
Section titled “Database and Rails console access”Access to the CustomerDot DB and Rails console is via Teleport
SREs have SSH access to the CustomerDot VMs for emergency access to the database. Ensure all database changes outside of Teleport or application code are retroactively approved via a change request following the Infrastructure change request workflow
Once you have logged into the Staging or the Production VM, run one of these commands to open the database or the Rails console:
gitlab-db # opens the PSQL console.gitlab-rails-console # opens the Rails console.
These scripts can be found in /usr/local/bin/
.
Database access with cloudsqlsuperuser permissions
Section titled “Database access with cloudsqlsuperuser permissions”The Database Operations Team also has their own user DBO_team
on CustomersDot DB Instances with cloudsqlsuperuser permissions. This user is necessary for maintenance operations and similar that require the highest possible level of access. It should not be used as a substitute for the teleport connection process, only for when the teleport process does not give sufficent access.
Ensure all database changes outside of Teleport or application code are approved or retroactively approved via a change request following the Infrastructure change request workflow
This user can access CustomerDot DB via the gcli cmd line as follows
gitlab-subscriptions-prod gcloud sql connect customers-db-2780 --user=dbo_team --database=CustomersDot_production
gitlab-subscriptions-staging gcloud sql connect customers-db-4dc5 --user=dbo_team --database=CustomersDot_stg
gitlab-subscriptions-stg-ref gcloud sql connect customers-db-04f6 --user=dbo_team --database=CustomersDot_stg-ref
The passwords for this user are found in the Production repo of 1Password in the format {gcp project} {instance name} {username} for example gitlab-subscriptions-staging customers-db-4dc5 dbo_team
This cmd will open a psql process in gcli to allow the team member to interact with the database.
Application provisioning
Section titled “Application provisioning”The provisioning of the CustomersDot application stack is done through the
CustomersDot Ansible project. For staging and production, provisioning works through Teleport to reach the VMs. For stgsub-ref
, it is still direct SSH access through the bastion.
To provision CustomersDot in staging and production manually, please refer to this documentation.
Alerts related to provisioning are sent to the #s_fulfillment_status
Slack channel.
Deployments
Section titled “Deployments”When a pipeline is triggered on the staging
(default) branch of CustomersDot,
the application is first deployed to Staging then to Production after a delay of
2 hours.
That being said, it is possible to trigger a manual pipeline to deploy to production right away, should the need to do so arise. To do so, please refer to this documentation.
Alerts related to deployments are sent to the #s_fulfillment_status
Slack channel.
If there’s a need to restart services, please refer to this restart documentation.
Similar to application provisioning, deployments in staging and production environments work through Teleport (not direct SSH). For stgsub-ref
, it is still direct SSH via bastion.
Change Management
Section titled “Change Management”Terraform is used to provision the virtual infrastructure for staging and production:
Chef is used essentially to bootstrap user access for users and Ansible.
Ansible is used to deploy the latest code to staging.
Alerting
Section titled “Alerting”At the moment, we rely on this Uptime Kuma instance for CustomersDot production for stack monitoring. When an Uptime Kuma alert is created from this instance, it is sent to the #s_fulfillment_status
Slack channel.
Unschedule maintenance
Section titled “Unschedule maintenance”+When Zuora is down (reported to #s_fulfillment_status
) and CustomersDot is in auto-maintenance mode for longer than 5 minutes, we should create an S3 incident with the CMOC updating the status page based on updates from https://trust.zuora.com/.
Service degradation during SeatLink traffic hour
Section titled “Service degradation during SeatLink traffic hour”maintenance_mode_seat_link
is a feature flag for just blocking seat link traffic. This was implemented because we suspect SeatLink was the cause of several service degradations. If a similar degradation occurs during SeatLink traffic hour (typically around 4 AM UTC), we can enable this flag to block SeatLink traffic.
Customer Impacts of Outage
Section titled “Customer Impacts of Outage”Gitlab.com customers
Section titled “Gitlab.com customers”For Gitlab.com customers there are several ways customers interact with the Customers Portal:
- When they click
Buy storage
orBuy CI minutes
buttons on the Usage Quotas page. Both buttons redirect the customer to the Customers Portal. This will not be available during downtime. - When they attempt to validate themselves by entering their credit card information when a CI pipeline is blocked by the anti-abuse system. This will not be available during downtime.
Self-managed and Dedicated Customers
Section titled “Self-managed and Dedicated Customers”- When a customer purchases a license for a self-managed or dedicated instance, they receive an activation code, which they enter on their instance to activate the license. For the activation to work, the instance needs to communicate with the Customer Portal. You can read more about how that works. This will not be available during downtime.
- If a customer does try to activate a subscription during downtime it will fail with the
Cannot activate instance due to a connectivity issue
error message. - Customers will not be able to obtain Cloud Connector access credentials anymore, which are needed to enable Duo and other features. Failure to sync with the Customers Portal for longer than the TTL of these credentials means these features will become unavailable.