Skip to content

scale-existing-shards

This guide describes steps similar to those in the new-shards.md guide, however, scaling a shard requires a few less steps.

reference issue

  1. Reach out to Google reps in #ext-google-cloud Slack channel.
    1. NOTE: Do this as early as possible in the process, as soon as we have an idea of many resources we’ll need, even before the projects exist. Do not assume we will have auto-approval on anything.
  2. Collaborate with GCP reps on provisioning and due dates.
    1. Planning sheet
    2. Planning doc
  3. Submit standard quota increase requests in the GCP console.

:warning: If you need to increase quotas for Heavy-weight read requests per minute, it is possible you need to specifically increase Heavy-weight read requests per minute per region as seen in this issue.

Typically, the following quotas will need to be increased:

  • N2D CPUs (us-east1)
  • Read requests per minute per region (us-east1)
  • Heavy-weight read requests per minute per region (us-east1)
  • Queries per minute per region (us-east1)
  • Concurrent regional operations per project per operation type (us-east1)

Compare the settings with other existing projects and request needed adjustments.

https://console.cloud.google.com/iam-admin/quotas?project=gitlab-r-saas-l-m-amd64-org-1&walkthrough_id=bigquery--bigquery_quota_request&pageState=(%22allQuotasTable%22:(%22f%22:%22%255B%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22N2D%2520CPUs_5C_22_22_2C_22s_22_3Atrue_2C_22i_22_3A_22displayName_22%257D_2C%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22OR_5C_22_22_2C_22o_22_3Atrue%257D_2C%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22Heavy-weight%2520read%2520requests%2520per%2520minute%2520per%2520region_5C_22_22_2C_22s_22_3Atrue_2C_22i_22_3A_22displayName_22%257D_2C%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22OR_5C_22_22_2C_22o_22_3Atrue%257D_2C%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22Read%2520requests%2520per%2520minute%2520per%2520region_5C_22_22_2C_22s_22_3Atrue_2C_22i_22_3A_22displayName_22%257D_2C%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22OR_5C_22_22_2C_22o_22_3Atrue%257D_2C%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22Queries%2520per%2520minute%2520per%2520region_5C_22_22_2C_22i_22_3A_22displayName_22%257D_2C%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22OR_5C_22_22_2C_22o_22_3Atrue%257D_2C%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22In-use%2520IP%2520addresses_5C_22_22_2C_22i_22_3A_22displayName_22%257D_2C%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22OR_5C_22_22_2C_22o_22_3Atrue%257D_2C%257B_22k_22_3A_22Name_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22Concurrent%2520regional%2520operations%2520per%2520project%2520per%2520operation%2520type_5C_22_22_2C_22s_22_3Atrue_2C_22i_22_3A_22displayName_22%257D_2C%257B_22k_22_3A_22_22_2C_22t_22_3A10_2C_22v_22_3A_22_5C_22region_3Aus-east1_5C_22_22%257D%255D%22,%22s%22:%5B(%22i%22:%22displayName%22,%22s%22:%220%22),(%22i%22:%22currentPercent%22,%22s%22:%221%22),(%22i%22:%22sevenDayPeakPercent%22,%22s%22:%220%22),(%22i%22:%22currentUsage%22,%22s%22:%221%22),(%22i%22:%22sevenDayPeakUsage%22,%22s%22:%220%22),(%22i%22:%22serviceTitle%22,%22s%22:%220%22),(%22i%22:%22displayDimensions%22,%22s%22:%220%22)%5D))

reference issue

  1. Register unique CIDRs for the new ephemeral runner projects in the Runbooks
  2. :warning: If you’re creating a new CIDR block, make sure you add it to the Global allow list.
    • In the Admin interface: Settings -> General -> Visibility and access controls.
    • If this is missed, we risk running into incidents like this one.

reference issue

Create the projects for the ephemeral VMs in the config-mgmt repo. Each runner manager will point to one of the projects created here.

  1. Locate the shard you’re scaling in environments/env-projects/saas-runners.tf
  2. Increase the count variable by the desired value.
  3. Submit and merge an MR with your changes.
  4. Confirm the projects are created successfully
    1. May require an SRE who has permissions to check, as most devs will not have permissions for these projects until the configuration step below when we grant permissions.

reference issue

  1. Locate the shard you’re scaling under ephemeral_project_networks in environments/ci/variables.tf
  2. Add a new entry under ephemeral_subnetworks and ephemeral_service_projects.
  3. Add the CIDRs under ci-gateway-allow-runners in environments/gprd/main.tf (using values set in the Document CIDRs step above)
  4. Locate the directory for the shard name you’re scaling under environments/
    1. Add a new entry for the new projects under the project variable in the variables.tf file.
  5. At the root of config-mgmt, run terraform fmt -recursive
  6. Submit an MR with your changes.

reference issue

Add the chef configs for the new runner-manager VMs. This will associate with config.toml settings on the runner managers, as well as some other settings (secrets config, analytics, etc).

  1. Copy (and update as needed) the existing config for the green settings of one of the existing runners-manager-<shard name>-green.json under roles/.
  2. Copy (and update as needed) the existing config for the blue settings of one of the existing runners-manager-<shard name>-green.json under roles/.

Note: initial concurrent setting should be 0 until we are ready to enable the runners

reference issue

  1. Locate the secrets vault for the shard you’re scaling in this location.
  2. Add two entries for each new runner-manager you’re adding, one for the green and one for the blue secrets.

Run chef-client on each new runner manager

Section titled “Run chef-client on each new runner manager”

reference issue

Note: will need chef server admin user and secrets in vault!

:warning: Ensure concurrent settings for any new runners is set to 0.

SSH into each new runner-manager and initiate a chef-client run:

Terminal window
$ ssh runners-manager-private-green-9.c.gitlab-ci-155816.internal
$ sudo chef-client

reference issue

SSH into each new runner-manager and initiate a manual tls-certificate test:

NOTE: you’ll need copy the /tmp/create-machine.sh and the /tmp/test-machine.sh from existing machines into the new VMs.

Terminal window
$ ssh runners-manager-private-green-9.c.gitlab-ci-155816.internal
$ export VM_MACHINE=docker-machine-tls-test-vm-01
$ /tmp/create-machine.sh && /tmp/test-machine.sh

If the run fails, you’ll get an error hinting at the reason.

Don’t forget to remove any machines you manually created

Section titled “Don’t forget to remove any machines you manually created”

docker-machine rm -f $VM_MACHINE

reference issue

In the infra-mgmt repo:

  1. add project names to run.sh
  2. add project names to data/gcp/impersonated-accounts.yaml

reference issue

reference issue

  1. Update the default_attributes.cookbook-gitlab-runner.global_config.concurrent value to match max capacity in the json file for the entire shard in chef-repo.

Enable the new runners on the existing shard

Section titled “Enable the new runners on the existing shard”

reference issue

:warning: Ensure concurrent settings for the runners are production ready, usually set to 1200.

SSH into each new runner-manager and run a chef-client as well as start/stop the gitlab-runner process depending on which runner is going to be active and which will be inactive. Suppose the blue deployment is inactive and the green is active, you’d perform the following:

Terminal window
$ ssh runners-manager-private-blue-9.c.gitlab-ci-155816.internal
$ sudo chef-client-disable "Disabling until next deployment"
$ sudo gitlab-runner stop
$ ssh runners-manager-private-green-9.c.gitlab-ci-155816.internal
$ sudo chef-client-enable
$ sudo chef-client
$ sudo gitlab-runner start

After ensuring the runner process is up, enable the new runner-manager VMs through a GitLab Admin account:

  1. Login as an Admin.

  2. Go to the admin console

  3. Filter using the shard’s tag, for example:

    img.png

  4. Click the play button to enable each of the new machines.

    img_1.png

reference issue

  1. In the gitlab admin account, unpause the runners (only needs to be done once)