Credential rotation
Rotating credentials in a high-availability database deployment with the objective to ensure zero downtime can be a challenge.
Here are some explicit tasks which are required to accomplish the changing
of a password for an important database role like gitlab-superuser
.
Change issue creation
Section titled “Change issue creation”Create a production change issue to track this work:
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/new?issuable_template=change_management
Label the issue for criticality level 2 (C2
) and severity level 2 (S2
)
so that production deployments are not initiated during the prodedure.
Make a comment in the issue with the following content to apply some required labels:
/label ~change ~C2 ~S2 ~Database ~"Service::Postgres" ~"Service::Patroni" ~"requires commendted manager approval" ~"required production access" ~"section::ops" ~"security" ~"change::scheduled"
If the operation is to commence immediately, use the ~"change::in-progress"
label instead of ~"change::scheduled"
. This enables deployment automation
to recognize the change issue as a blocker.
Operator workstation setup
Section titled “Operator workstation setup”In order to support commands like bundle exec knife <action>
it is
expected that an operator will change directory to their local workstation
clone of the gitlab-com/runbooks
project and install the required ruby
dependencies:
rbenv installruby -S gem install bundlerbundle install --path=vendor/bundle
Procedure
Section titled “Procedure”Phase one
Section titled “Phase one”-
Specify the environment in which to conduct operations:
Terminal window export GITLAB_ENVIRONMENT='gstg' -
Specify the link to this issue:
Terminal window export issue_link='https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/10961' # CHANGEME (as necessary) -
Specify the username whose password you wish to rotate:
Terminal window export GITLAB_USERNAME='gitlab-superuser' # CHANGEME (as necessary) -
Copy the current user password:
Terminal window bin/gkms-vault-cat gitlab-patroni "${GITLAB_ENVIRONMENT}" | jq --raw-output '."gitlab-patroni".patroni.users.superuser.password' | pbcopy -
Record the password in a field of type
Password
in a secure note entitled “gitlab-patroni ${GITLAB_ENVIRONMENT} ${GITLAB_USERNAME}
” in 1Password for reference in case a roll-back is necessary. -
Deploy the scripts from https://gitlab.com/gitlab-com/runbooks/-/merge_requests/2197 on each patroni node:
Terminal window export mode=0700 install_dir='/root/scripts' repository='https://gitlab.com/gitlab-com/runbooks' branch='master' artifacts=$(echo "scripts/database/session-connection-terminate.sh scripts/database/user-role-create.sh scripts/database/user-role-password-update.sh scripts/database/user-role-delete.sh")for patroni_node in $(bundle exec knife search node "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '.rows|=sort_by(.automatic.fqdn)|.rows|.[] .automatic.fqdn'); do echo "Deploying database utility scripts database/from ${repository}/-/raw/${branch} to ${patroni_node}:${install_dir}"; for artifact in ${artifacts}; do ssh "${patroni_node}" "sudo mkdir -p ${install_dir} && curl --silent --show-error --location '${repository}/-/raw/${branch}/${artifact}' --output - | sudo tee ${install_dir}/${script} &>/dev/null && sudo chmod $mode ${install_dir}/${script}"; done; done -
Create a new password:
Terminal window export new_password=$(openssl rand -base64 4096 | tr -dc A-Za-z0-9 | head -c64)echo "${new_password}" | pbcopyecho "export NEW_PASSWORD=${new_password}" | tee ./new_password.sh &>/dev/null -
Record the new password in a field of type
Password
named “Temporary PostgreSQL superuser role password
” in the environmentally appropriate “Postgres <username>
” Password entry in 1Password. -
Select the first member node of the patroni cluster:
Terminal window export patroni_node=$(bundle exec knife search node "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '.rows|=sort_by(.automatic.fqdn)|.rows[0]|.automatic.fqdn')echo "${patroni_node}" -
Ask the first patroni node to identify the leader patroni node:
Terminal window export leader_patroni_node=$(ssh "${patroni_node}" 'test -e /usr/bin/jq && sudo /usr/local/bin/gitlab-patronictl list --format json 2>/dev/null' | jq --raw-output '.[] | select(.Role=="Leader").Member')echo "${leader_patroni_node}" -
Copy the password to the patroni leader:
Terminal window scp ./new_password.sh "${leader_patroni_node}":/tmp/new_password.shbundle exec knife ssh "fqdn:${leader_patroni_node}" "sudo mv /tmp/new_password.sh /root/scripts/.new_password.sh && sudo chmod 0700 /root/scripts/.new_password.sh && sudo chown root:root /root/scripts/.new_password.sh" -
Dry-run the script to create a new temporary database user role on the leader and record the output:
Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo /root/scripts/user-role-create.sh $GITLAB_USERNAME --dry-run' -
Confirm that there were no relevant errors in the
dry-run
invocation. -
Run the script to create a new temporary database user role on the patroni leader and record the output:
Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo /root/scripts/user-role-create.sh $GITLAB_USERNAME --wet-run' -
Confirm that there were no relevant errors in the
wet-run
invocation. -
Record the verbatim character string of the new user role in a field of type
Text
named “Temporary PostgreSQL superuser role username
” in the environmentally appropriate “Postgres <username>
” Password entry in 1Password. -
Wait for replication to “catch up” to the changes in the database of the leader.
-
Optionally check each node in the patroni cluster to confirm that the new temporary user role exists in each database:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo /usr/local/bin/gitlab-psql --command "\du" | grep "$GITLAB_USERNAME-"'
-
-
Create (but DO NOT yet merge) a chef MR to change the username defined in
patroni.yml
for theGITLAB_USERNAME
user role to the name of the new temporary user in thegitlab-cookbooks/chef-repo/roles/${GITLAB_ENVIRONMENT}-base-db-patroni.json
file, by committing changes to:- Set the
default_attributes.gitlab-patroni.patroni.users.superuser.username
field to the name of the new temporary user, and also… - Set the
default_attributes.gitlab_walg.backup_user
field to the name of the new temporary user.
- Set the
-
Add a link to the MR here: For example: Configure the staging patroni fleet to use a temporary role with a time-stamped username
-
Block/disable the chef-client service on all patroni hosts with an explanation that includes a link to the issue created to track this work:
Terminal window read -p "Operating in environment ${GITLAB_ENVIRONMENT}; press return to continue, CTRL-C to abort> " && bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" "sudo /usr/local/bin/chef-client-disable 'Configuring new superuser role in /var/opt/gitlab/patroni/patroni.yml, see issue ${issue_link}'" -
Confirm that the chef-client service has been stopped on all the patroni nodes:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo systemctl status chef-client --full --no-pager | tail --lines=1' -
Update the password in the GKMS vault at
gitlab-patroni.patroni.users.superuser.password
to be the temporary database role password (created above) instead of the original password for the originalGITLAB_USERNAME
user role:Terminal window EDITOR=`which vim` bin/gkms-vault-edit gitlab-patroni "${GITLAB_ENVIRONMENT}" -
Notify relevant parties about a configuration change to the
${GITLAB_ENVIRONMENT}
patroni fleet. -
Merge the MR and apply the changes if necessary. This will not actually apply the changes.
-
Undo the disablement of the chef-client service on all the patroni nodes at once:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo /usr/local/bin/chef-client-enable' -
Invoke chef-client on all the patroni nodes in order to apply the changes:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo chef-client' -
Verify and record in a comment that WAL-G replication push operations are still running successfully:
Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo tail --lines=6/var/log/wal-g/wal-g.log' -
Verify and record in a comment that WAL-G backup write operations are still running successfully:
Terminal window for patroni_node in $(bundle exec knife search node "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '.rows|=sort_by(.automatic.fqdn)|.rows|.[] .automatic.fqdn'); do DATE=$(date -u '+%Y/%m/%d') ssh "${patroni_node}" "sudo egrep '$DATE.*Wrote backup with name' /var/log/wal-g/wal-g_backup_push.log && hostname --fqdn"; done -
Delete the temporary new password file:
Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo rm /root/scripts/.new_password.sh'
Phase two
Section titled “Phase two”Now that the original superuser role is not being used by the patroni cluster or the replication processes, update the password for the original superuser role, and revert the configurations to use the original role.
-
Ask the first patroni node to identify the leader patroni node:
Terminal window export leader_patroni_node=$(ssh "${patroni_node}" 'test -e /usr/bin/jq && sudo /usr/local/bin/gitlab-patronictl list --format json 2>/dev/null' | jq --raw-output '.[] | select(.Role=="Leader").Member')echo "${leader_patroni_node}" -
Dry-run the script to update the original
GITLAB_USERNAME
role with the new password on the patroni leader and record the output:Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo /root/scripts/user-role-password-update.sh $GITLAB_USERNAME --dry-run' -
Confirm that there were no relevant errors in the
dry-run
invocation. -
Run the script to set the password of the original
GITLAB_USERNAME
role in the database to the new password on the patroni leader and record the output:Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo /root/scripts/user-role-password-update.sh $GITLAB_USERNAME --wet-run' -
Confirm that there were no relevant errors in the
wet-run
invocation. -
Wait for replication to “catch up” to the changes in the database of the leader.
-
Optionally confirm that the change has replicated to each patroni node (You will be repeatedly prompted to enter the new password, so it is recommended that you turn off any screen-sharing or recording. If you paste the new password correctly, but the credentials update has not yet been replicated to all nodes in the patroni cluster, then this error will be displayed:
psql: FATAL: password authentication failed for user "$GITLAB_USERNAME"
):Terminal window for patroni_node in $(bundle exec knife search node "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '.rows|=sort_by(.automatic.fqdn)|.rows|.[] .automatic.fqdn'); do ssh "${patroni_node}" "sudo su --command \"psql --password --port=5432 --host=localhost --username=$GITLAB_USERNAME --dbname=gitlabhq_production --tuples-only --quiet --command 'SELECT 1;'\" root"; done
-
-
Create (but DO NOT yet merge) a chef MR to change the username defined in
patroni.yml
for theGITLAB_USERNAME
user role from the name of the temporary user back to the name of the original user in thegitlab-cookbooks/chef-repo/roles/${GITLAB_ENVIRONMENT}-base-db-patroni.json
file, by committing changes to:- Set the
default_attributes.gitlab-patroni.patroni.users.superuser.username
field back to the name of the originalGITLAB_USERNAME
user role, and also… - Set the
default_attributes.gitlab_wale.backup_user
field back to the name of the originalGITLAB_USERNAME
user role, and also… - Set the
default_attributes.gitlab_walg.backup_user
field back to the name of the originalGITLAB_USERNAME
user role.
- Set the
-
Add a link to the MR here: For example: Configure the staging patroni fleet to use the original superuser role
-
Block/disable the chef-client service with an explanation on all patroni hosts:
Terminal window read -p "Operating in environment ${GITLAB_ENVIRONMENT}; press return to continue, CTRL-C to abort> " && bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" "sudo /usr/local/bin/chef-client-disable 'Manually updating /var/opt/gitlab/patroni/patroni.yml, see issue ${issue_link}'" -
Confirm that the chef-client service has been stopped on all the patroni nodes:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo systemctl status chef-client --full --no-pager | tail --lines=1' -
Notify relevant parties about a configuration change to the
${GITLAB_ENVIRONMENT}
patroni fleet. -
Merge the MR and apply the changes if necessary.
-
Undo the disablement of the chef-client service on all the patroni nodes at once:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo /usr/local/bin/chef-client-enable' -
Invoke chef-client on all the patroni nodes in order to apply the changes:
Terminal window bundle exec knife ssh --concurrency 1 "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo chef-client' -
Verify and record in a comment that WAL-G replication push operations are still running successfully:
Terminal window bundle exec knife ssh "fqdn:${leader_patroni_node}" 'sudo tail --lines=6 /var/log/wal-g/wal-g.log' -
Verify and record in a comment that WAL-G backup write operations are still running successfully:
Terminal window for patroni_node in $(bundle exec knife search node "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '.rows|=sort_by(.automatic.fqdn)|.rows|.[] .automatic.fqdn'); do DATE=$(date -u '+%Y/%m/%d') ssh "${patroni_node}" "sudo egrep '$DATE.*Wrote backup with name' /var/log/wal-g/wal-g_backup_push.log && hostname --fqdn"; done
Roll-back
Section titled “Roll-back”- In order to undo these changes it is recommended that the procedure be repeated with the old credentials exchanged for the new credentials.
- Delete the temporary superuser role which is no longer being used by any patroni node or database operation.