Consul Service
- Service Overview
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22consul%22%2C%20tier%3D%22inf%22%7D
- Label: gitlab-com/gl-infra/production~“Service::Consul”
Logging
Section titled “Logging”Summary
Section titled “Summary”Architecture
Section titled “Architecture”Consul Server Cluster is a Kubernetes StatefulSet deployed in the regional GKE cluster ($ENV-gitlab-gke
) with 3 or 5 pods.
The StatefulSet is managed and deployed by the consul-gl
Helm release.
The servers have one “leader” which serves as the primary server, and all others will be
noted as “followers”. We utilize either 3 (pre
, ops
, db-benchmarking
) or 5 (gstg
, gprd
) nodes as Consul uses a quorum to ensure the data to be returned to clients is a state that all members of the Consul cluster
agree too. A cluster of 5 nodes also allows for at most 2 followers to be down before our Consul cluster would be considered faulty.
Consul Server cluster ports are exposed by an internal Loadbalancer
Service and can be reached by Consul clients from outside the Kubernetes Cluster (consul-internal.<env>.gke.gitlab.net
).
Consul DNS is being exposed by a Kubernetes Service as well and uses each local Consul Client to provide DNS resolution to the Rails workloads to be able to discover what is the Patroni primary and replica nodes.
Reference: Consul Architecture Overview
Consul clients run on nearly all servers (VMs and GKE nodes). These assist in helping the Consul servers with service/client discovery. The clients also all talk to each other, this helps distribute information about new clients, the consul servers, and changes to infrastructure topology.
Logical Architecture
Section titled “Logical Architecture”Physical Architecture
Section titled “Physical Architecture”Consul Server ports are exposed by a Loadbalancer
Service and can be reached by Consul clients from outside the Kubernetes Cluster.
All firewall ports are open as necessary such that all components that will have consul deployed will be able to
successfully talk to each other.
@startuml
package "GCP" { package "GKE <env>-gitlab-gke" { package "consul-server - StatefulSet" { [consul-server-01] [consul-server-n] } package "consul-client - DaemonSet" { [consul-client-01] [consul-client-n] } }
package "all-other-networks" { package "VM Fleets" { [consul agent] } package "GKE <env>-us-east-1[b,c,d]" { [consul-client - DaemonSet 2] } }}
[consul-server-01] <-> [consul-server-n][consul-client-01] <-> [consul-client-n]
[consul-server-01] <.> [consul-client-01]
[consul-server-n] <.> [consul-client-n][consul-server-n] <.> [consul-client - DaemonSet 2][consul agent] <.> [consul-client - DaemonSet 2]
[consul-server-01] <.> [consul agent][consul-server-n] <.> [consul agent]
@enduml
Configurations
Section titled “Configurations”- We use a single cookbook to manage our VM Consul agents: gitlab_consul
- All agents will have the
recipe[gitlab_consul::service]
in theirrun_list
- All agents will have a list of
services
that they participate in the role’sgitlab_consul.services
list- This provides some form of service discovery
- Note that this is not yet well formed and not all servers properly configure
this item. Example, all of our
fe
servers are noted ashaproxy
, but we do not distinguish between which type offe
node a server may be. Servers that partake in diffing stages, examplemain
andcanary
are not distinguished inside of Consul. - Only rely on Consul for service discovery if it as an item for which is already well utilized, such as our database servers. Expect inconsistencies when comparing data with Chef for all other keys.
- All agents need to know how to reach the DNS endpoint. This is done via the
running agent and configuring
systemd-resolved
to perform DNS lookups for the.consul
domain to the running agent on that node on port8600
. This is configured via the recipe gitlab-server::systemd-resolved_consul- All agents that need to perform a DNS lookup for services will have this enabled. This consists mostly of anything running requiring access to PostgreSQL.
Kubernetes
Section titled “Kubernetes”- Kubernetes deploys the Consul Server Cluster as a StatefulSet with a replica count of 3 (
pre
,ops
,db-benchmarking
) or 5 (gstg
,gprd
). - Kubernetes deploys the Consul Client agent as a DaemonSet such that it gets deployed onto all nodes
- The Consul Helm chart also provides a DNS service
(
consul-gl-consul-dns.consul.svc.cluster.local:53
), this service is configured withinternalTrafficPolicy: Cluster
and configured as the resolver for the domain.service.consul
in kube-dns. This means that DNS queries from any pods for this domain will use any Consul client or server node. - Both Kubernetes clients and servers are configured via
k8s-workloads/gitlab-helmfiles
Consul VM Agents
Section titled “Consul VM Agents”All VMs that have Consul installed contain all configuration files in
/etc/consul
. A general config file consul.json
provides the necessary
configuration for the service to operate. Anything in conf.d
are the services
for which Consul partakes in. This includes the healthchecks which Consul will
execute to tell the Consul cluster if that service is healthy on that particular
node. The ssl
directory contains secret items that are discussed later in this
document.
Environment Specific
Section titled “Environment Specific”In general, the configurations look nearly identical between production and staging. Items that differ include the certificates, keys, hostnames, and the environment metadata.
Performance
Section titled “Performance”No testing of performance for this service has ever been performed. This service was pushed into place prior to recognition for the need to test this.
Scalability
Section titled “Scalability”The Consul cluster is currently managed using Helm. Any additional
nodes can be added by modifying the server:replica
count for the specific
environment. The replica count must always be an odd number to avoid a
split-brain scenario.
Agents can come and go as they please. On Kubernetes, this is very important as our nodes auto-scale based on cluster demand.
Availability
Section titled “Availability”Cluster Servers
Section titled “Cluster Servers”With 5 Consul servers participating as servers, we can lose upwards of 2 before we lose the ability to have quorum.
Diagnosing service failures on the cluster servers requires observing logs and taking action based on the failure scenario.
Failure Recovery
Section titled “Failure Recovery”Consul operates very quickly in the face of failure. When a cluster is restored, it takes just a few seconds for the quorum to be reached.
Consul has documented a set of common error messages.
Split Brain
Section titled “Split Brain”Consul has the ability to be placed into a split brain state. This may happen in cases where network connectivity between two availability zones is lost and later recovers and the election terms differ between the cluster servers. We currently do not have the appropriate monitoring as the version of Consul utilized does not provide us the necessary metrics required to detect this situation. Suffering a split brain may provide some servers improper data which may lead to application speaking to the wrong database servers.
This can be found by logging into each of the Consul servers and listing out the members as describe in our Useful Commands. Recovery for this situation is documented by Hashicorp: Recovery from a split brain
A summary of the document would be to perform the following:
- Identify which cluster is safe to utilize
- This is subjective and therefore unable to describe in this document
- Stop the Consul service on nodes where we need to demote
- Move/Delete the data directory (defined into the
consul.json
config file) or Persistent Volume - Start the Consul service 1 at a time, validating each one joins the cluster successfully
Dependencies
Section titled “Dependencies”Consul has minimal dependencies to operate:
- A healthy networking connection
- Operating DNS
- A healthy server
Memory and Disk usage of Consul is very minimal. Disk usage primarily consists
of state files that are stored in /consul/data
and for our environments,
The largest file is going to be that of the raft.db
which will vary in size,
usually growing as we use this service more. As of today we primarily use this
to store which servers and services are healthy. Production appears to utilize
approximately 120MiB of space. This database uses BoltDB underneath Consul and
is subject to growth until the next compaction is run. All of this happens in
the background of Consul itself and shouldn’t be a concern of our Engineers.
These are configurable via two options:
If we see excessive growth in Disk usage, we should first validate whether or not it is in use by Consul. If yes, we then need to observe any behavioral changes to how we utilize Consul. Example may be adding a new service or a set of servers that make a lot of changes to Consul. This could signify that we may need to expand Consul if the usage is determined to be normal, or that a service is not properly behaving and we may be putting undue load on the Consul cluster.
If DNS is failing, Consul may fail to properly resolve the addresses of clients and other Consul servers. This will effectively bring down a given node and potentially the cluster. We currently do not provide any special DNS configurations on the Consul servers and are subject to the resolves provided by our infrastructure.
Consul Agents
Section titled “Consul Agents”A loss of a Consul agent will prevent any service running on that node from properly running DNS and/or participation of the Consul cluster. How this is impacted depends on the service.
- Patroni Consul agent failure == replicas will not be registered in the service, providing less replicas for GitLab services to talk to. This will lead to higher pressure on the remaining replicas. If on the primary Patroni node, a Patroni failover will be triggered.
- Agent failure on nodes running GitLab Services == GitLab services will be unable to properly query which database server to talk to. Due to this, the application will default to always opening connections to the primary.
Diagnosing agent failures requires observing logs and taking action based on the failure scenario.
Consul agents only depend on the ability to communicate to the Consul servers. If this is disrupted for any reason, we must determine what causes said interruption. The agents store very little data on disk, and their memory and CPU requirements are very low.
Recovery Time Objective
Section titled “Recovery Time Objective”We do not have a defined RTO for this service. This is currently unobtainable due to the lack of frequent testing.
Durability
Section titled “Durability”The data held within Consul is dynamic but strongly agreed to as is the design of Consul to have a consensus on the data it has knowledge of.
Consul Raft snapshots are backed up to a GCS bucket gitlab-$ENV-consul-snapshots
every hour by a Kubernetes CronJob.
Should a single node have failed, the StatefulSet in Kubernetes will automatically bring it back into the cluster without needing to worry about data on disk. As soon as a Consul server is brought back into participation of the cluster, the Raft database will sync enabling that server to begin participating in the cluster in a matter of a few seconds.
Security / Compliance
Section titled “Security / Compliance”Consul utilizes mutual TLS (mTLS) for authentication and traffic encryption to prevent intrusion of rogue clients.
We manage multiple certificates:
Certificate | Renewal method | Validity time | Vault secret path | Kubernetes Secret | Location on disk |
---|---|---|---|---|---|
CA | Manual | 5 years | k8s/env/$ENV/ns/consul/tls (certificate and key)chef/env/$ENV/cookbook/gitlab-consul/client (certificate only) | consul-tls-v* | /consul/tls/ca/ |
Server | Automatic on every Helm apply | 2 years | n/a | consul-gl-consul-server-cert | /consul/tls/server/ |
Client (Kubernetes) | Automatic on pod init | 1 year | n/a | n/a | /consul/tls/client/ |
Client (VM) | Manual | 5 years | chef/env/$ENV/cookbook/gitlab-consul/client | n/a | /etc/consul/certs/ |
CA TLS Certificate rotation
Section titled “CA TLS Certificate rotation”The CA TLS certificates must be renewed every 5 years.
The CA certificate expiration date can be monitored using the metric x509_cert_not_after{secret_namespace="consul",secret_name=~"consul-tls-v.+"}
.
To renew the CA certificate while reusing the same key:
-
Generate a new certificate based on the current one with a 5 years expiration date, and store it into Vault:
Terminal window vault kv get -field certificate k8s/env/gprd/ns/consul/tls > tls.crtvault kv get -field key k8s/env/gprd/ns/consul/tls > tls.keyopenssl x509 -in tls.crt -signkey tls.key -days 1825 | vault kv patch k8s/env/gprd/ns/consul/tls certificate=- -
Create a new Kubernetes external secret
consul-tls-vX
for the new Vault secret version invalues-secrets/gprd.yaml.gotmpl
(example MR) -
Update the Consul Helm deployment in
values-consul-gl/gprd.yaml.gotmpl
to use this new secret (example MR) -
The server certificate will be regenerated during the Helm apply. Follow these steps below to roll out the new CA and server certificates to all server pods.
-
The Kubernetes client certificate will be regenerated during pod init. Follow these steps below to roll out the new CA certificate to all clients pods.
-
Follow these steps below to roll out the new CA certificate to all VMs and regenerate the VM client certificate.
Server TLS certificate rotation
Section titled “Server TLS certificate rotation”The server TLS certificate is generated automatically during each Helm deployment and is valid for 2 years. Regular Consul updates should keep it current. If Consul hasn’t been deployed over this time period, forcing a Helm deployment (eg. by adding an annotation) will renew it.
The server TLS certificate expiration date can be verified with the following command:
kubectl get secret consul-gl-consul-server-cert --namespace consul --output jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
Client TLS certificate rotation
Section titled “Client TLS certificate rotation”The Kubernetes client TLS certificate is generated for each pod during initialisation and is valid for 1 year.
The VM client TLS certificate must regenerated manually and stored into the Chef Cookbook secret in Vault with the following commands:
To renew the client certificate, whether or not a new CA is also being rolled out:
-
Pause Chef all VMs of the environment
Terminal window knife ssh 'chef_environment:$ENV AND recipes:gitlab_consul\:\:agent' 'sudo chef-client-disable <link>' -
Generate a new client certificate with a 2 years expiration date and store it into the Consul cookbook secret in Vault:
Terminal window vault kv get -field certificate k8s/env/$ENV/ns/consul/tls > tls.crtvault kv get -field key k8s/env/$ENV/ns/consul/tls > tls.keyconsul tls cert create -client -ca tls.crt -key tls.key -days 530 -dc east-us-2vault kv patch chef/env/$ENV/cookbook/gitlab-consul/client [email protected] [email protected] [email protected] -
Rollout and test the new certificate(s) on a single VM (or a few more to be sure):
-
Re-enable and run Chef:
Terminal window sudo chef-client-enablesudo chef-clientThe client certificate will be reloaded automatically.
-
If the CA certificate was renewed, the Consul client need to be restarted:
Terminal window sudo systemctl restart consul.service
-
-
Rollout the new certificate(s) to the Patroni nodes following the same steps above, starting with the replicas one at a time, and finishing with the leader.
Before updating each Patroni replica node, put it maintenance mode first:
Terminal window knife node run_list add $NODE 'role[$ENV-base-db-patroni-maintenance]'And disable the maintenance mode after:
Terminal window knife node run_list remove $NODE 'role[$ENV-base-db-patroni-maintenance]' -
Rollout the new certificate(s) to all remaining nodes following the same steps above again:
Terminal window knife ssh 'chef_environment:$ENV AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
Gossip
Section titled “Gossip”All Gossip Protocol traffic is encrypted. This key is stored in Vault under the path:
k8s/env/$ENV/ns/consul/gossip
and synced to the Kubernetes secret consul-gossip-v*
.
Monitoring/Alerting
Section titled “Monitoring/Alerting”- Overview Dashboard
- Logs can be found on Kibana under the
pubsub-consul-inf-gprd*
index. - We only appear to alert on specific services and saturation of our nodes.
Deployments in k8s
Section titled “Deployments in k8s”When we bump the chart_version
for Consul (consul_gl
) in the bases/environments/$ENV.yaml
file,
this actually bumps the Consul version for servers and clients. In all non-production environments, this will trigger an upgrade
of the servers and clients and there is no manual intervention required. Server pods will be rotated one-by-one, and simultaneously client
pods will get rotated. So you might have clients that are temporarily on a newer version than the server, but this is usually fine as Hashicorp
have a protocol compatibility promise.
Nevertheless, in production we want to be more cautious and thus we make use of two guard rails for the sake of reliability:
-
For servers, we set
server.updatePartition
to the number of replicas minus 1 (i.e., currently ingprd
this is set to4
).This setting allows us to carefully control a rolling update of Consul server agents. With the default setting of
0
, k8s would simply rotate each pod and wait for the health check to pass before moving onto the next pod. This may be OK, but we could run into a situation where the pod passes health check, but then becomes unhealthy, and k8s has already moved onto the next pod. This could potentially result in an unhealthy Consul cluster, which would impact critical components like Patroni. Given the importance of a healthy Consul cluster, we decided the inconvenience of human intervention to occasionally upgrade Consul was justified to minimize the risk of an outage. -
For clients, we set the
client.updateStrategy.type
toOnDelete
so we can wait until we’re done upgrading the server cluster before we upgrade clients.
The following instructions describe the upgrade process for servers and clients for production only. As abovementioned, all other environments will
upgrade servers and clients automatically once the chart_version
is bumped.
Servers
Section titled “Servers”The server.updatePartition
setting controls how many instances of the server cluster are updated when the .spec.template
is updated.
Only instances with an index greater than or equal to updatePartition
(zero-indexed) are updated. So by setting this value to 4
,
we’re effectively saying only recreate the last pod (...-consul-server-4
) but leave all other pods untouched.
The upgrade process is as follows:
-
Bump
chart_version
in thebases/environments/$ENV.yaml
file(s) -
Create your MR, get it reviewed and merged
-
Once Helm starts to apply the change, you should see the
...-consul-server-4
pod get recreated, but the rest will remain unchanged. No client pods will get rotated at this stage. -
Helm will hang waiting for all pods to be recreated, so this is where you need to take action.
-
SSH to one of the Consul members and keep an eye on the servers:
Terminal window watch -n 2 consul operator raft list-peers -
Bring up your tunnel to talk to the
gprd
regional k8s cluster (e.g.,glsh kube use-cluster gprd
) -
Confirm that the Consul cluster looks healthy:
Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul get pods -o wide -l component=serverThe pod
consul-gl-consul-server-4
should only show minutes in theAGE
column. All pods should beRunning
. -
Rotate 2 more pods:
Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul patch statefulset consul-gl-consul-server -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}' -
You should now see 2 more pods get recreated. Wait until the Consul cluster is healthy (should only take a few secs to a minute) and do the last 2 pods:
Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul patch statefulset consul-gl-consul-server -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}' -
You should now see the remaining 2 pods get recreated. You can now put the setting back to
4
:Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul patch statefulset consul-gl-consul-server -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}' -
You should now see the Helm apply job complete successfully.
Clients
Section titled “Clients”To upgrade the clients, you have two choices:
-
Do nothing and let the clients get upgraded organically as servers come and go due to autoscaling. This is an acceptable approach if there is no rush on upgrading clients.
-
Rotate the pods by doing the
updateStrategy
dance as follows.For each of the 4x production k8s clusters (regional + zonals):
-
Change
updateStrategy.type
toRollingUpdate
:Terminal window kubectl --context <cluster> -n consul patch daemonset consul-gl-consul-client -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}' -
Wait until all client pods have been rotated.
-
Revert change to
updateStrategy.type
:Terminal window kubectl --context <cluster> -n consul patch daemonset consul-gl-consul-client -p '{"spec":{"updateStrategy":{"type":"OnDelete"}}}'
-