Consul Service
- Service Overview
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22consul%22%2C%20tier%3D%22inf%22%7D
- Label: gitlab-com/gl-infra/production~“Service::Consul”
Logging
Section titled “Logging”Summary
Section titled “Summary”Architecture
Section titled “Architecture”Consul Server Cluster is a k8s StatefulSet deployed in the regional GKE cluster (ENV-gitlab-gke) with at least 5 total Pods. The StatefulSet is managed and deployed by the consul-gl helm release.
The servers have one “leader” which serves as the primary server, and all others will be noted as “followers”. We utilize 5 nodes as consul uses a quorum to ensure the data to be returned to clients is a state that all members of the consul cluster agree too. This also allows for at most 2 followers to be down before our consul cluster would be considered faulty.
Consul Server cluster ports are exposed by an internal Loadbalancer
Service and can be reached by Consul clients from outside the k8s Cluster (consul-internal.<env>.gke.gitlab.net
).
Consul DNS is being exposed by a k8s Service as well and uses each local Consul Client to provide DNS resolution to the Rails workloads to be able to discover what is the Patroni primary and replica nodes.
Reference: Consul Architecture Overview
Consul clients run on nearly all servers (VMs and GKE nodes). These assist in helping the consul servers with service/client discovery. The clients also all talk to each other, this helps distribute information about new clients, the consul servers, and changes to infrastructure topology.
Logical Architecture
Section titled “Logical Architecture”Physical Architecture
Section titled “Physical Architecture”Consul Server ports are exposed by a Loadbalancer
Service and can be reached by Consul clients from outside the k8s Cluster.
All firewall ports are open as necessary such that all components that will have consul deployed will be able to
successfully talk to each other.
@startuml
package "GCP" { package "GKE <env>-gitlab-gke" { package "consul-server - StatefulSet" { [consul-server-01] [consul-server-n] } package "consul-client - DaemonSet" { [consul-client-01] [consul-client-n] } }
package "all-other-networks" { package "VM Fleets" { [consul agent] } package "GKE <env>-us-east-1[b,c,d]" { [consul-client - DaemonSet 2] } }}
[consul-server-01] <-> [consul-server-n][consul-client-01] <-> [consul-client-n]
[consul-server-01] <.> [consul-client-01]
[consul-server-n] <.> [consul-client-n][consul-server-n] <.> [consul-client - DaemonSet 2][consul agent] <.> [consul-client - DaemonSet 2]
[consul-server-01] <.> [consul agent][consul-server-n] <.> [consul agent]
@enduml
Configurations
Section titled “Configurations”- We use a single cookbook to manage our VM Consul agents: gitlab_consul
- All agents will have the
recipe[gitlab_consul::service]
in theirrun_list
- All agents will have a list of
services
that they participate in the role’sgitlab_consul.services
list- This provides some form of service discovery
- Note that this is not yet well formed and not all servers properly configure
this item. Example, all of our
fe
servers are noted ashaproxy
, but we do not distinguish between which type offe
node a server may be. Servers that partake in diffing stages, examplemain
andcanary
are not distinguished inside of consul. - Only rely on consul for service discovery if it as an item for which is already well utilized, such as our database servers. Expect inconsistencies when comparing data with chef for all other keys.
- All agents need to know how to reach the DNS endpoint. This is done via the
running agent and using
dnsmsq
to populate a configuration enabling the host to perform DNS lookups to the running agent on that node on port8600
. This is configured via the recipe gitlab-server::dnsmasq_consul- All agents that need to perform a DNS lookup for services will have this enabled. This consists mostly of anything running requiring access to postgres.
Kubernetes
Section titled “Kubernetes”- Kubernetes deploys the Consul Server Cluster as a StatefulSet with a minimum replica count of 5.
- Kubernetes deploys the Consul Client agent as a DaemonSet such that it gets deployed onto all nodes
- The Consul Helm chart also provides a DNS service (
consul-gl-consul-dns.consul.svc.cluster.local:53
) and this service is configured withinternalTrafficPolicy: local
. This means that any DNS queries from web/api/git/etc pods that use the above endpoint should end up using their local Consul agent. - Both k8s Clients and Servers are configured via
k8s-workloads/gitlab-helmfiles
Consul VM Agents
Section titled “Consul VM Agents”All VM’s that have consul installed contain all configuration files in
/etc/consul
. A general config file consul.json
provides the necessary
configuration for the service to operate. Anything in conf.d
are the services
for which consul partakes in. This includes the healthchecks which consul will
execute to tell the consul cluster if that service is healthy on that particular
node. The ssl
directory contains secret items that are discussed later in this
document.
Environment Specific
Section titled “Environment Specific”In general, the configurations look nearly identical between production and staging. Items that differ include the certificates, they keys, hostnames, and the environment metadata.
Performance
Section titled “Performance”No testing of performance for this service has ever been performed. This service was pushed into place prior to recognition for the need to test this.
Scalability
Section titled “Scalability”The Consul cluster is currently managed using Helm. Any additional
nodes can be added by modifying the server:replica
count for the specific environment.
Agents can come and go as they please. On Kubernetes, this is very important as our nodes auto-scale based on cluster demand.
Availability
Section titled “Availability”Cluster Servers
Section titled “Cluster Servers”With 5 consul servers participating as servers, we can lose upwards of 2 before we lose the ability to have quorum.
Diagnosing service failures on the cluster servers requires observing logs and taking action based on the failure scenario. Regional outages may impact consul pending how many servers have been impacted. Example, if a single zone goes +completely offline, between 1 and 2 servers may be negatively impacted resulting in no harm as quorum will still be met with the remaining participants of the cluster. Load will become mildly higher during this period as less nodes can respond to queries, and the servers will continue to try and reach the failed nodes of the cluster.
Failure Recovery
Section titled “Failure Recovery”Consul operates very quickly in the face of failure. When a cluster is restored, it takes just a few seconds for the quorum to be reached.
Consul has documented a set of common error messages.
Split Brain
Section titled “Split Brain”Consul has the ability to be placed into a split brain state. This may happen in cases where network connectivity between two availability zones is lost and later recovers and the election terms differ between the cluster servers. We currently do not have the appropriate monitoring as the version of Consul utilized does not provide us the necessary metrics required to detect this situation. Suffering a split brain may provide some servers improper data which may lead to application speaking to the wrong database servers.
This can be found by logging into each of the consul servers and listing out the members as describe in our Useful Commands. Recovery for this situation is documented by Hashicorp: Recovery from a split brain
A summary of the document would be to perform the following:
- Identify which cluster is safe to utilize
- This is subjective and therefore unable to describe in this document
- Stop the consul service on nodes where we need to demote
- Move/Delete the data directory (defined into the
consul.json
config file) - Start the consul service 1 at a time, validating each one joins the cluster successfully
Dependencies
Section titled “Dependencies”Consul has minimal dependencies to operate:
- A healthy networking connection
- Operating DNS
- A healthy server
Memory and Disk usage of consul is very minimal. Disk usage primarily consists
of state files that are stored in /var/lib/consul
and for our environments,
The largest file is going to be that of the raft.db
which will vary in size,
usually growing as we use this service more. As of today we primarily use this
to store which servers and services are healthy. Production appears to utilize
approximately 150MB of space. This database uses BoltDB underneath Consul and
is subject to growth until the next compaction is run. All of this happens in
the background of consul itself and shouldn’t be a concern of our Engineers.
These are configurable via two options:
If we see excessive growth in Disk usage, we should first validate whether or not it is in use by Consul. If yes, we then need to observe any behavioral changes to how we utilize Consul. Example may be adding a new service or a set of servers that make a lot of changes to consul. This could signify that we may need to expand consul if the usage is determined to be normal, or that a service is not properly behaving and we may be putting undue load on the consul cluster.
If DNS is failing, consul may fail to properly resolve the addresses of clients and other consul servers. This will effectively bring down a given node and potentially the cluster. We currently do not provide any special DNS configurations on the consul servers and are subject to the resolves provided by our infrastructure.
Consul Agents
Section titled “Consul Agents”A loss of a consul agent will prevent any service running on that node from properly running DNS and/or participation of the consul cluster. How this is impacted depends on the service.
- Postgres consul agent failure == replicas will not be registered in the service, providing less replicas for GitLab services to talk to. This will lead to higher pressure on the remaining replicas. If the primary database
- Agent failure on nodes running GitLab Services == GitLab services will be unable to properly query which database server to talk to. Due to this, the application will default to always opening connections to the primary.
Diagnosing agent failures requires observing logs and taking action based on the failure scenario.
Consul agents only depend on the ability to communicate to the consul servers. If this is disrupted for any reason, we must determine what causes said interruption. The agents store very little data on disk, and their memory and CPU requirements are very low.
Recovery Time Objective
Section titled “Recovery Time Objective”We do not have a defined RTO for this service. This is currently unobtainable due to the lack of frequent testing, lack of monitoring, and use of outdated versions of this service.
Durability
Section titled “Durability”The data held within consul is dynamic but strongly agreed too as is the design of consul to have a consensus on the data it has knowledge of.
We do not perform any backup of the data stored in consul.
Should a single node have failed, the StatefulSet in k8s will automatically bring it back into the cluster without needing to worry about data on disk. As soon as a consul server is brought back into participation of the cluster, the raft database will sync enabling that server to begin participating in the cluster in a matter of a few seconds.
Security/Compliance
Section titled “Security/Compliance”Secrets
Section titled “Secrets”Consul utilizes a secret key pair to prevent intrusion of rogue clients. This key is
stored in Vault under the path: "k8s/env/{{ .Environment.Name }}/ns/consul/tls"
. The values of these are
dropped on the servers in /etc/consul/ssl/certs
and are populated in consul
configuration file to enable TLS verification of clients.
All Gossip Protocol traffic is encrypted. This key is stored in Vault under the path:
"k8s/env/{{ .Environment.Name }}/ns/consul/gossip"
.
Monitoring/Alerting
Section titled “Monitoring/Alerting”- Overview Dashboard
- Logs can be found on Kibana under the
pubsub-consul-inf-gprd*
index. - We only appear to alert on specific services and saturation of our nodes. We do not appear to have any health metrics that are specific to the consul service
Deployments in k8s
Section titled “Deployments in k8s”When we bump the chart_version
for Consul (consul_gl
) in the bases/environments.yaml
file,
this actually bumps the Consul version for servers and clients. In all non-production environments, this will trigger an upgrade
of the servers and clients and there is no manual intervention required. Server pods will be rotated one-by-one, and simultaneously client
pods will get rotated. So you might have clients that are temporarily on a newer version than the server, but this is usually fine as Hashicorp
have a protocol compatibility promise.
Nevertheless, in production we want to be more cautious and thus we make use of two guard rails for the sake of reliability:
-
For servers, we set
server.updatePartition
to the number of replicas minus 1 (i.e., currently ingprd
this is set to4
).This setting allows us to carefully control a rolling update of Consul server agents. With the default setting of
0
, k8s would simply rotate each pod and wait for the health check to pass before moving onto the next pod. This may be OK, but we could run into a situation where the pod passes health check, but then becomes unhealthy, and k8s has already moved onto the next pod. This could potentially result in an unhealthy Consul cluster, which would impact critical components like Patroni. Given the importance of a healthy Consul cluster, we decided the inconvenience of human intervention to occasionally upgrade Consul was justified to minimize the risk of an outage. -
For clients, we set the
client.updateStrategy.type
toOnDelete
so we can wait until we’re done upgrading the server cluster before we upgrade clients.
The following instructions describe the upgrade process for servers and clients for production only. As abovementioned, all other environments will
upgrade servers and clients automatically once the chart_version
is bumped.
Servers
Section titled “Servers”The server.updatePartition
setting controls how many instances of the server cluster are updated when the .spec.template
is updated.
Only instances with an index greater than or equal to updatePartition
(zero-indexed) are updated. So by setting this value to 4
,
we’re effectively saying only recreate the last pod (...-consul-server-4
) but leave all other pods untouched.
The upgrade process is as follows:
-
Bump
chart_version
in thebases/environments/*.yaml
file(s) -
Create your MR, get it reviewed and merged
-
Once Helm starts to apply the change, you should see the
...-consul-server-4
pod get recreated, but the rest will remain unchanged. No client pods will get rotated at this stage. -
Helm will hang waiting for all pods to be recreated, so this is where you need to take action.
-
SSH to one of the Consul members and keep an eye on the servers:
Terminal window watch -n 2 consul operator raft list-peers -
Bring up your tunnel to talk to the
gprd
regional k8s cluster (e.g.,glsh kube use-cluster gprd
) -
Confirm that the Consul cluster looks healthy:
Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul get pods -o wide -l component=serverThe pod
consul-gl-consul-server-4
should only show minutes in theAGE
column. All pods should beRunning
. -
Rotate 2 more pods:
Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul patch statefulset consul-gl-consul-server -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}' -
You should now see 2 more pods get recreated. Wait until the Consul cluster is healthy (should only take a few secs to a minute) and do the last 2 pods:
Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul patch statefulset consul-gl-consul-server -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}' -
You should now see the remaining 2 pods get recreated. You can now put the setting back to
4
:Terminal window kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke -n consul patch statefulset consul-gl-consul-server -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}' -
You should now see the Helm apply job complete successfully.
Clients
Section titled “Clients”To upgrade the clients, you have two choices:
-
Do nothing and let the clients get upgraded organically as servers come and go due to autoscaling. This is an acceptable approach if there is no rush on upgrading clients.
-
Rotate the pods by doing the
updateStrategy
dance as follows.For each of the 4x production k8s clusters (regional + zonals):
-
Change
updateStrategy.type
toRollingUpdate
:Terminal window kubectl --context <cluster> -n consul patch daemonset consul-gl-consul-client -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}' -
Wait until all client pods have been rotated.
-
Revert change to
updateStrategy.type
:Terminal window kubectl --context <cluster> -n consul patch daemonset consul-gl-consul-client -p '{"spec":{"updateStrategy":{"type":"OnDelete"}}}'
-