Product Analytics SSL Troubleshooting
Note: For running the kubectl
commands suggested in this document, you need to make sure you’re in the correct kubernetes
context and namespace. You can find out which context to run these commands by checking the “Cluster” dropdown in the
monitoring dashboard.
All commands should be run in the default
namespace.
Services affected by SSL outages
Section titled “Services affected by SSL outages”All of our external endpoints are using SSL certificates, as well as the internal communication between various services. So an SSL outage could affect the entire stack if multiple certificates fail at the same time:
- Outages to the Snowplow collector will prevent events being ingested by the stack.
- Outages to the Cube endpoint or Clickhouse will prevent events from being retrievable.
- Outages to the configurator will prevent new projects from being onboarded to Product Analytics.
Internally, SSL outages to the Snowplow enricher, Vector, Kafka, or Clickhouse, will prevent events from being processed. Although, we do have queuing systems in place to help retain events in the situation that they can’t be processed, it won’t keep the events in a holding pattern forever.
Examples of SSL errors
Section titled “Examples of SSL errors”We send SSL metrics to our monitoring dashboard if you would like to observe the errors over time. You can also review the various logs through that dashboard too.
There are few errors we’ve seen in the past which indicate that we’re having an SSL outage. These include:
Vector
"2023-12-14T08:29:31.003191Z WARN sink{component_kind="sink" component_id=clickhouse_enriched_events component_type=clickhouse component_name=clickhouse_enriched_events}:request{request_id=517}:http: vector::internal_events::http_client: HTTP error. error=connection error: Connection reset by peer (os error 104) error_type="request_failed" stage="processing" internal_log_rate_limit=true"
Snowplow enricher
Failed authentication with <REDACTED_CLUSTER_NAME>-kafka/<REDACTED_IP> (SSL handshake failed)
│ [pool-1-thread-2] INFO com.snowplowanalytics.snowplow.enrich.common.fs2.Environment - Enrich stopped ││ org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed ││ Caused by: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorEx
Kafka
│ [2023-12-14 10:14:23,906] INFO [SocketServer listenerType=ZK_BROKER, nodeId=0] Failed authentication with /<REDACTED_IP> (channelId=<REDACTED_CHANNEL_ID>) (SSL handshake failed) (org.apache.kafka.common.network.Selector)
Kafka exporter
Cannot get current offset of topic __consumer_offsets partition 46: x509: certificate has expired or is not yet valid: current time 2024-09-09T11:12:31Z is after 2024-09-09T10:50:48Z
LoadBalancer
GKE events from Ingress type can be found in this panel
Alternatively, you can run kubectl get events --field-selector involvedObject.kind=Ingress
Error syncing to GCP: error running backend syncing routine: error ensuring health check: googleapi: Error 400: Invalid value for field 'resource.timeoutSec': '15'. TimeoutSec should be less than checkIntervalSec., invalid
Fixing SSL errors
Section titled “Fixing SSL errors”First, you should identify which certificates have failed. A good place to start is to view our monitoring dashboard and check for any telltale logs. If you can narrow down which services have been affected, this will help narrow down which certificates may be causing issues.
Once you know which services may be affected, you can use Kubernetes to read certificate details. You will need to follow the prerequisite steps for the Analytics Stack to be able to run these commands.
Start by getting all the certificates known by Kubernetes:
kubectl get certificates
This will give you a list of certificate names and how old they are. Oftentimes, the certificate renewal will have failed. The age of the certificate will give you an indication this may be the case.
For more details about a specific certificate, you can use the certificates name from the above command to get more information:
kubectl describe certificate <CERT_NAME>
Within the output, you will find a Status
subsection which may show any problems:
Status: Conditions: Last Transition Time: 2023-09-07T08:38:37Z Message: Certificate is up to date and has not expired Observed Generation: 1 Reason: Ready Status: True Type: Ready Not After: 2024-04-04T08:38:36Z Not Before: 2024-01-05T08:38:36Z Renewal Time: 2024-03-05T08:38:36Z Revision: 3
You can also get a specific certificates expiry directly:
kubectl get secret <CLUSTER_NAME>-certificates-<SECRET_NAME> -o "jsonpath={.data['tls\\.crt']}" | base64 -D | openssl x509 -dates -noout
We use the cert-manager Kubernetes plugin to manage SSL certificates, with LetsEncrypt
certificates for external endpoints, and self-signed certificates for internal communications. This means we can use the
cmctl
to manually renew
certificates as required.
Once the certificates have been created or renewed, you may need to redeploy to Kubernetes. The easiest way to do this is to use the existing CI pipelines.
- Find the most recent release.
- Click the commit at the bottom of the release.
- Click the pipeline for the commit.
- Trigger the environment deployment for the affected environment.
In some cases, certificates might have a status ready while the certificate issuer is not ready. You can use the following command to get information on issuers.
kubectl get issuer
Confirm that all issuers have True in the READY column. If not, get more information on the specific issuer with
kubectl describe issuer [issuer_name]
which will give you detailed information on the issuer. You can check the Status field and look for “Reason” and “Message”. These fields should give information on why the issuer is not ready.
Kafka exporter
Section titled “Kafka exporter”The Kafka exporter uses a self-signed certificate, which does not auto-renew. This can mean that the certificate expires if we haven’t made any changes for a longer period of time.
If you’re seeing Kafka export errors in Grafana, then the easy fix is to restart the deployment:
kubectl rollout restart deployment <CLUSTER_NAME>-kafka-exporter