Troubleshooting: Usage Billing - Data Ingestion
For Data Insights Platform (DIP) deployments, we use metrics-catalog to setup & manage our Ingester SLIs.
This setup provides us a few alert-definitions automatically - which can be sent to our engineers on-call. Following are alert-specific details of how to approach these alerts if & when you receive them.
No data received OR Increased error rates
Section titled “No data received OR Increased error rates”- Begin with our main dashboard for Usage Billing.
- Look for key signals in the Throughput panel:
- Requests: Do we have non-zero requests being received?
- Errors: Do we have non-zero errors being generated?
- Ingestion latency: Is the trend abnormal over a larger time period?
While we do not alert on ingestion latency yet, an increase from the baseline can be symptomatic of other issues in the system, e.g. resource starvation on DIP pods, NATS being unavailable, etc.
-
If we don’t see incoming requests, check with upstream sources of this traffic first.
- In the current iteration, all traffic comes from
AI-gateway. Check AI-gateway dashboard(s). - All requests are proxied via Cloudflare, ensure we’re not rejecting traffic at that level. Check Cloudflare dashboard(s)
- For details around configured Cloudflare zones/hosts, refer to the configuration overview.
- Our ingress endpoints have
rate-limitson them, ensure they have not been updated recently and/or are causing traffic to be rate-limited. This is also done via Cloudflare and can be monitored on the aforementioned Cloudflare dashboards. These rate-limits are provisioned via theconfig-mgmtrepository here.
- In the current iteration, all traffic comes from
-
If we see an uptick in errors being generated.
- Check logs for the concerned environment to ascertain the nature of these failures. Kibana
- Since ingesters only ever write data into NATS, ensure NATS is healthy.
- If you see connection-specific issues, it should be okay to rolling restart the DIP
Statefulset.
➜ ~ kubectl -n data-insights-platform rollout restart sts data-insights-platform-singlestatefulset.apps/data-insights-platform-single restartedOutside of NATS, ingested events can be malformed and/or not compliant with Usage Billing data schemas. This class of errors will show up on this graph. If this is what’s happening, involve folks from Analytics Instrumentation or AI Gateway teams.
- If we see ingestion latency trending upwards and/or has diverged from normal
- Check resource consumption on the DIP pods, are they being starved?
- Data Insights Platform - Usage Billing Dashboard > check Consumption panel.
- Are these pods running out of memory consistently?
- Are we not able to write to NATS fast enough?
- This should be evident from the
ingesterpods, with requests timing out during writes to NATS. Check aforementioned logs.
- This should be evident from the
- Solution: See if vertically or horizontally scaling the
statefulsethelps.
- Check resource consumption on the DIP pods, are they being starved?
GKE setup on CustomerDot environments - stgsub & prdsub
Section titled “GKE setup on CustomerDot environments - stgsub & prdsub”-
Ensure you have access to
NordVPN, see this for more details. -
Ensure you can connect to the two environments as necessary, e.g. for
prdsub, run the following when connected to a NordVPN gateway:
➜ ~ glsh kube use-cluster prdsub --no-proxySwitched to context "gke_gitlab-subscriptions-prod_us-east1_prdsub-customers-gke".- Data Insights Platform is setup in the
Singlemode via aStatefulset.
Singlemode is when we deploy all DIP components in a single pod, soingesters,enricher,billing-exporterall running as separate processes within the same pod.
➜ ~ kubectl -n data-insights-platform get sts data-insights-platform-singleNAME READY AGEdata-insights-platform-single 3/3 34d➜ ~ kubectl -n data-insights-platform get podsNAME READY STATUS RESTARTS AGEdata-insights-platform-ingress-nginx-controller-6765bdc6964r2t7 1/1 Running 0 2d4hdata-insights-platform-single-0 1/1 Running 0 2d3hdata-insights-platform-single-1 1/1 Running 2 (2d3h ago) 2d3hdata-insights-platform-single-2 1/1 Running 1 (2d3h ago) 2d3h- Check if the application container has enough resources allocated
➜ ~ kubectl -n data-insights-platform get sts data-insights-platform-single -o json | jq -r '.spec.template.spec.containers[].resources'{ "limits": { "cpu": "2", "memory": "2Gi" }, "requests": { "cpu": "250m", "memory": "1Gi" }}