Diagnosis with Kibana
Background
Section titled “Background”- Logging pipeline architecture (complex diagram)
- Elasticsearch / Kibana
- Retention: 7 days retention
- Logs Flow: Application => Log file => FluentD => Pub/Sub => Pubsubbeat => Elasticsearch
- GCS archive
- Retention: 1 year
- Logs Flow: Application => Log file => FluentD => Stackdriver => Archive GCS bucket
- Most useful for security-related RCAs.
- Can be imported into BigQuery for analysis.
- Elasticsearch / Kibana
- Logging cluster: https://log.gprd.gitlab.net
Exploration
Section titled “Exploration”Which indices correspond to which services?
Section titled “Which indices correspond to which services?”Take a look at the list of indices in Kibana and try to find out which indices correspond to which service.
pubsub-rails-inf-gprd*web, git and api trafficpubsub-consul-inf-gprd*service discovery, DB failoverpubsub-gitaly-inf-gprd-*Git repository Storagepubsub-gke-inf-gprd*meta Kubernetes logspubsub-gcp-events-inf-gprd-*GCP maintenance eventspubsub-kas-inf-gprd*Server side Kubernetes Agent Servicepubsub-mailroom-inf-gprd-*receiving emailspubsub-monitoring-inf-gprd-*Prometheus & Thanos meta monitoringpubsub-pages-inf-gprd*Logs for Gitlab-Pagespubsub-postgres-inf-gprd-*Patroni hostspubsub-pubsubbeat-inf-gprd-*meta log of logging pipelinepubsub-puma-inf-gprd*rails webservice logging (not requests)pubsub-pvs-inf-gprd*Pipeline Validation Servicepubsub-redis-inf-gprd*Redis and Sentinelpubsub-registry-inf-gprd*Registry traffic + monitoringpubsub-runner-inf-gprd*All runners logspubsub-shell-inf-gprd*- SSH traffic to Gitalypubsub-sidekiq-inf-gprd*- Background jobs queues logspubsub-system-inf-gprd*- Host-level syslogpubsub-workhorse-inf-gprd-*- Proxy in front of rails. All traffic.release_tools-*Deployer (owned by Delivery).
Explore the schema of the most important indices
Section titled “Explore the schema of the most important indices”These tend to be high-volume indices for critical path services in our infrastructure.
pubsub-rails-inf-gprd*pubsub-gitaly-inf-gprd-*pubsub-sidekiq-inf-gprd*pubsub-workhorse-inf-gprd-*
Finding things in logs, filtering on fields, looking at value distributions
Section titled “Finding things in logs, filtering on fields, looking at value distributions”- Find all 5xx errors on the API fleet over the last 6 hours
- What is the distribution of traffic served by api vs git vs web?
- How much traffic does workhorse absorb and not pass through to rails?
- Which endpoints are receiving the most traffic?
- Which users are sending the most requests?
Correlation across indices via correlation_id (tracing)
Section titled “Correlation across indices via correlation_id (tracing)”- Make a request to GitLab.com, find the correlation_id from the HTTP response header
- Go to the Correlation dashboard in kibana
- Look at which services are being traversed in the process
- Where is most of the time being spent?
- Which services are being hit multiple times per request?
Cross-links from grafana service dashboards
Section titled “Cross-links from grafana service dashboards”- Go to the api service in grafana and find cross-links on the right hand side
- Try out failed requests, slow requests, and visualizations
- Find a sample slow or failed request and trace it via the correlation dashboard
Visualization and time-series top-k queries
Section titled “Visualization and time-series top-k queries”- Modify existing visualizations to perform a top-k analysis
- Which users are using the most request time on rails processes?
- Which projects are using the most CPU time on gitaly?
- Which sidekiq jobs are processing the most jobs?