Diagnosis with Kibana
Background
Section titled “Background”- Logging pipeline architecture (complex diagram)
- Elasticsearch / Kibana
- Retention: 7 days retention
- Logs Flow: Application => Log file => FluentD => Pub/Sub => Pubsubbeat => Elasticsearch
- GCS archive
- Retention: 1 year
- Logs Flow: Application => Log file => FluentD => Stackdriver => Archive GCS bucket
- Most useful for security-related RCAs.
- Can be imported into BigQuery for analysis.
- Elasticsearch / Kibana
- Logging cluster: https://log.gprd.gitlab.net
Exploration
Section titled “Exploration”Which indices correspond to which services?
Section titled “Which indices correspond to which services?”Take a look at the list of indices in Kibana and try to find out which indices correspond to which service.
pubsub-rails-inf-gprd*
web, git and api trafficpubsub-consul-inf-gprd*
service discovery, DB failoverpubsub-gitaly-inf-gprd-*
Git repository Storagepubsub-gke-inf-gprd*
meta Kubernetes logspubsub-gcp-events-inf-gprd-*
GCP maintenance eventspubsub-kas-inf-gprd*
Server side Kubernetes Agent Servicepubsub-mailroom-inf-gprd-*
receiving emailspubsub-monitoring-inf-gprd-*
Prometheus & Thanos meta monitoringpubsub-pages-inf-gprd*
Logs for Gitlab-Pagespubsub-postgres-inf-gprd-*
Patroni hostspubsub-pubsubbeat-inf-gprd-*
meta log of logging pipelinepubsub-puma-inf-gprd*
rails webservice logging (not requests)pubsub-pvs-inf-gprd*
Pipeline Validation Servicepubsub-redis-inf-gprd*
Redis and Sentinelpubsub-registry-inf-gprd*
Registry traffic + monitoringpubsub-runner-inf-gprd*
All runners logspubsub-shell-inf-gprd*
- SSH traffic to Gitalypubsub-sidekiq-inf-gprd*
- Background jobs queues logspubsub-system-inf-gprd*
- Host-level syslogpubsub-workhorse-inf-gprd-*
- Proxy in front of rails. All traffic.release_tools-*
Deployer (owned by Delivery).
Explore the schema of the most important indices
Section titled “Explore the schema of the most important indices”These tend to be high-volume indices for critical path services in our infrastructure.
pubsub-rails-inf-gprd*
pubsub-gitaly-inf-gprd-*
pubsub-sidekiq-inf-gprd*
pubsub-workhorse-inf-gprd-*
Finding things in logs, filtering on fields, looking at value distributions
Section titled “Finding things in logs, filtering on fields, looking at value distributions”- Find all 5xx errors on the API fleet over the last 6 hours
- What is the distribution of traffic served by api vs git vs web?
- How much traffic does workhorse absorb and not pass through to rails?
- Which endpoints are receiving the most traffic?
- Which users are sending the most requests?
Correlation across indices via correlation_id
(tracing)
Section titled “Correlation across indices via correlation_id (tracing)”- Make a request to GitLab.com, find the correlation_id from the HTTP response header
- Go to the Correlation dashboard in kibana
- Look at which services are being traversed in the process
- Where is most of the time being spent?
- Which services are being hit multiple times per request?
Cross-links from grafana service dashboards
Section titled “Cross-links from grafana service dashboards”- Go to the api service in grafana and find cross-links on the right hand side
- Try out failed requests, slow requests, and visualizations
- Find a sample slow or failed request and trace it via the correlation dashboard
Visualization and time-series top-k queries
Section titled “Visualization and time-series top-k queries”- Modify existing visualizations to perform a top-k analysis
- Which users are using the most request time on rails processes?
- Which projects are using the most CPU time on gitaly?
- Which sidekiq jobs are processing the most jobs?