Skip to content

PubSub Queuing Rate Increasing

  • PubSub takes our log messages from fluentd and sends them to PubSub, which is later subscribed to by our pubsubbeat machines that forward the logs to Elasticsearch.
  • Either the beats, or elasticsearch itself, is having issues ingesting / forwarding these logs.
  • This dashboard will provide details on the status of our pubsubs
    • If the queues for the following are growing, continue to investigate:
      • Backlog size
      • Old unacknowledged message age
      • unacknowledged messages
      • Rate of change of unacked messages (there is a panel for this). If it’s positive and flat, ingestion has likely stopped altogether.
  • If Publish message operations has a spike, ensure it goes back down.
    • Ask around for potential changes to any characteristics of our infrastructure or application logging
  • If the queues continue to climb, check the health of the elastic cluster
  • As one resort, we can halt a particular queue with the hope that we’ll eventually catch up
    • For every topic, exists a server
    • knife node list | grep pubsub
    • ssh into the chosen server, stop the pubsubbeat service
    • This will only stop that one topic, but messages will continue to gather in pubsub
  • You can try to lower the retention on ES
  • Should there be unallocated shards
  • As yet another resort, we could consider acking all messages in pubsub
    • This will induce data loss, so this would only be recommended as a final resort, and for queues in which we are okay with losing said data
    • Example command for execution:
gcloud alpha pubsub subscriptions seek <subscription path> --time=yyyy-mm-ddThh:mm:ss
  • Check our stackdriver chart and ensure we don’t have things queued.