Workhorse Image Scaler
This runbook covers operations of Workhorse’s built-in image scaler.
Workhorse cooperates with Rails to route and handle image scaling requests. Workhorse itself is responsible
for proxying traffic, serving image data from either remote or block storage, and handling conditional GETs.
Image scaling itself, however, is not handled by Workhorse, but by a companion tool called gitlab-resize-image
(just “scaler” hereafter) that Workhorse shells out to for every such request.
Image scaler requests are ordinary web requests to images served via the /uploads/
path and which furthermore
carry a width
parameter, e.g.:
/uploads/-/system/group/avatar/22/avatar_w300.png?width=16
/uploads/-/system/user/avatar/1/avatar.png?width=64
NOTE:
- we currently only rescale project, group and user avatars
- we only rescale PNGs and JPEGs (see
SAFE_IMAGE_FOR_SCALING_EXT
) - we only rescale images when requesting a width defined by
ALLOWED_IMAGE_SCALER_WIDTHS
- we only rescale images that do not exceed a configured size in bytes (see
max_filesize
) - we only rescale images if enough scaler processes are available (see
max_scaler_procs
)
NOTE:
If you are confident it is the scaler itself failing, and not an ancillary system such as GCS (where images are stored),
you can quickly restore imaging functionality by toggling the dynamic_image_resizing
feature flag off. This is
an ops
type toggle and should continue
to exist even after this feature is fully rolled out.
Dashboards & Logs
Section titled “Dashboards & Logs”In case of an incident, look at:
- The
imagescaler
component panels in the web overview dashboard for general component health and performance - Thanos total requests metric
for a more direct breakdown of scaler
status
per environment etc. - Kibana logs for detailed request logs
- Kibana error breakdown counting scaler errors by message.
Failure modes
Section titled “Failure modes”Generally, three outcomes are possible:
- we scaled then served the rescaled image (good; this is a
200
) - we failed running the scaler, but served the original (bad; this is still a
200
) - we failed to serve anything, rescaled or otherwise (worse; this is a
500
)
The following sections describe how we might run into the last two types of failures.
Scaler failed, original was served
Section titled “Scaler failed, original was served”This can happen in two cases, outlined below. In both cases, we fail over to serving the original (usually much larger) image; this ensures we do not break functionality, but comes with client-side performance drag and higher egress traffic.
There are more scaling requests than available scalers
Section titled “There are more scaling requests than available scalers”Context:
We currently cap the number of scalers that may execute concurrently via Workhorse’s max_scaler_procs
config field.
If we trip that threshold, we will start ignoring new scaler requests.
Symptoms:
- Users may notice that image downloads take longer than usual
- The saturation metric on this panel will be degraded.
- In Kibana, you will see error logs saying
too many running scaler processes (x / y)
(with x > y) - In Thanos,
gitlab_workhorse_image_resize_requests_total
will have an elevated rate ofstatus="served-original"
To understand where the extra traffic might originate from, look for request spikes on these dashboards:
Actions:
We should assess whether we are merely dealing with a short burst of additional requests
or if we should consider raising the ceiling for max_scaler_procs
. It might help to get a process
listing from affected nodes and:
- See if scaler processes are getting stuck by looking at process listings (look for
gitlab-resize-image
procs). To unclog the pipes, killing these processes might be the easiest remedy. - See if scaler processes are finishing, but take a long time to complete (anything above a few dozen to a hundred milliseconds is too slow). The most likely explanation is that either the node is CPU starved (image scaling is a CPU bound operation) or that writing the scaled image back out to the client is taking a long time due to slow connection speed or other network bottlenecks.
The max_scaler_procs
setting is set in Workhorse’s config.toml
. For example:
[image_resizer] max_scaler_procs = XX
Note that we can always scale out to reduce pressure on this by running more Workhorse nodes.
The scaler process did not start
Section titled “The scaler process did not start”Context:
This means that we could not fork into gitlab-resize-image
.
Symptoms:
- Users may notice that image downloads take longer than usual
- In Kibana, you will see error logs saying
fork into scaler process: <reason>
- In Thanos,
gitlab_workhorse_image_resize_requests_total
will have an elevated rate ofstatus="served-original"
Actions:
This is unlikely to “just happen” and your best bet is to look at logs in Kibana to understand why.
No image was served
Section titled “No image was served”Context:
This means we were entirely unable to serve an image to the client. This will always result in a 500, and is a user facing error. Unfortunately, there are also countless reasons for why this might happen.
Symptoms:
- Users may see broken images or no images at all
- In Thanos,
gitlab_workhorse_image_resize_requests_total
will have an elevated rate ofstatus="request-failed"
- In Kibana, request logs for the scaler will contain error messages
Actions:
This is likely to be highly contextual, but a few things to look out for:
- Are we failing to serve any data? This could be indicated by the
json.written_bytes
field in Kibana logs being 0 - Did we previously fail over to the original image and did we fail to serve that, or did we scale successfully but failed to serve the rescaled image? Scaler failure should be indicated by additional error logs preceeding the serving failure.
- Was there a problem accessing the image in object storage? Check if there is a problem with GCS credentials.
- Did
gitlab-resize-image
return with a non-zero exit code? Scan logs for why this happened. - Are clients timing out and closing the connection? This could indicate that we are taking too long to serve image data.