Recovering from deploy failures
In this section, "retry" means "use GitHub Action's retry button on the failed run".
Build failures
You might need to destroy the Fly builder app. It'll get auto-created again when you retry, which is what you should do after destroying the builder app.
Docker failures
Just retry. It's fine. :)
Release command failures
Just retry. It's fine. :)
Machine update failures
Start by surveying the scene, to see how many machines are on the new image vs the old one, or in replacing
vs failed
vs created
status.
Total machine update failure, i.e. the release command succeeded but no Machines were updated at all
If you're here, the app is probably online but no longer processing background jobs (because all the Sidekiq processes were instructed to enter quiet mode during the release command).
Handle this by rebooting one of the worker_autoscale machines. That should be enough to start bringing machines back online.
Once you've verified that the app is doing work again, wait for it to catch up on the run backlog, and then retry the deploy.
A minority of machines were successfully updated
Manually redo the deploy.
Do this using a CLI deploy, using the Docker image URI from the build step.
A majority of machines were successfully updated
Manually update the rest of the machines.
Start by examining fly m list -a $FLY_APP_NAME
, and build a list of machine IDs that are stuck on the old image.
For each one, do something like this:
Sometimes a machine will get stuck and you'll need to outright destroy it
fly m destroy MACHINE_ID
Add --force
if the machine is stubborn and wonβt stop.
and then use fly scale count
to scale back up to the desired machine count. Search fly scale count in the internal slack and you'll see example usage.
Last updated