ClojureVerse Report: August 22 downtime / Planned changes to infra

Hello fellow members!

Clojureverse.org was down yesterday for ~26 hours. Super sorry for the downtime! :pray:

TLDR: seems to be a docker networking issue, not sure at this point about root cause, but got things working again with a workaround. Going to migrate to better infra, and add sms alerts in the coming few days

What happened?

As you might know - clojureverse.org is using Discourse; and we use the official docker containers provided by them.

Unfortuntely it has been a while since we updated our machines. Due to a curl/network error the docker container ended up crashing and then was not able to recover properly.

We do have alerts set up on uptimerobot.com but did not have SMS alerts, and yesterday being Sunday we completely missed those! Thanks for everyone who patiently informed us in the #clojuverse-ops channel in Clojurians :vulcan_salute:

The next logical step was then to try and start a fresh new container. So we tried to rebuild the discourse app, which is destroys the old container, bootstraps, and starts a fresh one.

During the rebuild process we got the following error:

cd /pups && git pull && git checkout v1.0.3 && /pups/bin/pups --stdin
fatal: unable to access 'https://github.com/discourse/pups.git/': Could not resolve host: github.com
[2:42 PM]

Turns out github.com is reachable from the machine just fine, but seemingly not from inside the container.

The most obvious culprit seemed to be that we were on

  1. An old docker version: docker did not get any upgrades because docker changed the domain name of their apt repositories
  2. Ubuntu 16.04 (which is EOL)

Then we upgraded to 18.04, and upgraded docker. But still couldn’t reach github.com (or any other host it seems) from inside the container

Several reports of the same issue, haven’t found a good explanation/solution yet After upgrade, docker cannot communicate with the outside world - #20 by supermathie - support - Discourse Meta

We used a workaround to get it to work for the time being: Could not resolve host: github.com for SamSaffron/pups.git - #9 by rcauvin - support - Discourse Meta

Where do we go from here?

Foruntately we have automatic postgres backups scheduled running since the beginning, so data is safe in any event :grin: :sharkdance:

Clojurians-log and other of our services are hosted at exoscale, but clojureverse is hosted at DigitalOcean for historic reasons.

So we are going to use this opportunity to rebuild the server against a newer ubuntu version (20.04), migrate from DigitalOcean to Exoscale (possibly?).

We are also setting up SMS downtime alerts so that we can take faster action to help fix issues like these in the future!

Thank you and apologies for the issue :pray:

UptimeRobot stats

13 Likes