I would like to move our system in this direction:
If you only have one big server, how would you perform a release of a new version?
The most straightforward way would be to accept a few seconds of downtime, which would be okay for us. More problematic is that we have some long-running jobs (up to 15 minutes) which should not be interrupted. Our system is a multi-tenant SaaS system. So we could move customers over to the new version as soon as their long-running jobs have finished. However, running the old and the new version in parallel again feels a bit like a distributed system At least you could use the local file system and file locks for coordination purposes.
A solution that has worked fine for me for many years (and many many releases) is running the processes in a way where downtime is minimized.
I start my main “application” before starting its “connnectors”. So starting the application without anything that would take up network ports, such as an http server. Say I want to replace v1 with v2. I start v2 as normal. Once all its services are started and the process is ready to serve requests it’ll signal v1 to shut down its http server part to “release” the TCP port. v2 will basically immediately take over, leaving only minimal downtime where no requests are answered (couple ms at most). v1 is then not reachable anymore, but unfinished tasks can finish gracefully. Once everything is done the process just shuts down.
I have done it this way since I didn’t want to rely on anything external to handle this. Arguably it would be better to use a load-balancer of some kind in front and have that just switch to the different process. Or use the OS firewall, but I liked a Clojure-only solution.
The actual thing looks a bit more complicated, but not much. In my case the (signal-other-to-stop) is just done by touching a tmp/shutdown.txt file, and (should-we-shutdown?) checking for modifications.
I usually don’t have long running http requests and websockets just get killed and clients reconnect, so the downtime is usually a couple msecs at most. It doesn’t matter how long my application takes to start and it doesn’t affect downtime. Starting the HTTP server itself is very quick.
Thanks a lot for your answer. This is a great approach. Besides the port binding, it also solves the cold-start challenges of the JVM and Clojure. I especially like that it uses the power of the local file system to signal the other process to shut down.
I also asked the same question on Reddit, where I provided a lot more details about our system in the answers. To avoid having to deal with all the complexities of a multi-tenant system, I will try to start one Kubernetes pod per customer. Sounds ridiculous, but it will allow us to get rid of many problems of a distributed and multi-tenant system. However, my dearest wish would still be to get rid of Kubernetes But I need to accept the fact that Google Kubernetes Engine is a good and robust solution for resource management and provisioning (that our multi-tenant system needs), and I don’t want to implement a solution for this complex problem by myself.
Yeah, I’m way behind the times and haven’t touched a single “Cloud” or Cluster solution.
I just have multiple of these types of processes running on a single server, together with a postgres DB and a nginx proxy. No cloud, container or other stuff in sight. Not gonna be breaking any throughput records, but for the most part the server (as in actual hardware) is pretty bored.
I would appreciate you sharing more details about what tools you are using for such a setup, such as the Linux distro you are using, the process supervisor, log file handling, database backup, cron jobs, etc.
I would even prefer to run without any container. Still, for our current system, there are many nitty-gritty details that you would really like to freeze in an immutable container image. For example, we take screenshots of designs made in our editor using a headless Chrome. We generate incorrect screenshots if some font types are missing or other Linux details are wrong.
I already tried Nix and Guix, but they are too complex for me when you need to create a package yourself. Therefore, creating a container with a Dockerfile is still my preferred solution to ensure that the dev environment, staging, and production are on par. We use Ubuntu as the Docker base image on all our development machines. I also started a collection of reusable install scripts for the Dockerfile.
FWIW if I were to setup a new project today it would likely be the first candidate to try some kind of container setup. I have been a skeptic for a rather long time now, but do eventually want to give it a try. Not everything in that area seems terrible.
However, I’m a strong believer in “never change a running system”, so I won’t experiment with any of the existing setups.
The server is minimal Debian LTS, which is actually past its lifetime nowadays. But again … “never change a running system”, so dreading dist-upgrading that. Just using basic built-in OS tools for most things “maintenance”.
DB uses streaming backup using barman, which I would not do again. To be honest I have no clue how I’d even recover a DB with that. It is still running and not complaining, but how the heck should I know it actually works. In addition a cronjob does regular full db dumps and they are also copied offsite via scp. The DB is tiny though, less than 2GB in total. Losing a couple hours is tolerable.
Basically the entire server is frozen in time from when I set it up. I locked everything down and only do security updates. I’m at most beginner-level DevOps and there is no valueable data to steal. These are all basically CMS setups with all data being public anyways. No clue if the server has been hacked before, but seems fine. I’d probably be a bit more paranoid if there was actual “valuable” data on this.
Thanks a lot for sharing the details. I’m also a strong believer in “never change a running system”, which was one of the reasons why I invested the time to learn the basics of Nix and Guix. Their main selling point is “reproducible builds”. When I empty my Docker build cache and rebuild the Dockerfile, I never know what I will get exactly, which is scary. With Nix and Guix, you can also build container images. Besides Nix’s bigger ecosystem, I felt more at home with Guix since it uses Guile Scheme as its programming language. At the core, Guix is using parts of Nix anyway. Regrettably, I didn’t find enough time yet to improve my Guix or Nix skills. But as mentioned, Guix and Nix perfectly cater to the idea of “never change a running system”.