REPLS on production deployments


#1

I’ve heard of people using REPLs on production sites. Noting how handy it would be to make figwheel-style updates, I can see the allure; still, I’m not sure updating that easily is always a good thing as it may propogate sloppiness/mistakes, and I can’t help but worry about security issues. Is anyone here using REPLs on production sites, or have in the past, and would share their lessons learned?


#2

We regularly run bare socket REPLs in our production processes. The ports are not exposed on the servers so you either have to be physically on the server or using an ssh tunnel and be connected over VPN. So that aspect is as secure as any other services running on those servers, in terms of access.

Having a REPL means you have full and absolute control of the running JVM process so, of course, there’s a serious aspect of “caveat programmer” here – but it’s the usual tradeoff of power vs. responsibility.

We mostly use these REPLs for debugging: being able to execute code in the context of the production process can make it much easier to debug a problem. We exercise a lot of SQL queries this way when we’re looking at performance issues or when we need to “sanity check” the state of data in the production database (so, yes, we use Clojure and a REPL to do what most places would use their DB’s command line tooling for!).

We apply “data fixes” this way too, in the event that a bug introduced data problems – we will often fix it “on-the-fly” via the REPL rather than doing it as a data migration as part of the next build/deployment.

We occasionally apply a live patch to a running process if we feel it is low-risk and we don’t want to wait for the next build/deployment cycle (most of our processes can be automatically deployed and we can easily run as many production deployments as we want every day, but we have a few legacy applications that have much more complex/manual deployment processes and it can be worth the (low) risk to apply a live patch to the code in those processes sometimes, especially if we want zero downtime).

Those legacy applications are non-Clojure, by the way, but include Clojure’s JARs on their classpath, which is how we start the REPLs.

Given that you only need JVM options at startup for a process to spawn a REPL on any given port, anyone with shell access to stop/start those processes can get a REPL going and have that level of access – so locking down shell access and port access is your line of defense.


#3

We have a similar setup. REPLs can only be accessed when ssh-ed into the host, and we don’t even allow ssh tunneling in our case.

Our main services also don’t have REPLs on them, but we have a launch script that can run a duplicate process of our service on the host within the same environment which bypasses our normal init, and instead inits straight into a REPL.

This means we can’t accidentally break things out of the active prod process, but have a as close as prod environment we can mess around in to help us debug.


#4

I’m currently running a socket REPL on a low traffic, single instance, live app. The app is under fairly heavy evolution, but remains usable and live at all times. The app runs on a linux virtual server with a tight firewall, is proxied behind Nginx on port 443, and the REPL port 5555 is only available via a SSH tunnel, which is plenty of security for this particular app.

I used to run it via a CI-built and -deployed uberjar in a Docker container, but due to the fairly fast development and the desire for instant updates, now I actually just run it using a systemd script that runs clojure -m my.service.main in the code directory. This means I can (en/dis)able the socket repl easily from the CLI args, can git pull and then immediately require via the REPL, and I can connect my IDE to the repl and send code live straight from my IDE.

One of the things I do the most often is to run queries and transactions on the production Datomic DB to match changes happening in the dev DB. All of the changes made are already defined in the dev code branch, so it’s mainly a matter of having a more “realtime” workflow with pushing changes live. All of this makes for a much more enjoyable “progressive” situation. It works fine for a small team of two devs, but I can imagine it being inappropriate for a large team especially with a separate dev ops team and load balanced multi-node high traffic system. However, even there it can be indispensable to have a REPL access somewhere for debugging, troubleshooting, or running investigative queries.