Glad you enjoyed the talk!
Yes, we do run Socket REPLs in several apps in production. We have machinery in our service startup scripts that looks for a dot-file next to the uberjar and reads it to get JVM options to run the uberjar. That makes it easy to enable/disable a Socket REPL since we can just update the dot-file and restart the service. No code is needed in the apps.
We use a VPN to connect to the DMZ that contains our servers and then stand up an
ssh tunnel for port forwarding so that we can connect to a port on localhost – which then ferries REPL commands/output back and forth.
We don’t have REPLs in every process – we have over a dozen services running on each server in our cluster – but we do have a few that are always running.
We generally only do “read” operations via such REPLs, for debugging, and nearly all our services are built with AOT and direct-linking so redefining individual functions is not feasible (because calls are direct-linked to the original function definition). We do occasionally patch things in the (MySQL/Percona) database via a REPL.
We talk about removing direct-linking from time to time because the benefit of being able to patch the running process (by sending one or more new
defn forms over the REPL) might outweigh the downside of slower code (i.e., calls not direct-linked). But, so far, we haven’t decided to change from direct-linking.
We have a couple of processes that run Clojure from source, because the process is legacy code (not Clojure), and it’s “easier” to do that rather than building libraries and dealing with the legacy code’s deployment setup. For those processes, a full read/write REPL experience is possible and we do – occasionally – patch those legacy apps by sending new
defn forms over the REPL.
The downside of patching remote code via the REPL is that you lose those changes if the process is restarted. Our overall deployment process (for all services except the two legacy apps) is automated so we can get an updated version of an app – with extra instrumentation via logging or
tap> – into our production cluster within 15-20 minutes.
The nice thing about using
tap> for debugging is that a) you can leave it in your production code b) you can add/remove a tap watcher function dynamically via the REPL as needed.
There are no safeguards in any of this: we trust our developers not to screw up production but it can occasionally happen and the fix is just to restart the process (since it has the original code in the uberjar or, in the case of our legacy apps, on disk).
One of the things we might do in the future is to start Socket pREPLs instead of plain REPLs since that would allow us to connect Reveal in a daisy chain to production so that we can display results in Reveal running locally, even for remote servers. See Vlad’s post about remote pREPLs for more details. The only thing stopping me from doing this right now is that my editor can’t connect to a pREPL