Remote Debugging

Hola!

I was inspired by an excellent REPL Driven Development talk by Sean Corfield to dig deeper into my Clojure toolkit. I love my new toys, like Reveal, tap>, and comment (tools.deps yet to be explored).

One of things which made me scratching my head, however, is remote debugging. Sean mentions that he does interact with a running server often. Longer I think about it I see more and more scenarios where this would be extremely useful, ie. poking the specific functions, looking into system’s config, or even changing logging levels on the fly.

I’m really curious what your sample Remote Debugging session looks like, and what is the (safe) limit of such live system interactions.
Would you expose REPL in your Prod env?
Would you add debug/tap>/log statements on the fly? If yes, how would you do it in a safe way?
What would be your safeguard in case you get too far with your changes?

Cheers,
Maciej

Glad you enjoyed the talk!

Yes, we do run Socket REPLs in several apps in production. We have machinery in our service startup scripts that looks for a dot-file next to the uberjar and reads it to get JVM options to run the uberjar. That makes it easy to enable/disable a Socket REPL since we can just update the dot-file and restart the service. No code is needed in the apps.

We use a VPN to connect to the DMZ that contains our servers and then stand up an ssh tunnel for port forwarding so that we can connect to a port on localhost – which then ferries REPL commands/output back and forth.

We don’t have REPLs in every process – we have over a dozen services running on each server in our cluster – but we do have a few that are always running.

We generally only do “read” operations via such REPLs, for debugging, and nearly all our services are built with AOT and direct-linking so redefining individual functions is not feasible (because calls are direct-linked to the original function definition). We do occasionally patch things in the (MySQL/Percona) database via a REPL.

We talk about removing direct-linking from time to time because the benefit of being able to patch the running process (by sending one or more new defn forms over the REPL) might outweigh the downside of slower code (i.e., calls not direct-linked). But, so far, we haven’t decided to change from direct-linking.

We have a couple of processes that run Clojure from source, because the process is legacy code (not Clojure), and it’s “easier” to do that rather than building libraries and dealing with the legacy code’s deployment setup. For those processes, a full read/write REPL experience is possible and we do – occasionally – patch those legacy apps by sending new defn forms over the REPL.

The downside of patching remote code via the REPL is that you lose those changes if the process is restarted. Our overall deployment process (for all services except the two legacy apps) is automated so we can get an updated version of an app – with extra instrumentation via logging or tap> – into our production cluster within 15-20 minutes.

The nice thing about using tap> for debugging is that a) you can leave it in your production code b) you can add/remove a tap watcher function dynamically via the REPL as needed.

There are no safeguards in any of this: we trust our developers not to screw up production but it can occasionally happen and the fix is just to restart the process (since it has the original code in the uberjar or, in the case of our legacy apps, on disk).

One of the things we might do in the future is to start Socket pREPLs instead of plain REPLs since that would allow us to connect Reveal in a daisy chain to production so that we can display results in Reveal running locally, even for remote servers. See Vlad’s post about remote pREPLs for more details. The only thing stopping me from doing this right now is that my editor can’t connect to a pREPL :slight_smile:

3 Likes

Wonderful response. Thank you!

The downside of patching remote code via the REPL is that you lose those changes if the process is restarted.

I got into the habbit of having a reproducible patch.clj file or ns that basically consolidates any temporary patches I’ve made to a live or otherwise deployed system (e.g. everything’s bundled in jar). I’d have redefinitions in there using in-ns to hop around and make changes internal to the namespace as if I was in the repl. This tends to suffice until I get to a point where I can consolidate these live changes upstream into a formal release. I don’t do it often, but it helps especially for reproducibility within these constraints. Direct linking limits this approach.