Partially re-initializing a running system (with Integrant)

mvarela · August 9, 2021, 12:46pm

I’m using Integrant to manage my system, and I’m having some issues where a (stateful) component needs to be reinitialized (the issues I’ve had in particular are with Cassandra connections, but it could apply to many things), and also reinitialize the components which depend on it (for instance, I have a Reitit interceptor that injects the Cassandra connections into requests, to be later used by the handlers).

During development, I can just do (reset), and everything is reinitialized, but I’m wondering what would be the best way to handle these types of situations while the service is running, so that it may attempt to re-create the broken connections, and propagate those where they need to go.

Any suggestions?

ProfoundBoat · August 9, 2021, 5:17pm

This sounds like the kind of thing that Erlang’s/OTP’s supervision trees are meant to solve: Supervision Principles

Coming at it from another angle, typically I’ve seen DB connections managed via a connection pool that can detect when the connections are obviously broken. Components that require connections get one by calling “get connection” on the connection manager, which will return one from the pool. If the connection is broken in a way that the manager can detect, it will remove that connection from the pool and try to make a new one. I think the change in this case would be for your components like a Reitit interceptor take a connection manager as a dependency rather than a connection.

Serioga · August 9, 2021, 6:58pm

Did you look at Integrant’s suspend/resume?

mvarela · August 9, 2021, 7:40pm

That sounds like a good option, I’ll have a look at the docs and see whether it’s possible to go that way.

mvarela · August 9, 2021, 7:42pm

I did, but from what I understood from the docs, they’re meant for dev-time use. Also (and maybe I misunderstood), I believe they wouldn’t handle the cascading reinitalization of the “downstream” components. I’ll have another look at that…

mvarela · August 9, 2021, 7:55pm

I’m wondering how you’d go about implementing that in Clojure… I could wrap the Cassandra connections (technically, Session objects) in a small API that would automatically recreate them if they break, but that would imply explicit state (say, keeping the connections in an atom, and swapping them out for fresh ones if they break), which is what I was trying to avoid in the first place… I’m probably missing an obvious angle to this.

magnus0re · August 9, 2021, 7:59pm

I was just looking at these concept about supervision trees and robust distributed systems this summer, read Armstrong’s PhD thesis on Erlang, and I was wanting to see how these concepts worked on the Clojure side.
Obviously we’re in a very different environment than Erlang, as Erlang runs many processes which each has a network connection out, while I think we want to keep more stuff on one JVM when using Clojure.

Integrant has a concept of lifecycles, but as the author says in his presentation:

“you’re turning a configuration into a system, and the only thing you can do with that system is either shut it down or throw it away”

So Integrant is not built for doing this. The right way is to re-initialize the full system. If you want separations, define multiple systems.

Also, looking at the tests for integrant it seems like it is an all or nothing kind of thing.

Now for the hacky approach:

It’s a bit out of scope perhaps; but with vanilla Clojure I think you can do something very powerful, and very similar to Erlang supervisors with agents and set-error-handler!/restart-agent. Similar ideas are in Kubernetes with replicasets.

Doing the partial restart robustly would require that the Integrant dependencies/components only exchange messages with each other. What I’m trying to say is that there is an architectural mismatch between using just raw Integrant and your problem. There needs to be another layer in between I think.

Sewing these two concepts together could be possible… But It would also require significant code changes somewhere, especially the interfaces between the pieces you would want to isolate.

My 2 cents at least

didibus · August 9, 2021, 8:52pm

I don’t know integrant, but I’m assuming it defines a tree-like system? Wouldn’t that naturally fit with having it reverse shutdown up to a node and restart?

But to be honest, I think the mistake here is treating the connection as a singleton, you probably want a connection per-request, not a connection per-component.

So on each request simply acquire a working connection (new one or from a pool) and inject it down, and release it at the end of the request.

Edit: Well I see in the doc it mentions:

Both init and halt! can take a second argument of a collection of keys. If this is supplied, the functions will only initiate or halt the supplied keys (and any referenced keys).

So it seems you can selectively halt! and init! specific parts of the system.

I think the challenge is how do you handle transient requests and incoming requests while you’re resetting a part of the system.

Maybe jetty depends on a handler, the handler depends on a Cassandra connection pool, you’ve got ongoing requests being handled by jetty, one of them throws an error on stale connection, you’d want to halt! the pool and init! it again…

That means halt! would go in reverse order, so halt jetty, halt the handlers, halt the pool. Then init! would init the pool, init the handler, init jetty.

You’d need to make sure that your jetty component when halted gracefully terminates all requests, there might be a weird deadlock against the request which called halt! to begin with, probably you want it doing that in a background task, so it could return to jetty.

So jetty component would wait for all enqueued requests to terminate, then shut itself down, during that time you’ll probably time-out to all incoming requests, or something there might break, etc.

I’m thinking a supervisor chain like that might be to coarse maybe?

Seems it be simpler to have a connection provider like someone before said, and each request can try to acquire a connection themselves and retry to do so until they get a good one or failed too many times. Only in the latter “fail all retries” would you might want to do a full halt! and init! to recover.

mvarela · August 9, 2021, 9:07pm

I’m not an expert in Cassandra, but from what I’ve seen so far, it handles things a bit differently, with Sessions that you get from a Cluster. These seem to be meant to be long-lived, and creating one is rather expensive, time-wise, so it’s not something I’d do on a per-request basis.

mvarela · August 9, 2021, 9:10pm

That is the conclusion I’ve been coming to, I guess. I think a full restart might be the easiest solution (or rather, the most “functional” one, the easiest would be to chuck these sessions into an atom and restart as needed…).
Thanks!

ProfoundBoat · August 9, 2021, 9:16pm

What platform are you writing code for?

For the JVM, I didn’t find any libraries that manage connection pooling with Cassandra out of the box, but there seems to be Java libraries that do that you could leverage: DataStax Java Driver - Connection pooling

Unless you’re willing to create a brand new connection on every invocation, I think the state of the connection/connection pool will have to be kept somewhere, it’s just a question of where. Calling a function to get a connection from some kind of connection manager or pool makes sense to me, and then just that function or two would then worry about the details of managing that state.

seancorfield · August 10, 2021, 12:32am

We use Redis at work and our Redis Component maintains an internal pool, recreating broken connections as needed. We initially tried to use various libraries that offered pooling with Redis but ran into problems with them and ended up writing our own (using core.async as I recall). I know you’re trying to avoid that but perhaps you really can’t…?

mvarela · August 10, 2021, 5:24am

@ProfoundBoat , @seancorfield , thanks for your responses. I guess I’ll implement a stateful approach to managing these Session objects, and stick it behind a small API.

dustingetz · August 11, 2021, 6:48pm

missionary implements process supervision and propagates dependency updates efficiently through a DAG and lets you run fine-grained effects in response to updates or events, I would use missionary for this

mvarela · August 12, 2021, 12:30pm

I’d never heard of missionary, looks interesting!

dimovich · August 16, 2021, 7:11am

I’ve been trying to do this in Roll, but it seems Integrant wasn’t designed for that.

Relevant issue.

My workaround.

In the end the best approach is to restart the whole system, and define suspend / resume methods for keys that don’t need to restart if the arguments didn’t change.

mvarela · August 16, 2021, 7:43am

Yes, it seems there’s not a “proper” way to do this within Integrant itself. I’m not sure why James thinks restarting a specific component and its dependencies would be less reliable, though…

didibus · August 16, 2021, 5:37pm

Sorry, I’ve never used Integrant, but doesn’t this describe what you want:

Both init and halt! can take a second argument of a collection of keys. If this is supplied, the functions will only initiate or halt the supplied keys (and any referenced keys).

dimovich · August 16, 2021, 8:54pm

If you run init with a collection of keys, you will get as a result a system with only these keys + their dependencies, leaving all the other (unspecified) keys missing.
So you either restart all your system, or you have to manually keep track of the unspecified keys and merge them back (as in the workaround I posted).

didibus · August 16, 2021, 9:12pm

Isn’t that what you would want?

Like you have this tree:

A -> B \>
   \> C -> D

A depends on B
A depends on C
B depends on D
C depends on D

Now you’re saying that D is broken, so you’d like to swap all use of D for a new one in your running system, but you’re not willing to halt the entire system and restart it all.

So you need to halt and init only A, B, C and D, assuming there’s a bunch of other independent pieces in the system other then just that.

Isn’t that what would happen if you say halt D and init D again?