How to replace DI in Clojure?

bsless · October 17, 2020, 7:15am

We’ve had a similar discussion regarding exceptions as control flow. I even experimented with creating a stackless Throwable in Clojure and I seem to remember it worked. That negates the allocation overhead, but from various reading I recall try/catch has another cost in that it prevents the more aggressive JIT optimizations, so it’s still not “free”.

I also want to thank you, Sean, for your detailed criticism of mount and comparison of different frameworks. It’s nice to know I was not alone in those thoughts and you make a good case for Component.

didibus · October 17, 2020, 8:11am

That got me curious, so I researched it and I especially like this article I found that did a thorough analysis for it: The Exceptional Performance of Lil' Exception

That said, unless I misread the linked article, I thought it talked about using error return values for errors cases, not general happy path control flow, where exceptions wouldn’t be an issue.

Now your IOException is interesting. Because I guess you could say that errors thrown that you can handle, are those a happy path? Like say you try three different directories for finding a config file, if you were to attempt to read the file in all three places one after the other and in each case you’d get a FileNotFound exception or something to indicate the file didn’t exist in that directory, which you’d handle by trying the other possible directories. Is this a happy path? Is that bad design then? Should you return a simpler file not find indicator instead of throwing? But that seems difficult to judge, because that IO read operation probably assumed that the file should be there if you asked to read it, and that you can’t handle the error of it being missing, which in a lot of programs would be true.

Even in cases where you can’t handle it, but the user or client needs to be aware, as it is a fault on their side, such as in cases of a Validation error. Is this a happy path as well? So would it be considered control flow to throw a Validation exception instead of say having your validation functions be predicates that return true when valid, false otherwise?

I never thought too heavily about this before. I used to think, of course I don’t use exceptions for “expected control flow”. But now I’m reconsidering what counts as expected and thus “happy path” control flow and what counts as truly exceptional?

Unrelated to the design consideration, the blog I linked does provide an answer to this for the performance consideration side of things:

The optimistic rule-of-thumb seems to be 10^-4 frequency for exceptions is exceptional enough

This is compared against returning int error codes for error checking using an if condition. Which will probably be faster then returning a Clojure keyword or any other more complex Object. It’s also pretty much the frequency at which you start to see slowdowns in overall execution speed, given a function that simply multiplies a number by two.

That means if you throw exceptions 0.01% of the time or less, you won’t see any performance degradation, but if you throw them more often then that, you will start to pay a slight performance cost. If you throw deep and catch at the top, you can go up to 0.1% before seeing performance degradation. That is if you don’t catch at every level and throw a nested exception, but throw once at the bottom and catch only once at the top or somewhere else.

The performance is only when throwing an exception, thus try is pretty much free, creating the exception is the most expensive part following by catching it.

It also seem that throwing and catching in the same function seems to get optimized away and perform as fast as conditional branches due to inlining optimizations.

Edit: Also this was on JDK 7, I saw someone doing similar test and it seems in JDK 8 the dynamic exception path got even faster, and I’d guess maybe JDk 11 and the latest version improved further on it.

seancorfield · October 17, 2020, 11:32am

My quick performance tests were on JDK 15 so, if things have improved over versions, creating an exception is still very expensive (due to building the stacktrace).

jjttjj · October 17, 2020, 2:24pm

[this reply shouldn’t be flagged it’s totally inoffensive, maybe it wasn’t clear that this is addressing the integrant/component discussion and not the Throwable discussion so it seemed off topic?]

I like this type of approach. In the past I’ve used GitHub - riverford/objection: A dynamic approach to application components which is similar, but will probably try out context soon.

Component/integrant/etc all deal with turning flat config into a running system. But in my experience I usually want more dynamism; there are usually components I want to start and stop over time while the system is running, sometimes as sort of “child processes”, or “subsystems” of another component. I think either could do it[1][2] but somewhat hackily and against the recommendations of the authors of each.

Component and integrant each require a level of buy-in but then you still don’t have a great means to control “subsystems”, and you have to either manage them ad-hoc or via another separate, framework-ish setup you devise. I like how the “mutable registry of components” approach that context takes cleanly solves both at once.

[1] Subsystems or components that take as input the whole system? · Issue #21 · weavejester/integrant · GitHub
[2] https://matthewdowney.github.io/nesting-component-systems-not-considered-harmful.html

dustingetz · October 17, 2020, 2:50pm

#1 If you “inject” into your business logic a function of return type IO, even calling that function will poison the callsite with return type IO and IO type will propagate through your biz logic like a virus.

#2 If you separate your “pure logic” from your “query logic” then the business logic cannot make decisions that impact the query, in other words the business logic cannot be very dynamic. Most likely, without typechecker pointing out IO, we all accidentally end up in #1 and not realizing it. (Or the business logic can be written in stored procedures, which colocate with the db and thus don’t do IO.)

The essense of this problem is database access and if it is pure or not. (And DB is the one component of the system that the application programmer cannot influence.) That’s why this thread is going in circles like every thread about this since 1995.

bsless · October 17, 2020, 4:49pm

Something like OTP, perhaps?

seancorfield · October 17, 2020, 7:58pm

@bsless That made me do some quick testing of times in the REPL:

;; "quick" version of ex-info that makes a data-carrying RuntimeException
;; that omits the stacktrace (and disables recording suppressed exceptions):
user=> (defn ex-info-q [msg data & [cause]]
         (proxy [RuntimeException clojure.lang.IExceptionInfo]
                [msg cause false false]
           (getData [] data)))
#'user/ex-info-q
;; create, throw, and catch this "quick" exception:
user=> (time (dotimes [n 10000] (try (throw (ex-info-q "Test" {})) (catch Exception e))))
"Elapsed time: 2.613088 msecs"
nil
user=> (time (dotimes [n 10000] (try (throw (ex-info-q "Test" {})) (catch Exception e))))
"Elapsed time: 2.563639 msecs"
nil
;; create, throw, and catch the regular clojure.lang.ExceptionInfo:
;; (this populates the stacktrace -- and trims some frames)
user=> (time (dotimes [n 10000] (try (throw (ex-info "Test" {})) (catch Exception e))))
"Elapsed time: 189.582417 msecs"
nil
user=> (time (dotimes [n 10000] (try (throw (ex-info "Test" {})) (catch Exception e))))
"Elapsed time: 185.192821 msecs"
nil
;; for comparison, create, throw, and catch a basic java.lang.Exception:
user=> (time (dotimes [n 10000] (try (throw (Exception. "Test")) (catch Exception e))))
"Elapsed time: 33.66002 msecs"
nil
user=> (time (dotimes [n 10000] (try (throw (Exception. "Test")) (catch Exception e))))
"Elapsed time: 31.446926 msecs"
nil
user=>

So the “quick” data-carrying exception is about 10x faster than a basic Java exception.

This also makes clear just how expensive the ex-info function is: not only does it generate the full stacktrace, it then trims off any leading frames that match clojure.core$ex_info (using drop-while) and then it pours it back into an array of StackTraceElement (and puts that into the exception).

(I could have used Exception instead of RuntimeException but I wanted to mimic what clojure.lang.Exception extended and implemented)

didibus · October 17, 2020, 8:25pm

I’ve never felt this limitation. I can understand how it could appear so though. This is a problem of interleaved program flow, but you can still achieve it with pure core. You just need the outer orchestrating function to well orchestrate the interleaved behavior. So maybe you query the DB, get some value, call into a pure core logic which returns info about needing to query more data, so the outer function proceeds to query additional data and calls into another pure function, etc.

The trick is that, you need to model the request for some additional dynamic side effect as a return value which is interpreted by the outer orchestrater. Instead of having the function directly invoke the side effect in-line.

A good way to think of it is, imagine the outer orchestrating function is a graph of states with transitions between them, and imagine the transitions are labeled. Now each state can either be a pure operation, or an IO operation. The pure function returns both a transformed input as well as the name of the next transition to take. The outer function is just the engine for the state machine, it moves the input/outputs from state to state and follows the transition arcs by matching on the ones returned from the states.

I’d you’ve ever used Clojure’s trampoline function, it gives you a little bit of a feel for this.

Ya I should have clarified, the tests I saw also show that it is still much slower, just a lot faster than it used to be. That means that the frequency analysis of my article maybe got even better. Meaning that say we might be able to benefit from using exceptions 1% of the time, without having it impact performance. Whereas in the current article on JDK7, 0.01% seems to be the frequency.

I’ve never really stopped and asked myself how frequent are my exceptions thrown though, but I might start paying more attention to that. As it seems anything more than 0.01% will start to become slower than using flags checks for success/failure.

The important aspect of the analysis though is that try is free, and thus the happy path is faster than having an if(success) check.

robertluo · October 20, 2020, 1:52am

It is common enough to inject dependencies into another module/component without depending the concrete implementation, I investigated the options mentioned above about 1 year ago, including Components, Integrant, Mount, IMO these solutions target to lifecycle management more. We actually used Components and Integrant in our production applications, while still not very satisfied because:

I have to use it anywhere like in tests, where most of time I just want to replace my database data with some mock plain data.
Like @seancorfield mentioned, the component is simpler, but it still force you to implement its lifecycle protocol just for integration purpose, and other part of the system does not care of it (it used only internally).

I ended by developing fun-map, it acts just like a plain map, so you can replace it by common map anywhere like unit testing. Under the hood, the entries of fun-map can depend on other entries (by using fnk, fw macros as the value of one map entry, and it acts just like normal functions), so the dependencies will be take care by simply fetching value by common get. Because it is a common data structure, you can put it in any other data structure or nest it, so subsystems which @jjttjj mentioned can also be handled, actually, in our production system using fun-map, this pattern is used widely.

Fun-map can also take care life cycle as a side product, you just use life-cycle-map, it is also a fun-map but support Closeable interface, it records the invocation order, and will call the corresponding shutdown functions reversely.

seancorfield · October 20, 2020, 2:44am

Re: 1. Mock data – if you write your code in such a way that the data any function needs is passed in, then you can easily pass in mock data when testing. We do this at work, and use generative testing or clojure.spec/exercise, for that mock data if we can.

Re: 2. Lifecycle – if you have no setup/teardown, you don’t need to implement the protocol, and you can use plain hash maps instead of “component” records.

A lot of people see the basic examples in Component’s docs and assume that’s how it all has to be done and that it “smells” of OOP too much, but the reality is that the lifecycle is optional and, if a component has no dependencies, it doesn’t even need to be a hash map.

Component definitely isn’t perfect – I think it would be interesting to see it using metadata for dependencies instead of plain hash map keys, so you could have dependencies on non-associative data (and I’ve opened an issue on the repo for that) – but it is about as simple as you can get for a system that has both optional dependencies and optional start/stop lifecycle.

robertluo · October 20, 2020, 2:55am

Re 1: Most of data processing functions are pure, where the source of data normally comes from Database and user inputs. Unit tests for pure functions are easy, while to integrates these functions, I always end up with hard coded function invocation path, and it is not always desirable.
Re 2: The problem I encountered while using Component/Integrant, is that some dependencies come from life cycle component, and some not. We still need to do them differently, not able to switch between easily.

Fun-map makes function invocation interchangeable with map’s get operation, making these code as data, if use correctly, data accessing is generalized, you can always replace plain data with function calls anytime you need it.

johnchristopherjones · October 24, 2020, 10:39pm

The super-short answer is that the functional patterns look like CQRS.

The medium answer is that you do not want to pass impure functions around. The dirty little secret of impure functions is that they should return pure data and they should be obscenely simple. You want them to be so simple that there’s no point to unit tests because you’d only be testing Clojure or the API itself—things that already (presumably) have good test coverage. You only want those jerks to be at the edges of a pipeline. So, you get pure data out of your function and pass it along to your pure functions in your pipeline. Or, you get pure data and pass it some pure functions in a pipeline and plop it out into an impure function at the end.

If you need to do something like paging, the page is a property of the state of your system (or a tiny self-contained sub-system). You’re making an impure call for each page. Those pure page results get passed into your pure functions.

The super-long answer is that I think you should give Functional Design in Clojure podcast a listen. It does an excellent job of stepping through the thought-process on how to reach elegant solutions to problems like this. In particular, there’s a series about implementing a bot that posts prepared content to Twitter on a schedule. They start small and simple and layer in more intricate features, refactoring for simplicity along the way. They deal with exactly this pattern with a REST API, but the fact that it’s a REST API vs a database makes not a whit of difference to the solution.

In particular, they deal with this problem by pushing the impure functions out to the very edges and keep the impure functions extremely simple. So simple that if you were to write tests you’d only be testing the API, not any of your own logic, so there’s no point to writing tests for those functions.

Those impure functions tend to appear at the beginning or ending of a pipeline. Previous posters mentioned the railway pattern, which is about general control flow in such a pipeline.

One “pattern” for keeping your functions pure is to implement a “decider” function. The “decider” is also a pure function. It takes some information about the (pure) state of the system and returns a map that describes what to do about that state. That map is just data, so you can easily test it. Similar to the CQRS Pattern, these maps “work” better as task-oriented descriptions rather than imperative descriptions. E.g., “Book Hotel Room”, not “Set Hotel Room’s Status to ‘reserved’”.

A thin shim function around your impure functions (a “do-er”) can take that “decision” and dispatch it to your impure functions. Critically, the “do-er” function is dirt-simple, little more than a case statement (or a multimethod) that stuffs your decider’s payload into the impure function and returns the result.

Note that this pattern means that you can have an entirely-pure data-driven pipeline that takes some pure data and returns a “decision” object.

Others have mentioned Component and Integrant, which are two ways of directly representing dependency injection in data-oriented or function-oriented fashions, which is where you actually get the “pure” “database” that you pass to your impure function. Neither library is really necessary, but I think they both provide good guide-rails for how to organize your code “down-wind.” That is, you’re unlikely to “go wrong” if you use them, but people that don’t use them usually end up writing their code in ways that are isomorphicly-compatible with them in the first place. However, “dependency injection” should be a about standing up the system, not making impure parts of it “mockable”. You should prefer a design that allows you to “mock” with real data, rather than fake side-effects.

ordnungswidrig · November 24, 2020, 8:12am

It has a simple lifecycle – just start and stop – and some dependency graph logic to figure out the order to start/stop the components

It’s actually even more general. Component also prodives the fundamental tooling to apply arbitrary functions to the system components in dependency order (or reverse). You can use that for more, e.g. determine health check of the system or collection “configuration”. That’s what I like about it so much: it’s nothing more than the essence of it: dependency specification, topological sort and some “icing” in form of the Lifecycle protocol.

yuuki · December 16, 2020, 8:55pm

I’m struggling with applying this style of programming to code I’m writing for work.

I’ll give you a simplified description of what I’m trying to do. I have a bunch of articles in the form of HTML coming from a database which I transform by replacing certain strings with anchor tags. Think cross linking of content. In total we’re talking about maybe 10 million HTML nodes with let’s just say 100 words of content on average.

We already have a service that accepts a list of strings and returns you possible destinations to which these strings can be linked (all in the form of a map from original string to target).

The typical OOP/imperative cookie-cutter approach is something like this:

type LinkStore interface {
  GetLinks([]string) []Destination
}

func NewArticleTransformer(getLinks LinkStore) Transformer {
  return Transformer{getLinks}
}

func (t Transformer) TransformArticle(article string) (string, error) {
  // Do something and use t.GetLinks() instead of
  // making actual HTTP requests right here
}

This is dead simple to write and easy for everyone to understand. You could call grep as part of your CI to check that your business logic code has no imports from application level packages and thus even in a team full of juniors you can stick to your layers. Business logic in one place, implementation specific stuff in another.

Another advantage here is that you could make the implementation of LinkStore stateful to implement caching. You can save significant network traffic with some really simple caching because the same string will always return the same Destination within a given time frame of let’s say a day or so. So a simple in-memory cache for the duration of the transformation pipeline would be totally fine.

How would I model something like this with the interleaved effect style? I brainstormed this a bit but couldn’t figure it out. I can see having a getTextFromArticle function which traverses the HTML and just returns a list of text chunks. You’re then essentially looking at a fold (~ reduce) over the text chunks, where for each step you request destination data and can then eliminate all found strings from the remaining chunks. This could be written quite nicely with fold/reduce and sets. The further you progress, the less work you need to do. At the end of this pipeline you’re left with all link data for the given text chunks. You can now pass that link data to another pure function which does the string -> string transformation with a given input text and all the link data it needs.

But the problem here is that fold includes both pure code and network requests. Do I now keep the accumulator in my outer layer and just have some small pure functions that I call from there? To me this sounds like a really, really awkward split of what should be one logical unit.

You could also turn all of these state transitions into some abstract thing but I can’t see myself honestly suggesting to my coworkers that we suddenly have integration layer code that calls on some state transitioner that knows how to handle state transitions that are super specific to each use case when all of this could also just be done by passing potentially impure functions to pure code. They’d ask what we’re getting out of this, especially since this code would be much harder to reason about than the plain OOP/imperative approach.

Another classical case is logging. If you want to avoid logging at all costs you need to always change your return types so that the parent can do the logging. Quite a lot of churn.

Interestingly, in Haskell and PureScript passing impure functions to pure code is perfectly fine. You write your functions in a way that the concrete effects are left undefined so that in your tests you can run everything in some pure environment whereas for production you use IO (https://thomashoneyman.com/guides/real-world-halogen/push-effects-to-the-edges/)

So long story short: I think that this comment really cuts to the essence of the whole thread and there’s definitely a lack of concrete and non trivial examples of code that separates pure and impure code by interleaving effects without making a mess of things.

didibus · December 17, 2020, 7:45am

Ok, I’m not totally following your use case, and I don’t really understand your example either cause I don’t think I know what language that is.

But if I understood, you have a bunch of HTML strings, you want to loop over each one, where you’d parse their content for some say keyword and then you’d call getLinks and pass them the keyword list of keywords from the HTML and it’ll return you the thing they link too, which you’ll then replace the keyword in the HTML with an anchor to those returned links. Is that it?

And so, you first need to parse the HTML to find the strings to call getLinks with, but getLinks is an IO call to some other service, so you’re wondering how to push that impure logic to the edge?

Well, first of all, I’d like you to think of the pros/cons of why you want to do that. You mentioned Haskell, but without going over the details, Haskell basically says that:

(defn replace-with-links
  [html links-getter]
  (->> (get-strings-from-html html)
    (links-getter)
    (replace-anchors html))

Is a pure function, because if the function you pass in as links-getter is pure, then the above is pure as well. And you can imagine that Haskell says, unless you run this in prod, links-getter will be pure, but if run in prod, then Haskell will have links-getter be the real impure IO call to getLinks.

And so this is true in Clojure as well. The above function is pure in a lot of scenarios, maybe those that matter, like your unit tests and what not.

I like to think Clojure is pragmatic like that, and so this could be totally fine.

But alternatively, what you want to do is break down things even more, you only need to push the impure things to the edge, not get rid of them from your code. So it could be:

(defn main ; (or some API entry point)
  [htmls]
  (for [html htmls]
    (->> (get-strings-from-html html)
      (getLinks)
      (replace-anchors html))))

Now you’ve pushed it to the edge. Main and getLinks are the only impure functions, main is your workflow definition for defining the series of steps needed to fulfill your program (or request if an API, or command if a user input event). And it’s at the very edge doing all the IO, and getLinks is the IO itself. The other two functions are pure.

Now I’ll say sometimes that main function could get pretty damn long, a complicated use case might have a lot of steps, but you only need to break it down to pure-chunk -> impure-chunk -> pure-chunk -> impure-chunk kind of thing. So it’s not like it needs to have every single small steps, just broken between a section you can do all pure until you need some impure. So it’s manageable most of the time. And for the rare case it might be unruly, well I’d be pragmatic about it and break the rule of having it all at the very edge, and allow having it as close to the edge as I can but maybe not the very edge.

yuuki · December 18, 2020, 5:31pm

And so, you first need to parse the HTML to find the strings to call getLinks with

This part I should have explained better. I don’t know which words can be linked. I need to send every single string to that service and it will return the substrings within that string that can be linked. You could imagine it as getLinks :: [String] -> [Destination].

And that’s the part that makes this difficult, otherwise it would have been a straight forward pipeline, as you said.

The language I used to demonstrate is Go by the way.

The Haskell equivalent to the Go code would be:


class (Monad M) => GetLinks m where
  getLinks :: [String] -> [Link]

transformArticle :: (GetLinks m) => String -> String
transformArticle article = do
  links <- getLinks (words article)
  -- and so on

instance GetLinks App where
  getLinks = someImplementationForProduction

instance GetLinks Test where
  getLinks = someImplementationForTesting

This comes down to the same “passing impure functions to my pure functions”.

didibus · December 18, 2020, 5:45pm

Ok, well I don’t think that matters much that you split the html into words or parse it for some subset of strings, my answer still applies I believe. Either you do the same, and pass in a getLinks implementation as an argument (you could even let getLinks dispatch to some implementation that differs based on the environment, which wouldn’t be DI anymore but achieves a similar result). Or you just dissect your logic between pure chunks and impure chunks, and flatten your call stack so that all impure chunks are at the top in your entry function which then becomes a workflow definition for how to weave your pure and impure chunks together in what order and connecting their inputs/outputs to one another.

yuuki · December 18, 2020, 6:13pm

Ok, well I don’t think that matters much that you split the html into words or parse it for some subset of strings

Maybe I’m still not articulating myself well.

(defn replace-with-links
  [html links-getter]
  (->> (get-strings-from-html html)
    (links-getter)
    (replace-anchors html))

This code does not do what it should be doing for the above example since the get-strings-from-html needs to send the strings to another service to actually get the links. If I just do this for all my strings then I’ll make a lot of unnecessary network request since the response from every previous request can be used to eliminate the already seen links from the next one. That’s what I mean when I was talking about interleaving impure and pure code.

(reduce 
    (fn [links blob] (let [remaining (remove-seen (links blob))
                           new-links (get-links remaining)] 
                     (concat links new-links))
    []
    articles)

This would have to be the first step, followed by just replace-anchors html as the second step. This first step could be extracted into a function but it would take an impure function get-links as input. And that’s totally fine, it’s what I currently do in every language I write but I was wondering how someone would do this with the mentioned approach of really separating pure and impure so that as little logic as possible as left in the impure parts.

Sorry for the confusing descriptions.

didibus · December 18, 2020, 6:37pm

Maybe this?

(defn main ; (or some API entry point)
  [htmls]
  (let [get-links-memo (memoize getLinks)]
    (for [html htmls]
      (->> (get-strings-from-html html)
        (get-links-memo)
        (replace-anchors html))))

yuuki · December 18, 2020, 7:13pm

That’s a nice idea! Maybe many more cases can be solved by thinking outside of the box instead of being mentally stuck in the DI approach. Thanks for the insightful replies.

This is also mentioned in https://www.parsonsmatt.org/2017/07/27/inverted_mocking.html under “Decomposition: Conduit-style”. Different language but the idea of separating code along the lines of pure logic and impure everything else (rather than the usual write logic against interfaces of impure functions) is the same.

I’ll have to practice this more.