Architecture of big applications

NoahTheDuke · November 29, 2022, 4:59pm

There are many example/demo repos for Clojure demonstrating a small slice of how a web app might be structured. Most notable is @seancorfield’s usermanager-example, which has many forks and variations (using reitit, integrant, etc). All of these are good for demonstrating how everything fits together for a small crud app, but they don’t do a good job of:

demonstrating how it works when there are 50+ routes and many interlocking Components etc
showing how to separate out the “pure” business logic from the “stateful” stuff, aka Eric Norman’s “Actions” vs “Calculations”

There’s lots of hints about how to do this work, folks talk in vague aphorisms or pithy suggestions, but even books like “Clojure Applied” end up staying higher level than I hoped. The codebase at my current job pretty good, but it’s not been built with such ideas in mind, so frequently I find network calls or database queries in a call stack that’s 10 or 15 functions deep, or there’s no clean separation between what a given Component should own so they overlap in their responsibilities.

My question is: Are there public examples of “big” applications that are built with best practices?

seancorfield · November 29, 2022, 8:48pm

While there are a lot of open-source Clojure libraries, I get the impression that nearly all the substantial applications are closed-source.

When this question comes up, metabase/metabase: The simplest, fastest way to get business intelligence and analytics to everyone in your company (github.com) is the project that I hear mentioned most often as a “large open source application” but I’ve never looked at the source code so I can’t comment on its architecture or idioms.

At work, we’ve tried a number of different approaches in our various applications over the years and we still haven’t really settled on an overall approach that we really like – although adopting Polylith (and slowly migrating our 130Kloc codebase to it) has addressed a lot of organizational issues for us, and the separation of application code, reusable code, and the build/test/artifact concerns.

Linus_Ericsson · November 30, 2022, 10:09am

Somewhat old but a document app (haven’t tried): GitHub - bevuta/pepa: A document management system

A Finnish agency application (in Finnish). GitHub - finnishtransportagency/harja: Väylän Harja-projekti

zcaudate · November 30, 2022, 12:08pm

@NoahTheDuke, Do you actually have 50+ routes that you need to expose? From experience, 50 of anything is hard to manage so you may have to write some custom ‘framework-like’ code that is specific to your needs but it’s always going to different depending on your app.

One thing that is really frustrating about backend development is the calls to the database and other external components - Kafka, Redis etc. I really recommend ulog by @BrunoBonacci and a lot of his other libraries. I came across his libraries about 2 or 3 years ago they are all built for large scale systems. At such scales, tracing and logging becomes super important. There are examples in the repo for using mulog in various distributed systems and configurations.

maxweber · November 30, 2022, 6:19pm

Penpot is also a big Clojure and ClojureScript open source app:

NoahTheDuke · November 30, 2022, 8:35pm

Currently, my company’s app exposes 251 unique routes. From my experience over the last year, we need every single one of them. The app itself is 403 clj files and roughly 65k lines of code, with 207 files and 75k lines of tests. Apps get big over time, which is one of the reasons I’m asking.

Tracing is nice but it’s different than I’m wondering about.

Yeah, that’s been my experience as well. (I’ve wished to read the World Singles codebase after reading your blog, hah.) Polylith is cool, but given that I don’t fully understand and I’m still struggling to get us moved to deps.edn from Leiningen, it’s a no-go right now.

seancorfield · November 30, 2022, 10:35pm

50 seems small to me. We have an API that has ~100 routes and our (internal-facing) administration app has over 400 routes.

seancorfield · November 30, 2022, 10:37pm

Have you watched the Los Angeles Clojure Meetup video exploring Polylith? I ended up showing quite a bit of the World Singles code structure as part of that. Meetup: Collaborative Learning - Polylith - YouTube

zcaudate · December 1, 2022, 2:10am

Sweet I was just making sure. Are the 403 files all backend or there’s cljs as well?

I think I know what you mean. One way is AOP or code generation of the routing layer that can potentially lighten the load. There is GitHub - sunng87/slacker: Transparent, non-incursive RPC by clojure and for clojure, which turns a namespace into routes. Your application may need customisation (mainly for authentication), but you can modify the repo for your own purposes

@seancorfield
I’m not going to get into a dick measuring contest here. I was reaching breaking point right at the very beginning when statstrade got to around 20 routes. The routing was already unwieldy enough with 20 links, not to mention the actual ui code on the js side so we had to solve it in a pretty novel way.

mvarela · December 5, 2022, 6:42am

Penpot is open source, and probably “big” enough. I haven’t had a proper look at the code, though.

Anthony_Leonard · December 7, 2022, 9:54pm

I have no concrete suggestions but am interested in answers given.Your experience perhaps similar to mine - large codebases growing with namespaces defined like Java packages arranged broadly in layers whose responsibilities are not clearly defined, where a “vertical” addition of functionality touches every namespace, and functions call other functions anywhere without limit.

I put a lot of store behind breaking up the codebase along clean architecture lines - where the (unit tested) functional core namespaces do not require the (integration tested) stateful pieces doing I/O. That can be enforced in tests with an equivalent of the archunit approach for clojure using ns-refers, ns-aliases etc. I believe Polylith seeks to similarly reduce the surface area exposed by “pieces” of your app to other pieces, and therefore the interdependence between them, and seems to be winning mindshare for large apps just now.

But much as these techniques help call out separate pieces, and reduce their fine grained interdependencies, what responsibilities pieces should limit themselves to is still often hard to agree even in a small team. There is often debate about what should be in a “controller” or a “repo”. Just talking and defining these early and sticking to them would probably help any team. But yes it sure would help if there were industry conventions for naming these types of pieces that folks could learn and stick to, or even good canonical examples such as you’re looking for here.

pfernandez · December 12, 2022, 11:03pm

I’m in the process of refactoring a microservice that’s starting to become not-so-micro with 35 endpoints built by several developers at different times. At that size, it’s already starting to feel like a mess and it’s time to clean it up.

Below are the basic techniques I plan to use that I believe will allow our code to grow indefinitely. I realize this isn’t an example I can point you to, but in reality large codebases can’t really follow templates anyway. Templates are just a starting point. I developed a lot of these ideas while wrestling with 18-year legacy code at Tumblr, eventually realized that they’re just basic functional programming techniques, then quit to become a Clojure developer.

Lift side effects like API calls to the entrypoints of the app. Remove all possible logic from this section of code so that writing tests around it won’t be necessary.
Gather the data you need into a map (often called a “context” map) that can be passed down through pure functions.
Store example data in EDN files, and use write your tests around these files. After a while you’ll find that instead of adding tests, you’ll often simply be adding more test data.
Write tests to cover the behaviors of only the top-level primary (pure) functions.
Use pure functions for everything except required side effects.
Always be refactoring into a tree structure that mirrors the natural shape of nested functions.
Break up logic into services, directories, and functions that arise naturally from your flow of logic, data, and use cases.
Wait to destructure data until it’s really necessary, near the leaves of your code tree.
Move shared code into utility files/directories as needed. Bubble these up the tree as needed throughout more of the codebase.

Things to avoid

Shared state. Passing data through functions instead preserves purity, making reasoning and testing much easier. If your code is becoming too deep, think about how you can flatten it naturally with pure functions.
Abstractions meant to reduce the number of lines of code at the expense of increased cognitive overhead. Think about the poor soul who’ll come in two years later and just needs to fix a bug. They should be able to drop into any function at any point in the app and understand everything they need to based solely on the function’s inputs and output.
defprotocol. Programmers trained in object-oriented design tend to go for this, resulting in a lot of needless abstraction that breaks the natural flow of data through an app, making it hard to trace and test.
Fancy techniques like currying, chaining, partial application, and the like, which have their places, but tend to make code confusing for developers new to the application.

I hope the pattern here is clear. Pass data through a tree of pure functions, refactor as it grows, and push all side effects to the root and leaves. This is basically what I consider to be true functional programming. And even though it may not be possible to refactor a huge app into a functional structure all at once, having the overall vision in mind can give you something to work toward.

Harleqin · December 12, 2022, 11:42pm

I think it’s good to lay out such goals, but be aware that this can only be a guideline, not commandments.

One point in particular stands out to me: context maps. In my view/experience, these tend to become giant wool balls, coupling everything to everything through the used keys. They become especially cumbersome when you try to shoehorn the flow of the program through a linear threading macro. Program structure is not linear in general.

Instead, I believe it is more useful to remember the top-down-bottom-up dance: identify top-down what you need, then build the language you need bottom-up, then use it in the upper part. Make sure that one function only talks at one abstraction level. Most function composition should just be function application in function bodies.

pfernandez · December 13, 2022, 12:55am

it’s good to lay out such goals, but be aware that this can only be a guideline, not commandments.

Amen. Remember though that lambda calculus has shown that all logic can (in theory) be written purely in the form a(b(c(x))). Everything else is basically a shortcut. My argument is that leaving the world of pure function composition is what makes code hard to test and understand from a “local” perspective, i.e. for the person debugging the code months and years later.

context maps… tend to become giant wool balls… Program structure is not linear in general.

They do, you’re right. But I’ve never seen a way to pass shared data around that doesn’t involve either a context map passed as an argument, or else something more opaque and/or side-effecty like a global state object.

I did mention one way to mitigate the “wool ball” problem: Think about how you can flatten your code by refactoring using only pure functions. You can take advantage of that nonlinear nature of code, and use multiple context maps depending on the entrypoint. An API, for example, isn’t really a tree but multiple interwoven trees. So strive to make each tree as shallow as possible and pass each only the context it needs.

didibus · December 13, 2022, 7:57pm

I support @Harleqin statement, while the shared immutable context map pattern is better than the shared mutable one, I believe it still creates data coupling and also fails to make your smaller units modular by having them depend on the top level structure.

What I recommend doing instead is to have the modules design an input structure that they want, with only the data that they need, and have that be injected into them on every call.

That means it’s up to the top layer to fetch the data the module wants, and to transform the data in the representation the module wants, prior to calling the module.

I say “module” here, which is ambiguous, but this is because there is a continuum that exists here. At the smallest, you would do what I’m saying for every single function. But sometimes there are a set of functions that all work together tightly to deliver some more application relevant chunk of behavior. I call these a module. How coarse or granular your modules are is up for you to decide what’s best in your case.

What you can do, is inside those modules, you could use the pattern of the “context map”, but it isn’t a global shared context map anymore, shared throughout every function of your application. Instead it is shared only within one module, meaning only accross a limited set of functions that logically form a module.

I would still recommend to lean on keeping modules small, and when you start, I’d even suggest you apply this to every function, because it’s easier to refactor a set of functions into a module with shared immutable input, then the other way around.

The most important trick here is to keep the call stack shallow.

Most people will be tempted to do something like:

A -> B -> C -> D

Now if they need something in D, they pass a context map to A which is passed down all the way to D, and they just keep adding to that map as D needs more data.

Instead you can do:

A -> B
A -> C
A -> D

If D needs output from B, it is A which will take it and give it to B, therefore B, C and D can simply have their own input/output with only the data they need as input and only what they produce as output. They no longer worry of what comes before or after, or where to get the data from.

A will have to find the data that D needs, by calling B to get it for example, it will then need to transform the return of B into the input of D, D might need the data from C as well, and some other data from A’s own input, A can combine all that into the input structure of D.

Now if you realize that B, C and D are never used anywhere else but inside A, you’ve identified a logical “module” of your application. It means that your application benefits from the more coarse behavior of A, and doesn’t need to use the more granular behavior of B, C and D.

Once you’ve identified that, you can refactor A, B, C and D to all use the same input structure, and have them return their output as additional data on their input.

It still is going to be:

A -> B
A -> C
A -> D

But if you look at the input/output of B it would have changed from:

(defn B [arg1 arg2]
  b-output-value)

To:

(defn B [{:keys [arg1 arg2] :as a-map}]
  (assoc a-map :b-output-value b-output-value))

As you see, when we do this, we’ve now coupled A, B, C and D in favor of convenience, because A can now just thread through B, C and D and doesn’t need to transform the data in/out.

But if you look at D, it’s now coupled to B, because if B changes the name of the key, the shape of the value, or where on the map it puts it’s output, D is broken.

Whereas before, only A would be broken, the breakage wouldn’t cascade, any change to B, C or D only would require fixing the direct caller A.

This is the problem you have with this pattern, so imagine using it across your entire app with just one giant global shared map.

Here we’re limiting it to modules we have identified, smaller independent section of code where you are okay taking convenience over coupling because you find the functions are all inherently meant to work together very tightly anyways with knowledge of each other. In which case it’s okay to do this.

pfernandez · December 13, 2022, 10:45pm

@didibus Great reply, thanks! I was just sharing this thread in a meeting with the team when your post appeared.

If I understand you, I think the solution to the “same input structure” problem is simply to use the shallow call stack:

A -> B
A -> C
...

but have the second-tier functions require only a subset of the context. They would be coupled, but only in the sense that the same data must have the same shape. In our case A would be an endpoint handler whose job is to act as a kind of “data bus”:

(defn A [{:keys [param] :as request}]
  (let [context        {:request request}
        external-data  @(post "data.com" {:param param})
        b-input-value  (assoc context :data external-data)
        b-output-value (B b-input-value)
        c-input-value  (assoc context :b-data b-output-value)]
    (C c-input-value)))

B and C contain all the business logic and get unit tested, while A does not. You could pass B and C the full context if it’s convenient (it’s just a reference) and you can even spec the context with :opt-un to help ensure that a consistent pattern is followed throughout the app. Most of your unit tests can leverage a single context.edn file, which really just follows the shape of your request, responses from other services, config, and commonly used internal data.

There are a lot more app-specific details to unpack of course, like parallelization, deciding what common API call sequences can be moved into helpers, what should be moved into middleware, etc., but the general idea is still a shallow function tree with side effects at the root.

didibus · December 13, 2022, 11:20pm

Ya, exactly. Lots of benefits derive from the shallow stack.

And you can start to grow these reusable “workflows” as well where say one shallow orchestrating workflow method uses another when a big chunk of it can be used in multiple routes.

This grows the call stack a bit, but benefits reuse, and it still keeps it shallower:

A -> B
A -> C  -> D
        C  -> E
        C  -> F
A -> G

K -> L
K -> C ;; C is reused here, like a child workflow 
K -> M

But within each of these “workflow”, you design it as if you had no knowledge of anything outside of them. This includes even the input/output.

So “C” isn’t passed the same input A is using. If it’s all maps, you can merge them obviously and ignore things, but I actually prefer to not have more than necessary, because you can become easily lazy and just start using stuff that weren’t explicitly passed to C inside C just because they’re there. select-keys is pretty good for that.

Another befit is it’s trivial now to create X:

X -> B
X -> E
X -> M
X -> D
X -> L

You get so much more reuse here, then if you’d have had:

A -> B -> C -> D  -> E  -> F -> G

How do you create K or X out of this?

By extracting the control and data flow to a parent supervisor (or whatever you want to call it, parent workflow, parent orchestrator, wtv), you’ve gained a lot of reuse and limited the breakage at a distance.

That supervisor can also query/extract/transform and apply side effect, in-between each steps. It can act as an adapter between every step, so if one step changes what it returns, the supervisor can just adapt it back to what the next step expects. Or if a later step now needs one more piece of input, the supervisor can just go get it before calling the step, the other steps don’t care.

Edit:

And on the shape of data. What I do normally is I have a strong domain model, basically all entities that make sense in my app, that I operate on, things like User, Player, Car, Balance, Contact, Transaction, Damage, thing that tend to have meaning even to stakeholders and users of the app.

These shapes are well specced, you could even use Record if you wanted, of have a constructor function that returns a map and does spec validation.

I think more about those, and I commit to their shape, so if I break the shape I’m willing to refactor all functions that operated on them. Because of that, I also try to limit how often I’d make breaking changes to them, and I spend a bit more time upfront thinking about what shape they should have.

I also then consistently use those names everywhere to refer exclusively to these.

And it helps to club all functions operating on those shapes together in the same namespace, so if you change the shape everything you need to refactor is in the same namespace.

But for every other kind of input/output, I limit the shape to just the function and it’s direct callers.So I’d expect the caller to transform the shape it has in whatever the function input is, and the output back to whatever the next step wants.

didibus · December 14, 2022, 12:16am

So maybe with all that said I would apply the following refactor:

(defn A [{:keys [request], {:keys [param]} :param, :as input}]
  (log input) ;; this is the only reason I have `:as input`, because otherwise I explicitly state what keys from input A actually uses
  (let [external-data @(post "data.com" param)
        b-output-value (B {:data external-data
                           :some-value (:some-value request)})
        c-output-value (C {:data b-output-value,
                           :some-other-value (:some-other-value request)}))]
    {:a-output c-output-value})

The difference here is I’ve also decoupled the shape between all steps. The name of the keys for each step’s input is managed by A directly, and not implicitly sharing the name that A’s input was using. Also in theory the shape of the input is also managed by A, so maybe C doesn’t take a map, no problem:

(defn A [{:keys [request], {:keys [param]} :param, :as input}]
  (let [external-data @(post "data.com" param)
        b-output-value (B {:data external-data
                           :some-value (:some-value request)})
        c-output-value (C b-output-value (:some-other-value request)))]
    {:a-output c-output-value})

The benefit is that you can implement C independent of A or B, and than A can still use C and give C the data it wants in the shape it wants it.

If you reuse C in other places as well, they might not all have the same exact shape of map that C would have magically also supported, so this adapts C easily to all those places.

And the best benefit in my opinion is if B returns a map, C doesn’t need to care:

(defn B [{:keys [b]}]
  {:b (inc b)})

(defn A [{:keys [request], {:keys [param]} :param, :as input}]
  (let [external-data @(post "data.com" param)
        b-output-value (B {:data external-data
                           :some-value (:some-value request)})
        c-output-value (C {:data (:b b-output-value)
                           :some-other-value (:some-other-value request))})]
    {:a-output c-output-value})

In my opinion, it’s very little extra effort, but helps reduce breakage at a distance.

Now what I was saying was, if A is a very useful piece of functionality that is also all tightly coupled conceptually, you could instead choose to do:

(defn A [{:keys [request], {:keys [param]} :param, :as input}]
  ;; This is our context map with the initial data all steps will need from A's input
  (-> {:external-data @(post "data.com" param)
       :some-value (:some-value request)
       :some-other-value (:some-other-value request)}
      (B) ; Will grab what it needs from the context, and assoc its result to it
      (C) ; Will grab what it needs from the context, including what it needed from B and assoc its result to it
       ;; Finally we return C's result in this case, or whatever else we'd want
      (:c-result)))

This makes B and C harder to reuse, and C depends on B associng the right key in the right shape, but it’s easy on the eyes, and you can go faster implementing A this way, but you have to keep all the steps and what they do together in your head to be sure the previous steps put the right key/values for the following steps. But when you know you won’t need to reuse C or B, and that the semantic logic itself is pretty tight between all these, I think it’s ok, but doing the whole app this way is too much.

Harleqin · December 14, 2022, 10:35am

I don’t think that throwing things together just because they appear together is useful. I would say that the single responsibility principle should also be held up for data structures.

If you have to do multi-level destructuring, or if you can’t find a better name for your new thing than »data« or »context«, or if you can never use trace for debugging because its output is clogged by »context«, then this should maybe be a hint that you’re going astray somewhere.

The abstract call pattern that you write as:

A -> B
A -> C
A -> D

should in most cases translate to something like

(defn A [foo bar baz]
  (let [b (B foo bar)
        c (C b baz)]
    (D foo b c)))

This way, only actual dependencies create coupling.

I would even go so far to say that creating an aggregate data structure just for the purpose of being able to use a threading arrow (or other point-free-style composition) is an anti-pattern.

Anthony_Leonard · December 28, 2022, 1:37am

This very recent post - Structuring Clojure Applications - describes (I think) a great way of approaching complex apps:

it clearly separates side effects from pure functions using clean architecture ideas
it allows functionality to be clearly extended, new actions need only define new multimethods that do not affect existing code

The OP asked for rich codebase examples using “best practices” and I don’t know of any using the above - but perhaps the author @Yogthos could point to any available?

I have been pooking for mini-framework such as the above for some time (see below *). It is a framework (the extension points are multimethods and protocols) which are not fashionable I think in Clojure circles that prefer libraries for good reasons we all know. Web development in other languages has always been dominated by its prescriptive frameworks which eventually bloat and frustrate developers attempting even the simplest things, particularly those wizened, experienced devs in small code shops that get drawn to Clojure in the first place for the freedom it offers. Alternatively your world may be like mine in that those around you are in large, high-churn, enterprise-y teams of non Clojure enthusiasts or even particularly experienced devs, that just want to be able to maintain and extend an already huge app without understanding the rest - which is how our business thinks too. In that case I think a mini framework limiting options and guiding naming conventions and code structure really would make us more productive. I can’t be sure, it’s just I think I see the converse every day, where the lack of a consistent code structure or framework in our Clojure services makes devs have to know everything at once, which hurts productivity, and makes persuading others about the wonders of Clojure harder and not easier.

I also wonder, if Clojure is to spread to more “boring” large workplaces with less skilled teams like this, perhaps it needs more of these prescriptive mini frameworks to emerge, to give a helping hand to those curious about Clojure but ultimately give a better deal to their own paymasters. That folks are still posting novel new ways of approaching what are common problems for all of us surely shows that the “best practices” are far from being fully established

*FWIW the version in my head centred around a recursive central “runner” loop, where a pure function would take a command and any gathered information (including existing events/state) and return a new event or “missing information”. The loop would run multimethods to try and resolve the missing information if any, and if found rerun the whole (pure) business function now using the extra added information returning more events etc that the runner would know how to store and publish etc.