Managing large codebases in Clojure


#1

One of the biggest advantages of static typing is how IDEs can leverage it to manage large codebases. And I’ve heard it said frequently that Clojure is best suited to small teams.

How can large teams and/or large codebases in Clojure be managed?


#2

I’d submit that the perspective of the Clojure community is that there’s an important difference between large codebases and complex codebases. Static typing and IDEs offer tools that help with some aspects of complexity, but Clojure offers mechanisms to avoid complexity, even in a larger codebase.

To put a finer point on that, a codebase becomes complex when it has lots of interwoven components mutating over time. Clojure encourages us to unweave these components where possible, use immutable data so things don’t change out from under us, use standard data structures so everything is inspectable, and invoke simple, consistent concurrency semantics when things truly must change. It doesn’t matter so much that a codebase has 300 types of records when you have immutable data and referential transparency—you need to understand the records that are in lexical scope, and you can forget about the others because they aren’t involved.

In terms of pragmatic suggestions, I think it’s mostly pretty standard stuff: code reviews, automated testing, continuous integration, and well-defined interfaces (microservices, spec, Schema, good namespacing, queues, etc.).

EDIT: a couple others that are a bit more Clojure-specific:

  • REPL-driven development. There’s no better way to understand a system than to run pieces of it and see the results with your own eyes. And this is much easier when the code is written in a functional (i.e. non-mutating fashion).
  • An extension of the above: remote REPL sessions. Connecting your editor to a running, production server can be a lifesaver when trying to resolve bugs that only occur rarely or only in production. I always feel like I’m applying a stethoscope to a patient when I connect to a running server and eval here and there to inspect the state of the running server. Then when you find the bug you can dynamically eval a corrected function into the server, ascertain that its working properly, then disconnect and commit. It’s awesome, if a bit cowboy-esque.

#3

In no particular order (all are important):

  • Have unit and integ tests for your interfacing functions.
  • Use spec to model your domain entities and values.
  • Spec your pure functions and setup generative testing on them.
  • Use eastwood as a linter on your code base.
  • Setup a code formatter like cljfmt
  • Spec the data you serialize and validate it before persisting it or passing it over to another system.
  • Keep your function call stack as flat as possible.
  • Separate your logic into pure functions that transform data, IO functions that only do the IO, and flow functions that orchestrate the order of operations between the pure and IO functions. Restrict global data access to the flow functions.
  • Use namespaces to group common components and their functionality together.
  • Have a convention to seperate interfacing functions and vars from implementation detail ones. Like make use of public/private, or use an impl namespace.
  • Teach everyone REPL driven development and make sure they have a proper editor setup to work with REPL driven development in which they know how to use auto-complete, goto definition, show doc, print full stack trace of errors, and evaluate code in the REPL.
  • Have good code review practices, all code should be code reviewed by at least two people, one who must already be familiar with Clojure and another which can be anyone else.
  • Document your functions and vars appropriately.

#4

I agree with people here. As long as you limit side effects to the limits of your program and use immutable datastructures everywhere possible (which is mostly everywhere), immutability will protect you. Then it’s a matter a good software engineering practices like @didibus said: good code naming/organization, separation of concerns etc.

And the REPL (with the immutable datastructures) really is the secret weapon. In my team I’m unfortunately the only one who uses a REPL connected editor and know some tooling to inspect/debug code, and I really feel I’m way faster than workmates when it comes to debugging. Hopefully this will change soon as I’ll be teaching them some REPL-fu soon. (I (over)use @vvvvalvalval’s scope-capture and sometimes datascope, their combination is just… futuristic for the rest of the world).

Note that the good engineering practices are the same than any other languages, the REPL just puts Clojure above most of them.


#5

I don’t have much to add above what people here have noted but I wrote a blog on some of the issues I’ve come across in large codebases. TL;DR Developer disciplines mentioned here - REPL based dev, SRP, judicious use of Spec/Schema, etc - are very important. http://devcycle.co.uk/clojure-is-the-devil/


#6

The best practice from the largest personal Clojure project (Lin Pengcheng Financial Analyser )

1.IDE: Notepad++ (ClojureBoxNpp)

  1. Version Control: 7z.exe

  2. Programming ideas (PurefunctionPipelineDataflow)simulate the following list:

    Imaginative programming: Everything is an algorithm, at your fingertips.
    The most valuable chapter of “Code Complete” : Chapter 2 Metaphors for a Richer Understanding of Software Development

   Business management thinking
   Pipeline technology for large industrial production
   Business process reengineering
   Enterprise organization, system, process design thinking
   Accounting
   Integrated circuit diagram
   Urban water network
   Boeing aircraft pulse production line technology
   Confluence technology of rivers from the source to the sea
  1. Data-centric, dataflow, designing a data model that is simple and fluent in manipulation. The line between the two points is the shortest, and the data is directly manipulated from the initial state to the final state.

  2. Pure Clojure.

  3. Don’t use OO, FP, AOP. They are overly complex hand-workshop-level technologies.

  4. Don’t write middleware, macros, loop. They are hard to read, difficult to debug and observe.

  5. repl drive development.

  6. Try to design a pure function (pipe function) of a single hash-map parameter.

10.Minimize front-end code.

11 Side effects can only appear at the end of the pipe.

12.Try to use thread macros.

13.Code linearization, schematicization, simplification. What You See Is What You Get.

14.Use namespaces to achieve good code structure.

  1. Normalize data.

16.Data verification only appears at the beginning of the pipeline.

  1. Use the clojure.core API to manipulate data, enhance data model design capabilities. Don’t use like specter lib etc.

18.Use and design “simple DSL”, like hiccup, honeysql etc. DSL usage is code conversion, Using data style representation is better than using function style representation. A series of pipeline functions are concatenated to form a compiler for converting DSL data into target code and then evaluating it.

19.The best abstraction is: data and logic are strictly separated, data-flow is current-flow, function is chip, thread macro (->>, -> etc.) is a wire, and the entire system is an integrated circuit that is energized.


#7

Another yes to everything that @camdez and @didibus said and I’ll particularly call out good namespace naming and organization (something that we weren’t very good about when we started in 2011 but are increasingly getting better at now). That latter area is where we could all do with a lot more guidance and written articles, I think. Many of the other bullet points mentioned are much more straightforward.

I guess there’s also the question of what is a “large” codebase in Clojure. There was a talk at Conj last year (I think) about a “large” codebase that was in the 30-40K range. Here’s the stats on our codebase (we run this every week and track the output so we can see code growth – or shrinkage – over time):

Clojure build/config 47 files 2532 total loc
Clojure source 260 files 61555 total loc,
    3278 fns, 673 of which are private,
    383 vars, 42 macros, 60 atoms,
    468 specs, 19 function specs.
Clojure tests 147 files 19176 total loc,
    23 specs, 1 function specs.

The build/config total includes both our (large) build.boot file and all our EDN files (both for configuration and for dependencies – we manage those external to Boot in a precursor to deps.edn).


#8

What did you use to produce those numbers ? It doesn’t look like cloc.


#9

It’s just a shell script that finds certain types of files and uses wc and fgrep :slight_smile: I knocked it up originally to track aspects of our legacy codebase nearly ten years ago and added tracking for Clojure as we started to use that as well :slight_smile:


#10

“Largest” according to what metric, out of curiosity? LoC / number of contributors / scale of deployement / … ?


#11

“Largest” according to LoC, I only wrote a formal project, this is personal amateur project.

It is not based on the OO or FP, It uses own pure function pipeline dataflow programming technology, It is a technology based on big industrial ideas, I think It is a better technology than OO and FP. OO&FP are just hand-workshop-level technologies.

Developing version (rewrite, base luminus, pure clojure(script))
clojure: 34k+ lines(Include a bit of code for repl testing)
clojurescript: 5k+ lines(Include a bit of code for repl testing)

Last version (pure clojure-clr, .net winform app):
clojure-clr: 25k+ lines (don‘’t include test code)


#12

There will be a talk about a large codebase in Clojure in the ClojuTRE conference.

Now, after 6 years, 120 000 end-users, 10 000 permits applied monthly, well over 40k commits made by 25 programmers over the time resulting over 125k lines of Clojure code, it’s interesting to take a look how this controversial and risky language has served us.

https://clojutre.org/2018/#jarppe. Will be recorded.


#13

My project is now 7 years old Clojure project. There are too few personal spare time available for development., it can only be regarded as the largest personal project now.

In the future, perhaps it can strive to be the project of the most end users.


#14

Can you elaborate on that, please?


#15

Of course. It’s a little hard to explain, but I’ll try my best.

What I mean by this is that you want to try and split up your logical operations and their control flow.

If A, B, C are the operations you need performed. Lets assume all of them need no input data and simply print out their names.

(defn A [] (println "A"))
(defn B [] (println "B"))
(defn C [] (println "C"))

Now say you want to print ABC? It’s often tempting to do:

(defn A [] (println "A") (B))
(defn B [] (println "B") (C))
(defn C [] (println "C"))

(A)

Creating a deep/nested call stack, where you’ve coupled your operations and their flow together.

Instead favor a shallow/flat call stack:

(defn A [] (println "A"))
(defn B [] (println "B"))
(defn C [] (println "C"))

(do (A) (B) (C))

With top level orchestration.

Hope that clarifies it.


#16

Pipeline is as flat as possible.
Let the reader see the flow of the code at a glance, like a company’s business process

Data verification only appears at the beginning of the pipeline
Side effects can only appear at the end of the pipeline.
Normalize data.

(some->>  data-map
          opt-valid-pure-fn
          pipe-normalize-data-pure-fn
          (map    pipe-transform-data-pure-fn1 ,)
          pipe-transform-data-pure-fn2
          (reduce pipe-transform-data-pure-fn3 {} ,)
          opt-side-effects-fn)

(defn f [{:keys [x y z] :as m}]
  (->  (> x 1)
       (and , (< y 10))
       (and , (> z 99))
       (if , :t :f)))

(defn path-combine [s1 s2]
  (cond
    (string/starts-with? s2 "/") 
      s2
    (not (string/ends-with? s1 "/"))
      (-> (string/split s1 #"[\\/]")
          butlast
          (#(string/join "/" %))
          (str , "/")
          (path-combine , s2)) 
    :else  
      (-> (string/join "/" [s1 s2])
          (string/replace ,  #"[\\/]+" "/"))))