How to structure the codebase for data shape discoverability

I personally use malli for data spec, mostly on the API part.

However, when inside some application domain functions, I often become at a loss as to the data shape of some function parameters.

To alleviate that situation, I use map desctruturing on the parameters. However. it still has its shortcomings:

  1. For nested data structure, it is verbose to further destructure on the parameters
  2. For most of app logic, the domain entities contains tens of fields, destructuring makes the code unmaintainable.

While some people may argue dynamic language has this shortcoming builtin, but for Python, we have type annotation (and dataclass) and that pretty much solves the problem. Understood we also have clojure.typed but seems that is not well adopted.

So, on the Clojure land, how do we solve this problem? I feel like this issuing is facing everyone writing some prod code that serves moderate traffic.

If you find that dataclasses in Python solves the problem you have two similar solution in Clojure:

  1. Use records
  2. Use spec (or some other similar lib)

At work we use clojure.spec heavily and it covers this use case, as well as many others. I wrote about all the different ways we use it in this blog post from 2019: An Architect’s View: How do you use clojure.spec (corfield.org)

1 Like

To be honest, it is the type annotation on the dataclass object that solves the problem.

Could you show some sample. Spec or malli is the big picture, but it doesn’t smoothly solve the problem of:
At some random function in the codebase, you ask, what’s the data shape of the function parameters?

How do you address this? destructuring, spec-ing, or more documentation? Do you spec on every function? When to spec and when to stop? Do you stop too often?

These are some practical problems that cannot be simply solved by the spec-ing concept.

Can you elaborate? Type annotations on dataclasses are not enforced at runtime, they’re basically documentation. Spec on the other hand has type restrictions and can be enforced at runtime if you want.

Or did you mean that Mypy solved the problem?

Do you spec on every function?

No, only where I feel it matters.

I recently worked on some legacy code that I did not author – the developer who wrote it moved on a few years back – and it performed some fairly complex data manipulation on a structure that I was not familiar with.

I started to write specs for some of the functions in the pipeline, turned instrumentation on, and started to explore how the code behaved. It took a while for my understanding to coalesce, so the specs gradually evolved (and became both more specific and more complex) while I worked on the code.

After about a day spent working on specs, instrumentation, and RCFs (Rich Comment Forms), I had a solid enough understanding of that part of the system that I was comfortable making the changes that had kicked off this exploration.

Stu Halloway talks about using Spec to gain understanding of code in one of his talks – sorry, I don’t remember which one – and I think it’s a very valuable use of Spec.

In general, good function and argument names, and good docstrings, should tell the reader all they need to know – see Zach Tellman’s excellent book “Elements of Clojure” – but using Spec as an additional form of documentation for data structure can also help. Destructuring can also be a helpful tool. There’s no One True Way(™) here.

It’s important not to overspecify things tho’ – Spec isn’t a “type system” and Clojure’s design is inherently “open for extension” so you should only specify what a function requires rather than trying to completely specify the entire data structure.

6 Likes

I use truss to check all my assumptions about data shape inside functions.

Let me add some details to my answer to make it more useful.

What I generally do is I’ll use Spec (but you can substitute with whatever else)

;;;;;;;;;;;;;
;;;; Car ;;;;

(s/def :car/name string?)
(s/def :car/model #{:subaru :ford :nissan})

(s/def :car/car
  (s/keys :req-un
    [:car/name :car/model]))

That’s like my “class”, except it’s not a class at all, just a generic description of the shape of a map that represents a logical Car in my application.

Then I’ll add a constructor function which I normally add right under:

(defn make-car
  ([car] (s/assert :car/car car))
  ([name model]
   (make-car
     {:name name
      :model model})))

So all together now you have:

;;;;;;;;;;;;;
;;;; Car ;;;;

(s/def :car/name string?)
(s/def :car/model #{:subaru :ford :nissan})

(s/def :car/car
  (s/keys :req-un
    [:car/name :car/model]))

(defn make-car
  ([car] (s/assert :car/car car))
  ([name model]
   (make-car
     {:name name
      :model model})))

;;;; Car ;;;;
;;;;;;;;;;;;;

Now anywhere in my app where I use a car I’ll call the variable or parameter for it car and if I destructure it I’ll do: {:keys [model] :as _car} to indicate logically I’d expect a car to be passed where I only use the :model key from it.

And when I need to update a car I’ll just add a call to make-car after I’ve made my changes to it, and the doc-string will mention I return a car.

(defn change-name
  "Takes a car and a new-name for it, and returns a car
   whose :name is the new-name."
  [car new-name]
  (-> car (assoc :name new-name) (make-car)))

Generally I find that’s enough, just by using the entity names only when I expect the value to be valid to the spec of that entity I’ve found it’s enough. You see anywhere something called car and you know it’s supposed to be of that spec.

When something doesn’t refer to a top-level entity, like take :car/model, maybe I also have :truck/model then I don’t call the variable/parameter model but instead I call it truck-model.

My top level entities are normally specced as :car/car, :truck/truck.

If there’s some deeper hierarchy, like say each model are a map themselves, and they too conflict, which is pretty rare, but it happens.

Say a user has a credit-card that has a number which is the credit-card number. But in another context a bank has a credit card that has a number but that number refers to like the type of cardSapphire gold, Sapphire silver, and not the actual credit card number.

Like:

;; User
{:user-id 123
 :credit-card {:number "7467-7364-8283-2234"}}

;; Bank
{:bank-name "Chase"}
 :credit-card {:number 482684}}

I’ll use the following name:

(s/def :bank.credit-card/number ...)

(s/def user.credit-card/number ...)

And in my code will call these: user-credit-card-number and bank-credit-card-number.

It’s kind of rare though that a function takes that directly. I wouldn’t name locals like that, because with locals the context is clear:

(defn do-wtv
  [bank]
  (let [number (-> bank :credit-card :number)]
    ...)

And keep in mind often those are not hierarchical, like if I use credit-card everywhere I’d have:

(s/def :credit-card/credit-card
  (s/keys :req-un
    [:credit-card/name :credit-card/number :credit-card/expiry]))

(s/def :bank/bank
  (s/keys :req-un
    [:credit-card/credit-card]))

It’s only if under some entity you’ve got something called credit-card that is logically a different kind of credit-card specific only too that entity that I’d give it a hierarchical name like bank/credit-card or bank.credit-card/number.

And if you need derived entities, I use multi-spec and a :type key on my maps.

And when I make a backward breaking change I also turn it into a multi-spec and add a :version key.

Finally, at the app boundary, so wherever you do I/O, I explicitly call s/valid? to make sure I’m producing or receiving a valid entity.

Then in my code, generally just this naming convention is good enough, but like I showed I sprinkle some make-car whenever I return an updated car and that will run assert on it, and I might add a few s/fdef and fn instrumentation as well in some places if I feel the need to be sure my function is called with an actual car. The other benefit is I can then easily add a generative test for those functions. Both asserts and instrumentation would only be turned on in dev/test/staging only so as not to incur their runtime cost.

4 Likes