What is 2021 recommendation for Specs?

No problem.

I’d also like to add some details. The reason the producer should validate before sending a payload or storing data to a data-store is because you want to fail fast. Once the reader on the other side receives the payload and fails, you’re already in a state that’s much harder to recover. Imagine storing a corrupted entity to disk, when you read it back later you can’t just throw an error undo the storing of it, go back to the writer, have it catch the exception and recover, etc, because of the distributed nature of things. So you want to fail fast from the producer side, so you don’t corrupt things that get harder to recover. Ideally your reply/tests catches things, and worse case you fail as soon as you go to prod and can quickly rollback, no data cleanup involved, because you’ve left the DB with a bunch of broken entities, or your kafka stream with a bunch of broken messages.

Secondly, the reason you want your reader to conform, is because conformance is the act of figuring out what type of data you received and handling it appropriately based on that. Any system over time will evolve its payloads and entities. And in a distributed setting, you can’t always enforce that the sender migrates to your new payload format. Even if they do, there’s always a deployment overlap if you want to have zero downtime. There will be in-transit messages of the old payload format, while the new code is deployed and starts to send new payload formats, so as a consumer of data, you must always be able to process the old data and the new. That’s what conform lets you do, when you call conform on some data, the result is that it tells you which of the many kinds and versions of the data you just received, which allows you to branch to the correct logic for it. It also will tell you if the data doesn’t match any of the type of data you support.

Now what neither validate and conform do is serialize the data for transport. I consider that orthogonal, and that’s why I don’t like schema libs that also do data conversion. Conform is not the same as convert.

So what you do is you produce the payload as Clojure data and validate that with the spec for it. Then you send that data to a layer that converts it into the transport format, like maybe it gets serialized to JSON. And then on the consumer side, you first get that transport encoded data, and you deserialize it back to Clojure data which you then conform.

Now if your transport is EDN compatible, well you don’t need to do much conversion, but the idea is the same, converting to/from the transport format I consider beyond the boundary of my app.

I prefer that to something like JSON schema, because what I want to do is work with my own domain representation, JSON is an implementation detail of my transport. With this approach you can support multiple transport, and use the same schema validation/conformance on all of them.

And then for conversion you are free to use whatever lib you prefer or hand roll it.

It does mean there’s a chance your conversion has a bug and that creates a bunch of broken payloads in the process though, but I tend to heavily test my conversion logic specifically for that.

5 Likes

The added details are also helpful, @didibus. I’m going to address some of my comments to you, because you wrote those two very helpful posts, but most of my comments are not specifically directed to you:

I understand now why some people recommend using maps over records. This is great–I appreciate the illumination. But I also think that while the use case that you’ve described is quite broad–lots of people have to serialize data to some storage, etc., and check for changed data formats, etc.–it’s not entirely general. If Clojure were only used for business, that kind of use case would be quite common, and some contexts outside of business have similar use cases as well. I can easily think of contexts involving scientific research, for example, in which exactly the sorts of concerns that your remarks identify would be applicable. What I love about those posts is they lay out so clearly the rationales for the practices you describe, didibus, so that others can evaluate whether and when they apply to their use cases.

However, not all contexts involve that sort of relationship to stored data, streams of data, etc. I think that the Clojure community has been largely business oriented–something that I like in many respects–but I wonder whether the “avoid records by default” recommendation is really a recommendation for common business applications, and shouldn’t be considered a general recommendation. I’m not taking a position on this. It’s a question I have, and maybe not a very important question.

(I’m very pleased about the recent Clojure data science community that’s developed, and I’m grateful for all of the dedicated work people are doing on developing tools for that. Despite my appreciation of the business orientation of most of the Clojure community, I’m happy to see a different kind of subcommunity develop–one that’s closer to my own interests. And fwiw, I do recognize that the kind of data serialization, etc. that you’re talking about, didibus, would be very important in some data science applications.)

1 Like

Fantastic write up! @didibus . Thank you.

Could you please tell is serialization with Records only a problem when communicating with external systems (storing data in DB, two microservices communicating with each other, client and server sending data to each other)?

Or can you have some serialization difficulties inside of the same program (when using Thread or when using Records from a library in the main app)?

The simple decision process might be: If you think that some data structure might need to be stored in DB or sent over the wire => Map. For internal use only => Map/Record.

One thing regarding records - if you use plain JSON serialization you will lose data. You can however use libraries such as transit which lets you register ser/de per type, and has a utility function for creating serializers and deserializes from records. I think fressian supports it as well. So does nippy

2 Likes

No recommendation is ever general though, all are contextual. Even a go-to sometimes might be the right tool. I’d say this one definitely is related to information systems, if your application isn’t modeling information, but instead doing something else, like say graphics rendering, signal processing, and other such thing, it’s possible it doesn’t apply as much, since you might not care to serialize the data anyways, or performance might be a bigger concern.

It’s possible too if you’re just writing some small scripts that specs are too powerful, too expressive, maybe the reduced scoped of records as schemas is what works best.

Basically: use maps unless you know better

When using records across libraries in the same app you can have the same issues yes. When using threads they shouldn’t be a problem.

The reason is that each namespace who needs to work with an instance of the record needs to depend on its schema definition, the defrecord. So you need to take an explicit dependency from the code that uses the record to the code that defines it. That can lead you into issues with circular dependencies as well, and things like that. It’s not as bad as full on Java static classes, it’s mostly if you care to manipulate the type or the constructor.

Like before sending a payload, validate it meets the spec, and as soon as you receive a payload, conform it. Or before writing data to the DB, validate it, and after reading data from the DB, conform it.

Love this. A really useful distinction of the two. This is Postel’s law in a nutshell no?

I have been saying the same for years - albeit less eloquently, as I was working mostly in Java at the time. A lot of people find this overkill but I totally agree with your more detailed explanation. The cost of saving or publishing the wrong or wierdly formatted information is very high compared with the effort/risk (and indeed benefits) of coping with wierdly formatted incoming information. But most feel comfort in strictly rejecting inputs that didn’t fit their original expectations, and apply no such rigour to their own outputs that they are convinced must be correct so long as the inputs were ok. Because noone ever had to data migrate bad data after a bug made it to the wild… :grinning:

1 Like

Not “use maps for information systems unless you know better”? Is information systems the default?

Java is used for everything (except low-level stuff). C, C++, too. Python isn’t used for everything, but it would be difficult to delimit what it is and isn’t used for.

I don’t care that much about what advice is given, but Clojure is not a business language, and it’s not an information systems language. It’s a general-purpose language. I personally like it for science. I would like it to replace Python and Java for scientific research programming. I don’t see a rationale for “use maps by default” outside of the kind of context that you have delineated. It’s an important context, and maybe it’s the case that at present, it’s a pervasive context among the Clojure community. I want the community to be broader than that–and Clojure as a language has the resources to play that role.

There are some ways in which it’s important to get new users into thinking idiomatically, so that they don’t try to put round pegs into square holes. This doesn’t seem like one of them, to me.

Well, I guess if you wanted to be more precise, and want my advice, it would be to use maps for information modeling, and use records for type abstractions. Simply because I find maps have benefits over records for modeling information, and records have benefits over maps for creating type abstractions, even though extend-via-metadata tried to close that gap.

And lastly, if you have special performance needs not met by maps, but met by records, go for it. And similarly if records don’t meet your performance needs either, try arrays, try a data frame library, use some specialized data-structure, maybe even a Java mutable one, whatever gives you the boost you need.

That’s just my advice, the rules around what defines a best practice isn’t something I understand very much, and I’ve never been a big fan of best practices anyways, they’re too dogmatic for my liking. So don’t take what I say here too strongly, records are great too, they’re not a “bad part” of the language to avoid, that said if you are a beginner I’d suggest trying maps out first, getting comfortable with them, because I know beginners are biased towards records as they tend to be more familiar coming from other language, and they give you a nice static schema which grants comfort.

And finally, if I came to a code base using records where I’d have used maps, I wouldn’t think oh god what monstrosity, I’d be totally fine with it, maps vs records, their difference is subtle, the effect of choosing one over another are not a deal breaker, and Clojure makes it quite easy to change between the two.

6 Likes

Clojure newbie here. Thanks for your helpful replies.

That’s why I say start with maps, use records if you need the performance boost and/or want to actually create a type to use with protocols for type polymorphism, though now you can do so with maps as well.

Can you explain how one can do type based polymorphism with maps, in clojure? That sounds useful. Thx!

Thanks for you explanation and for the link to the decision tree flowchart.

I’m a clojure newbie. I stumbled across this topic trying to figure out “why would I ever use protocols?” The answer seems to be “when you need Java interop, or for certain low level JVM optimizations.” Let me know if I got the gist wrong there.

I’ve moved from a heavy OO background in Java, to “oo-lite” solutions (a la protocols), to “we don’t need no stinkin’ OO!” I spend most of my time writing modern, functional React nowadays, with nary a class in sight. With hooks you just don’t need them. I also spend (too much) time hacking emacs lisp, which again doesn’t use OO. (EIEIO was a fad for a minute but faded.)

I appreciate that type based polymorphism can be useful, and so are platform specific optimizations. But I’m glad to hear that maps are considered more idiomatic.

1 Like

@didibus thanks for the great answer.
I was a bit confused (maybe still am) by what is a “producer” and what is a “consumer”:

have the producer of the data validate, and the reader conform

like before writing data to the DB, validate it, and after reading data from the DB, conform it.

I could look at the code writing data to the DB as a consumer of data, perhaps received via an HTTP API call - I’d validate the data coming from the client, transform it to the shape expected by the domain layer persistence logic and save it in the db (possibly again transforming to another form expected by the DB but usually not doing any extra validation just before saving the data).

Is that how you think about it or what you suggest is a different approach?

The other example still confuses me:

before sending a payload, validate it meets the spec, and as soon as you receive a payload, conform it.

Especially if these are different (distributed) processes I’d validate the payload in the receiver.
Perhaps the confusion is that a piece of code in such situations is really both a receiver and a producer? Like I’m receiving data from somewhere else (maybe an HTTP API handler) but, at the same time, validating that data and_producing_ transformed payload for something else (a DB, a Kafka topic, etc.).

It looks like you haven’t yet gotten a response to this so I’ll try to answer:

Clojure 1.10 introduced a new feature for defprotocol that let you declare that a protocol can be extended via metadata – see clojure/changes.md at master · clojure/clojure (github.com) – and this lets you use protocols for polymorphism on any value that can carry metadata, such as a plain old hash map, if the protocol is declared to be extended that way.

As an example of this in the wild, here’s how next.jdbc provides support for Stuart Sierra’s Component lifecycle (start/stop) via metadata: next-jdbc/connection.clj at develop · seancorfield/next-jdbc (github.com). In this case, next.jdbc.connection/component returns an empty hash map that satisfies the start portion of the Lifecycle protocol from the Component library via metadata. Once start is called, the Component returned is a function (not even a hash map) which can be invoked to get the underlying connection-pooled DataSource, and it also satisfies the stop portion of the Lifecycle protocol via metadata. When stop is called on it, you get back an empty hash map that satisfies the start portion again.

I could make it idempotent on the missing calls by completing the protocol implementation via metadata so that if you called stop on an unstarted Component, you just got the same thing back (including the metadata), and similarly if you called start on a started Component, but it seemed better to have those be an error (by omitting the protocol implementation) for calls that I consider to be logic errors.

[At work, we’ve adopted this idea of an “invocable component” as an idiom for getting at the underlying resource that a Component wraps/manages and it works very nicely for us]

5 Likes

Ah I misunderstood slightly - thanks for the clear explanation!

Ya, with Clojure 1.10 you can use metadata on maps, and the implementation of the protocol will be chosen based on the metadata on the map.

But what I was referring too was that you can lift your maps to become records without needing to refactor the code that uses the map, because records offer a compatible interface to maps. Once you lift your map into a record, it becomes a real type, and you can then use protocols on that type for type based polymorphism.

Here’s an example:

(defn make-dog
  [name breed]
  {:name name
   :breed breed})

(defn make-cat
  [name breed]
  {:name name
   :breed breed})

(defn bark
  [dog]
  (println "Woof I'm " (:name dog)))

(defn meow
  [cat]
  (println "Meow I'm " (:name cat)))

(bark (make-dog "Bob" :golden-doodle))
(meow (make-cat "Marcy" :persian))

Now say you wanted to have type polymorphism here, so that you could use one function speak that chooses its implementation based on the type of entity it is given.

So what you can do is first define a protocol for the polymorphic functions you want:

(defprotocol Speakable
  (speak [animal] "Prints a greeting from the given animal"))

And now you can refactor your map constructors to return a record instead which will provide the implementation for the polymorphic protocol functions you want:

(defn bark
  [dog]
  (println "Woof I'm " (:name dog)))

(defrecord Dog
  [name breed]
  Speakable
  (speak [dog] (bark dog)))

(defn make-dog
  [name breed]
  (map->Dog
    {:name name
     :breed breed}))

(defn meow
  [cat]
  (println "Meow I'm " (:name cat)))

(defrecord Cat
  [name breed]
  Speakable
  (speak [cat] (meow cat)))

(defn make-cat
  [name breed]
  (map->Cat
    {:name name
     :breed breed}))

(speak (make-dog "Bob" :golden-doodle))
(speak (make-cat "Marcy" :persian))

And as you see, the code that was using the maps does not need to change, it will work just as well with the record version of the maps.

This is what I meant by you can start with maps, and then if you need performance or type polymorphism you can just easily refactor to records and protocols.

Records will have better key lookup performance than maps, and protocol type polymorphism over records will have really fast dispatch.

This isn’t the only way you can add polymorphism to maps though. Like Sean said, using metadata is another way:

(defprotocol Speakable
  :extend-via-metadata true
  (speak [animal] "Prints a greeting from the given animal"))

(defn bark
  [dog]
  (println "Woof I'm " (:name dog)))

(defn meow
  [cat]
  (println "Meow I'm " (:name cat)))

(defn make-dog
  [name breed]
  (with-meta
    {:name name
     :breed breed}
    {`speak bark}))

(defn make-cat
  [name breed]
  (with-meta
    {:name name
     :breed breed}
    {`speak meow}))

(speak (make-dog "Bob" :golden-doodle))
(speak (make-cat "Marcy" :persian))

Here we’re still using maps, but by attaching metadata to them, the polymorphic protocol function speak will dispatch to the implementation defined on their metadata. This isn’t type polymorphism, you could call it metadata polymorphism, because the runtime type of the entity is still PersistentMap, but each instance of the map can have its own dispatch.

It won’t be as efficient to dispatch, and since we’re still using maps, key lookup doesn’t get a performance boost, but this is quite flexible and even less intrusive.

The last way to make maps polymorphic is to use multimethods, and have application level types (instead of relying on the runtime types) modeled on your entities themselves. Here’s how:

(defn bark
  [dog]
  (println "Woof I'm " (:name dog)))

(defn meow
  [cat]
  (println "Meow I'm " (:name cat)))

(defmulti speak
  "Prints a greeting from the given animal"
  :animal/type)

(defmethod speak :dog
  [dog]
  (bark dog))

(defmethod speak :cat
  [cat]
  (meow cat))

(defn make-dog
  [name breed]
  {:name name
   :breed breed
   :animal/type :dog})

(defn make-cat
  [name breed]
  {:name name
   :breed breed
   :animal/type :cat})

(speak (make-dog "Bob" :golden-doodle))
(speak (make-cat "Marcy" :persian))

You can use multi-methods with records as well, which would let you use the runtime types for dispatch if you prefered as well. I think multi-methods will have the slowest dispatch of them all, but they’re also very flexible, and they can dispatch on arbitrary function of the value.

It is possible I missed some other way, but these are the most common ones.

Last thing to point here is you could also use a generic function instead of polymorphism in this case:

(defn make-dog
  [name breed]
  {:name name
   :breed breed
   :greeting "Woof"})

(defn make-cat
  [name breed]
  {:name name
   :breed breed
   :greeting "Meow"})

(defn speak
  [animal]
  (println (:greeting animal) " I'm " (:name animal)))

(speak (make-dog "Bob" :golden-doodle))
(speak (make-cat "Marcy" :persian))

Thought I showed that as just an interesting alternative to polymorphism.

6 Likes

Code can be both consumer and producer, but if you go more granular you’ll have a chunk of code inside it that does only consumption and another only production.

So the way I think of it is:

Conform the data coming from the client. You’d do this by first deserializing the transport payload into Clojure data, so if you have JSON, decode it back into the Clojure representation your service uses. Now conform the Clojure representation of it against your Spec for it. This will both validate that the data you now have is valid to your expectations, and it will also tell you what type of data you have if multiple types of data are possible.

Now you are ready to consume the payload you received from the client. If anything failed up to this point, you’d return an error to the client and fail-fast.

Assuming it all succeeded, you can now process the valid and conformed Clojure representation of the payload however you want.

As part of that processing, you might have transformed or modified the data, and now want to persist this transformed data to the DB. That means you want to prepare for producing data to send to the DB which will consume it.

What I would do is thus take the data I want to persist in its Clojure representation again, for which I have Specs defined, which could have been modified from what we got from the payload or not, and I will validate it against the Spec to make sure it is valid before I send it to be persisted.

Now to prepare it to be sent to the DB, I will again need to serialize it for transport and consumption by the DB, this happens after I validate it against the Spec.

So if the validation fails, I once again fail-fast and return an error to the client (or find a way to recover), but I don’t even attempt to persist it.

Assuming the validation succeeded, I’d now convert it to whatever I need for the DB I’m using, maybe I convert it to a bunch of SQL statements and send that to some JDBC driver for persisting to MySQL, or maybe I convert it to JSON and send it to MongoDB, or I convert it to an EDN string and write it to a file to disk, etc.

The reason I say “consumer” and “producer” and not “receiver” and “sender” is that some parts of a system will just act as transport, they just move things around, without needing to read the data, process it or create it. Kafka is a good example of that, it receives data on a message and just pipes it to a stream for another system to consume. Kafka itself doesn’t consume or produce anything relevant to your system, it is pure transport.

Let’s assume Kafka had a way to validate what it received, in my opinion, that already be too late. You want to have the producer of the data validate it, and that will be some system upstream of Kafka. If Kafka needed to process the data (which it doesn’t), but say it did, you’d want it to conform it, because then it be a consumer as opposed to just a receiver.

But if you process the data, as in need to understand it’s content, you are a consumer, and should conform it. And if you modify or create data, you are a producer, and should validate what you’ve created or modified.

If you don’t process, don’t modify, don’t transform and don’t create data, you are neither a consumer nor producer, and don’t need to validate or conform.

A good way to know if you are a consumer or not is to ask yourself, can you still do what you do if the data came encrypted and you couldn’t decrypt it? If so, you’re not a consumer, even if you receive the data and move it around or wrap it/unwrap it.

And a producer is anything that modifies, transforms or creates data. So changing the value of a key, adding a new key, creating a map, etc.

2 Likes

Someone pointed out recently that Component includes idempotent implementations of start and stop on java.lang.Object which means the code in next.jdbc.connection that I was discussing here is already idempotent on the “missing” lifecycle methods so I don’t need to do anything here and calling those lifecycle methods is not an error (so I would have to explicitly override them to make them logic errors).

With hindsight, this is nice because the default behavior is idempotency which is generally what you actually want when constructing system component graphs.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.