What is 2021 recommendation for Specs?

This makes sense to me if one is exploring the ideas/data/algorithms, etc. My reasoning still makes sense to me otherwise, and at early stages of a project, it might not matter, since it’s easy to change definitions. It’s so easy to move back and forth between maps and records, that maybe it should be considered be a matter of personal preference.

I’m thinking that maybe the reason for the advice to use maps is because people coming from Java will overuse fixed types such as records, and think that everything has to be done that way. So it’s good advice for them to start with maps and then use records as needed. I can see that. That was never my orientation with Clojure, though. (I was a Java programmer a time long ago, but came to Clojure by way of Common Lisp.)

Irrelevant to this discussion, but fwiw I spent a lot of time studying that flowchart at one time, and in it is not always right, in my experience for decisions about interop structures. It presents good rules of thumb for many cases. I doubt any flowchart could capture all of the factors that could matter for Clojure interop data structure decisions. (I definitely have less overall experience with Clojure than many people, but I think I may have gotten deeper into interop at one time than most Clojure programmers. It wasn’t fun. :slight_smile: Well, OK, some of it was fun. And now I have the problems worked out to my satisfaction.)

No worries, I just wanted to ask about that. Thank you for the correction ;-).

Could you please tell me what do you do if you know the keys but do not have values for some of them? Do you later change them or add them?

Would you for example use nil?

(defrecord Flight [flight aircraft departed arrived])

(map->Flight {:flight "BA5", :aircraft "Boeing 747", :departed "2021-05-01 15:30:00")

#user.Flight{:flight "BA5", :aircraft "Boeing 747", :departed "2021-05-01 15:30:00", :arrived nil}

or something else

(map->Flight {:flight "BA5", :aircraft "Boeing 747", :departed "2021-05-01 15:30:00", :arrived :has-not-landed})

or rather not to include it at all

(defrecord Flight [flight aircraft departed])

and later add?

(assoc myflight :arrived "2021-05-01 19:30:00")

I would probably use nil in most cases. But more deeply experienced people may have a better idea. You do have to be careful in that case to make sure that getting an unexpected nil doesn’t cause a bug, but that’s a normal thing to have to watch out for. For your example, :has-not-landed seems like a good option, though, and avoids an accidental nil-pun, for example. I wouldn’t leave the field out of the definition, though.

If there are often many unfilled fields, i.e. keys without values, maybe that would be a case where maps are better–I don’t know.

“Optional” fields can be tricky to handle with records because if you accidentally dissoc a declared field out of a record, it quietly becomes a hash map and it won’t become a record again. And then there’s Rich’s whole thing about nil being a bad thing in a hash – see the Maybe Not talk – because so much code assumes nil == “not there” / “no value”, so having nil being a deliberate value can easily trip you up.

You get this problem when dealing with SQL/JDBC because NULL is a perfectly reasonable value in a database (although it doesn’t just have “regular value” semantics). You need nil in your hash map for INSERT / UPDATE operations and the main Clojure JDBC libraries will give you hash maps back with nil for NULL. next.jdbc.optional provides alternative builders that omit nil values that align with NULL values in the database. I don’t know how widely used it is. All I can say is that nearly all of the JDBC-related code I’ve ever written assumes nil-punning and therefore treats nil and “not there” as identical rather than trying to treat nil as an actual value.

4 Likes

That’s my thinking since :has-not-landed is an information. In Elixir there is an unwritten rule to use :unfetched so you don’t have to think about naming.

The downside of records is that they are no longer pure data, so serialization is a problem, which makes information modeled with them harder to move to other processes, or store/retrieve them.

Most format that have a schema suffer from this, you need to have the schema definition of the correct version of the serialized data and know implicitly which one it maps too, where as schemaless formats evolve better over time as they are more flexible.

That’s why I say start with maps, use records if you need the performance boost and/or want to actually create a type to use with protocols for type polymorphism, though now you can do so with maps as well.

That’s also where I’d recommend the use of Specs over records. Specs are much better at describing data then records, and much more flexible in how they can evolve along the data.

Just to give an example, if you have a map, you would model type as data (if you cared about type):

{:type :dog
 :name "Bib"}

{:type :cat
 :name "Kitty"}

But when using records, the type is implicit and it isn’t part of the data it models, instead it’s tracked by the runtime alongside the language instances of your data.

By having the type as data, your type info will serialize itself automatically. It is also more flexible and can evolve to be more refined or less as need be. The downside is polymorphic dispatch won’t be as performant.

And now if you want a schema to help you know what the data invariants for a certain entity are you can use spec instead of a record, which is even more precise.

So I feel maps + spec are just superior to records, unless like I said, you have some very special performance consideration.

You can absolutely use it in production, we have at my work since it launched with great success. The code works, and does what it does well. The reason it is alpha is because it isn’t sure if that’s what the final ergonomics and feature set for it will be for the language forever. They wanted to see how people would use it, if it would deliver on all they wanted, get feedback, etc, before commiting to spec fully for the language. And that’s where Spec 2 comes in, they’re reworking some aspects from what they learned from the alpha.

It isn’t alpha because it is buggy or anything like that, so it is safe to use in production.

As for best practices, I’d say you can spec your domain model and then validate explicitly using s/valid or conform (not instrumentation) at specific places in your app, my recommendation is to have the producer of the data validate, and the reader conform, and to do so at the boundary. Like before sending a payload, validate it meets the spec, and as soon as you receive a payload, conform it. Or before writing data to the DB, validate it, and after reading data from the DB, conform it.

On top of that, it’s good to spec pure functions you want to thoroughly test, and then setup a generative test for them.

Finally you can spec a few other functions as documentation for what entity they take as input/output, when it helps readability, and setup instrumentation at the REPL and when your tests run for it. But don’t use instrument in prod.

9 Likes

Wow. Incredibly helpful, @didibus.

No problem.

I’d also like to add some details. The reason the producer should validate before sending a payload or storing data to a data-store is because you want to fail fast. Once the reader on the other side receives the payload and fails, you’re already in a state that’s much harder to recover. Imagine storing a corrupted entity to disk, when you read it back later you can’t just throw an error undo the storing of it, go back to the writer, have it catch the exception and recover, etc, because of the distributed nature of things. So you want to fail fast from the producer side, so you don’t corrupt things that get harder to recover. Ideally your reply/tests catches things, and worse case you fail as soon as you go to prod and can quickly rollback, no data cleanup involved, because you’ve left the DB with a bunch of broken entities, or your kafka stream with a bunch of broken messages.

Secondly, the reason you want your reader to conform, is because conformance is the act of figuring out what type of data you received and handling it appropriately based on that. Any system over time will evolve its payloads and entities. And in a distributed setting, you can’t always enforce that the sender migrates to your new payload format. Even if they do, there’s always a deployment overlap if you want to have zero downtime. There will be in-transit messages of the old payload format, while the new code is deployed and starts to send new payload formats, so as a consumer of data, you must always be able to process the old data and the new. That’s what conform lets you do, when you call conform on some data, the result is that it tells you which of the many kinds and versions of the data you just received, which allows you to branch to the correct logic for it. It also will tell you if the data doesn’t match any of the type of data you support.

Now what neither validate and conform do is serialize the data for transport. I consider that orthogonal, and that’s why I don’t like schema libs that also do data conversion. Conform is not the same as convert.

So what you do is you produce the payload as Clojure data and validate that with the spec for it. Then you send that data to a layer that converts it into the transport format, like maybe it gets serialized to JSON. And then on the consumer side, you first get that transport encoded data, and you deserialize it back to Clojure data which you then conform.

Now if your transport is EDN compatible, well you don’t need to do much conversion, but the idea is the same, converting to/from the transport format I consider beyond the boundary of my app.

I prefer that to something like JSON schema, because what I want to do is work with my own domain representation, JSON is an implementation detail of my transport. With this approach you can support multiple transport, and use the same schema validation/conformance on all of them.

And then for conversion you are free to use whatever lib you prefer or hand roll it.

It does mean there’s a chance your conversion has a bug and that creates a bunch of broken payloads in the process though, but I tend to heavily test my conversion logic specifically for that.

5 Likes

The added details are also helpful, @didibus. I’m going to address some of my comments to you, because you wrote those two very helpful posts, but most of my comments are not specifically directed to you:

I understand now why some people recommend using maps over records. This is great–I appreciate the illumination. But I also think that while the use case that you’ve described is quite broad–lots of people have to serialize data to some storage, etc., and check for changed data formats, etc.–it’s not entirely general. If Clojure were only used for business, that kind of use case would be quite common, and some contexts outside of business have similar use cases as well. I can easily think of contexts involving scientific research, for example, in which exactly the sorts of concerns that your remarks identify would be applicable. What I love about those posts is they lay out so clearly the rationales for the practices you describe, didibus, so that others can evaluate whether and when they apply to their use cases.

However, not all contexts involve that sort of relationship to stored data, streams of data, etc. I think that the Clojure community has been largely business oriented–something that I like in many respects–but I wonder whether the “avoid records by default” recommendation is really a recommendation for common business applications, and shouldn’t be considered a general recommendation. I’m not taking a position on this. It’s a question I have, and maybe not a very important question.

(I’m very pleased about the recent Clojure data science community that’s developed, and I’m grateful for all of the dedicated work people are doing on developing tools for that. Despite my appreciation of the business orientation of most of the Clojure community, I’m happy to see a different kind of subcommunity develop–one that’s closer to my own interests. And fwiw, I do recognize that the kind of data serialization, etc. that you’re talking about, didibus, would be very important in some data science applications.)

1 Like

Fantastic write up! @didibus . Thank you.

Could you please tell is serialization with Records only a problem when communicating with external systems (storing data in DB, two microservices communicating with each other, client and server sending data to each other)?

Or can you have some serialization difficulties inside of the same program (when using Thread or when using Records from a library in the main app)?

The simple decision process might be: If you think that some data structure might need to be stored in DB or sent over the wire => Map. For internal use only => Map/Record.

One thing regarding records - if you use plain JSON serialization you will lose data. You can however use libraries such as transit which lets you register ser/de per type, and has a utility function for creating serializers and deserializes from records. I think fressian supports it as well. So does nippy

2 Likes

No recommendation is ever general though, all are contextual. Even a go-to sometimes might be the right tool. I’d say this one definitely is related to information systems, if your application isn’t modeling information, but instead doing something else, like say graphics rendering, signal processing, and other such thing, it’s possible it doesn’t apply as much, since you might not care to serialize the data anyways, or performance might be a bigger concern.

It’s possible too if you’re just writing some small scripts that specs are too powerful, too expressive, maybe the reduced scoped of records as schemas is what works best.

Basically: use maps unless you know better

When using records across libraries in the same app you can have the same issues yes. When using threads they shouldn’t be a problem.

The reason is that each namespace who needs to work with an instance of the record needs to depend on its schema definition, the defrecord. So you need to take an explicit dependency from the code that uses the record to the code that defines it. That can lead you into issues with circular dependencies as well, and things like that. It’s not as bad as full on Java static classes, it’s mostly if you care to manipulate the type or the constructor.

Like before sending a payload, validate it meets the spec, and as soon as you receive a payload, conform it. Or before writing data to the DB, validate it, and after reading data from the DB, conform it.

Love this. A really useful distinction of the two. This is Postel’s law in a nutshell no?

I have been saying the same for years - albeit less eloquently, as I was working mostly in Java at the time. A lot of people find this overkill but I totally agree with your more detailed explanation. The cost of saving or publishing the wrong or wierdly formatted information is very high compared with the effort/risk (and indeed benefits) of coping with wierdly formatted incoming information. But most feel comfort in strictly rejecting inputs that didn’t fit their original expectations, and apply no such rigour to their own outputs that they are convinced must be correct so long as the inputs were ok. Because noone ever had to data migrate bad data after a bug made it to the wild… :grinning:

1 Like

Not “use maps for information systems unless you know better”? Is information systems the default?

Java is used for everything (except low-level stuff). C, C++, too. Python isn’t used for everything, but it would be difficult to delimit what it is and isn’t used for.

I don’t care that much about what advice is given, but Clojure is not a business language, and it’s not an information systems language. It’s a general-purpose language. I personally like it for science. I would like it to replace Python and Java for scientific research programming. I don’t see a rationale for “use maps by default” outside of the kind of context that you have delineated. It’s an important context, and maybe it’s the case that at present, it’s a pervasive context among the Clojure community. I want the community to be broader than that–and Clojure as a language has the resources to play that role.

There are some ways in which it’s important to get new users into thinking idiomatically, so that they don’t try to put round pegs into square holes. This doesn’t seem like one of them, to me.

Well, I guess if you wanted to be more precise, and want my advice, it would be to use maps for information modeling, and use records for type abstractions. Simply because I find maps have benefits over records for modeling information, and records have benefits over maps for creating type abstractions, even though extend-via-metadata tried to close that gap.

And lastly, if you have special performance needs not met by maps, but met by records, go for it. And similarly if records don’t meet your performance needs either, try arrays, try a data frame library, use some specialized data-structure, maybe even a Java mutable one, whatever gives you the boost you need.

That’s just my advice, the rules around what defines a best practice isn’t something I understand very much, and I’ve never been a big fan of best practices anyways, they’re too dogmatic for my liking. So don’t take what I say here too strongly, records are great too, they’re not a “bad part” of the language to avoid, that said if you are a beginner I’d suggest trying maps out first, getting comfortable with them, because I know beginners are biased towards records as they tend to be more familiar coming from other language, and they give you a nice static schema which grants comfort.

And finally, if I came to a code base using records where I’d have used maps, I wouldn’t think oh god what monstrosity, I’d be totally fine with it, maps vs records, their difference is subtle, the effect of choosing one over another are not a deal breaker, and Clojure makes it quite easy to change between the two.

6 Likes

Clojure newbie here. Thanks for your helpful replies.

That’s why I say start with maps, use records if you need the performance boost and/or want to actually create a type to use with protocols for type polymorphism, though now you can do so with maps as well.

Can you explain how one can do type based polymorphism with maps, in clojure? That sounds useful. Thx!

Thanks for you explanation and for the link to the decision tree flowchart.

I’m a clojure newbie. I stumbled across this topic trying to figure out “why would I ever use protocols?” The answer seems to be “when you need Java interop, or for certain low level JVM optimizations.” Let me know if I got the gist wrong there.

I’ve moved from a heavy OO background in Java, to “oo-lite” solutions (a la protocols), to “we don’t need no stinkin’ OO!” I spend most of my time writing modern, functional React nowadays, with nary a class in sight. With hooks you just don’t need them. I also spend (too much) time hacking emacs lisp, which again doesn’t use OO. (EIEIO was a fad for a minute but faded.)

I appreciate that type based polymorphism can be useful, and so are platform specific optimizations. But I’m glad to hear that maps are considered more idiomatic.

1 Like

@didibus thanks for the great answer.
I was a bit confused (maybe still am) by what is a “producer” and what is a “consumer”:

have the producer of the data validate, and the reader conform

like before writing data to the DB, validate it, and after reading data from the DB, conform it.

I could look at the code writing data to the DB as a consumer of data, perhaps received via an HTTP API call - I’d validate the data coming from the client, transform it to the shape expected by the domain layer persistence logic and save it in the db (possibly again transforming to another form expected by the DB but usually not doing any extra validation just before saving the data).

Is that how you think about it or what you suggest is a different approach?

The other example still confuses me:

before sending a payload, validate it meets the spec, and as soon as you receive a payload, conform it.

Especially if these are different (distributed) processes I’d validate the payload in the receiver.
Perhaps the confusion is that a piece of code in such situations is really both a receiver and a producer? Like I’m receiving data from somewhere else (maybe an HTTP API handler) but, at the same time, validating that data and_producing_ transformed payload for something else (a DB, a Kafka topic, etc.).

It looks like you haven’t yet gotten a response to this so I’ll try to answer:

Clojure 1.10 introduced a new feature for defprotocol that let you declare that a protocol can be extended via metadata – see clojure/changes.md at master · clojure/clojure (github.com) – and this lets you use protocols for polymorphism on any value that can carry metadata, such as a plain old hash map, if the protocol is declared to be extended that way.

As an example of this in the wild, here’s how next.jdbc provides support for Stuart Sierra’s Component lifecycle (start/stop) via metadata: next-jdbc/connection.clj at develop · seancorfield/next-jdbc (github.com). In this case, next.jdbc.connection/component returns an empty hash map that satisfies the start portion of the Lifecycle protocol from the Component library via metadata. Once start is called, the Component returned is a function (not even a hash map) which can be invoked to get the underlying connection-pooled DataSource, and it also satisfies the stop portion of the Lifecycle protocol via metadata. When stop is called on it, you get back an empty hash map that satisfies the start portion again.

I could make it idempotent on the missing calls by completing the protocol implementation via metadata so that if you called stop on an unstarted Component, you just got the same thing back (including the metadata), and similarly if you called start on a started Component, but it seemed better to have those be an error (by omitting the protocol implementation) for calls that I consider to be logic errors.

[At work, we’ve adopted this idea of an “invocable component” as an idiom for getting at the underlying resource that a Component wraps/manages and it works very nicely for us]

5 Likes

Ah I misunderstood slightly - thanks for the clear explanation!