Spec: Best practices for raw vs cleaned input data validation?

I would appreciate your input on the following problem: We use Spec to 1) Document what data flows through our code (annotating key transformation functions with it, not just the input and output ones), 2) Auto-generate tests, 3) Verify that the input, coming from an external REST service, is something we can handle. Now the problem is that the raw input is “dirty” and contains data that we want to remove/clean before sending them into the rest of the system (keys with nil values, users with missing IDs, …).

So it would seem that I need two specs: one more permissive that can accept the real data the service is throwing at us (use case #3) and a stricter one for cleaner, simpler data inside our app (use cases #1, #2). Using the permissive spec everywhere is suboptimal because I will then need to handle the bad data at multiple places (because they will be produced by our generative Spec tests).

How do you people deal with this? Thank you!

It’s my understanding that two specs is totally idiomatic and intended. You permissively validate the data coming in from outside, then you coerce it into your desired internal format, which then you validate with the stricter spec.

Sean Corfield described something like this approach in this thread (though his solution solves the problem in the small rather than at the macro level):

in our REST API, for long, double, Boolean, date, etc – we have two specs for each: one that is a spec for the target type in the domain model (which in these cases is just defined as the appropriate built-in predicate), and one that is a spec for the API level (which accepts either the target type or a string that can be coerced to the target type). Then we use the appropriate spec at the appropriate “level” in our application.

You can also use a multi-spec, and have the “level” be explicit on the data itself.

{:type :clean
 ...}

{:type :raw 
 ...} 

That way, the spec describes the relationship of the data at various stages of its life.

It ends up pretty similar to having two specs, but I find having it as one spec that is multi-specced makes the relationship between them more clear.

1 Like

Thank you. It makes sense to me but I really struggle with the practical implementation. The problem for me is that Spec is key-based. I do not see a way to tell it that :event/user is nillable when it comes in (i.e. when “raw”) but not in the internal transformation functions (i.e. after being cleaned up). The only solution I see here is to rename the key, perhaps as a part of my cleanup code:

;; Spec
(s/def :clean-event/user (s/keys ...))
(s/def :raw-event/user (s/nilable :clean-event/user))
(s/def :event/cleaned? boolean?)

(defmulti event-cleaned? :event/cleaned?)
(defmethod event-cleaned? false [_]
  (s/keys :req [:event/cleaned? :raw-event/user]))
(defmethod event-cleaned? true [_]
  (s/keys :req [:event/cleaned?] :opt [:clean-event/user]))

(s/def :event/event (s/multi-spec event-cleaned? :event/cleaned?))
;; Code cleanup
(defn cleanup [{:raw-event/keys [user] :as event}]
  (-> event
      (dissoc :raw-event/user)
      (assoc :event/cleaned? true)
      (assoc-unless-nil
        :clean-event/user user)))

Is there a better way?

(BTW I think I might prefer multi-spec to having two completely separate specs as this is a relatively large, nested data structure where most things are the same between “raw” and “cleaned” and I guess the cost of the extra complexity is less than the cost of the complete duplication.)

My feeling is that spec affords the ability to define the semantics of the value associated with a given key in any context in a way that can be stable forever, or at least loosened only. (If you need to tighten it you do need a new key).

For example - a “surname” should be characters (and not a boolean) always. That’s intrinsic to your semantic meaning of “surname” in any context. But whether a surname is there - nil or not - is context specific and not intrinsic to the meaning of “surname”. It’s there in some forms, but not in other sparse reports with legacy data, etc etc.

So spec should not be used to close down the possibility of a value being present or not. That’s complecting two separate things - the key meaning and the expectations of this particular transfer or protocol.

Not sure how best to check the protocol side though - beyond writing separate non-spec based checks. I feel like an fdef spec might work for this … ? Or perhaps new tricks in spec.alpha2 to do with “checks” which are separate to specs ??

But I would be interested to hear from others if I’m on the right lines too. I already learned something here that a 2 spec solution for raw vs coerced values is becoming a pattern. My thoughts above are mainly based on parsing Rich Hickey’s “Speculation”, “Effective Programs” and “Maybe not” talks.

1 Like

So, one question I have is if you even need a spec for the raw data? You say that data is dirty, so does it have a formal specification? Like do you still expect it to be dirty in only well specified ways?

If not, you could only have a clean spec, and your cleaning functions are assumed to take in data and either successfully clean it to the clean spec or fail. Which means you only need a clean spec.

If you do actually have a spec. It helps to think in terms of what’s the point of the spec? Is it to allow you to easily generate random valid input data for your cleaning functions, is it to help document to some client the format they should export their data in? Is it to perform validation?

So if you find having a separate spec for any of these use is too much work, but you have that use, the question becomes, is it more work with spec or without?

This can help justify having a raw and clean spec, but it doesn’t help with, is there a way to leverage spec that’s even less work and can fill in the use case of raw and clean.

For that, I don’t think there’s a better way then having two specs or using multi-spec as of now.

2 Likes

My preference is to avoid declaring anything as s/nilable similar to how Datomic does not allow a datom with a nil value.

One approach would be to call some function prune, that recursively removes entries where the value is nil, before checking against the spec.

Then your spec can be:
(s/def :event/user (s/keys ...))

2 Likes

I think that is a very good idea, having a fail-safe cleaning function and only check the (maybe) cleaned data. There some ways I know of in which the raw data is “dirty” but there may be other, unexpected problems, that I want to discover. This approach would do that without much additional work/complexity.

Best would be to be able to derive the prune functions from the clean specs. Not a best practise, but there is some machinery for this in the spec-tools library: https://cljdoc.org/d/metosin/spec-tools/0.10.0/doc/spec-coercion

2 Likes