Data modelling with vectors

Clojure vectors are so much more than a convenient way to encode sequential data to the repl without having to quote lists. Still it’s not aparent how to use a vector for storing your application data. Clojure doesn’t even have an apparent “look for this specific single object in this sequence or vector” built in.

Also shuffling around data in a vector is not very easy using otherwise conventient sequence operations - constructions like (vec (concat (take ...) new-item (drop ...))) are obviously… not what the creator of Clojure had in mind.

Let’s see what vectors have speaking for them:

  • We don’t have to invent unique keys for storing things in the vector.
  • The vector as has a reproducible ordering
  • Anything can reference to a specific position in the vector storing only an integer index (a 64-bit long).

Things vectors are not suitable for:

  • inserting things anywhere in the collection
  • using the position indexing for domain-modelling

I will try to convince you that it is often better to use a “bare” vector than to have a map with various collections of items. I will also show you how to pick various specific collections out of the vector without having to use sequence functions like filter with various custom predicate functions.

Vectors as functions

Vectors behave similar to maps in some aspects, the position (index) of the item works as the key. Vectors can work as functions like hash-maps, but throws IndexOutOfBoundsException when the argument is not in the index (like the speedy nth do). Using get, nil is returned when the key is not present. Both behaviours have their use cases.

([1 2] 0) => 1
(nth [1 2] 3) => throws!
(nth [1 2] 3 :default) => :default
(get [1 2] 3 :default) => :default
(get [1 2] 3) => nil

A vector like [:a :b] is in some ways like map {0 :a 1 :b}.

There are some caveats:

assoc only works for either replacing something in the vector:

(assoc [:a :b] 0 :A)
[:A :b]

or add an item the very end of the vector.

(assoc [:a :b] 2 :c)
;=> [:a :b :c]

This conj-like assoc was new to me, but it’s reasonable and convenient in many use cases.

Trying to assoc something beyond the end of the vector throws an IndexOutOfBoundsException:

(assoc [:a :b] 44 :c)
;=> Execution error (IndexOutOfBoundsException)

Destructuring
Vectors can be destructured as

(let [[a b _ d :as the-vector] [1 2 3 4 5]] [a b d the-vector])
;=> [1 2 4 [1 2 3 4 5]]

It is also possible to destructure a vector with the syntax for hash-maps, :as returns the original vector, :or (defaults) works as well:

(let [{a 0 b 1 c 2 d 3 
       :or {c :c}
       :as the-vec} [:a :b]] [a b c d the-vec])
[:a :b :c nil [:a :b]]

I would not use this type of destructuring in regular code, but it’s certainly possible.

reduce-kv is also highly usable with vectors. The key is the vector-index, and the value is the value at the position. A contrived example using the assoc conjy behaviour:

(reduce-kv
  (fn push [agg idx value] 
     (assoc agg (inc idx) value))
  [:hello] ;; aggregate
  [5 4 3 1])
[:hello 5 4 3 1]

Using collections of indexes to work on vectors
Vectors of items (or pointers in the form of integers) can also be used as the to copy/move things around in various ways.

An example: we want to move the item in the third position to the first position in the vector, and the item from the first position to the the third position:

(mapv [:llama :camel :snake :capybara] [2 1 0 3]) 
[:snake :camel :llama :capybara]

The vector [2 1 0 3] is used as a sequence in this case.

It’s of course also possible to express a partial select of the vector, let’s say we only want the llama and the capybara:

(mapv [:llama :camel :snake :capybara] [0 3]) 
[:llama :capybara]

Of course we can return copies of objects in any order as well:

(mapv [:llama :camel :snake :capybara] [0 3 3 0]) 
[:llama :capybara :capybara :llama]

If you have a complex data structure for your application code you can index this data structure to be able to access the data in various ways. The clojure.data.int-map-library gives you sorted compact int-sets which seems like a match made in heaven for indexing vectorized data.

Lets say we have this small data set with two cats, a dog and a snake in various colors.

(def animals
  [{:animal :cat :color :brown :name "Bob"}
   {:animal :dog :color :brown :name "Dido"}
   {:animal :snake :color :yellow :name "Alex"}
   {:animal :cat :color :yellow :name "Garfield"}])

It’s possible to create an index like this:

(def index
  {:animal {:cat #{0 3}
            :dog #{1}
            :snake #{2}}
   :color {:brown #{0 1}
           :yellow #{2 3}}})

If you want to get a vector of all cats, do:

(mapv animals (get-in index [:animal :cat]) => 
  [{:animal :cat :color :brown :name "Bob"}
   {:animal :cat :color :yellow :name "Garfield"}]

Adding a back-reference
We really want to be able to have some pointer back to the original data to know where it came from. The keep-indexed is perfect for this use-case:

(def selected #{0 3})
(vec (keep-indexed (fn [idx v] (if (contains? selected idx) (assoc v :idx idx))) animals))
  [{:animal :cat :color :brown :name "Bob" :idx 0}
   {:animal :cat :color :yellow :name "Garfield" :idx 3}]

More specific selections using set operations on indexes
If you want to select (get the index to) all yellow cats, you can get the set intersection of the sets representing animal type and color using either clojure.set/intersection (for regular sets), or the clojure.data.int-map:s own super fast i/intersection:

(i/intersection (get-in index [:animal :cat]) (get-index [:color :yellow])) ;;=>
(i/intersection #{0 3} #{2 3})
;;=> #{3}

Which can be selected just as before:

(def selected #{3})
(vec (keep-indexed (fn [idx v] (if (contains? selected idx) (assoc v :idx idx))) animals))
=>[{:animal :cat :color :yellow :name "Garfield" :idx 3}]

You just made yourself a indexed database out of a vector!

It is possible to add references (as a single index integer or sets of integer index values) between various entities in the vector, and with more indexing work there can be relational queries and walks in the database. That’s for next time.

13 Likes

Clojure doesn’t even have an apparent “look for this specific single object in this sequence or vector” built in.

Filter and take will do effectively do that right?

  (->> [{:type "a"} {:type "b"} {:type "c"}]
       (filter #(-> % :type (= "a")))
       (take 1))
  ;; => ({:type "a"})

find works on vectors: find - clojure.core | ClojureDocs - Community-Powered Clojure Documentation and Examples

Of course it is easy and possible to construct such a function. What I was thinking of is something like:

(defn find-first [pred coll]
    (first (filter pred coll)))

which, functionally, is trivial. Computationally it is a O(n) - linear - operation which quickly may lead to O(n^2) - quadratic things or worse, in practical operations, which kills performance for larger data sets.

I don’t remember exactly where I heard it, but I’m quite sure I’ve heard Rich Hickey mention that he did not want any collection functions which had O(n) behaviour in the core of the system. (Apart from the functions applying to sequences, which obviously has a linear term).

I think avoiding computationally O(n) functions it is a great decision. If one knows why, then fine, use a linear search, but the language/library clearly won’t make that use case very easy.

On the contrary: cases where you use hash lookups or other almost O(1)/O(log(n))-function is promoted almost everywhere: in destructuring, hash-maps, hash-sets and vectors as functions etc).

There is also tons of functions using transforms which promotes use hash-lookups efficiently (group-by, frequencies, the namespace clojure.set, assoc, update, get …) often where linear search usage is not really possible.

So in a sense you are right, yes, of course one can look for an certain element linearly in a collection by some predicate, but idiomatic clojure begs you to index your collection to avoid linear lookups.

1 Like

Yes, you can use find on a vector, it will return a map-entry with the index-key and the value found in place. There is no built in function to find a certain element based on the value you are looking for in a vector, like “find the position of :a in this vector”.

On the other hand it is very easy to construct some suitable index for finding values this way.

1 Like

This is cool! When I’m doing something simple I often reach for a vector as a way of storing values, and then invariably I find myself converting it to a map as soon as I reach any significant level of complexity. But I didn’t know about some of these tricks, so you certainly made me think a bit differently about how a vector could be useful.

The vector sort [:snake :camel :llama :capybara] blew my mind, and I can’t count the number of times I’ve wanted a back-reference just like the one you demonstrated with keep-indexed. Thank you!

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.