Persistent, Git friendly local database.... any suggestions?


#1

As you know, I love writing scripts that automate various tasks (I’m a one-trick pony).
I usually have a repo that contains something - say automated emails - and its scripts.
Often I find myself needing to store some state - for example which automated emails we sent.
At the moment, what i do is that I store everything in a SQlite database - works like a charm, but it’s a binary file, so no inspection of changes and no merging. After I do something, I just commit. This state
is not especially large and will never be - thousands of records maximum. I know I could write hooks to dump the database as text and build it back, but it’s a PITA.

So I was considering having such state kept as EDN - I read an EDN file, do my changes, save it back. I get immediate visibility of changes that were made, and I can always repair/merge with a text editor.
Anybody did something similar?


#2

I do this sort of thing all the time. Go for it!


#3

Just with a plain EDN file?


#4

Yeah, for keeping a file with a little bit of data some EDN works fine. The only real concern is a crash while overwriting the old data resulting in data loss (some would copy the current state to a backup before writing the new state to protect against this), but if you’ve got the previous versions in git even that isn’t much of a problem.


#5

You could also go for an append-only EDN file backed by git.
Or use MapDB, a Java collection that synchronises it’s changes on disk. I heard it’s mature, and I think it would be possible to wrap it with a nice Clojure-like interface, so that conj and the likes writes to disk automatically. A colleague of mine started this project, but not sure it’s released some where yet. GitHub search is your friend :wink:


#6

I’ve been thinking the same thing! EDN is just so nice. Simple to handle. Some comments:

Datascript is a nice way to structure data when I don’t want to handle a database. You can store that database as EDN, so you can diff it with Git as you normally would.

Pros vs plain EDN: not a strict hierarchy. You get a more “open” data model from the beginning. No need to decide on what the top level keys have to be from the start. You also get relations. Cons vs EDN: slightly more complex, and needs a schema.

For getting diffable EDN files, I serialize data with clojure.core/pprint. That gives me reasonable line widths, which makes git diff readable.


#7

I was thinking something along the lines of Datascript myself - consider I already use a database and the schema is a no-brainer. About serialization, does pprint use the same sort order for keys - or you get unintended diffs because the order of a map key changed?


#8

You got me going!

  • I didn’t find any way to pretty print while sorting keys
  • (array-map) ordering seems to be unstable
  • (hash-map) ordering seems to be stable
  • Clojure seems to use array maps for small maps (<= 8 items) and hash maps for larger maps (>= 9 items)

My conclusions:

  • You can choose not to care – you might mostly get hash maps anyway.
  • If you do care, consider converting to sorted maps before starting. You can use clojure.walk/postwalk to traverse an arbitrary nested structure and do conversion.

Converting to sorted maps in an arbitrary nested data structure:

(require '[clojure.walk])

(defn maps->sorted-maps [data]
  (clojure.walk/postwalk
   (fn [item]
     (if (instance? clojure.lang.IPersistentMap item)
       (into (sorted-map) item)
       item))
   data))

(-> (zipmap [1 2 3] (repeat 'n))
    maps->sorted-maps
    prn)
;; => {1 n, 2 n, 3 n}

(-> (zipmap [3 2 1] (repeat 'n))
    maps->sorted-maps
    prn)
;; => {1 n, 2 n, 3 n}

Following is what I tried at the REPL. If you want better backing for the claims, some test.check properties for this may be interesting.

REPL explorations
(require '[clojure.pprint :refer [pprint]])

;; Is the ordering stable?
;; I think it may depend on whether we get an arraymap or a hashmap.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Test 1: is (keys) ordering the same as (pprint) ordering?

(for [sample-len [3 6 9 12]]
  (let [m (zipmap (range sample-len)
                  (repeat 'n))]
    (pprint m)
    (pprint (keys m))))
;; =>
;;    {0 n, 1 n, 2 n}
;;    (0 1 2)
;;    {0 n, 1 n, 2 n, 3 n, 4 n, 5 n}
;;    (0 1 2 3 4 5)
;;    {0 n, 7 n, 1 n, 4 n, 6 n, 3 n, 2 n, 5 n, 8 n}
;;    (0 7 1 4 6 3 2 5 8)
;;    {0 n, 7 n, 1 n, 4 n, 6 n, 3 n, 2 n, 11 n, 9 n, 5 n, 10 n, 8 n}
;;    (0 7 1 4 6 3 2 11 9 5 10 8)

;; Conclusion: Looks equal as far as I can tell.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Test 2: do we get the same ordering as we add more keys?

(for [sample-len [3 6 9 12 15 18]]
  (let [m (zipmap (range sample-len)
                  (repeat 0))]
    (prn `(~'range ~sample-len))
    (pprint (keys m))
    (println (type m))
    (println)
    ))
;; =>
;;    (range 3)
;;    (0 1 2)
;;    clojure.lang.PersistentArrayMap
;;
;;    (range 6)
;;    (0 1 2 3 4 5)
;;    clojure.lang.PersistentArrayMap
;;
;;    (range 9)
;;    (0 7 1 4 6 3 2 5 8)
;;    clojure.lang.PersistentHashMap
;;
;;    (range 12)
;;    (0 7 1 4 6 3 2 11 9 5 10 8)
;;    clojure.lang.PersistentHashMap
;;
;;    (range 15)
;;    (0 7 1 4 13 6 3 12 2 11 9 5 14 10 8)
;;    clojure.lang.PersistentHashMap
;;
;;    (range 18)
;;    (0 7 1 4 15 13 6 17 3 12 2 11 9 5 14 16 10 8)
;;    clojure.lang.PersistentHashMap

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Test 3: is intmap ordering stable?

(prn (zipmap [1 2 3 4 5 6] (repeat 'n)))
;; => {1 n, 2 n, 3 n, 4 n, 5 n, 6 n}
(prn (zipmap [6 5 4 3 2 1] (repeat 'n)))
;; => {6 n, 5 n, 4 n, 3 n, 2 n, 1 n}

;; We seem to get the same order as we put the items in. So not a stable
;; ordering!

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Test 4: is hashmap ordering stable?

(prn (zipmap [1 2 3 4 5 6 7 8 9] (repeat 'n)))
;; => {7 n, 1 n, 4 n, 6 n, 3 n, 2 n, 9 n, 5 n, 8 n}
(prn (zipmap [9 8 7 6 5 4 3 2 1] (repeat 'n)))
;; => {7 n, 1 n, 4 n, 6 n, 3 n, 2 n, 9 n, 5 n, 8 n}

;; Conclusion: looks like they key ordering is stable as long as we have
;; hashmaps.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Test 5: when do we flip into hash maps?

(defn map-of-length [n]
  (zipmap (range n) (repeat 'n)))

(pprint (for [n (range 12)]
          [n (type (map-of-length n))]))
;; ([0 clojure.lang.PersistentArrayMap]
;;  [1 clojure.lang.PersistentArrayMap]
;;  [2 clojure.lang.PersistentArrayMap]
;;  [3 clojure.lang.PersistentArrayMap]
;;  [4 clojure.lang.PersistentArrayMap]
;;  [5 clojure.lang.PersistentArrayMap]
;;  [6 clojure.lang.PersistentArrayMap]
;;  [7 clojure.lang.PersistentArrayMap]
;;  [8 clojure.lang.PersistentArrayMap]
;;  [9 clojure.lang.PersistentHashMap]
;;  [10 clojure.lang.PersistentHashMap]
;;  [11 clojure.lang.PersistentHashMap])

#9

I can recommend this wrapper for MapDB, which I made awhile back. However, the OP was looking for something that would work nicely with git, which this will not.


#10

I thought MapDB could read data from disk, and that a Clojure wrapper could write this data as EDN? Wouldn’t this make it git firendly?


#11

I’m not sure what you’re saying here. MapDB is a database to which one would write and and later read it back. The on-disk representation of that data is a binary format.


#12

Have you heard of duratom ?

A durable atom type for Clojure. Duratom implements the same interfaces as the core Clojure atom (IAtom, IRef, IDeref). In order to provide durability duratom will persist its state to some durable-backend on each mutation. The built-in backends are:

  1. A file on the local file-system
  2. A postgres DB table row
  3. An AWS-S3 bucket key

#13

There is also Durable-ref which implements various reference types with durable storage, one of which is a file.


#14

I was talking out of ignorance :sweat_smile: Didn’t know it was sorted as a binary.


#15

What about Microsoft access? I think there is a jdbc for it?


#16

I am using Konserve for this use case. It is git friendly. I wrote a layer on top of it that formalizes things related to persisting state, such as validation. That layer is called Kampbell, it is opinionated, it embraces specs, and it relies on the tests to demonstrate usage.


#17

I recently decided to use Fossil as a file format for my application. It’s a full distributed SCM similar to Git, stored in a single SQLite database. It’s open source and well documented, but there isn’t much api-wise besides the cli binaries.

Also, if you’re planning on storing a data file in source control, a structured diff tool is probably better than a text based one. Has anyone ever tried to make an EDN diff format?


#18

Speaking of EDN diff format, there’s this spec about JSON patch that just describes the changes to a data structure. Given the similarity between EDN and JSON, I guess this could be done somehow?