I’m trying to wrap my SQL-indoctrinated mind around Crux as a document database. I have some questions and I don’t even know if they are relevant in document stores. Namely, what about database dumps (backups) and database migrations? My intuition so far is that database backup belongs to the host db system (for Crux that seems to be my Postgres or something through Kafka, which I don’t know), and that migrations are really a nil operation on document databases because there are no schema to be updated. Does anyone have confirmation or deeper explanation about these? And what about connection pooling (I use Hikaru-CP for my pooling)? Is pooling a thing in something like Crux?
I have been wondering about this for some time as well, especially about the migration process. My feeling is that, even though you don’t define a schema up-front, your use of the documents stored in the database constitute an ad-hoc schema. This means that schema migrations are replaced by data migrations: Your application defines how data is stored and read, and if that use changes you have to migrate the data already in the database to the new ad-hoc schema. Is that indeed how it is?
I guess there can be different strategies employed. I can only speak to what I do. Also, not all noSql store are exactly the same, so it can be some store require some different strategies. I’ve not use Crux for example, so I can’t speak to it specifically.
In semi-structured DB, I don’t do data migrations. You can always just add new fields or stop updating existing fields. What I do instead, is that my code must forever handle all possible schemas. So if 5 years ago documents had the shape A, and since then have had the shape B, C and D. My code will know how to read/write to all of them.
I use Spec for this. I’d have a multi-spec over the type of document.
In the code base, I can normalize on the boundaries, so it can be when I read, no matter if it’s A, B, C or D, I can convert it to normal form X and my code can use X. On writing back, I could choose to upgrade. Like say I write normal form X back I could write it back in the shape of D.
But to be honest, I often don’t normalize like this. I tend to just have code everywhere become a multi-method of the document type. I find that better, because with normalizing, you get into having to make fields that are now mandatory optional, because they would be missing for older doc types for example. So I’d rather not do that and just handle each case on their own.
Sometimes I backfill data in if it’s possible. Like if I want to introduce a new field, and I can actually backfill the data for that field over all existing documents. I’d do that as a one time job.
For backups, well I’ve only used managed document stores, so they generally come with their own replication and backup features. So I’d say say that would be Crux specific.
Oh and for connection pooling, no, normally these NoSql DB have REST endpoints, and don’t need an active connection, beyond the underlying http connection. So I’ve never had to pool them.
Hi there, I would like to add to this some more Crux-specific points
since it can be rather confusing with all the ways of running Crux.
I would see three broad setups to run Crux in at the moment:
- Crux runs in standalone mode (RocksDB, JDBC SQLite, …) embedded in a JVM process
- Crux runs with Kafka or JDBC (Postgres, MySQL, MS Server) as storage and there are one or multiple client nodes
- Crux runs in one of the two above modes but you also run crux-http-server and your application logic run separate from Crux itself
Crux also has different data storages:
- The transaction log
- The document store
- The indexes inside the client nodes
The indexes directly in the nodes (can be memory only, but typically persisted in RocksDB) can be rebuilt from the log and don’t need backups.
The document storage is mostly in the same place as the transaction log, but it’s also possible to store it in a place like S3 now.
In general only the transaction log and document storage require backups.
With Kafka or JDBC you can rely on the storage-native backup mechanism.
In standalone mode the backup utilities can be used or with something like SQLite or H2 they also support their own backup mechanism.
crux-jdbc uses HikariCP for connection pooling under the hood.
Since there is no explicit schema, no explicitly schema has to be migrated.
When changing the shape of documents the implicit schemas still have to be considered for the different usage scenarios:
There is an implicit schema for writing new documents to Crux, which might change over time. It often helps with data integrity to also enforce this schema with a tool like spec.
For the schema on read you have to consider that with Crux you retain all of history, which means your code must always also be able to read old data.
You can get pretty far by avoiding breaking changes in documents. Having unique attributes is pretty nice to do with namespaced keys and adding attributes or making attributes optional can be done without migrating the data.
With a document store like Crux you can decide to model your different data types simply with unique attribute names or you can introduce a separate attribute to match to a certain
version. If you are explicity about mapping data to types, a breaking change would require to map to a new type or version of a type.
Many use cases work only with the latest version of documents while history is only relevant for a few specific features.
It might be helpful to migrate data to the latest schema to simplify the majority of the code base.
In Crux you can even migrate historic data to the latest version by writing these history documents with the appropriate
valid-time. Then only for features where you also need to consider all
transaction-times you must actually handle all past shapes of a document.
The most appropriate strategy is highly specific to each use case.
I hope all of this is somewhat understandable And of course there might also be other aspects to this which I have not considered. Happy to learn more!
I would really love to see some code-snippets of how the rubber meets the road in terms of your using Spec to guide the reading/writing of your documents. Thinking about it, this feels “right” to me in terms of utilizing the potential of Clojure (spec, multimethods, etc).
A related question for which I’d love to hear document-database folks chime in: I’ve just started using UUIDs (instead of serial integers) as identifiers in my postgres databases. Postgres has a nice in-the-box solution to auto-generating UUIDs if you install the build-in solution. However, the examples I’ve seen so far don’t ever bring up UUIDs in connection with NoSQL dbs. Are they equally important in DBs of both types and just unmentioned to keep things simple, or is there some reason they are less prevalent in NoSQL? For example, if I could see a datalog argument being to describe the data with your query rather than use UUIDs. Am I off-track here?
I see that my UUID question is resolved at https://opencrux.com/docs#transactions-valid-ids . Short: UUIDs are supported natively in Crux.
I’ve got to say, thanks for this Crux detail! I am excited to put Crux to work.
@Webdev_Tory, have you made any progress using Crux? How has it gone so far?
I’m afraid this research project went to the backburner and so hasn’t received any attention for a while; but so far everything has gone lovely with Crux. Note that it turned out to be much preferable to fire up the rocksdb version, which was far more performant and super easy (just include
[juxt/crux-rocksdb "20.06-1.9.1-beta"] in my project.clj).