XTDB Kafka topics externally consumable?

Timothy_Washington · June 5, 2022, 7:24pm

A. I really like the separation in this setup. “XTDB … uses Apache Kafka for the primary storage of transactions and documents, and RocksDB or LMDB to host indexes for rich query support.”

Can / does XTDB operate over top of existing Kafka topics? Ie, can a separate tool traverse a XTDB Kafka topic, and read the messages? This is presuming the client can deserialize messages - ( Avro, Protobuf, Transit, et al ).

B.i. Can XTDB be the base for its own Stream Processing topology?

B.ii. Or, can another service ( Riemann, Kafka Streams ) write to a Kafka topic and XTDB be used over top the same Topics? I’m assuming not. But figured I’d ask.

zcaudate · June 6, 2022, 5:48am

@Timothy_Washington

looking at the implementation, it seems like the topic is “crux-docs”.

github.com

xtdb/xtdb/blob/master/modules/kafka/src/xtdb/kafka.clj#L393-L400


      
            (fetch-docs [this ids]
              (throw (UnsupportedOperationException. "Can't fetch docs from submit-only Kafka document store"))))
          
          (defn ->submit-only-document-store {::sys/deps {:kafka-config `->kafka-config
                                                          :doc-topic-opts {:xtdb/module `->topic-opts
                                                                           :topic-name "crux-docs"
                                                                           :num-partitions 1}}}
            [{:keys [kafka-config doc-topic-opts] :as opts}]

As in, being both a source and a sink? I don’t see why not.

Not too sure on this… but it might work if you put same data to the topic for injestion.

mvarela · June 6, 2022, 8:26am

I’m not entirely sure what you’re actually trying to achieve… The XTDB topics (xtdb-docs and xtdb-transaction-log) can be read, but I suspect if you write directly to those, you’re going to cause yourself some headaches. I’m using XTDB with kafka streams, and use a “submit-only” client for sending stuff into XTDB. The indexing and transaction-function processing doesn’t happen on those Kafka topics as such, so you don’t really have the full DB there “at rest”.

Timothy_Washington · June 6, 2022, 11:28pm

Heeey Chris, what’s going on

A look at the source of course makes a lot of sense, ha ha. My bad for not checking there first.

As far as XTDB interoperating with other Stream Processing platforms, the more I thought about it, the less plausible it sounded. Ie XTDB needs to index documents it’s storing. And I don’t think… well I’m not confident that XTDB is indexing (Kafka Topic) documents from an external producer. Any wrapper you put around a producer, immediately pulls in an XTDB client.

Cheers!

Timothy_Washington · June 7, 2022, 12:12am

Hey @mvarela

I’m not entirely sure what you’re actually trying to achieve…

When I’d previously used a Stream Processing platform ( Kafka Streams, Samza ), setting up and operating your topology worked well. But for both libraries, there wasn’t a way to do realtime searching on messages on a topic or joining across topics. In order to query / join across topics, you’d need to manually add a Stream Processor to your topology (with all of the subject topics as inputs).

Apache Kafka

Now, for a reporting standpoint, or even in your Stream Processors, it’s very valuable to be able to query/join across topics. Very much like what Presto does.

Kafka Connector — Presto 0.284 Documentation

Since I’d first started using Kafka Streams, Confluent has added KSQL and KsqlDB

Now, it’d be nice to use Datalog as the query engine. But after I’d thought about it, I’m almost sure that’s not possible with it current setup.

The indexing and transaction-function processing doesn’t happen on those Kafka topics as such, so you don’t really have the full DB there “at rest”.

Exactly, yes. Lol, I thought about my question about 5 minutes after I posted, and it occurred to me that that arrangement would prevent any commingling of Kafka Topics between XTDB and other libs.

zcaudate · June 7, 2022, 1:26am

I didn’t think it was a good idea but yeah, I see what you’re looking to do (querying streams). I think the difficulty will be specific to the query you want to do (ie. full text, fuzzy). I did mess with XTDB and Kafka a couple years back but dropped them for Redis and Postgres instead.

A lightweight method to get full relational queries might be to have an sqlite node running in the pipeline and send information there . Provided that you can live with queries that may be out of sync (I’m pretty sure that ksql streaming queries are delayed as well).

zcaudate · June 7, 2022, 1:38am

You may also want to check out tarantool. It’s basically a supercharged redis (cache, streams) with an sql engine and scripting is via luajit.

Timothy_Washington · June 8, 2022, 2:08am

Indeed yeah, Presto and ksqlDB are already doing doing query / joins across topics. Any tool just has to be coordinated with Kafka and will be delayed somewhat. It’d just be nice to query using Datalog, lol!

Thanks for the tip on Tarantool. It’s the first time hearing of it - I’ll check it out.

system · December 7, 2022, 2:08pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.