Discussion: an idiomatic data format to represent Org-mode documents?

Hello!

I’m using Org-mode for keeping track of the textual information i produce, and I’m guessing that there are other emacsen around here that do the same.

I would like to be able to use Org-mode for more programmatic use-cases. Like semantic searches through documents I’m writing. Ideally, I would like to be able to work with the data in an Org-Mode document with Clojure.

To kick things off, here’ a sample document and a possible representation:

* An idiomatic data format for Org-mode documents
By using Org-mode for input manipulation, we have a best-in-class input system.
But we should be able to work with it programatically, from the REPL! What if I
want to perform some conversion?
** Possible uses
Use org-mode for Clojure data structure manipulation when we want to /author/
some content, put that org file in the project and ~io/resource~ it out.
** Data or database?
Should we go for pure data (like Hiccup and Pandoc) or a database that we can
traverse and query? I've wanted to use Datascript for something serious for a
while.
[:section
 [:header "An idiomatic data format for Org-mode documents"]
 [:body
  [:p "By using Org-mode for input manipulation ..."]
  [:section
   [:header "Possible uses"]
   [:p "Use org-mode for Clojure data structure ..."]]
  [:section
   [:header "Data or database?"]
   [:p "Should we go for pure data (like Hiccup and Pandoc) or a ..."]]]]

;; Personal comments: This doesn'ẗ really represent header levels
;; right. And it feels verbose. Improvement suggestions?

Two quick points:

  • Let’s not go for a full spec at once. Org mode is massive. My proposition: Represent the structure of the document with headers containing headers and text, and leave out italics, code blocks and all the little bits for now. Improve if it’s good.
  • Should we look at a document like pure data (like hiccup) or a database (Datascript?)?

I’m interested in hearing your opinions!

Teodor

1 Like

Interesting!

Just putting here links to two projects that seem to do something of this kind:

  • Organum – originally by
    Greg Hawkins, currently maintained by @seylerius – see its tests for examples of the data format
  • clj-org – by John Jacobsen
1 Like

After getting hooked on org-mode last year, I started trying to build out a parser this year. A lot of the spec seemed to lend itself to regexes + possibly a PEG or Packrat parser; for example, the standard headline format has a series of optional elements that must appear in a certain order. I made a tiny parser that seemed to handle this correctly for that element, at least.

If there was a formal model for org itself that doesn’t tie itself to HTML like clj-org does, I’d be more inclined to pick that up and continue running with it.

1 Like

I’ve been using org-mode as a database for my personal task management.

What I do is parse org-mode files into a tree of maps, and then use hickory-like selectors and clojure.zipper to query the tree structure.

I’ve found this really useful, as having the org data structure in clojure makes it easy and fast to query and manipulate.

1 Like

Would you mind giving an example of the data structure and a query?

For my case, I’m wondering where to put the heading text, and if I’m able to query subheadings. Example: give me a simple way to map over level 3 headings and give me access to the related level 1 and 2 headers.