Split a large org-mode file into smaller files

Hi, I’m used to imperative programming. And now I’m learning Clojure.

As a test project I want to split the following (org mode) file in multiple files:

* Main title
Some intro text
** First blog
PATH:first_blog
Some text
** Second large blog
PATH:second_blog
This second blog is about xxx.
*** Part 1 
PATH:part1
explain part 1
**** Part 1A
Some info about Part 1A
**** Part 1B
Explain Part 1B
*** Part 2
PATH:part2
explain part 2
** Third blog
PATH:third_blog
This is the third blog

I don’t want to slurp the whole file in memory, because the source file could be huge.
So this is what I get for now:

(defn check-if-header
  [state line]
  (if-let [matches (re-matches #"(\*+)\s*(.*)" line)]
    (prn "HEADERS" matches)))

(defn parse-line
  [state line]
  (println "The line is " line)
  (check-if-header state line)
  )

(defn parse-file
  [file-name]
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (doseq [line (line-seq rdr)] (parse-line [] line))))

First question : how can I recursively pass the state to parse-line ? I could also def atom but isn’t this the imperative way.
The state will contain for instance a vector that works like a stack to remember the paths. And to remember what level of header we are currently dealing with. But I think it should also contains some reference to the current reader, where all output goes. This until we bump into a header that also contains a PATH

Expected output files:
/index.org

* Main title
Some intro text

/first_blog/index.org

** First blog
Some text

/second_blog/index.org

** Second large blog
This second blog is about xxx.

/second_blog/part1/index.org

*** Part 1 
This second blog is about xxx.
explain part 1
**** Part 1A
Some info about Part 1A
**** Part 1B
Explain Part 1B

/second_blog/part2/index.org

*** Part 2 
explain part 2

/third_blog/index.org

** Third blog
This is the third blog

Thanks for any info !

there’s GitHub - bnbeckwith/orgmode: Org-mode parser in Clojure.

you can check out their parser here:

but generally, you should put it in a (loop []) block and recur your state in there.


Depending how where your input is coming from, It may not be as simple as going through the lines and parsing each line, as org files may have BEGIN and END blocks. So it could be just be worthwhile to use an existing library if the files are more complicated.

1 Like

FYI, you can use clojure.java.io/reader to open a buffered reader on the file instead of interop.

https://clojuredocs.org/clojure.java.io/reader

You would use reduce instead of doseq.

(with-open [rdr (io/reader filename)]
    (reduce parse-line {} (line-seq rdr)))

The {} is your initial state, you can put whatever key/value in it than you’d want to start with. Parse-line should return the new state, and automatically it will be called with the next line and the state returned from the previous return of parse-line.

2 Likes

Thanks for your link!
It is indeed not my intention to create a full blown org-parser. The idea is to split a large org file into multiple smaller files. These files will serve as base to generate a static site, using GitHub - magnars/stasis: Some Clojure functions for creating static websites.. And for parsing these files I’ll use a real org parser:-)

Thanks didibus,

very well explained.

okay I think I understand what you are doing.

Best way to go is to not worry about memory usage right now as you can optimise later. I would split up the parsing into maybe 2 or 3 passes instead of doing everything you mentioned all at once. The format you are working with is tricky. So I’d probably do something like:

  1. get a flat, tagged representation of the lines (ie. headers, path indicators, content)
  2. get a tree representation of the sections (nested headers)
  3. go over the tree representation sections to split according to a PATH indicator
  4. emit the split files

Once the passes are in place and you want to optimize, I’m sure people here will be able to help with upgrading it to an async, transducerized and parallalisable version.


A better solution though would be to change your input into something bit more straightforward to parse.