Split a large org-mode file into smaller files

pasdut · September 25, 2022, 10:43pm

Hi, I’m used to imperative programming. And now I’m learning Clojure.

As a test project I want to split the following (org mode) file in multiple files:

* Main title
Some intro text
** First blog
PATH:first_blog
Some text
** Second large blog
PATH:second_blog
This second blog is about xxx.
*** Part 1 
PATH:part1
explain part 1
**** Part 1A
Some info about Part 1A
**** Part 1B
Explain Part 1B
*** Part 2
PATH:part2
explain part 2
** Third blog
PATH:third_blog
This is the third blog

I don’t want to slurp the whole file in memory, because the source file could be huge.
So this is what I get for now:

(defn check-if-header
  [state line]
  (if-let [matches (re-matches #"(\*+)\s*(.*)" line)]
    (prn "HEADERS" matches)))

(defn parse-line
  [state line]
  (println "The line is " line)
  (check-if-header state line)
  )

(defn parse-file
  [file-name]
  (with-open [rdr (BufferedReader. (FileReader. file-name))]
    (doseq [line (line-seq rdr)] (parse-line [] line))))

First question : how can I recursively pass the state to parse-line ? I could also def atom but isn’t this the imperative way.
The state will contain for instance a vector that works like a stack to remember the paths. And to remember what level of header we are currently dealing with. But I think it should also contains some reference to the current reader, where all output goes. This until we bump into a header that also contains a PATH

Expected output files:
/index.org

* Main title
Some intro text

/first_blog/index.org

** First blog
Some text

/second_blog/index.org

** Second large blog
This second blog is about xxx.

/second_blog/part1/index.org

*** Part 1 
This second blog is about xxx.
explain part 1
**** Part 1A
Some info about Part 1A
**** Part 1B
Explain Part 1B

/second_blog/part2/index.org

*** Part 2 
explain part 2

/third_blog/index.org

** Third blog
This is the third blog

Thanks for any info !

zcaudate · September 26, 2022, 2:31am

there’s GitHub - bnbeckwith/orgmode: Org-mode parser in Clojure.

you can check out their parser here:

github.com

bnbeckwith/orgmode/blob/master/src/orgmode/block.clj

;; ## Block Element Formatting
;;
;; These functions parse an Org-mode file and generate the necessary
;; heirarchy of list and block elements.

(ns orgmode.block
  (:require [clojure.string :as s]
            [clojure.zip :as zip])
  (:use [orgmode.inline]))


;; ### Regular Expressions for Block Elements
;;
;; The following set of regular expressions match start and end
;; elements of blocks or block elements themselves.  Note that some
;; items are line items, but I considered them blocks of the
;; smallest size.

(def attrib-re
  "Attribute Regular Expression that captures the attribute name and

This file has been truncated. show original

but generally, you should put it in a (loop []) block and recur your state in there.

Depending how where your input is coming from, It may not be as simple as going through the lines and parsing each line, as org files may have BEGIN and END blocks. So it could be just be worthwhile to use an existing library if the files are more complicated.

didibus · September 26, 2022, 8:46pm

FYI, you can use clojure.java.io/reader to open a buffered reader on the file instead of interop.

https://clojuredocs.org/clojure.java.io/reader

You would use reduce instead of doseq.

(with-open [rdr (io/reader filename)]
    (reduce parse-line {} (line-seq rdr)))

The {} is your initial state, you can put whatever key/value in it than you’d want to start with. Parse-line should return the new state, and automatically it will be called with the next line and the state returned from the previous return of parse-line.

pasdut · September 26, 2022, 9:04pm

Thanks for your link!
It is indeed not my intention to create a full blown org-parser. The idea is to split a large org file into multiple smaller files. These files will serve as base to generate a static site, using GitHub - magnars/stasis: Some Clojure functions for creating static websites.. And for parsing these files I’ll use a real org parser:-)

pasdut · September 26, 2022, 9:18pm

Thanks didibus,

very well explained.

zcaudate · September 27, 2022, 4:06am

okay I think I understand what you are doing.

Best way to go is to not worry about memory usage right now as you can optimise later. I would split up the parsing into maybe 2 or 3 passes instead of doing everything you mentioned all at once. The format you are working with is tricky. So I’d probably do something like:

get a flat, tagged representation of the lines (ie. headers, path indicators, content)
get a tree representation of the sections (nested headers)
go over the tree representation sections to split according to a PATH indicator
emit the split files

Once the passes are in place and you want to optimize, I’m sure people here will be able to help with upgrading it to an async, transducerized and parallalisable version.

A better solution though would be to change your input into something bit more straightforward to parse.

system · March 28, 2023, 4:06pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.