String operations without losing line numbers

Since long I have one problem when writing parsers.

If you create a substring, you loose the track to the original string. But it is often required to know where in a file you are. For example if you want to produce error messages you must know the line numbers or if you want to use your parser for syntax highlighting.

So I use something like {:line_no 23 :text "This is line 23"} instead of strings. But this makes it hard to directly use any of the typical string operations like ends-with?, split or re-something. I have to write alternative implentations for all of them or write all my code differently. That makes the whole code too complicated to read. And when this is done, my code does not work with normal strings any more. That is not nice. Testing is harder as the code cannot be tested with simple strings.

Does anybody has an idea for a better solution of this problem? Is is possible (and easy) to implement an own String class that has this extra information and maintain it when copying them and building substrings with the standard string operations?

I would most definitely rely on maps and have a higher order function that takes a string function and returns a new function that has the same arity but unwraps the argument at a specific position.

It’s a typical trade-off between simple and easy, and this approach is the most direct and simple one, AFAICT.

On clojure-jvm, you can’t implement your own String type or derive from String since java.lang.String is final. You could implement CharSequence though, and have that work as a pseudo string, since a lot of the “String” functions are actually built around CharSequence…

(defrecord mystring [line ^String contents]
  java.lang.CharSequence
  (charAt [this index]
    (.charAt contents index))
  (chars [this] (.chars contents))
  (codePoints [this] (.codePoints contents))
  (length [this] (.length contents))
  (subSequence [this start end]
    (.subSequence contents start end))
  (toString [this] (.toString contents)))


user=> (def the-string (->mystring 22 "hello"))
user=> (require '[clojure.string :as s])
nil
user=> (s/includes? the-string "hel")
true

Another option is to just define your own string namespace with some protocol functionality to delegate to clojure.string, and then wrap the functions. Something like this:

(ns my.clojure.string
  (:require [clojure.string :as s]))

(defprotocol IStringLike
  (as-string ^String [this]))

(extend-protocol IStringLike
  java.lang.String
  (as-string [this] this)
  java.lang.CharSequence
  (as-string [this] (.toString this))
  clojure.lang.IPersistentMap
  (as-string [this] (this :text)))

(defn blank? ^String [s] (s/blank? (as-string s)))
(defn capitalize ^String [s] (s/capitalize (as-string s)))
(defn ends-with? [s substr] (s/ends-with? (as-string s) substr))
;;.... implement the other 10 functions

(def the-string "hello")
(def map-string {:line 23 :text "hello"})
(ends-with? the-string "o")
true
(ends-with? map-string "o")
true
(capitalize the-string)
"Hello"
(capitalize map-string)
"Hello"

could work. Another option is to define a protocol that implements the string operations defined in clojure.string, and use that for polymorphic operations. You could get custom behavior for stringlike types, e.g. lifting string operations into maps:

(defprotocol IString
  (blank? [this])
  (capitalize [this])
  (ends-with? [this substr])
  ;;...define the other fns from clojure.string.
  )

(extend-protocol IString
  String
  (blank? [this] (s/blank? this))
  (capitalize [this] (s/capitalize this))
  (ends-with? [this substr] (s/ends-with? this substr))
  clojure.lang.IPersistentMap
  (blank?     [this] (s/blank? (this :text)))
  (capitalize [this] (-> this (update :text  s/capitalize)))
  (ends-with? [this substr] (s/ends-with? (this :text) substr)))

my.clojure.string=> (capitalize map-string)
{:line 23, :text "Hello"}
my.clojure.string=> (capitalize the-string)
"Hello"
1 Like