Thought experiment about strings

taf · November 27, 2020, 4:24pm

here is the doc about strings in clj:

if i understand that correctly a str in clj is very much like a java-str, correct?

so for example in java str literals are pooled, right?

(identical? "clj" "clj")

true

(identical? "clj" (str "c" "l" "j"))

false
also if you look at something like:

(source str/upper-case)

you get:

(defn ^String upper-case
  "Converts string to all upper-case."
  {:added "1.2"}
  [^CharSequence s]
  (.. s toString toUpperCase))
nil

now strings in java are immutable and also until ( including ) jdk8 they were backed by a char[]… since jdk9 by a byte[], right? ( so i suppose you would also notice that clj programs became more memory efficient ( in many cases ) because of that internal change, correct? )

now because of how strings are implemented in java they can sometimes perform very poorly: ( this is from “effective java 3rd edition - item 63 - beware of the performance of string concatenation” )

public String statement(){
 String result = "";
 for( int i = 0; i < numItems(); i++)
   result += lineForItem(i);
 return result;
}

now seams to me that the main problem is that this is nothing like the immutable / persistent data types for sequences you get in clj. ( obviously a str being a seq of chars )

so here is a thought experiment for you. what would happen if you backed strings by such a clj data structure instead? ( obviously in practice this seams like a stupid idea… but just as a thought experiment… i suppose there would be some pros and cons to that, right?.. - as with everything -… so… hmm… idk… any thoughts? )

p.s. a few days ago a posted about some thoughts regarding enthusiasm and its role in programming… now i wasn’t able to finish that post, but still i published it for some reason, then decided to delete it in order to finish it and post it again… turns out… once i had spelled out my thoughts on the matter in more detail… it became clear that it was basically just nonsense … so i might post about that topic again… ( repost ) but that might take me a while… - just in case you were wondering / waiting - )

seancorfield · November 27, 2020, 7:05pm

The recommended approach to doing a lot of stuff with Java strings (and Clojure’s strings are exactly Java’s String type), is to use a StringBuilder and in many ways it mirrors what we do in Clojure when we use transient and persistent!:

Inside your “string operation function” that would otherwise glom together lots of (immutable) strings, you create an empty StringBuilder (like we’d say (transient [])) and then operations work efficiently on that sb object (which is mutable, just like our transient collections), and finally you’d do sb.toString() (like we’d say (persistent! coll)).

taf · November 27, 2020, 7:18pm

you are absolutely right about StringBuilder! ( in the book it is the next thing he explains )

but really what i was wondering about is this…

so, a clj sting is a java str… and so if java changes how string is internally implemented ( say you move from char[] to byte[] - as it happened - ) that change will affect clj as well. also the cache for string literals seams to work the same way…

okay, now suppose someone changed the internals of the java string class in such a way that it would move from an array of some sort to a persistent char sequence… alla clj…

what would one have to suspect would happen? ( for example seams to me that that example with the concatenation would not be much of a performance issue any more at all… even keeping the immutable property in tact… )

does that make sense?

andy.fingerhut · November 27, 2020, 8:02pm

If the implementation of the Java java.lang.String class changes in future versions of Java, then running Clojure on that version of Java will “inherit” whatever the properties are of that new java.lang.String implementation, yes.

andy.fingerhut · November 27, 2020, 8:07pm

Yes, with JDK9 Java strings where every character has a code point in the range [0, 127] (i.e. ASCII strings), will be stored more memory-efficiently as an array of byte, so one byte per character in the string instead of two. If there is a Java string that has even 1 character in it whose code point is non-ASCII, i.e. outside of the range [0, 127], then the whole string will be stored as an array of char, as older versions of the JVM did for all strings.

Phill · November 28, 2020, 12:07am

In case a reference to the docs would be helpful: “Clojure strings are Java Strings.” – https://clojure.org/reference/data_structures#Strings)

taf · November 28, 2020, 4:03am

oh had not seen that! thx.

so,… what does that mean for text blocks?

does that also affect clj then?

p.s. i feel like the main ( the more interesting? ) question about the implementation of the immutable char sequence is still very much left unanswered. any and all opinions would be much appreciated!

mdiin · November 28, 2020, 8:15am

For what it’s worth, I think it is an interesting thought experiment. I guess one trade-off would be increased memory requirements for strings, but I am not well-versed enough in the internals of Clojure’s data structures to know if that holds in the case of a constant string as well (e.g. “foobar”, as opposed to a composite string “goo” + “bar”).

Phill · November 28, 2020, 9:12am

Doesn’t that page say “It is not a goal to define a new reference type, distinct from java.lang.String, for the strings expressed by any new construct”?

Part of the “art” in Clojure is the recognition of persistent data structures. But another part of the “art” is host interop, and a third part is a low tax on performance of everyday tasks. Therefore, it is a rather difficult “thought experiment” to imagine using any thing other than a Java/JavaScript string. It would tax common cases to optimize a rare case that is well served by StringBuilder.

There is a historical precedent that might have some relevance - even if we waived the host interop complications. Clojure has long used distinct optimized concrete types for small and large maps. (This does not harm interop, as Map is an interface, unlike String.) A few years ago there was an effort to extend the practice to vectors: https://github.com/ztellman/clj-tuple. But, as I recall, while the benefits were real in purpose-built benchmarks, there was also an unexpected and bigger cost in actual programs: because the JVM does not optimize call sites that experience more than a tiny bit of polymorphism. This may be why Java itself provides just one String type, which Text Blocks will not change.

didibus · November 28, 2020, 9:51am

Are you asking what would the performance and memory characteristics be if java.util.String was implemented as a Clojure PersistentVector of bytes?

I think it be slower and use more memory, because vectors still need to box their elements. So using an array of bytes seems much faster.

taf · November 28, 2020, 10:20am

well, i think it makes sense to formulate the question as open as possible, but yes.

i mean for example i keep finding people mentioning O(log32 n)… for example here:

https://news.ycombinator.com/item?id=8503912

but obviously i am all but a data-structure expert… so my naive thinking goes something like this…

clj has really really cool persistent data structures for sequences… so working with immutable seqs is very hassle free and leads to very good performance ( no need to copy etc. etc. )… then i was thinking, well java has also strings as immutable seqs of chars… so how does one implement such an immutable seq? do they use something similar to what you find in clj? no they do not! not at all!

so i am just naively asking about the things that one would want to consider when thinking about implementing an immutable seq of chars… like you find with java strings…

also… what would one expect would happen to many / most programs if one was to change to something like the persistent vector type in clj? ( doesnt have to be exactly that structure of course )… would that lead to substantially slower programs and increased memory usage? and if so, why is that?

… but again… really i am really just curious… and this may in part be a silly question… from the perspective of an expert… but as they like to say in german… if you dont dare to ask… you’ll always remain ignorant!

… so… idk… that does make some sense, right?

didibus · November 29, 2020, 10:26pm

It makes sense. But I think the use cases are a bit different. Persistent Vector are O(log32 n) for getting an element. Which you could say is pseudo O(1). But at the same time, much more slower than an array access, which is like the fastest of fastest O(1) you can get.

Since strings are so pervasive, I feel they get looped over really really often, for displaying or for transformation. So PersistentVector would be slower there, even if still reasonably fast.

Now, the structural sharing of PersistentVector is when you add or remove to it. So here it’s true, if the string contained repeated characters they’d get shared. Now here’s the thing though, a character is a pretty small thing. When JVM was still using Char objects, maybe a pointer to the char which would get structurally shared wherever it is used could have been good. But now that they are bytes, the representation is just as small as the pointer itself. On a 64bit machine, pointers are 64bit, or 8 bytes. All Unicode characters can be represented with only 32 bits, so you can pack two Unicode character in a single pointer! So you’d actually be wasting memory by adding structural sharing as you’d need a 64bit pointer to point to at most 32 bits of character. That’s now 96 bits worth of memory, where as you could have only used 32bit.

Also, the sharing won’t save much, since each pointer will take up 64 bit.

So in effect, the pointer is more overhead than what we save by sharing anything when it comes to Strings.

On top of that, the sharing requires two memory lookup, you have to lookup the pointer and then the memory location it points too. So again slower access then what an array of bytes gives you.

All in all, I think for Strings it’s hard to beat array of bytes. If you need to store a ton of additional meta along your characters, maybe you could start looking at using PersistentVector. Or I think other structures are popular for that, like Ropes, which have supper faster insertion and modification in the middle, which is popular in text editors, etc.

taf · November 30, 2020, 7:38am

i think you were able to individuate “the poodle’s core” for us. thank you.

now i hope you don’t mind me paraphrasing a bit, but what you are drawing our intention to, above all else, is the matter of memory layout, right? in other words one has to be careful to distinguish between reference and primitive types here. in the sense that with an array of primitive type we are talking basically one starting + one end address, with n chunks of data of uniform size.

but now when we are talking about balanced, ordered trees ( or something like that ) we incur a large overhead because the data will not be linearly aligned in memory… so … in other words… we would not have one starting address + one large chuck of ( all the ) data but we would have n starting addresses and n individual small ( smaller than the individual address / register size itself ) pieces of data. ( - the elements ( char / byte ) - )

so even if one thinks about the copy that needs to happen for something like substring we would find sitting at the heart of that operation:

    @HotSpotIntrinsicCandidate
    public static native void arraycopy(Object src,  int  srcPos,
                                        Object dest, int destPos,
                                        int length);

and if one looks at that annotation…

The {@code @HotSpotIntrinsicCandidate} annotation is specific to the

HotSpot Virtual Machine. It indicates that an annotated method

may be (but is not guaranteed to be) intrinsified by the HotSpot VM. A method

is intrinsified if the HotSpot VM replaces the annotated method with hand-written

assembly and/or hand-written compiler IR – a compiler intrinsic – to improve

performance. The {@code @HotSpotIntrinsicCandidate} annotation is internal to the

Java libraries and is therefore not supposed to have any relevance for application

code.

so one can imagine that to copy a single block of memory could be reasonably cheap, especially when compared to an array of objects… where one would have to look up all the data separately and also copy every single element separately.

all in all, because of your explanation, i feel comfortable now to conclude, that when it comes to persistent sequence data structures, the question of data size in general and the question of equal sized data elements in particular… that would appear to be crucially important, in terms of instruction cycles / memory requirement implications.

in any case, i feel sufficiently satisfied now with this explanation.

thank you so much for sharing your thoughts and insight! i felt it was very helpful and it truly is much appreciated!!!

l3nz · December 1, 2020, 6:17am

I have been thinking for a while along these lines, and I’m not sure that “for Strings is hard to beat array of bytes”. A lot of the strings I create are to become a part of other strings, and so I think it would be sometimes better to have a vector of possibly vectors of things, and then do a flatten before creating the actual string.

Like: ["hello" " " ["world"] "\n"].

In a sense this is a more abstract version of a String, as a String is a more abstract version of a file containing bytes encoded as, say, UTF-8.

You could even do better: say you have a function:

(defn enc [x] 
  (fn [] (urlencode x))

You could do

["http://.../" "?" ["a=" (enc "x")] ]

And have it solved just when it is actually needed for output.

didibus · December 1, 2020, 6:29pm

I think you just described a rope data-structure to some extent.

Persistent ropes are ropes that only perform non-destructive operations, and those are sometimes used by text editors, because they make undo/redo easy.

I wonder what the differences are of using PersistentVector of Strings and a non-destructive rope data-structure.

In any case, I think that for this use case, it could make sense, since inserts and updates should be faster then with an array of strings.

In a way, a rope is like a hybrid, if your rope was at the character level, it be too granular and would introduce lots of pointer and bookkeeping overhead over an array. While big immutable string arrays start to incur performance and memory cost when modifying or inserting into it. A rope is like a hybrid, where you’ll have a immutable tree of smaller chunks of immutable string arrays.

That said when you’re flattening it into a string, still much faster to use StringBuilder for that. It’ll avoid unnecessary creation of intermediate string objects.

taf · December 1, 2020, 6:58pm

the topic of compact strings has already come up. still this link could perhaps be of interest:

http://openjdk.java.net/jeps/254

hmm… so… possibly worth pointing out:

Data gathered from many different applications indicates that strings are a major component of heap usage

Optimizing character storage for memory may well come with a trade-off in terms of run-time performance. We expect that this will be offset by reduced GC activity and that we will be able to maintain the throughput of typical server benchmarks. If not, we will investigate optimizations that can strike an acceptable balance between memory saving and run-time performance.

the catchwords being heap usage and trade-off, i suppose.

andy.fingerhut · January 9, 2021, 12:13am

I don’t claim to be addressing all of the earlier questions here, but in general, CPU instruction and data cache misses are one of the biggest performance reducing factors with current CPUs. Individual CPU cores are getting 0 instructions per cycle of performance while waiting for cache lines to be read from DRAM (modulo things like hyperthreads from other threads that happen to be running on the same CPU core). So:

data structures that take small amounts of memory are faster to access than ones that take larger amount of memory. So 1-byte-per-ASCII-char is better than 2-bytes-per-ASCII-char. Less memory gets pulled into a CPU’s cache per character accessed, or conversely, more characters are brought into cache per cache line.
fewer random memory accesses is better than more random memory accesses. All Clojure persistent vectors are based on 32-way branching trees, which is better than 2-way branching trees, certainly, but still requires several random memory accesses to get to an arbitrary index within the vector. Contiguous arrays do not require that.
Many modern CPUs actually have hardware to detect when the software is advancing through addresses sequentially, at least in the “increasing address” order, and I have not checked, but pershaps also in “decreasing address” order, such that they can fetch cache lines from DRAM before any instructions even access the data, if your code is advancing through the data in an order that this CPU hardware logic detects. Thus if your code is advancing through a large contiguous array by increasing index, CPUs with this hardware can perform significantly better than if they are advancing through elements of a rope or Clojure PersistentVector, where each “segment” of the string is stored at arbitrary addresses relative to each other.

system · July 10, 2021, 12:13pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.