I was trying to optimize my own toy language calcit-js, which compiles to JavaScript. And it’s using persistent data structure as well. For details there was a post.
So it took me weeks trying to deal with bugs and to optimize the performance for persistent list. I found it quite interesting that an array-based solution can be used to simulate persistent data structure in many occasions. And it was quite funny it made my language a lot faster since my own implementation for data-sharing tree is relatively slow.
So here’s the trick, use two modes to represent a persistent list, first is an array, then the real tree list. The array is optimized by JavaScript itself, and can be used to simple cases like reading, and after that, a tree list is used when it’s actually being modified. So when a list in calcit-js is first created, it’s in an array, and only being converted to a tree list when it has to.
The other part of the trick is to use startIndex
and endIndex
to track how this array is used to represent a persistent list. Suppose a JavaScript array of [1,2,3]
, with two indexing parameters it can represent [1]
[1,2]
[1,2,3]
[2,3]
[2]
[3]
or even []
, meanwhile the memory is reused. This trick was learnt from Go slices, but later I noticed since it’s immutable data in Clojure and in calcit-js, the shared array can be used in more scenarios.
Let’s see how it will work in these common scenarios:
access by index
Accessing index of x
of array is simple enough, just in this case we need to add x
with an extra startIndex
. Simple math~
rest
As mentioned above, in immutable data, we are safe to reuse the same piece of array, and meanwhile we mark a new index: startIndex + 1
. For example, [1 2 3 4]
can be represented with:
arrayData: [1,2,3,4]
startIndex: 0
endIndex: 4
And [2 3 4]
may use same piece of array but with a different start index:
arrayData: [1,2,3,4]
startIndex: 1
endIndex: 4
In this case, array memory is shared, calling rest
is quite cheap. Although not cheap enough like a linked list, but it supports fast accessing.
butlast, slice(subvec)
So similar idea with rest
, if we want to slice [2 3]
out of [1 2 3 4]
, it’s just changing indexes:
arrayData: [1,2,3,4]
startIndex: 1
endIndex: 3
And now you know how to implement butlast
here… And this model will always work even if you slice and keep slicing a list. The cost for each slicing step is just an object of 3 fields.
append(conj)
You may find it cheap for accessing, but how about calling conj
like we normally do in Clojure? Well, it depends. I noticed that in one case, the memory can still be reused. Say we have a list (def a0 [1 2])
, and we want (def a1 (conj a0 3))
. At first a0
has a structure like:
arrayData: [1,2]
startIndex: 0
endIndex: 2
tricky part is that we can just put 3 at the end the same array, then we get a1
with structure:
arrayData: [1,2,3]
startIndex: 0,
endIndex: 3
right? But the dirty part is we changed previous data of a0
. So, does it break our code? Nope, now we have a0
with a structure:
arrayData: [1,2,3]
startIndex: 0
endIndex: 2
even though the list has 3 items, being a list, its size is still 2 based on the indexes, and its values 1
and 2
are not influenced. So in this way, we can reuse quite some memory, it would even be cheaper than using a tree structure.
Well, we all know it’s a very special case. If our persistent list [1 2]
actually has a structure of:
arrayData: [1,2,4]
startIndex: 0,
endIndex: 2
then we can’t use this trick since the extra 4
at tail, because it would break previous value. Unless we have a very smart compiler and it happens to find out that previous value is no longer used. Well… it’s unrealistic for a dynamic language.
Judging by real-world usages, I suppose in quite some cases we create a list and we will keep filling it with data and at last to share it to others. For example, in reducing, we don’t share the list of each internal state, we only give out the result to other functions. So, that trick may help in quite some cases.
When we can’t reuse previous part of an array, personally, I prefer fallback mode that converting to tree list and bla…bla…bla…
prepend(cons)
The special trick for appending can not be used in prepending directly. However, in theory, it’s possible. Let’s look at an extended structure of [1 2 3 4]
arrayData: [1,2,3,4]
negativeArrayData: []
startIndex: 0,
endIndex: 4
When we prepend it with -1
, it turns into:
arrayData: [1,2,3,4]
negativeArrayData: [-1]
startIndex: -1
endIndex: 4
And then we prepend it with -2
, it turns into:
arrayData: [1,2,3,4]
negativeArrayData: [-1, -2]
startIndex: -2
endIndex: 4
so… this method is viable. Just we are not sure if this will really make code faster, since there are 2 arrays to maintain now.
Also, it’s only available for several specific cases. We may still turn to tree list for general cases.
insertion and deletion
No simple trick for modifications. Tree list is required now.
After the data is turned into a tree, in my case of calcit-js, it’s not turning back to array implicitly. Creating arrays requires new memory.
There might be some possible tricks, if we modify an element in the middle of an array, maybe, we can reuse multiple slices to represent the final tree. But in this case, the tree can be really a chaos, it would highly likely to be unbalanced and thus makes accessing unpredictably slow. Meanwhile balancing the tree can be heavy operations as well. So chaos…
I noticed V8 has some tricks for optimizing arrays, by choosing from multiple internal structures to make some of the arrays really fast. Maybe not bad idea for us in persistent list too.
I haven’t heard about such tricks in Clojure community yet. I glanced source code pf PersistentVector
and didn’t notice about this. I’m not sure. Maybe the trick can be helpful. In my own case, with an array representation of persistent list, in many cases, it makes arguments spreading really cheap on js side. So it did bring benefits to calcit-js.