A trick for cheaper persistent list in JavaScript

jiyinyiyong · February 12, 2021, 11:31am

I was trying to optimize my own toy language calcit-js, which compiles to JavaScript. And it’s using persistent data structure as well. For details there was a post.

So it took me weeks trying to deal with bugs and to optimize the performance for persistent list. I found it quite interesting that an array-based solution can be used to simulate persistent data structure in many occasions. And it was quite funny it made my language a lot faster since my own implementation for data-sharing tree is relatively slow.

So here’s the trick, use two modes to represent a persistent list, first is an array, then the real tree list. The array is optimized by JavaScript itself, and can be used to simple cases like reading, and after that, a tree list is used when it’s actually being modified. So when a list in calcit-js is first created, it’s in an array, and only being converted to a tree list when it has to.

The other part of the trick is to use startIndex and endIndex to track how this array is used to represent a persistent list. Suppose a JavaScript array of [1,2,3], with two indexing parameters it can represent [1] [1,2] [1,2,3] [2,3] [2] [3] or even [], meanwhile the memory is reused. This trick was learnt from Go slices, but later I noticed since it’s immutable data in Clojure and in calcit-js, the shared array can be used in more scenarios.

Let’s see how it will work in these common scenarios:

access by index

Accessing index of x of array is simple enough, just in this case we need to add x with an extra startIndex. Simple math~

rest

As mentioned above, in immutable data, we are safe to reuse the same piece of array, and meanwhile we mark a new index: startIndex + 1. For example, [1 2 3 4] can be represented with:

arrayData: [1,2,3,4]
startIndex: 0
endIndex: 4

And [2 3 4] may use same piece of array but with a different start index:

arrayData: [1,2,3,4]
startIndex: 1
endIndex: 4

In this case, array memory is shared, calling rest is quite cheap. Although not cheap enough like a linked list, but it supports fast accessing.

butlast, slice(subvec)

So similar idea with rest, if we want to slice [2 3] out of [1 2 3 4], it’s just changing indexes:

arrayData: [1,2,3,4]
startIndex: 1
endIndex: 3

And now you know how to implement butlast here… And this model will always work even if you slice and keep slicing a list. The cost for each slicing step is just an object of 3 fields.

append(conj)

You may find it cheap for accessing, but how about calling conj like we normally do in Clojure? Well, it depends. I noticed that in one case, the memory can still be reused. Say we have a list (def a0 [1 2]), and we want (def a1 (conj a0 3)). At first a0 has a structure like:

arrayData: [1,2]
startIndex: 0
endIndex: 2

tricky part is that we can just put 3 at the end the same array, then we get a1 with structure:

arrayData: [1,2,3]
startIndex: 0,
endIndex: 3

right? But the dirty part is we changed previous data of a0. So, does it break our code? Nope, now we have a0 with a structure:

arrayData: [1,2,3]
startIndex: 0
endIndex: 2

even though the list has 3 items, being a list, its size is still 2 based on the indexes, and its values 1 and 2 are not influenced. So in this way, we can reuse quite some memory, it would even be cheaper than using a tree structure.

Well, we all know it’s a very special case. If our persistent list [1 2] actually has a structure of:

arrayData: [1,2,4]
startIndex: 0,
endIndex: 2

then we can’t use this trick since the extra 4 at tail, because it would break previous value. Unless we have a very smart compiler and it happens to find out that previous value is no longer used. Well… it’s unrealistic for a dynamic language.

Judging by real-world usages, I suppose in quite some cases we create a list and we will keep filling it with data and at last to share it to others. For example, in reducing, we don’t share the list of each internal state, we only give out the result to other functions. So, that trick may help in quite some cases.

When we can’t reuse previous part of an array, personally, I prefer fallback mode that converting to tree list and bla…bla…bla…

prepend(cons)

The special trick for appending can not be used in prepending directly. However, in theory, it’s possible. Let’s look at an extended structure of [1 2 3 4]

arrayData: [1,2,3,4]
negativeArrayData: []
startIndex: 0,
endIndex: 4

When we prepend it with -1, it turns into:

arrayData: [1,2,3,4]
negativeArrayData: [-1]
startIndex: -1
endIndex: 4

And then we prepend it with -2, it turns into:

arrayData: [1,2,3,4]
negativeArrayData: [-1, -2]
startIndex: -2
endIndex: 4

so… this method is viable. Just we are not sure if this will really make code faster, since there are 2 arrays to maintain now.

Also, it’s only available for several specific cases. We may still turn to tree list for general cases.

insertion and deletion

No simple trick for modifications. Tree list is required now.

After the data is turned into a tree, in my case of calcit-js, it’s not turning back to array implicitly. Creating arrays requires new memory.

There might be some possible tricks, if we modify an element in the middle of an array, maybe, we can reuse multiple slices to represent the final tree. But in this case, the tree can be really a chaos, it would highly likely to be unbalanced and thus makes accessing unpredictably slow. Meanwhile balancing the tree can be heavy operations as well. So chaos…

I noticed V8 has some tricks for optimizing arrays, by choosing from multiple internal structures to make some of the arrays really fast. Maybe not bad idea for us in persistent list too.

I haven’t heard about such tricks in Clojure community yet. I glanced source code pf PersistentVector and didn’t notice about this. I’m not sure. Maybe the trick can be helpful. In my own case, with an array representation of persistent list, in many cases, it makes arguments spreading really cheap on js side. So it did bring benefits to calcit-js.

jiyinyiyong · June 8, 2021, 12:38pm

adding some notes here…

I found heavy memory usage in an app(refreshes every 30ms) that is using calcit-js, which uses this trick to optimize its internal List. Unfortunately I can’t locate the actual problem. But one thing may happen hypostatically, since the internal array is alway reused, there’s some chance GC does not know whether the memory of former part of the array can be released, so maybe I should add some code to detect and make sure the array is recreated when not many elements are actually reused.

Meanwhile I noticed some similar trick for a HashMap should be using a plain array as well, just implementing interfaces of a very naive HashMap by iterating by every 2 elements. Also copy before assoc/dissoc to simulate immutable HashMaps. It might work for a very very tiny HashMap(using an array internally) in performance of O(n) while n is really tiny. Thus to reduce the costs of building an extra trie. I don’t have benchmarks, just hypothesis and trying.

system · December 8, 2021, 12:38am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.