Subvec of Java Arrays

jgomo3 · March 25, 2019, 2:18am

Is there anything similar to subvec for Java Arrays?

Crrently I’m using take and nthnext (Update: This is just an example on how I’m using take and nthnext):

(defn sub-seq [from to coll] ;; Update: Name was subseq before, but core already has that name.
  (let [size (- to from)]
    (take size (nthnext coll from))))

But I feel there should be an already core function for this or something.

Beside: Why do you think subvec sets the vector as the first argument and not as the last one? What I’m learning about Functional programming points me to think that functions in this paradigm prefer to put the subject of processing as the last argument.

didibus · March 25, 2019, 4:36am

Yes, just use: java.util.Arrays/copyOfRange

(import 'java.util.Arrays)
(into [] (Arrays/copyOfRange (into-array [1 2 3 4]) 1 3))
;; => [2 3]

In Clojure, you have collections such as Vector, Map, Set, and you have sequences.

Functions which operate over sequences are normally (by convention) designed to take the sequence last. This include functions which convert collections to sequences, so anything that is seqable?.

Whereas functions that are designed to work on collections normally take the collection first. Basically, functions which can not do what they do using only the sequence abstraction. Which is the case for subvec, since subvec is specific to persistent vectors, as it is able to subvec in O(1). Your subseq implementation is not O(1). But in your case, you support any seqable, so it makes sense to take the collection last.

By the way, if you wanted to keep it around, I wouldn’t call it subseq, since there’s already a subseq function in core, it is confusing.

jgomo3 · March 25, 2019, 10:12am

Thank you very much for such a great explanation!

And yes, I will change my custom subvec function name, in the comment to avoid confusion. I’ll probably will remove it from my code.

linpengcheng · March 25, 2019, 10:26am


; Update: add Core API
;http://clojuredocs.org/clojure.core/subvec

(subvec [1 2 3 4 5 6 7] 2)
;=> [3 4 5 6 7]

(subvec [1 2 3 4 5 6 7] 2 4)
;=>[3 4]


;http://clojuredocs.org/clojure.core/subs

(subs "Clojure" 1)    
;=> "lojure"
 (subs "Clojure" 1 3)
;=> "lo"

;http://clojuredocs.org/clojure.core/subseq

(subseq (sorted-set 1 2 3 4) > 2)
;=> (3 4)

;=========

;just another implementation.

(def xs [10 11 12 13 14 15 16 17 18 19])
(->> (range 3 6) 
     (select-keys xs ,) 
     vals)
;return
;(13 14 15)

jgomo3 · March 25, 2019, 10:28am

That is a MUCH better solution of my “sub-seq” problem than my take/nthnext one .

Thank you for sharing it.

Update: Just because curiosity, I tried it with a Java Array, and it cast an exception:

find not supported on type: [Ljava.lang.Long;

I suppose select-keys use find which in turn would not be supported by [Ljava.lang.long?

linpengcheng · March 25, 2019, 11:24am

I’m sorry, I was wrong. The title requirement is Java array.
I’ve been using pure clojure programming and not paying attention to these differences.

jgomo3 · March 25, 2019, 11:51am

Illuminating anyway! I would not reach that solution based on select-keys. It is a new approach for me. Thank you!

plexus · March 25, 2019, 2:13pm

There is one important difference with subvec, copyOfRange will make a new copy, whereas subvec just delegates lookups to the underlying vector. So subvec has almost no allocation overhead.

I don’t think you can do the same thing with an array, but you could return a vector-like object that delegates its lookups to the array.

jgomo3 · March 25, 2019, 3:08pm

Good! Yes, you are right. For what I want, I should not be doing copies. Using the idea from @linpengcheng I could do something like:

(defn sub-array [array from upto]
  (->> (range from upto)
       (map #(aget array %))))

;; previous erroneous version
;; (defn sub-array [array from upto]
;;  (apply aget array (range from upto))

linpengcheng · March 25, 2019, 3:24pm

In the R language, this is a common method of vector programming (dataflow programming).

didibus · March 25, 2019, 3:48pm

This also doesn’t work on sequences, but only vector (associative collections to be more precise). It’s a good example of my explanation above as well. You can know select-keys doesn’t work on any arbitrary sequence or seqable, because it takes the collection first, and not last.

But ya, it’s a nice approach for vectors. I also hadn’t thought of it.

didibus · March 25, 2019, 4:56pm

aget doesn’t work like that. If you pass it additional indices it assumes they’re for nested arrays lookup. Basically, it’s to get from 2d or nd arrays. Not to get multiple values out of a 1d array.

Also, even if it did work as you thought, it’s not different to copeOfRange, in that it would still just copy the elements over to a new data structure.

What @plexus is saying is more complicated. But basically, subvec only returns a view over the vector. Because of that, no new vector is created, primitive values are not copied, and nothing is looked up yet. So it’s O(1) and takes O(1) extra memory.

copyOfRange is O(n) and takes O(n) memory.

To do what @plexus suggests over arrays, you would need to create a new ADT, which would just wrap a pointer to the array and keep track of start and end.

Since arrays are not interfaces in Java, you can’t create an ADT that can pass for an array. But you could implement some other collection interface, where under the hood it just delegates to your existing array.

The risk of doing that with arrays, is that unlike persistent vectors, they’re not immutable. So someone modifying the array elsewhere could change the elements of your subarray view without you knowing. That’s the advantage and reason why I’d suggest copyOfRange for arrays.

That said, I think Java actually has such a thing. It’s called Spliterator. It is used by Java Streams (the sequences of Java).

By using the steam API, you could get back such a view as a stream. But Clojure doesn’t support inerop with Java streams (unless I’m wrong).

All in all, better not bother. Also, I would rethink why you have an array in the first place ideally, especially if you are concerned with the cost of copying.

linpengcheng · March 26, 2019, 12:36am

For memory efficiency, we don’t need sub-java-array, but we can use reduce-subarray. Efficiency is equivalent to manipulating data on the Subarray view.

(defn reduce-subarray [f xs start end]
  (->> (range start end)
       (reduce #(->> (aget xs %2) f (conj %1 ,)) [] ,)
       doall))

(def xs (int-array [0 1 2 3 4 5 6 7 8 9]))

(reduce-subarray str xs 2 7)
;=> ["2" "3" "4" "5" "6"]

didibus · March 26, 2019, 1:19am

Hey, ya that’s true. And actually, you made me remember that Java Arrays are seqable. So in theory, what OP did was all lazy over the array, and is still O(1) time. Though when he processes the lazy-seq, it will become O(n) space, since the elements will be copied and cached in a seq.

Basically, lazy-seq already wraps a Java Array in an ADT which can keep track of start and end. And Sequence becomes our new data structure interface for our array.

In that way, it’s pretty similar to subvec. The only difference is related to space. When you read the elements from subvec, they don’t get copied. They would for lazy-seq. But honestly, primitives copy pretty quickly.

Which I guess would come into play sometimes. Say you had an array of 10 000 elements. And wanted to generate all sub array of that expanding. So like, a subset from 0 to 1, then 0 to 2, then 0 to 3, etc. You’d end up with 9999 additional sub arrays. With lazy-seq, creating that would be fast, since its all lazy. Now, you printed them all out. At that point, with lazy-seq, they’d all copy. So now you’d have 10 000 sequences all with a copy of the numbers in them. With subvec you wouldn’t.

At least, I think I got that right.

linpengcheng · March 26, 2019, 1:35am

Because I’m not engaged in big data development, in order to simplify the code, I don’t use lazy computing by default, and it’s easy to design another lazy version if you need to.

If it is for concurrency security, you should consider converting to the appropriate data structure, example: atom-vector .

reduce-subarray return a new vector.

jgomo3 · March 26, 2019, 12:09pm

You’re right. I edited the definition of sub-array.

You mean that for the example you gave, where you get subarrays expanding, when you process the subarray of 10,000 elemente, lazy-seq would create a copy of all those elements?

linpengcheng · March 26, 2019, 1:00pm

it’s lazy.

(defn map-subarray [f xs start end]
  (->> (range start end)
       (map #(f (aget xs %))  ,)))

joinr · March 27, 2019, 5:41am

sub-vec produces another data type that implements the persistent vector interfaces, but it actually indexes into the parent vector (maintaining a reference to it *preventing GC if you’re not careful), and the indices of the view. It then just delegates calls to nth and friends through this offset to the parent vector. So, there’s not copying involved. If you coerce it to a seq

user> (def the-vec [1 2 3 4])
#'user/the-vec
user> (def sv (subvec the-vec 2))
#'user/sv
user> (.v sv)
[1 2 3 4]
user> (identical? (.v sv) the-vec)
true
user> sv
[3 4]
user>

Even more interesting, conjing onto the subvector (at least in this case) uses the parent referenced vector as its basis for constructing a new wrapped parent, which does enforce copying semantics per persistent vector conj:

user> (conj sv :a)
[3 4 :a]
user> (type (conj sv :a))
clojure.lang.APersistentVector$SubVector
user> (.v (conj sv :a))
[1 2 3 4 :a]

lazy-seq “should” create a wrapper around the array and not copy. fun fact: since arrays are mutable, the projected sequence is actually…mutable. The values are not cached in this case.

user> (def the-array (long-array [1 2 3 4]))
#'user/the-array
user> the-array
#object[“[J” 0x7044e68 “[J@7044e68”]
user> (seq the-array)
(1 2 3 4)
user> (def xs (seq the-array))
#'user/xs
user> xs
(1 2 3 4)
user> (aset the-array 2 -99)
-99
user> xs
(1 2 -99 4)

The arrayseq definition shows that it’s just wrapping the underlying (mutable) array an an index, and calls to (first) just return the current index.

If we coerce the arrayseq using seq and realize it, then the values will be cached in the resulting seq So…something like this:

user> (def ys (map inc xs))
#'user/ys
user> ys
(2 3 -98 5)
user> (aset the-array 2 33)
33
user> xs
(1 2 33 4)
user> ys
(2 3 -98 5)

We see the mutation on xs, from 99 → 33, propagated into the arrayseq on xs, but not the lazy sequence ys, which was defined over xs as a map, then realized (and cached) when we printed it.

Bottom line: if you’re projecting arrays onto sequences, beware of side effects; it’s best to treat them as read-only, copy/clone them, or realize the sequences…If you’re messing with subvecs, be cognizant of the fact that you’re retaining a reference - potentially a very long chain of ancestors - to a parent vector the view is derived from. If this are long-lived, and you never actually disjoin elements down to the empty vector, you can introduce a subtle garbage leak (although the severity of the leak is highly dependent on the use case).

didibus · March 27, 2019, 7:19am

Hum, interesting. I’m surprised arrayseq doesn’t cache. Especially when wrapped in a lazy-seq (I mean like (lazy-seq xs)). I would have thought that would have cached.

I’m even more surprised subvec has a reference to the head node of the parent vector. Now, I always forget the details of how Clojure vectors are implemented, but couldn’t it just have .v point to the the first node in the subvec range, to allow the nodes before it to be GCed when other references to the head node are removed? With my superficial knowledge that persistent vectors are implemented as a Trie, it seems to me it could, but, I don’t fully grasp their implementation details.

Yes. Even with @joinr additional insights, it’s my understanding this would be the behavior. Because the lazy-seq wrapping the arrayseq would cache the elements inside itself when realized.

Now, when I say copy, it’s a copy of the pointer to the element. But for primitives, like int, bool, float, etc. It’s a copy. Since the value doesn’t actually take more space then a pointer would. So it’s not like a ton of additional data. Like say your array contained big objects, or large data-structures, those wouldn’t get copied over. Just to pointer to them would. Which would take 64bit each on a x86 architecture I believe.

system · September 25, 2019, 7:19pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.