Clojure byte array literal

Chris_Hapgood · August 24, 2020, 4:52pm

I’ve been using Clojure for ten years and would like to think I’m familiar enough with the language that I would not miss something obvious to solve this problem:

I want to read, write and operate on sequences of bytes. This is a common problem in diverse fields such as image and audio processing, Cryptography and low-level I/O.

The underlying JVM platform provides an efficient type for operating on byte arrays (the native Java Byte array, aka “[B” in Clojure). Clojure also has support for a more Clojure-idiomatic (and immutable) byte array via the (vector-of :byte ...) construct backed by a clojure.core.Vec. But neither of these types has a literal representation.

BigInt would be efficient with a compact literal representation but for the suppression of leading zeros both in storage and in printing.

Overall, the lack of an intuitive literal representation for a byte array seems to be a problem in Clojure. I have stumbled upon rumblings of a native support for representing sequences of bytes as a hex string, but nothing has come of it as far as I can tell. Hex strings are a common literal representation and I think Clojure would be well served by having a variation of a hex string as the literal representation of a sequence/vector/array of bytes. Reader support for a literal like #x"00ffabcd" for a clojure.lang.Vec of byte would do very nicely.

I’ve hacked around with other approaches, the most promising of which was a deftype backed by a clojure.core.Vec of :byte. But even with a custom print-dup and tagged literal, Clojure does not seem to tolerate reading a tagged literal into a deftype backed by a type without its own literal representation (a limitation I can’t quite understand).

I rule out a simple Clojure persistent vector of bytes. While Clojure neatly handles (de-)serialization, and operations on the byte elements via the built-in bitwise functions are reasonable, Clojure vectors are heterogeneous -a severe semantic mismatch. They also have no compact literal representation when filled with bytes (compare the print-dup of a vector of bytes with a hex string…)

So, cutting to the chase, how do you print, read and operate on byte sequences?

didibus · August 24, 2020, 8:31pm

Hum, I find that Bytes themselves arn’t really human readable. So I’m not too sure of the value of printing them, and storing them as strings seems counter intuitive to why they are bytes in the first place.

That said, I think you can implement a tagged literal for printing and reading them however you want. I’d probably keep them Java byte arrays when in memory personally.

Chris_Hapgood · August 24, 2020, 9:03pm

The goal is to not store them as characters or strings. Instead, a byte array (probably) or clojure.core.Vec (possibly) seems preferable… not sure how I conveyed that wrong.

The crux of the issue is the literal representation… Clojure doesn’t really have one. I’ve managed to hack one together using a hex string tagged literal with a reader that converts it to a byte array, but it seems dirty to modify the print-dup of native byte arrays (needed for round-trips). For an app it’s probably OK, but for a lib that would be a bigger problem.

didibus · August 24, 2020, 9:26pm

I see, so you’re thinking, you have a library that makes use of byte sequences (as byte array for now), and you’d like it that those sequences of bytes created by your lib would print in a readable way, without forcing all byte arrays in the program to do as such?

FYI, from an old mailing question of mine, the consensus was to extend print-method and have it switch on *print-readably*, where when true, it prints a readable edn literal #x/y … for example, otherwise it prints as standard. And to leave print-dup alone. Here’s the thread: https://groups.google.com/g/clojure/c/R-9Pwk3HcFk?pli=1

Basically, it seems print-dup precedes EDN. The goal of print-dup is to print in a way that the Clojure reader can read it back, but not necessarily in a way that is valid EDN. So for example, you could have byte-array print-dup as the following: (byte-array [12 35 167 210]), but for EDN that wouldn’t really make sense. In this case it be valid, but not exactly what you’d want for EDN, where you’d probably prefer a tag like: #mylib/bytes [12 35 167 210])

Anyways, back at your point. I think the best is for you to do just that, extend the printer for byte-array, but maybe put the extension code in a separate namespace, which the user need to explicitly require if they want that behavior. So they can choose how they want byte-arrays printed.

Alternatively, what you were doing should have worked I feel. Like to wrap the byte-arrays you expose in some higher level structure, either a deftype, or just some special typed map or a typed variant vector:

# Typed map
{:mylib/type :byte-array
 :bytes <the native byte array here>}

# Variant tuple
[:mylib/byte-array <the native byte array here>]

With those, you can just have a print-method that recognizes them as special, and prints the byte-array in them however you want. And make your lib work on those structures.

I also think the deftype should work, but my guess is that the type doesn’t exist yet, because the data-readers are going to run before the deftype declaration. I don’t know if you need to require the namespace with the deftype from within your data-reader maybe?

bsless · August 25, 2020, 6:23am

How about the following approach?

(import '[javax.xml.bind DatatypeConverter])
(defn hex->bin [x] (DatatypeConverter/parseHexBinary x))
(defn bin->hex [x] (DatatypeConverter/printHexBinary x))

(def bin (type (hex->bin "a1")))

(defmethod print-method bin
  [v ^java.io.Writer w]
  (.write w "#x \"")
  (.write w (bin->hex v))
  (.write w "\""))

(set! *data-readers* (conj *data-readers* {'x #'hex->bin}))

;;; then
user=> (def x #x "a1")
#'user/x
user=> x
#x "A1"
user=> (type x)
[B
user=>

Chris_Hapgood · August 25, 2020, 1:28pm

@bsless, I think you have landed in the same place that I landed -right down to leveraging printHexBinary and parseHexBinary. It’s a pragmatic solution that ticks a lot of boxes, but it has three drawbacks that I can identify:

You’ve hijacked the printing of native byte arrays. Probably safe, but clearly only one means of rendering is possible and all the players in an ecosystem need to agree.
Native byte arrays are mutable whereas the clojure.core.Vec are not. But byte arrays play nicely out-of-the-box with Java libs.
Native byte arrays cannot be extended with protocols so (in my use case) it’s less convenient to define some crypto operations that would otherwise be nicely accommodated with Clojure protocols.

Despite these drawbacks, I think you have the optimal solution and it confirms my conclusions. Thank you.

It’s worth noting that were Clojure core to define the printing and (default?) literal reader for a hex string, my first concern above would not apply.

Chris_Hapgood · August 25, 2020, 1:41pm

@didibus, thanks for your insight. It seems like the consensus is to modify the printer and provide a reader. And I too concluded that I needed a higher level type ByteArray in order to accommodate extension of some application-specific protocols to the byte array.

Note that my immediate goal is round-trip support through the Clojure reader. The EDN reader seems like a slightly easier task and it’s not as critical for my use case. The thread you reference provides lots of good input -thanks.

As far as requiring the deftype namespace from within my data_reader.clj… I don’t think that solves the issue. Critically, if the deftype basis is a byte array (or other non-round-trip-capable type), reading seems to fail even if the deftype itself attempts to manage all the necessary printing. The printing of the basis for the deftype seems to be hard coded in some way that I can’t yet crack.

It would be really convenient if this concept of round-trip print/read for the Clojure reader (and, in parallel, the EDN reader) were clearly documented somewhere. Issues like print-ctor, tagged-literals, basis printing and perhaps even some clear recommendations for *print-readably* and *print-dup*. Without it, your advice is even more valuable.

bsless · August 25, 2020, 3:33pm

Convergent evolution
Regarding your points:

I don’t mind about hijacking printing of byte arrays. Something like #object["[B" 0x5ce392e9 "[B@5ce392e9"] isn’t very meaningful to me.
You can implement this solution for both types.
They can, if you cheat:

(def B (type (hex->bin "a1")))
(defprotocol IFoo (-foo [this]))
(eval `(extend-protocol  IFoo ~B (-foo [~'this] (.toString ~'this))))

Edit:
you can do it without a hack by using extend directly:

(extend B IFoo {:-foo (fn [this] (.toString this))})

Chris_Hapgood · August 27, 2020, 4:12pm

Agreed that hijacking the printing of byte literals is a small sin and due to the nasty default printing, it’s “justifiable”.

I forgot about the bare extend function. Turns out that you don’t even need to def an “alias” type. This works fine:

(extend #x "" IFoo {:-foo (fn [this] ...)})

I’m pretty close to satisfactory Clojure support for the byte array “type”. My only outstanding issue is equality and I don’t think there is any hope of satisfying Clojure’s = and I’ll have to drop back to a custom my= or similar.

It’s worth noting that a clojure.core.Vec would give me the equality semantics I want. But there are serious roadblocks with round-trip reader/printer support for this type. Maybe in another post I’ll solicit the help of this forum to tackle that issue.

bsless · August 27, 2020, 5:51pm

You can always petition to add support for arrays equality in Util/equiv.

Chris_Hapgood · August 27, 2020, 6:05pm

Being mutable, I don’t think that will be appealing to the language stewards. And I don’t blame them: in a value-based world, it’s disconcerting that something that was equal might no longer be equal.

Alan_Thompson · August 28, 2020, 1:07am

Chris,

Partially OT, but if you need some Java interop to do I/O on primitive types for arbitrary length sequences, you may want to look at the tupelo.io namespace for some convenience functions.

system · February 26, 2021, 1:07pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.