Basic CSV reading with Numbers

fridgemagnet · May 9, 2024, 2:40pm

I’m trying to read a single small delimited list of numbers from a CSV file. When using slurp, if I’m reading the problem below correctly, the data is being imported as Java.String so I cannot use +,- apply, map in anyway in the collection. Also I find it odd the collection remains in quotation marks [“1,2,3,4”] and not [1,2,3,4]. My best guess was to find a typecasting function and I thought cast or to-array would work but I run into same error message.

For a solution, the Clojure data structure / type the data is held is unimportant to me as long as I can use apply, reduce.

I’ve looked online, in docs to find a solution and would be grateful to any help in this matter.

I considered external libraries like for CSV, changing data format but seem to run into the same problem. I also considered using SQLite but think this is just adding complexity I don’t want /need.

In lein repl, Clj 1.11.1 JDK 8 (another system with JDK 11 same issue) :

=> (def x (-> (slurp “nums.csv” ) (clojure.string/split #“\n”)))
[“1,2,3,4”]
=> (type x)
clojure.lang.PersistentVector
=> (apply + x)
Execution error (ClassCastException) at java.lang.Class/cast (Class.java:3369).
Cannot cast java.lang.String to java.lang.Number

love-your-parens · May 9, 2024, 4:28pm

Look closely at your data. It’s not “weirdly quoted” - it’s a single-element vector and that element is a string. You split your input into rows, but not into columns.

joinr · May 9, 2024, 9:51pm

I cannot use +,- apply, map in anyway in the collection.

Technically, strings are seqs of characters, so you can map/filter/reduce them in that context.
It’s not what you want here, but good to know that the seq abstraction can be extended there.

(def x (-> (slurp “nums.csv” ) (clojure.string/split #“\n”)))

I don’t have your data, so I’ll synthesize some:

(def data (->> (range 20)
               (partition 4)
               (map (fn [xs] (clojure.string/join \, xs)))
               (clojure.string/join \newline)))

(spit "blah.csv" data)

So the contents of the file are:

0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15
16,17,18,19

We can read it in one pass into memory (this is not always the best option, but it’s trivial enough for a vast number of use cases). As you did, I leverage slurp, then dissect the string into rows using clojure.string/split-lines (a built in), then into vectors of comma delimited number strings using clojure.string/split. This yields a sequence of rows, where rows are sequences of character strings. It’s where you stopped before.

The missing piece is to parse row entries into numbers. I am assuming longs here. So I map into a vector (using mapv for convenience) with the function parse-long (in older versions of clojure you may see #(Long/parseLong %) but we have parsing functions provided now).

(defn read-csv [path]
  (->> path
       slurp
       clojure.string/split-lines
       (map (fn [row]
              (->> (clojure.string/split row #",")
                   (mapv parse-long))))))

So now we can read our csv of numbers (in this narrow context) and work with the resulting nested collection:

(->> "blah.csv"
     read-csv
     (apply concat)
     (reduce +))
;;190

This is serviceable for small problems (things like advent of code puzzles fit this kind of pattern quite a bit). For actual munging of “real” csv/tsv or other encodings with more complex types, I would recommend a stronger library. charred is very efficient but like data.csv is only concerned with the csv syntax and not inferring values. semantic-csv can handle parsing/coercing values. tech.ml.dataset and its derivatives can handle a lot too (and uses charred under the hood for csv parsing):

user=> (require '[tech.v3.dataset :as ds])  
nil
user=> (def nums (ds/->dataset "blah.csv" {:header-row? false}))
#'user/nums
user=> nums
blah.csv [5 4]:

| column-0 | column-1 | column-2 | column-3 |
|---------:|---------:|---------:|---------:|
|        0 |        1 |        2 |        3 |
|        4 |        5 |        6 |        7 |
|        8 |        9 |       10 |       11 |
|       12 |       13 |       14 |       15 |
|       16 |       17 |       18 |       19 |

While there is a dedicated query and transformation API, datasets implement clojure’s persistent collection APIs. At their base, they are mappings of column-names (any object, typically strings or keywords or numbers) to columns (a custom type). This means they conform to the semantics of persistent maps when faced with seq, count, keys, vals, reduce, etc. The column type also has plenty of optimized operations, but it also participates in the apis for our indexed collections (like a vector). So we can treat this column-major format as if it were a map of vectors: like

{"column-0" [0 4 8 12 16], 
 "column-1" [1 5 9 13 17], 
 "column-2" [2 6 10 14 18], 
 "column-3" [3 7 11 15 19]}

user=> (first nums)
["column-0"
 #tech.v3.dataset.column<int16>[5]
 column-0
 [0, 4, 8, 12, 16]]
user=> (->> nums vals (apply concat))
(0 4 8 12 16 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19)
user=> (->> nums vals (apply concat) (reduce +))
190

You can also traverse the values as a seq-of-maps or a seq-of-vectors, if you need record or row-major access.

user=> (ds/rows nums)
[{"column-0" 0, "column-1" 1, "column-2" 2, "column-3" 3} 
 {"column-0" 4, "column-1" 5, "column-2" 6, "column-3" 7} 
 {"column-0" 8, "column-1" 9, "column-2" 10, "column-3" 11}
 {"column-0" 12, "column-1" 13, "column-2" 14, "column-3" 15} 
 {"column-0" 16, "column-1" 17, "column-2" 18, "column-3" 19}]

user=> (ds/rowvecs nums)
[[0 1 2 3] [4 5 6 7] [8 9 10 11] [12 13 14 15] [16 17 18 19]]

fridgemagnet · May 15, 2024, 7:11pm

Thanks very much for the replying.

I stopped learning Clojure for a while and am trying to get back to it. Its a neat language. I’m a bit surprised to see running into the ‘Casting’ errors in my original post. I don’t see a lot of references to typecasting.

I looked into using tech.ml but it felt too complicated.

I’ll continue to evaluate some use cases for Clojure but I’m happy to get this working. Thanks!

glchapman · May 17, 2024, 10:53pm

Here’s a little utility function I find convenient for things like Advent of Code:

(require '[clojure.edn :as edn])

(defn read-string-vec
  "Surround text with brackets in order to read it as a vector.
    E.g., '1,2,3' is read as [1 2 3]"
  [text]
  (edn/read-string (str \[ text \])))

For simple CSV, you could take advantage of Clojure treating ‘,’ as white space and use something like:

(->> (slurp source)
  (clojure.string/split-lines)
  (map read-string-vec))