How to search XML in cljs?

Webdev_Tory · July 31, 2020, 5:35pm

I’m having trouble searching xml in a pure client-side fashion. I’ve been vacillating between using browser-based DOM functionality for this, trying to leverage Closure, and trying to leverage clojure.data.xml. I can get and read the XML in each of these ways, but I’m struggling to search it. In my example, I want to find every <title> element and obtain the string of what the element is titled. Even this seems difficult, though. Here’s what I’ve scratched up so far, with limited success:

;; this is all cljs
;; with clojure.data.xml, but is non-trivially nested without search capabilities (css/hiccup style would be best, or at least xpath)
(let [x (xml/parse-str "<title>Tech.ToryAnderson.com</title>")]
  (-> x :content) ; ("Tech.ToryAnderson.com")
  #_(js/console.log x))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; With raw javascript
(let [s "<title>Tech.ToryAnderson.com</title>"
      p (js/DOMParser.)
      doc (.parseFromString p s "text/xml")]
  (-> (.getElementsByTagName doc "title")
					;vec ; Can't get to a place to use cljs (map).
					; ;; repl/invoke error Error: [object HTMLCollection] is not ISeqable
      #_((aget 0)
	 .-innerHTML) ; "Tech.ToryAnderson.com" ;; works for just one 
      ))
;; but how to do this for a large collection with nested data?

vvvvalvalval · July 31, 2020, 5:42pm

Yeah, navigating XML is tedious… Feel free to contribute a CLJS or CLJC engine to xml-pull

Webdev_Tory · July 31, 2020, 5:43pm

I’m thinking of trying to port Enlive to CLJS; a real shame it isn’t here already

jan · July 31, 2020, 5:51pm

Have you considered an XPath library?

Webdev_Tory · July 31, 2020, 8:16pm

Ooh – that’s lovely

tobyloxy · July 31, 2020, 10:55pm

You can convert the HTMLCollection object to a clojure sequence with array-seq, and then map over like you normally would.

(comment
  (let [s "<title>one</title> <title>two</title> <title>three</title>"
        p (js/DOMParser.)
        doc (.parseFromString p s "text/html")
        html-collection (.querySelectorAll doc "title")]
    (map #(.-innerHTML %) (array-seq html-collection)))))
; => ("one" "two" "three")

See

Alternatively, consider using Hickory and its selectors:

tobyloxy · July 31, 2020, 11:28pm

Hickory example:

(ns demo.scratch
  (:require [hickory.core :as h]
            [hickory.select :as s]))

(comment
  (let [s "<title>one</title> <title>two</title> <div><title>three</title></div>"
        tree (-> s h/parse h/as-hickory)
        title-elements (s/select (s/tag :title) tree)]
    (map #(first (get % :content)) title-elements)))
; => ("one" "two" "three")

xfthhxk · August 1, 2020, 12:02pm

You can also use zippers. Here’s an example from Stack Overflow. Probably don’t need io/reader and instead of x/parse might have to use x/parse-str. Caveat: I haven’t tried this out yet myself.

Phill · August 1, 2020, 9:55pm

In ClojureScript, clojure.data.xml uses the browser’s DOMParser.

If the run-time cost of conversion to Clojure data is OK, then zippers are the Cadillac of next steps. Traversing the zipper, you can “see” up and down and all directions from the current node, which can be convenient. At the other extreme is the standard library’s best-kept secret: xml-seq! Demonstrated here on an “RSS” feed which has its own title, in addition to items with titles. We select only the items’ titles:

user> (let [x (xml/parse-str "<rss><channel><title>Channel title</title><item><title>Tech.ToryAnderson.com</title></item><item><title>Second item</title><guid>foo</guid></item></channel></rss>")]
         (->> (xml-seq x)
              (filter #(= (:tag %) :item))
              (mapcat :content)
              (filter #(= (:tag %) :title))
              (mapcat :content)))

("Tech.ToryAnderson.com" "Second item")

Webdev_Tory · August 6, 2020, 8:40pm

+1 for showing me xml-seq. This works as a clojure-native searching method, but still lacks the advanced searching of something like xpath e.g. it’s non-trivial to perform a query like “All title nodes that are under doc.type=movie”. Or maybe I just need to embrace a more clojure way of thinking here.

Phill · August 6, 2020, 11:07pm

Working in ClojureScript, in a browser, there’s no dishonor in using the browser’s built-in XPath. In “pure Clojure”, Enlive accomplished something more flexible than XPath with zippers, but Enlive’s notation will seem abstruse unless it’s obvious that XPath would have been harder. (The zippery part of Enlive is here: https://github.com/cgrand/enlive/blob/master/src/net/cgrand/enlive_html.clj)

Webdev_Tory · August 8, 2020, 9:43pm

sadly, xml-seq appears to be CLJ only, not cljs

Phill · August 9, 2020, 9:20pm

How strange! This deficiency of ClojureScript is not mentioned on “Differences from Clojure” https://clojurescript.org/about/differences.

On the bright side, xml-seq is a one-liner, an application of tree-seq, which appears to be in ClojureScript.

Webdev_Tory · August 12, 2020, 10:44pm

Big thanks to the comments and suggestions here. I learned much and gained some strong opinions/appreciations. Block link forthcoming.

system · February 11, 2021, 10:44am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.