This is an area where I still feel I’m missing something essential in Clojure: a concise and idiomatic way to query a HTML document. Testing and any kind of scraping come to mind as use cases.
The main thing I need is quickly selecting elements based on CSS and/or Xpath.
Sparkeldriver can do this, but if you already have HTML from some other source than maybe it’s not the most elegant solution.
There’s Hickory, which has it’s own “css style selectors”, but I already know CSS selectors, I don’t want to learn another DSL.
I guess Enlive comes closest to what I’m looking for, it uses clojure vectors as selectors, similar to how Garden does it, which I’m ok with. It could stand to be better documented though.
I don’t care too much about internal representation, although really it should be either clojure.xml style, or Hiccup
I’ve been happy with the Enlive experience, though I agree with your assessment of the state of documentation. It relies on jsoup, mentioned by @thheller.
I went with Enlive, which I’ve used in the past and had forgotten how much I liked it. I’m also already used to the selector syntax from Garden. Here’s how you would do the above in Enlive
Module from npm, might be useful in ClojureScript:
Cheerio’s selector implementation is nearly identical to jQuery’s, so the API is very similar.
$( selector, [context], [root] )
selector searches within the context scope which searches within the root scope. selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object. root is typically the HTML document string.
This selector method is the starting point for traversing and manipulating the document. Like jQuery, it’s the primary method for selecting elements in the document, but unlike jQuery it’s built on top of the CSSSelect library, which implements most of the Sizzle selectors.
Seconding this. It’s significantly more verbose than enlive but it’s selectors are very powerful and compose in a way that enlive’s can’t.
Used to prefer enlive for a long time due to the familiar CSS selector syntax but Hickory really grew on me during a recent project where I had to do a lot of HTML querying…
I think hickory looks a little more complicated for your simple example, but I find it clearer and easier to compose than enlive for more complex selector expressions. Some other things I like about it:
clojure/clojurescript
supports selectors and zippers
offers conversion to hiccup format for output
… but enlive is also nice, so if it’s your thing, why not?
I always use JSoup. It’s a straightforward Java library and I never feel the need to wrap it in Clojure. Interop is fine. It’s also very fast and the syntax is about 95% compatible with jQuery. We actually had something that would run the same selectors in the browser and in JSoup. JSoup parses very similarly to how browsers parse.
We originally started with Enlive but found it too slow for what we wanted to do (parsing millions of pages). The Enlive queries loop through a giant tree of immutable data structures each time. For small stuff (like template snippets), it’s fine. But big pages with complex queries would slow it down.
I’m currently refactoring a metric ton of parsing code and I’m gonna spike out converting from Enlive to Hickory. My gut says I’ll be able to craft a function that can take my Enlive selectors, and for the most part compose a Hickory select from that.
My biggest challenge is actually refactoring the parsing config (all data structures) into a format that is suitable for clojure.spec, which means I need to express it differently than what Hickory expects… No selectors are hard coded anywhere in my parsing code
I preprocessed some HTML with http://jtidy.sourceforge.net at some point. It can transform (most) HTML into well-formed XML, which opens up the landscape for querying a bit (xpath, zippers etc).
It took some trial and error to get jtidy to consume HTML that isn’t well-formed. I ended up doing something like this (with [net.sf.jtidy/jtidy "r938"] as a dependency):
So I came out the rabbit hole a few days ago and I must say that just using plain old JSoup as @ericnormand suggested won hands down. It is fast, familiar, and supports a wide range of selectors that I believe people are used to using for expressing these kinds of requirements.
Hickory got awkward very quickly, and I already had a ton of (poorly) defined rules so I needed to write a ton of functions for converting from my CSS-esque syntax to something I could combine for Hickory. These worked well with Enlive, and I had plenty of custom predicates used with Enlive.
With JSoup though most of my predicates also just fell away because it supports them directly.
I’m really happy with the result, and the facade over JSoup is still super thin. I ended up making a little protocol that helps a lot too:
(ns foo
(:import [org.jsoup Jsoup]
[org.jsoup.nodes Element Node]
[org.jsoup.select Elements]))
(defprotocol Selectable
"Protocol for selecting data from DOM-like data structures"
(attr [_ k] "Return the value of the provided HTML attribute")
(text [_] "Return the text for this element and all child elements")
(own-text [_] "Return the text for just this element"))
(extend-type Elements
Selectable
(attr [this k]
(.attr this (name k)))
(text [this]
(.text this))
(own-text [this]
(.text this)))
(extend-type Element
Selectable
(attr [this k]
(.attr this (name k)))
(text [this]
(.text this))
(own-text [this]
(.ownText this)))
That basically solved the entire puzzle for me, at this stage of the match.
As for the rest, all my parsing/extracting code is now defined with Clojure spec, the functions are instrumented and it is dead easy to catch a bug in the data structures when they appear. I feel like I’ve gone from complete parsing bankruptcy to having huge leverage in the system now
Amusingly, not long after this thread got started I ran into a show stopping bug in JSoup’s XML parsing (which I was using so that the codebase, which also deals with HTML, would be standardized on hickory everywhere). Fortunately, the XML was well formed, so it was easy to switch clojure.xml and clojure.walk.