Best library for querying HTML?

I really like Hickory. It’s my standard library for all sorts of scraping-related tasks.

5 Likes

I went with Enlive, which I’ve used in the past and had forgotten how much I liked it. I’m also already used to the selector syntax from Garden. Here’s how you would do the above in Enlive

(require '[net.cgrand.enlive-html :as enlive])

(let [doc (enlive/html-snippet "<div class='hello'>Hello, <em>world<em></div>")
      em  (enlive/select doc [:.hello :em])]
  (enlive/texts em))

Module from npm, might be useful in ClojureScript:

Cheerio’s selector implementation is nearly identical to jQuery’s, so the API is very similar.

$( selector, [context], [root] )

selector searches within the context scope which searches within the root scope. selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object. root is typically the HTML document string.

This selector method is the starting point for traversing and manipulating the document. Like jQuery, it’s the primary method for selecting elements in the document, but unlike jQuery it’s built on top of the CSSSelect library, which implements most of the Sizzle selectors.

$('.apple', '#fruits').text()
//=> Apple 
 
$('ul .pear').attr('class')
//=> pear 
 
$('li[class=orange]').html()
//=> Orange 

I’ve found both cheerio and enlive particularly enjoyable and painless.

For those put off by lack of documentation, I’d recommend starting with David Nolan’s enlive tutorial for learning the ropes.

2 Likes

Seconding this. It’s significantly more verbose than enlive but it’s selectors are very powerful and compose in a way that enlive’s can’t.

Used to prefer enlive for a long time due to the familiar CSS selector syntax but Hickory really grew on me during a recent project where I had to do a lot of HTML querying…

2 Likes

@martinklepsch or @jackrusher mind showing how the example would look like in hickory, for comparison?

@plexus Sure! :slight_smile:

;; boot repl
(set-env! :dependencies '[[hickory "0.7.1"]]) 

(require '[hickory.select :as s]
         '[hickory.core :as hc])

(def my-doc
  (-> "<div class='hello'>Hello, <em>world</em></div>"
      hc/parse
      hc/as-hickory))

(s/select 
  (s/child (s/class "hello")
           (s/tag :em))
  my-doc)

;; [{:type :element, :attrs nil, :tag :em, :content ["world"]}]

Note that I modified your HTML string slightly, I think the two opening <em> tags were probably unintented?

1 Like

I think hickory looks a little more complicated for your simple example, but I find it clearer and easier to compose than enlive for more complex selector expressions. Some other things I like about it:

  • clojure/clojurescript
  • supports selectors and zippers
  • offers conversion to hiccup format for output

… but enlive is also nice, so if it’s your thing, why not? :slight_smile:

Hickory does look nice. I might have to look into it more if I never need to do more heavy lifting.

Enlive has zippers btw, it’s just clojure.xml.zip, since it uses the same data structure as clojure.xml

I think it can convert to hiccup as well, or maybe that’s a helper function I have in my code base, I’d have to check.

Just throwing in my 2 cents:

I always use JSoup. It’s a straightforward Java library and I never feel the need to wrap it in Clojure. Interop is fine. It’s also very fast and the syntax is about 95% compatible with jQuery. We actually had something that would run the same selectors in the browser and in JSoup. JSoup parses very similarly to how browsers parse.

We originally started with Enlive but found it too slow for what we wanted to do (parsing millions of pages). The Enlive queries loop through a giant tree of immutable data structures each time. For small stuff (like template snippets), it’s fine. But big pages with complex queries would slow it down.

4 Likes

Probably worth mentioning here that Hickory builds on top of JSoup. :+1:

1 Like

Poor performance was one of the things that put me off Enlive as well.

1 Like

I’m currently refactoring a metric ton of parsing code and I’m gonna spike out converting from Enlive to Hickory. My gut says I’ll be able to craft a function that can take my Enlive selectors, and for the most part compose a Hickory select from that.

My biggest challenge is actually refactoring the parsing config (all data structures) into a format that is suitable for clojure.spec, which means I need to express it differently than what Hickory expects… No selectors are hard coded anywhere in my parsing code

Anyhoo, will report back :slight_smile:

1 Like

I preprocessed some HTML with http://jtidy.sourceforge.net at some point. It can transform (most) HTML into well-formed XML, which opens up the landscape for querying a bit (xpath, zippers etc).

It took some trial and error to get jtidy to consume HTML that isn’t well-formed. I ended up doing something like this (with [net.sf.jtidy/jtidy "r938"] as a dependency):

    (defn html->xml [html]
       (let [os (java.io.ByteArrayOutputStream.)]
         (doto (org.w3c.tidy.Tidy.)
           (.setShowWarnings false)
           (.setXmlOut true)
           (.setForceOutput true)
           (.parse (java.io.ByteArrayInputStream. (.getBytes html)) os))
         (.toString os "UTF-8")))

So I came out the rabbit hole a few days ago and I must say that just using plain old JSoup as @ericnormand suggested won hands down. It is fast, familiar, and supports a wide range of selectors that I believe people are used to using for expressing these kinds of requirements.

Hickory got awkward very quickly, and I already had a ton of (poorly) defined rules so I needed to write a ton of functions for converting from my CSS-esque syntax to something I could combine for Hickory. These worked well with Enlive, and I had plenty of custom predicates used with Enlive.

With JSoup though most of my predicates also just fell away because it supports them directly.

I’m really happy with the result, and the facade over JSoup is still super thin. I ended up making a little protocol that helps a lot too:

(ns foo
  (:import [org.jsoup Jsoup]
           [org.jsoup.nodes Element Node]
           [org.jsoup.select Elements]))

(defprotocol Selectable
  "Protocol for selecting data from DOM-like data structures"
  (attr [_ k] "Return the value of the provided HTML attribute")
  (text [_] "Return the text for this element and all child elements")
  (own-text [_] "Return the text for just this element"))

(extend-type Elements
  Selectable
  (attr [this k]
    (.attr this (name k)))

  (text [this]
    (.text this))

  (own-text [this]
    (.text this)))

(extend-type Element
  Selectable
  (attr [this k]
    (.attr this (name k)))

  (text [this]
    (.text this))

  (own-text [this]
    (.ownText this)))

That basically solved the entire puzzle for me, at this stage of the match.

As for the rest, all my parsing/extracting code is now defined with Clojure spec, the functions are instrumented and it is dead easy to catch a bug in the data structures when they appear. I feel like I’ve gone from complete parsing bankruptcy to having huge leverage in the system now :slight_smile:

9 Likes

I :heart_eyes: jsoup. Been using it in several projects.

Amusingly, not long after this thread got started I ran into a show stopping bug in JSoup’s XML parsing (which I was using so that the codebase, which also deals with HTML, would be standardized on hickory everywhere). Fortunately, the XML was well formed, so it was easy to switch clojure.xml and clojure.walk.

Now I’m curious, what well formed xml did it choke on?

It was nothing weird. Here’s a sanitized version of the offending markup:

<?xml version="1.0" encoding="UTF-8"?>
<secret-tag FIRST-ATTR="en-US00001" attr2="XYY-Y000" attr3="XY1234567" xml:lang="en-US">
  <title>
    <other-tag category="" id="an ID" phrase-urn="urn:secret-stuff">Some text about some things</phrase>
  </title>
</secret-tag>

It turns out that JSoup has some HTML-specific hacks around title tags that causes all tags nested under any title tag to be converted into a single HTML-escaped text node, which was… unexpected.

1 Like