Best library for querying HTML?

kennethkalmer · January 16, 2018, 7:36pm

So I came out the rabbit hole a few days ago and I must say that just using plain old JSoup as @ericnormand suggested won hands down. It is fast, familiar, and supports a wide range of selectors that I believe people are used to using for expressing these kinds of requirements.

Hickory got awkward very quickly, and I already had a ton of (poorly) defined rules so I needed to write a ton of functions for converting from my CSS-esque syntax to something I could combine for Hickory. These worked well with Enlive, and I had plenty of custom predicates used with Enlive.

With JSoup though most of my predicates also just fell away because it supports them directly.

I’m really happy with the result, and the facade over JSoup is still super thin. I ended up making a little protocol that helps a lot too:

(ns foo
  (:import [org.jsoup Jsoup]
           [org.jsoup.nodes Element Node]
           [org.jsoup.select Elements]))

(defprotocol Selectable
  "Protocol for selecting data from DOM-like data structures"
  (attr [_ k] "Return the value of the provided HTML attribute")
  (text [_] "Return the text for this element and all child elements")
  (own-text [_] "Return the text for just this element"))

(extend-type Elements
  Selectable
  (attr [this k]
    (.attr this (name k)))

  (text [this]
    (.text this))

  (own-text [this]
    (.text this)))

(extend-type Element
  Selectable
  (attr [this k]
    (.attr this (name k)))

  (text [this]
    (.text this))

  (own-text [this]
    (.ownText this)))

That basically solved the entire puzzle for me, at this stage of the match.

As for the rest, all my parsing/extracting code is now defined with Clojure spec, the functions are instrumented and it is dead easy to catch a bug in the data structures when they appear. I feel like I’ve gone from complete parsing bankruptcy to having huge leverage in the system now