Best library for querying HTML?

grav · December 17, 2017, 3:00pm

I preprocessed some HTML with http://jtidy.sourceforge.net at some point. It can transform (most) HTML into well-formed XML, which opens up the landscape for querying a bit (xpath, zippers etc).

It took some trial and error to get jtidy to consume HTML that isn’t well-formed. I ended up doing something like this (with [net.sf.jtidy/jtidy "r938"] as a dependency):

    (defn html->xml [html]
       (let [os (java.io.ByteArrayOutputStream.)]
         (doto (org.w3c.tidy.Tidy.)
           (.setShowWarnings false)
           (.setXmlOut true)
           (.setForceOutput true)
           (.parse (java.io.ByteArrayInputStream. (.getBytes html)) os))
         (.toString os "UTF-8")))