I preprocessed some HTML with http://jtidy.sourceforge.net at some point. It can transform (most) HTML into well-formed XML, which opens up the landscape for querying a bit (xpath, zippers etc).
It took some trial and error to get jtidy to consume HTML that isn’t well-formed. I ended up doing something like this (with [net.sf.jtidy/jtidy "r938"]
as a dependency):
(defn html->xml [html]
(let [os (java.io.ByteArrayOutputStream.)]
(doto (org.w3c.tidy.Tidy.)
(.setShowWarnings false)
(.setXmlOut true)
(.setForceOutput true)
(.parse (java.io.ByteArrayInputStream. (.getBytes html)) os))
(.toString os "UTF-8")))