Best library for querying HTML?


Now I’m curious, what well formed xml did it choke on?


It was nothing weird. Here’s a sanitized version of the offending markup:

<?xml version="1.0" encoding="UTF-8"?>
<secret-tag FIRST-ATTR="en-US00001" attr2="XYY-Y000" attr3="XY1234567" xml:lang="en-US">
    <other-tag category="" id="an ID" phrase-urn="urn:secret-stuff">Some text about some things</phrase>

It turns out that JSoup has some HTML-specific hacks around title tags that causes all tags nested under any title tag to be converted into a single HTML-escaped text node, which was… unexpected.