Best library for querying HTML?

Now I’m curious, what well formed xml did it choke on?

It was nothing weird. Here’s a sanitized version of the offending markup:

<?xml version="1.0" encoding="UTF-8"?>
<secret-tag FIRST-ATTR="en-US00001" attr2="XYY-Y000" attr3="XY1234567" xml:lang="en-US">
  <title>
    <other-tag category="" id="an ID" phrase-urn="urn:secret-stuff">Some text about some things</phrase>
  </title>
</secret-tag>

It turns out that JSoup has some HTML-specific hacks around title tags that causes all tags nested under any title tag to be converted into a single HTML-escaped text node, which was… unexpected.

1 Like