Now I’m curious, what well formed xml did it choke on?
It was nothing weird. Here’s a sanitized version of the offending markup:
<?xml version="1.0" encoding="UTF-8"?>
<secret-tag FIRST-ATTR="en-US00001" attr2="XYY-Y000" attr3="XY1234567" xml:lang="en-US">
<title>
<other-tag category="" id="an ID" phrase-urn="urn:secret-stuff">Some text about some things</phrase>
</title>
</secret-tag>
It turns out that JSoup has some HTML-specific hacks around title
tags that causes all tags nested under any title
tag to be converted into a single HTML-escaped text node, which was… unexpected.
1 Like