How do you scrape pages without a valid content-type?

The Java itself seems to be having issues here, in ways that aren’t a problem with curl or javascript or other languages. Here’s the example;

scraper.npr> (defn comic-titles
  (let [dom (html/html-resource
             ( ""))
        title-nodes (html/select dom [:#middleContainer :a])
        titles (map html/text title-nodes)]
    (take n titles)))
scraper.npr> (comic-titles 5)
UnknownServiceException no content-type (

This is the documented example from enlive, As you’ll find, the problem is that the target URL resource doesn’t include a content-type in its header, which breaks Java.URL.getContent(). You’ll note that this works on compliant pages like

This is a really stupid error and doesn’t seem to be a problem in any non-java languages I know of; I just want the HTML! How have folks got around this?

1 Like

It seems Java has difficulty following the redirects from http to https, if you change it to https it works.


Whoa. Nice find! I wish it had occurred to me to check that scenario a few hours earlier.

It’s a bit of an odd one, looks like a Java bug to me, I kind of accidentally stumbled upon this workaround.

1 Like