How do you scrape pages without a valid content-type?


The Java itself seems to be having issues here, in ways that aren’t a problem with curl or javascript or other languages. Here’s the example;

scraper.npr> (defn comic-titles
  (let [dom (html/html-resource
             ( ""))
        title-nodes (html/select dom [:#middleContainer :a])
        titles (map html/text title-nodes)]
    (take n titles)))
scraper.npr> (comic-titles 5)
UnknownServiceException no content-type (

This is the documented example from enlive, As you’ll find, the problem is that the target URL resource doesn’t include a content-type in its header, which breaks Java.URL.getContent(). You’ll note that this works on compliant pages like

This is a really stupid error and doesn’t seem to be a problem in any non-java languages I know of; I just want the HTML! How have folks got around this?


It seems Java has difficulty following the redirects from http to https, if you change it to https it works.


Whoa. Nice find! I wish it had occurred to me to check that scenario a few hours earlier.


It’s a bit of an odd one, looks like a Java bug to me, I kind of accidentally stumbled upon this workaround.