How do you scrape pages without a valid content-type?


#1

The Java itself seems to be having issues here, in ways that aren’t a problem with curl or javascript or other languages. Here’s the example;

scraper.npr> (defn comic-titles
  [n]
  (let [dom (html/html-resource
             (java.net.URL. "http://xkcd.com/archive"))
        title-nodes (html/select dom [:#middleContainer :a])
        titles (map html/text title-nodes)]
    (take n titles)))
#'scraper.npr/comic-titles
scraper.npr> (comic-titles 5)
UnknownServiceException no content-type  java.net.URLConnection.getContentHandler (URLConnection.java:1241)

This is the documented example from enlive, https://github.com/clojure-cookbook/clojure-cookbook/blob/master/07_webapps/7-11_enlive.asciidoc. As you’ll find, the problem is that the target URL resource doesn’t include a content-type in its header, which breaks Java.URL.getContent(). You’ll note that this works on compliant pages like google.com.

This is a really stupid error and doesn’t seem to be a problem in any non-java languages I know of; I just want the HTML! How have folks got around this?


#2

It seems Java has difficulty following the redirects from http to https, if you change it to https it works.


#3

Whoa. Nice find! I wish it had occurred to me to check that scenario a few hours earlier.


#4

It’s a bit of an odd one, looks like a Java bug to me, I kind of accidentally stumbled upon this workaround.


#5