The Java itself seems to be having issues here, in ways that aren’t a problem with curl or javascript or other languages. Here’s the example;
scraper.npr> (defn comic-titles
[n]
(let [dom (html/html-resource
(java.net.URL. "http://xkcd.com/archive"))
title-nodes (html/select dom [:#middleContainer :a])
titles (map html/text title-nodes)]
(take n titles)))
#'scraper.npr/comic-titles
scraper.npr> (comic-titles 5)
UnknownServiceException no content-type java.net.URLConnection.getContentHandler (URLConnection.java:1241)
This is the documented example from enlive, https://github.com/clojure-cookbook/clojure-cookbook/blob/master/07_webapps/7-11_enlive.asciidoc. As you’ll find, the problem is that the target URL resource doesn’t include a content-type in its header, which breaks Java.URL.getContent(). You’ll note that this works on compliant pages like google.com.
This is a really stupid error and doesn’t seem to be a problem in any non-java languages I know of; I just want the HTML! How have folks got around this?