How to scrape a page that suffers from a circular redirect?

If I try to (client/get …) this page with clj-http I get a circular redirect error:
http://www.coloring-book.info/coloring/coloring_page.php?id=29

Likewise if I try to just (slurp …) it then I get an error “Server redirected too many times (20)”.

But if I visit the page in my browser (Chrome or Safari), it just works. What’s going on, and how can I slurp the version of the page that I see in my browser?

Have you tried https://github.com/nathell/skyscraper ?

I guess you could configure it properly to follow / not follow redirects, or keep a registry of visited links.

It looks like it’s doing a redirect to itself with an added random query parameter and also setting a cookie to match so it probably requires you to send the cookie back with the second request for the (redirected) URL – so you may need to tell clj-http to not follow redirects and to make two requests manually, the first to get the redirect URL and the cookie, and the second to send the cookie along to that new URL.

it probably requires you to send the cookie back with the second request for the (redirected) URL

I see, thanks!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.