How to scrape a page that suffers from a circular redirect?

tobyloxy · August 27, 2020, 9:42am

If I try to (client/get …) this page with clj-http I get a circular redirect error:
http://www.coloring-book.info/coloring/coloring_page.php?id=29

Likewise if I try to just (slurp …) it then I get an error “Server redirected too many times (20)”.

But if I visit the page in my browser (Chrome or Safari), it just works. What’s going on, and how can I slurp the version of the page that I see in my browser?

tomekw · August 27, 2020, 10:35am

Have you tried https://github.com/nathell/skyscraper ?

I guess you could configure it properly to follow / not follow redirects, or keep a registry of visited links.

seancorfield · August 27, 2020, 5:12pm

It looks like it’s doing a redirect to itself with an added random query parameter and also setting a cookie to match so it probably requires you to send the cookie back with the second request for the (redirected) URL – so you may need to tell clj-http to not follow redirects and to make two requests manually, the first to get the redirect URL and the cookie, and the second to send the cookie along to that new URL.

tobyloxy · August 28, 2020, 1:31am

it probably requires you to send the cookie back with the second request for the (redirected) URL

I see, thanks!

system · February 26, 2021, 1:31pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.