Does Clojure have Scrapy like framework for web scraping

stardiviner · March 10, 2019, 11:14am

Does Clojure or Java has Scrapy (which is written in Python) -like framework for web scraping. I’m currently learn web scraping, I already can write a basic simple crawler now, but I want it better and functional. I know Scrapy is great, But I want to do it under Clojure. So I ask this question.

If you have any suggestion, welcome to tell me. Thanks in advance.

I wish that framework is powerful, extensive for add new middleware or some kind of concepts. And easy to be distributed and parallel etc?

dimovich · March 10, 2019, 11:41am

I’ve been using JSoup (look this post here) combined with clj-http and mixed with core.async + transducers, makes wonders

borkdude · March 10, 2019, 1:49pm

I’ve been using the same, except for clj-http + core.async I have been using aleph + manifold.

stardiviner · March 10, 2019, 3:05pm

Thanks, I use clj-http in my previous simple crawler too. I try to use core.async in my code, but good at Clojure yet. I have checked out manifold, it great and simple. I like this solution aleph + manifold. After I get better on Clojure, I will check out core.async + transducers. Thanks you both.

What if I want a middle scalable and distributed solution, what do you suggest? Redis? MongoDB? or others? What about Clojure side support?

I checked out Java side project, found Nutch, but I don’t know Java and found Nutch is hard for me to start after I checked out it’s wiki and documents. Does anybody knows how to use Clojure on Nutch? maybe a simple example is great. Also the installation is hard to me. I don’t know which way to install it, Maven? Docker? or others.

didibus · March 11, 2019, 1:37am

There’s also crawler4j and Heritrix3 (used by internetarchive.org). But JSOUP is very good as well.

stardiviner · March 11, 2019, 5:10am

Thanks didibus.

After a closer view of Nutch, Heritrix3, I found both of this are mainly web indexer which mainly used for scraping web pages then used by search engine, of coursed can be used for crawler too. But not very good for web crawler.

didibus · March 11, 2019, 5:43am

Hum, can you be more specific what you mean then? I’m confused about the difference you mention between scraping and crawling?

What’s your end goal/use case?

stardiviner · March 11, 2019, 9:49am

Sorry for my confusing words.

My end goal is find a Clojure/Java library like Python’s Scrapy.

Scraping and crawling, I think they are almost same. The really difference is web indexer and web crawler.

After a closer view of Nutch, Heritrix3, I found both of this are mainly web indexer which mainly used for scraping web pages then used by search engine, of coursed can be used for crawler too. But not very good for web crawler.

You can check out Nutch document, it use scraped web pages for search engine like Apache Solr. I don’t need this. I want to save parsed data on web pages to database, and do some data science.

Maybe I can use part of Nutch. I will take a closer look into Nutch about how it works.

madbonkey · March 11, 2019, 2:01pm

Depends on what you‘re using Redis/MongoDB for, of course, but generally my experience with both of those at medium scale has been great. I‘ve never even needed to shard Mongo, but it’s great to have the option. If you use it as intended, it scales very well, same as Redis (I have no experience whatsoever running Redis in distributed way). I don‘t know if either are good tools if you just want a place to store HTML content. Parsing that and putting it in to Mongo might give you a lot of query power (over Redis), but if you‘re more interested in the linked nature of the web, maybe a graph database would suit your need (ansami/neo4j/daromic/…).

Also, I imagine there‘s loads of Java libraries for this kind of stuff, and many of those have probably considered lots of edge cases. I find an hour of trying out some Java libraries almost always pays off

didibus · March 11, 2019, 3:56pm

Ah I see.

Dunno how good it is, but I found this: https://github.com/maithilish/scoopi

And it looks right up your alley.

stardiviner · March 13, 2019, 6:03am

I agree with that. MongoDB and Redis is enough for medium distributed scale level. I will try MongoDB out. Thanks.

stardiviner · March 13, 2019, 6:08am

@didibus, Thanks. This one looks really interesting. I considered Dockerize crawer too. One thing I have no idea how to develop Clojure on this Scoopi Java library in Docker. If I solve this problem, I can setup similar environment for my development environment. I personally use Emacs with CIDER to write Clojure code. Emacs have docker.el and docker-tramp.el support for working with Docker container filesystem. I will try to Google “setup Clojure development in Docker container”.

jackrusher · March 13, 2019, 3:29pm

In case you need to deal with pages that require a JS runtime to render, there’s also my library Sparkledriver.

system · September 12, 2019, 3:29am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.