Does Clojure have Scrapy like framework for web scraping

clojure

#1

Does Clojure or Java has Scrapy (which is written in Python) -like framework for web scraping. I’m currently learn web scraping, I already can write a basic simple crawler now, but I want it better and functional. I know Scrapy is great, But I want to do it under Clojure. So I ask this question.

If you have any suggestion, welcome to tell me. Thanks in advance.

I wish that framework is powerful, extensive for add new middleware or some kind of concepts. And easy to be distributed and parallel etc?


#2

I’ve been using JSoup (look this post here) combined with clj-http and mixed with core.async + transducers, makes wonders :slight_smile:


#3

I’ve been using the same, except for clj-http + core.async I have been using aleph + manifold.


#4

Thanks, I use clj-http in my previous simple crawler too. I try to use core.async in my code, but good at Clojure yet. I have checked out manifold, it great and simple. I like this solution aleph + manifold. After I get better on Clojure, I will check out core.async + transducers. Thanks you both.

What if I want a middle scalable and distributed solution, what do you suggest? Redis? MongoDB? or others? What about Clojure side support?

I checked out Java side project, found Nutch, but I don’t know Java and found Nutch is hard for me to start after I checked out it’s wiki and documents. Does anybody knows how to use Clojure on Nutch? maybe a simple example is great. Also the installation is hard to me. I don’t know which way to install it, Maven? Docker? or others.


#5

There’s also crawler4j and Heritrix3 (used by internetarchive.org). But JSOUP is very good as well.


#6

Thanks didibus.

After a closer view of Nutch, Heritrix3, I found both of this are mainly web indexer which mainly used for scraping web pages then used by search engine, of coursed can be used for crawler too. But not very good for web crawler.


#7

Hum, can you be more specific what you mean then? I’m confused about the difference you mention between scraping and crawling?

What’s your end goal/use case?


#8

Sorry for my confusing words.

My end goal is find a Clojure/Java library like Python’s Scrapy.

Scraping and crawling, I think they are almost same. The really difference is web indexer and web crawler.

After a closer view of Nutch, Heritrix3, I found both of this are mainly web indexer which mainly used for scraping web pages then used by search engine, of coursed can be used for crawler too. But not very good for web crawler.

You can check out Nutch document, it use scraped web pages for search engine like Apache Solr. I don’t need this. I want to save parsed data on web pages to database, and do some data science.

Maybe I can use part of Nutch. I will take a closer look into Nutch about how it works.


#9

Depends on what you‘re using Redis/MongoDB for, of course, but generally my experience with both of those at medium scale has been great. I‘ve never even needed to shard Mongo, but it’s great to have the option. If you use it as intended, it scales very well, same as Redis (I have no experience whatsoever running Redis in distributed way). I don‘t know if either are good tools if you just want a place to store HTML content. Parsing that and putting it in to Mongo might give you a lot of query power (over Redis), but if you‘re more interested in the linked nature of the web, maybe a graph database would suit your need (ansami/neo4j/daromic/…).

Also, I imagine there‘s loads of Java libraries for this kind of stuff, and many of those have probably considered lots of edge cases. I find an hour of trying out some Java libraries almost always pays off :v:


#10

Ah I see.

Dunno how good it is, but I found this: https://github.com/maithilish/scoopi

And it looks right up your alley.


#11

I agree with that. MongoDB and Redis is enough for medium distributed scale level. I will try MongoDB out. Thanks.


#12

@didibus, Thanks. This one looks really interesting. I considered Dockerize crawer too. One thing I have no idea how to develop Clojure on this Scoopi Java library in Docker. If I solve this problem, I can setup similar environment for my development environment. I personally use Emacs with CIDER to write Clojure code. Emacs have docker.el and docker-tramp.el support for working with Docker container filesystem. I will try to Google “setup Clojure development in Docker container”.


#13

In case you need to deal with pages that require a JS runtime to render, there’s also my library Sparkledriver.