How to get the urls out of the html [brave book chapter 9]

RoelofWobben · January 11, 2021, 8:42pm

Hello,

Im still busy with the brave book and near the end
but I do not see how I could ever do this :

  (let [data (future (slurp  (str "https://" search-engine "/search?q=" search-term)))]
    @data))

(search-clojure-docs-with-search-engine "slurp" "clojuredocs.org")

; Create a new function that takes a search term and search engines as arguments,
; and returns a vector of the URLs from the first page of search results from
; each search engine

I could re-use the `search-clojure-docs-with-searchengine’ but never in the book is explained how I can get the urls our of the html.

So please any hints

mvarela · January 11, 2021, 9:16pm

What does search-clojure-docs-with-searchengine return?
I’d first look at that value in order to get started

RoelofWobben · January 11, 2021, 9:26pm

it return the whole html of a page

mvarela · January 11, 2021, 10:31pm

Maybe something like this would help: https://clojuredocs.org/clojure.core/re-matches

joinr · January 11, 2021, 11:04pm

fyi, the naive url-based search will work with bing, not google these days. google throws a 403 error; it wants a more properly formatted request than the simple string url being passed (some stuff in the headers, ends up being a bit more complicated). I think back when CFBAT was written this wasn’t an issue.

I think this is kind of a substantial exercise, since you are now left with the problem of getting results, and then parsing those results (which are encoded in HTML) into URLs. So you have to understand the format of the search engine’s output to parse the HTML (pedagogically useful but perhaps a bit intense for a beginner exercise). That or use an API to get better formatted results (no idea). There are libraries that make parsing and dissecting and searching the resulting html (or xhtml) really easy though, so it’s not terrible. It seems like the book dangles this out as something trivial (just looking at CH9) when in fact there is “a lot” of stuff left up to the reader.

The simplest way would be to search the HTML for urls using a regular expression. If you sift through the HTML, e.g. with the firefox or chrome inspector, you will get a sense for where the search results are and where the URL is stored. It looks like an href, so something along the lines of scraping the text looking for matches for a regex that conforms to this pattern could work. You could also see if there’s a common regex for URLs (or define your own) and just use that and hope that the only URLs in the search results will be ones the engine presents for clicking.

Look at the section under “Strings” for “Regex” at the cheatsheet for some built-in functions.

mvarela · January 11, 2021, 11:20pm

I think this is a common problem with this book. Implementing proper solutions to some of the problems would be quite cumbersome.

joinr · January 12, 2021, 7:42am

I still recommend it (or at least portions) due to most of the content being decent (aside from the extra sugary examples they use…trying to bend over backwards to be approachable with IMO goofy names and data ends up being distracting IMO, but the technical discussions are useful).

As I recall, the entire IDE/editor setup is focused on emacs and some stuff that is likely dated (e.g. not using cider, not mentioning spacemacs and its excellent clojure layer or other stuff - some of which the book predated). When I get folks started with a more-or-less minimal setup, I just help them install spacemacs and the clojure layer and that covers about 99% of everything. OTOH, I don’t know that I’d start off a complete new person with the dual task of learning clojure and emacs, with the exception of a very minimal familiar text-editor-like experience (e.g. I encourage folks to use CUA shortcuts if they’ve come from that and just ditch emacs’s defaults).

I think that book could use a 2nd edition by now covering current practices for tooling (e.g. discussing tools.deps, although I still prefer lein for projects) and revising some of the exercises or providing worked examples.

mvarela · January 12, 2021, 8:58am

Agreed 100%. It’s what I used to get started. I was also using spacemacs at the time (nowadays I’m using Doom, I got tired of spacemacs’ general sluggishness). The Doom clojure setup is almost as good (I had to define some extra bindings for some CIDER stuff).

My impression wrt the exercises is that they’re meant to have “sketched” solutions. The URL example here would probably be served by figuring out how to use re-matches or similar, rather than something like hickory or enlive and walking over the parsed HTML, which is what a proper solution would entail.

RoelofWobben · January 12, 2021, 9:21am

oops. never image that such a “simple” question would cause such a big discussion.
I have looked at enlive but could not make that work.

mvarela · January 12, 2021, 9:35am

Well, as we mentioned, the issue is that those exercises in the brave clojure book are kinda complex if you want to do them “right”. If you’re just starting out, enlive is not easy to use (heck, I’ve been doing clojure for a few years, and enlive is still not easy to use for me).

I’d suggest you look into using regexes, or maybe skip this one, tbh.

RoelofWobben · January 12, 2021, 11:50am

I think I skip this im very very bad at regexes

joinr · January 12, 2021, 11:57am

Another option is to use hickory and parse the query response then search that recursively. Hickory will turn the response into a nested collection of maps with keywords correspond to tags and attributes and content.

Or just skip it if you’re not interested with the problem

mvarela · January 12, 2021, 12:45pm

This is by no means a proper solution, but I suspect it’s more or less what the exercise expected you to do:

(->>  "https://www.bing.com/search?q=clojure"
      slurp
      (re-seq #"http[s]?://[^<>\"']+"))
;; => ("http://www.w3.org/1999/xhtml"
;;     "http://schemas.live.com/Web/"
;;     "https://business.bing.com/api/v3/search/person/photo?id={0}"
;;     "https://clojure.org/"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fclojure.org%2f"
;;     "https://clojure.org/about/rationale"
;;     "https://clojure.org/reference/reader"
;;     "https://clojure.org/api/api"
;;     "https://clojure.org/releases/downloads"
;;     "https://clojure.org/guides/getting_started"
;;     "https://clojure.org/community/success_stories"
;;     "https://clojure.org/dev/dev"
;;     "https://clojure.org/news/news"
;;     "https://en.wikipedia.org/wiki/Clojure"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fen.wikipedia.org%2fwiki%2fClojure"
;;     "https://en.wikipedia.org/wiki/"
;;     "http://www.w3.org/2000/svg"
;;     "http://www.w3.org/2000/svg"
;;     "https://en.wikipedia.org/wiki/Clojure"
;;     "http://creativecommons.org/licenses/by-sa/3.0/"
;;     "https://www.youtube.com/watch?v=xqGmE4KyhzQ"
;;     "https://www.youtube.com/watch?v=mrXDc4e0e6s"
;;     "https://www.youtube.com/watch?v=ciGyHkDuPAE"
;;     "https://www.youtube.com/watch?v=zznwKCifC1A"
;;     "https://www.youtube.com/watch?v=WTzzUSw6iaI"
;;     "http://www.w3.org/2000/svg"
;;     "http://www.w3.org/2000/svg"
;;     "https://clojure.org/"
;;     "http://www.w3.org/2000/svg"
;;     "http://www.w3.org/2000/svg"
;;     "https://www.braveclojure.com/"
;;     "http://www.w3.org/2000/svg"
;;     "http://www.w3.org/2000/svg"
;;     "https://clojure.org/"
;;     "http://www.w3.org/2000/svg"
;;     "http://www.w3.org/2000/svg"
;;     "https://en.wikipedia.org/wiki/Clojure"
;;     "https://fi.wikipedia.org/wiki/Clojure"
;;     "https://fi.wikipedia.org/wiki/"
;;     "https://www.tutorialspoint.com/clojure/clojure_basic_syntax.htm"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fwww.tutorialspoint.com%2fclojure%2fclojure_basic_syntax.htm"
;;     "https://www.tutorialspoint.com/"
;;     "http://www.w3.org/2000/svg"
;;     "http://www.w3.org/2000/svg"
;;     "https://www.tutorialspoint.com/clojure/clojure_basic_syntax.htm"
;;     "https://www.tutorialspoint.com/clojure/clojure_basic_syntax.htm"
;;     "https://clojuredocs.org/"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fclojuredocs.org%2f"
;;     "https://www.braveclojure.com/"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fwww.braveclojure.com%2f"
;;     "https://www.brave"
;;     "https://www.tutorialspoint.com/clojure/index.htm"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fwww.tutorialspoint.com%2fclojure%2findex.htm"
;;     "https://www.tutorialspoint.com/"
;;     "https://www.reddit.com/r/Clojure/"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fwww.reddit.com%2fr%2fClojure%2f"
;;     "https://www.reddit.com/r/"
;;     "https://clojuredocs.org/clojure.core/subs"
;;     "http://www.microsofttranslator.com/bv.aspx?ref=SERP&amp;br=ro&amp;mkt=fi-FI&amp;dl=fi&amp;lp=EN_FI&amp;a=https%3a%2f%2fclojuredocs.org%2fclojure.core%2fsubs"
;;     "http://fi.wikipedia.org/wiki/Clojure"
;;     "http://fi.wikipedia.org/wiki/Clojure"
;;     "https://fi.wikipedia.org/wiki/Automaattinen_roskienkeräys"
;;     "http://fi.wikipedia.org/wiki/Clojure"
;;     "http://creativecommons.org/licenses/by-sa/3.0/"
;;     "https://www.bing.com:443/shared/mcp"
;;     "http://go.microsoft.com/fwlink/?LinkId=521839"
;;     "http://go.microsoft.com/fwlink/?LinkID=246338"
;;     "https://go.microsoft.com/fwlink/?linkid=868922"
;;     "http://go.microsoft.com/fwlink/?LinkID=286759"
;;     "https://go.microsoft.com/fwlink/?LinkID=617297"
;;     "http://help.bing.microsoft.com/#apex/18/FI/10013/-1/FI"
;;     "https://go.microsoft.com/fwlink/?LinkId=521839"
;;     "https://storage.live.com/users/0x{0}/myprofile/expressionprofile/profilephoto:UserTileStatic/p?ck=1\\u0026ex=720\\u0026sid=39AEC427CB6B6A903335CB9ACA806BF0\\u0026fofoff=1"
;;     "https://storage.live.com/users/0x{0}/myprofile/expressionprofile/profilephoto:UserTileMedium/p?ck=1\\u0026ex=720\\u0026sid=39AEC427CB6B6A903335CB9ACA806BF0\\u0026fofoff=1"
;;     "https://login.live.com/login.srf?wa=wsignin1.0\\u0026rpsnv=11\\u0026ct=1610455445\\u0026rver=6.0.5286.0\\u0026wp=MBI_SSL\\u0026wreply=https:%2F%2fwww.bing.com%2Fsecure%2FPassport.aspx%3Fpopup%3D1%26ssl%3D1\\u0026lc=1035\\u0026id=264960"
;;     "http://cc.bingj.com/cache.aspx?q=clojure\\u0026d={0}\\u0026mkt=fi-FI\\u0026setlang=fi-FI\\u0026w={1}"
;;     "http://www.w3.org/2000/svg")

RoelofWobben · January 12, 2021, 7:42pm

thanks, what does ^<> do exactly ?

mvarela · January 12, 2021, 10:23pm

having the caret inside the square brackets means to match any characters but those. So we skip tags and quotes, basically (that’s a regex thing, nothing Clojure-specific).

First result off google explaining this: https://regular-expressions.mobi/charclass.html?wlr=1

RoelofWobben · January 13, 2021, 10:31am

Thanks a lot all for the help

mvarela · January 13, 2021, 12:18pm

You’re welcome!

system · July 15, 2021, 12:19am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.