Data science: How to cluster articles by their tags?

Hi! I have a list of 20-30 links, each with 0 - few tags (such as ["clojure", "library"] or ["javascript", "library", "data visualization"]). Now I would like to group these articles by topic so that I have something nicer to read then a flat, chronological list. Thus I would like to write a function that I feed all these lists of tags and that will give me back a few main topics + “And all the rest” topic where I can put the articles under.

How can I do that? I assume I could use with Smile’s k-means, transforming the tags into numbers with mm/categorical->number. Would that work? A disadvantage of k-means though is that I need to provide it the number of clusters I want while in fact I do not know how many “main topics” there are in the list of articles. Ideally, the algorithm would:

  • Find out the max 5 “main topics” that group the articles, with one “Other” topic for those that do not fit into any of the previous ones.
  • I would not need to set the number of topics == clusters myself. Though I can give in to this and set the number e.g. to 3+1.
  • No article belongs to multiple clusters (I do not want a “Clojre” cluster and a “Library” cluster and have in boht).

Thank you for any advice!

I wouldn’t personally use k-means for this at all at first. Given that you have a mapping from tag->integer that you can generate any number of ways but it needs to stay constant you can represent each link (or document) as a vector with a 1 at each integer index for each tag. Then you can use cosine or euclidean or any number of distance functions to find the ‘nearest’ vectors. kmeans in this case tends to be dimensionality reduction algorithm that may or may not help you but often the first step is to just lift each doc or link into a vector space and then use a pure distance function.

For the sparse math I recommend ojAlgo.

Once the number of vectors gets truly large then you may want to progress to some spatial datastructure that preserves distance characteristics and allows for rapid queries.

Here is one possible pathway -

  1. Convert each link/doc to a vector where each tag gets a 1 in a particular index.
  2. Then you can find nearest vectors using cosine or euclidean distance.
  3. People often use sparse vectors for this. OjAlgo has a good implementation of sparse math.

This gets into some abstract definition of ‘nearness’ which implies a metric space. If we arbitrarily assign integers to tags our space will mostly work but it will not know the relationship between tags; some tags tend to be semantically more related to other tags - this is one reason this is a first step :-). Words have meaning so a more faithful representation of the semantic space needs to have more than 1 dimension and then off we go to something quite a bit more complex.


Thank you for the very insightful response! I have managed to leverage to turn the data into vectors with each tag a column with 1 if present in the given article = row. Computing distance make sense and I see how I can do that e.g. with Smile’s EuclideanDistance. So I can compute the distance between any two vectors = articles but is that enough? I need some smart algorithm that can apply the distance to give me the groups I am interested in. (BTW thank you for pointing out the relatedness of some tags and thus a need to get eventually smarter about the definition of the distance!)

I’d expect that applying a transformation like singular value decomposition to the feature vectors would make it easier to detect “topics” among them. Naturally it depends on specific data, but density-based clustering (like DBSCAN, which does not need to be given number of clusters) might already work well enough under such transform.

(I cannot recommend specific JVM libraries here – quick search at least suggests that Apache Commons Math implements both of the above.)

1 Like

Thank you! I see that fastmath has dbscan (via Smile). as well:

using smile as well

Thank you, Carsten!

I am trying to wrap my head around how to use it. Currently I have

(def ds (-> [{:tags ["app" "mobile" "travel"]}
             {:tags ["business"]}
             {:tags ["clojure" "library" "data processing"]}
             {:tags ["clojure" "tool" "docker" "devops"]}
             {:tags ["tool" "automation" "macos"]}]
            (ds/separate-column :tags :infer #(zipmap % (repeat 1)))
           (ds/replace-missing :all :value 0)))
(def pipe-fn
    {:metamorph/id :tag-cluster}
    ( :dbscan [3 5] :cluster-id)))
(pipe-fn {:metamorph/data ds
          :metamorph/mode :fit})

but that fails with

NullPointerException at fastmath.core/seq->double-array (core.clj:1310)
Cannot invoke “java.lang.Number.doubleValue()” because the return value of “clojure.lang.ISeq.first()” is null

Any tips? :pray:

The above code works for me.
I do not get a NPE.

Which version of do you se ?

Sorry, my bad, you are right, it works. I guest I forgot to eval the fn after I added the (ds/replace-missing :all :value 0) (without it, it fails in that way)