I’m working on a small social networking site. I need to match people to other people, based on their similarities. I’ve initially taken a naive approach, using a background process to create a match-preliminary
between every user, then using other background processes to assess the two people in that match, raising or lowering the overall score, until finally, in some cases, the score becomes high enough that the match-preliminary
is promoted to a match
. That is, if we find, after multiple types of analysis, that two users have a lot in common, then we create a match between them.
I currently have 30 background processes going, each of which assesses the two users for different aspects, especially what they’ve written, and what their careers have been about.
I’m aware that if I did all the analysis in a single function, this would be more efficient, but I thought different processes for different aspects of the match would be more flexible, as a programming model. With this style of programming I can potentially take any one aspect of the analysis and make it a separate app.
Even given that my approach has been inefficient, I am still surprised by the strain on the server. I’m running an EC2 instance that has 96 gigs of RAM and 36 CPUs. A powerful machine. And yet, with only a few hundred users, my app uses up 14 gigs of RAM and, when I run htop
I see that server load has risen to 8 or 9.
This app talks to MongoDB. I have done nothing, so far, to optimize the connection to MongoDB. I have not added any indexes to MongoDB. I plan to do that soon, though it would be helpful to have some intuitions about what doing so would mean, and how it would help. In other words, I am curious, should I assume that server load is high because my functions are slow, and the functions are slow because the queries to the database are slow? And therefore adding indexes to MongoDB would bring down the server load? Or is the issue only the wastefulness of the original algorithm, that is, creating a match-preliminary
between every two pairs of users?
I’ve also added this line:
(Thread/sleep 100)
to try to spread out the load. But I assume having the functions pause like this, rather than race through as fast as possible, contributes to the memory usage, as each function is going slowly and holding onto memory while it does so. Do you think this ultimately makes server load worse?
The people in this network are not equal and therefore the relationship of user1 to user2 needs to be scored different than the relationship between user2 to user1, a fact which doubles the number of relationships that need to be built and scored.
As a programming model, I liked the flexibility it gives me to do each assessment in its own background task, but the strain on the server surprises me.
I’m curious if there are any tweaks that might reduce the server load? Are there tweaks to the garbage collection that might reclaim memory more quickly?
Here is the code I use to create the match-preliminary
:
(defn user-matches-preliminary-create
[]
(doseq [user1 (world/find-many {
:item-type "user"
:searchingEnabled true
})]
(doseq [user2 (world/find-many {
:item-type "user"
:searchingEnabled true
})]
(Thread/sleep 100)
(let [
preliminary (first (world/query {
:item-type "user_matches_preliminary"
:user-id (:user-id user1)
:other-user-id (:user-id user2)
}))
]
(when (nil? preliminary)
(world/create {
:item-type "user_matches_preliminary"
:user-id (:user-id user1)
:other-user-id (:user-id user2)
:match-score 0
:not-allowed []
}))))))