nREPL creates thousands of threads - app crashes

I have an application that is built partly in Clojure and partly in PHP. On each load of the PHP-script, it connects to the Clojure application through nREPL, creates a new nREPL session, and then sends a few commands to retrieve some information, and then closes the session. So several new nREPL sessions are created and closed each second.

On my production server (Redhat Linux), the Clojure app has started to crash after running for a few days. The crash file gives this message.

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.

Under running processes, there are about 32240 nREPL threads that look like this

0x00007f0b681f2800 JavaThread "nRepl-session-c4e18c8b-26f7-437b-85eb-5b6b16145b0f" daemon [_thread_blocked, id=129841, stack(0x00007f01a4b2b000,0x00007f01a4c2c000)]

They are all blocked. I have just added a monitor of (.getThreadCount (ManagementFactory/getThreadMXBean)) to the app, and it seems to increase for each PHP request. It is at 527 now, but I have not waited until the next crash yet to see if it grows up to >32000.

I don’t see the same pattern in my dev environment (MacOS), PHP requests do not increase (.getThreadCount (ManagementFactory/getThreadMXBean)), but the thread count stays pretty stable around 40.

I’m pretty clueless here, since I know nothing about Java or threads. Is it possible that nREPL leaves thousands of threads lying around even though each session is closed? I have confirmed that the number of live session IDs in nREPL are never more than a few, so it seems that nREPL discards the sessions themselves properly.

I would be very grateful for ideas!

Do you see the problem in your “dev” MacOS if you start your server without using an nREPL (e.g., “lein run…” instead of an editor nREPL connection)?

Do you see the problem in your “production” Linux if you also connect a long-running nREPL client, in addition to the short-lived PHP connections?

Thanks @Phill,

I always start the app with lein run on my Mac and then connect to nREPL through Cursive. I followed your suggestion and started the app but did not connect to nREPL with Cursive, and instead let PHP connect to the nREPL in multiple successive calls. The nREPL threads appear and disappear and no threads linger.

I don’t have any application to connect to the nREPL from the terminal on my production server. I’ve asked the IT folks to install Leiningen to be able to run https://github.com/trptcolin/reply/. I started the app about 4 hours ago and .getThreadCount already reports 1404 nREPL threads. The difference between the prod and dev environments confuses me. Unless…

Now I tried running the uberjar on my dev machine instead. BAM! The nREPL threads are piling up, one for each PHP request. What is going on? Why does nREPL not dispose of its threads when it’s in an uberjar?

So the question might be, is something unexpectedly good about Leiningen or is something unexpectedly bad about the uberjar?

You could try running your dev environment without Leiningen’s plugins. Grab the classpath from “lein classpath” and try using it with “java -cp … clojure.main -m …”

By the way, Emacs + CIDER can connect to a running nREPL server by port number; perhaps Cursive can do likewise. I suppose you could ssh port tunnel from the Mac to the Linux to bring the nREPL port within reach if it is normally available only to (the Linux) localhost.

Just guessing here, but could it be that macOS and Redhat Linux handles the network socket connections from terminating applications differently, and they for some reason is left open on the server after the PHP application is done with the request and, I guess, terminates?

If the JVM (on Redhat) never closes the open network sockets, at some point the JVM will run out of memory.

Thanks @Linus_Ericsson and @Phill for you input.

It turns out that this is probably due to a bug in nREPL, that was fixed in version 0.7.0. However, I was including nREPL through Luminus nREPL, which has not bumped its nREPL version for a couple of years.

I included nREPL 0.8.1 directly and now the problem has disappeared. Weird that it only happened in the uberjar though, which made it more difficult to detect.

Thanks again!

3 Likes

Oh, that was subtle. Thank you @DrLjotsson for the update! I took myself the liberty to put an issue in the Luminous nREPL project.

1 Like

it is fixed now.

1 Like

Great, thanks!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.