I haven’t really tried to do that myself, but here are a couple possibilities:
https://github.com/jtablesaw/tablesaw - It’s a Java library that claims to be panda-like.
You can load a 500,000,000 row, 4 column csv file (35GB on disk) entirely into about 10 GB of memory. If it’s in Tablesaw’s .saw format, you can load it in 22 seconds. You can query that table in 1-2 ms: fast enough to use as a cache for a Web app.
BTW, those numbers were achieved on a laptop.
I haven’t tried it, but it certainly sounds promising
The other thing might be to consider using Datomic instead of SQL. You might be better able to explore the datasets with a datalog query.