I often find this problem with python, although usually it is the *parsing* code...

I often find this problem with python, although usually it is the parsing code; the code that loads all the data up, that actually uses the high-water-mark of RAM.

For example, I had a 100MB JSON file that I tried to use the stdlib json library to load. It quickly used >8GB (my machine's RAM) and started paging, dragging everything to a halt. This is partly because the stdlib JSON parser is written in python.

Now, if you switch to a small, clever implementation called cjson[1], it can load the whole thing without bumping 3-400MB in RAM, and the high watermark is the data at the end. Much better!

So, in summary, be careful that the important part of your code is the one that uses all the RAM - and that it's not some "hello world" quality stdlib code that's killing you. If it is, and there isn't a cjson for the job, I've found wrapping C/C++ libraries with Cython[2] a simple way to solve the problem without too much hassle (generally only a couple of days work at a time if you're tight, and only wrap the functions you actually need to use yourself.)

[1] https://pypi.python.org/pypi/python-cjson - although there's a 1.5.1 out there somewhere with a fix for a bug that loses precision on floats...which is the only one I use personally. It's so hard to find that I keep a copy of the source in my Dropbox for when I need it!

[2] http://cython.org/ - although of course actually using cython means you can't take advantage of pypy, IronPython, and other "faster" implementations because you're tied to the cpython C interface forever.