Hacker News new | past | comments | ask | show | jobs | submit login
TCMalloc and MySQL (github.com/blog)
94 points by craigkerstiens on Feb 21, 2013 | hide | past | favorite | 30 comments



Warning: tcmalloc does not release memory back to the OS, ever:

"TCMalloc currently does not return any memory to the system."[1]

That means if you have many long-running processes, then each of them will consume the maximum amount of memory that it ever has. Not good for a multi-tenant setup.

If it's a dedicated server running one multi-threaded application, maybe that's OK, although I'd be a little bit wary anyway.

I should note that, even if the application doesn't let the memory go, the OS could page out the inactive regions. Not really something that I would like to rely on, though. There are some other caveats also, like it would make memory accounting a little trickier ("Wow, that process is huge! Oh, never mind, it's mostly paged out.").

For what it's worth, I just spent considerable effort to get rid of tcmalloc due (in part) to problems like this. [2]

[1] http://goog-perftools.sourceforge.net/doc/tcmalloc.html

[2] You wouldn't think it would be a lot of effort, but we were using dynamic libraries that were linking against tcmalloc, which is outright dangerous if the main executable isn't linked against tcmalloc (you don't want to replace the allocator in a running executable). And some of those libraries were actually using the tcmalloc-specific features/symbols, so I had to get away from that first.


About [1] Sorry about that: the document you linked to is amazingly stale. tcmalloc has been releasing memory to the system for many years. See for example the IncrementalScavenge routine in a version of page_heap.cc from Dec 2008:

https://code.google.com/p/gperftools/source/browse/trunk/src...

One caveat: physical memory and swap space is released, but the process's virtual size will not decrease since tcmalloc uses madvise(MNONE) to release memory.

About [2], code using tcmalloc-specific features/symbols is definitely a problem. I would strongly advise against doing that and sticking to the libc interfaces instead for the reason you pointed out.


Strange, the page showed up first when I googled "tcmalloc" and the problem was also present in the version that I was using (at least I think it was). My apologies.

Yeah, regarding [2], that was definitely not my idea.


Not your fault. We just plain forgot to update the documentation, so the freshest available document is a few years out of date.


wow i didn't realize madvise could actually modify the memory contents and return pages, but it makes sense that it can because that is a useful feature. very cool!


i don't know of any malloc implementations that return memory to the system. the only way to do that is to pass a negative value to sbrk, which requires all the memory being returned to be at the end of the data segment. even if you free 99% of the memory you were using, if one byte is still in use at the end of the data segment no memory can be returned. this is almost always the case in practice, so no malloc implementations bother to return memory in this way. on the other hand, unmapping an anonymous mmap does return the memory to the system. most mallocs handle large allocations by delegating to anonymous mmap for this reason. you probably could've tweaked the mmap threshold for tcmalloc to get the same result.

edit: apparently tcmalloc is using mmap to allocate more of its memory than i realized, not sure why i thought it was using sbrk for everything


OpenBSD's malloc implementation does: "On a call to free, memory is released and unmapped from the process address space using munmap."

http://en.wikipedia.org/wiki/C_dynamic_memory_allocation#Ope...


You mean address space. The OS can and will reclaim all the memory it likes, when it likes.

Pages that are mapped but are left untouched for a long time aren't problematic in modern systems. There is a small cost for the PTE but nothting like an entire page of physical memory.


I was talking about memory that was used in the past, but is no longer used.

The OS can either keep it resident or swap it out, but it has no way of knowing that it is no longer in use (short of something like madvise()). In the realm of sanity, the OS can't just arbitrarily throw away memory that a process has written to (unless it also kills the process).


Good allocators are good for different things, but what the glibc allocator is good for is yet to be discovered: fragments like a glass fallen in the floor and has contention issues.


Stability. Give it credit for being - probably - the most widely used implementation of malloc in the world.

I say this as someone who has implemented a lock-free memory allocator for mutlithreaded applications. I cared about performance, and I was willing to sacrifice nice things like detecting double-frees. I moved away from the project largely because I didn't want to be in a performance race with TCMalloc. (At the end, TCMalloc outperformed my allocator in some benchmarks, but not in others. But, surprisingly, there were also some places were glibc outperformed both.)


It could be that the MSVCRT implementation is the most used one (actually maybe the one in HeapAlloc and so). How would one know for sure :)

It's probably also used in all Xbox-es too...


jemalloc is the new hotness for MySQL. We are using it at Facebook (and I know percona/oracle use it for benchmarks and testing as well).

Good benchmark showing the impact of the different options:

http://www.mysqlperformanceblog.com/2012/07/05/impact-of-mem...


That blog makes it look like it and tcmalloc are roughly on-par. Do you think there's any possibility that FB leans toward jemalloc chiefly because the author works there, and the author of tcmalloc works at Google?


Firefox uses jemalloc, too.


Redis as well.


Apparently once you start having more threads than cores, tcmalloc really shines:

http://i.imgur.com/4RzmQD6.png

Looks like those on centos can install it easily via

   yum install gperftools-libs --enablerepo=epel 
which installs

  /usr/lib64/libtcmalloc.so.4
  /usr/lib64/libtcmalloc_minimal.so.4
then you just need to edit your mysql init script?

  test -e /usr/lib64/libtcmalloc_minimal.so.4 && export LD_PRELOAD="/usr/lib64/libtcmalloc_minimal.so.4"

You can also try jemalloc which supposedly is close to as good as tcmalloc but uses less memory

   yum install jemalloc  --enablerepo=epel 
which installs

  /usr/lib64/libjemalloc.so.1
and for your init.d

  test -e /usr/lib64/libjemalloc.so.1 && export LD_PRELOAD="/usr/lib64/libjemalloc.so.1"


oh and apparently mysql 5.5 users (not 5.1) can just directly use in my.cnf

  [mysqld_safe]
  malloc-lib=/usr/lib64/libtcmalloc_minimal.so.4
or

  malloc-lib=/usr/lib64/libjemalloc.so.1
http://dev.mysql.com/doc/refman/5.5/en//mysqld-safe.html#opt...

no export or script editing required


This is what I do at $dayjob.

We have been using tcmalloc for a while on our databases, as well as disabling the transparent huge pages and transparent huge page defrag (centos6). It made a big difference for us.


Can I ask you a dumb question: I think I just turned it on properly but I have no idea how to proactively confirm that mysql is actually using jemalloc, rather than just wait for better performance numbers?

Because it's an external environment variable, it doesn't actually show inside any of mysql's settings. No startup errors or runtime problems is always nice but I really am curious to know for a fact it worked.

Will probably have to ask this on stackexchange if you don't know.


With tcmalloc I get a few messages in the log about a large allocation on startup, but you can probably find it with this.

    # as root or sudo
    pmap -x $(pidof mysqld)|grep malloc



nedmalloc [0] is my absolute favorite and pretty much owns everything else in terms of performance (esp. multi-threaded memory allocations), though I would not use it at the scale Facebook and GitHub are running on. It has subtle bugs that creep in and get fixed down the road. jemalloc and tcmalloc are very heavily tested and vetted, though and are great options. Basically, anything other than the default allocator on Windows/Mac/Linux is fine :)

The author of nedmalloc is working on a very exciting C++ API (actually, I think it's API-complete now) to make it a drop-in STL allocator. I personally use the C API in my C++ applications without a problem, mainly as a pool allocator. For me, the Windows allocators (both the old default and the new "low-fragmentation" default) are absolutely abysmal at deallocation. Pool allocators in general make that go away.

0: http://www.nedprod.com/programs/portable/nedmalloc/


Shopify runs TCMalloc for mysql as well.


I don't work with MySQL or Rails, but I read this all the way through, mostly because the story was well told.

Strikes me as a perfect example of a culture that works hard and enjoys the hell out of it too.


They found a proverbial "silver bullet" in performance land. This almost never happens, but props to them for finding it. Now time to try this out!


It's fairly common to see double digit percent changes when swapping out lower level component implementations or version (compiler, JVM, OS, kernel, etc). That can be a good thing, in the case here where they found a win, or an awful thing.


This is interesting and from a black box view of MySQL, this is a good solution. For the MySQL developers, it seems like an opportunity for improvement. When you get bottlenecked on malloc() it usually means you are frequently allocating many small objects. To me this sounds like a good opportunity to use a memory pool allocator (or find a way in the code to do fewer allocations).


I've had mixed feelings about tcmalloc on Windows - that was 4-5 years ago, so things might be better. It was doing some hooking, looking at places to replace standard malloc/free/etc. throughout the whole address space, and on new dll's coming. Other than that, except when it was crashing for no reason (on some Windows 2003 servers for example), it was pretty good.


Try, as well, setting the value of tcmalloc.max_total_thread_cache_bytes to something larger than 16MB (the default). Reasonable values might range all the way to 1GB or more. Best to experiment and get data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: