Writing a simple pool allocator in C
This code is very beautifully written, thanks for sharing.
However, you should consult the book "C: Interfaces and Implementations" by Dave Hanson, which has a library called "Arena" that is very similar to your "Pool"s, and it shows a few more tricks to make the code safer and easier to read (Chapters 5+6, 6 in particular).
D. Hanson's book (written in the 'literary programming' style invented by Donal Knuth) also demonstrates having debugging and production implementations for the same C API to catch memory errors, and his book is from 1997:
> Copyright © 1997 by David R. Hanson
(He used to be a SE professor I Princeton before getting hired by Microsoft, I believe.)If you write your own allocator in C, do yourself a favor and use the valgrind API inside it. Its use can be conditional so you can "compile it out". Or should I say it has to be, so you can build the code in an environment where there's no valgrind API.
One aspect people don't tend to mention (and TFA does not because it has "general purpose" as a scope) about doing your own allocation arenas is that because a "pointer" is just the name (really number) of an allocated region, this also affords flexibility in the width of those numbers aka "pointer size" that decides your "address" space.
So, for example, if you are willing to limit yourself to 64 KiNodes of equal sized objects, a pool like https://github.com/c-blake/bst/blob/main/POOL.c can use an unsigned short 2-byte pointer. This small address space specialization can give a full 4x space overhead optimization over modern 64-bit pointers.
You could even use 8x less with 1-byte pointers if 256 nodes is enough or use bit fields or equivalents for even more fine-grained address space size selection. Admittedly, linked structures are usually less interesting for very small address space sizes, but this is sometimes useful. People may complain about arbitrary limits that relate to this kind of optimization, but the linked part could be "small by design" without compromising overall scale { such as for structure within a B-tree node or other multi-level indexing setups }.
In any event, I think it is useful to highlight pointer size arbitrariness to junior engineers. It is often learned once & immediately forgotten behind an onslaught of general purpose-ness (& pointer type systems/syntax). Relocation ideas also relate. E.g., if all the pointers are implicitly indices into a (small?) array, and all the code needs a "base address" of that array to get a VMem address, then you can relocate the whole bundle of objects pretty easily changing only the base address.
Hello, I am the author. Thank you all for the instructive comments, I made some changes to the article since I first posted it:
- Added a link to this HN post.
- Renamed some variables and types in the code for readability.
- Mention (in a footnote) that we could allocate the 'Pool' structure and the chunk arrays with a single call, or even return the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.
- Mention (in a footnote) that we don't always need to store the full address when building the linked list of free chunks. Suggested by 'cb321'.
- Added valgrind support (also to my 'libpool' project). Suggested by 'kazinator'.
A long time ago, in a galaxy far far away, i wrote something similar for an embedded platform. I was implementing a WAP Browser at the time (yes, WAP, https://en.wikipedia.org/wiki/Wireless_Application_Protocol), and we needed dynamic memory allocation in an environment where everything was static due to realtime constraints.
So i ended up with this : https://github.com/jinie/memmgr
you can avoid resizing and moving the pools if you allocate massive pools up front. most OSs let you overcommit memory, and most programs using memory pools only have a handful of them
(on windows, you'd need to handle this with VirtualAlloc. osx and linux let you overcommit with malloc, and handle it with a copy-on-write page fault handler)
A sometimes useful thing is to treat the pool as a stack and provide a call to free all recently allocated items up to a certain index. So you make a bunch of deep subroutine calls that use the pool allocator, and then on return free them all. It's like a secondary automatic storage class.
Also sometimes useful to provide another call to promote an automatic item to a more permanent one so that it is excluded from being freed like this.
The parent article is quite interesting and slightly different being cpp [1]. There is also a post about writing a memory allocator [2].
[1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-allocato...
[2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-alloca...
Here's another interesting O(1) memory allocator but with arbitrary sized allocations and low fragmentation. Negative side is relatively high memory overhead (a few dozen bytes per allocation).
This kind of allocator is often used to suballocate GPU memory in game and graphics applications.
I'm using a variant of this algorithm with added support for shrinkable and aligned allocations and flexible bin sizing.
You can also extend this idea to two dimensions to create texture atlases, which is possible in O(1) for power of two sized allocations.
Original: https://github.com/sebbbi/OffsetAllocator Rust port: https://crates.io/crates/offset-allocator
There is a small series of posts on the BitSquid blog about memory allocation which is worth reading! http://bitsquid.blogspot.com/2015/08/allocation-adventures-3...
Also "Arena allocator tips and tricks (nullprogram.com)" [0]
Note that using this will likely violate strict aliasing due to using the void pointer returned as one type, freeing it, and then getting that same chunk again and using it as another type. You'll probably want to compile with -fno-strict-aliasing to be safe.
A good reference on strict aliasing: https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...
This bring old memories back, I implemented something similar on a motorola 68360 ages ago. Since the size of the buffer I used was not huge I skipped the pointers and simply enumerated each chunk, it was remarkably efficient and stable.
The pool struct Is two pointers, why are you allocating it with Malloc?
Another (admittedly less portable) way to solve the resizing problem is to reserve a large virtual memory space using linux mmap with the PROT_NONE flag and then commit pages as needed using mprotect with PROT_READ|PROT_WRITE flags. I believe windows' VirtualAlloc/VirtualProtect also allows this
Some interesting things that would be interesting to add to this:
- thread safety ( not too hard to add on top of your liked list) - variable sized allocations. Ideally something more akin to malloc. Could be done wastefully by rounding up to the nearest chunk size - (as mentioned in the article) variable chunk sizes - zeroing of memory before handing it back to users - double free detection or even better full asan support
I usually default to slab allocators if I have to write my own:
It seems like almost all of the complexity of this allocator comes from having to manage the fact that it's being implemented on top of the system allocator. I thought the whole point of using a custom allocator was to avoid the system allocator for exactly the problems they mention e.g. calling the system-provided realloc() might move your stuff around, having to track separate pool starts so you can call the system-provided free() on them, etc.
Like, yeah you have to do that, but I thought the whole point of writing a custom allocator was to not use the system allocator beyond statically allocating a huge block of memory upfront and then managing that yourself, is it not?
ThreadX has pool allocators: https://github.com/eclipse-threadx/threadx/blob/master/commo...
The Latin Modern font on this website does not look so good in Chrome, is it because of the font size, or is it just me? (Chrome)
It all depends on the load profile using the allocator. You never know, that said you cannot beat, in theory, a semantically specialized allocator for a specific load... in other words, the right way(TM). This means applications should bundle their own allocators for the various load types they have. The "generic" allocator is sort of an heresy for the lazy, short termists or those who are in hurry. Don't worry I still use such generic allocator, sometimes, but often I do mmap myself the memory and deal directly with it.
This is really informative. I wonder if there is a better solution for the reallocation problem, one that preserves pointers.
This is a very nice article! The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)
I think it would be improved further if it began with the calling interface callers use to invoke the allocator, because it's easier to understand the implementation of some service such as pool allocation when you begin by knowing what service, exactly, is required. I began with the assumption that the author was describing an arena allocator under another name, rather than a fixed-size block allocator, both because the APR's influential arena allocator is called "apr_pool_t" and because one of the main headings in the table of contents (a greatly appreciated feature, like all the typography of the article) was "Closing the pool". It took me quite a while to figure out that I was mistaken. It's plausible that I only had this problem because I am an unusually stupid, ignorant, or pigheaded person (some would say all three), but, if not, many other people will have the same difficulty I had of taking several minutes to figure out what the topic of the article is. (I could have saved myself by clicking either of the first two links in the Introduction, but I didn't because I thought I already knew what it was.)
The current top-voted comment on this thread is by someone who had the same reading comprehension problem I did, but never realized their erorr: https://news.ycombinator.com/item?id=42643951 and nobody had pointed out their mistaek until I did just now. So I think we can say with some confidence that many HN readers will have the same difficulty.
I think that the article would also be improved by a few other small additions:
- It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.
- Probably when one benchmarks the implementation given, one will find that it is only slightly faster than jemalloc or even gnumalloc, because they both dispatch small allocations to a fixed-block allocator similar to this one, and the dispatch isn't very expensive. This can be fixed by inlining the allocation fast path and the deallocation function, which will then compile down to a few instructions each. A simple demonstration of how to do this is in my arena allocator kmregion in http://canonical.org/~kragen/sw/dev3/kmregion.h and http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The current implementation of 8dcc libpool is guaranteed to be unable to do this unless you're using link-time optimization or a so-called "unity" or "amalgamation" build process.
- It's worth distinguishing between average-case performance (in which this allocator is slightly better than malloc) and worst-case performance (in which case malloc, generally speaking, isn't even in the running).
- It may be worth mentioning that often such pool allocators are used in conjunction with initializing the objects in the pool to a valid state when the pool is created; that way you don't have to reinitialize objects to a valid state every time you deallocate and reallocate them, which is usually more work than malloc does to find the right free list. This does require not overwriting the user data with your freelist pointer, so it's a space/time tradeoff.
- Maybe you should mention alignment, especially now that even on amd64 GCC has taken to using new instructions that require alignment.
Even as it is, though, it's highly educational, visually pleasant, and very understandable (once I got over my initial misconception of having a totally wrong idea of what the article was about).
I recommend against using linked lists for bookkeeping the free blocks. It seems to be the data structure that every malloc/free implementation reaches for, and I don't know why - the slowness of pointer-chasing makes it terrible for almost any real-world use case. A balanced tree would be a much better idea, given that all the essential operations would take O(log n) time instead of O(n). Even if one insists on a linear search, a bitset is much more cache friendly than pointer-chasing and it can trivially benefit from SIMD optimizations.
(related) LLFree
Paper: LLFree: Scalable and Optionally-Persistent Page-Frame Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf
Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8
Implementation and benchmarks are well documented at the repos:
Rust repo https://github.com/luhsra/llfree-rs C repo https://github.com/luhsra/llfree-c