HNPWA with Next.js

Writing a simple pool allocator in C

8dcc | 307 points

This code is very beautifully written, thanks for sharing.

However, you should consult the book "C: Interfaces and Implementations" by Dave Hanson, which has a library called "Arena" that is very similar to your "Pool"s, and it shows a few more tricks to make the code safer and easier to read (Chapters 5+6, 6 in particular).

D. Hanson's book (written in the 'literary programming' style invented by Donal Knuth) also demonstrates having debugging and production implementations for the same C API to catch memory errors, and his book is from 1997:

  > Copyright © 1997 by David R. Hanson

(He used to be a SE professor I Princeton before getting hired by Microsoft, I believe.)

jll29 | 6 months ago

If you write your own allocator in C, do yourself a favor and use the valgrind API inside it. Its use can be conditional so you can "compile it out". Or should I say it has to be, so you can build the code in an environment where there's no valgrind API.

kazinator | 6 months ago

One aspect people don't tend to mention (and TFA does not because it has "general purpose" as a scope) about doing your own allocation arenas is that because a "pointer" is just the name (really number) of an allocated region, this also affords flexibility in the width of those numbers aka "pointer size" that decides your "address" space.

So, for example, if you are willing to limit yourself to 64 KiNodes of equal sized objects, a pool like https://github.com/c-blake/bst/blob/main/POOL.c can use an unsigned short 2-byte pointer. This small address space specialization can give a full 4x space overhead optimization over modern 64-bit pointers.

You could even use 8x less with 1-byte pointers if 256 nodes is enough or use bit fields or equivalents for even more fine-grained address space size selection. Admittedly, linked structures are usually less interesting for very small address space sizes, but this is sometimes useful. People may complain about arbitrary limits that relate to this kind of optimization, but the linked part could be "small by design" without compromising overall scale { such as for structure within a B-tree node or other multi-level indexing setups }.

In any event, I think it is useful to highlight pointer size arbitrariness to junior engineers. It is often learned once & immediately forgotten behind an onslaught of general purpose-ness (& pointer type systems/syntax). Relocation ideas also relate. E.g., if all the pointers are implicitly indices into a (small?) array, and all the code needs a "base address" of that array to get a VMem address, then you can relocate the whole bundle of objects pretty easily changing only the base address.

cb321 | 6 months ago

Hello, I am the author. Thank you all for the instructive comments, I made some changes to the article since I first posted it:

- Added a link to this HN post.

- Renamed some variables and types in the code for readability.

- Mention (in a footnote) that we could allocate the 'Pool' structure and the chunk arrays with a single call, or even return the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.

- Mention (in a footnote) that we don't always need to store the full address when building the linked list of free chunks. Suggested by 'cb321'.

- Added valgrind support (also to my 'libpool' project). Suggested by 'kazinator'.

8dcc | 6 months ago

A long time ago, in a galaxy far far away, i wrote something similar for an embedded platform. I was implementing a WAP Browser at the time (yes, WAP, https://en.wikipedia.org/wiki/Wireless_Application_Protocol), and we needed dynamic memory allocation in an environment where everything was static due to realtime constraints.

So i ended up with this : https://github.com/jinie/memmgr

8fingerlouie | 6 months ago

The parent article is quite interesting and slightly different being cpp [1]. There is also a post about writing a memory allocator [2].

[1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-allocato...

[2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-alloca...

keyle | 6 months ago

Here's another interesting O(1) memory allocator but with arbitrary sized allocations and low fragmentation. Negative side is relatively high memory overhead (a few dozen bytes per allocation).

This kind of allocator is often used to suballocate GPU memory in game and graphics applications.

I'm using a variant of this algorithm with added support for shrinkable and aligned allocations and flexible bin sizing.

You can also extend this idea to two dimensions to create texture atlases, which is possible in O(1) for power of two sized allocations.

Original: https://github.com/sebbbi/OffsetAllocator Rust port: https://crates.io/crates/offset-allocator

exDM69 | 6 months ago

(related) LLFree

Paper: LLFree: Scalable and Optionally-Persistent Page-Frame Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf

Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8

Implementation and benchmarks are well documented at the repos:

Rust repo https://github.com/luhsra/llfree-rs C repo https://github.com/luhsra/llfree-c

alecco | 6 months ago

What I find weird is that the chunk's size is fixed, while the caller must specify how many chunks a pool contains at pool creation. I would do exactly the dual: the chunk's size should be defined at pool creation, so that you can create multiple pools, each dedicated to one specific kind of object. (You need a third field in struct Pool, though.) Instead, the number of chunks per pool should not something the user needs to care about: if they knew in advance how many objects are required, they would simply allocate an array big enough! The library implementation should define a sensible number and stick to it.

Similarly, I don't like exposing pool_expand(): too much burden on the user. Expansion should be automatically triggered by pool_alloc() whenever the current pool is exhausted.

This would also allow a much simpler pool_new() that just initializes its pointers to NULL, leaving it to the first invocation of pool_alloc() to actually do the first allocation. This would avoid the duplication of code between pool_new() and pool_expand().

Another benefit of fixing the number of chunks is a possible simplification of ArrayStart: this could include a proper array instead of a pointer, avoiding a malloc(). Something like this:

  struct ArrayStart {
    struct ArrayStart *next;
    Chunk arr[CHUNKS_PER_ARRAY];
  };

teo_zero | 6 months ago

There is a small series of posts on the BitSquid blog about memory allocation which is worth reading! http://bitsquid.blogspot.com/2015/08/allocation-adventures-3...

voctor | 6 months ago

It all depends on the load profile using the allocator. You never know, that said you cannot beat, in theory, a semantically specialized allocator for a specific load... in other words, the right way(TM). This means applications should bundle their own allocators for the various load types they have. The "generic" allocator is sort of an heresy for the lazy, short termists or those who are in hurry. Don't worry I still use such generic allocator, sometimes, but often I do mmap myself the memory and deal directly with it.

sylware | 6 months ago

Also "Arena allocator tips and tricks (nullprogram.com)" [0]

[0] https://news.ycombinator.com/item?id=37670740

drysine | 6 months ago

The pool struct Is two pointers, why are you allocating it with Malloc?

liontwist | 6 months ago

Note that using this will likely violate strict aliasing due to using the void pointer returned as one type, freeing it, and then getting that same chunk again and using it as another type. You'll probably want to compile with -fno-strict-aliasing to be safe.

A good reference on strict aliasing: https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...

accelbred | 6 months ago

A sometimes useful thing is to treat the pool as a stack and provide a call to free all recently allocated items up to a certain index. So you make a bunch of deep subroutine calls that use the pool allocator, and then on return free them all. It's like a secondary automatic storage class.

Also sometimes useful to provide another call to promote an automatic item to a more permanent one so that it is excluded from being freed like this.

jhallenworld | 6 months ago

Some interesting things that would be interesting to add to this:

- thread safety ( not too hard to add on top of your liked list) - variable sized allocations. Ideally something more akin to malloc. Could be done wastefully by rounding up to the nearest chunk size - (as mentioned in the article) variable chunk sizes - zeroing of memory before handing it back to users - double free detection or even better full asan support

ch33zer | 6 months ago

Another (admittedly less portable) way to solve the resizing problem is to reserve a large virtual memory space using linux mmap with the PROT_NONE flag and then commit pages as needed using mprotect with PROT_READ|PROT_WRITE flags. I believe windows' VirtualAlloc/VirtualProtect also allows this

9029 | 6 months ago

ThreadX has pool allocators: https://github.com/eclipse-threadx/threadx/blob/master/commo...

pajko | 6 months ago

This bring old memories back, I implemented something similar on a motorola 68360 ages ago. Since the size of the buffer I used was not huge I skipped the pointers and simply enumerated each chunk, it was remarkably efficient and stable.

2rsf | 6 months ago

I usually default to slab allocators if I have to write my own:

https://github.com/codr7/libcbs#slab-allocation

codr7 | 6 months ago

you can avoid resizing and moving the pools if you allocate massive pools up front. most OSs let you overcommit memory, and most programs using memory pools only have a handful of them

(on windows, you'd need to handle this with VirtualAlloc. osx and linux let you overcommit with malloc, and handle it with a copy-on-write page fault handler)

harrison_clarke | 6 months ago

This is really informative. I wonder if there is a better solution for the reallocation problem, one that preserves pointers.

matheusmoreira | 6 months ago

It seems like almost all of the complexity of this allocator comes from having to manage the fact that it's being implemented on top of the system allocator. I thought the whole point of using a custom allocator was to avoid the system allocator for exactly the problems they mention e.g. calling the system-provided realloc() might move your stuff around, having to track separate pool starts so you can call the system-provided free() on them, etc.

Like, yeah you have to do that, but I thought the whole point of writing a custom allocator was to not use the system allocator beyond statically allocating a huge block of memory upfront and then managing that yourself, is it not?

lelandbatey | 6 months ago

The Latin Modern font on this website does not look so good in Chrome, is it because of the font size, or is it just me? (Chrome)

guardian5x | 6 months ago

This is a very nice article! The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)

I think it would be improved further if it began with the calling interface callers use to invoke the allocator, because it's easier to understand the implementation of some service such as pool allocation when you begin by knowing what service, exactly, is required. I began with the assumption that the author was describing an arena allocator under another name, rather than a fixed-size block allocator, both because the APR's influential arena allocator is called "apr_pool_t" and because one of the main headings in the table of contents (a greatly appreciated feature, like all the typography of the article) was "Closing the pool". It took me quite a while to figure out that I was mistaken. It's plausible that I only had this problem because I am an unusually stupid, ignorant, or pigheaded person (some would say all three), but, if not, many other people will have the same difficulty I had of taking several minutes to figure out what the topic of the article is. (I could have saved myself by clicking either of the first two links in the Introduction, but I didn't because I thought I already knew what it was.)

The current top-voted comment on this thread is by someone who had the same reading comprehension problem I did, but never realized their erorr: https://news.ycombinator.com/item?id=42643951 and nobody had pointed out their mistaek until I did just now. So I think we can say with some confidence that many HN readers will have the same difficulty.

I think that the article would also be improved by a few other small additions:

- It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.

- Probably when one benchmarks the implementation given, one will find that it is only slightly faster than jemalloc or even gnumalloc, because they both dispatch small allocations to a fixed-block allocator similar to this one, and the dispatch isn't very expensive. This can be fixed by inlining the allocation fast path and the deallocation function, which will then compile down to a few instructions each. A simple demonstration of how to do this is in my arena allocator kmregion in http://canonical.org/~kragen/sw/dev3/kmregion.h and http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The current implementation of 8dcc libpool is guaranteed to be unable to do this unless you're using link-time optimization or a so-called "unity" or "amalgamation" build process.

- It's worth distinguishing between average-case performance (in which this allocator is slightly better than malloc) and worst-case performance (in which case malloc, generally speaking, isn't even in the running).

- It may be worth mentioning that often such pool allocators are used in conjunction with initializing the objects in the pool to a valid state when the pool is created; that way you don't have to reinitialize objects to a valid state every time you deallocate and reallocate them, which is usually more work than malloc does to find the right free list. This does require not overwriting the user data with your freelist pointer, so it's a space/time tradeoff.

- Maybe you should mention alignment, especially now that even on amd64 GCC has taken to using new instructions that require alignment.

Even as it is, though, it's highly educational, visually pleasant, and very understandable (once I got over my initial misconception of having a totally wrong idea of what the article was about).

kragen | 6 months ago

[flagged]

openmarkand | 6 months ago

I recommend against using linked lists for bookkeeping the free blocks. It seems to be the data structure that every malloc/free implementation reaches for, and I don't know why - the slowness of pointer-chasing makes it terrible for almost any real-world use case. A balanced tree would be a much better idea, given that all the essential operations would take O(log n) time instead of O(n). Even if one insists on a linear search, a bitset is much more cache friendly than pointer-chasing and it can trivially benefit from SIMD optimizations.

10000truths | 6 months ago