Exploring the Deque, Maintaining Order in the Heap
Motivation:
Cache locality is my favorite runtime factor to optimize for. It often leads to easy to visualize data structures, imposing a sort of order on your containers. Linked Lists of all types suffer from not efficiently using the cache by comparison to array based data structures.For added context, here is a quick snippet on the basics of cache locality and how different data structures can be because of it here:
The goal of this post is more for me to understand the scientific process behind justifying assertions related to my coding practices. As a result, this is more of a high-level overview of the deque and the cache.
Topics covered will be:
- How
malloc()practically allocates to the heap in C. - Maximizing the deque's ability to take advantage of cache and understanding how that can break.
- Profiling the deque by looking at the assembly and seeing the cache misses.
Current Deque Implementations:
Here I'm going to be implementing a classic deque in C in style similar to the c++std::list, a doubly linked-list:
C Deque with Heap Allocation
typedef struct deque_node
{
int payload;
struct deque_node* next;
struct deque_node* prev;
}
deque_node;
typedef struct deque
{
struct deque_node* first;
struct deque_node* last;
int length;
}
deque;
void deque_push_front(deque *input, int val)
{
deque_node* new = malloc(sizeof(deque_node));
new->payload = val;
new->next = NULL;
new->prev = NULL;
input->length ++;
if (!input->first || !input->last) {
input->first = new;
input->last = new;
}
input->first->prev = new;
input->last->next = new;
new->prev = input->last;
new->next = input->first;
input->first = new;
}
By contrast, the large feature for this operation is the malloc call of each new node. It retrieves the first free block in the heap before actually marking the memory for use. Every node that is pushed on the heap separately calls malloc, meaning each allocation chooses the first available block large enough for a node.
Here you might assume (correctly), that pushing consecutively to a deque multiple times still creates a cache-friendly arrangement of nodes, as malloc will allocate nodes one after another on the heap. This allows the deque to still be reasonably fast when certain conditions are met (we'll get to those later).
For thoroughness here is the popping operation for reference.
int deque_pop_front (deque* input)
{
// if the deque is empty, error
if (!input || !input->first) {
fprintf(stderr, "Popped empty deque");
exit(-1);
}
// grab first item and update deque
deque_node* popped = input->first;
int ret = popped->payload;
input->length--;
// if the deque had one element return it
if (input->length == 0) {
free(popped);
input->first = NULL;
input->last = NULL;
return ret;
}
// otherwise update pointer
input->first->next->prev = input->last;
input->last->next = popped->next;
input->first = popped->next;
free(popped);
return ret;
}
Lessons on Malloc
Let us look at the following test bench:deque* main_deque = init_deque();
for (int i = 0; i < TEST_SIZE; i++)
{
// push i onto the deque 1000 times,
deque_push_back(main_deque, i, PRINT);
}
for (int i = 0; i < MOVING_SIZE; i++)
{
//pop the front and push the back
int popped = deque_pop_front(main_deque);
deque_push_back(main_deque, popped);
}
TEST_SIZE elements onto the deque, and then pops
the front and pushes the back
MOVING_SIZE
times. When we malloc each of the pushed nodes, where do we expect the nodes to be in memory?
One might expect it to move as a wheel would roll through the heap, where the popped node from the front is then pushed to the memory location after the last node. We can see here this is not the case though.
On the first passthough where we allocate a deque, we allocate the nodes in a contiguous block of memory. This is exactly expected as malloc finds the first free block that is large enough.
Here are the pointer locations in the 10 nodes allocated, truncated to the last 4 digits.
0x92c0 0x92e0 0x9300 0x9320 0x9340 0x9360 0x9380 0x93a0
0x93c0 0x93e0When we pop the front and push the back, it ends up pushing the new node exactly where the last one was popped, resulting in the following pointer locations:
0x92c0 0x92e0 0x9300 0x9320 0x9340 0x9360 0x9380 0x93a0
0x93c0 0x93e0The important detail here is that as we pop memory, we free it. This allows the freed block to be re-inserted into the list of free blocks usable for malloc. This is promising for deque usage in a vacuum as references will stay close in memory, but this also has many caveats.
Fragmenting A Deque
To measure the effect of taking advantage of cache speedup, we want to build a fragmented deque. When we make the deque nodes physically further apart, the cache won't prefetch.We can fragment a deque while building it by allocating memory between consecutive push operations.
The following code block shows the test bench with the added fragmentation, the timing only includes the iteration:
for (int i = 0; i < TEST_SIZE; i++)
{
deque* main_deque = init_deque();
deque* frag_deque = init_deque();
for (int i = 0; i < TEST_SIZE; i++)
{
deque_push_front(main_deque, i);
//we fragment the deque by adding two dummy deque_nodes in between
deque_push_front(frag_deque, i);
deque_push_front(frag_deque, i);
deque_push_back(main_deque, i);
deque_push_front(frag_deque, i);
deque_push_front(frag_deque, i);
}
}
timer_start();
// Then iterate through the deque _many_ times while accessing node->payload
timer_end();
Thus, I claim the deque with fragmentation will have a higher cache miss rate, and will be slower than the deque without fragmentation.
The cache block is fragmented by sizes of
2*sizeof(deque_node)
or 48 bytes. This causes deque_nodes to be greater than 64 (theoretically,
72) bytes away from eachother, which is greater than the size of
a cache-line.
Focusing on the cache line
To build intuition on why this would cause a dramatic slowdown, one of the cache's superpowers is prefetching. Whenever a reference is accessed in DRAM, a surrounding "cache-line", normally of size 64 bytes, is pulled with it into the cache. This is because we are likely to reference the memory locations near ones we already access (principle of reference locality). This dramatically increases the speed of memory accesses when data we look at is already prefetched into cache. When our 24 byte struct is fetched, only the one node is fetched. The rest of the cache line is taken by dummy nodes. This causes accessing of new nodes to require going out to DRAM.How big is the effect in practice? Breaking the deque.
When we run this bench with and without fragmentation, we can see the effect that cache locality has on the deque.(Average samples from 5 runs)
- Without Fragmentation, Forward Traversal: 1.92s
- Without Fragmentation, Reverse Traversal: 1.92s
- With Fragmentation Forward Traversal: 18.18s
- With Fragmentation Reverse Traversal: 18.34s
This resulted just from using another deque being pushed to at the same time.
It's also important to note that deque's can be fragmented when popping from within the deque as well though.
Profiling!
X86 Assembly
When viewing the X86 Assembly for this bench at -O2 optimization in gcc, we get this result for the traversal (the benched part)..L16:
mov rcx, rdx :rcx now holds address of current node
mov rdx, QWORD PTR [rdx+8] :rdx now holds pointer to next node
cmp rdx, rax :Checks if curr->next is end of list
jne .L16 :If not, redo the loop
mov edx, DWORD PTR [rcx]
mov DWORD PTR [rsi], edx
Side Note: I wrote a very small snippet on how important this step is for benching code here.
Luckily though, we don't just have to assume what the cache is doing, we can use a tool called perf stat to directly measure the number of cache misses! As expected, here are the benches:
- Without Fragmentation: .33% of all cache references are misses
- With Fragmentation: 1.2% of all cache references are misses
How Does This Factor Into Best Coding Practices?
The deque's circular, double-sided nature makes it a suitable alternative to both linked-lists and stacks. Obviously many software systems rely on its efficient implementation.While working with a deque, it is important to consider where memory is allocated. If a deque is being created or changed while other threads are malloc'ing to the same heap, it can be useful to rebuild it atomically (in relation to other heap allocations) to regain the cache speedup.
You could also consider forking a new process if the deque calculations are not reliant on the other heap collection's states. Process isolation means the child will have a different heap than the parent, but forking itself is a process on the order of 1ms, which is often much too long for certain higher performance applications. Threads share the heap though, so spawning a thread is a faster, but ineffectual process for this use case.
More likely, for performance critical applications the array and arraylist are your best friends. A lot of times, even using a hashmap can prefetch cache more efficiently (as in an open-addressing system), since the hashmap will be contiguous in memory.
This involves reliance on a good hashing function for efficiency though, and having a high load factor can cause dramatic slowdown depending on the implementation of the hashtable.
I typically avoid using linked-lists unless I am optimizing a small problem with how easy they are to work with. That's probably why I have a soft spot for them, and wanted to see about making them as efficient as possible.