Once upon a time, in the age of C, low-level programming languages, languages that hewed closer to the hardware, required something of developers that virtually no modern high-level language demands: memory management. We had to tell the computer ahead of time how much memory we would need, and it was up to us to make sure we didn’t exceed that allocation. We also had to tell the computer what to do with objects we weren’t using anymore. It was a dirty job, but someone had to do it. Luckily, newer languages have us covered. They outsource the job to a Garbage Collector.
What is a Garbage Collector?
Garbage Collection, at its most basic, is a form of automatic memory management. The Garbage Collector (GC) attempts to reclaim memory occupied by objects the program no longer uses. The concept was first conceived sometime around 1959 by John McCarthy, who was trying to simplify manual memory management in Lisp.
As mentioned above, most high-level languages, like Node.js, Python, and Ruby, come with a garbage collector. Low-level languages like C and C+ can add the feature in practice through libraries. Each language has its own implementation for GCs, so the details will vary. But they all aim to solve some common problems with memory management.
GCs work such that, when we create an object, we can come back later to check whether that object is still being used anywhere. If it has no references (and therefore isn’t being used anywhere and isn’t reachable) then we can collect it.
Its name may be a bit misleading, as, in most cases, the GC is also involved in the initial allocation of memory. Memory is mapped based on linked-list concepts, so it is contiguous and consecutive. When using a GC, developers don’t manipulate physical memory directly. We only work with the “virtual address space” (VAS) allocated by the GC.
Processes have their own VAS. Each has a 2 GB user-mode VAS. When virtual memory allocation is requested, the manager has to find contiguous memory: a solid 2GB block called the “managed heap.” Each time an object is created, the GC allocates memory in the VAS immediately following the previous object. Because the runtime allocates memory for an object by adding a value to a pointer, it’s almost as fast as allocating memory directly from the call stack.
There are three common methods for implementing a GC: Mark and Sweep, Copying Collection, and Reference Counting. Each has pros and cons and looks very different from its cousins.
#1: Mark and Sweep
Mark and Sweep was the first known method for garbage collection. It works by finding live heap links (object references) and marking the ones that are reachable by setting a “being used” bit. Generally, it starts at the root set, a set of objects easily identified in the program, and tries to access other objects from there. It then makes a second pass, locates objects in memory that haven’t been marked (and therefore have no reference, i.e., can’t be accessed) and returns that address in memory to the free pool, making it available for reuse, like putting a house back on the market.
FUN FACT: Ruby uses a special type of mark and sweep called tricolor. In this implementation, the GC starts by marking every object in memory white. This indicates that it is a possible candidate for GC, but currently unreviewed.
Next, each white object is checked for access. This rules out garbage collection for gray objects. Next, the connections for gray objects are reviewed. For each connection, if that object has a white flag, the flag is changed to gray, thus indicating this object also needs its connections reviewed.
Finally, after a gray object’s connections are reviewed, the root object (the initial gray object) changes from gray to black. A black flag indicates the object is still in use (has working references) and shouldn’t be garbage collected. The process repeats with all gray objects until only white or black objects remain. The memory where white objects reside is finally returned to the heap for reuse.
Mark and Sweep has one major disadvantage: the entire system must be suspended during collection making real-time and time-critical applications impossible.
#2: Copying Collection
In Copying Collection, memory is divided into two halves. One half is used until it is full. Then, used blocks are moved (copied) to the other half and the first half is erased. The big advantage of this implementation is that, if our memory burden is low, we might never need it. If we use less than half the heap, no copying will be necessary.
However, this method still stops the program’s execution to move objects. Also, it changes the addresses to which program objects point. This involves changing the values of variables in the program. This costs time.
#3: Reference Counting
This method is used in CPython’s implementation. In Reference Counting, each block of memory has a counter of heap links. That counter is incremented when a heap link is copied and decremented when the link is discarded. When a counter hits zero the block is freed. This guarantees objects are destroyed as soon as they become unreachable, which is great. However, the counters require additional space. Also, we can end up in a vicious cycle whereby, if multiple objects refer ence each other, their mutual references prevent their reference count from hitting zero, and they will consequently never be collected.
CON: Influences performance because it can take up a significant portion of total processing time.
Stop the world collection
Garbage Collection can be a huge time saver. Without it, developers would have to write their own code for memory management tasks. Because it is already implemented for us, it also eliminates several once-common problems: memory leaks caused by forgetting to free objects, attempts to access an object that’s already been freed up, and memory safety making sure objects cannot use the content of another object. It also allocates memory efficiently using the managed heap.
However, these benefits come at a cost to performance. It can take up a large portion of processing time. GCs use “stop-the-world” collection: it stops the execution of your application to prevent the app from trying to allocate new memory while it’s still working on the current heap. On the one hand, this is a necessary evil. Imagine you’re folding laundry and your roommate comes over and dumps a whole new load on top of your nice neat stack. However, when this happens can be unpredictable leading to undesirable slowdowns. All this means is that GCs aren’t a cure-all. It’s important to understand the costs and benefits of managing your own memory allocation and choose the right approach for the job.