GC FAQ -- harder stuff

Harder stuff

Thread support

Makes life interesting, since a garbage collection may now occur "at any time". In addition, few thread packages make it easy to stop other threads and retrieve their state, an operation that's required by most garbage collectors.

Atomic operations may require additional code to ensure that they are really atomic (two examples are reference count update and object copying), and some use of memory barriers may be needed to ensure that writes are seen in the same order by all observers. For example, reference counts cannot inconsistently fall to zero (if any observer sees a non-zero reference count, then the memory ought not be reclaimed).

Typical solutions here include working at a machine-dependent level, and mostly removing the threading issues by caching information in a single-threaded context. For example, a heap can be carved up into per-thread pieces, and each thread allocates inpedendently and unsynchronized, except when it needs a new piece. Reference counting can remove much of its dependence on fine-grained synchronization by caching updates in per-thread buffers, and processing the buffers in batch form.

Threads can also be tamed with "safepoints". Before a GC can be started, all threads must be parked at safepoints, where their pointers are guaranteed to be in stable, well-known places. Safepoints have some of their own problems; if, for instance, a low-priority thread must reach a safepoint before GC can begin, GC for the entire process may be delayed.

Parallel collection

Parallel collection makes use of multiple processors to reduce the elapsed time needed by GC. This is especially important on multiprocessors, because if garbage collection does not scale as well as the garbage generator, it becomes the bottleneck. In a stopped-mutator collector, parallelism is not too hairy; it is easy to sweep in parallel, and parallel marking is not too difficult, though it does require some form of load balancing. Compaction can be parallelized by dividing the heap into as many pieces as processors, or the forwarding pointers can be installed using compare-and-swap (on architectures that support this).

Concurrent and incremental collection

"Concurrent" collection means concurrent with the mutator; that is, without pauses. A close approximation of concurrent collection is incremental collection, which is more tractable because it does not have to deal with mutation during the garbage collection algorithm.

One way to add concurrency is to perform non-interfering GC tasks concurrent with mutation. These can include processing card marks and some versions of snapshot (e.g., mark-and-sweep on a copy-on-write mapping) mark-and-sweep collection. See, for example:

"Concurrent Remembered Set Refinement in Generational Garbage Collection" by David Detlefs, Ross Knippel, William D. Clinger, Matthias Jacob, in the Proceedings of 2002 USENIX Java VM Research and Technology Symposium.

Available at http://research.sun.com/jtech/pubs/.

Other concurrent collection algorithms require some amount of mutator cooperation, usually to obtain the equivalent of a snapshot or a write barrier. Another advantage of incremental collection over concurrent collection is that it can adjust the rate of GC if it falls behind; a concurrent collector's rate is affected by changes in OS thread scheduling.

Distributed objects

A reference (e.g. ports, sockets, URLs, etc.) is the distributed equivalent of a pointer. Few existing reference mechanisms support GC. If you consider objects that are widely shared, in a large distributed system, manual collection is a nightmare, and automatic garbage collection is essential.

Lots of people think it's a very hard problem. In fact, complete GC is unfeasible, but simple solutions can collect most garbage.

The problem is that the property "object X is garbage" is a global property, i.e. conceptually you need to examine the whole system (in a consistent state) just to know the status of X. Examining the whole system is expensive, and actually unfeasible in the presence of failures; getting a consistent state is hard. However it is always safe not to collect; so any conservative approximation is OK. Furthermore, garbage objects will not revert, so algorithms that work "eventually" are ok if there is enough slack in the system.

What is different about distributed systems?

To know the status of a remote site you must send and receive messages; messages are costly.
Distributed systems are inherently asynchronous. A message only tells you about the past of the sender. There is no single global order of events.
Failures do occur. Machines crash, processes get into infinite loops. Messages get lost and delivered out-of-order (even with TCP/IP, in the presence of crashes). In the Internet remote sites often appear unreachable. The killer is: you can't reliably detect failures, so some questions about remote state are undecidable (in distributed systems jargon: "consensus is impossible in an asynchronous system with failures").
As a consequence of the above: data tends to be inconsistent. If the data you are copying is the GC status, then you may be deciding to reclaim an object based on incorrect information.

You will find general algorithms in the distributed systems literature that appear to solve the above problems, called distributed snapshots, distributed transactions, etc. They will work but they are very costly and they don't scale. Specialized solutions to the specific problem of GC can be much simpler and easier to implement.

Distributed reference counting and reference listing

The easiest GC algorithm to distribute is reference counting. A simple-minded approach is to attach a counter to each object; duplicating or killing a reference sends an increment or decrement message to the target object. This is very expensive (one message per pointer operation) but more importantly doesn't work. Since there is no guaranteed delivery order for messages, an increment might be overrun by a decrement, and the target object unsafely reclaimed.

A general fix is to provide "Causal delivery" but that's overkill. A simpler fix is Weighted Reference Count (or some variation thereof). When an object X is created, it is allocated an initial weight. The first reference to X carries a weight equal to the object's. Duplicating any reference to X divides the weight between the two duplicates. Killing a reference sends a decrement message to the object's weight. At all times, the sum of the reference weights is equal to the object's weight.

In "reference listing" the target object keeps a "scion set", i.e. a list of backpointers to the processes that hold pointers to it. This is made possible by the following observation: if process P owns object X, process Q may acquire a reference to X only if P sends a message to Q containing the reference. At message sending time, record the name Q in the reference list of X. When Q kills its (last) reference to X, it sends a decrement message which removes Q from X's scion list. When the scion list is empty, X can be reclaimed.

Reference listing is less sensitive to failures than counting. If process Q crashes, P can unilaterally remove its scion from all its reference lists (but beware: it is impossible to reliably detect crashes!). If a decrement message is lost, P can query Q. Reference listing is heavily used in Partitioned GC.

Reference counting and listing scale well. They cannot collect cycles of garbage. Another drawback in practice is that they do not provide "root identity", i.e. if there are multiple roots (e.g. process roots and persistent roots), there is no way to know from which a specific object is reachable.

Distributed Mark and Sweep

For concreteness, I will concentrate on Mark and Sweep; but all tracing methods (whether mark and sweep, copying, or other) have a synchronization point and hence will not scale well. On the other hand, tracing does collect cycles and supports root identity. It is also fault tolerant (in a perverse way): if anything goes wrong, you can abort a trace and start again from scratch.

The simple-minded approach, where each process marks and sweeps in parallel, sending "mark" messages down remote references, doesn't work. Suppose P has a reference to object X in process Q, and Q has no local reference to X. Then Q shouldn't start sweeping until it is certain that it will receive no more mark messages from P. Of course, P must do the same with respect to Q. Thus, every process must wait for all others between the mark and the sweep phases (this is called a "termination protocol").

The Hughes algorithm overcomes this problem somewhat by using timestamps to mark reachable objects (rather than a single mark bit). The idea is that the timestamp of a reachable object will always increase; after an object becomes unreachable its timestamp ceases to increase; objects whose timestamp is below some global minimum are garbage. The catch is that computing the global minimum is a termination protocol. The good thing is that it can be done in the background.

Ladin and Liskov take an alternative approach. The idea here is that each process sends (an approximation of) its local graph to some central service. The central service combines the partial graphs into a global graph, which it marks and sweeps. There are a few problems with this approach. The first is that the central service is a bottleneck and a single point of failure (Ladin and Liskov address this problem by replication). The second is that the partial graphs are mutually inconsistent (Ladin and Liskov address this by timestamping and being conservative). The third is that getting the partial graph information is harder than it seems (Ladin and Liskov got it wrong the first time around).

Partitioned or Hybrid GC

Partitioned GC combines the advantages of Distributed Reference Counting or Listing and of Distributed Mark and Sweep. The global memory is partitioned; GC is hybrid: tracing within a subset, counting (or listing) across subsets.

The idea is that when some subset of the global memory (e.g. a process) sends a pointer to another subset, that pointer is recorded in a "scion set" (akin to a remembered set). You keep a count of such inter-subset references. Thus, each subset can independently and locally perform tracing of any kind. Inter-subset references are reference-counted.

The difference between a remembered set and a scion set is that in the former case the memory subsets are ordered, so no cycles of garbage can form. In the latter case there is no natural ordering and inter-subset garbage cycles cannot be collected.

Variations on this theme have to do with doing the whole job more efficiently and tolerating faults ("SSP Chains": Shapiro, Dickman and Plainfossè), collecting cycles of garbage that span subsets (Hughes, or "Garbage Collecting the World": Lang Queinnec and Piquer), avoiding expensive memory synchronization and I/O operations ("Larchant": Ferreira and Shapiro), using the counting information to choose the best subset to trace (Cook Wolf and Zorn, or "Sprite LFS": Rosenblum and Ousterhout), or avoiding problems with transaction aborts (Amsaleg Gruber and Franklin).

DSM, cached distributed stores

Databases, persistent stores, and file systems

What is different:

lots of data on disk, I/O is costly
inconsistent data
transaction rollback
clustering

Reference counting and listing, mark and sweep: see distributed systems

Partitioned GC

see distributed systems
selecting the best partition to trace
Naughton
Sprite LFS

Tolerating transaction rollback and inconsistent data

note the identity between the two problems
Amsaleg
Larchant

Clustering

Uncooperative environments

Garbage collectors designed to work with C and C++ cannot rely on compiler support for garbage collection, because there is none. Such garbage collectors cannot know exactly where pointers are stored on the stack, or in which variables, nor can they necessarily know the exact layout of allocated objects. The usual way to solve this problem is with a conservative garbage collector. Relocation (compaction) of objects referenced by these uncertain pointers is not possible.

In cases where the C is machine-generated (in compilation of some other language, such as Sather, Eiffel, Scheme, or Modula-3), code to record active pointers can be generated.

There is a theoretical problem posed by sufficiently good optimizing compilers. In principle, there is no reason why the generated code should reference objects by a pointer to the start of the object, or even to its interior. In practice this is not a problem.

Hardware support

Good VM hardware support for read or write barriers.
Cache sub-block allocate on write.
Memory pre-fetch (for scanning/scavenging).
Control of caching/non-caching at different cache levels.

Recent work

Kelvin Nilsen's approach to hardware-assisted real-time garbage collection is to insert special hardware between the cache and the memory subsystem. The role of this hardware is to:

Enforce read barriers by snooping on memory fetch operations and inserting pipeline stalls and/or requiring fetch retries whenever special handling is required.
Maintain a log of fetch and store operations for purposes of updating cross-generational link tables for generational garbage collectors.
Maintain a larger-than-normal write buffer and provide programmable support for implementation of custom write barriers.
Provide a very small programmable coprocessor to assist with garbage collection efforts by performing such routine tasks as scanning regions of memory for pointers and copying blocks of memory from one region to another. This coprocessor resides next to memory so its activities do not generally interfere with the contents of the main CPU's cache.

The garbage collection hardware can be configured to support a variety of different garbage collection techniques including two-space copying, incremental mark and sweep, generational, and various hybrid techniques. For more information, refer to http:www.newmonics.com/WebRoot/technologies/gc.html

OS support

Where, oh where, is my root set? (Stacks and shared libraries can be hard to find).
Zeroing out memory
Some people take it for granted, it need not be the case, especially on PCs. Junk appearing in register windows (e.g., Sparc) can also cause trouble, whether it is left over from other processes, the OS, or simply the same process's former calls. For Sparc, on Solaris, there is a way to request that register windows be zeroed, and compilers can also zero out unused registers upon procedure entry.
Access to VM hardware support for read or write barriers.
If you have an incremental or concurrent GC that can choose which pages it's going to scan, and it can determine what's paged in and what's not, you can minimize paging penalties by deferred scanning of pages that are not in physical memory.
Pre-fetching without blocking a little while before you scan a page can be useful for minimizing paging costs of GC.
Fast traps for read or write barrier violations.
Backing store pre-fetch.
Pre-fetching without blocking one cache-line ahead while scanning can be a real win. (I know this has been written about, perhaps in the Tarditi, et al, cache/GC measurements paper, which, btw, should be in the list of references.)
Support for dropping dirty in-core condemned page frames.
If you know a page is garbage, just discard it rather than paging something out. (Obviously, this doesn't only apply to GCs -- a typical example would be memory past the top of the stack. Copying GCs permit discarding all of new space after a flip, that is, pages between the allocation pointer and the end of new space can be paged out by discarding them.)
If the GC doesn't require zero-filled memory (perhaps because everything always gets initialized explicitly -- typical for ML, etc) and you can get garbage memory from the OS, you can save the cost of bzeroing.

Compiler support

Lack of compiler support usually requires a conservative collector.
Not hiding pointers (see Chase, Chase and Boehm)
Nulling out dead pointer variables (Cedar Mesa folk wisdom)
Identifying pointers (Moss et al.)
Avoiding heap allocation (SETL, Lisp, Ruggieri, Chase)
Optimizations of reference counting (Barth CACM paper, SETL work).

GC FAQ table of contents
Techniques and algorithms
Language interfaces
Difficult topics
Silly stuff