Atomic operations may require additional code to ensure that they are really atomic (two examples are reference count update and object copying), and some use of memory barriers may be needed to ensure that writes are seen in the same order by all observers. For example, reference counts cannot inconsistently fall to zero (if any observer sees a non-zero reference count, then the memory ought not be reclaimed).
Typical solutions here include working at a machine-dependent level, and mostly removing the threading issues by caching information in a single-threaded context. For example, a heap can be carved up into per-thread pieces, and each thread allocates inpedendently and unsynchronized, except when it needs a new piece. Reference counting can remove much of its dependence on fine-grained synchronization by caching updates in per-thread buffers, and processing the buffers in batch form.
Threads can also be tamed with "safepoints". Before a GC can be started, all threads must be parked at safepoints, where their pointers are guaranteed to be in stable, well-known places. Safepoints have some of their own problems; if, for instance, a low-priority thread must reach a safepoint before GC can begin, GC for the entire process may be delayed.
One way to add concurrency is to perform non-interfering GC tasks concurrent with mutation. These can include processing card marks and some versions of snapshot (e.g., mark-and-sweep on a copy-on-write mapping) mark-and-sweep collection. See, for example:
"Concurrent Remembered Set Refinement in Generational Garbage Collection" by David Detlefs, Ross Knippel, William D. Clinger, Matthias Jacob, in the Proceedings of 2002 USENIX Java VM Research and Technology Symposium.Available at http://research.sun.com/jtech/pubs/.
Other concurrent collection algorithms require some amount of mutator cooperation, usually to obtain the equivalent of a snapshot or a write barrier. Another advantage of incremental collection over concurrent collection is that it can adjust the rate of GC if it falls behind; a concurrent collector's rate is affected by changes in OS thread scheduling.
Lots of people think it's a very hard problem. In fact, complete GC is unfeasible, but simple solutions can collect most garbage.
The problem is that the property "object X is garbage" is a global property, i.e. conceptually you need to examine the whole system (in a consistent state) just to know the status of X. Examining the whole system is expensive, and actually unfeasible in the presence of failures; getting a consistent state is hard. However it is always safe not to collect; so any conservative approximation is OK. Furthermore, garbage objects will not revert, so algorithms that work "eventually" are ok if there is enough slack in the system.
The easiest GC algorithm to distribute is reference counting. A simple-minded approach is to attach a counter to each object; duplicating or killing a reference sends an increment or decrement message to the target object. This is very expensive (one message per pointer operation) but more importantly doesn't work. Since there is no guaranteed delivery order for messages, an increment might be overrun by a decrement, and the target object unsafely reclaimed.
A general fix is to provide "Causal delivery" but that's overkill. A simpler fix is Weighted Reference Count (or some variation thereof). When an object X is created, it is allocated an initial weight. The first reference to X carries a weight equal to the object's. Duplicating any reference to X divides the weight between the two duplicates. Killing a reference sends a decrement message to the object's weight. At all times, the sum of the reference weights is equal to the object's weight.
In "reference listing" the target object keeps a "scion set", i.e. a list of backpointers to the processes that hold pointers to it. This is made possible by the following observation: if process P owns object X, process Q may acquire a reference to X only if P sends a message to Q containing the reference. At message sending time, record the name Q in the reference list of X. When Q kills its (last) reference to X, it sends a decrement message which removes Q from X's scion list. When the scion list is empty, X can be reclaimed.
Reference listing is less sensitive to failures than counting. If process Q crashes, P can unilaterally remove its scion from all its reference lists (but beware: it is impossible to reliably detect crashes!). If a decrement message is lost, P can query Q. Reference listing is heavily used in Partitioned GC.
Reference counting and listing scale well. They cannot collect cycles of garbage. Another drawback in practice is that they do not provide "root identity", i.e. if there are multiple roots (e.g. process roots and persistent roots), there is no way to know from which a specific object is reachable.
For concreteness, I will concentrate on Mark and Sweep; but all tracing methods (whether mark and sweep, copying, or other) have a synchronization point and hence will not scale well. On the other hand, tracing does collect cycles and supports root identity. It is also fault tolerant (in a perverse way): if anything goes wrong, you can abort a trace and start again from scratch.
The simple-minded approach, where each process marks and sweeps in parallel, sending "mark" messages down remote references, doesn't work. Suppose P has a reference to object X in process Q, and Q has no local reference to X. Then Q shouldn't start sweeping until it is certain that it will receive no more mark messages from P. Of course, P must do the same with respect to Q. Thus, every process must wait for all others between the mark and the sweep phases (this is called a "termination protocol").
The Hughes algorithm overcomes this problem somewhat by using timestamps to mark reachable objects (rather than a single mark bit). The idea is that the timestamp of a reachable object will always increase; after an object becomes unreachable its timestamp ceases to increase; objects whose timestamp is below some global minimum are garbage. The catch is that computing the global minimum is a termination protocol. The good thing is that it can be done in the background.
Ladin and Liskov take an alternative approach. The idea here is that each process sends (an approximation of) its local graph to some central service. The central service combines the partial graphs into a global graph, which it marks and sweeps. There are a few problems with this approach. The first is that the central service is a bottleneck and a single point of failure (Ladin and Liskov address this problem by replication). The second is that the partial graphs are mutually inconsistent (Ladin and Liskov address this by timestamping and being conservative). The third is that getting the partial graph information is harder than it seems (Ladin and Liskov got it wrong the first time around).
The idea is that when some subset of the global memory (e.g. a process) sends a pointer to another subset, that pointer is recorded in a "scion set" (akin to a remembered set). You keep a count of such inter-subset references. Thus, each subset can independently and locally perform tracing of any kind. Inter-subset references are reference-counted.
The difference between a remembered set and a scion set is that in the former case the memory subsets are ordered, so no cycles of garbage can form. In the latter case there is no natural ordering and inter-subset garbage cycles cannot be collected.
Variations on this theme have to do with doing the whole job more efficiently and tolerating faults ("SSP Chains": Shapiro, Dickman and Plainfossè), collecting cycles of garbage that span subsets (Hughes, or "Garbage Collecting the World": Lang Queinnec and Piquer), avoiding expensive memory synchronization and I/O operations ("Larchant": Ferreira and Shapiro), using the counting information to choose the best subset to trace (Cook Wolf and Zorn, or "Sprite LFS": Rosenblum and Ousterhout), or avoiding problems with transaction aborts (Amsaleg Gruber and Franklin).
Partitioned GC
In cases where the C is machine-generated (in compilation of some other language, such as Sather, Eiffel, Scheme, or Modula-3), code to record active pointers can be generated.
There is a theoretical problem posed by sufficiently good optimizing compilers. In principle, there is no reason why the generated code should reference objects by a pointer to the start of the object, or even to its interior. In practice this is not a problem.
Some people take it for granted, it need not be the case, especially on PCs. Junk appearing in register windows (e.g., Sparc) can also cause trouble, whether it is left over from other processes, the OS, or simply the same process's former calls. For Sparc, on Solaris, there is a way to request that register windows be zeroed, and compilers can also zero out unused registers upon procedure entry.
If you have an incremental or concurrent GC that can choose which pages it's going to scan, and it can determine what's paged in and what's not, you can minimize paging penalties by deferred scanning of pages that are not in physical memory.
Pre-fetching without blocking a little while before you scan a page can be useful for minimizing paging costs of GC.
Pre-fetching without blocking one cache-line ahead while scanning can be a real win. (I know this has been written about, perhaps in the Tarditi, et al, cache/GC measurements paper, which, btw, should be in the list of references.)
If you know a page is garbage, just discard it rather than paging something out. (Obviously, this doesn't only apply to GCs -- a typical example would be memory past the top of the stack. Copying GCs permit discarding all of new space after a flip, that is, pages between the allocation pointer and the end of new space can be paged out by discarding them.)
If the GC doesn't require zero-filled memory (perhaps because everything always gets initialized explicitly -- typical for ML, etc) and you can get garbage memory from the OS, you can save the cost of bzeroing.