diff --git a/README.md b/README.md
index 8d6d955aa..185761704 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@ compiler for Standard ML which implements support for
 nested (fork-join) parallelism. MPL generates executables with
 excellent multicore performance, utilizing a novel approach to
 memory management based on the theory of disentanglement
-[[1](#rmab16),[2](#gwraf18),[3](#wyfa20),[4](#awa21),[5](#waa22)].
+[[1](#rmab16),[2](#gwraf18),[3](#wyfa20),[4](#awa21),[5](#waa22),[6](#awa23)].
 
 MPL is research software and is being actively developed.
 
@@ -25,7 +25,7 @@ $ docker run -it shwestrick/mpl /bin/bash
 ...# examples/bin/primes @mpl procs 4 --
 ```
 
-If you want to try out MPL by writing and compiling your own code, we recommend
+To write and compile your own code, we recommend
 mounting a local directory inside the container. For example, here's how you
 can use MPL to compile and run your own `main.mlb` in the current directory.
 (To mount some other directory, replace `$(pwd -P)` with a different path.)
@@ -38,46 +38,36 @@ $ docker run -it -v $(pwd -P):/root/mycode shwestrick/mpl /bin/bash
 ...# ./main @mpl procs 4 --
 ```
 
+## Benchmark Suite
 
-## Build and Install (from source)
+The [Parallel ML benchmark suite](https://github.com/MPLLang/parallel-ml-bench)
+provides many examples of sophisticated parallel algorithms and
+applications in MPL, as well as cross-language performance comparisons with
+C++, Go, Java,
+and multicore OCaml.
 
-### Requirements
+## Libraries
 
-MPL has only been tested on Linux with x86-64. The following software is
-required.
- * [GCC](http://gcc.gnu.org)
- * [GMP](http://gmplib.org) (GNU Multiple Precision arithmetic library)
- * [GNU Make](http://savannah.gnu.org/projects/make), [GNU Bash](http://www.gnu.org/software/bash/)
- * binutils (`ar`, `ranlib`, `strip`, ...)
- * miscellaneous Unix utilities (`diff`, `find`, `grep`, `gzip`, `patch`, `sed`, `tar`, `xargs`, ...)
- * Standard ML compiler and tools:
-   - Recommended: [MLton](http://mlton.org) (`mlton`, `mllex`, and `mlyacc`).  Pre-built binary packages for MLton can be installed via an OS package manager or (for select platforms) obtained from http://mlton.org.
-   - Supported but not recommended: [SML/NJ](http://www.smlnj.org) (`sml`, `ml-lex`, `ml-yacc`).
+We recommend using the [smlpkg](https://github.com/diku-dk/smlpkg) package
+manager. MPL supports the full SML language, so existing libraries for
+SML can be used.
 
-### Instructions
+In addition, here are a few libraries that make use of MPL for parallelism:
+  * [`github.com/MPLLang/mpllib`](https://github.com/MPLLang/mpllib): implements
+  a variety of data structures (sequences, sets, dictionaries, graphs, matrices, meshes,
+  images, etc.) and parallel algorithms (map, reduce, scan, filter, sorting,
+  search, tokenization, graph processing, computational geometry, etc.). Also
+  includes basic utilies (e.g. parsing command-line arguments) and
+  benchmarking infrastructure.
+  * [`github.com/shwestrick/sml-audio`](https://github.com/shwestrick/sml-audio):
+  a library for audio processing with I/O support for `.wav` files.
 
-The following builds the compiler at `build/bin/mpl`.
-```
-$ make all
-```
-
-After building, MPL can then be installed to `/usr/local`:
-```
-$ make install
-```
-or to a custom directory with the `PREFIX` option:
-```
-$ make PREFIX=/opt/mpl install
-```
 
 ## Parallel and Concurrent Extensions
 
 MPL extends SML with a number of primitives for parallelism and concurrency.
 Take a look at `examples/` to see these primitives in action.
 
-**Note**: Before writing any of your own code, make sure to read the section
-"Disentanglement" below.
-
 ### The `ForkJoin` Structure
 ```
 val par: (unit -> 'a) * (unit -> 'b) -> 'a * 'b
@@ -163,8 +153,6 @@ by default.
 * `-debug true -debug-runtime true -keep g` For debugging, keeps the generated
 C files and uses the debug version of the runtime (with assertions enabled).
 The resulting executable is somewhat peruse-able with tools like `gdb`.
-* `-detect-entanglement true` enables the dynamic entanglement detector.
-See below for more information.
 
 For example:
 ```
@@ -198,60 +186,13 @@ argument `bar` using 4 pinned processors.
 $ foo @mpl procs 4 set-affinity -- bar
 ```
 
-## Disentanglement
-
-Currently, MPL only supports programs that are **disentangled**, which
-(roughly speaking) is the property that concurrent threads remain oblivious
-to each other's allocations [[3](#wyfa20)].
-
-Here are a number of different ways to guarantee that your code is
-disentangled.
-- (Option 1) Use only purely functional data (no `ref`s or `array`s). This is
-the simplest but most restrictive approach.
-- (Option 2) If using mutable data, use only non-pointer data. MPL guarantees
-that simple types (`int`, `word`, `char`, `real`, etc.) are never
-indirected through a
-pointer, so for example it is safe to use `int array`. Other types such as
-`int list array` and `int array array` should be avoided. This approach
-is very easy to check and is surprisingly general. Data races are fine!
-- (Option 3) Make sure that your program is race-free. This can be
-tricky to check but allows you to use any type of data. Many of our example
-programs are race-free.
-
-## Entanglement Detection
-
-Whenever a thread acquires a reference
-to an object allocated concurrently by some other thread, then we say that
-the two threads are **entangled**. This is a violation of disentanglement,
-which MPL currently does not allow.
-
-MPL has a built-in dynamic entanglement detector which is enabled by default.
-The entanglement detector monitors individual reads and writes during execution;
-if entanglement is found, the program will terminate with an error message.
-
-The entanglement detector is both "sound" and "complete": there are neither
-false negatives nor false positives. In other words, the detector always raises
-an alarm when entanglement occurs, and never raises an alarm otherwise. Note
-however that entanglement (and therefore also entanglement detection) can
-be execution-dependent: if your program is non-deterministic (e.g. racy),
-then entanglement may or may not occur depending on the outcome of a race
-condition. Similarly, entanglement could be input-dependent.
-
-Entanglement detection is highly optimized, and typically has negligible
-overhead (see [[5](#waa22)]). It can be disabled at compile-time by passing
-`-detect-entanglement false`; however, we recommend against doing so. MPL
-relies on entanglement detection to ensure memory safety. We recommend leaving
-entanglement detection enabled at all times.
 
 ## Bugs and Known Issues
 
 ### Basis Library
-In general, the basis library has not yet been thoroughly scrubbed, and many
-functions may not be safe for parallelism
+The basis library is inherited from (sequential) SML. It has not yet been 
+thoroughly scrubbed, and some functions may not be safe for parallelism
 ([#41](https://github.com/MPLLang/mpl/issues/41)).
-Some known issues:
-* `Int.toString` is racy when called in parallel.
-* `Real.fromString` may throw an error when called in parallel.
 
 ### Garbage Collection
 * ([#115](https://github.com/MPLLang/mpl/issues/115)) The GC is currently
@@ -274,6 +215,61 @@ unsupported, including (but not limited to):
 * `Weak`
 * `World`
 
+
+## Build and Install (from source)
+
+### Requirements
+
+MPL has only been tested on Linux with x86-64. The following software is
+required.
+ * [GCC](http://gcc.gnu.org)
+ * [GMP](http://gmplib.org) (GNU Multiple Precision arithmetic library)
+ * [GNU Make](http://savannah.gnu.org/projects/make), [GNU Bash](http://www.gnu.org/software/bash/)
+ * binutils (`ar`, `ranlib`, `strip`, ...)
+ * miscellaneous Unix utilities (`diff`, `find`, `grep`, `gzip`, `patch`, `sed`, `tar`, `xargs`, ...)
+ * Standard ML compiler and tools:
+   - Recommended: [MLton](http://mlton.org) (`mlton`, `mllex`, and `mlyacc`).  Pre-built binary packages for MLton can be installed via an OS package manager or (for select platforms) obtained from http://mlton.org.
+   - Supported but not recommended: [SML/NJ](http://www.smlnj.org) (`sml`, `ml-lex`, `ml-yacc`).
+ * (If using [`mpl-switch`](https://github.com/mpllang/mpl-switch)): Python 3, and `git`.
+
+### Installation with `mpl-switch`
+
+The [`mpl-switch`](https://github.com/mpllang/mpl-switch) utility makes it
+easy to install multiple versions of MPL on the same system and switch
+between them. After setting up `mpl-switch`, you can install MPL as follows:
+```
+$ mpl-switch install v0.4
+$ mpl-switch select v0.4
+```
+
+You can use any commit hash or tag name from the MPL repo to pick a
+particular version of MPL. Installed versions are stored in `~/.mpl/`; this
+folder is safe to delete at any moment, as it can always be regenerated. To
+see what versions of MPL are currently installed, do:
+```
+$ mpl-switch list
+```
+
+### Manual Instructions
+
+Alternatively, you can manually build `mpl` by cloning this repo and then
+performing the following.
+
+**Build the executable**. This produces an executable at `build/bin/mpl`:
+```
+$ make
+```
+
+**Put it where you want it**. After building, MPL can then be installed to
+`/usr/local`:
+```
+$ make install
+```
+or to a custom directory with the `PREFIX` option:
+```
+$ make PREFIX=/opt/mpl install
+```
+
 ## References
 
 [<a name="rmab16">1</a>]
@@ -300,3 +296,8 @@ POPL 2021.
 [Entanglement Detection with Near-Zero Cost](http://www.cs.cmu.edu/~swestric/22/icfp-detect.pdf).
 Sam Westrick, Jatin Arora, and Umut A. Acar.
 ICFP 2022.
+
+[<a name="awa23">6</a>]
+[Efficient Parallel Functional Programming with Effects](https://www.cs.cmu.edu/~swestric/23/epfpe.pdf).
+Jatin Arora, Sam Westrick, and Umut A. Acar.
+PLDI 2023.
diff --git a/basis-library/mlton/thread.sig b/basis-library/mlton/thread.sig
index 90b9caaaa..9b25da5b9 100644
--- a/basis-library/mlton/thread.sig
+++ b/basis-library/mlton/thread.sig
@@ -42,6 +42,8 @@ signature MLTON_THREAD =
       structure HierarchicalHeap :
         sig
           type thread = Basic.t
+          type clear_set
+          type finished_clear_set_grain
 
           (* The level (depth) of a thread's heap in the hierarchy. *)
           val getDepth : thread -> int
@@ -69,6 +71,16 @@ signature MLTON_THREAD =
           (* Move all chunks at the current depth up one level. *)
           val promoteChunks : thread -> unit
 
+          val clearSuspectsAtDepth: thread * int -> unit
+          val numSuspectsAtDepth: thread * int -> int
+          val takeClearSetAtDepth: thread * int -> clear_set
+          val numChunksInClearSet: clear_set -> int
+          val processClearSetGrain: clear_set * int * int -> finished_clear_set_grain
+          val commitFinishedClearSetGrain: thread * finished_clear_set_grain -> unit
+          val deleteClearSet: clear_set -> unit
+
+          val updateBytesPinnedEntangledWatermark: unit -> unit
+
           (* "put a new thread in the hierarchy *)
           val moveNewThreadToDepth : thread * int -> unit
 
diff --git a/basis-library/mlton/thread.sml b/basis-library/mlton/thread.sml
index 4b3b98659..1e7bf8c5c 100644
--- a/basis-library/mlton/thread.sml
+++ b/basis-library/mlton/thread.sml
@@ -73,6 +73,9 @@ struct
   type thread = Basic.t
   type t = MLtonPointer.t
 
+  type clear_set = MLtonPointer.t
+  type finished_clear_set_grain = MLtonPointer.t
+
   fun forceLeftHeap (myId, t) = Prim.forceLeftHeap(Word32.fromInt myId, t)
   fun forceNewChunk () = Prim.forceNewChunk (gcState ())
   fun registerCont (kl, kr, k, t) = Prim.registerCont(kl, kr, k, t)
@@ -90,6 +93,30 @@ struct
     Prim.moveNewThreadToDepth (t, Word32.fromInt d)
   fun checkFinishedCCReadyToJoin () =
     Prim.checkFinishedCCReadyToJoin (gcState ())
+
+  fun clearSuspectsAtDepth (t, d) =
+    Prim.clearSuspectsAtDepth (gcState (), t, Word32.fromInt d)
+  
+  fun numSuspectsAtDepth (t, d) =
+    Word64.toInt (Prim.numSuspectsAtDepth (gcState (), t, Word32.fromInt d))
+
+  fun takeClearSetAtDepth (t, d) =
+    Prim.takeClearSetAtDepth (gcState (), t, Word32.fromInt d)
+
+  fun numChunksInClearSet c =
+    Word64.toInt (Prim.numChunksInClearSet (gcState (), c))
+
+  fun processClearSetGrain (c, start, stop) =
+    Prim.processClearSetGrain (gcState (), c, Word64.fromInt start, Word64.fromInt stop)
+
+  fun commitFinishedClearSetGrain (t, fcsg) =
+    Prim.commitFinishedClearSetGrain (gcState (), t, fcsg)
+
+  fun deleteClearSet c =
+    Prim.deleteClearSet (gcState (), c)
+
+  fun updateBytesPinnedEntangledWatermark () =
+    Prim.updateBytesPinnedEntangledWatermark (gcState ())
 end
 
 structure Disentanglement =
diff --git a/basis-library/mpl/gc.sig b/basis-library/mpl/gc.sig
index c1c596aa4..1b4c5cabb 100644
--- a/basis-library/mpl/gc.sig
+++ b/basis-library/mpl/gc.sig
@@ -17,12 +17,15 @@ sig
    *)
   val numberDisentanglementChecks: unit -> IntInf.int
 
-  (* How many times entanglement has been detected at a read barrier.
-   *)
-  val numberEntanglementsDetected: unit -> IntInf.int
+  (* How many times the entanglement is detected *)
+  val numberEntanglements: unit -> IntInf.int
+
+  val approxRaceFactor: unit -> Real32.real
 
   val numberSuspectsMarked: unit -> IntInf.int
   val numberSuspectsCleared: unit -> IntInf.int
+  val bytesPinnedEntangled: unit -> IntInf.int
+  val bytesPinnedEntangledWatermark: unit -> IntInf.int
 
   val getControlMaxCCDepth: unit -> int
 
@@ -43,6 +46,8 @@ sig
   val localBytesReclaimed: unit -> IntInf.int
   val localBytesReclaimedOfProc: int -> IntInf.int
 
+  val bytesInScopeForLocal: unit -> IntInf.int
+
   val numLocalGCs: unit -> IntInf.int
   val numLocalGCsOfProc: int -> IntInf.int
 
@@ -52,21 +57,28 @@ sig
   val promoTime: unit -> Time.time
   val promoTimeOfProc: int -> Time.time
 
+  val numCCs: unit -> IntInf.t
+  val numCCsOfProc: int -> IntInf.t
+
+  val ccBytesReclaimed: unit -> IntInf.int
+  val ccBytesReclaimedOfProc: int -> IntInf.int
+
+  val bytesInScopeForCC: unit -> IntInf.int
+
+  val ccTime: unit -> Time.time
+  val ccTimeOfProc: int -> Time.time
+
+  (* DEPRECATED *)
   val rootBytesReclaimed: unit -> IntInf.int
   val rootBytesReclaimedOfProc: int -> IntInf.int
-
   val internalBytesReclaimed: unit -> IntInf.int
   val internalBytesReclaimedOfProc: int -> IntInf.int
-
   val numRootCCs: unit -> IntInf.int
   val numRootCCsOfProc: int -> IntInf.int
-
   val numInternalCCs: unit -> IntInf.int
   val numInternalCCsOfProc: int -> IntInf.int
-
   val rootCCTime: unit -> Time.time
   val rootCCTimeOfProc: int -> Time.time
-
   val internalCCTime: unit -> Time.time
   val internalCCTimeOfProc: int -> Time.time
 end
diff --git a/basis-library/mpl/gc.sml b/basis-library/mpl/gc.sml
index 83f0a9d48..5f2417e04 100644
--- a/basis-library/mpl/gc.sml
+++ b/basis-library/mpl/gc.sml
@@ -24,24 +24,27 @@ struct
       GC.getCumulativeStatisticsBytesAllocatedOfProc (gcState (), Word32.fromInt p)
     fun getCumulativeStatisticsLocalBytesReclaimedOfProc p =
       GC.getCumulativeStatisticsLocalBytesReclaimedOfProc (gcState (), Word32.fromInt p)
-    fun getNumRootCCsOfProc p =
-      GC.getNumRootCCsOfProc (gcState (), Word32.fromInt p)
-    fun getNumInternalCCsOfProc p =
-      GC.getNumInternalCCsOfProc (gcState (), Word32.fromInt p)
-    fun getRootCCMillisecondsOfProc p =
-      GC.getRootCCMillisecondsOfProc (gcState (), Word32.fromInt p)
-    fun getInternalCCMillisecondsOfProc p =
-      GC.getInternalCCMillisecondsOfProc (gcState (), Word32.fromInt p)
-    fun getRootCCBytesReclaimedOfProc p =
-      GC.getRootCCBytesReclaimedOfProc (gcState (), Word32.fromInt p)
-    fun getInternalCCBytesReclaimedOfProc p =
-      GC.getInternalCCBytesReclaimedOfProc (gcState (), Word32.fromInt p)
+    fun getNumCCsOfProc p =
+      GC.getNumCCsOfProc (gcState (), Word32.fromInt p)
+    fun getCCMillisecondsOfProc p =
+      GC.getCCMillisecondsOfProc (gcState (), Word32.fromInt p)
+    fun getCCBytesReclaimedOfProc p =
+      GC.getCCBytesReclaimedOfProc (gcState (), Word32.fromInt p)
+
+    fun bytesInScopeForLocal () =
+      C_UIntmax.toLargeInt (GC.bytesInScopeForLocal (gcState ()))
+
+    fun bytesInScopeForCC () =
+      C_UIntmax.toLargeInt (GC.bytesInScopeForCC (gcState ()))
 
     fun numberDisentanglementChecks () =
       C_UIntmax.toLargeInt (GC.numberDisentanglementChecks (gcState ()))
 
-    fun numberEntanglementsDetected () =
-      C_UIntmax.toLargeInt (GC.numberEntanglementsDetected (gcState ()))
+    fun numberEntanglements () =
+      C_UIntmax.toLargeInt (GC.numberEntanglements (gcState ()))
+
+    fun approxRaceFactor () =
+      (GC.approxRaceFactor (gcState ()))
 
     fun getControlMaxCCDepth () =
       Word32.toInt (GC.getControlMaxCCDepth (gcState ()))
@@ -51,6 +54,12 @@ struct
 
     fun numberSuspectsCleared () =
       C_UIntmax.toLargeInt (GC.numberSuspectsCleared (gcState ()))
+
+    fun bytesPinnedEntangled () =
+      C_UIntmax.toLargeInt (GC.bytesPinnedEntangled (gcState ()))
+
+    fun bytesPinnedEntangledWatermark () =
+      C_UIntmax.toLargeInt (GC.bytesPinnedEntangledWatermark (gcState ()))
   end
 
   exception NotYetImplemented of string
@@ -92,34 +101,19 @@ struct
     ; millisecondsToTime (getPromoMillisecondsOfProc p)
     )
 
-  fun numRootCCsOfProc p =
-    ( checkProcNum p
-    ; C_UIntmax.toLargeInt (getNumRootCCsOfProc p)
-    )
-
-  fun numInternalCCsOfProc p =
-    ( checkProcNum p
-    ; C_UIntmax.toLargeInt (getNumInternalCCsOfProc p)
-    )
-
-  fun rootCCTimeOfProc p =
+  fun numCCsOfProc p =
     ( checkProcNum p
-    ; millisecondsToTime (getRootCCMillisecondsOfProc p)
+    ; C_UIntmax.toLargeInt (getNumCCsOfProc p)
     )
 
-  fun internalCCTimeOfProc p =
+  fun ccTimeOfProc p =
     ( checkProcNum p
-    ; millisecondsToTime (getInternalCCMillisecondsOfProc p)
+    ; millisecondsToTime (getCCMillisecondsOfProc p)
     )
 
-  fun rootBytesReclaimedOfProc p =
+  fun ccBytesReclaimedOfProc p =
     ( checkProcNum p
-    ; C_UIntmax.toLargeInt (getRootCCBytesReclaimedOfProc p)
-    )
-
-  fun internalBytesReclaimedOfProc p =
-    ( checkProcNum p
-    ; C_UIntmax.toLargeInt (getInternalCCBytesReclaimedOfProc p)
+    ; C_UIntmax.toLargeInt (getCCBytesReclaimedOfProc p)
     )
 
   fun sumAllProcs (f: 'a * 'a -> 'a) (perProc: int -> 'a) =
@@ -148,28 +142,39 @@ struct
   fun promoTime () =
     millisecondsToTime (sumAllProcs C_UIntmax.+ getPromoMillisecondsOfProc)
 
-  fun numRootCCs () =
-    C_UIntmax.toLargeInt
-    (sumAllProcs C_UIntmax.+ getNumRootCCsOfProc)
-
-  fun numInternalCCs () =
+  fun numCCs () =
     C_UIntmax.toLargeInt
-    (sumAllProcs C_UIntmax.+ getNumInternalCCsOfProc)
-
-  fun rootCCTime () =
-    millisecondsToTime
-    (sumAllProcs C_UIntmax.+ getRootCCMillisecondsOfProc)
+    (sumAllProcs C_UIntmax.+ getNumCCsOfProc)
 
-  fun internalCCTime () =
+  fun ccTime () =
     millisecondsToTime
-    (sumAllProcs C_UIntmax.+ getInternalCCMillisecondsOfProc)
-
-  fun rootBytesReclaimed () =
-    C_UIntmax.toLargeInt
-    (sumAllProcs C_UIntmax.+ getRootCCBytesReclaimedOfProc)
+    (sumAllProcs C_UIntmax.+ getCCMillisecondsOfProc)
 
-  fun internalBytesReclaimed () =
+  fun ccBytesReclaimed () =
     C_UIntmax.toLargeInt
-    (sumAllProcs C_UIntmax.+ getInternalCCBytesReclaimedOfProc)
+    (sumAllProcs C_UIntmax.max getCCBytesReclaimedOfProc)
+
+
+  (* ======================================================================
+   * DEPRECATED
+   *)
+
+  exception Deprecated of string
+
+  fun d name (_: 'a) : 'b =
+    raise Deprecated ("MPL.GC." ^ name)
+  
+  val rootBytesReclaimed = d "rootBytesReclaimed"
+  val rootBytesReclaimedOfProc = d "rootBytesReclaimedOfProc"
+  val internalBytesReclaimed = d "internalBytesReclaimed"
+  val internalBytesReclaimedOfProc = d "internalBytesReclaimedOfProc"
+  val numRootCCs = d "numRootCCs"
+  val numRootCCsOfProc = d "numRootCCsOfProc"
+  val numInternalCCs = d "numInternalCCs"
+  val numInternalCCsOfProc = d "numInternalCCsOfProc"
+  val rootCCTime = d "rootCCTime"
+  val rootCCTimeOfProc = d "rootCCTimeOfProc"
+  val internalCCTime = d "internalCCTime"
+  val internalCCTimeOfProc = d "internalCCTimeOfProc"
 
 end
diff --git a/basis-library/primitive/prim-mlton.sml b/basis-library/primitive/prim-mlton.sml
index e6d7021d6..54376834c 100644
--- a/basis-library/primitive/prim-mlton.sml
+++ b/basis-library/primitive/prim-mlton.sml
@@ -159,20 +159,29 @@ structure GC =
       val getCumulativeStatisticsLocalBytesReclaimedOfProc = _import
       "GC_getCumulativeStatisticsLocalBytesReclaimedOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
 
-      val getNumRootCCsOfProc = _import "GC_getNumRootCCsOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
-      val getNumInternalCCsOfProc = _import "GC_getNumInternalCCsOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
-      val getRootCCMillisecondsOfProc = _import "GC_getRootCCMillisecondsOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
-      val getInternalCCMillisecondsOfProc = _import "GC_getInternalCCMillisecondsOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
-      val getRootCCBytesReclaimedOfProc = _import "GC_getRootCCBytesReclaimedOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
-      val getInternalCCBytesReclaimedOfProc = _import "GC_getInternalCCBytesReclaimedOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
+      val bytesInScopeForLocal =
+        _import "GC_bytesInScopeForLocal" runtime private:
+        GCState.t -> C_UIntmax.t;
+      
+      val bytesInScopeForCC =
+        _import "GC_bytesInScopeForCC" runtime private:
+        GCState.t -> C_UIntmax.t;
+
+      val getNumCCsOfProc = _import "GC_getNumCCsOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
+      val getCCMillisecondsOfProc = _import "GC_getCCMillisecondsOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
+      val getCCBytesReclaimedOfProc = _import "GC_getCCBytesReclaimedOfProc" runtime private: GCState.t * Word32.word -> C_UIntmax.t;
 
       val numberDisentanglementChecks = _import "GC_numDisentanglementChecks" runtime private: GCState.t -> C_UIntmax.t;
+      val numberEntanglements = _import "GC_numEntanglements" runtime private: GCState.t -> C_UIntmax.t;
 
-      val numberEntanglementsDetected = _import "GC_numEntanglementsDetected" runtime private: GCState.t -> C_UIntmax.t;
+      val approxRaceFactor = _import "GC_approxRaceFactor" runtime private: GCState.t -> Real32.real;
 
       val numberSuspectsMarked = _import "GC_numSuspectsMarked" runtime private: GCState.t -> C_UIntmax.t;
 
       val numberSuspectsCleared = _import "GC_numSuspectsCleared" runtime private: GCState.t -> C_UIntmax.t;
+
+      val bytesPinnedEntangled = _import "GC_bytesPinnedEntangled" runtime private: GCState.t -> C_UIntmax.t;
+      val bytesPinnedEntangledWatermark = _import "GC_bytesPinnedEntangledWatermark" runtime private: GCState.t -> C_UIntmax.t;
    end
 
 structure HM =
@@ -368,6 +377,24 @@ structure Thread =
       val setMinLocalCollectionDepth = _import "GC_HH_setMinLocalCollectionDepth" runtime private: thread * Word32.word -> unit;
       val mergeThreads = _import "GC_HH_mergeThreads" runtime private: thread * thread -> unit;
       val promoteChunks = _import "GC_HH_promoteChunks" runtime private: thread -> unit;
+      val clearSuspectsAtDepth = _import "GC_HH_clearSuspectsAtDepth" runtime private: 
+        GCState.t * thread * Word32.word -> unit;
+      val numSuspectsAtDepth = _import "GC_HH_numSuspectsAtDepth" runtime private:
+        GCState.t * thread * Word32.word -> Word64.word;
+      val takeClearSetAtDepth = _import "GC_HH_takeClearSetAtDepth" runtime private:
+        GCState.t * thread * Word32.word -> Pointer.t;
+      val numChunksInClearSet = _import "GC_HH_numChunksInClearSet" runtime private:
+        GCState.t * Pointer.t -> Word64.word;
+      val processClearSetGrain = _import "GC_HH_processClearSetGrain" runtime private:
+        GCState.t * Pointer.t * Word64.word * Word64.word -> Pointer.t;
+      val commitFinishedClearSetGrain = _import "GC_HH_commitFinishedClearSetGrain" runtime private:
+        GCState.t * thread * Pointer.t -> unit;
+      val deleteClearSet = _import "GC_HH_deleteClearSet" runtime private:
+        GCState.t * Pointer.t -> unit;
+
+      val updateBytesPinnedEntangledWatermark = 
+        _import "GC_updateBytesPinnedEntangledWatermark" runtime private:
+        GCState.t -> unit;
 
       val decheckFork = _import "GC_HH_decheckFork" runtime private:
         GCState.t * Word64.word ref * Word64.word ref -> unit;
diff --git a/basis-library/schedulers/shh/CumulativePerProcTimer.sml b/basis-library/schedulers/shh/CumulativePerProcTimer.sml
new file mode 100644
index 000000000..4f0202981
--- /dev/null
+++ b/basis-library/schedulers/shh/CumulativePerProcTimer.sml
@@ -0,0 +1,73 @@
+functor CumulativePerProcTimer(val timerName: string):
+sig
+  val start: unit -> unit
+  val tick: unit -> unit   (* Essentially the same as (stop();start()) *)
+  val stop: unit -> unit
+
+  val isStarted: unit -> bool
+
+  val cumulative: unit -> Time.time
+end =
+struct
+
+  val numP = MLton.Parallel.numberOfProcessors
+  fun myId () = MLton.Parallel.processorNumber ()
+
+  fun die strfn =
+    ( print (Int.toString (myId ()) ^ ": " ^ strfn ())
+    ; OS.Process.exit OS.Process.failure
+    )
+
+  val totals = Array.array (numP, Time.zeroTime)
+  val starts = Array.array (numP, Time.zeroTime)
+  val isRunning = Array.array (numP, false)
+
+  fun isStarted () =
+    Array.sub (isRunning, myId())
+
+  fun start () =
+    let
+      val p = myId()
+    in
+      if Array.sub (isRunning, p) then
+        die (fn _ => "timer \"" ^ timerName ^ "\": start after start")
+      else
+        ( Array.update (isRunning, p, true)
+        ; Array.update (starts, p, Time.now ())
+        )
+    end
+
+  fun tick () =
+    let
+      val p = myId()
+      val tnow = Time.now ()
+      val delta = Time.- (tnow, Array.sub (starts, p))
+    in
+      if not (Array.sub (isRunning, p)) then
+        die (fn _ => "timer \"" ^ timerName ^ "\": tick while stopped")
+      else
+        ( Array.update (totals, p, Time.+ (Array.sub (totals, p), delta))
+        ; Array.update (starts, p, tnow)
+        )
+    end
+
+  fun stop () =
+    let
+      val p = myId()
+      val tnow = Time.now ()
+      val delta = Time.- (tnow, Array.sub (starts, p))
+    in
+      if not (Array.sub (isRunning, p)) then
+        die (fn _ => "timer \"" ^ timerName ^ "\": stop while stopped")
+      else
+        ( Array.update (isRunning, p, false)
+        ; Array.update (totals, p, Time.+ (Array.sub (totals, p), delta))
+        )
+    end
+
+  fun cumulative () =
+    ( if isStarted () then tick () else ()
+    ; Array.foldl Time.+ Time.zeroTime totals
+    )
+
+end
\ No newline at end of file
diff --git a/basis-library/schedulers/shh/DummyTimer.sml b/basis-library/schedulers/shh/DummyTimer.sml
new file mode 100644
index 000000000..ff2f17b7c
--- /dev/null
+++ b/basis-library/schedulers/shh/DummyTimer.sml
@@ -0,0 +1,19 @@
+functor DummyTimer(val timerName: string):
+sig
+  val start: unit -> unit
+  val tick: unit -> unit   (* Essentially the same as (stop();start()) *)
+  val stop: unit -> unit
+
+  val isStarted: unit -> bool
+
+  val cumulative: unit -> Time.time
+end =
+struct
+
+  fun isStarted () = false
+  fun start () = ()
+  fun tick () = ()
+  fun stop () = ()
+  fun cumulative () = Time.zeroTime
+
+end
\ No newline at end of file
diff --git a/basis-library/schedulers/shh/FORK_JOIN.sig b/basis-library/schedulers/shh/FORK_JOIN.sig
index ddc867801..c733a0fa7 100644
--- a/basis-library/schedulers/shh/FORK_JOIN.sig
+++ b/basis-library/schedulers/shh/FORK_JOIN.sig
@@ -8,9 +8,7 @@ sig
   (* synonym for par *)
   val fork: (unit -> 'a) * (unit -> 'b) -> 'a * 'b
 
-  (* other scheduler hooks *)
-  val communicate: unit -> unit
-  val getIdleTime: int -> Time.time
-
+  val idleTimeSoFar: unit -> Time.time
+  val workTimeSoFar: unit -> Time.time
   val maxForkDepthSoFar: unit -> int
 end
diff --git a/basis-library/schedulers/shh/Scheduler.sml b/basis-library/schedulers/shh/Scheduler.sml
index 5c3179d05..57517c441 100644
--- a/basis-library/schedulers/shh/Scheduler.sml
+++ b/basis-library/schedulers/shh/Scheduler.sml
@@ -100,33 +100,11 @@ struct
   fun dbgmsg' _ = ()
 
   (* ========================================================================
-   * IDLENESS TRACKING
+   * TIMERS
    *)
 
-  val idleTotals = Array.array (P, Time.zeroTime)
-  fun getIdleTime p = arraySub (idleTotals, p)
-  fun updateIdleTime (p, deltaTime) =
-    arrayUpdate (idleTotals, p, Time.+ (getIdleTime p, deltaTime))
-
-(*
-  val timerGrain = 256
-  fun startTimer myId = (myId, 0, Time.now ())
-  fun tickTimer (p, count, t) =
-    if count < timerGrain then (p, count+1, t) else
-    let
-      val t' = Time.now ()
-      val diff = Time.- (t', t)
-      val _ = updateIdleTime (p, diff)
-    in
-      (p, 0, t')
-    end
-  fun stopTimer (p, _, t) =
-    (tickTimer (p, timerGrain, t); ())
-*)
-
-  fun startTimer _ = ()
-  fun tickTimer _ = ()
-  fun stopTimer _ = ()
+  structure IdleTimer = CumulativePerProcTimer(val timerName = "idle")
+  structure WorkTimer = CumulativePerProcTimer(val timerName = "work")
 
   (** ========================================================================
     * MAXIMUM FORK DEPTHS
@@ -218,8 +196,6 @@ struct
         Queue.tryPopTop queue
     end
 
-  fun communicate () = ()
-
   fun push x =
     let
       val myId = myWorkerId ()
@@ -273,9 +249,6 @@ struct
         Finished x => x
       | Raised e => raise e
 
-    val communicate = communicate
-    val getIdleTime = getIdleTime
-
     (* Must be called from a "user" thread, which has an associated HH *)
     fun parfork thread depth (f : unit -> 'a, g : unit -> 'b) =
       let
@@ -339,6 +312,8 @@ struct
             ( HH.promoteChunks thread
             ; HH.setDepth (thread, depth)
             ; DE.decheckJoin (tidLeft, tidRight)
+            ; maybeParClearSuspectsAtDepth (thread, depth)
+            ; if depth <> 1 then () else HH.updateBytesPinnedEntangledWatermark ()
             (* ; dbgmsg' (fn _ => "join fast at depth " ^ Int.toString depth) *)
             (* ; HH.forceNewChunk () *)
             ; let
@@ -362,6 +337,8 @@ struct
                     HH.setDepth (thread, depth);
                     DE.decheckJoin (tidLeft, tidRight);
                     setQueueDepth (myWorkerId ()) depth;
+                    maybeParClearSuspectsAtDepth (thread, depth);
+                    if depth <> 1 then () else HH.updateBytesPinnedEntangledWatermark ();
                     (* dbgmsg' (fn _ => "join slow at depth " ^ Int.toString depth); *)
                     case HM.refDerefNoBarrier rightSideResult of
                       NONE => die (fn _ => "scheduler bug: join failed: missing result")
@@ -374,8 +351,85 @@ struct
         (extractResult fr, extractResult gr)
       end
 
+    
+    and simpleParFork thread depth (f: unit -> unit, g: unit -> unit) : unit =
+      let
+        val rightSideThread = ref (NONE: Thread.t option)
+        val rightSideResult = ref (NONE: unit result option)
+        val incounter = ref 2
+
+        val (tidLeft, tidRight) = DE.decheckFork ()
+
+        fun g' () =
+          let
+            val () = DE.copySyncDepthsFromThread (thread, depth+1)
+            val () = DE.decheckSetTid tidRight
+            val gr = result g
+            val t = Thread.current ()
+          in
+            rightSideThread := SOME t;
+            rightSideResult := SOME gr;
+            if decrementHitsZero incounter then
+              ( setQueueDepth (myWorkerId ()) (depth+1)
+              ; threadSwitch thread
+              )
+            else
+              returnToSched ()
+          end
+        val _ = push (NormalTask g')
+        val _ = HH.setDepth (thread, depth + 1)
+        (* NOTE: off-by-one on purpose. Runtime depths start at 1. *)
+        val _ = recordForkDepth depth
+
+        val _ = DE.decheckSetTid tidLeft
+        val fr = result f
+        val tidLeft = DE.decheckGetTid thread
 
-    fun forkGC thread depth (f : unit -> 'a, g : unit -> 'b) =
+        val gr =
+          if popDiscard () then
+            ( HH.promoteChunks thread
+            ; HH.setDepth (thread, depth)
+            ; DE.decheckJoin (tidLeft, tidRight)
+            ; maybeParClearSuspectsAtDepth (thread, depth)
+            ; if depth <> 1 then () else HH.updateBytesPinnedEntangledWatermark ()
+            (* ; dbgmsg' (fn _ => "join fast at depth " ^ Int.toString depth) *)
+            (* ; HH.forceNewChunk () *)
+            ; let
+                val gr = result g
+              in
+                (* (gr, DE.decheckGetTid thread) *)
+                gr
+              end
+            )
+          else
+            ( clear () (* this should be safe after popDiscard fails? *)
+            ; if decrementHitsZero incounter then () else returnToSched ()
+            ; case HM.refDerefNoBarrier rightSideThread of
+                NONE => die (fn _ => "scheduler bug: join failed")
+              | SOME t =>
+                  let
+                    val tidRight = DE.decheckGetTid t
+                  in
+                    HH.mergeThreads (thread, t);
+                    HH.promoteChunks thread;
+                    HH.setDepth (thread, depth);
+                    DE.decheckJoin (tidLeft, tidRight);
+                    setQueueDepth (myWorkerId ()) depth;
+                    maybeParClearSuspectsAtDepth (thread, depth);
+                    if depth <> 1 then () else HH.updateBytesPinnedEntangledWatermark ();
+                    (* dbgmsg' (fn _ => "join slow at depth " ^ Int.toString depth); *)
+                    case HM.refDerefNoBarrier rightSideResult of
+                      NONE => die (fn _ => "scheduler bug: join failed: missing result")
+                    | SOME gr => gr
+                  end
+            )
+      in
+        (extractResult fr, extractResult gr);
+        ()
+      end
+
+
+    and forkGC thread depth (f : unit -> 'a, g : unit -> 'b) =
       let
         val heapId = ref (HH.getRoot thread)
         val gcTaskTuple = (thread, heapId)
@@ -398,7 +452,8 @@ struct
             val _ = HH.setDepth (thread, depth + 1)
             val _ = HH.forceLeftHeap(myWorkerId(), thread)
             (* val _ = dbgmsg' (fn _ => "fork CC at depth " ^ Int.toString depth) *)
-            val result = fork' {ccOkayAtThisDepth=false} (f, g)
+            val res =
+              result (fn () => fork' {ccOkayAtThisDepth=false} (f, g))
 
             val _ =
               if popDiscard() then
@@ -416,9 +471,11 @@ struct
 
             val _ = HH.promoteChunks thread
             val _ = HH.setDepth (thread, depth)
+            val _ = maybeParClearSuspectsAtDepth (thread, depth)
+            val _ = if depth <> 1 then () else HH.updateBytesPinnedEntangledWatermark ()
             (* val _ = dbgmsg' (fn _ => "join CC at depth " ^ Int.toString depth) *)
           in
-            result
+            extractResult res
           end
       end
 
@@ -437,7 +494,55 @@ struct
           (f (), g ())
       end
 
-    fun fork (f, g) = fork' {ccOkayAtThisDepth=true} (f, g)
+    and fork (f, g) = fork' {ccOkayAtThisDepth=true} (f, g)
+
+    and simpleFork (f, g) =
+      let
+        val thread = Thread.current ()
+        val depth = HH.getDepth thread
+      in
+        (* if ccOkayAtThisDepth andalso depth = 1 then *)
+        if depth < Queue.capacity andalso depthOkayForDECheck depth then
+          simpleParFork thread depth (f, g)
+        else
+          (* don't let us hit an error, just sequentialize instead *)
+          (f (); g ())
+      end
+
+    and maybeParClearSuspectsAtDepth (t, d) =
+      if HH.numSuspectsAtDepth (t, d) <= 10000 then
+        HH.clearSuspectsAtDepth (t, d)
+      else
+        let
+          val cs = HH.takeClearSetAtDepth (t, d)
+          val count = HH.numChunksInClearSet cs
+          val grainSize = 20
+          val numGrains = 1 + (count-1) div grainSize
+          val results = ArrayExtra.alloc numGrains
+          fun start i = i*grainSize
+          fun stop i = Int.min (grainSize + start i, count)
+
+          fun processLoop i j =
+            if j-i = 1 then
+              Array.update (results, i, HH.processClearSetGrain (cs, start i, stop i))
+            else
+              let
+                val mid = i + (j-i) div 2
+              in
+                simpleFork (fn _ => processLoop i mid, fn _ => processLoop mid j)
+              end
+
+          fun commitLoop i =
+            if i >= numGrains then () else
+            ( HH.commitFinishedClearSetGrain (t, Array.sub (results, i))
+            ; commitLoop (i+1)
+            )
+        in
+          processLoop 0 numGrains;
+          commitLoop 0;
+          HH.deleteClearSet cs;
+          maybeParClearSuspectsAtDepth (t, d) (* need to go again, just in case *)
+        end
   end
 
   (* ========================================================================
@@ -473,22 +578,25 @@ struct
         in if other < myId then other else other+1
         end
 
-      fun request idleTimer =
+      fun stealLoop () =
         let
-          fun loop tries it =
+          fun loop tries =
             if tries = P * 100 then
-              (OS.Process.sleep (Time.fromNanoseconds (LargeInt.fromInt (P * 100)));
-               loop 0 (tickTimer idleTimer))
+              ( IdleTimer.tick ()
+              ; OS.Process.sleep (Time.fromNanoseconds (LargeInt.fromInt (P * 100)))
+              ; loop 0 )
             else
             let
               val friend = randomOtherId ()
             in
               case trySteal friend of
-                NONE => loop (tries+1) (tickTimer idleTimer)
-              | SOME (task, depth) => (task, depth, tickTimer idleTimer)
+                NONE => loop (tries+1)
+              | SOME (task, depth) => (task, depth)
             end
+
+          val result = loop 0
         in
-          loop 0 idleTimer
+          result
         end
 
       (* ------------------------------------------------------------------- *)
@@ -499,33 +607,44 @@ struct
         | SOME (thread, hh) =>
             ( (*dbgmsg' (fn _ => "back in sched; found GC task")
             ;*) setGCTask myId NONE
+            ; IdleTimer.stop ()
+            ; WorkTimer.start ()
             ; HH.collectThreadRoot (thread, !hh)
             ; if popDiscard () then
-                ( (*dbgmsg' (fn _ => "resume task thread")
-                ;*) threadSwitch thread
+                ( threadSwitch thread
+                ; WorkTimer.stop ()
+                ; IdleTimer.start ()
                 ; afterReturnToSched ()
                 )
               else
-                ()
+                ( WorkTimer.stop ()
+                ; IdleTimer.start ()
+                )
             )
 
 
       fun acquireWork () : unit =
         let
-          val idleTimer = startTimer myId
-          val (task, depth, idleTimer') = request idleTimer
-          val _ = stopTimer idleTimer'
+          val (task, depth) = stealLoop ()
         in
           case task of
             GCTask (thread, hh) =>
-              ( HH.collectThreadRoot (thread, !hh)
+              ( IdleTimer.stop ()
+              ; WorkTimer.start ()
+              ; HH.collectThreadRoot (thread, !hh)
+              ; WorkTimer.stop ()
+              ; IdleTimer.start ()
               ; acquireWork ()
               )
           | Continuation (thread, depth) =>
               ( (*dbgmsg' (fn _ => "stole continuation (" ^ Int.toString depth ^ ")")
               ; dbgmsg' (fn _ => "resume task thread")
               ;*) Queue.setDepth myQueue depth
+              ; IdleTimer.stop ()
+              ; WorkTimer.start ()
               ; threadSwitch thread
+              ; WorkTimer.stop ()
+              ; IdleTimer.start ()
               ; afterReturnToSched ()
               ; Queue.setDepth myQueue 1
               ; acquireWork ()
@@ -541,7 +660,11 @@ struct
                 HH.setDepth (taskThread, depth+1);
                 setTaskBox myId t;
                 (* dbgmsg' (fn _ => "switch to new task thread"); *)
+                IdleTimer.stop ();
+                WorkTimer.start ();
                 threadSwitch taskThread;
+                WorkTimer.stop ();
+                IdleTimer.start ();
                 afterReturnToSched ();
                 Queue.setDepth myQueue 1;
                 acquireWork ()
@@ -560,6 +683,7 @@ struct
     let
       val (_, acquireWork) = setupSchedLoop ()
     in
+      IdleTimer.start ();
       acquireWork ();
       die (fn _ => "scheduler bug: scheduler exited acquire-work loop")
     end
@@ -570,7 +694,7 @@ struct
     if HH.getDepth originalThread = 0 then ()
     else die (fn _ => "scheduler bug: root depth <> 0")
   val _ = HH.setDepth (originalThread, 1)
-  val _ = HH.forceLeftHeap (myWorkerId (), originalThread)
+  val _ = HH.forceLeftHeap (myWorkerId(), originalThread)
 
   (* implicitly attaches worker child heaps *)
   val _ = MLton.Parallel.initializeProcessors ()
@@ -596,7 +720,10 @@ struct
       let
         val (afterReturnToSched, acquireWork) = setupSchedLoop ()
       in
+        WorkTimer.start ();
         threadSwitch originalThread;
+        WorkTimer.stop ();
+        IdleTimer.start ();
         afterReturnToSched ();
         setQueueDepth (myWorkerId ()) 1;
         acquireWork ();
@@ -635,5 +762,7 @@ struct
       ArrayExtra.Raw.unsafeToArray a
     end
 
+  val idleTimeSoFar = Scheduler.IdleTimer.cumulative
+  val workTimeSoFar = Scheduler.WorkTimer.cumulative
   val maxForkDepthSoFar = Scheduler.maxForkDepthSoFar
-end
+end
\ No newline at end of file
diff --git a/basis-library/schedulers/shh/sources.mlb b/basis-library/schedulers/shh/sources.mlb
index 100700dcd..8fb08583e 100644
--- a/basis-library/schedulers/shh/sources.mlb
+++ b/basis-library/schedulers/shh/sources.mlb
@@ -22,6 +22,8 @@ local
   FORK_JOIN.sig
   SimpleRandom.sml
   queue/DequeABP.sml
+  DummyTimer.sml
+  CumulativePerProcTimer.sml
   Scheduler.sml
 in
   structure ForkJoin
diff --git a/basis-library/util/one.sml b/basis-library/util/one.sml
index a4b11da1b..f04b82dba 100644
--- a/basis-library/util/one.sml
+++ b/basis-library/util/one.sml
@@ -1,5 +1,6 @@
 (* Copyright (C) 2006-2006 Henry Cejtin, Matthew Fluet, Suresh
  *    Jagannathan, and Stephen Weeks.
+ * Copyright (C) 2023 Sam Westrick.
  *
  * MLton is released under a HPND-style license.
  * See the file MLton-LICENSE for details.
@@ -13,28 +14,44 @@ structure One:
       val use: 'a t * ('a -> 'b) -> 'b
    end =
    struct
+
+      (* SAM_NOTE: using Word8 instead of bool here to work around the
+       * compilation bug with primitive compareAndSwap... (The compilation
+       * passes splitTypes1 and splitTypes2 cause a crash when compareAndSwap
+       * is used on certain data types, including bool.)
+       *
+       * Here I use 0w0 for false, and 0w1 for true
+       *
+       * When we fix compilation for compareAndSwap, we can switch back
+       * to using bool.
+       *)
+
       datatype 'a t = T of {more: unit -> 'a,
                             static: 'a,
-                            staticIsInUse: bool ref}
+                            staticIsInUse: Primitive.Word8.word ref}
 
       fun make f = T {more = f,
                       static = f (),
-                      staticIsInUse = ref false}
+                      staticIsInUse = ref 0w0}
+
+      val cas = Primitive.MLton.Parallel.compareAndSwap
 
       fun use (T {more, static, staticIsInUse}, f) =
          let
             val () = Primitive.MLton.Thread.atomicBegin ()
-            val b = ! staticIsInUse
+            val claimed =
+               (!staticIsInUse) = 0w0
+               andalso
+               0w0 = cas (staticIsInUse, 0w0, 0w1)
             val d =
-               if b then
+               if not claimed then
                   (Primitive.MLton.Thread.atomicEnd ();
                    more ())
                else
-                  (staticIsInUse := true;
-                   Primitive.MLton.Thread.atomicEnd ();
+                  (Primitive.MLton.Thread.atomicEnd ();
                    static)
         in
            DynamicWind.wind (fn () => f d,
-                             fn () => if b then () else staticIsInUse := false)
+                             fn () => if claimed then staticIsInUse := 0w0 else ())
         end
    end
diff --git a/include/c-chunk.h b/include/c-chunk.h
index 9f7df8cec..27641c1b8 100644
--- a/include/c-chunk.h
+++ b/include/c-chunk.h
@@ -56,7 +56,7 @@
 
 extern void Assignable_writeBarrier(CPointer, Objptr, Objptr*, Objptr);
 extern Objptr Assignable_readBarrier(CPointer, Objptr, Objptr*);
-extern void Assignable_decheckObjptr(Objptr);
+extern Objptr Assignable_decheckObjptr(Objptr, Objptr);
 
 static inline
 Real64 ArrayR64_cas(Real64* a, Word64 i, Real64 x, Real64 y) {
@@ -65,6 +65,13 @@ Real64 ArrayR64_cas(Real64* a, Word64 i, Real64 x, Real64 y) {
   return *((Real64*)&result);
 }
 
+static inline
+Real32 ArrayR32_cas(Real32* a, Word64 i, Real32 x, Real32 y) {
+  Word32 result =
+    __sync_val_compare_and_swap(((Word32*)a) + i, *((Word32*)&x), *((Word32*)&y));
+  return *((Real32*)&result);
+}
+
 #define RefW8_cas(r, x, y) __sync_val_compare_and_swap((Word8*)(r), (x), (y))
 #define RefW16_cas(r, x, y) __sync_val_compare_and_swap((Word16*)(r), (x), (y))
 #define RefW32_cas(r, x, y) __sync_val_compare_and_swap((Word32*)(r), (x), (y))
@@ -78,9 +85,8 @@ Real64 ArrayR64_cas(Real64* a, Word64 i, Real64 x, Real64 y) {
 
 static inline
 Objptr RefP_cas(Objptr* r, Objptr x, Objptr y) {
-  Objptr result = __sync_val_compare_and_swap(r, x, y);
-  Assignable_decheckObjptr(result);
-  return result;
+  Objptr res = __sync_val_compare_and_swap(r, x, y);
+  return Assignable_decheckObjptr(r, res);
 }
 
 #define ArrayW8_cas(a, i, x, y) __sync_val_compare_and_swap(((Word8*)(a)) + (i), (x), (y))
@@ -88,7 +94,7 @@ Objptr RefP_cas(Objptr* r, Objptr x, Objptr y) {
 #define ArrayW32_cas(a, i, x, y) __sync_val_compare_and_swap(((Word32*)(a)) + (i), (x), (y))
 #define ArrayW64_cas(a, i, x, y) __sync_val_compare_and_swap(((Word64*)(a)) + (i), (x), (y))
 
-#define ArrayR32_cas(a, i, x, y) __sync_val_compare_and_swap(((Real32*)(a)) + (i), (x), (y))
+// #define ArrayR32_cas(a, i, x, y) __sync_val_compare_and_swap(((Real32*)(a)) + (i), (x), (y))
 // #define ArrayR64_cas(a, i, x, y) __sync_val_compare_and_swap(((Real64*)(a)) + (i), (x), (y))
 
 // #define ArrayP_cas(a, i, x, y) __sync_val_compare_and_swap(((Objptr*)(a)) + (i), (x), (y))
@@ -96,9 +102,8 @@ Objptr RefP_cas(Objptr* r, Objptr x, Objptr y) {
 
 static inline
 Objptr ArrayP_cas(Objptr* a, Word64 i, Objptr x, Objptr y) {
-  Objptr result = __sync_val_compare_and_swap(a + i, x, y);
-  Assignable_decheckObjptr(result);
-  return result;
+  Objptr res = __sync_val_compare_and_swap(a + i, x, y);
+  return Assignable_decheckObjptr(a, res);
 }
 
 static inline void GC_writeBarrier(CPointer s, Objptr obj, CPointer dst, Objptr src) {
diff --git a/mlton/ssa/simplify.fun b/mlton/ssa/simplify.fun
index 4646565b3..7ba3d4710 100644
--- a/mlton/ssa/simplify.fun
+++ b/mlton/ssa/simplify.fun
@@ -55,7 +55,10 @@ val ssaPassesDefault =
    {name = "localFlatten1", doit = LocalFlatten.transform, execute = true} ::
    {name = "constantPropagation", doit = ConstantPropagation.transform, execute = true} ::
    {name = "duplicateGlobals1", doit = DuplicateGlobals.transform, execute = false} ::
-   {name = "splitTypes1", doit = SplitTypes.transform, execute = true} ::
+   (* SAM_NOTE: disabling splitTypes1 because it does not yet support primitive
+    * polymorphic CAS. We should update the pass and then re-enable.
+    *)
+   {name = "splitTypes1", doit = SplitTypes.transform, execute = false} ::
    (* useless should run 
     *   - after constant propagation because constant propagation makes
     *     slots of tuples that are constant useless
@@ -67,7 +70,10 @@ val ssaPassesDefault =
    {name = "loopUnroll1", doit = LoopUnroll.transform, execute = false} ::
    {name = "removeUnused2", doit = RemoveUnused.transform, execute = true} ::
    {name = "duplicateGlobals2", doit = DuplicateGlobals.transform, execute = true} ::
-   {name = "splitTypes2", doit = SplitTypes.transform, execute = true} ::
+   (* SAM_NOTE: disabling splitTypes2 because it does not yet support primitive
+    * polymorphic CAS. We should update the pass and then re-enable.
+    *)
+   {name = "splitTypes2", doit = SplitTypes.transform, execute = false} ::
    {name = "simplifyTypes", doit = SimplifyTypes.transform, execute = true} ::
    (* polyEqual should run
     *   - after types are simplified so that many equals are turned into eqs
diff --git a/runtime/gc.c b/runtime/gc.c
index e4508c1bd..df3fe1100 100644
--- a/runtime/gc.c
+++ b/runtime/gc.c
@@ -75,6 +75,8 @@ extern C_Pthread_Key_t gcstate_key;
 #include "gc/heap.c"
 #include "gc/hierarchical-heap.c"
 #include "gc/hierarchical-heap-collection.c"
+#include "gc/ebr.c"
+#include "gc/entangled-ebr.c"
 #include "gc/hierarchical-heap-ebr.c"
 #include "gc/init-world.c"
 #include "gc/init.c"
@@ -92,8 +94,10 @@ extern C_Pthread_Key_t gcstate_key;
 #include "gc/pin.c"
 #include "gc/pointer.c"
 #include "gc/profiling.c"
+#include "gc/concurrent-list.c"
 #include "gc/remembered-set.c"
 #include "gc/rusage.c"
+#include "gc/sampler.c"
 #include "gc/sequence-allocate.c"
 #include "gc/sequence.c"
 #include "gc/share.c"
diff --git a/runtime/gc.h b/runtime/gc.h
index a5ca9aa5c..47a23a8a5 100644
--- a/runtime/gc.h
+++ b/runtime/gc.h
@@ -28,6 +28,7 @@ typedef GC_state GCState_t;
 
 #include "gc/debug.h"
 #include "gc/logger.h"
+#include "gc/sampler.h"
 #include "gc/block-allocator.h"
 
 #include "gc/tls-objects.h"
@@ -84,12 +85,15 @@ typedef GC_state GCState_t;
 #include "gc/processor.h"
 #include "gc/pin.h"
 #include "gc/hierarchical-heap.h"
+#include "gc/ebr.h"
 #include "gc/hierarchical-heap-ebr.h"
+#include "gc/entangled-ebr.h"
 #include "gc/hierarchical-heap-collection.h"
 #include "gc/entanglement-suspects.h"
 #include "gc/local-scope.h"
 #include "gc/local-heap.h"
 #include "gc/assign.h"
+#include "gc/concurrent-list.h"
 #include "gc/remembered-set.h"
 #include "gc/gap.h"
 // #include "gc/deferred-promote.h"
diff --git a/runtime/gc/assign.c b/runtime/gc/assign.c
index 0a118ba26..1d5560a59 100644
--- a/runtime/gc/assign.c
+++ b/runtime/gc/assign.c
@@ -6,59 +6,95 @@
  * MLton is released under a HPND-style license.
  * See the file MLton-LICENSE for details.
  */
-void Assignable_decheckObjptr(objptr op)
+#ifdef DETECT_ENTANGLEMENT
+
+objptr Assignable_decheckObjptr(objptr dst, objptr src)
 {
   GC_state s = pthread_getspecific(gcstate_key);
   s->cumulativeStatistics->numDisentanglementChecks++;
-  decheckRead(s, op);
+  objptr new_src = src;
+  pointer dstp = objptrToPointer(dst, NULL);
+  HM_HierarchicalHeap dstHH = HM_getLevelHead(HM_getChunkOf(dstp));
+
+  if (!isObjptr(src) || HM_HH_getDepth(dstHH) == 0 || !ES_contains(NULL, dst))
+  {
+    return src;
+  }
+
+  // HM_EBR_leaveQuiescentState(s);
+  if (!decheck(s, src))
+  {
+    assert (isMutable(s, dstp));
+    s->cumulativeStatistics->numEntanglements++;
+    new_src = manage_entangled(s, src, getThreadCurrent(s)->decheckState);
+    assert (isPinned(new_src));
+  }
+  // HM_EBR_enterQuiescentState(s);
+  assert (!hasFwdPtr(objptrToPointer(new_src, NULL)));
+  return new_src;
 }
 
 objptr Assignable_readBarrier(
   GC_state s,
-  ARG_USED_FOR_ASSERT objptr obj,
+  objptr obj,
   objptr *field)
 {
+// can't rely on obj header becaues it may be forwarded.
 
-#if ASSERT
-  assert(isObjptr(obj));
-  // check that field is actually inside this object
+  s->cumulativeStatistics->numDisentanglementChecks++;
+  objptr ptr = *field;
   pointer objp = objptrToPointer(obj, NULL);
-  GC_header header = getHeader(objp);
-  GC_objectTypeTag tag;
-  uint16_t bytesNonObjptrs;
-  uint16_t numObjptrs;
-  bool hasIdentity;
-  splitHeader(s, header, &tag, &hasIdentity, &bytesNonObjptrs, &numObjptrs);
-  pointer objend = objp;
-  if (!hasIdentity) {
-    DIE("read barrier: attempting to read immutable object "FMTOBJPTR, obj);
-  }
-  if (NORMAL_TAG == tag) {
-    objend += bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE);
-  }
-  else if (SEQUENCE_TAG == tag) {
-    size_t dataBytes = getSequenceLength(objp) * (bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE));
-    objend += alignWithExtra (s, dataBytes, GC_SEQUENCE_METADATA_SIZE);
+  HM_HierarchicalHeap objHH = HM_getLevelHead(HM_getChunkOf(objp));
+  if (!isObjptr(ptr) || HM_HH_getDepth(objHH) == 0 || !ES_contains(NULL, obj))
+  {
+    return ptr;
   }
-  else {
-    DIE("read barrier: cannot handle tag %u", tag);
+  // HM_EBR_leaveQuiescentState(s);
+  if (!decheck(s, ptr))
+  {
+    assert (ES_contains(NULL, obj));
+    // assert (isMutable(s, obj));
+    // if (!ES_contains(NULL, obj))
+    // {
+    //   if (!decheck(s, obj)) {
+    //     assert (false);
+    //   }
+    //   assert(isPinned(ptr));
+    //   assert(!hasFwdPtr(ptr));
+    //   assert(pinType(getHeader(ptr)) == PIN_ANY);
+    // }
+    s->cumulativeStatistics->numEntanglements++;
+    ptr = manage_entangled(s, ptr, getThreadCurrent(s)->decheckState);
   }
-  pointer fieldp = (pointer)field;
-  ASSERTPRINT(
-    objp <= fieldp && fieldp + OBJPTR_SIZE <= objend,
-    "read barrier: objptr field %p outside object "FMTOBJPTR" of size %zu",
-    (void*)field,
-    obj,
-    (size_t)(objend - objp));
-#endif
-  assert(ES_contains(NULL, obj));
-  s->cumulativeStatistics->numDisentanglementChecks++;
-  objptr ptr = *field;
-  decheckRead(s, ptr);
+  // HM_EBR_enterQuiescentState(s);
+  assert (!hasFwdPtr(objptrToPointer(ptr, NULL)));
 
   return ptr;
 }
 
+#else
+
+objptr Assignable_decheckObjptr(objptr dst, objptr src) {
+  (void) dst;
+  return src;
+}
+
+objptr Assignable_readBarrier(
+  GC_state s,
+  objptr obj,
+  objptr *field) {
+  (void)s;
+  (void)obj;
+  return *field;
+}
+#endif
+
+static inline bool decheck_opt_fast (GC_state s, pointer p) {
+  HM_HierarchicalHeap hh = HM_getLevelHead(HM_getChunkOf(p));
+  return (hh->depth <= 1) || hh == getThreadCurrent(s)->hierarchicalHeap;
+}
+
+
 void Assignable_writeBarrier(
   GC_state s,
   objptr dst,
@@ -67,39 +103,43 @@ void Assignable_writeBarrier(
 {
   assert(isObjptr(dst));
   pointer dstp = objptrToPointer(dst, NULL);
+  pointer srcp = objptrToPointer(src, NULL);
 
-#if ASSERT
-  // check that field is actually inside this object
-  GC_header header = getHeader(dstp);
-  GC_objectTypeTag tag;
-  uint16_t bytesNonObjptrs;
-  uint16_t numObjptrs;
-  bool hasIdentity;
-  splitHeader(s, header, &tag, &hasIdentity, &bytesNonObjptrs, &numObjptrs);
-  pointer objend = dstp;
-  if (!hasIdentity) {
-    DIE("write barrier: attempting to modify immutable object "FMTOBJPTR, dst);
-  }
-  if (NORMAL_TAG == tag) {
-    objend += bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE);
-  }
-  else if (SEQUENCE_TAG == tag) {
-    size_t dataBytes = getSequenceLength(dstp) * (bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE));
-    objend += alignWithExtra (s, dataBytes, GC_SEQUENCE_METADATA_SIZE);
-  }
-  else {
-    DIE("write barrier: cannot handle tag %u", tag);
-  }
-  pointer fieldp = (pointer)field;
-  ASSERTPRINT(
-    dstp <= fieldp && fieldp + OBJPTR_SIZE <= objend,
-    "write barrier: objptr field %p outside object "FMTOBJPTR" of size %zu",
-    (void*)field,
-    dst,
-    (size_t)(objend - dstp));
-#endif
-
-  HM_HierarchicalHeap dstHH = HM_getLevelHeadPathCompress(HM_getChunkOf(dstp));
+  assert (!hasFwdPtr(dstp));
+  assert (!isObjptr(src) || !hasFwdPtr(srcp));
+
+// #if ASSERT
+//   // check that field is actually inside this object
+//   GC_header header = getHeader(dstp);
+//   GC_objectTypeTag tag;
+//   uint16_t bytesNonObjptrs;
+//   uint16_t numObjptrs;
+//   bool hasIdentity;
+//   splitHeader(s, header, &tag, &hasIdentity, &bytesNonObjptrs, &numObjptrs);
+//   pointer objend = dstp;
+//   if (!hasIdentity) {
+//     DIE("write barrier: attempting to modify immutable object "FMTOBJPTR, dst);
+//   }
+//   if (NORMAL_TAG == tag) {
+//     objend += bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE);
+//   }
+//   else if (SEQUENCE_TAG == tag) {
+//     size_t dataBytes = getSequenceLength(dstp) * (bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE));
+//     objend += alignWithExtra (s, dataBytes, GC_SEQUENCE_METADATA_SIZE);
+//   }
+//   else {
+//     DIE("write barrier: cannot handle tag %u", tag);
+//   }
+//   pointer fieldp = (pointer)field;
+//   ASSERTPRINT(
+//     dstp <= fieldp && fieldp + OBJPTR_SIZE <= objend,
+//     "write barrier: objptr field %p outside object "FMTOBJPTR" of size %zu",
+//     (void*)field,
+//     dst,
+//     (size_t)(objend - dstp));
+// #endif
+
+  HM_HierarchicalHeap dstHH = HM_getLevelHead(HM_getChunkOf(dstp));
 
   objptr readVal = *field;
   if (dstHH->depth >= 1 && isObjptr(readVal) && s->wsQueueTop!=BOGUS_OBJPTR) {
@@ -120,13 +160,126 @@ void Assignable_writeBarrier(
   }
 
   /* deque down-pointers are handled separately during collection. */
-  if (dst == s->wsQueue)
+  if (dst == s->wsQueue) {
     return;
+  }
 
-  struct HM_remembered remElem_ = {.object = src, .from = dst};
-  HM_remembered remElem = &remElem_;
+  HM_HierarchicalHeap srcHH = HM_getLevelHead(HM_getChunkOf(srcp));
+  if (srcHH == dstHH) {
+    /* internal pointers are always traced */
+    return;
+  }
+
+  uint32_t dd = dstHH->depth;
+  bool src_de = decheck_opt_fast(s, srcp) || decheck(s, src);
+  if (src_de) {
+    bool dst_de = decheck_opt_fast(s, dstp) || decheck(s, dst);
+    if (dst_de) {
+      uint32_t sd = srcHH->depth;
+      /* up pointer (snapshotted by the closure)
+       * or internal (within a chain) pointer to a snapshotted heap
+       */
+      if(dd > sd ||
+        ((HM_HH_getConcurrentPack(srcHH)->ccstate != CC_UNREG) &&
+        dd == sd))
+      {
+        return;
+      }
+
+      uint32_t unpinDepth = dd;
+      bool success = pinObject(s, src, unpinDepth, PIN_DOWN);
+
+      if (success || dd == unpinDepthOf(src))
+      {
+        struct HM_remembered remElem_ = {.object = src, .from = dst};
+        HM_remembered remElem = &remElem_;
+
+        HM_HierarchicalHeap shh = HM_HH_getHeapAtDepth(s, getThreadCurrent(s), sd);
+        assert(NULL != shh);
+        assert(HM_HH_getConcurrentPack(shh)->ccstate == CC_UNREG);
+
+        HM_HH_rememberAtLevel(shh, remElem, false);
+        LOG(LM_HH_PROMOTION, LL_INFO,
+            "remembered downptr %" PRIu32 "->%" PRIu32 " from " FMTOBJPTR " to " FMTOBJPTR,
+            dstHH->depth, srcHH->depth,
+            dst, src);
+      }
+
+      if (dd > 0 && !ES_contains(NULL, dst)) {
+        /*if dst is not a suspect, it must be disentangled*/
+        // if (!dst_de) {
+        //   printf("problematix: %p \n", dst);
+        //   DIE("done");
+        // }
+        // assert (dst_de);
+        HM_HierarchicalHeap dhh = HM_HH_getHeapAtDepth(s, getThreadCurrent(s), dd);
+        ES_add(s, HM_HH_getSuspects(dhh), dst);
+      }
+    }
+    else if(dstHH->depth != 0) {
+      // traverseAndCheck(s, &dst, dst, NULL);
+
+      // SAM_NOTE: TODO: do we count this one??
+      s->cumulativeStatistics->numEntanglements++;
+      manage_entangled (s, src, HM_getChunkOf(dstp)->decheckState);
+    }
+
+
+    // if (!dst_de) {
+    //   assert (ES_contains(NULL, dst));
+    // }
+
+    /* Depth comparisons make sense only when src && dst are on the same root-to-leaf path,
+     * checking this maybe expensive, so we approximate here.
+     * If both dst_de && src_de hold, they are on the same path
+     * Otherwise, we assume they are on different paths.
+     */
+
+
+
+
+    // /* otherwise pin*/
+    // // bool primary_down_ptr = dst_de && dd < sd && (HM_HH_getConcurrentPack(dstHH)->ccstate == CC_UNREG);
+    //  = dst_de ? dd :
+    //   (uint32_t)lcaHeapDepth(HM_getChunkOf(srcp)->decheckState,
+    //     HM_getChunkOf(dstp)->decheckState);
+    // enum PinType pt = dst_de ? PIN_DOWN : PIN_ANY;
+
+    // bool success = pinTemp(s, src, unpinDepth, pt);
+    // if (success || (dst_de && dd == unpinDepthOf (src))) {
+    //   objptr fromObj = pt == PIN_DOWN ? dst : BOGUS_OBJPTR;
+    //   struct HM_remembered remElem_ = {.object = src, .from = fromObj};
+    //   HM_remembered remElem = &remElem_;
+
+    //   HM_HierarchicalHeap shh = HM_HH_getHeapAtDepth(s, getThreadCurrent(s), sd);
+    //   assert(NULL != shh);
+    //   assert(HM_HH_getConcurrentPack(shh)->ccstate == CC_UNREG);
+
+    //   HM_HH_rememberAtLevel(shh, remElem, false);
+    //   LOG(LM_HH_PROMOTION, LL_INFO,
+    //     "remembered downptr %"PRIu32"->%"PRIu32" from "FMTOBJPTR" to "FMTOBJPTR,
+    //     dstHH->depth, srcHH->depth,
+    //     dst, src);
+    // }
+
+    // /*add dst to the suspect set*/
+    // if (dd > 0 && dst_de && !ES_contains(NULL, dst)) {
+    //   /*if dst is not a suspect, it must be disentangled*/
+    //   // if (!dst_de) {
+    //   //   printf("problematix: %p \n", dst);
+    //   //   DIE("done");
+    //   // }
+    //   // assert (dst_de);
+    //   HM_HierarchicalHeap dhh = HM_HH_getHeapAtDepth(s, getThreadCurrent(s), dd);
+    //   ES_add(s, HM_HH_getSuspects(dhh), dst);
+    // }
+  } else {
+    // assert (isPinned(src));
+    // assert (!hasFwdPtr(srcp));
+    // assert (pinType(getHeader(srcp)) == PIN_ANY);
+    traverseAndCheck(s, &src, src, NULL);
+  }
 
-  pointer srcp = objptrToPointer(src, NULL);
 
 #if 0
   /** This is disabled for now. In the future we will come back to
@@ -157,36 +310,35 @@ void Assignable_writeBarrier(
 
     if (pinObject(src, (uint32_t)unpinDepth)) {
       /** Just remember it at some arbitrary place... */
-      HM_rememberAtLevel(getThreadCurrent(s)->hierarchicalHeap, remElem);
+      HM_HH_rememberAtLevel(getThreadCurrent(s)->hierarchicalHeap, remElem);
     }
 
     return;
   }
 #endif
 
-  HM_HierarchicalHeap srcHH = HM_getLevelHeadPathCompress(HM_getChunkOf(srcp));
 
   /* Up-pointer. */
-  if (dstHH->depth > srcHH->depth)
-    return;
+  // if (dstHH->depth > srcHH->depth)
+  //   return;
 
   /* Internal pointer. It's safe to ignore an internal pointer if:
    *   1. it's contained entirely within one subheap, or
    *   2. the pointed-to object (src) lives in an already snapshotted subregion
    */
-  if ( (dstHH == srcHH) ||
-       (dstHH->depth == srcHH->depth &&
-         HM_HH_getConcurrentPack(srcHH)->ccstate != CC_UNREG) ) {
-    // assert(...);
-    // if (dstHH != srcHH) {
-    //   printf(
-    //     "ignore internal pointer "FMTPTR" --> "FMTPTR". dstHH == srcHH? %d\n",
-    //     (uintptr_t)dstp,
-    //     (uintptr_t)srcp,
-    //     srcHH == dstHH);
-    // }
-    return;
-  }
+  // if ( (dstHH == srcHH) ||
+  //      (dstHH->depth == srcHH->depth &&
+  //        HM_HH_getConcurrentPack(srcHH)->ccstate != CC_UNREG) ) {
+  //   // assert(...);
+  //   // if (dstHH != srcHH) {
+  //   //   printf(
+  //   //     "ignore internal pointer "FMTPTR" --> "FMTPTR". dstHH == srcHH? %d\n",
+  //   //     (uintptr_t)dstp,
+  //   //     (uintptr_t)srcp,
+  //   //     srcHH == dstHH);
+  //   // }
+  //   return;
+  // }
   /** Otherwise, its a down-pointer, so
     *  (i) make dst a suspect for entanglement, i.e., mark the suspect bit of dst's header
     *      (see pin.h for header-layout).
@@ -197,42 +349,46 @@ void Assignable_writeBarrier(
     */
 
   /* make dst a suspect for entanglement */
-  uint32_t dd = dstHH->depth;
-  GC_thread thread = getThreadCurrent(s);
-  if (dd > 0 && !ES_contains(NULL, dst)) {
-    HM_HierarchicalHeap dhh = HM_HH_getHeapAtDepth(s, thread, dd);
-    ES_add(s, HM_HH_getSuspects(dhh), dst);
-  }
+  // uint32_t dd = dstHH->depth;
+  // if (dd > 0 && !ES_contains(NULL, dst)) {
+  //   HM_HierarchicalHeap dhh = HM_HH_getHeapAtDepth(s, thread, dd);
+  //   ES_add(s, HM_HH_getSuspects(dhh), dst);
+  // }
+
+  // if (decheck(s, src)) {
+  //   uint32_t sd = srcHH->depth;
+  //   bool dst_de = decheck(s, dst);
+  //   assert (dd <= sd);
+  //   bool true_down_ptr = dd < sd && (HM_HH_getConcurrentPack(dstHH)->ccstate == CC_UNREG) && dst_de;
+  //   // bool unpinDepth = dst_de ? dd : lcaDepth(srcHH->tid, dstHH->tid);
+  //   /* treat a pointer from a chained heap as a cross pointer */
+  //   bool success = pinObject(src, dd, true_down_ptr ? PIN_DOWN : PIN_ANY);
+  //   if (success)
+  //   {
+  //     HM_HierarchicalHeap shh = HM_HH_getHeapAtDepth(s, thread, sd);
+  //     assert(NULL != shh);
+  //     assert(HM_HH_getConcurrentPack(shh)->ccstate == CC_UNREG);
+  //     HM_HH_rememberAtLevel(shh, remElem, false);
+
+  //     LOG(LM_HH_PROMOTION, LL_INFO,
+  //       "remembered downptr %"PRIu32"->%"PRIu32" from "FMTOBJPTR" to "FMTOBJPTR,
+  //       dstHH->depth, srcHH->depth,
+  //       dst, src);
+  //   }
+  //   if (!dst_de)
+  //   {
+  //     DIE("HAVE TO HANDLE ENTANGLED WRITES SEPARATELY");
+  //     // HM_HierarchicalHeap lcaHeap = HM_HH_getHeapAtDepth(s, thread, unpinDepth);
+  //     // ES_add(s, HM_HH_getSuspects(lcaHeap), ptr);
+  //   }
+  // }
+
+
+  // // any concurrent pin can only decrease unpinDepth
+  // assert(unpinDepth <= dd);
+
+  // bool maybe_across_chain = write_de && (dd == unpinDepth) && (dd == sd);
 
-  bool success = pinObject(src, dd);
-
-  // any concurrent pin can only decrease unpinDepth
-  uint32_t unpinDepth = unpinDepthOf(src);
-  assert(unpinDepth <= dd);
-
-  if (success || dd == unpinDepth)
-  {
-    uint32_t sd = srcHH->depth;
-#if 0
-    /** Fix a silly issue where, when we are dealing with entanglement, the
-      * lower object is actually deeper than the current thread (which is
-      * possible because of entanglement! the thread is peeking inside of
-      * some other thread's heaps, and the other thread might be deeper).
-      */
-    if (d > thread->currentDepth && s->controls->manageEntanglement)
-      d = thread->currentDepth;
-#endif
-
-    HM_HierarchicalHeap shh = HM_HH_getHeapAtDepth(s, thread, sd);
-    assert(NULL != shh);
-    assert(HM_HH_getConcurrentPack(shh)->ccstate == CC_UNREG);
-    HM_rememberAtLevel(shh, remElem);
-
-    LOG(LM_HH_PROMOTION, LL_INFO,
-      "remembered downptr %"PRIu32"->%"PRIu32" from "FMTOBJPTR" to "FMTOBJPTR,
-      dstHH->depth, srcHH->depth,
-      dst, src);
-  }
 
   /* SAM_NOTE: TODO: track bytes allocated here in
    * thread->bytesAllocatedSinceLast...? */
@@ -242,8 +398,9 @@ void Assignable_writeBarrier(
 //   Assignable_writeBarrier(s, dst, field, src, false);
 //   // *field = src;
 // }
-// void Assignable_casBarrier (GC_state s, objptr dst, objptr* field, objptr src) {
-//   Assignable_writeBarrier(s, dst, field, src, true);
+// void Assignable_casBarrier (objptr dst, objptr* field, objptr src) {
+//   GC_state s = pthread_getspecific(gcstate_key);
+//   Assignable_writeBarrier(s, dst, field, src);
 //   // cas(field, (*field), dst); //return?
 // }
 
diff --git a/runtime/gc/assign.h b/runtime/gc/assign.h
index 7807e10a3..73fc5805a 100644
--- a/runtime/gc/assign.h
+++ b/runtime/gc/assign.h
@@ -22,7 +22,7 @@ PRIVATE objptr Assignable_readBarrier(
   GC_state s, objptr dst, objptr* field
   );
 
-PRIVATE void Assignable_decheckObjptr(objptr op);
+PRIVATE objptr Assignable_decheckObjptr(objptr dst, objptr src);
 
 #endif  /* MLTON_GC_INTERNAL_BASIS */
 
diff --git a/runtime/gc/block-allocator.c b/runtime/gc/block-allocator.c
index 20821cdce..abc97d9a4 100644
--- a/runtime/gc/block-allocator.c
+++ b/runtime/gc/block-allocator.c
@@ -134,9 +134,13 @@ static void initBlockAllocator(GC_state s, BlockAllocator ball) {
 
   ball->completelyEmptyGroup.firstSuperBlock = NULL;
 
-  ball->numBlocks = 0;
-  ball->numBlocksInUse = 0;
   ball->firstFreedByOther = NULL;
+  ball->numBlocksMapped = 0;
+  ball->numBlocksReleased = 0;
+  for (enum BlockPurpose p = 0; p < NUM_BLOCK_PURPOSES; p++) {
+    ball->numBlocksAllocated[p] = 0;
+    ball->numBlocksFreed[p] = 0;
+  }
 
   for (size_t i = 0; i < numMegaBlockSizeClasses; i++) {
     ball->megaBlockSizeClass[i].firstMegaBlock = NULL;
@@ -239,14 +243,15 @@ static void mmapNewSuperBlocks(
     prependSuperBlock(getFullnessGroup(s, ball, 0, COMPLETELY_EMPTY), sb);
   }
 
-  ball->numBlocks += count*(SUPERBLOCK_SIZE(s));
+  ball->numBlocksMapped += count*(SUPERBLOCK_SIZE(s));
 }
 
 
 static Blocks allocateInSuperBlock(
   GC_state s,
   SuperBlock sb,
-  int sizeClass)
+  int sizeClass,
+  enum BlockPurpose purpose)
 {
   if ((size_t)sb->numBlocksFree == SUPERBLOCK_SIZE(s)) {
     // It's completely empty! We can reuse.
@@ -262,10 +267,11 @@ static Blocks allocateInSuperBlock(
 
     sb->numBlocksFree -= (1 << sb->sizeClass);
     assert(sb->owner != NULL);
-    sb->owner->numBlocksInUse += (1 << sb->sizeClass);
+    sb->owner->numBlocksAllocated[purpose] += (1 << sb->sizeClass);
 
     result->container = sb;
     result->numBlocks = 1 << sb->sizeClass;
+    result->purpose = purpose;
     return result;
   }
 
@@ -276,11 +282,12 @@ static Blocks allocateInSuperBlock(
   sb->numBlocksFree -= (1 << sb->sizeClass);
 
   assert(sb->owner != NULL);
-  sb->owner->numBlocksInUse += (1 << sb->sizeClass);
+  sb->owner->numBlocksAllocated[purpose] += (1 << sb->sizeClass);
 
   Blocks bs = (Blocks)result;
   bs->container = sb;
   bs->numBlocks = (1 << sb->sizeClass);
+  bs->purpose = purpose;
 
   return bs;
 }
@@ -300,16 +307,19 @@ static void deallocateInSuperBlock(
   sb->firstFree = block;
   sb->numBlocksFree += (1 << sb->sizeClass);
 
+  enum BlockPurpose purpose = block->purpose;
+  assert( purpose < NUM_BLOCK_PURPOSES );
+
   assert(sb->owner != NULL);
-  assert(sb->owner->numBlocksInUse >= ((size_t)1 << sb->sizeClass));
-  sb->owner->numBlocksInUse -= (1 << sb->sizeClass);
+  sb->owner->numBlocksFreed[purpose] += (1 << sb->sizeClass);
 }
 
 
 static Blocks tryAllocateAndAdjustSuperBlocks(
   GC_state s,
   BlockAllocator ball,
-  int class)
+  int class,
+  enum BlockPurpose purpose)
 {
   SuperBlockList targetList = NULL;
 
@@ -337,7 +347,7 @@ static Blocks tryAllocateAndAdjustSuperBlocks(
   SuperBlock sb = targetList->firstSuperBlock;
   assert( sb != NULL );
   enum FullnessGroup fg = fullness(s, sb);
-  Blocks result = allocateInSuperBlock(s, sb, class);
+  Blocks result = allocateInSuperBlock(s, sb, class, purpose);
   enum FullnessGroup newfg = fullness(s, sb);
 
   if (fg != newfg) {
@@ -389,7 +399,7 @@ static void clearOutOtherFrees(GC_state s) {
   }
 
   if (numFreed > 400) {
-    LOG(LM_CHUNK_POOL, LL_INFO,
+    LOG(LM_CHUNK_POOL, LL_DEBUG,
       "number of freed blocks: %zu",
       numFreed);
   }
@@ -398,14 +408,22 @@ static void clearOutOtherFrees(GC_state s) {
 
 static void freeMegaBlock(GC_state s, MegaBlock mb, size_t sizeClass) {
   BlockAllocator global = s->blockAllocatorGlobal;
+  size_t nb = mb->numBlocks;
+  enum BlockPurpose purpose = mb->purpose;
+
+  LOG(LM_CHUNK_POOL, LL_INFO,
+    "Freeing megablock of %zu blocks",
+    nb);
 
   if (sizeClass >= s->controls->megablockThreshold) {
-    size_t nb = mb->numBlocks;
     GC_release((pointer)mb, s->controls->blockSize * mb->numBlocks);
     LOG(LM_CHUNK_POOL, LL_INFO,
       "Released large allocation of %zu blocks (unmap threshold: %zu)",
       nb,
       (size_t)1 << (s->controls->megablockThreshold - 1));
+      
+    __sync_fetch_and_add(&(global->numBlocksFreed[purpose]), nb);
+    __sync_fetch_and_add(&(global->numBlocksReleased), nb);
     return;
   }
 
@@ -415,6 +433,8 @@ static void freeMegaBlock(GC_state s, MegaBlock mb, size_t sizeClass) {
   mb->nextMegaBlock = global->megaBlockSizeClass[mbClass].firstMegaBlock;
   global->megaBlockSizeClass[mbClass].firstMegaBlock = mb;
   pthread_mutex_unlock(&(global->megaBlockLock));
+
+  __sync_fetch_and_add(&(global->numBlocksFreed[purpose]), nb);
   return;
 }
 
@@ -422,7 +442,8 @@ static void freeMegaBlock(GC_state s, MegaBlock mb, size_t sizeClass) {
 static MegaBlock tryFindMegaBlock(
   GC_state s,
   size_t numBlocksNeeded,
-  size_t sizeClass)
+  size_t sizeClass,
+  enum BlockPurpose purpose)
 {
   BlockAllocator global = s->blockAllocatorGlobal;
   assert(sizeClass >= s->controls->superblockThreshold);
@@ -452,6 +473,8 @@ static MegaBlock tryFindMegaBlock(
         mb->nextMegaBlock = NULL;
         pthread_mutex_unlock(&(global->megaBlockLock));
 
+        __sync_fetch_and_add(&(global->numBlocksAllocated[purpose]), mb->numBlocks);
+
         LOG(LM_CHUNK_POOL, LL_INFO,
           "inspected %zu, satisfied large alloc of %zu blocks using megablock of %zu",
           count,
@@ -468,7 +491,7 @@ static MegaBlock tryFindMegaBlock(
 }
 
 
-static MegaBlock mmapNewMegaBlock(GC_state s, size_t numBlocks)
+static MegaBlock mmapNewMegaBlock(GC_state s, size_t numBlocks, enum BlockPurpose purpose)
 {
   pointer start = GC_mmapAnon(NULL, s->controls->blockSize * numBlocks);
   if (MAP_FAILED == start) {
@@ -478,14 +501,24 @@ static MegaBlock mmapNewMegaBlock(GC_state s, size_t numBlocks)
     DIE("whoops, mmap didn't align by the block-size.");
   }
 
+  BlockAllocator global = s->blockAllocatorGlobal;
+  __sync_fetch_and_add(&(global->numBlocksMapped), numBlocks);
+  __sync_fetch_and_add(&(global->numBlocksAllocated[purpose]), numBlocks);
+
   MegaBlock mb = (MegaBlock)start;
   mb->numBlocks = numBlocks;
   mb->nextMegaBlock = NULL;
+  mb->purpose = purpose;
+
+  LOG(LM_CHUNK_POOL, LL_INFO,
+    "mmap'ed new megablock of size %zu",
+    numBlocks);
+
   return mb;
 }
 
 
-Blocks allocateBlocks(GC_state s, size_t numBlocks) {
+Blocks allocateBlocksWithPurpose(GC_state s, size_t numBlocks, enum BlockPurpose purpose) {
   BlockAllocator local = s->blockAllocatorLocal;
   assertBlockAllocatorOkay(s, local);
 
@@ -497,10 +530,10 @@ Blocks allocateBlocks(GC_state s, size_t numBlocks) {
       * fails, we're a bit screwed.
       */
 
-    MegaBlock mb = tryFindMegaBlock(s, numBlocks, class);
+    MegaBlock mb = tryFindMegaBlock(s, numBlocks, class, purpose);
 
     if (NULL == mb)
-      mb = mmapNewMegaBlock(s, numBlocks);
+      mb = mmapNewMegaBlock(s, numBlocks, purpose);
 
     if (NULL == mb)
       DIE("ran out of space!");
@@ -510,6 +543,7 @@ Blocks allocateBlocks(GC_state s, size_t numBlocks) {
     Blocks bs = (Blocks)mb;
     bs->container = NULL;
     bs->numBlocks = actualNumBlocks;
+    bs->purpose = purpose;
     return bs;
   }
 
@@ -517,31 +551,39 @@ Blocks allocateBlocks(GC_state s, size_t numBlocks) {
   assertBlockAllocatorOkay(s, local);
 
   /** Look in local first. */
-  Blocks result = tryAllocateAndAdjustSuperBlocks(s, local, class);
+  Blocks result = tryAllocateAndAdjustSuperBlocks(s, local, class, purpose);
   if (result != NULL) {
     assertBlockAllocatorOkay(s, local);
+    assert( result->purpose == purpose );
     return result;
   }
 
   /** If both local fails, we need to mmap new superchunks. */
   mmapNewSuperBlocks(s, local);
 
-  result = tryAllocateAndAdjustSuperBlocks(s, local, class);
+  result = tryAllocateAndAdjustSuperBlocks(s, local, class, purpose);
   if (result == NULL) {
     DIE("Ran out of space for new superblocks!");
   }
 
   assertBlockAllocatorOkay(s, local);
+  assert( result->purpose == purpose );
   return result;
 }
 
 
+Blocks allocateBlocks(GC_state s, size_t numBlocks) {
+  return allocateBlocksWithPurpose(s, numBlocks, BLOCK_FOR_UNKNOWN_PURPOSE);
+}
+
+
 void freeBlocks(GC_state s, Blocks bs, writeFreedBlockInfoFnClosure f) {
   BlockAllocator local = s->blockAllocatorLocal;
   assertBlockAllocatorOkay(s, local);
 
   size_t numBlocks = bs->numBlocks;
   SuperBlock sb = bs->container;
+  enum BlockPurpose purpose = bs->purpose;
   pointer blockStart = (pointer)bs;
 
 #if ASSERT
@@ -586,6 +628,7 @@ void freeBlocks(GC_state s, Blocks bs, writeFreedBlockInfoFnClosure f) {
     MegaBlock mb = (MegaBlock)blockStart;
     mb->numBlocks = numBlocks;
     mb->nextMegaBlock = NULL;
+    mb->purpose = purpose;
     freeMegaBlock(s, mb, sizeClass);
     return;
   }
@@ -594,6 +637,7 @@ void freeBlocks(GC_state s, Blocks bs, writeFreedBlockInfoFnClosure f) {
 
   FreeBlock elem = (FreeBlock)blockStart;
   elem->container = sb;
+  elem->purpose = purpose;
   BlockAllocator owner = sb->owner;
   assert(owner != NULL);
   assert(sb->sizeClass == computeSizeClass(numBlocks));
@@ -613,4 +657,166 @@ void freeBlocks(GC_state s, Blocks bs, writeFreedBlockInfoFnClosure f) {
   }
 }
 
+
+void queryCurrentBlockUsage(
+  GC_state s,
+  size_t *numBlocksMapped,
+  size_t *numGlobalBlocksMapped,
+  size_t *numBlocksReleased,
+  size_t *numGlobalBlocksReleased,
+  size_t *numBlocksAllocated,
+  size_t *numBlocksFreed)
+{
+  *numBlocksMapped = 0;
+  *numGlobalBlocksMapped = 0;
+  *numBlocksReleased = 0;
+  *numGlobalBlocksReleased = 0;
+  for (enum BlockPurpose p = 0; p < NUM_BLOCK_PURPOSES; p++) {
+    numBlocksAllocated[p] = 0;
+    numBlocksFreed[p] = 0;
+  }
+
+  // query local allocators
+  for (uint32_t i = 0; i < s->numberOfProcs; i++) {
+    BlockAllocator ball = s->procStates[i].blockAllocatorLocal;
+    *numBlocksMapped += ball->numBlocksMapped;
+    *numBlocksReleased += ball->numBlocksReleased;
+    for (enum BlockPurpose p = 0; p < NUM_BLOCK_PURPOSES; p++) {
+      numBlocksAllocated[p] += ball->numBlocksAllocated[p];
+      numBlocksFreed[p] += ball->numBlocksFreed[p];
+    }
+  }
+
+  // query global allocator
+  BlockAllocator global = s->blockAllocatorGlobal;
+  *numBlocksMapped += global->numBlocksMapped;
+  *numBlocksReleased += global->numBlocksReleased;
+  for (enum BlockPurpose p = 0; p < NUM_BLOCK_PURPOSES; p++) {
+    numBlocksAllocated[p] += global->numBlocksAllocated[p];
+    numBlocksFreed[p] += global->numBlocksFreed[p];
+  }
+
+  *numGlobalBlocksMapped += global->numBlocksMapped;
+  *numGlobalBlocksReleased += global->numBlocksReleased;
+}
+
+
+void logCurrentBlockUsage(
+  GC_state s, 
+  struct timespec *now, 
+  __attribute__((unused)) void *env)
+{
+  size_t mapped;
+  size_t globalMapped;
+  size_t released;
+  size_t globalReleased;
+  size_t allocated[NUM_BLOCK_PURPOSES];
+  size_t freed[NUM_BLOCK_PURPOSES];
+  queryCurrentBlockUsage(
+    s,
+    &mapped,
+    &globalMapped,
+    &released,
+    &globalReleased,
+    (size_t*)allocated,
+    (size_t*)freed
+  );
+
+  size_t inUse[NUM_BLOCK_PURPOSES];
+  for (enum BlockPurpose p = 0; p < NUM_BLOCK_PURPOSES; p++) {
+    if (freed[p] > allocated[p]) {
+      inUse[p] = 0;
+    }
+    else {
+      inUse[p] = allocated[p] - freed[p];
+    }
+  }
+
+  size_t count = mapped-released;
+  if (released > mapped) count = 0;
+
+  size_t globalCount = globalMapped - globalReleased;
+  if (globalReleased > globalMapped) globalCount = 0;
+
+  LOG(LM_BLOCK_ALLOCATOR, LL_INFO,
+    "block-allocator(%zu.%.9zu)\n"
+    "  currently mapped           %zu (= %zu - %zu)\n"
+    "  currently mapped (global)  %zu (= %zu - %zu)\n"
+    "  BLOCK_FOR_HEAP_CHUNK       %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_REMEMBERED_SET   %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_FORGOTTEN_SET    %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_HH_ALLOCATOR     %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_UF_ALLOCATOR     %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_GC_WORKLIST      %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_SUSPECTS         %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_EBR              %zu (%zu%%) (= %zu - %zu)\n"
+    "  BLOCK_FOR_UNKNOWN_PURPOSE  %zu (%zu%%) (= %zu - %zu)\n",
+    now->tv_sec,
+    now->tv_nsec,
+    count,
+    mapped,
+    released,
+    
+    globalCount,
+    globalMapped,
+    globalReleased,
+
+    inUse[BLOCK_FOR_HEAP_CHUNK],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_HEAP_CHUNK] / (double)count),
+    allocated[BLOCK_FOR_HEAP_CHUNK],
+    freed[BLOCK_FOR_HEAP_CHUNK],
+
+    inUse[BLOCK_FOR_REMEMBERED_SET],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_REMEMBERED_SET] / (double)count),
+    allocated[BLOCK_FOR_REMEMBERED_SET],
+    freed[BLOCK_FOR_REMEMBERED_SET],
+
+    inUse[BLOCK_FOR_FORGOTTEN_SET],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_FORGOTTEN_SET] / (double)count),
+    allocated[BLOCK_FOR_FORGOTTEN_SET],
+    freed[BLOCK_FOR_FORGOTTEN_SET],
+
+    inUse[BLOCK_FOR_HH_ALLOCATOR],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_HH_ALLOCATOR] / (double)count),
+    allocated[BLOCK_FOR_HH_ALLOCATOR],
+    freed[BLOCK_FOR_HH_ALLOCATOR],
+
+    inUse[BLOCK_FOR_UF_ALLOCATOR],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_UF_ALLOCATOR] / (double)count),
+    allocated[BLOCK_FOR_UF_ALLOCATOR],
+    freed[BLOCK_FOR_UF_ALLOCATOR],
+
+    inUse[BLOCK_FOR_GC_WORKLIST],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_GC_WORKLIST] / (double)count),
+    allocated[BLOCK_FOR_GC_WORKLIST],
+    freed[BLOCK_FOR_GC_WORKLIST],
+
+    inUse[BLOCK_FOR_SUSPECTS],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_SUSPECTS] / (double)count),
+    allocated[BLOCK_FOR_SUSPECTS],
+    freed[BLOCK_FOR_SUSPECTS],
+
+    inUse[BLOCK_FOR_EBR],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_EBR] / (double)count),
+    allocated[BLOCK_FOR_EBR],
+    freed[BLOCK_FOR_EBR],
+
+    inUse[BLOCK_FOR_UNKNOWN_PURPOSE],
+    (size_t)(100.0 * (double)inUse[BLOCK_FOR_UNKNOWN_PURPOSE] / (double)count),
+    allocated[BLOCK_FOR_UNKNOWN_PURPOSE],
+    freed[BLOCK_FOR_UNKNOWN_PURPOSE]);
+}
+
+Sampler newBlockUsageSampler(GC_state s) {
+  struct SamplerClosure func;
+  func.fun = logCurrentBlockUsage;
+  func.env = NULL;
+
+  struct timespec desiredInterval = s->controls->blockUsageSampleInterval;
+  Sampler result = malloc(sizeof(struct Sampler));
+  initSampler(s, result, &func, &desiredInterval);
+
+  return result;
+}
+
 #endif
diff --git a/runtime/gc/block-allocator.h b/runtime/gc/block-allocator.h
index 76d277ebd..47b55a2dc 100644
--- a/runtime/gc/block-allocator.h
+++ b/runtime/gc/block-allocator.h
@@ -16,6 +16,20 @@
 
 #if (defined (MLTON_GC_INTERNAL_TYPES))
 
+enum BlockPurpose {
+  BLOCK_FOR_HEAP_CHUNK,
+  BLOCK_FOR_REMEMBERED_SET,
+  BLOCK_FOR_FORGOTTEN_SET,
+  BLOCK_FOR_HH_ALLOCATOR,
+  BLOCK_FOR_UF_ALLOCATOR,
+  BLOCK_FOR_GC_WORKLIST,
+  BLOCK_FOR_SUSPECTS,
+  BLOCK_FOR_EBR,
+  BLOCK_FOR_UNKNOWN_PURPOSE,
+  NUM_BLOCK_PURPOSES /** Hack to know statically how many there are. Make sure
+                       * this comes last in the list. */
+};
+
 /** This is used for debugging, to write info about freed blocks when
   * s->controls->debugKeepFreedBlocks is enabled.
   *
@@ -48,6 +62,7 @@ typedef struct DebugKeptFreeBlock {
 typedef struct FreeBlock {
   struct FreeBlock *nextFree;
   struct SuperBlock *container;
+  enum BlockPurpose purpose;
 } *FreeBlock;
 
 
@@ -103,6 +118,7 @@ typedef struct SuperBlockList {
 typedef struct MegaBlock {
   struct MegaBlock *nextMegaBlock;
   size_t numBlocks;
+  enum BlockPurpose purpose;
 } *MegaBlock;
 
 
@@ -124,11 +140,10 @@ enum FullnessGroup {
 
 typedef struct BlockAllocator {
 
-  /** These are used for local to decide to move things to global, but
-    * are ignored in global.
-    */
-  size_t numBlocks;
-  size_t numBlocksInUse;
+  size_t numBlocksMapped;
+  size_t numBlocksReleased;
+  size_t numBlocksAllocated[NUM_BLOCK_PURPOSES];
+  size_t numBlocksFreed[NUM_BLOCK_PURPOSES];
 
   /** There are 3 fullness groups in each size class:
     *   0 is completely full, i.e. no free blocks available
@@ -160,6 +175,7 @@ typedef struct BlockAllocator {
 typedef struct Blocks {
   SuperBlock container;
   size_t numBlocks;
+  enum BlockPurpose purpose;
 } *Blocks;
 
 #else
@@ -179,9 +195,30 @@ void initLocalBlockAllocator(GC_state s, BlockAllocator globalAllocator);
 /** Get a pointer to the start of some number of free contiguous blocks. */
 Blocks allocateBlocks(GC_state s, size_t numBlocks);
 
+Blocks allocateBlocksWithPurpose(GC_state s, size_t numBlocks, enum BlockPurpose purpose);
+
 /** Free a group of contiguous blocks. */
 void freeBlocks(GC_state s, Blocks bs, writeFreedBlockInfoFnClosure f);
 
+
+/** populate:
+  *   *numBlocks := current total number of blocks mmap'ed
+  *   blocksAllocated[p] := cumulative number of blocks allocated for purpose `p`
+  *   blocksFreed[p] := cumulative number of blocks freed for purpose `p`
+  *
+  * The `blocksAllocated` and `blocksFreed` arrays must have length `NUM_BLOCK_PURPOSES`
+  */
+void queryCurrentBlockUsage(
+  GC_state s,
+  size_t *numBlocksMapped,
+  size_t *numGlobalBlocksMapped,
+  size_t *numBlocksReleased,
+  size_t *numGlobalBlocksReleased,
+  size_t *blocksAllocated,
+  size_t *blocksFreed);
+
+Sampler newBlockUsageSampler(GC_state s);
+
 #endif
 
 #endif // BLOCK_ALLOCATOR_H_
diff --git a/runtime/gc/cc-work-list.c b/runtime/gc/cc-work-list.c
index 1724fa8b9..18705b6be 100644
--- a/runtime/gc/cc-work-list.c
+++ b/runtime/gc/cc-work-list.c
@@ -10,16 +10,18 @@ void CC_workList_init(
   HM_chunkList c = &(w->storage);
   HM_initChunkList(c);
   // arbitrary, just need an initial chunk
-  w->currentChunk = HM_allocateChunk(c, sizeof(struct CC_workList_elem));
+  w->currentChunk = HM_allocateChunkWithPurpose(
+    c,
+    sizeof(struct CC_workList_elem),
+    BLOCK_FOR_GC_WORKLIST);
 }
 
 void CC_workList_free(
     __attribute__((unused)) GC_state s,
     CC_workList w)
 {
-
   HM_chunkList c = &(w->storage);
-  HM_freeChunksInListWithInfo(s, c, NULL);
+  HM_freeChunksInListWithInfo(s, c, NULL, BLOCK_FOR_GC_WORKLIST);
   w->currentChunk = NULL;
 }
 
@@ -137,7 +139,11 @@ void CC_workList_push(
     if (chunk->nextChunk != NULL) {
       chunk = chunk->nextChunk; // this will be an empty chunk
     } else {
-      chunk = HM_allocateChunk(list, elemSize);
+      // chunk = HM_allocateChunk(list, elemSize);
+      chunk = HM_allocateChunkWithPurpose(
+        list,
+        elemSize,
+        BLOCK_FOR_GC_WORKLIST);
     }
     w->currentChunk = chunk;
   }
@@ -298,7 +304,7 @@ objptr* CC_workList_pop(
     if (NULL != chunk->nextChunk) {
       HM_chunk nextChunk = chunk->nextChunk;
       HM_unlinkChunk(list, nextChunk);
-      HM_freeChunk(s, nextChunk);
+      HM_freeChunkWithInfo(s, nextChunk, NULL, BLOCK_FOR_GC_WORKLIST);
     }
 
     assert(NULL == chunk->nextChunk);
diff --git a/runtime/gc/cc-work-list.h b/runtime/gc/cc-work-list.h
index c30eff34a..0af8209ce 100644
--- a/runtime/gc/cc-work-list.h
+++ b/runtime/gc/cc-work-list.h
@@ -51,6 +51,8 @@ void CC_workList_push(GC_state s, CC_workList w, objptr op);
   * Returns NULL if work list is empty */
 objptr* CC_workList_pop(GC_state s, CC_workList w);
 
+void CC_workList_free(GC_state s, CC_workList w);
+
 #endif /* MLTON_GC_INTERNAL_FUNCS */
 
 #endif /* CC_WORK_LIST_H */
diff --git a/runtime/gc/chunk.c b/runtime/gc/chunk.c
index 1d0589831..09c078bc4 100644
--- a/runtime/gc/chunk.c
+++ b/runtime/gc/chunk.c
@@ -107,7 +107,7 @@ HM_chunk HM_initializeChunk(pointer start, pointer end) {
   chunk->mightContainMultipleObjects = TRUE;
   chunk->tmpHeap = NULL;
   chunk->decheckState = DECHECK_BOGUS_TID;
-  chunk->disentangledDepth = INT32_MAX;
+  chunk->retireChunk = FALSE;
   chunk->magic = CHUNK_MAGIC;
 
 #if ASSERT
@@ -119,11 +119,11 @@ HM_chunk HM_initializeChunk(pointer start, pointer end) {
 }
 
 
-HM_chunk HM_getFreeChunk(GC_state s, size_t bytesRequested) {
+HM_chunk HM_getFreeChunkWithPurpose(GC_state s, size_t bytesRequested, enum BlockPurpose purpose) {
   size_t chunkWidth =
     align(bytesRequested + sizeof(struct HM_chunk), HM_BLOCK_SIZE);
   size_t numBlocks = chunkWidth / HM_BLOCK_SIZE;
-  Blocks start = allocateBlocks(s, numBlocks);
+  Blocks start = allocateBlocksWithPurpose(s, numBlocks, purpose);
   SuperBlock container = start->container;
   numBlocks = start->numBlocks;
   HM_chunk result =
@@ -134,6 +134,11 @@ HM_chunk HM_getFreeChunk(GC_state s, size_t bytesRequested) {
 }
 
 
+HM_chunk HM_getFreeChunk(GC_state s, size_t bytesRequested) {
+  return HM_getFreeChunkWithPurpose(s, bytesRequested, BLOCK_FOR_UNKNOWN_PURPOSE);
+}
+
+
 struct writeChunkInfoArgs {
   writeFreedBlockInfoFn fun;
   void* env;
@@ -174,7 +179,8 @@ void writeChunkInfo(
 void HM_freeChunkWithInfo(
   GC_state s,
   HM_chunk chunk,
-  writeFreedBlockInfoFnClosure f)
+  writeFreedBlockInfoFnClosure f,
+  enum BlockPurpose purpose)
 {
 
   struct writeChunkInfoArgs args;
@@ -199,34 +205,40 @@ void HM_freeChunkWithInfo(
   Blocks bs = (Blocks)chunk;
   bs->numBlocks = numBlocks;
   bs->container = container;
+  bs->purpose = purpose;
   freeBlocks(s, bs, &c);
 }
 
 void HM_freeChunk(GC_state s, HM_chunk chunk) {
-  HM_freeChunkWithInfo(s, chunk, NULL);
+  HM_freeChunkWithInfo(s, chunk, NULL, BLOCK_FOR_UNKNOWN_PURPOSE);
 }
 
 void HM_freeChunksInListWithInfo(
   GC_state s,
   HM_chunkList list,
-  writeFreedBlockInfoFnClosure f)
+  writeFreedBlockInfoFnClosure f,
+  enum BlockPurpose purpose)
 {
   HM_chunk chunk = list->firstChunk;
   while (chunk != NULL) {
     HM_chunk next = chunk->nextChunk;
-    HM_freeChunkWithInfo(s, chunk, f);
+    HM_freeChunkWithInfo(s, chunk, f, purpose);
     chunk = next;
   }
   HM_initChunkList(list);
 }
 
 void HM_freeChunksInList(GC_state s, HM_chunkList list) {
-  HM_freeChunksInListWithInfo(s, list, NULL);
+  HM_freeChunksInListWithInfo(s, list, NULL, BLOCK_FOR_UNKNOWN_PURPOSE);
 }
 
-HM_chunk HM_allocateChunk(HM_chunkList list, size_t bytesRequested) {
+HM_chunk HM_allocateChunkWithPurpose(
+  HM_chunkList list,
+  size_t bytesRequested,
+  enum BlockPurpose purpose)
+{
   GC_state s = pthread_getspecific(gcstate_key);
-  HM_chunk chunk = HM_getFreeChunk(s, bytesRequested);
+  HM_chunk chunk = HM_getFreeChunkWithPurpose(s, bytesRequested, purpose);
 
   if (NULL == chunk) {
     DIE("Out of memory. Unable to allocate chunk of size %zu.",
@@ -245,6 +257,12 @@ HM_chunk HM_allocateChunk(HM_chunkList list, size_t bytesRequested) {
   return chunk;
 }
 
+
+HM_chunk HM_allocateChunk(HM_chunkList list, size_t bytesRequested) {
+  return HM_allocateChunkWithPurpose(list, bytesRequested, BLOCK_FOR_UNKNOWN_PURPOSE);
+}
+
+
 void HM_initChunkList(HM_chunkList list) {
   list->firstChunk = NULL;
   list->lastChunk = NULL;
@@ -475,10 +493,10 @@ size_t HM_getChunkListUsedSize(HM_chunkList list) {
   return list->usedSize;
 }
 
-pointer HM_storeInchunkList(HM_chunkList chunkList, void* p, size_t objSize) {
+pointer HM_storeInChunkListWithPurpose(HM_chunkList chunkList, void* p, size_t objSize, enum BlockPurpose purpose) {
   HM_chunk chunk = HM_getChunkListLastChunk(chunkList);
   if (NULL == chunk || HM_getChunkSizePastFrontier(chunk) < objSize) {
-    chunk = HM_allocateChunk(chunkList, objSize);
+    chunk = HM_allocateChunkWithPurpose(chunkList, objSize, purpose);
   }
 
   assert(NULL != chunk);
@@ -493,6 +511,10 @@ pointer HM_storeInchunkList(HM_chunkList chunkList, void* p, size_t objSize) {
   return frontier;
 }
 
+pointer HM_storeInChunkList(HM_chunkList chunkList, void* p, size_t objSize) {
+  return HM_storeInChunkListWithPurpose(chunkList, p, objSize, BLOCK_FOR_UNKNOWN_PURPOSE);
+}
+
 HM_HierarchicalHeap HM_getLevelHead(HM_chunk chunk) {
   assert(chunk != NULL);
   assert(chunk->levelHead != NULL);
@@ -624,8 +646,8 @@ void HM_assertChunkListInvariants(HM_chunkList chunkList) {
     chunk = chunk->nextChunk;
   }
   assert(chunkList->lastChunk == chunk);
-  assert(chunkList->size == size);
-  assert(chunkList->usedSize == usedSize);
+  // assert(chunkList->size == size);
+  // assert(chunkList->usedSize == usedSize);
 }
 #else
 void HM_assertChunkListInvariants(HM_chunkList chunkList) {
diff --git a/runtime/gc/chunk.h b/runtime/gc/chunk.h
index ee3cfbf62..6335b45ea 100644
--- a/runtime/gc/chunk.h
+++ b/runtime/gc/chunk.h
@@ -69,7 +69,7 @@ struct HM_chunk {
   /** set during entanglement when in "safe" mode, to help temporarily disable
     * local GCs while the entanglement persists.
     */
-  int32_t disentangledDepth;
+  bool retireChunk;
 
   bool mightContainMultipleObjects;
   void* tmpHeap;
@@ -156,14 +156,15 @@ HM_chunk HM_getFreeChunk(GC_state s, size_t bytesRequested);
  *   chunk->limit - chunk->frontier <= bytesRequested
  * Returns NULL if unable to find space for such a chunk. */
 HM_chunk HM_allocateChunk(HM_chunkList list, size_t bytesRequested);
+HM_chunk HM_allocateChunkWithPurpose(HM_chunkList list, size_t bytesRequested, enum BlockPurpose purpose);
 
 void HM_initChunkList(HM_chunkList list);
 
 void HM_freeChunk(GC_state s, HM_chunk chunk);
 void HM_freeChunksInList(GC_state s, HM_chunkList list);
 
-void HM_freeChunkWithInfo(GC_state s, HM_chunk chunk, writeFreedBlockInfoFnClosure f);
-void HM_freeChunksInListWithInfo(GC_state s, HM_chunkList list, writeFreedBlockInfoFnClosure f);
+void HM_freeChunkWithInfo(GC_state s, HM_chunk chunk, writeFreedBlockInfoFnClosure f, enum BlockPurpose purpose);
+void HM_freeChunksInListWithInfo(GC_state s, HM_chunkList list, writeFreedBlockInfoFnClosure f, enum BlockPurpose purpose);
 
 // void HM_deleteChunks(GC_state s, HM_chunkList deleteList);
 void HM_appendChunkList(HM_chunkList destinationChunkList, HM_chunkList chunkList);
@@ -265,7 +266,8 @@ pointer HM_shiftChunkStart(HM_chunk chunk, size_t bytes);
 pointer HM_getChunkStartGap(HM_chunk chunk);
 
 /* store the object pointed by p at the end of list and return the address */
-pointer HM_storeInchunkList(HM_chunkList chunkList, void* p, size_t objSize);
+pointer HM_storeInChunkList(HM_chunkList chunkList, void* p, size_t objSize);
+pointer HM_storeInchunkListWithPurpose(HM_chunkList chunkList, void* p, size_t objSize, enum BlockPurpose purpose);
 
 
 /**
diff --git a/runtime/gc/concurrent-collection.c b/runtime/gc/concurrent-collection.c
index 13987d1c7..f7133f87e 100644
--- a/runtime/gc/concurrent-collection.c
+++ b/runtime/gc/concurrent-collection.c
@@ -325,6 +325,7 @@ void tryMarkAndAddToWorkList(
   if (!CC_isPointerMarked(p)) {
     markObj(p);
     args->bytesSaved += sizeofObject(s, p);
+    args->numObjectsMarked++;
     assert(CC_isPointerMarked(p));
     CC_workList_push(s, &(args->worklist), op);
   }
@@ -416,7 +417,9 @@ void forwardPtrChunk (GC_state s, objptr *opp, void* rawArgs) {
 void forwardPinned(GC_state s, HM_remembered remElem, void* rawArgs) {
   objptr src = remElem->object;
   tryMarkAndMarkLoop(s, &src, src, rawArgs);
-  tryMarkAndMarkLoop(s, &(remElem->from), remElem->from, rawArgs);
+  if (remElem->from != BOGUS_OBJPTR) {
+    tryMarkAndMarkLoop(s, &(remElem->from), remElem->from, rawArgs);
+  }
 
 #if 0
 #if ASSERT
@@ -496,10 +499,12 @@ void unmarkPinned(
 {
   objptr src = remElem->object;
   assert(!(HM_getChunkOf(objptrToPointer(src, NULL))->pinnedDuringCollection));
+  tryUnmarkAndUnmarkLoop(s, &src, src, rawArgs);
+  if (remElem->from != BOGUS_OBJPTR) {
+    tryUnmarkAndUnmarkLoop(s, &(remElem->from), remElem->from, rawArgs);
+  }
   // unmarkPtrChunk(s, &src, rawArgs);
   // unmarkPtrChunk(s, &(remElem->from), rawArgs);
-  tryUnmarkAndUnmarkLoop(s, &src, src, rawArgs);
-  tryUnmarkAndUnmarkLoop(s, &(remElem->from), remElem->from, rawArgs);
 
 #if 0
 #if ASSERT
@@ -536,6 +541,7 @@ void forceForward(GC_state s, objptr *opp, void* rawArgs) {
     markObj(p);
     assert(CC_isPointerMarked(p));
     args->bytesSaved += sizeofObject(s, p);
+    args->numObjectsMarked++;
   }
 
   CC_workList_push(s, &(args->worklist), op);
@@ -648,19 +654,22 @@ void CC_collectAtRoot(pointer threadp, pointer hhp) {
 #endif
 
   size_t beforeSize = HM_getChunkListSize(HM_HH_getChunkList(heap));
-  size_t live = CC_collectWithRoots(s, heap, thread);
+  size_t live = 0;
+  size_t numObjectsMarked = 0;
+  CC_collectWithRoots(s, heap, thread, &live, &numObjectsMarked);
   size_t afterSize = HM_getChunkListSize(HM_HH_getChunkList(heap));
 
   size_t diff = beforeSize > afterSize ? beforeSize - afterSize : 0;
 
   LOG(LM_CC_COLLECTION, LL_INFO,
-    "finished at depth %u. before: %zu after: %zu (-%.01lf%%) live: %zu (%.01lf%% fragmented)",
+    "finished at depth %u. before: %zu after: %zu (-%.01lf%%) live: %zu (%.01lf%% fragmented) objects: %zu",
     heap->depth,
     beforeSize,
     afterSize,
     100.0 * ((double)diff / (double)beforeSize),
     live,
-    100.0 * (1.0 - (double)live / (double)afterSize));
+    100.0 * (1.0 - (double)live / (double)afterSize),
+    numObjectsMarked);
 
   // HM_HH_getConcurrentPack(heap)->ccstate = CC_UNREG;
   __atomic_store_n(&(HM_HH_getConcurrentPack(heap)->ccstate), CC_DONE, __ATOMIC_SEQ_CST);
@@ -694,7 +703,7 @@ void CC_collectAtPublicLevel(GC_state s, GC_thread thread, uint32_t depth) {
   // collect only if the heap is above a threshold size
   if (HM_getChunkListSize(&(heap->chunkList)) >= 2 * HM_BLOCK_SIZE) {
     assert(getThreadCurrent(s) == thread);
-    CC_collectWithRoots(s, heap, thread);
+    CC_collectWithRoots(s, heap, thread, NULL, NULL);
   }
 
   // Mark that collection is complete
@@ -705,18 +714,17 @@ void CC_collectAtPublicLevel(GC_state s, GC_thread thread, uint32_t depth) {
 /* ========================================================================= */
 
 struct CC_tryUnpinOrKeepPinnedArgs {
-  HM_chunkList newRemSet;
+  HM_remSet newRemSet;
   HM_HierarchicalHeap tgtHeap;
 
   void* fromSpaceMarker;
   void* toSpaceMarker;
 };
 
-
 void CC_tryUnpinOrKeepPinned(
-  __attribute__((unused)) GC_state s,
-  HM_remembered remElem,
-  void* rawArgs)
+    __attribute__((unused)) GC_state s,
+    HM_remembered remElem,
+    void *rawArgs)
 {
   struct CC_tryUnpinOrKeepPinnedArgs* args =
     (struct CC_tryUnpinOrKeepPinnedArgs *)rawArgs;
@@ -751,7 +759,7 @@ void CC_tryUnpinOrKeepPinned(
       * entry. It will be merged and handled properly later.
       */
 
-    HM_remember(args->newRemSet, remElem);
+    HM_remember(args->newRemSet, remElem, false);
     return;
   }
 
@@ -759,7 +767,40 @@ void CC_tryUnpinOrKeepPinned(
   assert(chunk->tmpHeap == args->fromSpaceMarker);
   assert(HM_getLevelHead(chunk) == args->tgtHeap);
 
-  HM_chunk fromChunk = HM_getChunkOf(objptrToPointer(remElem->from, NULL));
+  // pointer p = objptrToPointer(remElem->object, NULL);
+  // GC_header header = getHeader(p);
+  // enum PinType pt = pinType(header);
+  // uint32_t unpinDepth = unpinDepthOf(remElem->object);
+
+  // assert (pt != PIN_NONE);
+
+  if (remElem->from != BOGUS_OBJPTR)
+  {
+
+    // uint32_t opDepth = HM_HH_getDepth(args->tgtHeap);
+    // if (unpinDepth > opDepth && pt == PIN_ANY) {
+      // tryPinDec(remElem->object, opDepth);
+    // }
+
+    HM_chunk fromChunk = HM_getChunkOf(objptrToPointer(remElem->from, NULL));
+    assert(fromChunk->tmpHeap != args->toSpaceMarker);
+
+    if (fromChunk->tmpHeap == args->fromSpaceMarker) {
+      assert(isChunkInList(fromChunk, HM_HH_getChunkList(args->tgtHeap)));
+      return;
+    }
+
+    /* otherwise, object stays pinned, and we have to keep this remembered
+    * entry into the toSpace. */
+  } else {
+    GC_header header = getHeader(objptrToPointer(remElem->object, NULL));
+    if (pinType(header) == PIN_ANY &&
+      unpinDepthOfH(header) >= HM_HH_getDepth(args->tgtHeap)) {
+      return;
+    }
+  }
+
+  HM_remember(args->newRemSet, remElem, false);
 
   /** SAM_NOTE: The goal of the following was to filter remset entries
     * to only keep the "shallowest" entries. But this is really tricky,
@@ -772,7 +813,6 @@ void CC_tryUnpinOrKeepPinned(
     */
 #if 0
   uint32_t unpinDepth = unpinDepthOf(op);
-  uint32_t opDepth = HM_HH_getDepth(args->tgtHeap);
   uint32_t fromDepth = HM_HH_getDepth(HM_getLevelHead(fromChunk));
   if (fromDepth > unpinDepth) {
     /** Can forget any down-pointer that came from shallower than the
@@ -783,20 +823,6 @@ void CC_tryUnpinOrKeepPinned(
   assert(opDepth < unpinDepth || fromDepth == unpinDepth);
 #endif
 
-  assert(fromChunk->tmpHeap != args->toSpaceMarker);
-
-  if (fromChunk->tmpHeap == args->fromSpaceMarker)
-  {
-    // fromChunk is in-scope of CC. Don't need to keep this remembered entry.
-    assert(isChunkInList(fromChunk, HM_HH_getChunkList(args->tgtHeap)));
-    return;
-  }
-
-  /* otherwise, object stays pinned, and we have to keep this remembered
-   * entry into the toSpace. */
-
-  HM_remember(args->newRemSet, remElem);
-
   assert(isChunkInList(chunk, HM_HH_getChunkList(args->tgtHeap)));
   assert(HM_getLevelHead(chunk) == args->tgtHeap);
 }
@@ -809,9 +835,9 @@ void CC_filterPinned(
   void* fromSpaceMarker,
   void* toSpaceMarker)
 {
-  HM_chunkList oldRemSet = HM_HH_getRemSet(hh);
-  struct HM_chunkList newRemSet;
-  HM_initChunkList(&newRemSet);
+  HM_remSet oldRemSet = HM_HH_getRemSet(hh);
+  struct HM_remSet newRemSet;
+  HM_initRemSet(&newRemSet);
 
   LOG(LM_CC_COLLECTION, LL_INFO,
     "num pinned initially: %zu",
@@ -832,7 +858,7 @@ void CC_filterPinned(
   /** Save "valid" entries to newRemSet, throw away old entries, and store
     * valid entries back into the main remembered set.
     */
-  HM_foreachRemembered(s, oldRemSet, &closure);
+  HM_foreachRemembered(s, oldRemSet, &closure, false);
 
   struct CC_chunkInfo info =
     {.initialDepth = initialDepth,
@@ -842,13 +868,17 @@ void CC_filterPinned(
   struct writeFreedBlockInfoFnClosure infoc =
     {.fun = CC_writeFreeChunkInfo, .env = &info};
 
-  HM_freeChunksInListWithInfo(s, oldRemSet, &infoc);
-  *oldRemSet = newRemSet;  // this moves all data into remset of hh
+  // HM_freeRemSetWithInfo(s, oldRemSet, &infoc);
+  // this reintializes the private remset
+  HM_freeChunksInListWithInfo(s, &(oldRemSet->private), &infoc, BLOCK_FOR_REMEMBERED_SET);
+  assert (newRemSet.public.firstChunk == NULL);
+  // this moves all data into remset of hh
+  HM_appendRemSet(oldRemSet, &newRemSet);
 
-  assert(HM_HH_getRemSet(hh)->firstChunk == newRemSet.firstChunk);
-  assert(HM_HH_getRemSet(hh)->lastChunk == newRemSet.lastChunk);
-  assert(HM_HH_getRemSet(hh)->size == newRemSet.size);
-  assert(HM_HH_getRemSet(hh)->usedSize == newRemSet.usedSize);
+  // assert(HM_HH_getRemSet(hh)->firstChunk == newRemSet.firstChunk);
+  // assert(HM_HH_getRemSet(hh)->lastChunk == newRemSet.lastChunk);
+  // assert(HM_HH_getRemSet(hh)->size == newRemSet.size);
+  // assert(HM_HH_getRemSet(hh)->usedSize == newRemSet.usedSize);
 
   LOG(LM_CC_COLLECTION, LL_INFO,
     "num pinned after filter: %zu",
@@ -883,10 +913,12 @@ void CC_filterDownPointers(GC_state s, HM_chunkList x, HM_HierarchicalHeap hh){
 #endif
 
 
-size_t CC_collectWithRoots(
+void CC_collectWithRoots(
   GC_state s,
   HM_HierarchicalHeap targetHH,
-  __attribute__((unused)) GC_thread thread)
+  __attribute__((unused)) GC_thread thread,
+  size_t *outputBytesSaved,
+  size_t *outputNumObjectsMarked)
 {
   getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed(s);
   getThreadCurrent(s)->exnStack = s->exnStack;
@@ -910,9 +942,6 @@ size_t CC_collectWithRoots(
   // chunks in which all objects are garbage. Before exiting, chunks in
   // origList are added to the free list.
 
-  bool isConcurrent = (HM_HH_getDepth(targetHH) == 1);
-  // assert(isConcurrent);
-
   uint32_t initialDepth = HM_HH_getDepth(targetHH);
 
   struct HM_chunkList _repList;
@@ -927,7 +956,8 @@ size_t CC_collectWithRoots(
     .repList  = repList,
     .toHead = (void*)repList,
     .fromHead = (void*) &(origList),
-    .bytesSaved = 0
+    .bytesSaved = 0,
+    .numObjectsMarked = 0
   };
   CC_workList_init(s, &(lists.worklist));
 
@@ -965,7 +995,7 @@ size_t CC_collectWithRoots(
 
   struct HM_foreachDownptrClosure forwardPinnedClosure =
     {.fun = forwardPinned, .env = (void*)&lists};
-  HM_foreachRemembered(s, HM_HH_getRemSet(targetHH), &forwardPinnedClosure);
+  HM_foreachRemembered(s, HM_HH_getRemSet(targetHH), &forwardPinnedClosure, false);
 
   // forward closures, stack and deque?
   forceForward(s, &(cp->snapLeft), &lists);
@@ -1025,7 +1055,7 @@ size_t CC_collectWithRoots(
 
   struct HM_foreachDownptrClosure unmarkPinnedClosure =
     {.fun = unmarkPinned, .env = &lists};
-  HM_foreachRemembered(s, HM_HH_getRemSet(targetHH), &unmarkPinnedClosure);
+  HM_foreachRemembered(s, HM_HH_getRemSet(targetHH), &unmarkPinnedClosure, false);
 
   forceUnmark(s, &(cp->snapLeft), &lists);
   forceUnmark(s, &(cp->snapRight), &lists);
@@ -1039,7 +1069,7 @@ size_t CC_collectWithRoots(
   forEachObjptrInCCStackBag(s, removedFromCCBag, tryUnmarkAndUnmarkLoop, &lists);
   unmarkLoop(s, &lists);
 
-  HM_freeChunksInList(s, removedFromCCBag);
+  HM_freeChunksInListWithInfo(s, removedFromCCBag, NULL, BLOCK_FOR_FORGOTTEN_SET);
 
   assert(CC_workList_isEmpty(s, &(lists.worklist)));
   CC_workList_free(s, &(lists.worklist));
@@ -1126,8 +1156,8 @@ size_t CC_collectWithRoots(
   /** SAM_NOTE: TODO: deleteList no longer needed, because
     * block allocator handles that.
     */
-  HM_freeChunksInListWithInfo(s, origList, &infoc);
-  HM_freeChunksInListWithInfo(s, deleteList, &infoc);
+  HM_freeChunksInListWithInfo(s, origList, &infoc, BLOCK_FOR_HEAP_CHUNK);
+  HM_freeChunksInListWithInfo(s, deleteList, &infoc, BLOCK_FOR_HEAP_CHUNK);
 
   for(HM_chunk chunk = repList->firstChunk;
     chunk!=NULL; chunk = chunk->nextChunk) {
@@ -1141,9 +1171,11 @@ size_t CC_collectWithRoots(
   assert(!(stackChunk->mightContainMultipleObjects));
   assert(HM_HH_getChunkList(HM_getLevelHead(stackChunk)) == origList);
   assert(isChunkInList(stackChunk, origList));
+  assert(bytesSaved >= HM_getChunkUsedSize(stackChunk));
+  bytesSaved -= HM_getChunkUsedSize(stackChunk);
   HM_unlinkChunk(origList, stackChunk);
   info.freedType = CC_FREED_STACK_CHUNK;
-  HM_freeChunkWithInfo(s, stackChunk, &infoc);
+  HM_freeChunkWithInfo(s, stackChunk, &infoc, BLOCK_FOR_HEAP_CHUNK);
   info.freedType = CC_FREED_NORMAL_CHUNK;
   cp->stack = BOGUS_OBJPTR;
 
@@ -1164,19 +1196,22 @@ size_t CC_collectWithRoots(
 
   timespec_now(&stopTime);
   timespec_sub(&stopTime, &startTime);
-
-  if (isConcurrent) {
-    timespec_add(&(s->cumulativeStatistics->timeRootCC), &stopTime);
-    s->cumulativeStatistics->numRootCCs++;
-    s->cumulativeStatistics->bytesReclaimedByRootCC += bytesScanned-bytesSaved;
-  } else {
-    timespec_add(&(s->cumulativeStatistics->timeInternalCC), &stopTime);
-    s->cumulativeStatistics->numInternalCCs++;
-    s->cumulativeStatistics->bytesReclaimedByInternalCC += bytesScanned-bytesSaved;
+  timespec_add(&(s->cumulativeStatistics->timeCC), &stopTime);
+  s->cumulativeStatistics->numCCs++;
+  assert(bytesScanned >= bytesSaved);
+  uintmax_t bytesReclaimed = bytesScanned-bytesSaved;
+  s->cumulativeStatistics->bytesInScopeForCC += bytesScanned;
+  s->cumulativeStatistics->bytesReclaimedByCC += bytesReclaimed;
+
+  if (outputBytesSaved != NULL) {
+    *outputBytesSaved = lists.bytesSaved;
   }
 
-  return lists.bytesSaved;
-
+  if (outputNumObjectsMarked != NULL) {
+    *outputNumObjectsMarked = lists.numObjectsMarked;
+  }
+  
+  return;
 }
 
 #endif
diff --git a/runtime/gc/concurrent-collection.h b/runtime/gc/concurrent-collection.h
index a52dd757b..db0bfc561 100644
--- a/runtime/gc/concurrent-collection.h
+++ b/runtime/gc/concurrent-collection.h
@@ -28,6 +28,7 @@ typedef struct ConcurrentCollectArgs {
 	void* toHead;
 	void* fromHead;
   size_t bytesSaved;
+	size_t numObjectsMarked;
 } ConcurrentCollectArgs;
 
 
@@ -89,8 +90,13 @@ PRIVATE void GC_updateObjectHeader(GC_state s, pointer p, GC_header newHeader);
 // in the chunk is live then the whole chunk is. However, tracing is at the granularity of objects.
 // Objects in chunks that are preserved may point to chunks that are not. But such objects aren't
 // reachable.
-size_t CC_collectWithRoots(GC_state s, struct HM_HierarchicalHeap * targetHH, GC_thread thread);
-
+void CC_collectWithRoots(
+	GC_state s,
+	struct HM_HierarchicalHeap * targetHH,
+	GC_thread thread,
+	size_t *bytesSaved,
+	size_t *numObjectsMarked);
+	
 void CC_collectAtPublicLevel(GC_state s, GC_thread thread, uint32_t depth);
 void CC_addToStack(GC_state s, ConcurrentPackage cp, pointer p);
 void CC_initStack(GC_state s, ConcurrentPackage cp);
diff --git a/runtime/gc/concurrent-list.c b/runtime/gc/concurrent-list.c
new file mode 100644
index 000000000..e34c165ad
--- /dev/null
+++ b/runtime/gc/concurrent-list.c
@@ -0,0 +1,224 @@
+void CC_initConcList(CC_concList concList) {
+  concList->firstChunk = NULL;
+  concList->lastChunk = NULL;
+  pthread_mutex_init(&(concList->mutex), NULL);
+}
+
+
+void allocateChunkInConcList(
+  CC_concList concList,
+  size_t objSize,
+  HM_chunk lastChunk,
+  enum BlockPurpose purpose)
+{
+  GC_state s = pthread_getspecific(gcstate_key);
+
+  // pthread_mutex_lock(&concList->mutex);
+  if(concList->lastChunk != lastChunk) {
+    // pthread_mutex_unlock(&concList->mutex);
+    return;
+  }
+
+  HM_chunk chunk = HM_getFreeChunkWithPurpose(s, objSize, purpose);
+
+  if (NULL == chunk)
+  {
+    DIE("Out of memory. Unable to allocate chunk of size %zu.",
+        objSize);
+  }
+
+  assert(chunk->frontier == HM_getChunkStart(chunk));
+  assert(chunk->mightContainMultipleObjects);
+  assert((size_t)(chunk->limit - chunk->frontier) >= objSize);
+  assert(chunk != NULL);
+
+  chunk->prevChunk = lastChunk;
+  // if (lastChunk != NULL)
+  // {
+  //   lastChunk->nextChunk = chunk;
+  // }
+  // concList->lastChunk = chunk;
+
+  memset((void *)HM_getChunkStart(chunk), '\0', HM_getChunkLimit(chunk) - HM_getChunkStart(chunk));
+
+  bool success = false;
+  if(concList->lastChunk == lastChunk) {
+    pthread_mutex_lock(&concList->mutex);
+    if (concList->lastChunk == lastChunk) {
+      if (concList->firstChunk == NULL) {
+        concList->firstChunk = chunk;
+      }
+      if(lastChunk != NULL) {
+        lastChunk->nextChunk = chunk;
+      }
+      concList->lastChunk = chunk;
+      success = true;
+    }
+    pthread_mutex_unlock(&concList->mutex);
+  }
+  if (!success) {
+    HM_freeChunkWithInfo(s, chunk, NULL, purpose);
+  }
+
+  // if (!__sync_bool_compare_and_swap(&(concList->lastChunk), lastChunk, chunk)) {
+  //   HM_freeChunk(s, chunk);
+  // } else if (lastChunk != NULL) {
+  //   lastChunk->nextChunk = chunk;
+  // }
+}
+
+
+pointer CC_storeInConcListWithPurpose(CC_concList concList, void* p, size_t objSize, enum BlockPurpose purpose){
+  assert(concList != NULL);
+  // pthread_mutex_lock(&concList->mutex);
+  while(TRUE) {
+    HM_chunk chunk = concList->lastChunk;
+    if (NULL == chunk) {
+      allocateChunkInConcList(concList, objSize, chunk, purpose);
+      continue;
+    }
+    else {
+      pointer frontier = HM_getChunkFrontier(chunk);
+      size_t sizePast = (size_t) (chunk->limit - frontier);
+      if (sizePast < objSize) {
+        allocateChunkInConcList(concList, objSize, chunk, purpose);
+        continue;
+      }
+
+      pointer new_frontier = frontier + objSize;
+      bool success = __sync_bool_compare_and_swap(&(chunk->frontier), frontier, new_frontier);
+      if (success)
+      {
+        memcpy(frontier, p, objSize);
+        // pthread_mutex_unlock(&concList->mutex);
+        return frontier;
+      }
+    }
+  }
+  // pthread_mutex_unlock(&concList->mutex);
+  DIE("should never come here");
+  return NULL;
+}
+
+
+// void CC_foreachObjInList(CC_concList concList, size_t objSize, HM_foreachObjClosure f) {
+
+//   struct HM_chunkList _chunkList;
+//   HM_chunkList chunkList =  &(_chunkList);
+//   chunkList->firstChunk = concList->firstChunk;
+//   pthread_mutex_lock(&concList->mutex);
+//   chunkList->lastChunk = concList->lastChunk;
+//   concList->firstChunk = NULL;
+//   concList->lastChunk = NULL;
+//   pthread_mutex_unlock(&concList->mutex);
+//   HM_foreachObjInChunkList(chunkList, objSize, f);
+// }
+
+// void CC_foreachRemInConc(GC_state s, CC_concList concList, struct HM_foreachDownptrClosure* f) {
+//   struct HM_chunkList _store;
+//   HM_chunkList store = &(_store);
+//   HM_initChunkList(store);
+
+//   while (TRUE) {
+//     HM_chunk firstChunk = concList->firstChunk;
+//     if (firstChunk == NULL) {
+//       break;
+//     }
+
+//     pthread_mutex_lock(&concList->mutex);
+//     HM_chunk lastChunk = concList->lastChunk;
+//     concList->firstChunk = NULL;
+//     concList->lastChunk = NULL;
+//     pthread_mutex_unlock(&concList->mutex);
+
+//     assert(firstChunk != NULL);
+//     assert(lastChunk != NULL);
+//     HM_chunk chunk = firstChunk;
+//     while (chunk != NULL) {
+//       pointer p = HM_getChunkStart(chunk);
+//       pointer frontier = HM_getChunkFrontier(chunk);
+//       while (p < frontier)
+//       {
+//         f->fun(s, (HM_remembered)p, f->env);
+//         p += sizeof(struct HM_remembered);
+//       }
+//       chunk = chunk->nextChunk;
+//     }
+
+//     if (store->firstChunk != NULL) {
+//       store->lastChunk->nextChunk = firstChunk;
+//       firstChunk->prevChunk = store->lastChunk;
+//       store->lastChunk = lastChunk;
+//     }
+//     else {
+//       store->firstChunk = firstChunk;
+//       store->lastChunk = lastChunk;
+//     }
+//   }
+
+//   /*add the chunks back to the list*/
+//   pthread_mutex_lock(&concList->mutex);
+//   if (concList->firstChunk != NULL) {
+//     concList->firstChunk->prevChunk = store->lastChunk;
+//     store->lastChunk->nextChunk = concList->firstChunk;
+//     concList->firstChunk = store->firstChunk;
+//   }
+//   else {
+//     concList->firstChunk = store->firstChunk;
+//     concList->lastChunk = store->lastChunk;
+//   }
+//   pthread_mutex_unlock(&concList->mutex);
+
+// }
+
+void CC_popAsChunkList(CC_concList concList, HM_chunkList chunkList) {
+  pthread_mutex_lock(&concList->mutex);
+  chunkList->firstChunk = concList->firstChunk;
+  concList->firstChunk = NULL;
+  chunkList->lastChunk = concList->lastChunk;
+  concList->lastChunk = NULL;
+  pthread_mutex_unlock(&concList->mutex);
+}
+
+HM_chunk CC_getLastChunk (CC_concList concList) {
+  HM_chunk c;
+  pthread_mutex_lock(&concList->mutex);
+  c = concList->lastChunk;
+  pthread_mutex_unlock(&concList->mutex);
+  return c;
+}
+
+void CC_appendConcList(CC_concList concList1, CC_concList concList2) {
+
+  HM_chunk firstChunk, lastChunk;
+  pthread_mutex_lock(&concList2->mutex);
+  firstChunk = concList2->firstChunk;
+  lastChunk = concList2->lastChunk;
+  concList2->firstChunk = NULL;
+  concList2->lastChunk = NULL;
+  pthread_mutex_unlock(&concList2->mutex);
+
+  if (firstChunk == NULL || lastChunk == NULL) {
+    return;
+  }
+
+
+  pthread_mutex_lock(&concList1->mutex);
+  if (concList1->lastChunk == NULL) {
+    concList1->firstChunk = firstChunk;
+    concList1->lastChunk = lastChunk;
+  }
+  else {
+    concList1->lastChunk->nextChunk = firstChunk;
+    concList1->lastChunk->retireChunk = true;
+    firstChunk->prevChunk = concList1->lastChunk;
+    concList1->lastChunk = lastChunk;
+  }
+  pthread_mutex_unlock(&concList1->mutex);
+}
+
+void CC_freeChunksInConcListWithInfo(GC_state s, CC_concList concList, void *info, enum BlockPurpose purpose) {
+  struct HM_chunkList _chunkList;
+  CC_popAsChunkList(concList, &(_chunkList));
+  HM_freeChunksInListWithInfo(s, &(_chunkList), info, purpose);
+}
\ No newline at end of file
diff --git a/runtime/gc/concurrent-list.h b/runtime/gc/concurrent-list.h
new file mode 100644
index 000000000..dbe042abd
--- /dev/null
+++ b/runtime/gc/concurrent-list.h
@@ -0,0 +1,40 @@
+/* Copyright (C) 2018-2019 Sam Westrick
+ * Copyright (C) 2015 Ram Raghunathan.
+ *
+ * MLton is released under a HPND-style license.
+ * See the file MLton-LICENSE for details.
+ */
+
+
+#ifndef CC_LIST_H
+#define CC_LIST_H
+
+struct CC_concList;
+typedef struct CC_concList * CC_concList;
+
+#if (defined (MLTON_GC_INTERNAL_TYPES))
+
+struct CC_concList {
+  HM_chunk firstChunk;
+  HM_chunk lastChunk;
+  pthread_mutex_t mutex;
+};
+
+#endif /* MLTON_GC_INTERNAL_TYPES */
+
+#if (defined (MLTON_GC_INTERNAL_FUNCS))
+
+void CC_initConcList(CC_concList concList);
+pointer CC_storeInConcListWithPurpose(CC_concList concList, void* p, size_t objSize, enum BlockPurpose purpose);
+
+// void CC_foreachObjInList(CC_concList concList, size_t objSize, HM_foreachObjClosure f);
+// void CC_foreachRemInConc(GC_state s, CC_concList concList, struct HM_foreachDownptrClosure* f);
+void CC_popAsChunkList(CC_concList concList, HM_chunkList chunkList);
+
+HM_chunk CC_getLastChunk (CC_concList concList);
+void CC_freeChunksInConcListWithInfo(GC_state s, CC_concList concList, void *info, enum BlockPurpose purpose);
+void CC_appendConcList(CC_concList concList1, CC_concList concList2);
+
+#endif /* MLTON_GC_INTERNAL_FUNCS */
+
+#endif /* CC_LIST_H */
diff --git a/runtime/gc/concurrent-stack.c b/runtime/gc/concurrent-stack.c
index 325ed4025..15c4535c5 100644
--- a/runtime/gc/concurrent-stack.c
+++ b/runtime/gc/concurrent-stack.c
@@ -71,7 +71,7 @@ bool CC_stack_data_push(CC_stack_data* stack, void* datum){
     stack->storage[stack->size++] = datum;
 #endif
 
-    HM_storeInchunkList(&(stack->storage), &(datum), sizeof(datum));
+    HM_storeInChunkListWithPurpose(&(stack->storage), &(datum), sizeof(datum), BLOCK_FOR_FORGOTTEN_SET);
     pthread_mutex_unlock(&stack->mutex);
     return TRUE;
 }
@@ -141,7 +141,7 @@ void CC_stack_data_free(CC_stack_data* stack){
 #endif
 
 void CC_stack_data_free(GC_state s, CC_stack_data* stack) {
-  HM_freeChunksInList(s, &(stack->storage));
+  HM_freeChunksInListWithInfo(s, &(stack->storage), NULL, BLOCK_FOR_FORGOTTEN_SET);
 }
 
 void CC_stack_free(GC_state s, CC_stack* stack) {
@@ -154,7 +154,7 @@ void CC_stack_free(GC_state s, CC_stack* stack) {
 
 void CC_stack_data_clear(GC_state s, CC_stack_data* stack){
     pthread_mutex_lock(&stack->mutex);
-    HM_freeChunksInList(s, &(stack->storage));
+    HM_freeChunksInListWithInfo(s, &(stack->storage), NULL, BLOCK_FOR_FORGOTTEN_SET);
     pthread_mutex_unlock(&stack->mutex);
 }
 
diff --git a/runtime/gc/controls.h b/runtime/gc/controls.h
index e9ff7f394..75af66944 100644
--- a/runtime/gc/controls.h
+++ b/runtime/gc/controls.h
@@ -64,6 +64,7 @@ struct GC_controls {
   size_t allocBlocksMinSize;
   size_t superblockThreshold; // upper bound on size-class of a superblock
   size_t megablockThreshold; // upper bound on size-class of a megablock (unmap above this threshold)
+  struct timespec blockUsageSampleInterval;
   float emptinessFraction;
   bool debugKeepFreeBlocks;
   bool manageEntanglement;
diff --git a/runtime/gc/decheck.c b/runtime/gc/decheck.c
index 288ee4220..14556e204 100644
--- a/runtime/gc/decheck.c
+++ b/runtime/gc/decheck.c
@@ -34,18 +34,18 @@ bool GC_HH_decheckMaxDepth(ARG_USED_FOR_DETECT_ENTANGLEMENT objptr resultRef) {
 #ifdef DETECT_ENTANGLEMENT
 void decheckInit(GC_state s) {
 #if ASSERT
-    if (mmap(SYNCH_DEPTHS_BASE, SYNCH_DEPTHS_LEN, PROT_WRITE,
-            MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, 0, 0) == MAP_FAILED) {
-        perror("mmap error");
-        exit(-1);
-    }
-    memset(synch_depths, 0, MAX_PATHS * sizeof(uint32_t));
-    synch_depths[1] = 0;
+  if (mmap(SYNCH_DEPTHS_BASE, SYNCH_DEPTHS_LEN, PROT_WRITE,
+          MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, 0, 0) == MAP_FAILED) {
+      perror("mmap error");
+      exit(-1);
+  }
+  memset(synch_depths, 0, MAX_PATHS * sizeof(uint32_t));
+  synch_depths[1] = 0;
 #endif
 
-    GC_thread thread = getThreadCurrent(s);
-    thread->decheckState.internal.path = 1;
-    thread->decheckState.internal.depth = 0;
+  GC_thread thread = getThreadCurrent(s);
+  thread->decheckState.internal.path = 1;
+  thread->decheckState.internal.depth = 0;
 }
 #else
 inline void decheckInit(GC_state s) {
@@ -55,16 +55,16 @@ inline void decheckInit(GC_state s) {
 #endif
 
 static inline unsigned int tree_depth(decheck_tid_t tid) {
-    return tid.internal.depth & 0x1f;
+  return tid.internal.depth & 0x1f;
 }
 
 static inline unsigned int dag_depth(decheck_tid_t tid) {
-    return tid.internal.depth >> 5;
+  return tid.internal.depth >> 5;
 }
 
 static inline uint32_t norm_path(decheck_tid_t tid) {
-    unsigned int td = tree_depth(tid);
-    return tid.internal.path & ((1 << (td+1)) - 1);
+  unsigned int td = tree_depth(tid);
+  return tid.internal.path & ((1 << (td+1)) - 1);
 }
 
 
@@ -89,31 +89,31 @@ static inline uint32_t decheckGetSyncDepth(GC_thread thread, uint32_t pathLen) {
  * refs just to pass values by destination-passing through the FFI.
  */
 void GC_HH_decheckFork(GC_state s, uint64_t *left, uint64_t *right) {
-    GC_thread thread = getThreadCurrent(s);
-    decheck_tid_t tid = thread->decheckState;
-    assert(tid.bits != DECHECK_BOGUS_BITS);
-    unsigned int h = tree_depth(tid);
-    assert(h < MAX_FORK_DEPTH);
-
-    decheck_tid_t t1;
-    t1.internal.path = (tid.internal.path & ~(1 << h)) | (1 << (h+1));
-    t1.internal.depth = tid.internal.depth + (1 << 5) + 1;
-    *left = t1.bits;
-
-    decheck_tid_t t2;
-    t2.internal.path = (tid.internal.path | (1 << h)) | (1 << (h+1));
-    t2.internal.depth = tid.internal.depth + (1 << 5) + 1;
-    *right = t2.bits;
-
-    assert(tree_depth(t1) == tree_depth(tid)+1);
-    assert(tree_depth(t2) == tree_depth(tid)+1);
-    assert(dag_depth(t1) == dag_depth(tid)+1);
-    assert(dag_depth(t2) == dag_depth(tid)+1);
-    assert((norm_path(t1) ^ norm_path(t2)) == (uint32_t)(1 << h));
+  GC_thread thread = getThreadCurrent(s);
+  decheck_tid_t tid = thread->decheckState;
+  assert(tid.bits != DECHECK_BOGUS_BITS);
+  unsigned int h = tree_depth(tid);
+  assert(h < MAX_FORK_DEPTH);
+
+  decheck_tid_t t1;
+  t1.internal.path = (tid.internal.path & ~(1 << h)) | (1 << (h+1));
+  t1.internal.depth = tid.internal.depth + (1 << 5) + 1;
+  *left = t1.bits;
+
+  decheck_tid_t t2;
+  t2.internal.path = (tid.internal.path | (1 << h)) | (1 << (h+1));
+  t2.internal.depth = tid.internal.depth + (1 << 5) + 1;
+  *right = t2.bits;
+
+  assert(tree_depth(t1) == tree_depth(tid)+1);
+  assert(tree_depth(t2) == tree_depth(tid)+1);
+  assert(dag_depth(t1) == dag_depth(tid)+1);
+  assert(dag_depth(t2) == dag_depth(tid)+1);
+  assert((norm_path(t1) ^ norm_path(t2)) == (uint32_t)(1 << h));
 
 #if ASSERT
-    synch_depths[norm_path(t1)] = dag_depth(t1);
-    synch_depths[norm_path(t2)] = dag_depth(t2);
+  synch_depths[norm_path(t1)] = dag_depth(t1);
+  synch_depths[norm_path(t2)] = dag_depth(t2);
 #endif
 }
 #else
@@ -127,19 +127,27 @@ void GC_HH_decheckFork(GC_state s, uint64_t *left, uint64_t *right) {
 
 
 #ifdef DETECT_ENTANGLEMENT
+
+void setStateIfBogus(HM_chunk chunk, decheck_tid_t tid) {
+  if ((chunk->decheckState).bits == DECHECK_BOGUS_BITS)
+  {
+    chunk->decheckState = tid;
+  }
+}
+
 void GC_HH_decheckSetTid(GC_state s, uint64_t bits) {
-    decheck_tid_t tid;
-    tid.bits = bits;
+  decheck_tid_t tid;
+  tid.bits = bits;
+
+  GC_thread thread = getThreadCurrent(s);
+  thread->decheckState = tid;
 
-    GC_thread thread = getThreadCurrent(s);
-    thread->decheckState = tid;
+  setStateIfBogus(HM_getChunkOf((pointer)thread), tid);
+  setStateIfBogus(HM_getChunkOf((pointer)thread->stack), tid);
 
-// #if ASSERT
-//     synch_depths[norm_path(tid)] = dag_depth(tid);
-// #endif
-    decheckSetSyncDepth(thread, tree_depth(tid), dag_depth(tid));
+  decheckSetSyncDepth(thread, tree_depth(tid), dag_depth(tid));
 
-    assert(decheckGetSyncDepth(thread, tree_depth(tid)) == synch_depths[norm_path(tid)]);
+  assert(decheckGetSyncDepth(thread, tree_depth(tid)) == synch_depths[norm_path(tid)]);
 }
 #else
 void GC_HH_decheckSetTid(GC_state s, uint64_t bits) {
@@ -165,32 +173,32 @@ uint64_t GC_HH_decheckGetTid(GC_state s, objptr threadp) {
 
 #ifdef DETECT_ENTANGLEMENT
 void GC_HH_decheckJoin(GC_state s, uint64_t left, uint64_t right) {
-    decheck_tid_t t1;
-    t1.bits = left;
-    decheck_tid_t t2;
-    t2.bits = right;
-
-    assert(tree_depth(t1) == tree_depth(t2));
-    assert(tree_depth(t1) >= 1);
-
-    GC_thread thread = getThreadCurrent(s);
-    unsigned int td = tree_depth(t1) - 1;
-    unsigned int dd = MAX(dag_depth(t1), dag_depth(t2)) + 1;
-    assert(dag_depth(t1) == synch_depths[norm_path(t1)]);
-    assert(dag_depth(t2) == synch_depths[norm_path(t2)]);
-    decheck_tid_t tid;
-    tid.internal.path = t1.internal.path | (1 << td);
-    tid.internal.depth = (dd << 5) + td;
-    thread->decheckState = tid;
-
-    assert(tree_depth(tid) == tree_depth(t1)-1);
+  decheck_tid_t t1;
+  t1.bits = left;
+  decheck_tid_t t2;
+  t2.bits = right;
+
+  assert(tree_depth(t1) == tree_depth(t2));
+  assert(tree_depth(t1) >= 1);
+
+  GC_thread thread = getThreadCurrent(s);
+  unsigned int td = tree_depth(t1) - 1;
+  unsigned int dd = MAX(dag_depth(t1), dag_depth(t2)) + 1;
+  assert(dag_depth(t1) == synch_depths[norm_path(t1)]);
+  assert(dag_depth(t2) == synch_depths[norm_path(t2)]);
+  decheck_tid_t tid;
+  tid.internal.path = t1.internal.path | (1 << td);
+  tid.internal.depth = (dd << 5) + td;
+  thread->decheckState = tid;
+
+  assert(tree_depth(tid) == tree_depth(t1)-1);
 
 #if ASSERT
-    synch_depths[norm_path(tid)] = dd;
+  synch_depths[norm_path(tid)] = dd;
 #endif
-    decheckSetSyncDepth(thread, tree_depth(tid), dd);
+  decheckSetSyncDepth(thread, tree_depth(tid), dd);
 
-    assert(decheckGetSyncDepth(thread, tree_depth(tid)) == synch_depths[norm_path(tid)]);
+  assert(decheckGetSyncDepth(thread, tree_depth(tid)) == synch_depths[norm_path(tid)]);
 }
 #else
 void GC_HH_decheckJoin(GC_state s, uint64_t left, uint64_t right) {
@@ -262,13 +270,15 @@ int lcaHeapDepth(decheck_tid_t t1, decheck_tid_t t2)
   uint32_t p1mask = (1 << tree_depth(t1)) - 1;
   uint32_t p2 = norm_path(t2);
   uint32_t p2mask = (1 << tree_depth(t2)) - 1;
-  assert(p1 != p2);
   uint32_t shared_mask = p1mask & p2mask;
   uint32_t shared_upper_bit = shared_mask+1;
   uint32_t x = ((p1 ^ p2) & shared_mask) | shared_upper_bit;
   uint32_t lca_bit = x & -x;
   // uint32_t lca_mask = lca_bit-1;
   int llen = bitIndex(lca_bit);
+  if (p1 == p2) {
+    return tree_depth(t1) + 1;
+  }
   assert(llen == lcaLen(p1, p2));
   return llen+1;
 }
@@ -307,69 +317,284 @@ bool decheckIsOrdered(GC_thread thread, decheck_tid_t t1) {
 }
 #endif
 
-
 #ifdef DETECT_ENTANGLEMENT
-void decheckRead(GC_state s, objptr ptr) {
-    GC_thread thread = getThreadCurrent(s);
-    if (thread == NULL)
-        return;
-    decheck_tid_t tid = thread->decheckState;
-    if (tid.bits == DECHECK_BOGUS_BITS)
-        return;
-    if (!isObjptr(ptr))
-        return;
-    HM_chunk chunk = HM_getChunkOf(objptrToPointer(ptr, NULL));
-    if (chunk == NULL)
-        return;
-    decheck_tid_t allocator = chunk->decheckState;
-    if (allocator.bits == DECHECK_BOGUS_BITS)
-        return;
-    if (decheckIsOrdered(thread, allocator))
-        return;
-
-    /** If we get here, there is entanglement. Next is how to handle it. */
-
-    assert(!s->controls->manageEntanglement);
-    if (!s->controls->manageEntanglement) {
-        printf("Entanglement detected: object at %p\n", (void *) ptr);
-        printf("Allocator tree depth: %d\n", tree_depth(allocator));
-        printf("Allocator path: 0x%x\n", allocator.internal.path);
-        printf("Allocator dag depth: %d\n", dag_depth(allocator));
-        printf("Reader tree depth: %d\n", tree_depth(tid));
-        printf("Reader path: 0x%x\n", tid.internal.path);
-        printf("Reader dag depth: %d\n", dag_depth(tid));
-        exit(-1);
+
+#if ASSERT
+void traverseAndCheck(
+  GC_state s,
+  __attribute__((unused)) objptr *opp,
+  objptr op,
+  __attribute__((unused)) void *rawArgs)
+{
+  GC_header header = getHeader(objptrToPointer(op, NULL));
+  pointer p = objptrToPointer (op, NULL);
+  assert (pinType(header) == PIN_ANY);
+  assert (!isFwdHeader(header));
+  if (isMutableH(s, header)) {
+    assert (ES_contains(NULL, op));
+  }
+  else {
+    struct GC_foreachObjptrClosure echeckClosure =
+        {.fun = traverseAndCheck, .env = NULL};
+    foreachObjptrInObject(s, p, &trueObjptrPredicateClosure, &echeckClosure, FALSE);
+  }
+}
+#else
+inline void traverseAndCheck(
+  __attribute__((unused)) GC_state s,
+  __attribute__((unused)) objptr *opp,
+  __attribute__((unused)) objptr op,
+  __attribute__((unused)) void *rawArgs)
+{
+  return;
+}
+#endif
+
+static inline objptr getRacyFwdPtr(pointer p) {
+  while (isFwdHeader(getHeader(p))) {
+    p = objptrToPointer(getFwdPtr(p), NULL);
+  }
+  return pointerToObjptr(p, NULL);
+}
+
+void make_entangled(
+  GC_state s,
+  objptr *opp,
+  objptr ptr,
+  void *rawArgs)
+{
+
+  struct ManageEntangledArgs* mea = (struct ManageEntangledArgs*) rawArgs;
+
+  HM_chunk chunk = HM_getChunkOf(objptrToPointer(ptr, NULL));
+  // if (!decheckIsOrdered(mea->root, allocator)) {
+  //   // while managing entanglement, we stay ordered wrt the root of the entanglement
+  //   return;
+  // }
+
+  pointer p_ptr = objptrToPointer(ptr, NULL);
+  GC_header header = getRacyHeader(p_ptr);
+  assert(!isFwdHeader(header));
+  bool mutable = isMutableH(s, header);
+  bool headerChange = false, pinChange = false;
+  // unpin depth according to the caller
+  uint32_t unpinDepth = mea->unpinDepth;
+
+  objptr new_ptr;
+  if (pinType(header) != PIN_ANY || unpinDepthOfH(header) > unpinDepth)
+  {
+    bool addToRemSet = mea->firstCall;
+    if (mutable) {
+      new_ptr = pinObjectInfo(s, ptr, unpinDepth, PIN_ANY, &headerChange, &pinChange);
+    }
+    else
+    {
+      mea->firstCall = false;
+      struct GC_foreachObjptrClosure emanageClosure =
+          {.fun = make_entangled, .env = rawArgs};
+      // the unpinDepth of reachable maybe smaller.
+      mea->unpinDepth = pinType(header) == PIN_NONE ? unpinDepth : min(unpinDepth, unpinDepthOfH(header));
+      foreachObjptrInObject(s, p_ptr, &trueObjptrPredicateClosure, &emanageClosure, FALSE);
+      new_ptr = pinObjectInfo(s, ptr, unpinDepth, PIN_ANY, &headerChange, &pinChange);
+      assert(pinType(getHeader(objptrToPointer(new_ptr, NULL))) == PIN_ANY);
+    }
+    if (pinChange && addToRemSet)
+    {
+      struct HM_remembered remElem_ = {.object = new_ptr, .from = BOGUS_OBJPTR};
+      HM_HH_rememberAtLevel(HM_getLevelHead(chunk), &(remElem_), true);
+      assert (HM_HH_getDepth(HM_getLevelHead(chunk)) != 1);
     }
+  }
+  else {
+    new_ptr = getRacyFwdPtr(p_ptr);
+  }
 
+  mea->unpinDepth = unpinDepth;
+
+  assert(!hasFwdPtr(objptrToPointer(new_ptr, NULL)));
+  assert(isPinned(new_ptr));
+
+  if (ptr != new_ptr) {
+    // Help LGC move along--because this reader might traverse this pointer
+    // and it shouldn't see the forwarded one
+    assert(hasFwdPtr(objptrToPointer(ptr, NULL)));
+    *opp = new_ptr;
+  }
+
+  if (mutable && !ES_contains(NULL, new_ptr)) {
+    HM_HierarchicalHeap lcaHeap = HM_HH_getHeapAtDepth(s, getThreadCurrent(s), unpinDepth);
+    ES_add(s, HM_HH_getSuspects(lcaHeap), new_ptr);
+    assert(ES_contains(NULL, new_ptr));
+  }
+
+  traverseAndCheck(s, &new_ptr, new_ptr, NULL);
+
+  assert (!mutable || ES_contains(NULL, new_ptr));
+}
+
+objptr manage_entangled(
+  GC_state s,
+  objptr ptr,
+  decheck_tid_t reader)
+{
+
+  // GC_thread thread = getThreadCurrent(s);
+  // decheck_tid_t tid = thread->decheckState;
+  HM_chunk chunk = HM_getChunkOf(objptrToPointer(ptr, NULL));
+  decheck_tid_t allocator = chunk->decheckState;
+
+  if (!s->controls->manageEntanglement && false)
+  {
+    printf("Entanglement detected: object at %p\n", (void *)ptr);
+    printf("Allocator tree depth: %d\n", tree_depth(allocator));
+    printf("Allocator path: 0x%x\n", allocator.internal.path);
+    printf("Allocator dag depth: %d\n", dag_depth(allocator));
+    printf("Reader tree depth: %d\n", tree_depth(allocator));
+    printf("Reader path: 0x%x\n", allocator.internal.path);
+    printf("Reader dag depth: %d\n", dag_depth(allocator));
+    exit(-1);
+  }
+
+  uint32_t unpinDepth = lcaHeapDepth(reader, allocator);
+  GC_header header = getHeader(objptrToPointer (ptr, NULL));
+
+
+  uint32_t current_ud = unpinDepthOfH(header);
+  enum PinType current_pt = pinType(header);
+  bool manage = isFwdHeader(header) ||
+    current_pt != PIN_ANY ||
+    current_ud > unpinDepth;
+
+  if (current_pt != PIN_NONE && current_ud == 0)
+  {
+    return ptr;
+  }
+
+  if (manage) {
+    uint32_t newUnpinDepth = current_pt == PIN_NONE ? unpinDepth : min(current_ud, unpinDepth);
+    struct ManageEntangledArgs mea = {
+      .reader = reader,
+      .root = allocator,
+      .unpinDepth = newUnpinDepth,
+      .firstCall = !(current_pt == PIN_DOWN && current_ud == 1)
+    };
+    make_entangled(s, &ptr, ptr, (void*) &mea);
+  }
+  else {
+    if (isMutableH(s, header) && !ES_contains(NULL, ptr)) {
+      HM_HierarchicalHeap lcaHeap = HM_HH_getHeapAtDepth(s, getThreadCurrent(s), unpinDepth);
+      ES_add(s, HM_HH_getSuspects(lcaHeap), ptr);
+      assert(ES_contains(NULL, ptr));
+    }
+    traverseAndCheck(s, &ptr, ptr, NULL);
+  }
+
+
+  traverseAndCheck(s, &ptr, ptr, NULL);
+  return ptr;
+  // GC_header header = getRacyHeader(objptrToPointer(ptr, NULL));
+  // bool mutable = isMutableH(s, header);
+  // bool headerChange = false, pinChange = false;
+  // objptr new_ptr = ptr;
+  // if (pinType(header) != PIN_ANY || unpinDepthOfH(header) > unpinDepth)
+  // {
+  //   if (mutable)
+  //   {
+  //     new_ptr = pinObjectInfo(ptr, unpinDepth, PIN_ANY, &headerChange, &pinChange);
+  //     if (!ES_contains(NULL, new_ptr)) {
+  //       HM_HierarchicalHeap lcaHeap = HM_HH_getHeapAtDepth(s, thread, unpinDepth);
+  //       ES_add(s, HM_HH_getSuspects(lcaHeap), new_ptr);
+  //     }
+  //   }
+  //   else
+  //   {
+  //     struct GC_foreachObjptrClosure emanageClosure =
+  //         {.fun = manage_entangled, .env = NULL};
+  //     foreachObjptrInObject(s, ptr, &trueObjptrPredicateClosure, &emanageClosure, FALSE);
+  //     new_ptr = pinObjectInfo(ptr, unpinDepth, PIN_ANY, &headerChange, &pinChange);
+  //   }
+  //   if (pinChange)
+  //   {
+  //     struct HM_remembered remElem_ = {.object = new_ptr, .from = BOGUS_OBJPTR};
+  //     HM_HH_rememberAtLevel(HM_getLevelHeadPathCompress(chunk), &(remElem_), true);
+  //   }
+  // }
+  // else
+  // {
+  //   if (!mutable)
+  //   {
+  //     traverseAndCheck(s, &new_ptr, new_ptr, NULL);
+  //   }
+  // }
+
+  // traverseAndCheck(s, &new_ptr, new_ptr, NULL);
+  // return new_ptr;
+}
+
+#else
+objptr manage_entangled(GC_state s, objptr ptr, decheck_tid_t reader) {
+  (void)s;
+  (void)ptr;
+  (void)reader;
+  return ptr;
+}
+#endif
+
+#ifdef DETECT_ENTANGLEMENT
+
+bool decheck(GC_state s, objptr ptr) {
+  if (!s->controls->manageEntanglement) {
+    return true;
+  }
+  GC_thread thread = getThreadCurrent(s);
+  if (thread == NULL)
+      return true;
+  decheck_tid_t tid = thread->decheckState;
+  if (tid.bits == DECHECK_BOGUS_BITS)
+      return true;
+  if (!isObjptr(ptr))
+      return true;
+  HM_chunk chunk = HM_getChunkOf(objptrToPointer(ptr, NULL));
+  if (chunk == NULL)
+      return true;
+  decheck_tid_t allocator = chunk->decheckState;
+  if (allocator.bits == DECHECK_BOGUS_BITS) {
+      // assert (false);
+    return true;
+  }
+  if (decheckIsOrdered(thread, allocator))
+      return true;
+
+  return false;
 #if 0
-    s->cumulativeStatistics->numEntanglementsDetected++;
-
-    /** set the chunk's disentangled depth. This synchronizes with GC, if there
-      * is GC happening by the owner of this chunk.
-      */
-    int32_t newDD = lcaHeapDepth(thread->decheckState, allocator);
-    assert(newDD >= 1);
-    while (TRUE) {
-      int32_t oldDD = atomicLoadS32(&(chunk->disentangledDepth));
-
-      /** Negative means it's frozen for GC. Wait until it's unfrozen... */
-      while (oldDD < 0) {
-        pthread_yield();
-        oldDD = atomicLoadS32(&(chunk->disentangledDepth));
-      }
-
-      /** And then attempt to update. */
-      if (newDD >= oldDD ||
-          __sync_bool_compare_and_swap(&(chunk->disentangledDepth), oldDD, newDD))
-        break;
+
+  /** set the chunk's disentangled depth. This synchronizes with GC, if there
+    * is GC happening by the owner of this chunk.
+    */
+  int32_t newDD = lcaHeapDepth(thread->decheckState, allocator);
+  assert(newDD >= 1);
+  while (TRUE) {
+    int32_t oldDD = atomicLoadS32(&(chunk->disentangledDepth));
+
+    /** Negative means it's frozen for GC. Wait until it's unfrozen... */
+    while (oldDD < 0) {
+      pthread_yield();
+      oldDD = atomicLoadS32(&(chunk->disentangledDepth));
     }
+
+    /** And then attempt to update. */
+    if (newDD >= oldDD ||
+        __sync_bool_compare_and_swap(&(chunk->disentangledDepth), oldDD, newDD))
+      break;
+  }
 #endif
 }
 #else
-void decheckRead(GC_state s, objptr ptr) {
+bool decheck(GC_state s, objptr ptr)
+{
   (void)s;
   (void)ptr;
-  return;
+  return true;
 }
 #endif
 
@@ -400,11 +625,27 @@ void GC_HH_copySyncDepthsFromThread(GC_state s, objptr victimThread, uint32_t st
    */
   memcpy(to, from, DECHECK_DEPTHS_LEN * sizeof(uint32_t));
 }
+
 #else
-void GC_HH_copySyncDepthsFromThread(GC_state s, objptr victimThread, uint32_t stealDepth) {
+void GC_HH_copySyncDepthsFromThread(GC_state s, objptr victimThread, uint32_t stealDepth)
+{
   (void)s;
   (void)victimThread;
   (void)stealDepth;
   return;
 }
 #endif
+
+// returns true if the object is unpinned.
+bool disentangleObject(GC_state s, objptr op, uint32_t opDepth) {
+  if (isPinned(op) && unpinDepthOf(op) >= opDepth) {
+    bool success = tryUnpinWithDepth(op, opDepth);
+    if (success && ES_contains(NULL, op)) {
+      ES_unmark(s, op);
+      return true;
+    }
+    return false;
+  }
+  return true;
+}
+
diff --git a/runtime/gc/decheck.h b/runtime/gc/decheck.h
index 92e26a3d1..406b9c207 100644
--- a/runtime/gc/decheck.h
+++ b/runtime/gc/decheck.h
@@ -20,6 +20,14 @@ typedef union {
     uint64_t bits;
 } decheck_tid_t;
 
+struct ManageEntangledArgs
+{
+  decheck_tid_t reader;
+  decheck_tid_t root;
+  uint32_t unpinDepth;
+  bool firstCall;
+};
+
 #define DECHECK_BOGUS_BITS ((uint64_t)0)
 #define DECHECK_BOGUS_TID ((decheck_tid_t){ .bits = DECHECK_BOGUS_BITS })
 
@@ -40,10 +48,12 @@ PRIVATE bool GC_HH_decheckMaxDepth(objptr resultRef);
 #if (defined (MLTON_GC_INTERNAL_FUNCS))
 
 void decheckInit(GC_state s);
-void decheckRead(GC_state s, objptr ptr);
+bool decheck(GC_state s, objptr ptr);
 bool decheckIsOrdered(GC_thread thread, decheck_tid_t t1);
 int lcaHeapDepth(decheck_tid_t t1, decheck_tid_t t2);
-
+bool disentangleObject(GC_state s, objptr op, uint32_t opDepth);
+objptr manage_entangled(GC_state s, objptr ptr, decheck_tid_t reader);
+void traverseAndCheck(GC_state s, objptr *opp ,objptr op, void *rawArgs);
 #endif /* (defined (MLTON_GC_INTERNAL_FUNCS)) */
 
 #endif /* _DECHECK_H_ */
diff --git a/runtime/gc/ebr.c b/runtime/gc/ebr.c
new file mode 100644
index 000000000..ec7847371
--- /dev/null
+++ b/runtime/gc/ebr.c
@@ -0,0 +1,170 @@
+/* Copyright (C) 2021 Sam Westrick
+ * Copyright (C) 2022 Jatin Arora
+ *
+ * MLton is released under a HPND-style license.
+ * See the file MLton-LICENSE for details.
+ */
+
+#if (defined(MLTON_GC_INTERNAL_FUNCS))
+
+/** Helpers for packing/unpacking announcements. DEBRA packs epochs with a
+ * "quiescent" bit, the idea being that processors should set the bit during
+ * quiescent periods (between operations) and have it unset otherwise (i.e.
+ * during an operation). Being precise about quiescent periods in this way
+ * is helpful for reclamation, because in order to advance the epoch, all we
+ * need to know is that every processor has been in a quiescent period since
+ * the beginning of the last epoch.
+ *
+ * But note that updating the quiescent bits is only efficient if we can
+ * amortize the cost of the setting/unsetting the bit with other nearby
+ * operations. If we assumed that the typical state for each processor
+ * is quiescent and then paid for non-quiescent periods, this would
+ * be WAY too expensive. In our case, processors are USUALLY NON-QUIESCENT,
+ * due to depth queries at the write-barrier.
+ *
+ * So
+ */
+#define PACK(epoch, qbit) ((((size_t)(epoch)) << 1) | ((qbit)&1))
+#define UNPACK_EPOCH(announcement) ((announcement) >> 1)
+#define UNPACK_QBIT(announcement) ((announcement)&1)
+#define SET_Q_TRUE(announcement) ((announcement) | (size_t)1)
+#define SET_Q_FALSE(announcement) ((announcement) & (~(size_t)1))
+
+#define ANNOUNCEMENT_PADDING 16
+
+static inline size_t getAnnouncement(EBR_shared ebr, uint32_t pid)
+{
+  return ebr->announce[ANNOUNCEMENT_PADDING * pid];
+}
+
+static inline void setAnnouncement(EBR_shared ebr, uint32_t pid, size_t ann)
+{
+  ebr->announce[ANNOUNCEMENT_PADDING * pid] = ann;
+}
+
+void EBR_enterQuiescentState(GC_state s, EBR_shared ebr)
+{
+  uint32_t mypid = s->procNumber;
+  setAnnouncement(ebr, mypid, SET_Q_TRUE(getAnnouncement(ebr, mypid)));
+}
+
+static void rotateAndReclaim(GC_state s, EBR_shared ebr)
+{
+  uint32_t mypid = s->procNumber;
+
+  int limboIdx = (ebr->local[mypid].limboIdx + 1) % 3;
+  ebr->local[mypid].limboIdx = limboIdx;
+  HM_chunkList limboBag = &(ebr->local[mypid].limboBags[limboIdx]);
+
+  // Free all HH records in the limbo bag.
+  for (HM_chunk chunk = HM_getChunkListFirstChunk(limboBag);
+       NULL != chunk;
+       chunk = chunk->nextChunk)
+  {
+    for (pointer p = HM_getChunkStart(chunk);
+         p < HM_getChunkFrontier(chunk);
+         p += sizeof(void *))
+    {
+      ebr->freeFun(s, *(void **)p);
+      // HM_UnionFindNode hufp = *(HM_UnionFindNode *)p;
+      // assert(hufp->payload != NULL);
+      // freeFixedSize(getHHAllocator(s), hufp->payload);
+      // freeFixedSize(getUFAllocator(s), hufp);
+    }
+  }
+
+  HM_freeChunksInListWithInfo(s, limboBag, NULL, BLOCK_FOR_EBR);
+  HM_initChunkList(limboBag); // clear it out
+}
+
+EBR_shared EBR_new(GC_state s, EBR_freeRetiredObj freeFun)
+{
+  EBR_shared ebr = malloc(sizeof(struct EBR_shared));
+
+  ebr->epoch = 0;
+  ebr->announce =
+      malloc(s->numberOfProcs * ANNOUNCEMENT_PADDING * sizeof(size_t));
+  ebr->local =
+      malloc(s->numberOfProcs * sizeof(struct EBR_local));
+  ebr->freeFun = freeFun;
+
+  for (uint32_t i = 0; i < s->numberOfProcs; i++)
+  {
+    // Everyone starts by announcing epoch = 0 and is non-quiescent
+    setAnnouncement(ebr, i, PACK(0, 0));
+    ebr->local[i].limboIdx = 0;
+    ebr->local[i].checkNext = 0;
+    for (int j = 0; j < 3; j++)
+      HM_initChunkList(&(ebr->local[i].limboBags[j]));
+  }
+  return ebr;
+}
+
+void EBR_leaveQuiescentState(GC_state s, EBR_shared ebr)
+{
+  uint32_t mypid = s->procNumber;
+  uint32_t numProcs = s->numberOfProcs;
+
+  size_t globalEpoch = ebr->epoch;
+  size_t myann = getAnnouncement(ebr, mypid);
+  size_t myEpoch = UNPACK_EPOCH(myann);
+  assert(globalEpoch >= myEpoch);
+
+  if (myEpoch != globalEpoch)
+  {
+    ebr->local[mypid].checkNext = 0;
+    /** Advance into the current epoch. To do so, we need to clear the limbo
+     * bag of the epoch we're moving into.
+     */
+    rotateAndReclaim(s, ebr);
+  }
+  // write a function which takes a number of reads of otherann as an argument
+  uint32_t otherpid = (ebr->local[mypid].checkNext) % numProcs;
+  size_t otherann = getAnnouncement(ebr, otherpid);
+  if (UNPACK_EPOCH(otherann) == globalEpoch || UNPACK_QBIT(otherann))
+  {
+    uint32_t c = ++ebr->local[mypid].checkNext;
+    if (c >= numProcs)
+    {
+      __sync_val_compare_and_swap(&(ebr->epoch), globalEpoch, globalEpoch + 1);
+    }
+  }
+
+  setAnnouncement(ebr, mypid, PACK(globalEpoch, 0));
+}
+
+void EBR_retire(GC_state s, EBR_shared ebr, void *ptr)
+{
+  uint32_t mypid = s->procNumber;
+  int limboIdx = ebr->local[mypid].limboIdx;
+  HM_chunkList limboBag = &(ebr->local[mypid].limboBags[limboIdx]);
+  HM_chunk chunk = HM_getChunkListLastChunk(limboBag);
+
+  // fast path: bump frontier in chunk
+
+  if (NULL != chunk &&
+      HM_getChunkSizePastFrontier(chunk) >= sizeof(void *))
+  {
+    pointer p = HM_getChunkFrontier(chunk);
+    *(void **) p = ptr;
+    HM_updateChunkFrontierInList(limboBag, chunk, p + sizeof(void *));
+    return;
+  }
+
+  // slow path: allocate new chunk
+
+  chunk = HM_allocateChunkWithPurpose(
+    limboBag,
+    sizeof(void *),
+    BLOCK_FOR_EBR);
+
+  assert(NULL != chunk &&
+         HM_getChunkSizePastFrontier(chunk) >= sizeof(void *));
+
+  pointer p = HM_getChunkFrontier(chunk);
+  *(void **) p = ptr;
+  HM_updateChunkFrontierInList(limboBag, chunk, p + sizeof(void *));
+  return;
+}
+
+#endif // MLTON_GC_INTERNAL_FUNCS
diff --git a/runtime/gc/ebr.h b/runtime/gc/ebr.h
new file mode 100644
index 000000000..dbf2be156
--- /dev/null
+++ b/runtime/gc/ebr.h
@@ -0,0 +1,57 @@
+/* Copyright (C) 2021 Sam Westrick
+ * Copyright (C) 2022 Jatin Arora
+ *
+ * MLton is released under a HPND-style license.
+ * See the file MLton-LICENSE for details.
+ */
+
+/** Epoch-based reclamation (EBR) of hierarchical heap records.
+ */
+
+#ifndef EBR_H_
+#define EBR_H_
+
+#if (defined(MLTON_GC_INTERNAL_TYPES))
+
+struct EBR_local
+{
+  struct HM_chunkList limboBags[3];
+  int limboIdx;
+  uint32_t checkNext;
+} __attribute__((aligned(128)));
+
+typedef void (*EBR_freeRetiredObj) (GC_state s, void *ptr);
+
+// There is exactly one of these! Everyone shares a reference to it.
+typedef struct EBR_shared
+{
+  size_t epoch;
+
+  // announcement array, length = num procs
+  // each announcement is packed: 63 bits for epoch, 1 bit for quiescent bit
+  size_t *announce;
+
+  // processor-local data, length = num procs
+  struct EBR_local *local;
+
+  EBR_freeRetiredObj freeFun;
+} * EBR_shared;
+
+#else
+
+struct EBR_local;
+struct EBR_shared;
+typedef struct EBR_shared *EBR_shared;
+
+#endif // MLTON_GC_INTERNAL_TYPES
+
+#if (defined(MLTON_GC_INTERNAL_FUNCS))
+
+EBR_shared EBR_new(GC_state s, EBR_freeRetiredObj freeFun);
+void EBR_enterQuiescentState(GC_state s, EBR_shared ebr);
+void EBR_leaveQuiescentState(GC_state s, EBR_shared ebr);
+void EBR_retire(GC_state s, EBR_shared ebr, void *ptr);
+
+#endif // MLTON_GC_INTERNAL_FUNCS
+
+#endif // EBR_H_
diff --git a/runtime/gc/entangled-ebr.c b/runtime/gc/entangled-ebr.c
new file mode 100644
index 000000000..8e934762c
--- /dev/null
+++ b/runtime/gc/entangled-ebr.c
@@ -0,0 +1,29 @@
+/* Copyright (C) 2022 Jatin Arora
+ *
+ * MLton is released under a HPND-style license.
+ * See the file MLton-LICENSE for details.
+ */
+
+#if (defined (MLTON_GC_INTERNAL_FUNCS))
+
+void freeChunk(GC_state s, void *ptr) {
+  HM_freeChunkWithInfo(s, (HM_chunk)ptr, NULL, BLOCK_FOR_HEAP_CHUNK);
+}
+
+void HM_EBR_init(GC_state s) {
+  s->hmEBR = EBR_new(s, &freeChunk);
+}
+
+void HM_EBR_enterQuiescentState (GC_state s) {
+  EBR_enterQuiescentState(s, s->hmEBR);
+}
+
+void HM_EBR_leaveQuiescentState(GC_state s) {
+  EBR_leaveQuiescentState(s, s->hmEBR);
+}
+
+void HM_EBR_retire(GC_state s, HM_chunk chunk) {
+  EBR_retire(s, s->hmEBR, (void *)chunk);
+}
+
+#endif // MLTON_GC_INTERNAL_FUNCS
diff --git a/runtime/gc/entangled-ebr.h b/runtime/gc/entangled-ebr.h
new file mode 100644
index 000000000..85cdf59fe
--- /dev/null
+++ b/runtime/gc/entangled-ebr.h
@@ -0,0 +1,17 @@
+/** Epoch-based reclamation (EBR) of hierarchical heap records.
+ */
+
+#ifndef ENTANGLED_EBR_H_
+#define ENTANGLED_EBR_H_
+
+#if (defined(MLTON_GC_INTERNAL_FUNCS))
+
+
+void HM_EBR_init(GC_state s);
+void HM_EBR_enterQuiescentState(GC_state s);
+void HM_EBR_leaveQuiescentState(GC_state s);
+void HM_EBR_retire(GC_state s, HM_chunk chunk);
+
+#endif // MLTON_GC_INTERNAL_FUNCS
+
+#endif //CHUNK_EBR_H_
diff --git a/runtime/gc/entanglement-suspects.c b/runtime/gc/entanglement-suspects.c
index 434231b76..30220f9b8 100644
--- a/runtime/gc/entanglement-suspects.c
+++ b/runtime/gc/entanglement-suspects.c
@@ -6,7 +6,30 @@ static inline bool mark_suspect(objptr op)
 {
   pointer p = objptrToPointer(op, NULL);
   GC_header header = __sync_fetch_and_or(getHeaderp(p), SUSPECT_MASK);
-  assert (1 == (header & GC_VALID_HEADER_MASK));
+  assert(1 == (header & GC_VALID_HEADER_MASK));
+  // while (TRUE)
+  // {
+  //   GC_header header = getHeader(p);
+  //   GC_header newHeader = header | SUSPECT_MASK;
+  //   if (header == newHeader)
+  //   {
+  //     /*
+  //        just return because the suspect bit is already set
+  //      */
+  //     return false;
+  //   }
+  //   else
+  //   {
+  //     /*
+  //        otherwise, install the new header with the bit set. this might fail
+  //        if a concurrent thread changes the header first.
+  //      */
+  //     if (__sync_bool_compare_and_swap(getHeaderp(p), header, newHeader))
+  //       return true;
+  //   }
+  // }
+  // DIE("should be impossible to reach here");
+  /*return true if this call marked the header, false if someone else did*/
   return !suspicious_header(header);
 }
 
@@ -21,14 +44,36 @@ static inline bool is_suspect(objptr op)
 }
 
 void clear_suspect(
-    __attribute__((unused)) GC_state s,
-    __attribute__((unused)) objptr *opp,
+    GC_state s,
+    objptr *opp,
     objptr op,
-    __attribute__((unused)) void *rawArgs)
+    void *rawArgs)
 {
   pointer p = objptrToPointer(op, NULL);
-  assert(isObjptr(op) && is_suspect(op));
-  __sync_fetch_and_and(getHeaderp(p), ~(SUSPECT_MASK));
+  ES_clearArgs eargs = (ES_clearArgs) rawArgs;
+
+  GC_header header = getHeader(p);
+  uint32_t unpinDepth = (header & UNPIN_DEPTH_MASK) >> UNPIN_DEPTH_SHIFT;
+
+  if (pinType(header) == PIN_ANY && unpinDepth < eargs->heapDepth) {
+    /* Not ready to be cleared */
+    HM_HierarchicalHeap unpinHeap = HM_HH_getHeapAtDepth(s, eargs->thread, unpinDepth);
+    HM_storeInChunkListWithPurpose(HM_HH_getSuspects(unpinHeap), opp, sizeof(objptr), BLOCK_FOR_SUSPECTS);
+    eargs->numMoved++;
+    return;
+  }
+
+  GC_header newHeader = header & ~(SUSPECT_MASK);
+  if (__sync_bool_compare_and_swap(getHeaderp(p), header, newHeader)) {
+    /* clearing successful */
+    eargs->numCleared++;
+    return;
+  }
+  else {
+    /*oops something changed in b/w, let's try at the next join*/
+    HM_storeInChunkListWithPurpose(eargs->newList, opp, sizeof(objptr), BLOCK_FOR_SUSPECTS);
+    eargs->numFailed++;
+  }
 }
 
 bool ES_contains(__attribute__((unused)) HM_chunkList es, objptr op)
@@ -36,6 +81,14 @@ bool ES_contains(__attribute__((unused)) HM_chunkList es, objptr op)
   return is_suspect(op);
 }
 
+bool ES_mark(__attribute__((unused)) GC_state s, objptr op) {
+  return mark_suspect(op);
+}
+
+void ES_unmark(GC_state s, objptr op) {
+  clear_suspect(s, &op, op, NULL);
+}
+
 void ES_add(__attribute__((unused)) GC_state s, HM_chunkList es, objptr op)
 {
 
@@ -45,7 +98,7 @@ void ES_add(__attribute__((unused)) GC_state s, HM_chunkList es, objptr op)
     return;
   }
   s->cumulativeStatistics->numSuspectsMarked++;
-  HM_storeInchunkList(es, &op, sizeof(objptr));
+  HM_storeInChunkListWithPurpose(es, &op, sizeof(objptr), BLOCK_FOR_SUSPECTS);
 }
 
 int ES_foreachSuspect(
@@ -88,12 +141,241 @@ void ES_move(HM_chunkList list1, HM_chunkList list2) {
   HM_initChunkList(list2);
 }
 
-void ES_clear(GC_state s, HM_chunkList es)
+static size_t SUSPECTS_THRESHOLD = 10000;
+
+
+void ES_clear(GC_state s, HM_HierarchicalHeap hh)
 {
+  struct timespec startTime;
+  struct timespec stopTime;
+
+  HM_chunkList es = HM_HH_getSuspects(hh);
+  uint32_t heapDepth = HM_HH_getDepth(hh);
+  struct HM_chunkList oldList = *(es);
+  HM_initChunkList(HM_HH_getSuspects(hh));
+
+  size_t numSuspects = HM_getChunkListUsedSize(&oldList) / sizeof(objptr);
+  if (numSuspects >= SUSPECTS_THRESHOLD) {
+    timespec_now(&startTime);
+  }
+
+  struct ES_clearArgs eargs = {
+    .newList = HM_HH_getSuspects(hh),
+    .heapDepth = heapDepth,
+    .thread = getThreadCurrent(s),
+    .numMoved = 0,
+    .numCleared = 0,
+    .numFailed = 0
+  };
+  
   struct GC_foreachObjptrClosure fObjptrClosure =
-    {.fun = clear_suspect, .env = NULL};
-  int numSuspects = ES_foreachSuspect(s, es, &fObjptrClosure);
-  s->cumulativeStatistics->numSuspectsCleared+=numSuspects;
+      {.fun = clear_suspect, .env = &(eargs)};
+#if ASSERT
+  int ns = ES_foreachSuspect(s, &oldList, &fObjptrClosure);
+  assert(numSuspects == (size_t)ns);
+#else
+  ES_foreachSuspect(s, &oldList, &fObjptrClosure);
+#endif
+  s->cumulativeStatistics->numSuspectsCleared += numSuspects;
+
+  HM_freeChunksInListWithInfo(s, &(oldList), NULL, BLOCK_FOR_SUSPECTS);
+
+  if (eargs.numFailed > 0) {
+    LOG(LM_HIERARCHICAL_HEAP, LL_INFO,
+      "WARNING: %zu failed suspect clear(s)",
+      eargs.numFailed);
+  }
 
-  HM_freeChunksInList(s, es);
+  if (numSuspects >= SUSPECTS_THRESHOLD) {
+    timespec_now(&stopTime);
+    timespec_sub(&stopTime, &startTime);
+    LOG(LM_HIERARCHICAL_HEAP, LL_FORCE,
+      "time to process %zu suspects (%zu cleared, %zu moved) at depth %u: %ld.%09ld",
+      numSuspects,
+      eargs.numCleared,
+      eargs.numMoved,
+      HM_HH_getDepth(hh),
+      (long)stopTime.tv_sec,
+      stopTime.tv_nsec
+    );
+  }
+}
+
+
+size_t ES_numSuspects(
+  __attribute__((unused)) GC_state s,
+  HM_HierarchicalHeap hh)
+{
+  return HM_getChunkListUsedSize(HM_HH_getSuspects(hh)) / sizeof(objptr);
+}
+
+
+ES_clearSet ES_takeClearSet(
+  __attribute__((unused)) GC_state s,
+  HM_HierarchicalHeap hh)
+{
+  struct timespec startTime;
+  timespec_now(&startTime);
+
+  size_t numSuspects = ES_numSuspects(s, hh);
+
+  ES_clearSet result = malloc(sizeof(struct ES_clearSet));
+  HM_chunkList es = HM_HH_getSuspects(hh);
+  struct HM_chunkList oldList = *es;
+  HM_initChunkList(es);
+
+  size_t numChunks = 0;
+  for (HM_chunk cursor = HM_getChunkListFirstChunk(&oldList);
+       cursor != NULL;
+       cursor = cursor->nextChunk)
+  {
+    numChunks++;
+  }
+
+  HM_chunk *chunkArray = malloc(numChunks * sizeof(HM_chunk));
+  result->chunkArray = chunkArray;
+  result->lenChunkArray = numChunks;
+  result->depth = HM_HH_getDepth(hh);
+  result->numSuspects = numSuspects;
+  result->startTime = startTime;
+
+  size_t i = 0;
+  for (HM_chunk cursor = HM_getChunkListFirstChunk(&oldList);
+       cursor != NULL;
+       cursor = cursor->nextChunk)
+  {
+    chunkArray[i] = cursor;
+    i++;
+  }
+
+  return result;
 }
+
+
+size_t ES_numChunksInClearSet(
+  __attribute__((unused)) GC_state s,
+  ES_clearSet es)
+{
+  return es->lenChunkArray;
+}
+
+
+void clear_suspect_par_safe(
+  __attribute__((unused)) GC_state s,
+  objptr *opp,
+  objptr op,
+  struct HM_chunkList *output,
+  size_t lenOutput)
+{
+  pointer p = objptrToPointer(op, NULL);
+  while (TRUE) {
+    GC_header header = getHeader(p);
+    uint32_t unpinDepth = (header & UNPIN_DEPTH_MASK) >> UNPIN_DEPTH_SHIFT;
+
+    // Note: lenOutput == depth of heap whose suspects we are clearing
+    if (pinType(header) == PIN_ANY && unpinDepth < lenOutput) {
+      /* Not ready to be cleared; move it instead */
+      HM_storeInChunkListWithPurpose(&(output[unpinDepth]), opp, sizeof(objptr), BLOCK_FOR_SUSPECTS);
+      return;
+    }
+
+    GC_header newHeader = header & ~(SUSPECT_MASK);
+    if (__sync_bool_compare_and_swap(getHeaderp(p), header, newHeader)) {
+      /* clearing successful */
+      return;
+    }
+    else {
+      // oops, something changed in between
+      // Is this possible?
+      //   - Seems like it could be, if there is a CGC happening
+      //   simultaneously at the same level (and it's marking/unmarking objects)?
+      //   - If this is the only possibility, then we should be able to just
+      //   do the CAS with newHeader in a loop?
+      LOG(LM_HIERARCHICAL_HEAP, LL_INFO,
+        "WARNING: failed suspect clear; trying again");
+    }
+  }
+}
+
+
+ES_finishedClearSetGrain ES_processClearSetGrain(
+  GC_state s,
+  ES_clearSet es,
+  size_t start,
+  size_t stop)
+{
+  ES_finishedClearSetGrain result = malloc(sizeof(struct ES_finishedClearSetGrain));
+  struct HM_chunkList *output = malloc(es->depth * sizeof(struct HM_chunkList));
+  result->output = output;
+  result->lenOutput = es->depth;
+  
+  // initialize output
+  for (uint32_t i = 0; i < es->depth; i++) {
+    HM_initChunkList(&(output[i]));
+  }
+
+  // process each input chunk
+  for (size_t i = start; i < stop; i++) {
+    HM_chunk chunk = es->chunkArray[i];
+    pointer p = HM_getChunkStart(chunk);
+    pointer frontier = HM_getChunkFrontier(chunk);
+    while (p < frontier)
+    {
+      objptr* opp = (objptr*)p;
+      objptr op = *opp;
+      if (isObjptr(op)) {
+        clear_suspect_par_safe(s, opp, op, output, es->depth);
+      }
+      p += sizeof(objptr);
+    }
+  }
+
+  return result;
+}
+
+
+void ES_commitFinishedClearSetGrain(
+  GC_state s,
+  GC_thread thread,
+  ES_finishedClearSetGrain es)
+{
+  for (size_t i = 0; i < es->lenOutput; i++) {
+    HM_chunkList list = &(es->output[i]);
+    if (HM_getChunkListSize(list) == 0)
+      continue;
+
+    HM_HierarchicalHeap dest = HM_HH_getHeapAtDepth(s, thread, i);
+    HM_appendChunkList(HM_HH_getSuspects(dest), list);
+    HM_initChunkList(list);
+  }
+
+  free(es->output);
+  free(es);
+}
+
+
+void ES_deleteClearSet(GC_state s, ES_clearSet es) {
+  s->cumulativeStatistics->numSuspectsCleared += es->numSuspects;
+
+  for (size_t i = 0; i < es->lenChunkArray; i++) {
+    HM_freeChunkWithInfo(s, es->chunkArray[i], NULL, BLOCK_FOR_SUSPECTS);
+  }
+
+  size_t numSuspects = es->numSuspects;
+  uint32_t depth = es->depth;
+  struct timespec startTime = es->startTime;
+  free(es->chunkArray);
+  free(es);
+
+  struct timespec stopTime;
+  timespec_now(&stopTime);
+  timespec_sub(&stopTime, &startTime);
+  LOG(LM_HIERARCHICAL_HEAP, LL_INFO,
+    "time to process %zu suspects at depth %u: %ld.%09ld",
+    numSuspects,
+    depth,
+    (long)stopTime.tv_sec,
+    stopTime.tv_nsec
+  );
+
+}
\ No newline at end of file
diff --git a/runtime/gc/entanglement-suspects.h b/runtime/gc/entanglement-suspects.h
index 235e987e6..9afabb960 100644
--- a/runtime/gc/entanglement-suspects.h
+++ b/runtime/gc/entanglement-suspects.h
@@ -6,17 +6,53 @@
 #define SUSPECT_MASK ((GC_header)0x40000000)
 #define SUSPECT_SHIFT 30
 
+typedef struct ES_clearArgs {
+  HM_chunkList newList;
+  uint32_t heapDepth;
+  GC_thread thread;
+  size_t numMoved;
+  size_t numFailed;
+  size_t numCleared;
+} * ES_clearArgs;
+
+
+typedef struct ES_clearSet {
+  HM_chunk *chunkArray; // array of chunks that need to be processed
+  size_t lenChunkArray; // len(chunkArray)
+  uint32_t depth;
+  size_t numSuspects;
+  struct timespec startTime;
+} * ES_clearSet;
+
+typedef struct ES_finishedClearSetGrain {
+  struct HM_chunkList *output; // output[d]: unsuccessful clears that were moved to depth d
+  size_t lenOutput; // len(output array)
+} * ES_finishedClearSetGrain;
+
+
+bool ES_mark(__attribute__((unused)) GC_state s, objptr op);
+void ES_unmark(GC_state s, objptr op);
+
 void ES_add(GC_state s, HM_chunkList es, objptr op);
 
 bool ES_contains(HM_chunkList es, objptr op);
 
-HM_chunkList ES_append (GC_state s, HM_chunkList es1, HM_chunkList es2);
+HM_chunkList ES_append(GC_state s, HM_chunkList es1, HM_chunkList es2);
+
+void ES_clear(GC_state s, HM_HierarchicalHeap hh);
 
-void ES_clear(GC_state s, HM_chunkList es);
+// These functions allow us to clear a suspect set in parallel,
+// by integrating with the scheduler. The idea is...
+size_t ES_numSuspects(GC_state s, HM_HierarchicalHeap hh);
+ES_clearSet ES_takeClearSet(GC_state s, HM_HierarchicalHeap hh);
+size_t ES_numChunksInClearSet(GC_state s, ES_clearSet es);
+ES_finishedClearSetGrain ES_processClearSetGrain(GC_state s, ES_clearSet es, size_t start, size_t stop);
+void ES_commitFinishedClearSetGrain(GC_state s, GC_thread thread, ES_finishedClearSetGrain es);
+void ES_deleteClearSet(GC_state s, ES_clearSet es);
 
 void ES_move(HM_chunkList list1, HM_chunkList list2);
 
-int ES_foreachSuspect(GC_state s, HM_chunkList storage, struct GC_foreachObjptrClosure* fObjptrClosure);
+int ES_foreachSuspect(GC_state s, HM_chunkList storage, struct GC_foreachObjptrClosure * fObjptrClosure);
 
 #endif
 #endif
\ No newline at end of file
diff --git a/runtime/gc/fixed-size-allocator.c b/runtime/gc/fixed-size-allocator.c
index 9eec57607..8a6757d08 100644
--- a/runtime/gc/fixed-size-allocator.c
+++ b/runtime/gc/fixed-size-allocator.c
@@ -6,7 +6,11 @@
 
 #if (defined (MLTON_GC_INTERNAL_FUNCS))
 
-void initFixedSizeAllocator(FixedSizeAllocator fsa, size_t fixedSize) {
+void initFixedSizeAllocator(
+  FixedSizeAllocator fsa,
+  size_t fixedSize,
+  enum BlockPurpose purpose)
+{
   size_t minSize = sizeof(struct FixedSizeElement);
   fsa->fixedSize = align(fixedSize < minSize ? minSize : fixedSize, 8);
   HM_initChunkList(&(fsa->buffer));
@@ -16,6 +20,7 @@ void initFixedSizeAllocator(FixedSizeAllocator fsa, size_t fixedSize) {
   fsa->numAllocated = 0;
   fsa->numLocalFreed = 0;
   fsa->numSharedFreed = 0;
+  fsa->purpose = purpose;
   return;
 }
 
@@ -86,7 +91,7 @@ void* allocateFixedSize(FixedSizeAllocator fsa) {
     * to their original buffer, by looking up the chunk header.
     */
 
-  chunk = HM_allocateChunk(buffer, fsa->fixedSize + sizeof(void*));
+  chunk = HM_allocateChunkWithPurpose(buffer, fsa->fixedSize + sizeof(void*), fsa->purpose);
   pointer gap = HM_shiftChunkStart(chunk, sizeof(void*));
   *(FixedSizeAllocator *)gap = fsa;
 
diff --git a/runtime/gc/fixed-size-allocator.h b/runtime/gc/fixed-size-allocator.h
index 95cad510d..30061c696 100644
--- a/runtime/gc/fixed-size-allocator.h
+++ b/runtime/gc/fixed-size-allocator.h
@@ -29,6 +29,7 @@ typedef struct FixedSizeAllocator {
   size_t numAllocated;
   size_t numLocalFreed;
   size_t numSharedFreed;
+  enum BlockPurpose purpose;
 
   /** A bit of a hack. I just want quick access to pages to store elements.
     * I'll reuse the frontier mechanism inherent to chunks to remember which
@@ -66,7 +67,10 @@ typedef struct FixedSizeAllocator *FixedSizeAllocator;
 /** Initialize [fsa] to be able to allocate objects of size [fixedSize].
   * You should never re-initialize an allocator.
   */
-void initFixedSizeAllocator(FixedSizeAllocator fsa, size_t fixedSize);
+void initFixedSizeAllocator(
+  FixedSizeAllocator fsa,
+  size_t fixedSize,
+  enum BlockPurpose purpose);
 
 
 /** Allocate an object of the size specified when the allocator was initialized.
diff --git a/runtime/gc/foreach.c b/runtime/gc/foreach.c
index 8c16ab249..da5812e41 100644
--- a/runtime/gc/foreach.c
+++ b/runtime/gc/foreach.c
@@ -125,7 +125,7 @@ pointer foreachObjptrInObject (GC_state s, pointer p,
 
   bool skip = !pred->fun(s, p, pred->env);
 
-  header = getHeader (p);
+  header = getRacyHeader (p);
   splitHeader(s, header, &tag, NULL, &bytesNonObjptrs, &numObjptrs);
   if (DEBUG_DETAILED)
     fprintf (stderr,
diff --git a/runtime/gc/forward.c b/runtime/gc/forward.c
index 664d94ec7..f94777f5a 100644
--- a/runtime/gc/forward.c
+++ b/runtime/gc/forward.c
@@ -25,10 +25,14 @@ objptr getFwdPtr (pointer p) {
   return *(getFwdPtrp(p));
 }
 
+bool isFwdHeader (GC_header h) {
+  return (not (GC_VALID_HEADER_MASK & h));
+}
+
 /* hasFwdPtr (p)
  *
  * Returns true if the object pointed to by p has a valid forwarding pointer.
  */
 bool hasFwdPtr (pointer p) {
-  return (not (GC_VALID_HEADER_MASK & getHeader(p)));
+  return isFwdHeader (getHeader(p));
 }
diff --git a/runtime/gc/forward.h b/runtime/gc/forward.h
index 154381d81..364b49cef 100644
--- a/runtime/gc/forward.h
+++ b/runtime/gc/forward.h
@@ -15,5 +15,6 @@
 static inline objptr* getFwdPtrp (pointer p);
 static inline objptr getFwdPtr (pointer p);
 static inline bool hasFwdPtr (pointer p);
+static inline bool isFwdHeader (GC_header h);
 
 #endif /* (defined (MLTON_GC_INTERNAL_FUNCS)) */
diff --git a/runtime/gc/garbage-collection.c b/runtime/gc/garbage-collection.c
index 5168cb3e7..ff92abe05 100644
--- a/runtime/gc/garbage-collection.c
+++ b/runtime/gc/garbage-collection.c
@@ -53,7 +53,11 @@ void growStackCurrent(GC_state s) {
 
   /* in this case, the new stack needs more space, so allocate a new chunk,
    * copy the stack, and throw away the old chunk. */
-  HM_chunk newChunk = HM_allocateChunk(HM_HH_getChunkList(newhh), stackSize);
+  HM_chunk newChunk = HM_allocateChunkWithPurpose(
+    HM_HH_getChunkList(newhh),
+    stackSize,
+    BLOCK_FOR_HEAP_CHUNK);
+    
   if (NULL == newChunk) {
     DIE("Ran out of space to grow stack!");
   }
@@ -95,8 +99,15 @@ void GC_collect (GC_state s, size_t bytesRequested, bool force) {
   getThreadCurrent(s)->exnStack = s->exnStack;
   HM_HH_updateValues(getThreadCurrent(s), s->frontier);
   beginAtomic(s);
+  // ebr for hh nodes
   HH_EBR_leaveQuiescentState(s);
 
+  // ebr for chunks
+  HM_EBR_leaveQuiescentState(s);
+  // HM_EBR_enterQuiescentState(s);
+
+  maybeSample(s, s->blockUsageSampler);
+
   // HM_HierarchicalHeap h = getThreadCurrent(s)->hierarchicalHeap;
   // while (h->nextAncestor != NULL) h = h->nextAncestor;
   // if (HM_HH_getDepth(h) == 0 && HM_getChunkListSize(HM_HH_getChunkList(h)) > 8192) {
diff --git a/runtime/gc/gc_state.c b/runtime/gc/gc_state.c
index 3b7008442..c821e9741 100644
--- a/runtime/gc/gc_state.c
+++ b/runtime/gc/gc_state.c
@@ -98,6 +98,22 @@ uintmax_t GC_getCumulativeStatisticsLocalBytesReclaimedOfProc(GC_state s, uint32
   return s->procStates[proc].cumulativeStatistics->bytesReclaimedByLocal;
 }
 
+uintmax_t GC_bytesInScopeForLocal(GC_state s) {
+  uintmax_t retVal = 0;
+  for (size_t i = 0; i < s->numberOfProcs; i++) {
+    retVal += s->procStates[i].cumulativeStatistics->bytesInScopeForLocal;
+  }
+  return retVal;
+}
+
+uintmax_t GC_bytesInScopeForCC(GC_state s) {
+  uintmax_t retVal = 0;
+  for (size_t i = 0; i < s->numberOfProcs; i++) {
+    retVal += s->procStates[i].cumulativeStatistics->bytesInScopeForCC;
+  }
+  return retVal;
+}
+
 uintmax_t GC_getCumulativeStatisticsBytesAllocated (GC_state s) {
   /* return sum across all processors */
   size_t retVal = 0;
@@ -165,30 +181,17 @@ uintmax_t GC_getCumulativeStatisticsNumLocalGCsOfProc(GC_state s, uint32_t proc)
   return s->procStates[proc].cumulativeStatistics->numHHLocalGCs;
 }
 
-uintmax_t GC_getNumRootCCsOfProc(GC_state s, uint32_t proc) {
-  return s->procStates[proc].cumulativeStatistics->numRootCCs;
+uintmax_t GC_getNumCCsOfProc(GC_state s, uint32_t proc) {
+  return s->procStates[proc].cumulativeStatistics->numCCs;
 }
 
-uintmax_t GC_getNumInternalCCsOfProc(GC_state s, uint32_t proc) {
-  return s->procStates[proc].cumulativeStatistics->numInternalCCs;
-}
-
-uintmax_t GC_getRootCCMillisecondsOfProc(GC_state s, uint32_t proc) {
-  struct timespec *t = &(s->procStates[proc].cumulativeStatistics->timeRootCC);
-  return (uintmax_t)t->tv_sec * 1000 + (uintmax_t)t->tv_nsec / 1000000;
-}
-
-uintmax_t GC_getInternalCCMillisecondsOfProc(GC_state s, uint32_t proc) {
-  struct timespec *t = &(s->procStates[proc].cumulativeStatistics->timeInternalCC);
+uintmax_t GC_getCCMillisecondsOfProc(GC_state s, uint32_t proc) {
+  struct timespec *t = &(s->procStates[proc].cumulativeStatistics->timeCC);
   return (uintmax_t)t->tv_sec * 1000 + (uintmax_t)t->tv_nsec / 1000000;
 }
 
-uintmax_t GC_getRootCCBytesReclaimedOfProc(GC_state s, uint32_t proc) {
-  return s->procStates[proc].cumulativeStatistics->bytesReclaimedByRootCC;
-}
-
-uintmax_t GC_getInternalCCBytesReclaimedOfProc(GC_state s, uint32_t proc) {
-  return s->procStates[proc].cumulativeStatistics->bytesReclaimedByInternalCC;
+uintmax_t GC_getCCBytesReclaimedOfProc(GC_state s, uint32_t proc) {
+  return s->procStates[proc].cumulativeStatistics->bytesReclaimedByCC;
 }
 
 uintmax_t GC_getLocalGCMillisecondsOfProc(GC_state s, uint32_t proc) {
@@ -209,6 +212,14 @@ uintmax_t GC_numDisentanglementChecks(GC_state s) {
   return count;
 }
 
+uintmax_t GC_numEntanglements(GC_state s) {
+  uintmax_t count = 0;
+  for (uint32_t p = 0; p < s->numberOfProcs; p++) {
+    count += s->procStates[p].cumulativeStatistics->numEntanglements;
+  }
+  return count;
+}
+
 uintmax_t GC_numChecksSkipped(GC_state s)
 {
   uintmax_t count = 0;
@@ -239,10 +250,56 @@ uintmax_t GC_numSuspectsCleared(GC_state s)
   return count;
 }
 
-uintmax_t GC_numEntanglementsDetected(GC_state s) {
+uintmax_t GC_bytesPinnedEntangled(GC_state s)
+{
   uintmax_t count = 0;
+  for (uint32_t p = 0; p < s->numberOfProcs; p++)
+  {
+    count += s->procStates[p].cumulativeStatistics->bytesPinnedEntangled;
+  }
+  return count;
+}
+
+uintmax_t GC_bytesPinnedEntangledWatermark(GC_state s)
+{
+  uintmax_t mark = 0;
+  for (uint32_t p = 0; p < s->numberOfProcs; p++)
+  {
+    mark = max(mark,
+      s->procStates[p].cumulativeStatistics->bytesPinnedEntangledWatermark);
+  }
+  return mark;
+}
+
+// must only be called immediately after join at root depth
+void GC_updateBytesPinnedEntangledWatermark(GC_state s)
+{
+  uintmax_t total = 0;
+  for (uint32_t p = 0; p < s->numberOfProcs; p++)
+  {
+    uintmax_t *currp = 
+      &(s->procStates[p].cumulativeStatistics->currentPhaseBytesPinnedEntangled);
+    uintmax_t curr = __atomic_load_n(currp, __ATOMIC_SEQ_CST);
+    __atomic_store_n(currp, 0, __ATOMIC_SEQ_CST);
+    total += curr;
+  }
+
+  // if (total > 0) {
+  //   LOG(LM_HIERARCHICAL_HEAP, LL_FORCE, "hello %zu", total);
+  // }
+  
+  s->cumulativeStatistics->bytesPinnedEntangledWatermark =
+    max(
+      s->cumulativeStatistics->bytesPinnedEntangledWatermark,
+      total
+    );
+}
+
+float GC_approxRaceFactor(GC_state s)
+{
+  float count = 0;
   for (uint32_t p = 0; p < s->numberOfProcs; p++) {
-    count += s->procStates[p].cumulativeStatistics->numEntanglementsDetected;
+    count = max(count, s->procStates[p].cumulativeStatistics->approxRaceFactor);
   }
   return count;
 }
diff --git a/runtime/gc/gc_state.h b/runtime/gc/gc_state.h
index 68bd4e31b..ff8246d7f 100644
--- a/runtime/gc/gc_state.h
+++ b/runtime/gc/gc_state.h
@@ -32,6 +32,7 @@ struct GC_state {
   volatile uint32_t atomicState;
   struct BlockAllocator *blockAllocatorGlobal;
   struct BlockAllocator *blockAllocatorLocal;
+  struct Sampler *blockUsageSampler;
   objptr callFromCHandlerThread; /* Handler for exported C calls (in heap). */
   pointer callFromCOpArgsResPtr; /* Pass op, args, and res from exported C call */
   struct GC_controls *controls;
@@ -51,7 +52,8 @@ struct GC_state {
   uint32_t globalsLength;
   struct FixedSizeAllocator hhAllocator;
   struct FixedSizeAllocator hhUnionFindAllocator;
-  struct HH_EBR_shared * hhEBR;
+  struct EBR_shared * hhEBR;
+  struct EBR_shared * hmEBR;
   struct GC_lastMajorStatistics *lastMajorStatistics;
   pointer limitPlusSlop; /* limit + GC_HEAP_LIMIT_SLOP */
   int (*loadGlobals)(FILE *f); /* loads the globals from the file. */
@@ -131,17 +133,20 @@ PRIVATE uintmax_t GC_getPromoMillisecondsOfProc(GC_state s, uint32_t proc);
 
 PRIVATE uintmax_t GC_getCumulativeStatisticsNumLocalGCsOfProc(GC_state s, uint32_t proc);
 
-PRIVATE uintmax_t GC_getNumRootCCsOfProc(GC_state s, uint32_t proc);
-PRIVATE uintmax_t GC_getNumInternalCCsOfProc(GC_state s, uint32_t proc);
-PRIVATE uintmax_t GC_getRootCCMillisecondsOfProc(GC_state s, uint32_t proc);
-PRIVATE uintmax_t GC_getInternalCCMillisecondsOfProc(GC_state s, uint32_t proc);
-PRIVATE uintmax_t GC_getRootCCBytesReclaimedOfProc(GC_state s, uint32_t proc);
-PRIVATE uintmax_t GC_getInternalCCBytesReclaimedOfProc(GC_state s, uint32_t proc);
+PRIVATE uintmax_t GC_getNumCCsOfProc(GC_state s, uint32_t proc);
+PRIVATE uintmax_t GC_getCCMillisecondsOfProc(GC_state s, uint32_t proc);
+PRIVATE uintmax_t GC_getCCBytesReclaimedOfProc(GC_state s, uint32_t proc);
+PRIVATE uintmax_t GC_bytesInScopeForLocal(GC_state s);
+PRIVATE uintmax_t GC_bytesInScopeForCC(GC_state s);
 PRIVATE uintmax_t GC_numDisentanglementChecks(GC_state s);
-PRIVATE uintmax_t GC_numEntanglementsDetected(GC_state s);
+PRIVATE uintmax_t GC_numEntanglements(GC_state s);
+PRIVATE float GC_approxRaceFactor(GC_state s);
 PRIVATE uintmax_t GC_numChecksSkipped(GC_state s);
 PRIVATE uintmax_t GC_numSuspectsMarked(GC_state s);
 PRIVATE uintmax_t GC_numSuspectsCleared(GC_state s);
+PRIVATE uintmax_t GC_bytesPinnedEntangled(GC_state s);
+PRIVATE uintmax_t GC_bytesPinnedEntangledWatermark(GC_state s);
+PRIVATE void GC_updateBytesPinnedEntangledWatermark(GC_state s);
 
 PRIVATE uint32_t GC_getControlMaxCCDepth(GC_state s);
 
diff --git a/runtime/gc/hierarchical-heap-collection.c b/runtime/gc/hierarchical-heap-collection.c
index dc6a7191f..444f8744f 100644
--- a/runtime/gc/hierarchical-heap-collection.c
+++ b/runtime/gc/hierarchical-heap-collection.c
@@ -40,21 +40,38 @@ void unfreezeDisentangledDepthAfter(
 #endif
 
 void tryUnpinOrKeepPinned(
-  GC_state s,
-  HM_remembered remElem,
-  void* rawArgs);
+    GC_state s,
+    HM_remembered remElem,
+    void *rawArgs);
+
+void LGC_markAndScan(GC_state s, HM_remembered remElem, void *rawArgs);
+void unmark(GC_state s, objptr *opp, objptr op, void *rawArgs);
 
 void copySuspect(GC_state s, objptr *opp, objptr op, void *rawArghh);
 
-void forwardObjptrsOfRemembered(
-  GC_state s,
+void forwardFromObjsOfRemembered(
+    GC_state s,
+    HM_remembered remElem,
+    void *rawArgs);
+
+void unmarkWrapper(
+  __attribute__((unused)) GC_state s,
   HM_remembered remElem,
-  void* rawArgs);
+  __attribute__((unused)) void *rawArgs);
+void addEntangledToRemSet(GC_state s, objptr op, uint32_t opDepth, struct ForwardHHObjptrArgs *args);
 
+static inline HM_HierarchicalHeap toSpaceHH (GC_state s, struct ForwardHHObjptrArgs *args, uint32_t depth) {
+  if (args->toSpace[depth] == NULL)
+  {
+    /* Level does not exist, so create it */
+    args->toSpace[depth] = HM_HH_new(s, depth);
+  }
+  return args->toSpace[depth];
+}
 // void scavengeChunkOfPinnedObject(GC_state s, objptr op, void* rawArgs);
 
 #if ASSERT
-void checkRememberedEntry(GC_state s, HM_remembered remElem, void* args);
+void checkRememberedEntry(GC_state s, HM_remembered remElem, void *args);
 bool hhContainsChunk(HM_HierarchicalHeap hh, HM_chunk theChunk);
 #endif
 
@@ -70,7 +87,8 @@ bool hhContainsChunk(HM_HierarchicalHeap hh, HM_chunk theChunk);
  *
  * @return the tag of the object
  */
-GC_objectTypeTag computeObjectCopyParameters(GC_state s, pointer p,
+GC_objectTypeTag computeObjectCopyParameters(GC_state s, GC_header header,
+                                             pointer p,
                                              size_t *objectSize,
                                              size_t *copySize,
                                              size_t *metaDataSize);
@@ -80,6 +98,8 @@ pointer copyObject(pointer p,
                    size_t copySize,
                    HM_HierarchicalHeap tgtHeap);
 
+void delLastObj(objptr op, size_t objectSize, HM_HierarchicalHeap tgtHeap);
+
 /**
  * ObjptrPredicateFunction for skipping stacks and threads in the hierarchical
  * heap.
@@ -87,68 +107,69 @@ pointer copyObject(pointer p,
  * @note This function takes as additional arguments the
  * struct SSATOPredicateArgs
  */
-struct SSATOPredicateArgs {
+struct SSATOPredicateArgs
+{
   pointer expectedStackPointer;
   pointer expectedThreadPointer;
 };
 bool skipStackAndThreadObjptrPredicate(GC_state s,
                                        pointer p,
-                                       void* rawArgs);
+                                       void *rawArgs);
 
 /************************/
 /* Function Definitions */
 /************************/
-#if (defined (MLTON_GC_INTERNAL_BASIS))
+#if (defined(MLTON_GC_INTERNAL_BASIS))
 #endif /* MLTON_GC_INTERNAL_BASIS */
 
-#if (defined (MLTON_GC_INTERNAL_FUNCS))
+#if (defined(MLTON_GC_INTERNAL_FUNCS))
 
-enum LGC_freedChunkType {
+enum LGC_freedChunkType
+{
   LGC_FREED_REMSET_CHUNK,
   LGC_FREED_STACK_CHUNK,
   LGC_FREED_NORMAL_CHUNK,
   LGC_FREED_SUSPECT_CHUNK
 };
 
-static const char* LGC_freedChunkTypeToString[] = {
-  "LGC_FREED_REMSET_CHUNK",
-  "LGC_FREED_STACK_CHUNK",
-  "LGC_FREED_NORMAL_CHUNK",
-  "LGC_FREED_SUSPECT_CHUNK"
-};
+static const char *LGC_freedChunkTypeToString[] = {
+    "LGC_FREED_REMSET_CHUNK",
+    "LGC_FREED_STACK_CHUNK",
+    "LGC_FREED_NORMAL_CHUNK",
+    "LGC_FREED_SUSPECT_CHUNK"};
 
-struct LGC_chunkInfo {
+struct LGC_chunkInfo
+{
   uint32_t depth;
   int32_t procNum;
   uintmax_t collectionNumber;
   enum LGC_freedChunkType freedType;
 };
 
-
 void LGC_writeFreeChunkInfo(
-  __attribute__((unused)) GC_state s,
-  char* infoBuffer,
-  size_t bufferLen,
-  void* env)
+    __attribute__((unused)) GC_state s,
+    char *infoBuffer,
+    size_t bufferLen,
+    void *env)
 {
   struct LGC_chunkInfo *info = env;
 
   snprintf(infoBuffer, bufferLen,
-    "freed %s at depth %u by LGC %d:%zu",
-    LGC_freedChunkTypeToString[info->freedType],
-    info->depth,
-    info->procNum,
-    info->collectionNumber);
+           "freed %s at depth %u by LGC %d:%zu",
+           LGC_freedChunkTypeToString[info->freedType],
+           info->depth,
+           info->procNum,
+           info->collectionNumber);
 }
 
-
-uint32_t minDepthWithoutCC(GC_thread thread) {
+uint32_t minDepthWithoutCC(GC_thread thread)
+{
   assert(thread != NULL);
   assert(thread->hierarchicalHeap != NULL);
   HM_HierarchicalHeap cursor = thread->hierarchicalHeap;
 
   if (cursor->subHeapForCC != NULL)
-    return thread->currentDepth+1;
+    return thread->currentDepth + 1;
 
   while (cursor->nextAncestor != NULL &&
          cursor->nextAncestor->subHeapForCC == NULL)
@@ -166,17 +187,19 @@ uint32_t minDepthWithoutCC(GC_thread thread) {
   return HM_HH_getDepth(cursor);
 }
 
-void HM_HHC_collectLocal(uint32_t desiredScope) {
+void HM_HHC_collectLocal(uint32_t desiredScope)
+{
   GC_state s = pthread_getspecific(gcstate_key);
   GC_thread thread = getThreadCurrent(s);
-  struct HM_HierarchicalHeap* hh = thread->hierarchicalHeap;
+  struct HM_HierarchicalHeap *hh = thread->hierarchicalHeap;
 
   struct rusage ru_start;
   struct timespec startTime;
   struct timespec stopTime;
   uint64_t oldObjectCopied;
 
-  if (NONE == s->controls->collectionType) {
+  if (NONE == s->controls->collectionType)
+  {
     /* collection disabled */
     return;
   }
@@ -188,62 +211,65 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
   //   return;
   // }
 
-  if (s->wsQueueTop == BOGUS_OBJPTR || s->wsQueueBot == BOGUS_OBJPTR) {
+  if (s->wsQueueTop == BOGUS_OBJPTR || s->wsQueueBot == BOGUS_OBJPTR)
+  {
     LOG(LM_HH_COLLECTION, LL_DEBUG, "Skipping collection, deque not registered yet");
     return;
   }
 
-  uint64_t topval = *(uint64_t*)objptrToPointer(s->wsQueueTop, NULL);
+  uint64_t topval = *(uint64_t *)objptrToPointer(s->wsQueueTop, NULL);
   uint32_t potentialLocalScope = UNPACK_IDX(topval);
   uint32_t originalLocalScope = pollCurrentLocalScope(s);
 
-  if (thread->currentDepth != originalLocalScope) {
+  if (thread->currentDepth != originalLocalScope)
+  {
     LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Skipping collection:\n"
-      "  currentDepth %u\n"
-      "  originalLocalScope %u\n"
-      "  potentialLocalScope %u\n",
-      thread->currentDepth,
-      originalLocalScope,
-      potentialLocalScope);
+        "Skipping collection:\n"
+        "  currentDepth %u\n"
+        "  originalLocalScope %u\n"
+        "  potentialLocalScope %u\n",
+        thread->currentDepth,
+        originalLocalScope,
+        potentialLocalScope);
     return;
   }
 
   /** Compute the min depth for local collection. We claim as many levels
-    * as we can without interfering with CC, but only so far as desired.
-    *
-    * Note that we could permit local collection at the same level as a
-    * registered (but not yet stolen) CC, as long as we update the rootsets
-    * stored for the CC. But this is tricky. Much simpler to just avoid CC'ed
-    * levels entirely.
-    */
+   * as we can without interfering with CC, but only so far as desired.
+   *
+   * Note that we could permit local collection at the same level as a
+   * registered (but not yet stolen) CC, as long as we update the rootsets
+   * stored for the CC. But this is tricky. Much simpler to just avoid CC'ed
+   * levels entirely.
+   */
   uint32_t minNoCC = minDepthWithoutCC(thread);
   uint32_t minOkay = desiredScope;
   minOkay = max(minOkay, thread->minLocalCollectionDepth);
   minOkay = max(minOkay, minNoCC);
   uint32_t minDepth = originalLocalScope;
-  while (minDepth > minOkay && tryClaimLocalScope(s)) {
+  while (minDepth > minOkay && tryClaimLocalScope(s))
+  {
     minDepth--;
     assert(minDepth == pollCurrentLocalScope(s));
   }
   assert(minDepth == pollCurrentLocalScope(s));
 
-  if ( minDepth == 0 ||
-       minOkay > minDepth ||
-       minDepth > thread->currentDepth )
+  if (minDepth == 0 ||
+      minOkay > minDepth ||
+      minDepth > thread->currentDepth)
   {
     LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Skipping collection:\n"
-      "  minDepth %u\n"
-      "  currentDepth %u\n"
-      "  minNoCC %u\n"
-      "  desiredScope %u\n"
-      "  potentialLocalScope %u\n",
-      minDepth,
-      thread->currentDepth,
-      minNoCC,
-      desiredScope,
-      potentialLocalScope);
+        "Skipping collection:\n"
+        "  minDepth %u\n"
+        "  currentDepth %u\n"
+        "  minNoCC %u\n"
+        "  desiredScope %u\n"
+        "  potentialLocalScope %u\n",
+        minDepth,
+        thread->currentDepth,
+        minNoCC,
+        desiredScope,
+        potentialLocalScope);
 
     releaseLocalScope(s, originalLocalScope);
     return;
@@ -260,33 +286,38 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
   s->cumulativeStatistics->numHHLocalGCs++;
 
   /* used needs to be set because the mutator has changed s->stackTop. */
-  getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed (s);
+  getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed(s);
   getThreadCurrent(s)->exnStack = s->exnStack;
 
   assertInvariants(thread);
 
-  if (SUPERLOCAL == s->controls->collectionType) {
+  if (SUPERLOCAL == s->controls->collectionType)
+  {
     minDepth = maxDepth;
   }
 
   /* copy roots */
   struct ForwardHHObjptrArgs forwardHHObjptrArgs = {
-    .hh = hh,
-    .minDepth = minDepth,
-    .maxDepth = maxDepth,
-    .toDepth = HM_HH_INVALID_DEPTH,
-    .fromSpace = NULL,
-    .toSpace = NULL,
-    .pinned = NULL,
-    .containingObject = BOGUS_OBJPTR,
-    .bytesCopied = 0,
-    .objectsCopied = 0,
-    .stacksCopied = 0,
-    .bytesMoved = 0,
-    .objectsMoved = 0
-  };
+      .hh = hh,
+      .minDepth = minDepth,
+      .maxDepth = maxDepth,
+      .toDepth = HM_HH_INVALID_DEPTH,
+      .fromSpace = NULL,
+      .toSpace = NULL,
+      .toSpaceStart = NULL,
+      .toSpaceStartChunk = NULL,
+      .pinned = NULL,
+      .containingObject = BOGUS_OBJPTR,
+      .bytesCopied = 0,
+      .entangledBytes = 0,
+      .objectsCopied = 0,
+      .stacksCopied = 0,
+      .bytesMoved = 0,
+      .objectsMoved = 0,
+      .concurrent = false};
+  CC_workList_init(s, &(forwardHHObjptrArgs.worklist));
   struct GC_foreachObjptrClosure forwardHHObjptrClosure =
-    {.fun = forwardHHObjptr, .env = &forwardHHObjptrArgs};
+      {.fun = forwardHHObjptr, .env = &forwardHHObjptrArgs};
 
   LOG(LM_HH_COLLECTION, LL_INFO,
       "collecting hh %p (L: %u):\n"
@@ -298,7 +329,7 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
       "  potential local scope is  %u -> %u\n"
       "  collection scope is       %u -> %u\n",
       // "  lchs %"PRIu64" lcs %"PRIu64,
-      ((void*)(hh)),
+      ((void *)(hh)),
       thread->currentDepth,
       s->procNumber,
       s->cumulativeStatistics->numHHLocalGCs,
@@ -311,26 +342,38 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
       forwardHHObjptrArgs.minDepth,
       forwardHHObjptrArgs.maxDepth);
 
-  struct HM_chunkList pinned[maxDepth+1];
+  struct HM_chunkList pinned[maxDepth + 1];
   forwardHHObjptrArgs.pinned = &(pinned[0]);
-  for (uint32_t i = 0; i <= maxDepth; i++) HM_initChunkList(&(pinned[i]));
+  for (uint32_t i = 0; i <= maxDepth; i++)
+    HM_initChunkList(&(pinned[i]));
 
-  HM_HierarchicalHeap toSpace[maxDepth+1];
+  HM_HierarchicalHeap toSpace[maxDepth + 1];
   forwardHHObjptrArgs.toSpace = &(toSpace[0]);
-  for (uint32_t i = 0; i <= maxDepth; i++) toSpace[i] = NULL;
+  pointer toSpaceStart[maxDepth + 1];
+  forwardHHObjptrArgs.toSpaceStart = &(toSpaceStart[0]);
+  HM_chunk toSpaceStartChunk[maxDepth + 1];
+  forwardHHObjptrArgs.toSpaceStartChunk = &(toSpaceStartChunk[0]);
+  for (uint32_t i = 0; i <= maxDepth; i++)
+  {
+    toSpace[i] = NULL;
+    toSpaceStart[i] = NULL;
+    toSpaceStartChunk[i] = NULL;
+  }
 
-  HM_HierarchicalHeap fromSpace[maxDepth+1];
+  HM_HierarchicalHeap fromSpace[maxDepth + 1];
   forwardHHObjptrArgs.fromSpace = &(fromSpace[0]);
-  for (uint32_t i = 0; i <= maxDepth; i++) fromSpace[i] = NULL;
+  for (uint32_t i = 0; i <= maxDepth; i++)
+    fromSpace[i] = NULL;
   for (HM_HierarchicalHeap cursor = hh;
        NULL != cursor;
-       cursor = cursor->nextAncestor) {
+       cursor = cursor->nextAncestor)
+  {
     fromSpace[HM_HH_getDepth(cursor)] = cursor;
   }
 
   /* =====================================================================
    * logging */
-  size_t sizesBefore[maxDepth+1];
+  size_t sizesBefore[maxDepth + 1];
   for (uint32_t i = 0; i <= maxDepth; i++)
     sizesBefore[i] = 0;
   size_t totalSizeBefore = 0;
@@ -406,6 +449,7 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
   Trace0(EVENT_PROMOTION_ENTER);
   timespec_now(&startTime);
 
+  forwardHHObjptrArgs.concurrent = true;
   /* For each remembered entry, if possible, unpin and discard the entry.
    * otherwise, copy the remembered entry to the toSpace remembered set. */
   for (HM_HierarchicalHeap cursor = hh;
@@ -415,11 +459,26 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
     forwardHHObjptrArgs.toDepth = HM_HH_getDepth(cursor);
 
     struct HM_foreachDownptrClosure closure =
-      {.fun = tryUnpinOrKeepPinned, .env = (void*)&forwardHHObjptrArgs};
-    HM_foreachRemembered(s, HM_HH_getRemSet(cursor), &closure);
+        {.fun = tryUnpinOrKeepPinned, .env = (void *)&forwardHHObjptrArgs};
+    HM_foreachRemembered(s, HM_HH_getRemSet(cursor), &closure, true);
   }
+  forwardHHObjptrArgs.concurrent = false;
   forwardHHObjptrArgs.toDepth = HM_HH_INVALID_DEPTH;
 
+  for (uint32_t i = 0; i <= maxDepth; i++)
+  {
+    if (toSpace[i] != NULL)
+    {
+      HM_chunkList toSpaceList = HM_HH_getChunkList(toSpace[i]);
+      if (toSpaceList->firstChunk != NULL)
+      {
+        toSpaceStart[i] = HM_getChunkFrontier(toSpaceList->lastChunk);
+        toSpaceStartChunk[i] = toSpaceList->lastChunk;
+        // assert(HM_getChunkOf(toSpaceStart[i]) == toSpaceList->lastChunk);
+      }
+    }
+  }
+
   // assertInvariants(thread);
 
 #if ASSERT
@@ -438,8 +497,9 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
 
   /* ===================================================================== */
 
-  if (needGCTime(s)) {
-    startTiming (RUSAGE_THREAD, &ru_start);
+  if (needGCTime(s))
+  {
+    startTiming(RUSAGE_THREAD, &ru_start);
   }
 
   timespec_now(&startTime);
@@ -460,12 +520,12 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
                         &forwardHHObjptrClosure,
                         FALSE);
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Copied %"PRIu64" objects from stack",
+      "Copied %" PRIu64 " objects from stack",
       forwardHHObjptrArgs.objectsCopied - oldObjectCopied);
   Trace3(EVENT_COPY,
-   forwardHHObjptrArgs.bytesCopied,
-   forwardHHObjptrArgs.objectsCopied,
-   forwardHHObjptrArgs.stacksCopied);
+         forwardHHObjptrArgs.bytesCopied,
+         forwardHHObjptrArgs.objectsCopied,
+         forwardHHObjptrArgs.stacksCopied);
 
   /* forward contents of thread (hence including stack) */
   oldObjectCopied = forwardHHObjptrArgs.objectsCopied;
@@ -477,26 +537,25 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
                         &forwardHHObjptrClosure,
                         FALSE);
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Copied %"PRIu64" objects from thread",
+      "Copied %" PRIu64 " objects from thread",
       forwardHHObjptrArgs.objectsCopied - oldObjectCopied);
   Trace3(EVENT_COPY,
-   forwardHHObjptrArgs.bytesCopied,
-   forwardHHObjptrArgs.objectsCopied,
-   forwardHHObjptrArgs.stacksCopied);
+         forwardHHObjptrArgs.bytesCopied,
+         forwardHHObjptrArgs.objectsCopied,
+         forwardHHObjptrArgs.stacksCopied);
 
   /* forward thread itself */
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-    "Trying to forward current thread %p",
-    (void*)s->currentThread);
+      "Trying to forward current thread %p",
+      (void *)s->currentThread);
   oldObjectCopied = forwardHHObjptrArgs.objectsCopied;
   forwardHHObjptr(s, &(s->currentThread), s->currentThread, &forwardHHObjptrArgs);
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-      (1 == (forwardHHObjptrArgs.objectsCopied - oldObjectCopied)) ?
-      "Copied thread from GC_state" : "Did not copy thread from GC_state");
+      (1 == (forwardHHObjptrArgs.objectsCopied - oldObjectCopied)) ? "Copied thread from GC_state" : "Did not copy thread from GC_state");
   Trace3(EVENT_COPY,
-   forwardHHObjptrArgs.bytesCopied,
-   forwardHHObjptrArgs.objectsCopied,
-   forwardHHObjptrArgs.stacksCopied);
+         forwardHHObjptrArgs.bytesCopied,
+         forwardHHObjptrArgs.objectsCopied,
+         forwardHHObjptrArgs.stacksCopied);
 
   /* forward contents of deque */
   oldObjectCopied = forwardHHObjptrArgs.objectsCopied;
@@ -507,12 +566,12 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
                         &forwardHHObjptrClosure,
                         FALSE);
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Copied %"PRIu64" objects from deque",
+      "Copied %" PRIu64 " objects from deque",
       forwardHHObjptrArgs.objectsCopied - oldObjectCopied);
   Trace3(EVENT_COPY,
-   forwardHHObjptrArgs.bytesCopied,
-   forwardHHObjptrArgs.objectsCopied,
-   forwardHHObjptrArgs.stacksCopied);
+         forwardHHObjptrArgs.bytesCopied,
+         forwardHHObjptrArgs.objectsCopied,
+         forwardHHObjptrArgs.stacksCopied);
 
   LOG(LM_HH_COLLECTION, LL_DEBUG, "END root copy");
 
@@ -530,107 +589,82 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
   // };
 
   /* off-by-one to prevent underflow */
-  uint32_t depth = thread->currentDepth+1;
-  while (depth > forwardHHObjptrArgs.minDepth) {
+  uint32_t depth = thread->currentDepth + 1;
+  while (depth > forwardHHObjptrArgs.minDepth)
+  {
     depth--;
     HM_HierarchicalHeap toSpaceLevel = toSpace[depth];
-    if (NULL == toSpaceLevel) {
+    if (NULL == toSpaceLevel)
+    {
       continue;
     }
 
     LOG(LM_HH_COLLECTION, LL_INFO,
-      "level %"PRIu32": num pinned: %zu",
-      depth,
-      HM_numRemembered(HM_HH_getRemSet(toSpaceLevel)));
+        "level %" PRIu32 ": num pinned: %zu",
+        depth,
+        HM_numRemembered(HM_HH_getRemSet(toSpaceLevel)));
 
-    /* use the remembered (pinned) entries at this level as extra roots */
+    /* forward the from-elements of the down-ptrs */
     struct HM_foreachDownptrClosure closure =
-      {.fun = forwardObjptrsOfRemembered, .env = (void*)&forwardHHObjptrArgs};
-    HM_foreachRemembered(s, HM_HH_getRemSet(toSpaceLevel), &closure);
+        {.fun = forwardFromObjsOfRemembered, .env = (void *)&forwardHHObjptrArgs};
+    // HM_foreachRemembered pops the public remSet into private. So it interferes
+    // with the unmarking phase of GC. So use HM_foreachPrivate instead.
+    HM_foreachPrivate(s, &(HM_HH_getRemSet(toSpaceLevel)->private), &closure);
 
     if (NULL != HM_HH_getChunkList(toSpaceLevel)->firstChunk)
     {
       HM_chunkList toSpaceList = HM_HH_getChunkList(toSpaceLevel);
+      pointer start = toSpaceStart[depth] != NULL ? toSpaceStart[depth] : HM_getChunkStart(toSpaceList->firstChunk);
+      HM_chunk startChunk = toSpaceStartChunk[depth] != NULL ? toSpaceStartChunk[depth] : toSpaceList->firstChunk;
       HM_forwardHHObjptrsInChunkList(
-        s,
-        toSpaceList->firstChunk,
-        HM_getChunkStart(toSpaceList->firstChunk),
-        // &skipStackAndThreadObjptrPredicate,
-        // &ssatoPredicateArgs,
-        &trueObjptrPredicate,
-        NULL,
-        &forwardHHObjptr,
-        &forwardHHObjptrArgs);
+          s,
+          startChunk,
+          start,
+          // &skipStackAndThreadObjptrPredicate,
+          // &ssatoPredicateArgs,
+          &trueObjptrPredicate,
+          NULL,
+          &forwardHHObjptr,
+          &forwardHHObjptrArgs);
     }
   }
 
-  /* after everything has been scavenged, we have to move the pinned chunks */
-  depth = thread->currentDepth+1;
-  while (depth > forwardHHObjptrArgs.minDepth) {
-    depth--;
-    HM_HierarchicalHeap toSpaceLevel = toSpace[depth];
-    if (NULL == toSpaceLevel) {
-      /* check that there are also no pinned chunks at this level
-       * (if there was pinned chunk, then we would have also created a
-       * toSpace HH at this depth, because we would have scavenged the
-       * remembered entry) */
-      assert(pinned[depth].firstChunk == NULL);
-      continue;
-    }
-
-#if ASSERT
-    // SAM_NOTE: safe to check here, because pinned chunks are separate.
-    traverseEachObjInChunkList(s, HM_HH_getChunkList(toSpaceLevel));
-#endif
-
-    /* unset the flags on pinned chunks and update their HH pointer */
-    for (HM_chunk chunkCursor = pinned[depth].firstChunk;
-         chunkCursor != NULL;
-         chunkCursor = chunkCursor->nextChunk)
-    {
-      assert(chunkCursor->pinnedDuringCollection);
-      chunkCursor->pinnedDuringCollection = FALSE;
-      chunkCursor->levelHead = HM_HH_getUFNode(toSpaceLevel);
-    }
-
-    /* put the pinned chunks into the toSpace */
-    HM_appendChunkList(HM_HH_getChunkList(toSpaceLevel), &(pinned[depth]));
-  }
-
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Copied %"PRIu64" objects in copy-collection",
+      "Copied %" PRIu64 " objects in copy-collection",
       forwardHHObjptrArgs.objectsCopied - oldObjectCopied);
   LOG(LM_HH_COLLECTION, LL_DEBUG,
-      "Copied %"PRIu64" stacks in copy-collection",
+      "Copied %" PRIu64 " stacks in copy-collection",
       forwardHHObjptrArgs.stacksCopied);
   Trace3(EVENT_COPY,
-   forwardHHObjptrArgs.bytesCopied,
-   forwardHHObjptrArgs.objectsCopied,
-   forwardHHObjptrArgs.stacksCopied);
+         forwardHHObjptrArgs.bytesCopied,
+         forwardHHObjptrArgs.objectsCopied,
+         forwardHHObjptrArgs.stacksCopied);
 
   /* ===================================================================== */
 
   struct LGC_chunkInfo info =
-    {.depth = 0,
-     .procNum = s->procNumber,
-     .collectionNumber = s->cumulativeStatistics->numHHLocalGCs,
-     .freedType = LGC_FREED_NORMAL_CHUNK};
+      {.depth = 0,
+       .procNum = s->procNumber,
+       .collectionNumber = s->cumulativeStatistics->numHHLocalGCs,
+       .freedType = LGC_FREED_NORMAL_CHUNK};
   struct writeFreedBlockInfoFnClosure infoc =
-    {.fun = LGC_writeFreeChunkInfo, .env = &info};
+      {.fun = LGC_writeFreeChunkInfo, .env = &info};
 
   for (HM_HierarchicalHeap cursor = hh;
        NULL != cursor && HM_HH_getDepth(cursor) >= minDepth;
-       cursor = cursor->nextAncestor) {
+       cursor = cursor->nextAncestor)
+  {
     HM_chunkList suspects = HM_HH_getSuspects(cursor);
-    if (suspects->size != 0) {
+    if (suspects->size != 0)
+    {
       uint32_t depth = HM_HH_getDepth(cursor);
       forwardHHObjptrArgs.toDepth = depth;
       struct GC_foreachObjptrClosure fObjptrClosure =
-        {.fun = copySuspect, .env = &forwardHHObjptrArgs};
+          {.fun = copySuspect, .env = &forwardHHObjptrArgs};
       ES_foreachSuspect(s, suspects, &fObjptrClosure);
       info.depth = depth;
       info.freedType = LGC_FREED_SUSPECT_CHUNK;
-      HM_freeChunksInListWithInfo(s, suspects, &infoc);
+      HM_freeChunksInListWithInfo(s, suspects, &infoc, BLOCK_FOR_SUSPECTS);
     }
   }
 
@@ -644,8 +678,9 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
     HM_HierarchicalHeap nextAncestor = hhTail->nextAncestor;
 
     HM_chunkList level = HM_HH_getChunkList(hhTail);
-    HM_chunkList remset = HM_HH_getRemSet(hhTail);
-    if (NULL != remset) {
+    HM_remSet remset = HM_HH_getRemSet(hhTail);
+    if (NULL != remset)
+    {
 #if ASSERT
       /* clear out memory to quickly catch some memory safety errors */
       // HM_chunk chunkCursor = remset->firstChunk;
@@ -658,12 +693,13 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
 #endif
       info.depth = HM_HH_getDepth(hhTail);
       info.freedType = LGC_FREED_REMSET_CHUNK;
-      HM_freeChunksInListWithInfo(s, remset, &infoc);
+      HM_freeChunksInListWithInfo(s, &(remset->private), &infoc, BLOCK_FOR_REMEMBERED_SET);
     }
 
 #if ASSERT
     HM_chunk chunkCursor = level->firstChunk;
-    while (chunkCursor != NULL) {
+    while (chunkCursor != NULL)
+    {
       assert(!chunkCursor->pinnedDuringCollection);
       chunkCursor = chunkCursor->nextChunk;
     }
@@ -671,14 +707,82 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
 
     info.depth = HM_HH_getDepth(hhTail);
     info.freedType = LGC_FREED_NORMAL_CHUNK;
-    HM_freeChunksInListWithInfo(s, level, &infoc);
-    HM_HH_freeAllDependants(s, hhTail, FALSE);
-    freeFixedSize(getUFAllocator(s), HM_HH_getUFNode(hhTail));
-    freeFixedSize(getHHAllocator(s), hhTail);
+    // HM_freeChunksInListWithInfo(s, level, &infoc);
+    HM_chunk chunk = level->firstChunk;
+    while (chunk != NULL) {
+      HM_chunk next = chunk->nextChunk;
+      if (chunk->retireChunk) {
+        HM_EBR_retire(s, chunk);
+        chunk->retireChunk = false;
+      }
+      else
+      {
+        HM_freeChunkWithInfo(s, chunk, &infoc, BLOCK_FOR_HEAP_CHUNK);
+      }
+      chunk = next;
+    }
+    HM_initChunkList(level);
+    HM_HH_freeAllDependants(s, hhTail, TRUE);
+    // freeFixedSize(getUFAllocator(s), HM_HH_getUFNode(hhTail));
+    // freeFixedSize(getHHAllocator(s), hhTail);
 
     hhTail = nextAncestor;
   }
 
+  HM_EBR_leaveQuiescentState(s);
+  // HM_EBR_enterQuiescentState(s);
+
+  /* after everything has been scavenged, we have to move the pinned chunks */
+  depth = thread->currentDepth + 1;
+  while (depth > forwardHHObjptrArgs.minDepth)
+  {
+    depth--;
+    HM_HierarchicalHeap fromSpaceLevel = fromSpace[depth];
+    if (NULL == fromSpaceLevel)
+    {
+      /* check that there are also no pinned chunks at this level
+       * (if there was pinned chunk, then there must also have been a
+       * fromSpace HH at this depth which originally stored the chunk)
+       */
+      assert(pinned[depth].firstChunk == NULL);
+      assert(NULL == toSpace[depth] || (HM_HH_getRemSet(toSpace[depth])->private).firstChunk == NULL);
+      continue;
+    }
+
+    HM_HierarchicalHeap toSpaceLevel = toSpace[depth];
+    // if (fromSpaceLevel != NULL) {
+    //   struct HM_foreachDownptrClosure closure =
+    //       {.fun = tryUnpinOrKeepPinned, .env = (void *)&forwardHHObjptrArgs};
+    //   // HM_foreachRemembered(s, HM_HH_getRemSet(toSpaceLevel), &closure);
+    //   /*go through the public of fromSpaceLevel, they will be joined later anyway*/
+    //   // assert((HM_HH_getRemSet(fromSpaceLevel)->private).firstChunk == NULL);
+    //   forwardHHObjptrArgs.toDepth = depth;
+    //   HM_foreachRemembered(s, HM_HH_getRemSet(fromSpaceLevel), &closure);
+    // }
+
+    if (toSpaceLevel != NULL) {
+      struct HM_foreachDownptrClosure unmarkClosure =
+        {.fun = unmarkWrapper, .env = NULL};
+      HM_foreachPublic(s, HM_HH_getRemSet(toSpaceLevel), &unmarkClosure, true);
+    }
+
+    /* unset the flags on pinned chunks and update their HH pointer */
+    for (HM_chunk chunkCursor = pinned[depth].firstChunk;
+         chunkCursor != NULL;
+         chunkCursor = chunkCursor->nextChunk)
+    {
+      assert(chunkCursor->levelHead == HM_HH_getUFNode(fromSpaceLevel));
+      assert(chunkCursor->pinnedDuringCollection);
+      chunkCursor->pinnedDuringCollection = FALSE;
+      chunkCursor->retireChunk = FALSE;
+    }
+
+    /* put the pinned chunks into the toSpace */
+    HM_appendChunkList(HM_HH_getChunkList(fromSpaceLevel), &(pinned[depth]));
+  }
+
+  CC_workList_free(s, &(forwardHHObjptrArgs.worklist));
+
   /* Build the toSpace hh */
   HM_HierarchicalHeap hhToSpace = NULL;
   for (uint32_t i = 0; i <= maxDepth; i++)
@@ -691,15 +795,17 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
   }
 
   /* merge in toSpace */
-  if (NULL == hhTail && NULL == hhToSpace) {
+  if (NULL == hh && NULL == hhToSpace)
+  {
     /** SAM_NOTE: If we collected everything, I suppose this is possible.
-      * But shouldn't the stack and thread at least be in the root-to-leaf
-      * path? Should look into this...
-      */
+     * But shouldn't the stack and thread at least be in the root-to-leaf
+     * path? Should look into this...
+     */
     hh = HM_HH_new(s, thread->currentDepth);
   }
-  else {
-    hh = HM_HH_zip(s, hhTail, hhToSpace);
+  else
+  {
+    hh = HM_HH_zip(s, hh, hhToSpace);
   }
 
   thread->hierarchicalHeap = hh;
@@ -716,15 +822,18 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
        NULL != cursor;
        cursor = cursor->nextAncestor)
   {
-    if (HM_getChunkListLastChunk(HM_HH_getChunkList(cursor)) != NULL) {
+    if (HM_getChunkListLastChunk(HM_HH_getChunkList(cursor)) != NULL)
+    {
       lastChunk = HM_getChunkListLastChunk(HM_HH_getChunkList(cursor));
       break;
     }
   }
   thread->currentChunk = lastChunk;
 
-  if (lastChunk != NULL && !lastChunk->mightContainMultipleObjects) {
-    if (!HM_HH_extend(s, thread, GC_HEAP_LIMIT_SLOP)) {
+  if (lastChunk != NULL && !lastChunk->mightContainMultipleObjects)
+  {
+    if (!HM_HH_extend(s, thread, GC_HEAP_LIMIT_SLOP))
+    {
       DIE("Ran out of space for hierarchical heap!\n");
     }
   }
@@ -738,7 +847,6 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
    * assert(lastChunk->frontier < (pointer)lastChunk + HM_BLOCK_SIZE);
    */
 
-
 #if 0
   /** Finally, unfreeze chunks if we need to. */
   if (s->controls->manageEntanglement) {
@@ -769,8 +877,19 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
        cursor = cursor->nextAncestor)
   {
     struct HM_foreachDownptrClosure closure =
-      {.fun = checkRememberedEntry, .env = (void*)cursor};
-    HM_foreachRemembered(s, HM_HH_getRemSet(cursor), &closure);
+        {.fun = checkRememberedEntry, .env = (void *)cursor};
+    HM_foreachRemembered(s, HM_HH_getRemSet(cursor), &closure, false);
+  }
+
+  // make sure that original representatives haven't been messed up
+  for (HM_HierarchicalHeap cursor = hh;
+       NULL != cursor;
+       cursor = cursor->nextAncestor)
+  {
+    if (NULL != fromSpace[HM_HH_getDepth(cursor)])
+    {
+      assert(fromSpace[HM_HH_getDepth(cursor)] == cursor);
+    }
   }
 #endif
 
@@ -783,7 +902,11 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
    * TODO: IS THIS A PROBLEM?
    */
   thread->bytesSurvivedLastCollection =
-    forwardHHObjptrArgs.bytesMoved + forwardHHObjptrArgs.bytesCopied;
+      forwardHHObjptrArgs.bytesMoved + forwardHHObjptrArgs.bytesCopied;
+
+  float new_rf = forwardHHObjptrArgs.entangledBytes;
+
+  s->cumulativeStatistics->approxRaceFactor = max(s->cumulativeStatistics->approxRaceFactor, new_rf);
 
   thread->bytesAllocatedSinceLastCollection = 0;
 
@@ -813,10 +936,13 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
       size_t sizeBefore = sizesBefore[i];
       const char *sign;
       size_t diff;
-      if (sizeBefore > sizeAfter) {
+      if (sizeBefore > sizeAfter)
+      {
         sign = "-";
         diff = sizeBefore - sizeAfter;
-      } else {
+      }
+      else
+      {
         sign = "+";
         diff = sizeAfter - sizeBefore;
       }
@@ -831,11 +957,16 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
     }
   }
 
-  if (totalSizeAfter > totalSizeBefore) {
+  s->cumulativeStatistics->bytesInScopeForLocal += totalSizeBefore;
+
+  if (totalSizeAfter > totalSizeBefore)
+  {
     // whoops?
-  } else {
+  }
+  else
+  {
     s->cumulativeStatistics->bytesReclaimedByLocal +=
-      (totalSizeBefore - totalSizeAfter);
+        (totalSizeBefore - totalSizeAfter);
   }
 
   /* enter statistics if necessary */
@@ -853,8 +984,10 @@ void HM_HHC_collectLocal(uint32_t desiredScope) {
   //     (int)thread->minLocalCollectionDepth);
   // }
 
-  if (needGCTime(s)) {
-    if (detailedGCTime(s)) {
+  if (needGCTime(s))
+  {
+    if (detailedGCTime(s))
+    {
       stopTiming(RUSAGE_THREAD, &ru_start, &s->cumulativeStatistics->ru_gcHHLocal);
     }
     /*
@@ -881,7 +1014,7 @@ bool isObjptrInToSpace(objptr op, struct ForwardHHObjptrArgs *args)
   HM_chunk c = HM_getChunkOf(objptrToPointer(op, NULL));
   HM_HierarchicalHeap levelHead = HM_getLevelHeadPathCompress(c);
   uint32_t depth = HM_HH_getDepth(levelHead);
-  assert(depth <= args->maxDepth);
+  // assert(depth <= args->maxDepth);
   assert(NULL != levelHead);
 
   return args->toSpace[depth] == levelHead;
@@ -890,15 +1023,27 @@ bool isObjptrInToSpace(objptr op, struct ForwardHHObjptrArgs *args)
 /* ========================================================================= */
 
 objptr relocateObject(
-  GC_state s,
-  objptr op,
-  HM_HierarchicalHeap tgtHeap,
-  struct ForwardHHObjptrArgs *args)
+    GC_state s,
+    objptr op,
+    HM_HierarchicalHeap tgtHeap,
+    struct ForwardHHObjptrArgs *args,
+    bool *relocSuccess)
 {
+  *relocSuccess = true;
   pointer p = objptrToPointer(op, NULL);
-
   assert(!hasFwdPtr(p));
   assert(HM_HH_isLevelHead(tgtHeap));
+  GC_header header = getHeader(p);
+  assert (!isFwdHeader(header));
+
+  if (pinType(header) != PIN_NONE)
+  {
+    // object is pinned, so can't relocate
+    // this case must happen from a down pointer or as a down pointer.
+    *relocSuccess = false;
+    assert(args->concurrent);
+    return op;
+  }
 
   HM_chunkList tgtChunkList = HM_HH_getChunkList(tgtHeap);
 
@@ -908,37 +1053,57 @@ objptr relocateObject(
 
   /* compute object size and bytes to be copied */
   computeObjectCopyParameters(s,
+                              header,
                               p,
                               &objectBytes,
                               &copyBytes,
                               &metaDataBytes);
 
-  if (!HM_getChunkOf(p)->mightContainMultipleObjects) {
+  if (!HM_getChunkOf(p)->mightContainMultipleObjects)
+  {
     /* This chunk contains *only* this object, so no need to copy. Instead,
      * just move the chunk. Don't forget to update the levelHead, too! */
     HM_chunk chunk = HM_getChunkOf(p);
-    HM_unlinkChunk(HM_HH_getChunkList(HM_getLevelHead(chunk)), chunk);
+    HM_unlinkChunkPreserveLevelHead(HM_HH_getChunkList(HM_getLevelHead(chunk)), chunk);
     HM_appendChunk(tgtChunkList, chunk);
     chunk->levelHead = HM_HH_getUFNode(tgtHeap);
 
     LOG(LM_HH_COLLECTION, LL_DEBUGMORE,
-      "Moved single-object chunk %p of size %zu",
-      (void*)chunk,
-      HM_getChunkSize(chunk));
+        "Moved single-object chunk %p of size %zu",
+        (void *)chunk,
+        HM_getChunkSize(chunk));
     args->bytesMoved += copyBytes;
     args->objectsMoved++;
     return op;
   }
 
+  /* Otherwise try copying the object */
   pointer copyPointer = copyObject(p - metaDataBytes,
                                    objectBytes,
                                    copyBytes,
                                    tgtHeap);
 
   /* Store the forwarding pointer in the old object metadata. */
-  *(getFwdPtrp(p)) = pointerToObjptr (copyPointer + metaDataBytes,
-                                      NULL);
-  assert (hasFwdPtr(p));
+  objptr newPointer = pointerToObjptr(copyPointer + metaDataBytes, NULL);
+  if (!args->concurrent)
+  {
+    assert(!isPinned(op));
+    assert (__sync_bool_compare_and_swap(getFwdPtrp(p), header, newPointer));
+    *(getFwdPtrp(p)) = newPointer;
+  }
+  else
+  {
+    bool success = __sync_bool_compare_and_swap(getFwdPtrp(p), header, newPointer);
+    if (!success)
+    {
+      delLastObj(newPointer, objectBytes, tgtHeap);
+      assert(isPinned(op));
+      *relocSuccess = false;
+      return op;
+    }
+  }
+  assert (getFwdPtr(p) == newPointer);
+  assert(hasFwdPtr(p));
 
   args->bytesCopied += copyBytes;
   args->objectsCopied++;
@@ -1065,41 +1230,369 @@ void copySuspect(
   assert(isObjptr(op));
   pointer p = objptrToPointer(op, NULL);
   objptr new_ptr = op;
-  if (hasFwdPtr(p)) {
+  if (hasFwdPtr(p))
+  {
     new_ptr = getFwdPtr(p);
   }
-  else if (!isPinned(op)) {
+  else if (!isPinned(op))
+  {
     /* the suspect does not have a fwd-ptr and is not pinned
      * ==> its garbage, so skip it
      */
     return;
   }
   uint32_t opDepth = args->toDepth;
-  if (NULL == args->toSpace[opDepth])
+  HM_storeInChunkListWithPurpose(
+    HM_HH_getSuspects(toSpaceHH(s, args, opDepth)),
+    &new_ptr,
+    sizeof(objptr),
+    BLOCK_FOR_SUSPECTS);
+}
+
+bool headerForwarded(GC_header h)
+{
+  return (!(GC_VALID_HEADER_MASK & h));
+}
+
+void markAndAdd(
+    GC_state s,
+    objptr *opp,
+    objptr op,
+    void *rawArgs)
+{
+  struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+  pointer p = objptrToPointer(op, NULL);
+  HM_chunk chunk = HM_getChunkOf(p);
+  uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+  bool isInToSpace = isObjptrInToSpace(op, args);
+  if ((opDepth > args->maxDepth) || (opDepth < args->minDepth))
+  {
+    /*object is outside the scope of collection*/
+    return;
+  }
+  else if (isInToSpace) {
+    assert(!hasFwdPtr(p));
+    return;
+  }
+  else if (args->fromSpace[opDepth] != HM_getLevelHead(chunk))
+  {
+    /*object is outside the scope of collection*/
+    return;
+  }
+
+  if (hasFwdPtr(p))
+  {
+    objptr fop = getFwdPtr(p);
+    assert(!hasFwdPtr(objptrToPointer(fop, NULL)));
+    assert(isObjptrInToSpace(fop, args));
+    assert(HM_getObjptrDepth(fop) == opDepth);
+    *opp = fop; // SAM_UNSAFE :: potential bug here because race with reader
+    return;
+  }
+  else if (CC_isPointerMarked(p)) {
+    assert (pinType(getHeader(p)) == PIN_ANY);
+    return;
+  }
+
+  disentangleObject(s, op, opDepth);
+  enum PinType pt = pinType(getHeader(p));
+
+  if (pt == PIN_DOWN)
+  {
+    // it is okay to not trace PIN_DOWN objects because the remembered set will have them
+    // and we will definitely trace; this relies on the failure of unpinning in disentangleObject.
+    // it is dangerous to skip PIN_ANY objects here because the remSet entry for them might be created
+    // concurrently to LGC and LGC may miss them.
+    return;
+  }
+  else
+  {
+    assert (!CC_isPointerMarked(p));
+    assert(!hasFwdPtr(p));
+    assert(args->concurrent);
+    HM_HierarchicalHeap tgtHeap = toSpaceHH(s, args, opDepth);
+    assert(p == objptrToPointer(op, NULL));
+    bool relocateSuccess;
+    objptr op_new = relocateObject(s, op, tgtHeap, args, &relocateSuccess);
+    if (relocateSuccess)
+    {
+      chunk->retireChunk = true;
+      *opp = op_new;
+      assert(!hasFwdPtr(objptrToPointer(op_new, NULL)));
+      CC_workList_push(s, &(args->worklist), op_new);
+    }
+    else
+    {
+      assert (isPinned(op));
+      assert (pinType(getHeader(p)) == PIN_ANY);
+      // this is purely an optimization to prevent retracing of PIN_ANY objects
+      // so it is okay if this header read is racy. worst case the object is retraced.
+      addEntangledToRemSet(s, op, opDepth, args);
+
+      if (!chunk->pinnedDuringCollection)
+      {
+        chunk->pinnedDuringCollection = TRUE;
+
+        if (chunk->levelHead != HM_HH_getUFNode(args->fromSpace[opDepth]))
+        {
+          chunk->levelHead = HM_HH_getUFNode(args->fromSpace[opDepth]);
+        }
+        HM_unlinkChunkPreserveLevelHead(
+            HM_HH_getChunkList(args->fromSpace[opDepth]),
+            chunk);
+        HM_appendChunk(&(args->pinned[opDepth]), chunk);
+      }
+      CC_workList_push(s, &(args->worklist), op);
+    }
+  }
+  return;
+}
+
+void unmarkAndAdd(
+    GC_state s,
+    __attribute__((unused)) objptr *opp,
+    objptr op,
+    void *rawArgs)
+{
+  struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+  pointer p = objptrToPointer(op, NULL);
+  HM_chunk chunk = HM_getChunkOf(p);
+  uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+  assert(!hasFwdPtr(p));
+  if ((opDepth > args->maxDepth) || (opDepth < args->minDepth))
+  {
+    return;
+  }
+  else if (args->fromSpace[opDepth] != HM_getLevelHead(chunk) && !isObjptrInToSpace(op, args))
+  {
+    return;
+  }
+  else if (CC_isPointerMarked(p))
+  {
+    markObj(p);
+    CC_workList_push(s, &(args->worklist), op);
+  }
+}
+
+void unmark(
+    GC_state s,
+    __attribute__((unused)) objptr *opp,
+    objptr op,
+    void *rawArgs)
+{
+  struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+  pointer p = objptrToPointer(op, NULL);
+  HM_chunk chunk = HM_getChunkOf(p);
+  uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+  assert(!hasFwdPtr(p));
+  if ((opDepth > args->maxDepth) || (opDepth < args->minDepth))
+  {
+    return;
+  }
+  else if (args->fromSpace[opDepth] != HM_getLevelHead(chunk))
+  {
+    return;
+  }
+  else if (CC_isPointerMarked(p))
+  {
+    markObj(p);
+    struct GC_foreachObjptrClosure unmarkClosure = {.fun = unmark, .env = rawArgs};
+    foreachObjptrInObject(s, p, &trueObjptrPredicateClosure, &unmarkClosure, FALSE);
+  }
+}
+
+void phaseLoop(GC_state s, void *rawArgs, GC_foreachObjptrClosure fClosure)
+{
+  struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+
+  CC_workList worklist = &(args->worklist);
+  objptr *current = CC_workList_pop(s, worklist);
+  while (NULL != current)
   {
-    args->toSpace[opDepth] = HM_HH_new(s, opDepth);
+    callIfIsObjptr(s, fClosure, current);
+    current = CC_workList_pop(s, worklist);
   }
-  HM_storeInchunkList(HM_HH_getSuspects(args->toSpace[opDepth]), &new_ptr, sizeof(objptr));
+  assert(CC_workList_isEmpty(s, worklist));
 }
 
-void tryUnpinOrKeepPinned(GC_state s, HM_remembered remElem, void* rawArgs) {
-  struct ForwardHHObjptrArgs* args = (struct ForwardHHObjptrArgs*)rawArgs;
+void addEntangledToRemSet(
+  GC_state s,
+  objptr op,
+  uint32_t opDepth,
+  struct ForwardHHObjptrArgs *args) {
+  pointer p = objptrToPointer(op, NULL);
+  GC_header header = getHeader(p);
+
+  if (pinType(header) == PIN_ANY && !CC_isPointerMarked(p))
+  {
+    markObj(p);
+    struct HM_remembered remElem_ = {.object = op, .from = BOGUS_OBJPTR};
+    HM_remember (HM_HH_getRemSet(toSpaceHH(s, args, opDepth)), &remElem_, true);
+
+    size_t metaDataBytes;
+    size_t objectBytes;
+    size_t copyBytes;
+
+    /* compute object size and bytes to be copied */
+    computeObjectCopyParameters(s,
+                                header,
+                                p,
+                                &objectBytes,
+                                &copyBytes,
+                                &metaDataBytes);
+    args->entangledBytes += copyBytes;
+  }
+}
+
+void LGC_markAndScan(
+    GC_state s,
+    HM_remembered remElem,
+    void *rawArgs)
+{
   objptr op = remElem->object;
+  pointer p = objptrToPointer(op, NULL);
+  HM_chunk chunk = HM_getChunkOf(p);
+  struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+  uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+  assert(!hasFwdPtr(p));
+  assert(isPinned(op));
+  addEntangledToRemSet(s, op, opDepth, args);
 
-#if ASSERT
-  HM_chunk fromChunk = HM_getChunkOf(objptrToPointer(remElem->from, NULL));
-  HM_HierarchicalHeap fromHH = HM_getLevelHead(fromChunk);
-  assert(HM_HH_getDepth(fromHH) <= args->toDepth);
-#endif
+  if (!isObjptrInToSpace(op, args) && !chunk->pinnedDuringCollection)
+  {
+    chunk->pinnedDuringCollection = TRUE;
+
+    if (chunk->levelHead != HM_HH_getUFNode(args->fromSpace[opDepth]))
+    {
+      chunk->levelHead = HM_HH_getUFNode(args->fromSpace[opDepth]);
+    }
+    HM_unlinkChunkPreserveLevelHead(
+        HM_HH_getChunkList(args->fromSpace[opDepth]),
+        chunk);
+    HM_appendChunk(&(args->pinned[opDepth]), chunk);
+  }
+
+  CC_workList_push(s, &(args->worklist), op);
+  struct GC_foreachObjptrClosure markClosure =
+      {.fun = markAndAdd, .env = (void *)args};
+  phaseLoop(s, rawArgs, &markClosure);
+  assert(CC_workList_isEmpty(s, &(args->worklist)));
+}
+// void LGC_markAndScan(
+//   GC_state s,
+//   __attribute__((unused)) objptr *opp,
+//   objptr op,
+//   void *rawArgs)
+// {
+//   struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+//   pointer p = objptrToPointer(op, NULL);
+//   HM_chunk chunk = HM_getChunkOf(p);
+//   uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+//   if ((opDepth > args->maxDepth) || (opDepth < args->minDepth))
+//   {
+//     // DOUBLE CHECK
+//     return;
+//   }
+//   else if (args->fromSpace[opDepth] != HM_getLevelHead(chunk)) {
+//     return;
+//   }
+//   else if (!CC_isPointerMarked(p))
+//   {
+//     assert(args->fromSpace[opDepth] == HM_getLevelHead(chunk));
+//     markObj(p);
+//     if (!chunk->pinnedDuringCollection)
+//     {
+//       chunk->pinnedDuringCollection = TRUE;
+
+//       if (chunk->levelHead != HM_HH_getUFNode(args->fromSpace[opDepth])) {
+//         chunk->levelHead = HM_HH_getUFNode(args->fromSpace[opDepth]);
+//       }
+
+//       // HM_unlinkChunkPreserveLevelHead(
+//       //     HM_HH_getChunkList(args->fromSpace[opDepth]),
+//       //     chunk);
+//       // HM_appendChunk(&(args->pinned[opDepth]), chunk);
+
+//       HM_unlinkChunkPreserveLevelHead(
+//           HM_HH_getChunkList(args->fromSpace[opDepth]),
+//           chunk);
+//       HM_appendChunk(&(args->pinned[opDepth]), chunk);
+//     }
+//     struct GC_foreachObjptrClosure msClosure =
+//         {.fun = LGC_markAndScan, .env = rawArgs};
+//     foreachObjptrInObject(s, p, &trueObjptrPredicateClosure, &msClosure, FALSE);
+//   }
+//   else
+//   {
+//     assert(args->fromSpace[opDepth] == HM_getLevelHead(chunk));
+//     assert(chunk->pinnedDuringCollection);
+//   }
+// }
+
+// void unmarkLoop(
+//     __attribute__((unused)) GC_state s,
+//     __attribute__((unused)) objptr *opp,
+//     objptr op,
+//     void *rawArgs)
+// {
+//   struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+//   pointer p = objptrToPointer(op, NULL);
+//   HM_chunk chunk = HM_getChunkOf(p);
+//   uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+//   assert(!hasFwdPtr(p));
+//   if ((opDepth > args->maxDepth) || (opDepth < args->minDepth))
+//   {
+//     return;
+//   }
+//   if (CC_isPointerMarked(p))
+//   {
+//     markObj(p);
+//     struct GC_foreachObjptrClosure unmarkClosure = {.fun = unmark, .env = rawArgs};
+//     foreachObjptrInObject(s, p, &trueObjptrPredicateClosure, &unmarkClosure, FALSE);
+//   }
+// }
+
+void unmarkWrapper(
+  __attribute__((unused)) GC_state s,
+  HM_remembered remElem,
+  __attribute__((unused)) void *rawArgs)
+{
+  objptr op = remElem->object;
+  pointer p = objptrToPointer (op, NULL);
+  assert (pinType(getHeader(p)) == PIN_ANY);
+  if (CC_isPointerMarked(p)) {markObj(p);}
+  // struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+  // struct GC_foreachObjptrClosure unmarkClosure =
+  //     {.fun = unmarkAndAdd, .env = args};
+
+  // unmarkAndAdd(s, &(remElem->object), remElem->object, rawArgs);
+  // CC_workList_push(s, &(args->worklist), remElem->object);
+  // phaseLoop(s, rawArgs, &unmarkClosure);
+  // unmark(s, &(remElem->object), remElem->object, rawArgs);
+}
 
-  if (!isPinned(op)) {
+void tryUnpinOrKeepPinned(GC_state s, HM_remembered remElem, void *rawArgs)
+{
+  struct ForwardHHObjptrArgs *args = (struct ForwardHHObjptrArgs *)rawArgs;
+  objptr op = remElem->object;
+
+  // #if ASSERT
+  //   HM_chunk fromChunk = HM_getChunkOf(objptrToPointer(remElem->from, NULL));
+  //   HM_HierarchicalHeap fromHH = HM_getLevelHead(fromChunk);
+  //   assert(HM_HH_getDepth(fromHH) <= args->toDepth);
+  // #endif
+
+  if (!isPinned(op))
+  {
     // If previously unpinned, then no need to remember this object.
-    assert(HM_getLevelHead(fromChunk) == args->fromSpace[args->toDepth]);
+    // assert(HM_getLevelHead(fromChunk) == args->fromSpace[args->toDepth]);
 
     LOG(LM_HH_PROMOTION, LL_INFO,
-      "forgetting remset entry from "FMTOBJPTR" to "FMTOBJPTR,
-      remElem->from, op);
+        "forgetting remset entry from " FMTOBJPTR " to " FMTOBJPTR,
+        remElem->from, op);
 
+    return;
+  } else if ((isObjptrInToSpace(op, args))) {
     return;
   }
 
@@ -1109,13 +1602,8 @@ void tryUnpinOrKeepPinned(GC_state s, HM_remembered remElem, void* rawArgs) {
    * (getLevelHead, etc.), but this should be faster. The toDepth field
    * is set by the loop that calls this function */
   uint32_t opDepth = args->toDepth;
-  HM_chunk chunk = HM_getChunkOf(objptrToPointer(op, NULL));
-
-  if (NULL == args->toSpace[opDepth]) {
-    args->toSpace[opDepth] = HM_HH_new(s, opDepth);
-  }
-
 #if ASSERT
+  HM_chunk chunk = HM_getChunkOf(objptrToPointer(op, NULL));
   assert(opDepth <= args->maxDepth);
   HM_HierarchicalHeap hh = HM_getLevelHead(chunk);
   assert(args->fromSpace[opDepth] == hh);
@@ -1123,136 +1611,152 @@ void tryUnpinOrKeepPinned(GC_state s, HM_remembered remElem, void* rawArgs) {
     assert(listContainsChunk(&(args->pinned[opDepth]), chunk));
   else
     assert(hhContainsChunk(args->fromSpace[opDepth], chunk));
+  assert(HM_getObjptrDepth(op) == opDepth);
+  assert(HM_getLevelHead(chunk) == args->fromSpace[opDepth]);
 #endif
 
-#if 0
-  /** If it's not in our from-space, then it's entangled.
-    * KEEP THE ENTRY but don't do any of the other nasty stuff.
-    *
-    * SAM_NOTE: pinned chunks still have their HH set to the from-space,
-    * despite living in separate chunklists.
-    */
-  if (opDepth > args->maxDepth || args->fromSpace[opDepth] != hh) {
-    assert(s->controls->manageEntanglement);
-    /** TODO: assert entangled here */
+  bool unpin = tryUnpinWithDepth(op, opDepth);
 
-    HM_remember(HM_HH_getRemSet(args->toSpace[opDepth]), remElem);
+  if (unpin)
+  {
     return;
   }
-#endif
-
-  assert(HM_getObjptrDepth(op) == opDepth);
-  assert(HM_getLevelHead(chunk) == args->fromSpace[opDepth]);
 
   uint32_t unpinDepth = unpinDepthOf(op);
-  uint32_t fromDepth = HM_getObjptrDepth(remElem->from);
-
-  assert(fromDepth <= opDepth);
 
-  if (opDepth <= unpinDepth) {
-    unpinObject(op);
-    assert(fromDepth == opDepth);
-
-    LOG(LM_HH_PROMOTION, LL_INFO,
-      "forgetting remset entry from "FMTOBJPTR" to "FMTOBJPTR,
-      remElem->from, op);
-
-    return;
-  }
-
-  if (fromDepth > unpinDepth) {
-    /** If this particular remembered entry came from deeper than some other
-      * down-pointer, then we don't need to keep it around. There will be some
-      * other remembered entry coming from the unpinDepth level.
-      *
-      * But note that it is very important that the condition is a strict
-      * inequality: we need to keep all remembered entries that came from the
-      * same shallowest level. (CC-chaining depends on this.)
-      */
+  if (remElem->from != BOGUS_OBJPTR)
+  {
+    uint32_t fromDepth = HM_getObjptrDepth(remElem->from);
+    assert(fromDepth <= opDepth);
+    if (fromDepth > unpinDepth)
+    {
+      /** If this particular remembered entry came from deeper than some other
+       * down-pointer, then we don't need to keep it around. There will be some
+       * other remembered entry coming from the unpinDepth level.
+       *
+       * But note that it is very important that the condition is a strict
+       * inequality: we need to keep all remembered entries that came from the
+       * same shallowest level. (CC-chaining depends on this.)
+       */
 
-    LOG(LM_HH_PROMOTION, LL_INFO,
-      "forgetting remset entry from "FMTOBJPTR" to "FMTOBJPTR,
-      remElem->from, op);
+      LOG(LM_HH_PROMOTION, LL_INFO,
+          "forgetting remset entry from " FMTOBJPTR " to " FMTOBJPTR,
+          remElem->from, op);
 
-    return;
+      return;
+    }
   }
 
-  assert(fromDepth == unpinDepth);
-
   /* otherwise, object stays pinned, and we have to scavenge this remembered
-   * entry into the toSpace. */
-
-  HM_remember(HM_HH_getRemSet(args->toSpace[opDepth]), remElem);
-
-  if (chunk->pinnedDuringCollection) {
-    return;
+   * entry into the toSpace.
+   * Entangled entries are added later because we use mark&sweep on them
+   * and use the rememebered set later for unmarking.
+   */
+  if (remElem->from != BOGUS_OBJPTR) {
+    HM_remember(HM_HH_getRemSet(toSpaceHH(s, args, opDepth)), remElem, false);
   }
+  // if (remElem->from != BOGUS_OBJPTR) {
+  //   uint32_t fromDepth = HM_getObjptrDepth(remElem->from);
+  //   if ((fromDepth <= args->maxDepth) && (fromDepth >= args->minDepth)) {
+  //     HM_chunk chunk = HM_getChunkOf(objptrToPointer(op, NULL));
+  //     uint32_t opDepth = HM_HH_getDepth(HM_getLevelHead(chunk));
+  //     /* if this is a down-ptr completely inside the scope,
+  //     * no need to in-place things reachable from it
+  //     */
+  //     if (!chunk->pinnedDuringCollection)
+  //     {
+  //       chunk->pinnedDuringCollection = TRUE;
+  //       HM_unlinkChunkPreserveLevelHead(
+  //           HM_HH_getChunkList(args->fromSpace[opDepth]),
+  //           chunk);
+  //       HM_appendChunk(&(args->pinned[opDepth]), chunk);
+  //     }
+  //     return;
+  //   }
+  // }
+  LGC_markAndScan(s, remElem, rawArgs);
 
-  chunk->pinnedDuringCollection = TRUE;
-  assert(hhContainsChunk(args->fromSpace[opDepth], chunk));
-  assert(HM_getLevelHead(chunk) == args->fromSpace[opDepth]);
+  // LGC_markAndScan(s, &(remElem->from), rawArgs);
 
-  HM_unlinkChunkPreserveLevelHead(
-    HM_HH_getChunkList(args->fromSpace[opDepth]),
-    chunk);
-  HM_appendChunk(&(args->pinned[opDepth]), chunk);
+  // if (chunk->pinnedDuringCollection)
+  // {
+  //   return;
+  // }
+  // if (chunk->levelHead != HM_HH_getUFNode(args->fromSpace[opDepth]))
+  // {
+  //   chunk->levelHead = HM_HH_getUFNode(args->fromSpace[opDepth]);
+  // }
 
-  assert(HM_getLevelHead(chunk) == args->fromSpace[opDepth]);
+  // HM_unlinkChunkPreserveLevelHead(
+  //     HM_HH_getChunkList(args->fromSpace[opDepth]),
+  //     chunk);
+  // HM_appendChunk(&(args->pinned[opDepth]), chunk);
+  // chunk->pinnedDuringCollection = TRUE;
+  // assert(HM_getLevelHead(chunk) == args->fromSpace[opDepth]);
+  // /** stronger version of previous assertion, needed for safe freeing of
+  //   * hh dependants after LGC completes
+  //   */
+  // assert(chunk->levelHead == HM_HH_getUFNode(args->fromSpace[opDepth]));
 }
 
 /* ========================================================================= */
 
-void forwardObjptrsOfRemembered(GC_state s, HM_remembered remElem, void* rawArgs) {
+// JATIN_TODO: CHANGE NAME TO FORWARD FROM
+/// COULD BE HEKLPFUL FOR DEGBUGGING TO FORWARD THE OBJECTS ANYWAY
+void forwardFromObjsOfRemembered(GC_state s, HM_remembered remElem, void *rawArgs)
+{
+#if ASSERT
   objptr op = remElem->object;
-
   assert(isPinned(op));
+#endif
 
-  struct GC_foreachObjptrClosure closure =
-    {.fun = forwardHHObjptr, .env = rawArgs};
-
-  foreachObjptrInObject(
-    s,
-    objptrToPointer(op, NULL),
-    &trueObjptrPredicateClosure,
-    &closure,
-    FALSE
-  );
+  // struct GC_foreachObjptrClosure closure =
+  //     {.fun = forwardHHObjptr, .env = rawArgs};
 
+  // foreachObjptrInObject(
+  //     s,
+  //     objptrToPointer(op, NULL),
+  //     &trueObjptrPredicateClosure,
+  //     &closure,
+  //     FALSE);
+  assert (remElem->from != BOGUS_OBJPTR);
   forwardHHObjptr(s, &(remElem->from), remElem->from, rawArgs);
 }
 
 /* ========================================================================= */
 
 void forwardHHObjptr(
-  GC_state s,
-  objptr* opp,
-  objptr op,
-  void* rawArgs)
+    GC_state s,
+    objptr *opp,
+    objptr op,
+    void *rawArgs)
 {
-  struct ForwardHHObjptrArgs* args = ((struct ForwardHHObjptrArgs*)(rawArgs));
-  pointer p = objptrToPointer (op, NULL);
+  struct ForwardHHObjptrArgs *args = ((struct ForwardHHObjptrArgs *)(rawArgs));
+  pointer p = objptrToPointer(op, NULL);
 
   assert(args->toDepth == HM_HH_INVALID_DEPTH);
 
-  if (DEBUG_DETAILED) {
-    fprintf (stderr,
-             "forwardHHObjptr  opp = "FMTPTR"  op = "FMTOBJPTR"  p = "
-             ""FMTPTR"\n",
-             (uintptr_t)opp,
-             op,
-             (uintptr_t)p);
+  if (DEBUG_DETAILED)
+  {
+    fprintf(stderr,
+            "forwardHHObjptr  opp = " FMTPTR "  op = " FMTOBJPTR "  p = "
+            "" FMTPTR "\n",
+            (uintptr_t)opp,
+            op,
+            (uintptr_t)p);
   }
 
   LOG(LM_HH_COLLECTION, LL_DEBUGMORE,
-      "opp = "FMTPTR"  op = "FMTOBJPTR"  p = "FMTPTR,
+      "opp = " FMTPTR "  op = " FMTOBJPTR "  p = " FMTPTR,
       (uintptr_t)opp,
       op,
       (uintptr_t)p);
 
-  if (!isObjptr(op) || isObjptrInRootHeap(s, op)) {
+  if (!isObjptr(op) || isObjptrInRootHeap(s, op))
+  {
     /* does not point to an HH objptr, so not in scope for collection */
     LOG(LM_HH_COLLECTION, LL_DEBUGMORE,
-        "skipping opp = "FMTPTR"  op = "FMTOBJPTR"  p = "FMTPTR": not in HH.",
+        "skipping opp = " FMTPTR "  op = " FMTOBJPTR "  p = " FMTPTR ": not in HH.",
         (uintptr_t)opp,
         op,
         (uintptr_t)p);
@@ -1261,71 +1765,84 @@ void forwardHHObjptr(
 
   uint32_t opDepth = HM_getObjptrDepthPathCompress(op);
 
-  if (opDepth > args->maxDepth) {
-    DIE("entanglement detected during collection: %p is at depth %u, below %u",
-        (void *)p,
-        opDepth,
-        args->maxDepth);
-  }
+  // if (opDepth > args->maxDepth)
+  // {
+  //   DIE("entanglement detected during collection: %p is at depth %u, below %u",
+  //       (void *)p,
+  //       opDepth,
+  //       args->maxDepth);
+  // }
 
   /* RAM_NOTE: This is more nuanced with non-local collection */
   if ((opDepth > args->maxDepth) ||
       /* cannot forward any object below 'args->minDepth' */
-      (opDepth < args->minDepth)) {
-      LOG(LM_HH_COLLECTION, LL_DEBUGMORE,
-          "skipping opp = "FMTPTR"  op = "FMTOBJPTR"  p = "FMTPTR
-          ": depth %d not in [minDepth %d, maxDepth %d].",
-          (uintptr_t)opp,
-          op,
-          (uintptr_t)p,
-          opDepth,
-          args->minDepth,
-          args->maxDepth);
-      return;
+      (opDepth < args->minDepth))
+  {
+    LOG(LM_HH_COLLECTION, LL_DEBUGMORE,
+        "skipping opp = " FMTPTR "  op = " FMTOBJPTR "  p = " FMTPTR
+        ": depth %d not in [minDepth %d, maxDepth %d].",
+        (uintptr_t)opp,
+        op,
+        (uintptr_t)p,
+        opDepth,
+        args->minDepth,
+        args->maxDepth);
+    return;
   }
 
   assert(HM_getObjptrDepth(op) >= args->minDepth);
 
-  if (isObjptrInToSpace(op, args)) {
+  if (isObjptrInToSpace(op, args))
+  {
     assert(!hasFwdPtr(objptrToPointer(op, NULL)));
-    assert(!isPinned(op));
+    // to space objects may be pinned now.
+    // assert(!isPinned(op));
+    return;
+  }
+  else if (HM_getLevelHead(HM_getChunkOf(objptrToPointer(op, NULL))) !=
+           args->fromSpace[HM_getObjptrDepth(op)])
+  {
+    // assert (!decheck(s, op));
     return;
   }
 
-  /* Assert is in from space. This holds for pinned objects, too, because
-   * their levelHead is still set to the fromSpace HH. (Pinned objects are
-   * stored in a different chunklist during collection through.) */
-  assert( HM_getLevelHead(HM_getChunkOf(objptrToPointer(op, NULL)))
-          ==
-          args->fromSpace[HM_getObjptrDepth(op)] );
-
-  if (hasFwdPtr(p)) {
+  if (hasFwdPtr(p))
+  {
     objptr fop = getFwdPtr(p);
     assert(!hasFwdPtr(objptrToPointer(fop, NULL)));
     assert(isObjptrInToSpace(fop, args));
     assert(HM_getObjptrDepth(fop) == opDepth);
-    assert(!isPinned(fop));
+    // assert(!isPinned(fop));
+    // assert(!CC_isPointerMarked(fop));
     *opp = fop;
     return;
   }
 
   assert(!hasFwdPtr(p));
 
+  if (CC_isPointerMarked(p))
+  {
+    // this object is collected in-place.
+    return;
+  }
+
   /** REALLY SUBTLE. CC clears out remset entries, but can't safely perform
-    * unpinning. So, there could be objects that (for the purposes of LC) are
-    * semantically unpinned, but just haven't been marked as such yet. Here,
-    * we are lazily checking to see if this object should have been unpinned.
-    */
-  if (isPinned(op) && unpinDepthOf(op) < opDepth) {
+   * unpinning. So, there could be objects that (for the purposes of LC) are
+   * semantically unpinned, but just haven't been marked as such yet. Here,
+   * we are lazily checking to see if this object should have been unpinned.
+   */
+  if (isPinned(op) && unpinDepthOf(op) < opDepth)
+  {
     // This is a truly pinned object
-    assert(listContainsChunk( &(args->pinned[opDepth]),
-                              HM_getChunkOf(objptrToPointer(op, NULL))
-    ));
+    assert(listContainsChunk(&(args->pinned[opDepth]),
+                             HM_getChunkOf(objptrToPointer(op, NULL))));
     return;
   }
-  else {
+  else
+  {
+    disentangleObject(s, op, opDepth);
     // This object should have been previously unpinned
-    unpinObject(op);
+    // unpinObject(op);
   }
 
   /* ========================================================================
@@ -1348,38 +1865,38 @@ void forwardHHObjptr(
 
     /* compute object size and bytes to be copied */
     tag = computeObjectCopyParameters(s,
+                                      getHeader(p),
                                       p,
                                       &objectBytes,
                                       &copyBytes,
                                       &metaDataBytes);
 
-    switch (tag) {
+    switch (tag)
+    {
     case STACK_TAG:
-        args->stacksCopied++;
-        break;
+      args->stacksCopied++;
+      break;
     case WEAK_TAG:
-        die(__FILE__ ":%d: "
-            "forwardHHObjptr() does not support WEAK_TAG objects!",
-            __LINE__);
-        break;
+      die(__FILE__ ":%d: "
+                   "forwardHHObjptr() does not support WEAK_TAG objects!",
+          __LINE__);
+      break;
     default:
-        break;
+      break;
     }
 
-    HM_HierarchicalHeap tgtHeap = args->toSpace[opDepth];
-    if (tgtHeap == NULL) {
-      /* Level does not exist, so create it */
-      tgtHeap = HM_HH_new(s, opDepth);
-      args->toSpace[opDepth] = tgtHeap;
-    }
+    HM_HierarchicalHeap tgtHeap = toSpaceHH(s, args, opDepth);
     assert(p == objptrToPointer(op, NULL));
 
     /* use the forwarding pointer */
-    *opp = relocateObject(s, op, tgtHeap, args);
+    bool relocateSuccess;
+    assert(!args->concurrent);
+    *opp = relocateObject(s, op, tgtHeap, args, &relocateSuccess);
+    assert(relocateSuccess);
   }
 
   LOG(LM_HH_COLLECTION, LL_DEBUGMORE,
-      "opp "FMTPTR" set to "FMTOBJPTR,
+      "opp " FMTPTR " set to " FMTOBJPTR,
       ((uintptr_t)(opp)),
       *opp);
 }
@@ -1387,10 +1904,11 @@ void forwardHHObjptr(
 pointer copyObject(pointer p,
                    size_t objectSize,
                    size_t copySize,
-                   HM_HierarchicalHeap tgtHeap) {
+                   HM_HierarchicalHeap tgtHeap)
+{
 
-// check if you can add to existing chunk --> mightContain + size
-// If not, allocate new chunk and copy.
+  // check if you can add to existing chunk --> mightContain + size
+  // If not, allocate new chunk and copy.
 
   assert(HM_HH_isLevelHead(tgtHeap));
   assert(copySize <= objectSize);
@@ -1402,24 +1920,31 @@ pointer copyObject(pointer p,
   bool mustExtend = false;
 
   HM_chunk chunk = HM_getChunkListLastChunk(tgtChunkList);
-  if(chunk == NULL || !chunk->mightContainMultipleObjects){
+  if (chunk == NULL || !chunk->mightContainMultipleObjects)
+  {
     mustExtend = true;
   }
-  else {
+  else
+  {
     pointer frontier = HM_getChunkFrontier(chunk);
     pointer limit = HM_getChunkLimit(chunk);
     assert(frontier <= limit);
     mustExtend = ((size_t)(limit - frontier) < objectSize) ||
-                      (frontier  + GC_SEQUENCE_METADATA_SIZE
-                        >= (pointer)chunk + HM_BLOCK_SIZE);
+                 (frontier + GC_SEQUENCE_METADATA_SIZE >= (pointer)chunk + HM_BLOCK_SIZE);
   }
 
-  if (mustExtend) {
+  if (mustExtend)
+  {
     /* Need to allocate a new chunk. Safe to use the dechecker state of where
      * the object came from, as all objects in the same heap can be safely
      * reassigned to any dechecker state of that heap. */
-    chunk = HM_allocateChunk(tgtChunkList, objectSize);
-    if (NULL == chunk) {
+    chunk = HM_allocateChunkWithPurpose(
+      tgtChunkList,
+      objectSize,
+      BLOCK_FOR_HEAP_CHUNK);
+
+    if (NULL == chunk)
+    {
       DIE("Ran out of space for Hierarchical Heap!");
     }
     chunk->decheckState = HM_getChunkOf(p)->decheckState;
@@ -1442,50 +1967,66 @@ pointer copyObject(pointer p,
 
   return frontier;
 }
+
+void delLastObj(objptr op, size_t objectSize, HM_HierarchicalHeap tgtHeap)
+{
+  HM_chunkList tgtChunkList = HM_HH_getChunkList(tgtHeap);
+  HM_chunk chunk = HM_getChunkOf(objptrToPointer(op, NULL));
+  assert(listContainsChunk(tgtChunkList, chunk));
+  HM_updateChunkFrontierInList(tgtChunkList, chunk, HM_getChunkFrontier(chunk) - objectSize);
+}
+
 #endif /* MLTON_GC_INTERNAL_FUNCS */
 
-GC_objectTypeTag computeObjectCopyParameters(GC_state s, pointer p,
+GC_objectTypeTag computeObjectCopyParameters(GC_state s,
+                                             GC_header header,
+                                             pointer p,
                                              size_t *objectSize,
                                              size_t *copySize,
-                                             size_t *metaDataSize) {
-    GC_header header;
-    GC_objectTypeTag tag;
-    uint16_t bytesNonObjptrs;
-    uint16_t numObjptrs;
-    header = getHeader(p);
-    splitHeader(s, header, &tag, NULL, &bytesNonObjptrs, &numObjptrs);
-
-    /* Compute the space taken by the metadata and object body. */
-    if ((NORMAL_TAG == tag) or (WEAK_TAG == tag)) { /* Fixed size object. */
-      if (WEAK_TAG == tag) {
-        die(__FILE__ ":%d: "
-            "computeObjectSizeAndCopySize() #define does not support"
-            " WEAK_TAG objects!",
-            __LINE__);
-      }
-      *metaDataSize = GC_NORMAL_METADATA_SIZE;
-      *objectSize = bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE);
-      *copySize = *objectSize;
-    } else if (SEQUENCE_TAG == tag) {
-      *metaDataSize = GC_SEQUENCE_METADATA_SIZE;
-      *objectSize = sizeofSequenceNoMetaData (s, getSequenceLength (p),
-                                              bytesNonObjptrs, numObjptrs);
-      *copySize = *objectSize;
-    } else {
-      /* Stack. */
-      // bool current;
-      // size_t reservedNew;
-      GC_stack stack;
-
-      assert (STACK_TAG == tag);
-      *metaDataSize = GC_STACK_METADATA_SIZE;
-      stack = (GC_stack)p;
-
-      /* SAM_NOTE:
-       * I am disabling shrinking here because it assumes that
-       * the stack is going to be copied, which doesn't work with the
-       * "stacks-in-their-own-chunks" strategy.
-       */
+                                             size_t *metaDataSize)
+{
+  GC_objectTypeTag tag;
+  uint16_t bytesNonObjptrs;
+  uint16_t numObjptrs;
+  splitHeader(s, header, &tag, NULL, &bytesNonObjptrs, &numObjptrs);
+
+  /* Compute the space taken by the metadata and object body. */
+  if ((NORMAL_TAG == tag) or (WEAK_TAG == tag))
+  { /* Fixed size object. */
+    if (WEAK_TAG == tag)
+    {
+      die(__FILE__ ":%d: "
+                   "computeObjectSizeAndCopySize() #define does not support"
+                   " WEAK_TAG objects!",
+          __LINE__);
+    }
+    *metaDataSize = GC_NORMAL_METADATA_SIZE;
+    *objectSize = bytesNonObjptrs + (numObjptrs * OBJPTR_SIZE);
+    *copySize = *objectSize;
+  }
+  else if (SEQUENCE_TAG == tag)
+  {
+    *metaDataSize = GC_SEQUENCE_METADATA_SIZE;
+    *objectSize = sizeofSequenceNoMetaData(s, getSequenceLength(p),
+                                           bytesNonObjptrs, numObjptrs);
+    *copySize = *objectSize;
+  }
+  else
+  {
+    /* Stack. */
+    // bool current;
+    // size_t reservedNew;
+    GC_stack stack;
+
+    assert(STACK_TAG == tag);
+    *metaDataSize = GC_STACK_METADATA_SIZE;
+    stack = (GC_stack)p;
+
+    /* SAM_NOTE:
+     * I am disabling shrinking here because it assumes that
+     * the stack is going to be copied, which doesn't work with the
+     * "stacks-in-their-own-chunks" strategy.
+     */
 #if 0
       /* RAM_NOTE: This changes with non-local collection */
       /* Check if the pointer is the current stack of my processor. */
@@ -1502,34 +2043,37 @@ GC_objectTypeTag computeObjectCopyParameters(GC_state s, pointer p,
         stack->reserved = reservedNew;
       }
 #endif
-      *objectSize = sizeof (struct GC_stack) + stack->reserved;
-      *copySize = sizeof (struct GC_stack) + stack->used;
-    }
+    *objectSize = sizeof(struct GC_stack) + stack->reserved;
+    *copySize = sizeof(struct GC_stack) + stack->used;
+  }
 
-    *objectSize += *metaDataSize;
-    *copySize += *metaDataSize;
+  *objectSize += *metaDataSize;
+  *copySize += *metaDataSize;
 
-    return tag;
+  return tag;
 }
 
-
 bool skipStackAndThreadObjptrPredicate(GC_state s,
                                        pointer p,
-                                       void* rawArgs) {
+                                       void *rawArgs)
+{
   /* silence compliler */
   ((void)(s));
 
   /* extract expected stack */
-  LOCAL_USED_FOR_ASSERT const struct SSATOPredicateArgs* args =
-      ((struct SSATOPredicateArgs*)(rawArgs));
+  LOCAL_USED_FOR_ASSERT const struct SSATOPredicateArgs *args =
+      ((struct SSATOPredicateArgs *)(rawArgs));
 
   /* run through FALSE cases */
   GC_header header;
   header = getHeader(p);
-  if (header == GC_STACK_HEADER) {
+  if (header == GC_STACK_HEADER)
+  {
     assert(args->expectedStackPointer == p);
     return FALSE;
-  } else if (header == GC_THREAD_HEADER) {
+  }
+  else if (header == GC_THREAD_HEADER)
+  {
     assert(args->expectedThreadPointer == p);
     return FALSE;
   }
@@ -1540,10 +2084,11 @@ bool skipStackAndThreadObjptrPredicate(GC_state s,
 #if ASSERT
 
 void checkRememberedEntry(
-  __attribute__((unused)) GC_state s,
-  HM_remembered remElem,
-  void* args)
+    __attribute__((unused)) GC_state s,
+    HM_remembered remElem,
+    void *args)
 {
+  return;
   objptr object = remElem->object;
 
   HM_HierarchicalHeap hh = (HM_HierarchicalHeap)args;
@@ -1557,11 +2102,14 @@ void checkRememberedEntry(
   assert(HM_getLevelHead(theChunk) == hh);
 
   assert(!hasFwdPtr(objptrToPointer(object, NULL)));
-  assert(!hasFwdPtr(objptrToPointer(remElem->from, NULL)));
+  if (remElem->from != BOGUS_OBJPTR)
+  {
+    assert(!hasFwdPtr(objptrToPointer(remElem->from, NULL)));
 
-  HM_chunk fromChunk = HM_getChunkOf(objptrToPointer(remElem->from, NULL));
-  HM_HierarchicalHeap fromHH = HM_getLevelHead(fromChunk);
-  assert(HM_HH_getDepth(fromHH) <= HM_HH_getDepth(hh));
+    HM_chunk fromChunk = HM_getChunkOf(objptrToPointer(remElem->from, NULL));
+    HM_HierarchicalHeap fromHH = HM_getLevelHead(fromChunk);
+    assert(HM_HH_getDepth(fromHH) <= HM_HH_getDepth(hh));
+  }
 }
 
 bool hhContainsChunk(HM_HierarchicalHeap hh, HM_chunk theChunk)
diff --git a/runtime/gc/hierarchical-heap-collection.h b/runtime/gc/hierarchical-heap-collection.h
index 8999c5031..a5014f559 100644
--- a/runtime/gc/hierarchical-heap-collection.h
+++ b/runtime/gc/hierarchical-heap-collection.h
@@ -14,22 +14,25 @@
  * Definition of the HierarchicalHeap collection interface
  */
 
-
 #ifndef HIERARCHICAL_HEAP_COLLECTION_H_
 #define HIERARCHICAL_HEAP_COLLECTION_H_
 
 #include "chunk.h"
+#include "cc-work-list.h"
 
-#if (defined (MLTON_GC_INTERNAL_TYPES))
-struct ForwardHHObjptrArgs {
-  struct HM_HierarchicalHeap* hh;
+#if (defined(MLTON_GC_INTERNAL_TYPES))
+struct ForwardHHObjptrArgs
+{
+  struct HM_HierarchicalHeap *hh;
   uint32_t minDepth;
   uint32_t maxDepth;
   uint32_t toDepth; /* if == HM_HH_INVALID_DEPTH, preserve level of the forwarded object */
 
   /* arrays of HH objects, e.g. HM_HH_getDepth(toSpace[i]) == i */
-  HM_HierarchicalHeap* fromSpace;
-  HM_HierarchicalHeap* toSpace;
+  HM_HierarchicalHeap *fromSpace;
+  HM_HierarchicalHeap *toSpace;
+  pointer *toSpaceStart;
+  HM_chunk *toSpaceStartChunk;
   /* an array of pinned chunklists */
   struct HM_chunkList *pinned;
 
@@ -37,18 +40,24 @@ struct ForwardHHObjptrArgs {
   objptr containingObject;
 
   size_t bytesCopied;
+  size_t entangledBytes;
   uint64_t objectsCopied;
   uint64_t stacksCopied;
 
   /* large objects are "moved" (rather than copied). */
   size_t bytesMoved;
   uint64_t objectsMoved;
+
+  /*worklist for mark and scan*/
+  struct CC_workList worklist;
+  bool concurrent;
 };
 
-struct checkDEDepthsArgs {
+struct checkDEDepthsArgs
+{
   int32_t minDisentangledDepth;
-  HM_HierarchicalHeap* fromSpace;
-  HM_HierarchicalHeap* toSpace;
+  HM_HierarchicalHeap *fromSpace;
+  HM_HierarchicalHeap *toSpace;
   uint32_t maxDepth;
 };
 
@@ -56,10 +65,10 @@ struct checkDEDepthsArgs {
 
 #endif /* MLTON_GC_INTERNAL_TYPES */
 
-#if (defined (MLTON_GC_INTERNAL_BASIS))
+#if (defined(MLTON_GC_INTERNAL_BASIS))
 #endif /* MLTON_GC_INTERNAL_BASIS */
 
-#if (defined (MLTON_GC_INTERNAL_FUNCS))
+#if (defined(MLTON_GC_INTERNAL_FUNCS))
 /**
  * This function performs a local collection on the current hierarchical heap
  */
@@ -73,12 +82,12 @@ void HM_HHC_collectLocal(uint32_t desiredScope);
  * @param opp The objptr to forward
  * @param args The struct ForwardHHObjptrArgs* for this call, cast as a void*
  */
-void forwardHHObjptr(GC_state s, objptr* opp, objptr op, void* rawArgs);
+void forwardHHObjptr(GC_state s, objptr *opp, objptr op, void *rawArgs);
 
 /* check if `op` is in args->toSpace[depth(op)] */
 bool isObjptrInToSpace(objptr op, struct ForwardHHObjptrArgs *args);
 
-objptr relocateObject(GC_state s, objptr obj, HM_HierarchicalHeap tgtHeap, struct ForwardHHObjptrArgs *args);
+objptr relocateObject(GC_state s, objptr obj, HM_HierarchicalHeap tgtHeap, struct ForwardHHObjptrArgs *args, bool *relocSuccess);
 
 pointer copyObject(pointer p, size_t objectSize, size_t copySize, HM_HierarchicalHeap tgtHeap);
 #endif /* MLTON_GC_INTERNAL_FUNCS */
diff --git a/runtime/gc/hierarchical-heap-ebr.c b/runtime/gc/hierarchical-heap-ebr.c
index 8aa1bb666..ff96120ba 100644
--- a/runtime/gc/hierarchical-heap-ebr.c
+++ b/runtime/gc/hierarchical-heap-ebr.c
@@ -1,4 +1,5 @@
 /* Copyright (C) 2021 Sam Westrick
+ * Copyright (C) 2022 Jatin Arora
  *
  * MLton is released under a HPND-style license.
  * See the file MLton-LICENSE for details.
@@ -6,153 +7,28 @@
 
 #if (defined (MLTON_GC_INTERNAL_FUNCS))
 
-/** Helpers for packing/unpacking announcements. DEBRA packs epochs with a
-  * "quiescent" bit, the idea being that processors should set the bit during
-  * quiescent periods (between operations) and have it unset otherwise (i.e.
-  * during an operation). Being precise about quiescent periods in this way
-  * is helpful for reclamation, because in order to advance the epoch, all we
-  * need to know is that every processor has been in a quiescent period since
-  * the beginning of the last epoch.
-  *
-  * But note that updating the quiescent bits is only efficient if we can
-  * amortize the cost of the setting/unsetting the bit with other nearby
-  * operations. If we assumed that the typical state for each processor
-  * is quiescent and then paid for non-quiescent periods, this would
-  * be WAY too expensive. In our case, processors are USUALLY NON-QUIESCENT,
-  * due to depth queries at the write-barrier.
-  *
-  * So
-  */
-#define PACK(epoch, qbit) ((((size_t)(epoch)) << 1) | ((qbit) & 1))
-#define UNPACK_EPOCH(announcement) ((announcement) >> 1)
-#define UNPACK_QBIT(announcement) ((announcement) & 1)
-#define SET_Q_TRUE(announcement) ((announcement) | (size_t)1)
-#define SET_Q_FALSE(announcement) ((announcement) & (~(size_t)1))
-
-#define ANNOUNCEMENT_PADDING 16
-
-static inline size_t getAnnouncement(GC_state s, uint32_t pid) {
-  return s->hhEBR->announce[ANNOUNCEMENT_PADDING*pid];
-}
-
-static inline void setAnnouncement(GC_state s, uint32_t pid, size_t ann) {
-  s->hhEBR->announce[ANNOUNCEMENT_PADDING*pid] = ann;
-}
-
 void HH_EBR_enterQuiescentState(GC_state s) {
-  uint32_t mypid = s->procNumber;
-  setAnnouncement(s, mypid, SET_Q_TRUE(getAnnouncement(s, mypid)));
+  EBR_enterQuiescentState(s, s->hhEBR);
 }
 
-static void rotateAndReclaim(GC_state s) {
-  HH_EBR_shared ebr = s->hhEBR;
-  uint32_t mypid = s->procNumber;
-
-  int limboIdx = (ebr->local[mypid].limboIdx + 1) % 3;
-  ebr->local[mypid].limboIdx = limboIdx;
-  HM_chunkList limboBag = &(ebr->local[mypid].limboBags[limboIdx]);
-
-  // Free all HH records in the limbo bag.
-  for (HM_chunk chunk = HM_getChunkListFirstChunk(limboBag);
-       NULL != chunk;
-       chunk = chunk->nextChunk)
-  {
-    for (pointer p = HM_getChunkStart(chunk);
-         p < HM_getChunkFrontier(chunk);
-         p += sizeof(HM_UnionFindNode *))
-    {
-      freeFixedSize(getUFAllocator(s), *(HM_UnionFindNode*)p);
-    }
-  }
-
-  HM_freeChunksInList(s, limboBag);
-  HM_initChunkList(limboBag); // clear it out
+void freeUnionFind (GC_state s, void *ptr) {
+  HM_UnionFindNode hufp = (HM_UnionFindNode)ptr;
+  assert(hufp->payload != NULL);
+  freeFixedSize(getHHAllocator(s), hufp->payload);
+  freeFixedSize(getUFAllocator(s), hufp);
 }
 
-
 void HH_EBR_init(GC_state s) {
-  HH_EBR_shared ebr = malloc(sizeof(struct HH_EBR_shared));
-  s->hhEBR = ebr;
-
-  ebr->epoch = 0;
-  ebr->announce =
-    malloc(s->numberOfProcs * ANNOUNCEMENT_PADDING * sizeof(size_t));
-  ebr->local =
-    malloc(s->numberOfProcs * sizeof(struct HH_EBR_local));
-
-  for (uint32_t i = 0; i < s->numberOfProcs; i++) {
-    // Everyone starts by announcing epoch = 0 and is non-quiescent
-    setAnnouncement(s, i, PACK(0,0));
-    ebr->local[i].limboIdx = 0;
-    ebr->local[i].checkNext = 0;
-    for (int j = 0; j < 3; j++)
-      HM_initChunkList(&(ebr->local[i].limboBags[j]));
-  }
+  s->hhEBR = EBR_new(s, &freeUnionFind);
 }
 
 
 void HH_EBR_leaveQuiescentState(GC_state s) {
-  HH_EBR_shared ebr = s->hhEBR;
-  uint32_t mypid = s->procNumber;
-  uint32_t numProcs = s->numberOfProcs;
-
-  size_t globalEpoch = ebr->epoch;
-  size_t myann = getAnnouncement(s, mypid);
-  size_t myEpoch = UNPACK_EPOCH(myann);
-  assert(globalEpoch >= myEpoch);
-
-  if (myEpoch != globalEpoch) {
-    ebr->local[mypid].checkNext = 0;
-    /** Advance into the current epoch. To do so, we need to clear the limbo
-      * bag of the epoch we're moving into.
-      */
-    rotateAndReclaim(s);
-  }
-
-  uint32_t otherpid = (ebr->local[mypid].checkNext) % numProcs;
-  size_t otherann = getAnnouncement(s, otherpid);
-  if ( UNPACK_EPOCH(otherann) == globalEpoch || UNPACK_QBIT(otherann) ) {
-    uint32_t c = ++ebr->local[mypid].checkNext;
-    if (c >= numProcs) {
-      __sync_val_compare_and_swap(&(ebr->epoch), globalEpoch, globalEpoch+1);
-    }
-  }
-
-  setAnnouncement(s, mypid, PACK(globalEpoch, 0));
+  EBR_leaveQuiescentState(s, s->hhEBR);
 }
 
-
 void HH_EBR_retire(GC_state s, HM_UnionFindNode hhuf) {
-  HH_EBR_shared ebr = s->hhEBR;
-  uint32_t mypid = s->procNumber;
-  int limboIdx = ebr->local[mypid].limboIdx;
-  HM_chunkList limboBag = &(ebr->local[mypid].limboBags[limboIdx]);
-  HM_chunk chunk = HM_getChunkListLastChunk(limboBag);
-
-  // fast path: bump frontier in chunk
-
-  if (NULL != chunk &&
-      HM_getChunkSizePastFrontier(chunk) >= sizeof(HM_UnionFindNode *))
-  {
-    pointer p = HM_getChunkFrontier(chunk);
-    *(HM_UnionFindNode *)p = hhuf;
-    HM_updateChunkFrontierInList(limboBag, chunk, p + sizeof(HM_UnionFindNode *));
-    return;
-  }
-
-  // slow path: allocate new chunk
-
-  chunk = HM_allocateChunk(limboBag, sizeof(HM_UnionFindNode *));
-
-  assert(NULL != chunk &&
-    HM_getChunkSizePastFrontier(chunk) >= sizeof(HM_UnionFindNode *));
-
-  pointer p = HM_getChunkFrontier(chunk);
-  *(HM_UnionFindNode *)p = hhuf;
-  HM_updateChunkFrontierInList(limboBag, chunk, p + sizeof(HM_UnionFindNode *));
-  return;
+  EBR_retire(s, s->hhEBR, (void *)hhuf);
 }
 
-
-
 #endif // MLTON_GC_INTERNAL_FUNCS
diff --git a/runtime/gc/hierarchical-heap-ebr.h b/runtime/gc/hierarchical-heap-ebr.h
index 541528684..9e8792d45 100644
--- a/runtime/gc/hierarchical-heap-ebr.h
+++ b/runtime/gc/hierarchical-heap-ebr.h
@@ -10,34 +10,6 @@
 #ifndef HIERARCHICAL_HEAP_EBR_H_
 #define HIERARCHICAL_HEAP_EBR_H_
 
-#if (defined (MLTON_GC_INTERNAL_TYPES))
-
-struct HH_EBR_local {
-  struct HM_chunkList limboBags[3];
-  int limboIdx;
-  uint32_t checkNext;
-} __attribute__((aligned(128)));
-
-// There is exactly one of these! Everyone shares a reference to it.
-typedef struct HH_EBR_shared {
-  size_t epoch;
-
-  // announcement array, length = num procs
-  // each announcement is packed: 63 bits for epoch, 1 bit for quiescent bit
-  size_t *announce;
-
-  // processor-local data, length = num procs
-  struct HH_EBR_local *local;
-} * HH_EBR_shared;
-
-#else
-
-struct HH_EBR_local;
-struct HH_EBR_shared;
-typedef struct HH_EBR_shared * HH_EBR_shared;
-
-#endif // MLTON_GC_INTERNAL_TYPES
-
 #if (defined (MLTON_GC_INTERNAL_FUNCS))
 
 void HH_EBR_init(GC_state s);
diff --git a/runtime/gc/hierarchical-heap.c b/runtime/gc/hierarchical-heap.c
index 492c69276..987cb26f1 100644
--- a/runtime/gc/hierarchical-heap.c
+++ b/runtime/gc/hierarchical-heap.c
@@ -230,16 +230,17 @@ HM_HierarchicalHeap HM_HH_zip(
 
     if (depth1 == depth2)
     {
-      HM_appendChunkList(HM_HH_getChunkList(hh1), HM_HH_getChunkList(hh2));
-      HM_appendChunkList(HM_HH_getRemSet(hh1), HM_HH_getRemSet(hh2));
-      ES_move(HM_HH_getSuspects(hh1), HM_HH_getSuspects(hh2));
-      linkCCChains(s, hh1, hh2);
-
       // This has to happen before linkInto (which frees hh2)
       HM_HierarchicalHeap hh2anc = hh2->nextAncestor;
       CC_freeStack(s, HM_HH_getConcurrentPack(hh2));
+      linkCCChains(s, hh1, hh2);
       linkInto(s, hh1, hh2);
 
+      HM_appendChunkList(HM_HH_getChunkList(hh1), HM_HH_getChunkList(hh2));
+      ES_move(HM_HH_getSuspects(hh1), HM_HH_getSuspects(hh2));
+      HM_appendRemSet(HM_HH_getRemSet(hh1), HM_HH_getRemSet(hh2));
+
+
       *cursor = hh1;
       cursor = &(hh1->nextAncestor);
 
@@ -339,6 +340,26 @@ void HM_HH_merge(
   assertInvariants(parentThread);
 }
 
+
+void HM_HH_clearSuspectsAtDepth(
+  GC_state s,
+  GC_thread thread,
+  uint32_t targetDepth)
+{
+  // walk to find heap; only clear suspects at the target depth
+  for (HM_HierarchicalHeap cursor = thread->hierarchicalHeap;
+       NULL != cursor;
+       cursor = cursor->nextAncestor)
+  {
+    uint32_t d = HM_HH_getDepth(cursor);
+    if (d <= targetDepth) {
+      if (d == targetDepth) ES_clear(s, cursor);
+      return;
+    }
+  }
+}
+
+
 void HM_HH_promoteChunks(
   GC_state s,
   GC_thread thread)
@@ -349,7 +370,6 @@ void HM_HH_promoteChunks(
   {
     /* no need to do anything; this function only guarantees that the
      * current depth has been completely evacuated. */
-    ES_clear(s, HM_HH_getSuspects(thread->hierarchicalHeap));
     return;
   }
 
@@ -378,18 +398,20 @@ void HM_HH_promoteChunks(
 
     if (NULL == hh->subHeapForCC) {
       assert(NULL == hh->subHeapCompletedCC);
+      /* don't need the snapshot for this heap now. */
+      CC_freeStack(s, HM_HH_getConcurrentPack(hh));
+      linkCCChains(s, parent, hh);
+      linkInto(s, parent, hh);
+
       HM_appendChunkList(HM_HH_getChunkList(parent), HM_HH_getChunkList(hh));
-      HM_appendChunkList(HM_HH_getRemSet(parent), HM_HH_getRemSet(hh));
       ES_move(HM_HH_getSuspects(parent), HM_HH_getSuspects(hh));
-      linkCCChains(s, parent, hh);
+      HM_appendRemSet(HM_HH_getRemSet(parent), HM_HH_getRemSet(hh));
       /* shortcut.  */
       thread->hierarchicalHeap = parent;
-      /* don't need the snapshot for this heap now. */
-      CC_freeStack(s, HM_HH_getConcurrentPack(hh));
-      linkInto(s, parent, hh);
       hh = parent;
     }
-    else {
+    else
+    {
       assert(HM_getLevelHead(thread->currentChunk) == hh);
 
 #if ASSERT
@@ -453,8 +475,7 @@ void HM_HH_promoteChunks(
 
     assert(HM_HH_getDepth(hh) == currentDepth-1);
   }
-
-  ES_clear(s, HM_HH_getSuspects(thread->hierarchicalHeap));
+  assert(hh == thread->hierarchicalHeap);
 
 #if ASSERT
   assert(hh == thread->hierarchicalHeap);
@@ -465,6 +486,7 @@ void HM_HH_promoteChunks(
 #endif
 }
 
+
 bool HM_HH_isLevelHead(HM_HierarchicalHeap hh)
 {
   return (NULL != hh)
@@ -503,7 +525,7 @@ HM_HierarchicalHeap HM_HH_new(GC_state s, uint32_t depth)
   hh->heightDependants = 0;
 
   HM_initChunkList(HM_HH_getChunkList(hh));
-  HM_initChunkList(HM_HH_getRemSet(hh));
+  HM_initRemSet(HM_HH_getRemSet(hh));
   HM_initChunkList(HM_HH_getSuspects(hh));
 
   return hh;
@@ -583,7 +605,10 @@ bool HM_HH_extend(GC_state s, GC_thread thread, size_t bytesRequested)
     hh = newhh;
   }
 
-  chunk = HM_allocateChunk(HM_HH_getChunkList(hh), bytesRequested);
+  chunk = HM_allocateChunkWithPurpose(
+    HM_HH_getChunkList(hh),
+    bytesRequested,
+    BLOCK_FOR_HEAP_CHUNK);
 
   if (NULL == chunk) {
     return FALSE;
@@ -596,6 +621,17 @@ bool HM_HH_extend(GC_state s, GC_thread thread, size_t bytesRequested)
 #endif
 
   chunk->levelHead = HM_HH_getUFNode(hh);
+  // hh->chunkList <--> og
+  // toList --> hh
+  // 1. in-place collection of unionFind nodes?
+  // 2. How do you make the hh fully concurrent?
+  // how do you make the union-find fully concurrent and collectible?
+      // what is the hh?? list of heaps
+          // ->
+          // ->
+          // ->
+          // ->
+  // 3.
 
   thread->currentChunk = chunk;
   HM_HH_addRecentBytesAllocated(thread, HM_getChunkSize(chunk));
@@ -674,8 +710,11 @@ void splitHeapForCC(GC_state s, GC_thread thread) {
 
   HM_HierarchicalHeap newHH = HM_HH_new(s, HM_HH_getDepth(hh));
   thread->hierarchicalHeap = newHH;
-  HM_chunk chunk =
-    HM_allocateChunk(HM_HH_getChunkList(newHH), GC_HEAP_LIMIT_SLOP);
+  HM_chunk chunk = HM_allocateChunkWithPurpose(
+    HM_HH_getChunkList(newHH),
+    GC_HEAP_LIMIT_SLOP,
+    BLOCK_FOR_HEAP_CHUNK);
+    
   chunk->levelHead = HM_HH_getUFNode(newHH);
 
 #ifdef DETECT_ENTANGLEMENT
@@ -755,12 +794,21 @@ void mergeCompletedCCs(GC_state s, HM_HierarchicalHeap hh) {
     HM_HierarchicalHeap completed = hh->subHeapCompletedCC;
     while (completed != NULL) {
       HM_HierarchicalHeap next = completed->subHeapCompletedCC;
+      
+      /* consider using max instead of addition */
       HM_HH_getConcurrentPack(hh)->bytesSurvivedLastCollection +=
         HM_HH_getConcurrentPack(completed)->bytesSurvivedLastCollection;
-      HM_appendChunkList(HM_HH_getChunkList(hh), HM_HH_getChunkList(completed));
-      HM_appendChunkList(HM_HH_getRemSet(hh), HM_HH_getRemSet(completed));
+      
+      /*
+      HM_HH_getConcurrentPack(hh)->bytesSurvivedLastCollection =
+        max(HM_HH_getConcurrentPack(hh)->bytesSurvivedLastCollection,
+            HM_HH_getConcurrentPack(completed)->bytesSurvivedLastCollection);
+      */
+
       CC_freeStack(s, HM_HH_getConcurrentPack(completed));
       linkInto(s, hh, completed);
+      HM_appendChunkList(HM_HH_getChunkList(hh), HM_HH_getChunkList(completed));
+      HM_appendRemSet(HM_HH_getRemSet(hh), HM_HH_getRemSet(completed));
       completed = next;
     }
 
@@ -809,6 +857,8 @@ bool checkPolicyforRoot(
   }
 
   size_t bytesSurvived = HM_HH_getConcurrentPack(hh)->bytesSurvivedLastCollection;
+  
+  /* consider removing this: */
   for (HM_HierarchicalHeap cursor = hh->subHeapCompletedCC;
        NULL != cursor;
        cursor = cursor->subHeapCompletedCC)
@@ -858,7 +908,11 @@ objptr copyCurrentStack(GC_state s, GC_thread thread) {
   assert(isStackReservedAligned(s, reserved));
   size_t stackSize = sizeofStackWithMetaData(s, reserved);
 
-  HM_chunk newChunk = HM_allocateChunk(HM_HH_getChunkList(hh), stackSize);
+  HM_chunk newChunk = HM_allocateChunkWithPurpose(
+    HM_HH_getChunkList(hh),
+    stackSize,
+    BLOCK_FOR_HEAP_CHUNK);
+
   if (NULL == newChunk) {
     DIE("Ran out of space to copy stack!");
   }
@@ -940,7 +994,7 @@ void HM_HH_cancelCC(GC_state s, pointer threadp, pointer hhp) {
 
   mainhh->subHeapForCC = heap->subHeapForCC;
   HM_appendChunkList(HM_HH_getChunkList(mainhh), HM_HH_getChunkList(heap));
-  HM_appendChunkList(HM_HH_getRemSet(mainhh), HM_HH_getRemSet(heap));
+  HM_appendRemSet(HM_HH_getRemSet(mainhh), HM_HH_getRemSet(heap));
   linkInto(s, mainhh, heap);
 
 
@@ -953,7 +1007,7 @@ void HM_HH_cancelCC(GC_state s, pointer threadp, pointer hhp) {
       HM_HH_getConcurrentPack(mainhh)->bytesSurvivedLastCollection +=
         HM_HH_getConcurrentPack(completed)->bytesSurvivedLastCollection;
       HM_appendChunkList(HM_HH_getChunkList(mainhh), HM_HH_getChunkList(completed));
-      HM_appendChunkList(HM_HH_getRemSet(mainhh), HM_HH_getRemSet(completed));
+      HM_appendRemSet(HM_HH_getRemSet(mainhh), HM_HH_getRemSet(completed));
       linkInto(s, mainhh, completed);
       completed = next;
     }
@@ -1168,6 +1222,32 @@ void HM_HH_addRootForCollector(GC_state s, HM_HierarchicalHeap hh, pointer p) {
   }
 }
 
+void HM_HH_rememberAtLevel(HM_HierarchicalHeap hh, HM_remembered remElem, bool conc) {
+  assert(hh != NULL);
+  if (!conc) {
+    HM_remember(HM_HH_getRemSet(hh), remElem, conc);
+  } else {
+    HM_UnionFindNode cursor = HM_HH_getUFNode(hh);
+    while(true) {
+      while (NULL != cursor->representative) {
+        cursor = cursor->representative;
+      }
+      hh = cursor->payload;
+      if (hh == NULL){
+        /* race with a join that changed the cursor, iterate again */
+        /*should not happen if we retire hh just like ufnodes*/
+        assert (false);
+        continue;
+      }
+      HM_remember(HM_HH_getRemSet(hh), remElem, conc);
+      if (NULL == cursor->representative) {
+        return;
+      }
+    }
+  }
+
+}
+
 
 void HM_HH_freeAllDependants(
   GC_state s,
@@ -1259,7 +1339,7 @@ void HM_HH_freeAllDependants(
 /*******************************/
 
 static inline void linkInto(
-  GC_state s,
+  __attribute__((unused)) GC_state s,
   HM_HierarchicalHeap left,
   HM_HierarchicalHeap right)
 {
@@ -1279,8 +1359,9 @@ static inline void linkInto(
 
   assert(NULL == HM_HH_getUFNode(left)->dependant2);
 
-  HM_HH_getUFNode(right)->payload = NULL;
-  freeFixedSize(getHHAllocator(s), right);
+  // HM_HH_getUFNode(right)->payload = NULL;
+  // freeFixedSize(getHHAllocator(s), right);
+  // HH_EBR_retire(s, HM_HH_getUFNode(right));
 
   assert(HM_HH_isLevelHead(left));
 }
@@ -1307,9 +1388,7 @@ void assertInvariants(GC_thread thread)
          NULL != chunk;
          chunk = chunk->nextChunk)
     {
-      assert(HM_getLevelHead(chunk) == cursor);
-      assert(chunk->disentangledDepth >= 1);
-    }
+      assert(HM_getLevelHead(chunk) == cursor);    }
   }
 
   /* check sorted by depth */
diff --git a/runtime/gc/hierarchical-heap.h b/runtime/gc/hierarchical-heap.h
index 147377811..7add64a41 100644
--- a/runtime/gc/hierarchical-heap.h
+++ b/runtime/gc/hierarchical-heap.h
@@ -10,6 +10,7 @@
 
 #include "chunk.h"
 #include "concurrent-collection.h"
+#include "remembered-set.h"
 
 #if (defined (MLTON_GC_INTERNAL_TYPES))
 
@@ -62,7 +63,7 @@ typedef struct HM_HierarchicalHeap {
   struct HM_HierarchicalHeap *subHeapForCC;
   struct HM_HierarchicalHeap *subHeapCompletedCC;
 
-  struct HM_chunkList rememberedSet;
+  struct HM_remSet rememberedSet;
   struct ConcurrentPackage concurrentPack;
   struct HM_chunkList entanglementSuspects;
 
@@ -104,7 +105,7 @@ static inline HM_chunkList HM_HH_getChunkList(HM_HierarchicalHeap hh)
   return &(hh->chunkList);
 }
 
-static inline HM_chunkList HM_HH_getRemSet(HM_HierarchicalHeap hh)
+static inline HM_remSet HM_HH_getRemSet(HM_HierarchicalHeap hh)
 {
   return &(hh->rememberedSet);
 }
@@ -122,6 +123,7 @@ bool HM_HH_isLevelHead(HM_HierarchicalHeap hh);
 
 bool HM_HH_isCCollecting(HM_HierarchicalHeap hh);
 void HM_HH_addRootForCollector(GC_state s, HM_HierarchicalHeap hh, pointer p);
+void HM_HH_rememberAtLevel(HM_HierarchicalHeap hh, HM_remembered remElem, bool conc);
 
 void HM_HH_merge(GC_state s, GC_thread parent, GC_thread child);
 void HM_HH_promoteChunks(GC_state s, GC_thread thread);
@@ -158,6 +160,11 @@ void HM_HH_resetList(pointer threadp);
 
 void mergeCompletedCCs(GC_state s, HM_HierarchicalHeap hh);
 
+void HM_HH_clearSuspectsAtDepth(
+  GC_state s,
+  GC_thread thread,
+  uint32_t targetDepth);
+
 
 /** Very fancy (constant-space) loop that frees each dependant union-find
   * node of hh. Specifically, calls this on each dependant ufnode:
diff --git a/runtime/gc/init.c b/runtime/gc/init.c
index 96d87a826..a39c793d8 100644
--- a/runtime/gc/init.c
+++ b/runtime/gc/init.c
@@ -80,6 +80,49 @@ static size_t stringToBytes(const char *s) {
   die ("Invalid @MLton/@mpl memory amount: %s.", s);
 }
 
+
+static void stringToTime(const char *s, struct timespec *t) {
+  double d;
+  char *endptr;
+  size_t factor;
+
+  d = strtod (s, &endptr);
+  if (s == endptr)
+    goto bad;
+
+  switch (*endptr++) {
+    case 's':
+      factor = 1;
+      break;
+    case 'm':
+      factor = 1000;
+      break;
+    case 'u':
+      factor = 1000 * 1000;
+      break;
+    case 'n':
+      factor = 1000 * 1000 * 1000;
+      break;
+    default:
+      goto bad;
+  }
+
+  d /= (double)factor;
+  size_t sec = (size_t)d;
+  size_t nsec = (size_t)((d - (double)sec) * 1000000000.0);
+
+  unless (*endptr == '\0'
+          and 0.0 <= d)
+    goto bad;
+
+  t->tv_sec = sec;
+  t->tv_nsec = nsec;
+  return;
+
+bad:
+  die ("Invalid @MLton/@mpl time spec: %s.", s);
+}
+
 /* ---------------------------------------------------------------- */
 /*                             GC_init                              */
 /* ---------------------------------------------------------------- */
@@ -327,6 +370,14 @@ int processAtMLton (GC_state s, int start, int argc, char **argv,
             die("%s megablock-threshold must be at least 1", atName);
           }
           s->controls->megablockThreshold = xx;
+        } else if (0 == strcmp(arg, "block-usage-sample-interval")) {
+          i++;
+          if (i == argc || (0 == strcmp (argv[i], "--"))) {
+            die ("%s block-usage-sample-interval missing argument.", atName);
+          }
+          struct timespec tm;
+          stringToTime(argv[i++], &tm);
+          s->controls->blockUsageSampleInterval = tm;
         } else if (0 == strcmp (arg, "collection-type")) {
           i++;
           if (i == argc || (0 == strcmp (argv[i], "--"))) {
@@ -479,7 +530,11 @@ int GC_init (GC_state s, int argc, char **argv) {
   s->controls->emptinessFraction = 0.25;
   s->controls->superblockThreshold = 7;  // superblocks of 128 blocks
   s->controls->megablockThreshold = 18;
-  s->controls->manageEntanglement = FALSE;
+  s->controls->manageEntanglement = TRUE;
+
+  // default: sample block usage once a second
+  s->controls->blockUsageSampleInterval.tv_sec = 1;
+  s->controls->blockUsageSampleInterval.tv_nsec = 0;
 
   /* Not arbitrary; should be at least the page size and must also respect the
    * limit check coalescing amount in the compiler. */
@@ -509,8 +564,8 @@ int GC_init (GC_state s, int argc, char **argv) {
   s->rootsLength = 0;
   s->savedThread = BOGUS_OBJPTR;
 
-  initFixedSizeAllocator(getHHAllocator(s), sizeof(struct HM_HierarchicalHeap));
-  initFixedSizeAllocator(getUFAllocator(s), sizeof(struct HM_UnionFindNode));
+  initFixedSizeAllocator(getHHAllocator(s), sizeof(struct HM_HierarchicalHeap), BLOCK_FOR_HH_ALLOCATOR);
+  initFixedSizeAllocator(getUFAllocator(s), sizeof(struct HM_UnionFindNode), BLOCK_FOR_UF_ALLOCATOR);
   s->numberDisentanglementChecks = 0;
 
   s->signalHandlerThread = BOGUS_OBJPTR;
@@ -605,8 +660,10 @@ void GC_lateInit(GC_state s) {
   HM_configChunks(s);
 
   HH_EBR_init(s);
+  HM_EBR_init(s);
 
   initLocalBlockAllocator(s, initGlobalBlockAllocator(s));
+  s->blockUsageSampler = newBlockUsageSampler(s);
 
   s->nextChunkAllocSize = s->controls->allocChunkSize;
 
@@ -642,9 +699,11 @@ void GC_duplicate (GC_state d, GC_state s) {
   d->wsQueueTop = BOGUS_OBJPTR;
   d->wsQueueBot = BOGUS_OBJPTR;
   initLocalBlockAllocator(d, s->blockAllocatorGlobal);
-  initFixedSizeAllocator(getHHAllocator(d), sizeof(struct HM_HierarchicalHeap));
-  initFixedSizeAllocator(getUFAllocator(d), sizeof(struct HM_UnionFindNode));
+  d->blockUsageSampler = s->blockUsageSampler;
+  initFixedSizeAllocator(getHHAllocator(d), sizeof(struct HM_HierarchicalHeap), BLOCK_FOR_HH_ALLOCATOR);
+  initFixedSizeAllocator(getUFAllocator(d), sizeof(struct HM_UnionFindNode), BLOCK_FOR_UF_ALLOCATOR);
   d->hhEBR = s->hhEBR;
+  d->hmEBR = s->hmEBR;
   d->nextChunkAllocSize = s->nextChunkAllocSize;
   d->lastMajorStatistics = newLastMajorStatistics();
   d->numberOfProcs = s->numberOfProcs;
diff --git a/runtime/gc/logger.c b/runtime/gc/logger.c
index d48a245ab..432ae0254 100644
--- a/runtime/gc/logger.c
+++ b/runtime/gc/logger.c
@@ -205,6 +205,7 @@ bool stringToLogModule(enum LogModule* module, const char* moduleString) {
 
   struct Conversion conversions[] =
       {{.string = "allocation", .module = LM_ALLOCATION},
+       {.string = "block-allocator", .module = LM_BLOCK_ALLOCATOR},
        {.string = "chunk", .module = LM_CHUNK},
        {.string = "chunk-pool", .module = LM_CHUNK_POOL},
        {.string = "dfs-mark", .module = LM_DFS_MARK},
diff --git a/runtime/gc/logger.h b/runtime/gc/logger.h
index 5d7680a32..5a5cce8a7 100644
--- a/runtime/gc/logger.h
+++ b/runtime/gc/logger.h
@@ -18,6 +18,7 @@
 
 enum LogModule {
   LM_ALLOCATION,
+  LM_BLOCK_ALLOCATOR,
   LM_CHUNK,
   LM_CHUNK_POOL,
   LM_DFS_MARK,
diff --git a/runtime/gc/new-object.c b/runtime/gc/new-object.c
index ed9e0d500..9083b7593 100644
--- a/runtime/gc/new-object.c
+++ b/runtime/gc/new-object.c
@@ -147,8 +147,16 @@ GC_thread newThreadWithHeap(
    * yet. */
   HM_HierarchicalHeap hh = HM_HH_new(s, depth);
 
-  HM_chunk tChunk = HM_allocateChunk(HM_HH_getChunkList(hh), threadSize);
-  HM_chunk sChunk = HM_allocateChunk(HM_HH_getChunkList(hh), stackSize);
+  HM_chunk tChunk = HM_allocateChunkWithPurpose(
+    HM_HH_getChunkList(hh),
+    threadSize,
+    BLOCK_FOR_HEAP_CHUNK);
+    
+  HM_chunk sChunk = HM_allocateChunkWithPurpose(
+    HM_HH_getChunkList(hh),
+    stackSize,
+    BLOCK_FOR_HEAP_CHUNK);
+    
   if (NULL == sChunk || NULL == tChunk) {
     DIE("Ran out of space for thread+stack allocation!");
   }
diff --git a/runtime/gc/object.c b/runtime/gc/object.c
index 4dc652e89..91d6ccf91 100644
--- a/runtime/gc/object.c
+++ b/runtime/gc/object.c
@@ -36,7 +36,17 @@ GC_header* getHeaderp (pointer p) {
  * Returns the header for the object pointed to by p.
  */
 GC_header getHeader (pointer p) {
-  return *(getHeaderp(p));
+  GC_header h = *(getHeaderp(p));
+  return h;
+}
+
+GC_header getRacyHeader (pointer ptr) {
+  GC_header header = getHeader(ptr);
+  while (isFwdHeader(header)) {
+    ptr = (pointer) header;
+    header = getHeader(ptr);
+  }
+  return header;
 }
 
 /*
@@ -90,6 +100,20 @@ void splitHeader(GC_state s, GC_header header,
     *numObjptrsRet = numObjptrs;
 }
 
+static inline bool isMutableH(GC_state s, GC_header header) {
+  GC_objectTypeTag tag;
+  uint16_t bytesNonObjptrs;
+  uint16_t numObjptrs;
+  bool hasIdentity;
+  splitHeader(s, header, &tag, &hasIdentity, &bytesNonObjptrs, &numObjptrs);
+  return hasIdentity;
+}
+
+static inline bool isMutable(GC_state s, pointer p) {
+  GC_header header = getHeader(p);
+  return isMutableH(s, header);
+}
+
 /* advanceToObjectData (s, p)
  *
  * If p points at the beginning of an object, then advanceToObjectData
diff --git a/runtime/gc/object.h b/runtime/gc/object.h
index 0df5ce702..1250b2abe 100644
--- a/runtime/gc/object.h
+++ b/runtime/gc/object.h
@@ -82,7 +82,8 @@ COMPILE_TIME_ASSERT(sizeof_objptr__eq__sizeof_header,
 
 static inline GC_header* getHeaderp (pointer p);
 static inline GC_header getHeader (pointer p);
-static inline GC_header buildHeaderFromTypeIndex (uint32_t t);
+static inline GC_header getRacyHeader (pointer p);
+static inline GC_header buildHeaderFromTypeIndex(uint32_t t);
 
 #endif /* (defined (MLTON_GC_INTERNAL_FUNCS)) */
 
@@ -180,6 +181,8 @@ enum {
 static inline void splitHeader (GC_state s, GC_header header,
                                 GC_objectTypeTag *tagRet, bool *hasIdentityRet,
                                 uint16_t *bytesNonObjptrsRet, uint16_t *numObjptrsRet);
+static inline bool isMutable(GC_state s, pointer p);
+static inline bool isMutableH(GC_state s, GC_header h);
 static inline pointer advanceToObjectData (GC_state s, pointer p);
 static inline size_t objectSize(GC_state s, pointer p);
 
diff --git a/runtime/gc/pin.c b/runtime/gc/pin.c
index b564d5c0c..fca2c138f 100644
--- a/runtime/gc/pin.c
+++ b/runtime/gc/pin.c
@@ -6,51 +6,133 @@
 
 #if (defined (MLTON_GC_INTERNAL_FUNCS))
 
-bool pinObject(objptr op, uint32_t unpinDepth)
+enum PinType pinType(GC_header h) {
+  if (0 == (h & GC_VALID_HEADER_MASK))
+    return PIN_NONE;
+
+  int t = ((h & PIN_MASK) >> PIN_SHIFT);
+  if (t == 0)
+    return PIN_NONE;
+  else if (t == 2)
+    return PIN_DOWN;
+  else if (t == 3)
+    return PIN_ANY;
+  else
+    DIE("NOT supposed to reach here!");
+}
+
+enum PinType maxPT(enum PinType pt1, enum PinType pt2) {
+  if (pt1 == PIN_NONE)
+    return pt2;
+  else if (pt1 == PIN_ANY || pt2 == PIN_ANY)
+    return PIN_ANY;
+  else
+    return PIN_DOWN;
+}
+
+static inline GC_header getRep(enum PinType pt)
+{
+  GC_header h;
+  if (pt == PIN_NONE)
+    h = 0;
+  else if (pt == PIN_DOWN)
+    h = 0x20000000;
+  else
+    h = 0x30000000;
+  return h;
+}
+
+uint32_t unpinDepthOfH(GC_header h)
+{
+  return (h & UNPIN_DEPTH_MASK) >> UNPIN_DEPTH_SHIFT;
+}
+
+bool pinObject(GC_state s, objptr op, uint32_t unpinDepth, enum PinType pt)
+{
+  bool a, b;
+  pinObjectInfo(s, op, unpinDepth, pt, &a, &b);
+  return a;
+}
+
+objptr pinObjectInfo(
+  GC_state s,
+  objptr op,
+  uint32_t unpinDepth,
+  enum PinType pt,
+  bool *headerChange,
+  bool *pinChange)
 {
   pointer p = objptrToPointer(op, NULL);
+  assert(pt != PIN_NONE);
+
+  /*initialize with false*/
+  *headerChange = false;
+  *pinChange = false;
 
   uint32_t maxUnpinDepth = TWOPOWER(UNPIN_DEPTH_BITS) - 1;
-  if (unpinDepth > maxUnpinDepth) {
-    DIE("unpinDepth %"PRIu32" exceeds max possible value %"PRIu32,
+  if (unpinDepth > maxUnpinDepth)
+  {
+    DIE("unpinDepth %" PRIu32 " exceeds max possible value %" PRIu32,
         unpinDepth,
         maxUnpinDepth);
-    return FALSE;
+    return op;
   }
   assert(
-    ((GC_header)unpinDepth) << UNPIN_DEPTH_SHIFT
-    == (UNPIN_DEPTH_MASK & ((GC_header)unpinDepth) << UNPIN_DEPTH_SHIFT)
-  );
-
-  while (TRUE) {
+      ((GC_header)unpinDepth) << UNPIN_DEPTH_SHIFT == (UNPIN_DEPTH_MASK & ((GC_header)unpinDepth) << UNPIN_DEPTH_SHIFT));
+  while (true)
+  {
     GC_header header = getHeader(p);
+    uint32_t newUnpinDepth;
+    if (isFwdHeader(header))
+    {
+      assert(pt != PIN_DOWN);
+      op = getFwdPtr(p);
+      p = objptrToPointer(op, NULL);
+      continue;
+    }
+    else if (pinType(header) != PIN_NONE)
+    {
+      uint32_t previousUnpinDepth = unpinDepthOfH(header);
+      newUnpinDepth = min(previousUnpinDepth, unpinDepth);
+    } else {
+      newUnpinDepth = unpinDepth;
+    }
 
-    bool notPinned = (0 == (header & PIN_MASK) >> PIN_SHIFT);
-    uint32_t previousUnpinDepth =
-      (header & UNPIN_DEPTH_MASK) >> UNPIN_DEPTH_SHIFT;
+    /* if we are changing the unpinDepth, then the new pinType (nt) is
+     * equal to the function argument pt. Otherwise its the max. */
+    enum PinType nt = newUnpinDepth < unpinDepthOfH(header) ? pt : maxPT(pt, pinType(header));
+    GC_header unpinnedHeader = header & (~UNPIN_DEPTH_MASK) & (~PIN_MASK);
 
     GC_header newHeader =
-        (header & (~UNPIN_DEPTH_MASK))                // clear unpin bits
-      | ((GC_header)unpinDepth << UNPIN_DEPTH_SHIFT)  // put in new unpinDepth
-      | PIN_MASK;                                     // set pin bit
-
-    if (notPinned) {
-      /* first, handle case where this object was not already pinned */
-      if (__sync_bool_compare_and_swap(getHeaderp(p), header, newHeader))
-        return TRUE;
+        unpinnedHeader
+        | ((GC_header)newUnpinDepth << UNPIN_DEPTH_SHIFT) // put in new unpinDepth
+        | getRep(nt);                                     // setup the pin type
+
+    if(newHeader == header) {
+      assert (!hasFwdPtr(p));
+      return op;
     }
     else {
-      /* if the object was previously pinned, we still need to do a writeMin */
-      if (previousUnpinDepth <= unpinDepth)
-        return FALSE;
-
-      if (__sync_bool_compare_and_swap(getHeaderp(p), header, newHeader))
-        return TRUE;
+      if (__sync_bool_compare_and_swap(getHeaderp(p), header, newHeader)) {
+        *headerChange = true;
+        bool didPinChange = (nt != pinType(header));
+        *pinChange = didPinChange;
+        if (nt == PIN_ANY && didPinChange) {
+          size_t sz = objectSize(s, p);
+          s->cumulativeStatistics->bytesPinnedEntangled += sz;
+          __sync_fetch_and_add(
+            &(s->cumulativeStatistics->currentPhaseBytesPinnedEntangled),
+            (uintmax_t)sz
+          );
+        }
+        assert (!hasFwdPtr(p));
+        assert(pinType(newHeader) == nt);
+        return op;
+      }
     }
   }
-
   DIE("should be impossible to reach here");
-  return FALSE;
+  return op;
 }
 
 void unpinObject(objptr op) {
@@ -72,14 +154,51 @@ bool isPinned(objptr op) {
    * (otherwise, there could be a forward pointer in this spot)
    * ...and then check the mark
    */
-  return (1 == (h & GC_VALID_HEADER_MASK)) &&
-         (1 == ((h & PIN_MASK) >> PIN_SHIFT));
+  bool result = (1 == (h & GC_VALID_HEADER_MASK)) &&
+         (((h & PIN_MASK) >> PIN_SHIFT) > 0);
+  assert (result == (pinType(h) != PIN_NONE));
+  return result;
 }
 
 uint32_t unpinDepthOf(objptr op) {
   pointer p = objptrToPointer(op, NULL);
-  uint32_t d = (getHeader(p) & UNPIN_DEPTH_MASK) >> UNPIN_DEPTH_SHIFT;
+  uint32_t d = unpinDepthOfH(getHeader(p));
   return d;
 }
 
+bool tryUnpinWithDepth(objptr op, uint32_t opDepth) {
+
+  pointer p = objptrToPointer(op, NULL);
+  GC_header header = getHeader(p);
+  uint32_t d = unpinDepthOfH(header);
+
+  if (d >= opDepth) {
+    GC_header newHeader =
+        getHeader(p)
+      & (~UNPIN_DEPTH_MASK)  // clear counter bits
+      & (~PIN_MASK);         // clear mark bit
+
+    return __sync_bool_compare_and_swap(getHeaderp(p), header, newHeader);
+  }
+  return false;
+}
+
+
+// bool tryPinDec(objptr op, uint32_t opDepth) {
+//   pointer p = objptrToPointer(op, NULL);
+//   GC_header header = getHeader(p);
+//   uint32_t d = (header & UNPIN_DEPTH_MASK) >> UNPIN_DEPTH_SHIFT;
+
+//   if (d >= opDepth && pinType(header) == PIN_ANY) {
+//     GC_header newHeader =
+//         getHeader(p)
+//       & (~UNPIN_DEPTH_MASK)  // clear counter bits
+//       & (~PIN_MASK);         // clear mark bit
+
+//     return __sync_bool_compare_and_swap(getHeaderp(p), header, newHeader));
+//   }
+
+//   return false;
+// }
+
 #endif
diff --git a/runtime/gc/pin.h b/runtime/gc/pin.h
index cc69a86e2..7591d3044 100644
--- a/runtime/gc/pin.h
+++ b/runtime/gc/pin.h
@@ -11,19 +11,25 @@
   *
   *                 +------+-------------------------+----------+--------------+
   *  header fields  | mark |      counter            | type-tag | valid-header |
-  *                 +------+-----------+-------------+----------+--------------+
-  *     sub-fields  |      | sus | pin | unpin-depth |          |              |
-  *                 +------+-----+-----+-------------+----------+--------------+
-  *                 ^      ^     ^     ^             ^          ^              ^
-  *        offsets  32     31    30    29            20         1              0
+  *                 +------+------------+-------------+----------+--------------+
+  *     sub-fields  |      | sus | pin  | unpin-depth |          |              |
+  *                 +------+-----+------+-------------+----------+--------------+
+  *                 ^      ^     ^      ^             ^          ^              ^
+  *        offsets  32     31    30     28            20         1              0
   *
   */
-#define UNPIN_DEPTH_BITS  9
-#define UNPIN_DEPTH_MASK  ((GC_header)0x1FF00000)
+#define UNPIN_DEPTH_BITS  8
+#define UNPIN_DEPTH_MASK ((GC_header)0xFF00000)
 #define UNPIN_DEPTH_SHIFT 20
-#define PIN_BITS          1
-#define PIN_MASK ((GC_header)0x20000000)
-#define PIN_SHIFT         29
+#define PIN_MASK ((GC_header)0x30000000)
+#define PIN_SHIFT         28
+
+enum PinType
+{
+  PIN_NONE,
+  PIN_DOWN,
+  PIN_ANY
+};
 
 /* Pin this object, making it immovable (by GC) until it reaches
  * unpinDepth (or shallower). Returns TRUE if the object was
@@ -32,11 +38,22 @@
  * Note that regardless of whether or not the object was previously
  * pinned, this does a writeMin on the unpinDepth of the object.
  */
-bool pinObject(objptr op, uint32_t unpinDepth);
+bool pinObject(GC_state s, objptr op, uint32_t unpinDepth, enum PinType pt);
+
+objptr pinObjectInfo(
+  GC_state s,
+  objptr op,
+  uint32_t unpinDepth,
+  enum PinType pt,
+  bool* headerChange,
+  bool* pinChange);
 
 /* check if an object is pinned */
 bool isPinned(objptr op);
 
+/* */
+enum PinType pinType(GC_header header);
+
 /* Unpin an object by clearing the mark and counter bits in its header.
  * This is only safe if the object is not being concurrently pinned.
  * As long as we only call this on objects that are local, it's safe.
@@ -45,5 +62,14 @@ void unpinObject(objptr op);
 
 /* read the current unpin-depth of an object */
 uint32_t unpinDepthOf(objptr op);
+uint32_t unpinDepthOfH(GC_header header);
+
+
+/* unpin an object if its depth allows. Because the unpinDepth can change
+ * concurrently, we want to make sure we use the logic in this function.
+ * If unpin is successful, then it returns true. Otherwise, false.
+ */
+bool tryUnpinWithDepth(objptr op, uint32_t opDepth);
+// bool tryPinDec(objptr op, uint32_t opDepth);
 
 #endif
diff --git a/runtime/gc/remembered-set.c b/runtime/gc/remembered-set.c
index 3b5bb2932..f6ef64414 100644
--- a/runtime/gc/remembered-set.c
+++ b/runtime/gc/remembered-set.c
@@ -4,26 +4,32 @@
  * See the file MLton-LICENSE for details.
  */
 
-void HM_remember(HM_chunkList remSet, HM_remembered remElem) {
-  HM_storeInchunkList(remSet, (void*)remElem, sizeof(struct HM_remembered));
+void HM_initRemSet(HM_remSet remSet) {
+  HM_initChunkList(&(remSet->private));
+  CC_initConcList(&(remSet->public));
 }
 
-void HM_rememberAtLevel(HM_HierarchicalHeap hh, HM_remembered remElem) {
-  assert(hh != NULL);
-  HM_remember(HM_HH_getRemSet(hh), remElem);
+void HM_remember(HM_remSet remSet, HM_remembered remElem, bool conc) {
+  if (!conc) {
+    HM_storeInChunkListWithPurpose(&(remSet->private), (void*)remElem, sizeof(struct HM_remembered), BLOCK_FOR_REMEMBERED_SET);
+  }
+  else {
+    CC_storeInConcListWithPurpose(&(remSet->public), (void *)remElem, sizeof(struct HM_remembered), BLOCK_FOR_REMEMBERED_SET);
+  }
 }
 
-void HM_foreachRemembered(
+void HM_foreachPrivate(
   GC_state s,
-  HM_chunkList remSet,
+  HM_chunkList chunkList,
   HM_foreachDownptrClosure f)
 {
-  assert(remSet != NULL);
-  HM_chunk chunk = HM_getChunkListFirstChunk(remSet);
-  while (chunk != NULL) {
+  HM_chunk chunk = HM_getChunkListFirstChunk(chunkList);
+  while (chunk != NULL)
+  {
     pointer p = HM_getChunkStart(chunk);
     pointer frontier = HM_getChunkFrontier(chunk);
-    while (p < frontier) {
+    while (p < frontier && ((HM_remembered)p)->object != 0)
+    {
       f->fun(s, (HM_remembered)p, f->env);
       p += sizeof(struct HM_remembered);
     }
@@ -31,10 +37,142 @@ void HM_foreachRemembered(
   }
 }
 
-size_t HM_numRemembered(HM_chunkList remSet) {
+typedef struct FishyChunk
+{
+  HM_chunk chunk;
+  pointer scanned;
+} FishyChunk;
+
+void makeChunkFishy(FishyChunk * fc, HM_chunk chunk, pointer frontier, int* numFishyChunks) {
+  (fc[*numFishyChunks]).chunk = chunk;
+  (fc[*numFishyChunks]).scanned = frontier;
+  *numFishyChunks = *numFishyChunks + 1;
+  return;
+}
+
+FishyChunk * resizeFishyArray (FishyChunk * fishyChunks, int * currentSize) {
+  int cs =  *currentSize;
+  int new_size = 2 * cs;
+  FishyChunk * fc = malloc(sizeof(struct FishyChunk) * new_size);
+  memcpy(fc, fishyChunks, sizeof(struct FishyChunk) * cs);
+  *currentSize = new_size;
+  free(fishyChunks);
+  return fc;
+}
+
+void checkFishyChunks(GC_state s,
+  FishyChunk * fishyChunks,
+  int numFishyChunks,
+  HM_foreachDownptrClosure f)
+{
+  if (fishyChunks == NULL) {
+    return;
+  }
+  bool changed = true;
+  while (changed) {
+    int i = numFishyChunks - 1;
+    changed = false;
+    while (i >= 0)
+    {
+      HM_chunk chunk = fishyChunks[i].chunk;
+      pointer p = fishyChunks[i].scanned;
+      pointer frontier = HM_getChunkFrontier(chunk);
+      while (TRUE)
+      {
+        while (p < frontier && ((HM_remembered)p)->object != 0)
+        {
+          f->fun(s, (HM_remembered)p, f->env);
+          p += sizeof(struct HM_remembered);
+        }
+        frontier = HM_getChunkFrontier(chunk);
+        if (p >= frontier) {
+          break;
+        }
+      }
+      if (p != fishyChunks[i].scanned) {
+        fishyChunks[i].scanned = p;
+        changed = true;
+      }
+      i --;
+    }
+  }
+}
+
+void HM_foreachPublic (
+  GC_state s,
+  HM_remSet remSet,
+  HM_foreachDownptrClosure f,
+  bool trackFishyChunks)
+{
+
+  if ((remSet->public).firstChunk == NULL) {
+    return;
+  }
+
+  if (!trackFishyChunks) {
+    struct HM_chunkList _chunkList;
+    HM_chunkList chunkList = &(_chunkList);
+    CC_popAsChunkList(&(remSet->public), chunkList);
+    HM_foreachPrivate(s, chunkList, f);
+    HM_appendChunkList(&(remSet->private), chunkList);
+    return;
+  }
+
+  HM_chunk chunk = (remSet->public).firstChunk;
+  HM_chunk lastChunk = CC_getLastChunk (&(remSet->public));
+  int array_size = 2 * s->numberOfProcs;
+  FishyChunk* fishyChunks = malloc(sizeof(struct FishyChunk) * array_size);
+  int numFishyChunks = 0;
+  while (chunk != NULL)
+  {
+    while (chunk != NULL)
+    {
+      pointer p = HM_getChunkStart(chunk);
+      pointer frontier = HM_getChunkFrontier(chunk);
+      while (p < frontier && ((HM_remembered)p)->object != 0)
+      {
+        f->fun(s, (HM_remembered)p, f->env);
+        p += sizeof(struct HM_remembered);
+      }
+      if ((chunk->retireChunk || chunk->nextChunk == NULL))
+      {
+        if (numFishyChunks >= array_size) {
+          fishyChunks = resizeFishyArray(fishyChunks, &array_size);
+        }
+        makeChunkFishy(fishyChunks, chunk, p, &numFishyChunks);
+      }
+      chunk = chunk->nextChunk;
+    }
+    checkFishyChunks(s, fishyChunks, numFishyChunks, f);
+    lastChunk = CC_getLastChunk(&(remSet->public));
+    if (lastChunk != fishyChunks[numFishyChunks - 1].chunk) {
+      assert (chunk->nextChunk != NULL);
+      chunk = chunk->nextChunk;
+    }
+  }
+  free(fishyChunks);
+  struct HM_chunkList _chunkList;
+  HM_chunkList chunkList = &(_chunkList);
+  CC_popAsChunkList(&(remSet->public), chunkList);
+  HM_appendChunkList(&(remSet->private), chunkList);
+}
+
+void HM_foreachRemembered(
+  GC_state s,
+  HM_remSet remSet,
+  HM_foreachDownptrClosure f,
+  bool trackFishyChunks)
+{
+  assert(remSet != NULL);
+  HM_foreachPrivate(s, &(remSet->private), f);
+  HM_foreachPublic(s, remSet, f, trackFishyChunks);
+}
+
+
+size_t HM_numRemembered(HM_remSet remSet) {
   assert(remSet != NULL);
   size_t count = 0;
-  HM_chunk chunk = HM_getChunkListFirstChunk(remSet);
+  HM_chunk chunk = HM_getChunkListFirstChunk(&(remSet->private)); // ignore public for now.
   while (chunk != NULL) {
     pointer p = HM_getChunkStart(chunk);
     pointer frontier = HM_getChunkFrontier(chunk);
@@ -44,3 +182,13 @@ size_t HM_numRemembered(HM_chunkList remSet) {
 
   return count;
 }
+
+void HM_appendRemSet(HM_remSet r1, HM_remSet r2) {
+  HM_appendChunkList(&(r1->private), &(r2->private));
+  CC_appendConcList(&(r1->public), &(r2->public));
+}
+
+void HM_freeRemSetWithInfo(GC_state s, HM_remSet remSet, void* info) {
+  HM_freeChunksInListWithInfo(s, &(remSet->private), info, BLOCK_FOR_REMEMBERED_SET);
+  CC_freeChunksInConcListWithInfo(s, &(remSet->public), info, BLOCK_FOR_REMEMBERED_SET);
+}
diff --git a/runtime/gc/remembered-set.h b/runtime/gc/remembered-set.h
index c1859d3f6..ce0cfad40 100644
--- a/runtime/gc/remembered-set.h
+++ b/runtime/gc/remembered-set.h
@@ -9,6 +9,7 @@
 
 
 #if (defined (MLTON_GC_INTERNAL_TYPES))
+#include "gc/concurrent-list.h"
 
 /* Remembering that there exists a downpointer to this object. The unpin
  * depth of the object will be stored in the object header. */
@@ -17,6 +18,24 @@ typedef struct HM_remembered {
   objptr object;
 } * HM_remembered;
 
+
+/*
+1. How do we do this public remSet in a hh changing away?
+2. What's a simple ds that does the right thing? -> global lookup, which maps
+   a position in the heap hierarchy to a remSet.
+3. Each chunk keeps track of which remSet?
+4.
+
+hh->chunkList <--> ogList
+toList
+=============================
+
+*/
+typedef struct HM_remSet {
+  struct HM_chunkList private;
+  struct CC_concList public;
+} * HM_remSet;
+
 typedef void (*HM_foreachDownptrFun)(GC_state s, HM_remembered remElem, void* args);
 
 typedef struct HM_foreachDownptrClosure {
@@ -29,10 +48,14 @@ typedef struct HM_foreachDownptrClosure {
 
 #if (defined (MLTON_GC_INTERNAL_BASIS))
 
-void HM_remember(HM_chunkList remSet, HM_remembered remElem);
-void HM_rememberAtLevel(HM_HierarchicalHeap hh, HM_remembered remElem);
-void HM_foreachRemembered(GC_state s, HM_chunkList remSet, HM_foreachDownptrClosure f);
-size_t HM_numRemembered(HM_chunkList remSet);
+void HM_initRemSet(HM_remSet remSet);
+void HM_freeRemSetWithInfo(GC_state s, HM_remSet remSet, void* info);
+void HM_remember(HM_remSet remSet, HM_remembered remElem, bool conc);
+void HM_appendRemSet(HM_remSet r1, HM_remSet r2);
+void HM_foreachRemembered(GC_state s, HM_remSet remSet, HM_foreachDownptrClosure f, bool trackFishyChunks);
+size_t HM_numRemembered(HM_remSet remSet);
+void HM_foreachPublic(GC_state s, HM_remSet remSet, HM_foreachDownptrClosure f, bool trackFishyChunks);
+void HM_foreachPrivate(GC_state s, HM_chunkList list,HM_foreachDownptrClosure f);
 
 #endif /* defined (MLTON_GC_INTERNAL_BASIS) */
 
diff --git a/runtime/gc/sampler.c b/runtime/gc/sampler.c
new file mode 100644
index 000000000..bc7b7ac5f
--- /dev/null
+++ b/runtime/gc/sampler.c
@@ -0,0 +1,63 @@
+/* Copyright (C) 2022 Sam Westrick
+ *
+ * MLton is released under a HPND-style license.
+ * See the file MLton-LICENSE for details.
+ */
+
+
+void initSampler(
+  __attribute__((unused)) GC_state s,
+  Sampler samp,
+  SamplerClosure func,
+  struct timespec *desiredInterval)
+{
+  samp->func = *func;
+  samp->desiredInterval = *desiredInterval;
+  samp->currentEpoch = 0;
+  timespec_now(&(samp->absoluteStart));
+}
+
+
+static void timespec_mul(struct timespec *dst, size_t multiplier) {
+  size_t sec = dst->tv_sec;
+  size_t nsec = dst->tv_nsec;
+
+  size_t nps = 1000L * 1000 * 1000;
+
+  size_t new_nsec = (nsec * multiplier) % nps;
+  size_t add_sec = (nsec * multiplier) / nps;
+
+  dst->tv_sec = (sec * multiplier) + add_sec;
+  dst->tv_nsec = new_nsec;
+}
+
+
+static inline double timespec_to_seconds(struct timespec *tm) {
+  return (double)tm->tv_sec + ((double)tm->tv_nsec * 0.000000001);
+}
+
+
+void maybeSample(GC_state s, Sampler samp) {
+  size_t oldEpoch = samp->currentEpoch;
+
+  // compute the time of the last successful sample (relative to start)
+  struct timespec lastSample;
+  lastSample = samp->desiredInterval;
+  timespec_mul(&lastSample, oldEpoch);
+
+  // compare against current time by computing epoch diff
+  struct timespec now;
+  timespec_now(&now);
+  timespec_sub(&now, &(samp->absoluteStart));
+  double diff = timespec_to_seconds(&now) - timespec_to_seconds(&lastSample);
+  long epochDiff = (long)(diff / timespec_to_seconds(&samp->desiredInterval));
+
+  if (epochDiff < 1)
+    return;
+
+  size_t newEpoch = oldEpoch + epochDiff;
+
+  if (__sync_bool_compare_and_swap(&samp->currentEpoch, oldEpoch, newEpoch)) {
+    samp->func.fun(s, &now, samp->func.env);
+  }
+}
\ No newline at end of file
diff --git a/runtime/gc/sampler.h b/runtime/gc/sampler.h
new file mode 100644
index 000000000..c7f9b88c2
--- /dev/null
+++ b/runtime/gc/sampler.h
@@ -0,0 +1,34 @@
+/* Copyright (C) 2022 Sam Westrick
+ *
+ * MLton is released under a HPND-style license.
+ * See the file MLton-LICENSE for details.
+ */
+
+#ifndef SAMPLER_H_
+#define SAMPLER_H_
+
+#if (defined (MLTON_GC_INTERNAL_FUNCS))
+
+typedef void (*SamplerFun) (GC_state s, struct timespec *tm, void *env);
+
+typedef struct SamplerClosure {
+  SamplerFun fun;
+  void *env;
+} *SamplerClosure;
+
+typedef struct Sampler {
+  struct SamplerClosure func;
+  struct timespec desiredInterval;
+  struct timespec absoluteStart;
+  size_t currentEpoch;
+} * Sampler;
+
+
+void initSampler(GC_state s, Sampler samp, SamplerClosure func, struct timespec *desiredInterval);
+
+void maybeSample(GC_state s, Sampler samp);
+
+#endif /* MLTON_GC_INTERNAL_FUNCS */
+
+
+#endif /* SAMPLER_H_ */
\ No newline at end of file
diff --git a/runtime/gc/statistics.c b/runtime/gc/statistics.c
index 176908d4d..fc3fc2880 100644
--- a/runtime/gc/statistics.c
+++ b/runtime/gc/statistics.c
@@ -47,8 +47,9 @@ struct GC_cumulativeStatistics *newCumulativeStatistics(void) {
   cumulativeStatistics->bytesScannedMinor = 0;
   cumulativeStatistics->bytesHHLocaled = 0;
   cumulativeStatistics->bytesReclaimedByLocal = 0;
-  cumulativeStatistics->bytesReclaimedByRootCC = 0;
-  cumulativeStatistics->bytesReclaimedByInternalCC = 0;
+  cumulativeStatistics->bytesReclaimedByCC = 0;
+  cumulativeStatistics->bytesInScopeForLocal = 0;
+  cumulativeStatistics->bytesInScopeForCC = 0;
   cumulativeStatistics->maxBytesLive = 0;
   cumulativeStatistics->maxBytesLiveSinceReset = 0;
   cumulativeStatistics->maxHeapSize = 0;
@@ -67,22 +68,23 @@ struct GC_cumulativeStatistics *newCumulativeStatistics(void) {
   cumulativeStatistics->numMarkCompactGCs = 0;
   cumulativeStatistics->numMinorGCs = 0;
   cumulativeStatistics->numHHLocalGCs = 0;
-  cumulativeStatistics->numRootCCs = 0;
-  cumulativeStatistics->numInternalCCs = 0;
+  cumulativeStatistics->numCCs = 0;
   cumulativeStatistics->numDisentanglementChecks = 0;
+  cumulativeStatistics->numEntanglements = 0;
   cumulativeStatistics->numChecksSkipped = 0;
   cumulativeStatistics->numSuspectsMarked = 0;
   cumulativeStatistics->numSuspectsCleared = 0;
-  cumulativeStatistics->numEntanglementsDetected = 0;
+  cumulativeStatistics->bytesPinnedEntangled = 0;
+  cumulativeStatistics->currentPhaseBytesPinnedEntangled = 0;
+  cumulativeStatistics->bytesPinnedEntangledWatermark = 0;
+  cumulativeStatistics->approxRaceFactor = 0;
 
   cumulativeStatistics->timeLocalGC.tv_sec = 0;
   cumulativeStatistics->timeLocalGC.tv_nsec = 0;
   cumulativeStatistics->timeLocalPromo.tv_sec = 0;
   cumulativeStatistics->timeLocalPromo.tv_nsec = 0;
-  cumulativeStatistics->timeRootCC.tv_sec = 0;
-  cumulativeStatistics->timeRootCC.tv_nsec = 0;
-  cumulativeStatistics->timeInternalCC.tv_sec = 0;
-  cumulativeStatistics->timeInternalCC.tv_nsec = 0;
+  cumulativeStatistics->timeCC.tv_sec = 0;
+  cumulativeStatistics->timeCC.tv_nsec = 0;
 
   rusageZero (&cumulativeStatistics->ru_gc);
   rusageZero (&cumulativeStatistics->ru_gcCopying);
diff --git a/runtime/gc/statistics.h b/runtime/gc/statistics.h
index 349aaf7f7..391d7f647 100644
--- a/runtime/gc/statistics.h
+++ b/runtime/gc/statistics.h
@@ -36,8 +36,9 @@ struct GC_cumulativeStatistics {
   uintmax_t bytesScannedMinor;
   uintmax_t bytesHHLocaled;
   uintmax_t bytesReclaimedByLocal;
-  uintmax_t bytesReclaimedByRootCC;
-  uintmax_t bytesReclaimedByInternalCC;
+  uintmax_t bytesReclaimedByCC;
+  uintmax_t bytesInScopeForLocal;
+  uintmax_t bytesInScopeForCC;
 
   size_t maxBytesLive;
   size_t maxBytesLiveSinceReset;
@@ -63,19 +64,21 @@ struct GC_cumulativeStatistics {
   uintmax_t numMarkCompactGCs;
   uintmax_t numMinorGCs;
   uintmax_t numHHLocalGCs;
-  uintmax_t numRootCCs;
-  uintmax_t numInternalCCs;
-  uintmax_t numDisentanglementChecks;
+  uintmax_t numCCs;
+  uintmax_t numDisentanglementChecks; // count full read barriers
+  uintmax_t numEntanglements;         // count instances entanglement is detected
   uintmax_t numChecksSkipped;
   uintmax_t numSuspectsMarked;
   uintmax_t numSuspectsCleared;
-  uintmax_t numEntanglementsDetected;
+  uintmax_t bytesPinnedEntangled;
+  uintmax_t currentPhaseBytesPinnedEntangled;
+  uintmax_t bytesPinnedEntangledWatermark;
+  float approxRaceFactor;
 
   struct timespec timeLocalGC;
   struct timespec timeLocalPromo;
 
-  struct timespec timeRootCC;
-  struct timespec timeInternalCC;
+  struct timespec timeCC;
 
   struct rusage ru_gc; /* total resource usage in gc. */
   struct rusage ru_gcCopying; /* resource usage in major copying gcs. */
diff --git a/runtime/gc/thread.c b/runtime/gc/thread.c
index 495843cde..4cbc6b209 100644
--- a/runtime/gc/thread.c
+++ b/runtime/gc/thread.c
@@ -164,6 +164,75 @@ void GC_HH_promoteChunks(pointer threadp) {
   HM_HH_promoteChunks(s, thread);
 }
 
+void GC_HH_clearSuspectsAtDepth(GC_state s, pointer threadp, uint32_t depth) {
+  getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed(s);
+  getThreadCurrent(s)->exnStack = s->exnStack;
+  HM_HH_updateValues(getThreadCurrent(s), s->frontier);
+  assert(threadAndHeapOkay(s));
+
+  GC_thread thread = threadObjptrToStruct(s, pointerToObjptr(threadp, NULL));
+  assert(thread != NULL);
+  assert(thread->hierarchicalHeap != NULL);
+  HM_HH_clearSuspectsAtDepth(s, thread, depth);
+}
+
+Word64 GC_HH_numSuspectsAtDepth(GC_state s, pointer threadp, uint32_t targetDepth) {
+  getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed(s);
+  getThreadCurrent(s)->exnStack = s->exnStack;
+  HM_HH_updateValues(getThreadCurrent(s), s->frontier);
+  assert(threadAndHeapOkay(s));
+  GC_thread thread = threadObjptrToStruct(s, pointerToObjptr(threadp, NULL));
+  assert(thread != NULL);
+  assert(thread->hierarchicalHeap != NULL);
+
+  for (HM_HierarchicalHeap cursor = thread->hierarchicalHeap;
+       NULL != cursor;
+       cursor = cursor->nextAncestor)
+  {
+    uint32_t d = HM_HH_getDepth(cursor);
+    if (d <= targetDepth) {
+      if (d == targetDepth) return (Word64)ES_numSuspects(s, cursor);
+      return 0;
+    }
+  }
+
+  return 0;
+}
+
+Pointer /*ES_clearSet*/
+GC_HH_takeClearSetAtDepth(GC_state s, pointer threadp, uint32_t targetDepth) {
+  getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed(s);
+  getThreadCurrent(s)->exnStack = s->exnStack;
+  HM_HH_updateValues(getThreadCurrent(s), s->frontier);
+  assert(threadAndHeapOkay(s));
+  GC_thread thread = threadObjptrToStruct(s, pointerToObjptr(threadp, NULL));
+  assert(thread != NULL);
+  assert(thread->hierarchicalHeap != NULL);
+  return (pointer)ES_takeClearSet(s, HM_HH_getHeapAtDepth(s, thread, targetDepth));
+}
+
+Word64 GC_HH_numChunksInClearSet(GC_state s, pointer clearSet) {
+  return (Word64)ES_numChunksInClearSet(s, (ES_clearSet)clearSet);
+}
+
+Pointer /*ES_finishedClearSetGrain*/
+GC_HH_processClearSetGrain(GC_state s, pointer clearSet, Word64 start, Word64 stop) {
+  return (pointer)ES_processClearSetGrain(s, (ES_clearSet)clearSet, (size_t)start, (size_t)stop);
+}
+
+void GC_HH_commitFinishedClearSetGrain(GC_state s, pointer threadp, pointer finClearSetGrain) {
+  getStackCurrent(s)->used = sizeofGCStateCurrentStackUsed(s);
+  getThreadCurrent(s)->exnStack = s->exnStack;
+  HM_HH_updateValues(getThreadCurrent(s), s->frontier);
+  assert(threadAndHeapOkay(s));
+  GC_thread thread = threadObjptrToStruct(s, pointerToObjptr(threadp, NULL));
+  ES_commitFinishedClearSetGrain(s, thread, (ES_finishedClearSetGrain)finClearSetGrain);
+}
+
+void GC_HH_deleteClearSet(GC_state s, pointer clearSet) {
+  ES_deleteClearSet(s, (ES_clearSet)clearSet);
+}
+
 void GC_HH_moveNewThreadToDepth(pointer threadp, uint32_t depth) {
   GC_state s = pthread_getspecific(gcstate_key);
   GC_thread thread = threadObjptrToStruct(s, pointerToObjptr(threadp, NULL));
diff --git a/runtime/gc/thread.h b/runtime/gc/thread.h
index 0c7c03c23..68047b5ad 100644
--- a/runtime/gc/thread.h
+++ b/runtime/gc/thread.h
@@ -129,6 +129,15 @@ PRIVATE void GC_HH_setMinLocalCollectionDepth(pointer thread, Word32 depth);
  */
 PRIVATE void GC_HH_moveNewThreadToDepth(pointer thread, Word32 depth);
 
+PRIVATE void GC_HH_clearSuspectsAtDepth(GC_state s, pointer threadp, uint32_t depth);
+
+PRIVATE Word64 GC_HH_numSuspectsAtDepth(GC_state s, pointer threadp, uint32_t depth);
+PRIVATE Pointer /*ES_clearSet*/ GC_HH_takeClearSetAtDepth(GC_state s, pointer threadp, uint32_t depth);
+PRIVATE Word64 GC_HH_numChunksInClearSet(GC_state s, pointer clearSet);
+PRIVATE Pointer /*ES_finishedClearSetGrain*/ GC_HH_processClearSetGrain(GC_state s, pointer clearSet, Word64 start, Word64 stop);
+PRIVATE void GC_HH_commitFinishedClearSetGrain(GC_state s, pointer threadp, pointer finClearSetGrain);
+PRIVATE void GC_HH_deleteClearSet(GC_state s, pointer clearSet);
+
 PRIVATE Bool GC_HH_checkFinishedCCReadyToJoin(GC_state s);
 
 #endif /* MLTON_GC_INTERNAL_BASIS */