[CF 2015][An architecture for near-data processing systems]
[MemSys 2017][The sparse data reduction engine: chopping sparse data one byte at a time]
[MemSys 2017][Identifying the potential of Near Data Processing for Apache Spark]
[ICPP 2017][Boosting the efficiency of HPCG and Graph500 with near-data processing]
[ISCA 2008][3D-Stacked Memory Architectures for Multi-Core Processors]
[IEEE Micro 2013][Centip3De: A 64-Core, 3D stacked near threshold system]
[Springer Chapter 2013][3D-MAPS: 3D Massively Parallel Processor with Stacked Memory]
[3DIC 2013][A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing]
[WoNDP 2014][Thermal Feasibility of Die-Stacked Processing in Memory]
Thermal Analysis
[MemSys 2015][Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore]
This work optimizes the 3D-stacked DRAM architecture for PIM
[MemSys 2016][Integrated Thermal Analysis for Processing In Die-Stacking Memory]
Thermal Analysis
[HPCA 2018][PM3: Power Modeling and Power Management for Processing-in-Memory]
Power Analysis
[IPDPS 2018][CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading]
Thermal Analysis
[MSPC 2013][A new perspective on processing-in-memory architecture design]
[WoNDP 2014][3D-Stacked Memory-Side Acceleration: Accelerator and System Design]
[ISPASS 2014][NDC: Analyzing the impact of 3D-stacked memory + logic devices on MapReduce workloads]
[HPDC 2014][TOP-PIM: Throughput-Oriented Programmable Processing in Memory]
This design restricts PIM processing logic to execute on only non-cacheable data, which forces cores within the CPU to read PIM data directly from DRAM
This paper assumes PIM cores to be streaming multiprocessor style to utilize the internal bandwidth of HMC.
This paper assumes whole application offloading.
[ASPLOS 2014][Integrated 3D-stacked server designs for increasing physical density of key-value stores]
[IBM 2015][Active Memory Cube: A processing-in-memory architecture for exascale systems]
[ISCA 2015][A scalable processing-in-memory accelerator for parallel graph processing]
Granularity: entire application
Programming model: a new set of API to write programs.
Host Interface: Tesseract acts like an accelerator that is memory-mapped to part of a noncacheable memory region of the host processors.
Coherence: restrict PIM processing logic to execute on only non-cacheable data, which forces cores within the CPU to read PIM data directly from DRAM.
No Virtual Memory Support: Since in-memory big-data workloads usually do not require many features provided by virtual memory, Tesseract does not support virtual memory to avoid the need for address translation inside memory.
Computation movement: Computation movement is implemented as a remote function call
Sychronization: guranteed by barrier() API
[ISCA 2015][Data-reorganization: Data Reorganization in Memory Using 3D-stacked DRAM]
[HPCA 2015][NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules]
Coherence: restrict PIM processing logic to execute on only non-cacheable data, which forces cores within the CPU to read PIM data directly from DRAM.
[MemSys 2015][Instruction Offloading with HMC 2.0 Standard — a Case Study for Graph Traversals]
Granularity: single instruction
[MemSys 2015][Understanding Energy Aspect of Processing Near Memory for HPC Workloads]
[MemSys 2015][Near Memory Data Structure Rearrangement]
[ASBD 2015][Sort vs. Hash Join Revisited for Near-Memory Execution]
[ISCA 2016][Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory]
[HPCA 2016][HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing]
PIM core style: FPGA / CGRA
Granularity: entire application
Coherence: restrict PIM processing logic to execute on only non-cacheable data, which forces cores within the CPU to read PIM data directly from DRAM.
[HPCA 2016][Scheduling techniques for GPU architectures with processing-in-memory capabilities]
GPU-HMC
Granularity: kernel offloading.
This paper assumes GPU style PIM cores.
This paper's offloading strategy not only considers the memory / computation features (map compute-bound kernels to GPU-PIC, and map memory-bound to GPU-PIM), it also considers how to achieve maximum concurrency so that the overall execution time is reduced.
It identifies independent kernels and schedule them concurrently.
[MemSys 2016][Analyzing Consistency Issues in HMC Atomics]
[IEEE Micro 2016][Near-DRAM Acceleration with Single-ISA Heterogeneous Processing in Standard Memory Modules]
[IEEE Micro 2016][HAMLeT Architecture for Parallel Data Reorganization in Memory]
[DATE 2016][Large vector extensions inside the HMC]
[PACT 2016][Accelerating Linked-list Traversal Through Near-Data Processing]
Pointer traversal
[ICCD 2016][Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation]
Pointer traversal
This introduces in-memory support for address translation and pointer chasing
[ICS 2016][Prefetching Techniques for Near-memory Throughput Processors]
[ARCS 2016][Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube]
[thesis 2016][RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning]
[ICPPW 2016][Performance Implications of Processing-in-Memory Designs on Data-Intensive Applications]
[PDPD 2016][HMC-Sim-2.0: A Simulation Platform for Exploring Custom Memory Cube Operations]
[ISCA 2017][The Mondrian Data Engine]
[HPCA 2017][GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks]
Granularity: single instruction
Previous work: PIM-enabled instructions [ISCA 2015] require programmers to explicitly invoke PIM operations using new host (native) instructions. GraphPIM does not add an extra burden on application programmers by leveraging existing host instructions.
Key idea: map host atomic instructions directly into PIM atomics using uncacheable memory support in modern architectures, without any changes in user applications or ISA.
Offloading Target: choose the atomic operations on the graph property as PIM offloading targets. All host atomic instructions accessing the PMR (PIM memory region) are offloaded as PIMAtomic requests.
Takeaway: offloading with HMC-atomic instructions on CPU is beneficial for graph-computing applications. GraphPIM does not explore candidate properties and selection process.
Coherence Issue: GraphPIM requires the framework to allocate the graph property in the PIM memory region (PMR), which is a continuous block of an uncacheable region in the virtual memory space.
[HPCA 2017][Processing-in-Memory Enabled Graphics Processors for 3D Rendering]
GPU-HMC for graphics
[ASPLOS 2017][TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory]
[TVLSI 2017][Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory Cube]
[TPDS 2017][Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes]
[MemSys 2017][Near memory key/value lookup acceleration]
[MemSys 2017][Lightweight SIMT Core Designs for Intelligent 3D Stacked DRAM]
[SC 2017][Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs]
GPU-HMC
This enables the distribution of PIM data across multiple memory stacks.
[PC 2017][HMC-Sim-2.0: A co-design infrastructure for exploring custom memory cube operations]
[arXiv 2017][NeuroTrainer: An Intelligent Memory Module for Deep Learning Training]
[TPDS 2018][Near-Memory Acceleration for Radio Astronomy]
[arXiv 2018][Memory Slices: A Modular Building Block for Scalable, Intelligent Memory Systems]
[BMC Genomics][GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies]
[IEEE Trans on Computer 2018][A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Data Sets]
[TCAD 2018][DeepTrain: A Programmable Embedded Platform for Training Deep Neural Networks]
[arXiv 2017][Application-Driven Near-Data Processing for Similarity Search]
[IPDPS 2018][Application Codesign of Near-Data Processing for Similarity Search]
[MICRO 2018][Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture]
[IEEE Micro 2014][Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads]
[Micro 2016][Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems]
This paper proposes a method of integrating logic within the DRAM module but outside of the chip to reduce manufacturing costs.
[IEEE Micro 2016][Heterogeneous Computing Meets Near-Memory Acceleration and High-Level Synthesis in the Post-Moore Era]
[ISLPED 2017][XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks in Wide-IO2 DRAMs]
"These works (1) are often unable to take advantage of the high internal bandwidth of 3D-stacked DRAM, which reduces the efficiency of PIM execution, and (2) may still suffer from many of the same challenges faced by architectures that embed logic within the DRAM chip." - Ghose, arXiv 2018.
[Micro 2015][Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses]
This design embeds logic within the memory controller to remap a single memory request across multiple rows and columns within DRAM.
[Micro 2015][Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM]
[Micro 2016][Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads]
PIM kernel identification: programmer / compiler
This design embeds logic in the memory controller that accelerates dependent cache misses and performs runahead execution
[ISCA 2016][Accelerating Dependent Cache Misses with an Enhanced Memory Controller]
This design embeds logic in the memory controller that accelerates dependent cache misses and performs runahead execution
[DAC 2018][On-chip deep neural network storage with multi-level eNVM]
[ISCA 2018][RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM]
[WoNDP 2013][High-level Programming Model Abstractions for Processing in Memory]
[WoNDP 2013][Data-triggered Multithreading for Near-Data Processing]
[ISCA 2015][PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture]
Granularity: single instruction
Idea: a hardware-based locality monitor chooses a processing unit, either in the host processor or memory hierarchy, for the execution of each custom PIM instruction written by a programmer.
PIM kernel identification: programmer / compiler
[Micro 2017][Data movement aware computation partitioning]
GPU in manycore system
[TACO 2017][CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-in-Memory]
Granularity: Singel instruction
Idea: This paper is a a follow-up to GraphPIM. It extends GraphPIM to GPU workloads and introduces a compiler-assisted technique that facilitates instruction-level offloading on both CPU and GPU platforms in the context of HMC-atomic instructions for graph-computing applications.
Cache Coherence: Because HMC-atomic instructions directly modify data within the HMC, this design maintains a cache-bypassing policy that ensures a coherent view of offloading targets. In other words, marking the memory accesses of HMC-atomic instructions as uncacheable causes them to bypass the cache hierarchy and ensures that a single copy of offloading targets exists.
The cost metrics (bandwidth saving) to identify off-loading computation is similar to Transparent Offloading and Mapping (TOM), (ISCA 2016).
[TACO 2017][An Architecture for Integrated Near-Data Processors]
[SPAA 2017][Concurrent Data Structures for Near-Memory Computing]
This work designs PIM-specific concurrent data structures to improve PIM performance.
[ASPLOS 2018][In-Memory Data Parallel Processor]
[arXiv 2018][ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning]
[PACT 2015][Practical Near-Data Processing for In-memory Analytics Frameworks]
[CF 2015][Data Access Optimization in a Processing-in-Memory System]
This work optimizes how programs access PIM data
[ISCA 2016][Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems]
Granularity: instruction blocks (GPU warp).
This paper assumes memory accessed by offloaded instructions are marked uncacheable. In other word, if a warp accesses memory that will cause data sharing among PIM and host, the warp should not be an offloading candidate.
This paper also assumes no memory barrier, synchronization, or atomic instructions in candidate blocks as it does not support synchronization primitives between the main GPU and the logic layer SM.
This paper assumes PIM cores to be streaming multiprocessor style to utilize the internal bandwidth of HMC.
This paper uses a compiler-based technique to identify candidate off-loading code blocks (bandwidth saving as a metric).
This paper also considers the problem of mapping data onto different HMCs (locality of data and computation). This maps memory pages accessed by offloaded code to where the code will execute, by exploiting common memory access patterns.
This paper also discusses a runtime component to dynamically determine whether an offloading candidate block should really be offloaded.
Virtual memory mechanism: this paper assumes memory stack SMs are equipped with similar TLBs and MMUs and are capable of performing virtual address translation. It assumes the size of the MMU and TLB per SM is fairly small: 1-2K flip-flops and small amount of logic.
[IA 2017][Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture]
[DAC 2017][Exploiting Parallelism for Convolutional Connections in Processing-In-Memory Architecture]
[TACO 2017][Triple Engine Processor (TEP): A Heterogeneous Near-Memory Processor for Diverse Kernel Operations]
[HotOS 2017][It's Time to Think About an Operating System for Near Data Processing Architectures]
[IEEE Trans on Computers 2017][StaleLearn: Learning Acceleration with Asynchronous Synchronization between Model Replicas on PIM]
[CF 2017][Selective off-loading to Memory: Task Partitioning and Mapping for PIM-enabled Heterogeneous Systems]
[MASCOTS 2017][Quantifying the Potential Benefits of On-chip Near-Data Computing in Manycore Processors]
[ASPLOS 2018][Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks]
[TPDS 2018][Towards Memory-Efficient Allocation of CNNs on Processing-in-Memory Architecture]
[MICRO 2018][Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach]
[MICRO 2018][Application-Transparent Near-Memory Processing Architecture with Memory Channel Network]
[CAL 2018][Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning]
[HPEC 2018][Designing Algorithms for the EMU Migrating-threads-based Architecture]
[PACT 2015][BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models]
Granularity: single instruction
[arXiv 2018][Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions]
[Micro 2015][Enabling Portable Energy Efficiency with Memory Accelerated Library]