diff --git a/demos/iiswc-20/diagram.png b/demos/iiswc-20/diagram.png new file mode 120000 index 0000000..cba57f3 --- /dev/null +++ b/demos/iiswc-20/diagram.png @@ -0,0 +1 @@ +../../example/model/diagram.png \ No newline at end of file diff --git a/demos/iiswc-20/tutorial.ipynb b/demos/iiswc-20/tutorial.ipynb new file mode 100644 index 0000000..3379d2c --- /dev/null +++ b/demos/iiswc-20/tutorial.ipynb @@ -0,0 +1,1300 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: PATH=/home/subh/anaconda3/bin:/home/subh/tools/llvm-10/build/bin:/home/subh/tools/llvm-10/build/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "%set_env PATH=/home/subh/anaconda3/bin:/home/subh/tools/llvm-10/build/bin:/home/subh/tools/llvm-10/build/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games\n", + "%cd ../.." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# HetSim Demo\n", + "We will now demonstrate the use of HetSim for an example scenarios." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* We will first change an existing target model\n", + "* We will then write an application for it and run it through a) detailed simulation, and b) HetSim" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Install Dependencies\n", + "* Install LLVM (version > 10.0) by following instructions from https://llvm.org/docs/GettingStarted.html\n", + "* Install any cross-compilers required for the target\n", + " * This example uses an Arm gcc that you can install by running `sudo apt install g++-arm-linux-gnueabihf`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "## Build gem5\n", + "We will build gem5 by running a convenience script inside `scripts/`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/scripts\n", + "\u001b[32m\u001b[1m[0]: Starting gem5 build for TimingSimpleCPU\u001b[m\n", + "\u001b[32m\u001b[1m[1]: gem5 build succeeded\u001b[m\n", + "\u001b[32m\u001b[1m[2]: Compiling m5threads library\u001b[m\n", + "Makefile:50: warning: overriding recipe for target 'test_omp'\n", + "Makefile:42: warning: ignoring old recipe for target 'test_omp'\n", + "make: '../pthread.o' is up to date.\n", + "\u001b[32m\u001b[1m[3]: build-gem5.sh successfully exiting\u001b[m\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "%cd scripts\n", + "!VERBOSE=0 CC=/usr/bin/gcc CXX=/usr/bin/g++ bash build-gem5.sh\n", + "%cd .." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Construct a gem5 Model for the Target\n", + "Consider a programmable target composed of two types of PEs - **worker** and **manager**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "
\n", + " \"example\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* All PEs share a D-Cache, a DSPM, and the main memory\n", + "* Each PE has a private instruction cache\n", + "* The manager distributes work to the workers via _FIFO queues_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Tweak the Existing Model\n", + "We will make two changes to the existing model.\n", + "* Change number of workers from 4 → 16\n", + "* Change the depth of each work queue from 4 → 6" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "#### Step 1: Change Macros and Python Bindings" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/example/model\n", + "swig -python -module params params.h\n", + "gcc -c -fpic params_wrap.c -I/usr/include/python2.7 -o params_wrap.o\n", + "gcc -shared params_wrap.o -o _params.so\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "%cd example/model\n", + "# change the relevant define in params.h\n", + "!sed -i 's/#define NUM_WORKER.*/#define NUM_WORKER 8/g' params.h\n", + "!sed -i 's/#define WQ_DEPTH.*/#define WQ_DEPTH 6/g' params.h\n", + "# generate Python bindings\n", + "!make\n", + "%cd ../.." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### Step 2: Reflect Changes in User Spec and Libraries" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Assign PE IDs to the new worker PEs and queue IDs for the queues corresponding the new workers" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "# this step should be done manually!\n", + "!sed -i 's/\\[1, 2, 3, 4\\]/[1, 2, 3, 4, 5, 6, 7, 8]/g' spec/spec.json" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "* Generate code to connect the queues in the `gem5` model and the emulation and TRE libraries" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/scripts\n", + "\u001b[32m\u001b[1m[0]: Starting gem5 build for TimingSimpleCPU\u001b[m\n", + "\u001b[32m\u001b[1m[1]: gem5 build succeeded\u001b[m\n", + "\u001b[32m\u001b[1m[2]: Compiling m5threads library\u001b[m\n", + "Makefile:50: warning: overriding recipe for target 'test_omp'\n", + "Makefile:42: warning: ignoring old recipe for target 'test_omp'\n", + "make: '../pthread.o' is up to date.\n", + "\u001b[32m\u001b[1m[3]: build-gem5.sh successfully exiting\u001b[m\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "%cd scripts\n", + "!python2 populate_init_queues.py\n", + "!VERBOSE=0 bash build-gem5.sh\n", + "%cd .." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### Step 3: Regenerate Compiler Plugin\n", + "Finally, we regenerate the compiler plugin and build the tracing library." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/scripts\n", + "/home/subh/research/hetsim-rel/tracer\n", + "/home/subh/research/hetsim-rel/tracer/build\n", + "/usr/bin/ar: creating t.a\n", + "\u001b[35m\u001b[1mScanning dependencies of target LLVMHetsim\u001b[0m\n", + "[ 20%] \u001b[32mBuilding CXX object compiler-pass/CMakeFiles/LLVMHetsim.dir/hetsim-analysis.cpp.o\u001b[0m\n", + "[ 40%] \u001b[32mBuilding CXX object compiler-pass/CMakeFiles/LLVMHetsim.dir/hetsim-codegen.cpp.o\u001b[0m\n", + "[ 60%] \u001b[32m\u001b[1mLinking CXX shared module LLVMHetsim.so\u001b[0m\n", + "[ 60%] Built target LLVMHetsim\n", + "\u001b[35m\u001b[1mScanning dependencies of target hetsim_default_rt\u001b[0m\n", + "[ 80%] \u001b[32mBuilding CXX object runtime/default/CMakeFiles/hetsim_default_rt.dir/hetsim_default_rt.cpp.o\u001b[0m\n", + "[100%] \u001b[32m\u001b[1mLinking CXX shared library libhetsim_default_rt.so\u001b[0m\n", + "[100%] Built target hetsim_default_rt\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "%cd scripts\n", + "!python generate_model.py ../spec/spec.json > /dev/null\n", + "%cd ../tracer\n", + "%mkdir -p build\n", + "%cd build\n", + "!cmake .. > /dev/null && make\n", + "%cd ../.." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Write Application for the Target\n", + "We will write a program to do **vector addition** on the target hardware." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Header Files\n", + "```Cpp\n", + "#include \"params.h\" // import parameters of target hardware as macros\n", + "#include \"util.h\" // import primitive definitions\n", + "#include \n", + "#include \n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Boilerplate Initialization\n", + "```Cpp\n", + "\n", + "void *work(void *arg) { // manager \"spawns\" worker threads with tid=1,2,3...\n", + " unsigned tid = *(unsigned *)(arg);\n", + " __register_core_id(tid);\n", + " ...\n", + "}\n", + "int main() {\n", + " __init_queues(WQ_DEPTH);\n", + " __register_core_id(0); // manager is assigned core-id 0\n", + " ...\n", + " __teardown_queues();\n", + " return 0;\n", + "}\n", + "```\n", + "\n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### Note\n", + "The `core_id` must be the same across the application, `spec.json` and `target.py`." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \"mgr\": {\r\n", + " \"id\": [0],\r\n", + " \"__push(unsigned int, unsigned long)\" : {\r\n", + "--\r\n", + " \"wrkr\": {\r\n", + " \"id\": [1, 2, 3, 4, 5, 6, 7, 8],\r\n", + " \"__pop(unsigned int)\": {\r\n" + ] + } + ], + "source": [ + "# spec.json\n", + "!grep -A1 -B1 \"id\" spec/spec.json" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " system.mgr = TRE(\r\n", + " id=0, # This must correspond to the ID assigned in the user spec file.\r\n", + " queue_depth=WQ_DEPTH,\r\n", + "--\r\n", + " wrkr.append(TRE(\r\n", + " id=i+1, # This must correspond to the ID assigned in the user spec file.\r\n", + " max_outstanding_addrs=MAX_OUTSTANDING_REQS\r\n" + ] + } + ], + "source": [ + "# target.py\n", + "!grep -A1 -B1 \"id=\" example/model/target.py" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Manager PE code: `main()`\n", + "Memory allocation is done at the beginning part of `main()`.\n", + "\n", + "```Cpp\n", + " // main memory allocation\n", + " // in this example, we are working with 3 float arrays each of size N\n", + " size_t RAM_SIZE_BYTES = 3 * N * sizeof(float);\n", + " char *ram = (char *)mmap((void *)(RAM_BASE_ADDR), RAM_SIZE_BYTES,\n", + " PROT_READ | PROT_WRITE | PROT_EXEC, \n", + " MAP_ANON | MAP_PRIVATE, 0, 0);\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + " // scratchpad memory allocation\n", + "#ifdef EMULATION\n", + " // for emulation\n", + " char *dspm = (char *)mmap((void *)(SPM_BASE_ADDR), SPM_SIZE_BYTES,\n", + " PROT_READ | PROT_WRITE | PROT_EXEC, \n", + " MAP_ANON | MAP_PRIVATE, 0, 0);\n", + "#else // !EMULATION\n", + " // the model uses physically-addressed scratchpad that does not need explicit allocation\n", + " char *dspm = (char *)SPM_BASE_ADDR;\n", + "#endif // EMULATION\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "We \"allocate\" the input and output vectors at the pre-allocated main memory and initialize them.\n", + "```Cpp\n", + " // allocate the vectors and populate them\n", + " float *a = (float *)(ram);\n", + " float *b = (float *)(ram + N * sizeof(float));\n", + " float *c = (float *)(ram + 2 * N * sizeof(float));\n", + " for (int i = 0; i < N; ++i) {\n", + " a[i] = float(i + 1);\n", + " b[i] = float(i + 1);\n", + " c[i] = 0.0;\n", + " }\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "For illustration, we allocate a barrier in the shared DSPM.\n", + "```Cpp\n", + " // allocate barrier object for synchronization\n", + " pthread_barrier_t *bar = (pthread_barrier_t *)(dspm);\n", + " // initialize barrier with participants = 1 manager + NUM_WORKER workers\n", + " __barrier_init(bar, NUM_WORKER + 1); \n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "Next, we allocate and spawn the worker threads.\n", + "```Cpp\n", + " // allocate thread objects for each \"worker\" PE\n", + " pthread_t *workers = new pthread_t[NUM_WORKER];\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + " // create vector of core IDs to send to each thread\n", + " unsigned *tids = new unsigned[NUM_WORKER];\n", + " for (int i = 0; i < NUM_WORKER; ++i) {\n", + " tids[i] = i + 1;\n", + " // spawn worker thread\n", + " pthread_create(workers + i, NULL, work, &tids[i]);\n", + " }\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "The most important part of the code for the manager -- distribute and push work to the workers!\n", + "```Cpp\n", + " // partition the work and push work \"packets\"\n", + " for (int i = 0; i < NUM_WORKER; ++i) {\n", + " // each worker is assigned floor(N / NUM_WORKER) elements\n", + " int n = N / NUM_WORKER;\n", + " int start_idx = i * n;\n", + " int end_idx = (i + 1) * n - 1;\n", + "\n", + " // handle trailing elements by assigning to final worker\n", + " if (i == NUM_WORKER - 1) {\n", + " end_idx = N - 1;\n", + " }\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + " // push through work queues\n", + " __push(i + 1, (uintptr_t)(a));\n", + " __push(i + 1, (uintptr_t)(b));\n", + " __push(i + 1, (uintptr_t)(c));\n", + " __push(i + 1, (unsigned)(start_idx));\n", + " __push(i + 1, (unsigned)(end_idx));\n", + " __push(i + 1, (uintptr_t)(bar));\n", + " }\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "```Cpp\n", + "// ----- ROI begin -----\n", + " __reset_stats(); // begin recording time here\n", + " for (int i = 0; i < NUM_WORKER; ++i) {\n", + " __push(i + 1, 0); // start signal, value is ignored\n", + " }\n", + " __barrier_wait(bar); // synchronize with worker threads\n", + "\n", + " __dump_reset_stats(); // end recording time here\n", + "// ----- ROI end -----\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + " // join with all threads\n", + " for (int tid = 0; tid < NUM_WORKER; ++tid) {\n", + " pthread_join(workers[tid], NULL);\n", + " }\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + " // clean up\n", + "#ifdef EMULATION\n", + " munmap(dspm, SPM_SIZE_BYTES);\n", + "#endif // EMULATION\n", + " munmap(ram, RAM_SIZE_BYTES);\n", + " delete[] workers;\n", + " delete[] tids;\n", + " __teardown_queues();\n", + "} // end of main()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "### Worker PE Code: `work()`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + "void *work(void *arg) {\n", + " unsigned tid = *(unsigned *)(arg);\n", + " __register_core_id(tid);\n", + "\n", + " // retrieve variables from work queue\n", + " volatile float *a = (volatile float *)__pop(0);\n", + " volatile float *b = (volatile float *)__pop(0);\n", + " volatile float *c = (volatile float *)__pop(0);\n", + " int start_idx = (int)__pop(0);\n", + " int end_idx = (int)__pop(0);\n", + " pthread_barrier_t *bar = (pthread_barrier_t *)__pop(0);\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + " // receive start signal\n", + " __pop(0);\n", + "\n", + " // perform actual computation\n", + " for (int i = start_idx; i <= end_idx; ++i) {\n", + " c[i] += a[i] + b[i];\n", + " }\n", + "\n", + " // synchronize with manager\n", + " __barrier_wait(bar);\n", + "\n", + " return NULL;\n", + "} // end of work()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Verify Functionality of Emulated Code" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/emu\n", + "/home/subh/research/hetsim-rel/emu/build\n", + "\u001b[35m\u001b[1mScanning dependencies of target hetsim_prim\u001b[0m\n", + "[ 50%] \u001b[32mBuilding CXX object CMakeFiles/hetsim_prim.dir/src/util.cpp.o\u001b[0m\n", + "[100%] \u001b[32m\u001b[1mLinking CXX shared library libhetsim_prim.so\u001b[0m\n", + "[100%] Built target hetsim_prim\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "# build emulator library\n", + "%cd emu\n", + "!mkdir -p build\n", + "%cd build\n", + "!cmake .. > /dev/null && make\n", + "%cd ../.." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/example/app\n", + "/home/subh/research/hetsim-rel/example/app/build\n", + "MODE set to EMU\n", + "Processing application: serial_factorial\n", + "Processing application: vector_add\n", + "Processing application: workq_mutex\n", + "\u001b[35m\u001b[1mScanning dependencies of target workq_mutex\u001b[0m\n", + "[ 16%] \u001b[32mBuilding CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o\u001b[0m\n", + "[ 33%] \u001b[32m\u001b[1mLinking CXX executable workq_mutex\u001b[0m\n", + "[ 33%] Built target workq_mutex\n", + "\u001b[35m\u001b[1mScanning dependencies of target serial_factorial\u001b[0m\n", + "[ 50%] \u001b[32mBuilding CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o\u001b[0m\n", + "[ 66%] \u001b[32m\u001b[1mLinking CXX executable serial_factorial\u001b[0m\n", + "[ 66%] Built target serial_factorial\n", + "\u001b[35m\u001b[1mScanning dependencies of target vector_add\u001b[0m\n", + "[ 83%] \u001b[32mBuilding CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o\u001b[0m\n", + "[100%] \u001b[32m\u001b[1mLinking CXX executable vector_add\u001b[0m\n", + "[100%] Built target vector_add\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "# build application with emulation library\n", + "%cd example/app\n", + "%rm -rf build\n", + "%mkdir -p build\n", + "%cd build\n", + "!CC=/usr/bin/gcc CXX=/usr/bin/g++ MODE=EMU cmake .. > /dev/null && make\n", + "%cd ../../.." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/example/app/build\n", + "== Vector Add Test with N = 100000, NUM_WORKER = 8\n", + "== Test Passed ==\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "# run emulated application for functional verification\n", + "%cd example/app/build\n", + "!./vector_add\n", + "%cd ../../.." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Simulation on Detailed Model" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: CMAKE_C_COMPILER=/usr/bin/arm-linux-gnueabihf-gcc\n", + "env: CMAKE_CXX_COMPILER=/usr/bin/arm-linux-gnueabihf-g++\n" + ] + } + ], + "source": [ + "%set_env CMAKE_C_COMPILER=/usr/bin/arm-linux-gnueabihf-gcc\n", + "%set_env CMAKE_CXX_COMPILER=/usr/bin/arm-linux-gnueabihf-g++" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/example/app\n", + "/home/subh/research/hetsim-rel/example/app/build\n", + "MODE set to SIM\n", + "Processing application: serial_factorial\n", + "Processing application: vector_add\n", + "Processing application: workq_mutex\n", + "\u001b[35m\u001b[1mScanning dependencies of target m5threads\u001b[0m\n", + "[ 12%] \u001b[32mBuilding C object CMakeFiles/m5threads.dir/home/subh/research/hetsim-rel/m5threads/pthread.c.o\u001b[0m\n", + "\u001b[01m\u001b[K/home/subh/research/hetsim-rel/m5threads/pthread.c:\u001b[m\u001b[K In function ‘\u001b[01m\u001b[Kpthread_create\u001b[m\u001b[K’:\n", + "\u001b[01m\u001b[K/home/subh/research/hetsim-rel/m5threads/pthread.c:256:3:\u001b[m\u001b[K \u001b[01;35m\u001b[Kwarning: \u001b[m\u001b[Kimplicit declaration of function ‘\u001b[01m\u001b[Kclone\u001b[m\u001b[K’ [-Wimplicit-function-declaration]\n", + " clone(__pthread_trampoline, tcb->stack_start_addr, CLONE_VM|CLONE_FS|CLONE_FI\n", + "\u001b[01;32m\u001b[K ^\u001b[m\u001b[K\n", + "[ 25%] \u001b[32m\u001b[1mLinking C static library libm5threads.a\u001b[0m\n", + "[ 25%] Built target m5threads\n", + "\u001b[35m\u001b[1mScanning dependencies of target serial_factorial\u001b[0m\n", + "[ 37%] \u001b[32mBuilding CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o\u001b[0m\n", + "[ 50%] \u001b[32m\u001b[1mLinking CXX executable serial_factorial\u001b[0m\n", + "[ 50%] Built target serial_factorial\n", + "\u001b[35m\u001b[1mScanning dependencies of target workq_mutex\u001b[0m\n", + "[ 62%] \u001b[32mBuilding CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o\u001b[0m\n", + "[ 75%] \u001b[32m\u001b[1mLinking CXX executable workq_mutex\u001b[0m\n", + "[ 75%] Built target workq_mutex\n", + "\u001b[35m\u001b[1mScanning dependencies of target vector_add\u001b[0m\n", + "[ 87%] \u001b[32mBuilding CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o\u001b[0m\n", + "[100%] \u001b[32m\u001b[1mLinking CXX executable vector_add\u001b[0m\n", + "[100%] Built target vector_add\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "# build application with emulation library\n", + "%cd example/app\n", + "%rm -rf build\n", + "%mkdir -p build\n", + "%cd build\n", + "!MODE=SIM cmake .. > /dev/null && make\n", + "%cd ../../../" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/scripts\n", + "...\n", + "== Vector Add Test with N = 100000, NUM_WORKER = 8\n", + "warn: PowerState: Already in the requested power state, request ignored\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "warn: User mode does not have SPSR\n", + "== Test Passed ==\n", + "Exiting @ tick 9534907382 because exiting with last active thread context\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "# run gem5 simulation\n", + "%cd scripts\n", + "!MODE=SIM APP=vector_add bash run-gem5.sh > /dev/null\n", + "!echo \"...\" && tail -n20 ../gem5/m5out/run.log\n", + "%cd ../\n", + "### Gather Runtime\n", + "%mkdir -p res\n", + "%cp gem5/m5out/stats.txt res/stats.det.txt\n", + "!grep sim_ticks res/stats.det.txt | head -n1 | tr -s ' ' | cut -d' ' -f2 > ticks.det.txt" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "env: CMAKE_C_COMPILER=/usr/bin/gcc\n", + "env: CMAKE_CXX_COMPILER=/usr/bin/g++\n" + ] + } + ], + "source": [ + "%set_env CMAKE_C_COMPILER=/usr/bin/gcc\n", + "%set_env CMAKE_CXX_COMPILER=/usr/bin/g++" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Run Trace Generation\n", + "We will first make the required changes to enable tracing in the CPP program that we wrote and then run the trace generation step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "source": [ + "```Cpp\n", + "#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)\n", + "#include \"hetsim_default_rt.h\"\n", + "#endif\n", + "\n", + "void *work(void *arg) {\n", + " unsigned tid = *(unsigned *)(arg);\n", + " __register_core_id(tid);\n", + "#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)\n", + " __open_trace_log(tid);\n", + "#endif // AUTO_TRACING || MANUAL_TRACING\n", + " // ROI begin\n", + " ...\n", + " // ROI end\n", + "#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)\n", + " __close_trace_log(tid);\n", + "#endif // AUTO_TRACING || MANUAL_TRACING\n", + "}\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "```Cpp\n", + "int main() {\n", + " ...\n", + "#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)\n", + " __open_trace_log(0); // use core-id as argument\n", + "#endif\n", + " ...\n", + " __barrier_init(bar, NUM_WORKER + 1); // set number of participants to 1 manager + NUM_WORKER worker PEs\n", + " ... \n", + " __dump_reset_stats(); // end recording time here\n", + " // ----- ROI end -----\n", + "\n", + "#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)\n", + " __close_trace_log(0);\n", + "#endif // AUTO_TRACING || MANUAL_TRACING\n", + " ...\n", + "} // end of main()\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/example/app/build\n", + "MODE set to EMU_AUTO_TRACE\n", + "Processing application: serial_factorial\n", + "Processing application: vector_add\n", + "Processing application: workq_mutex\n", + "\u001b[35m\u001b[1mScanning dependencies of target workq_mutex\u001b[0m\n", + "[ 16%] \u001b[32mBuilding CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o\u001b[0m\n", + "\u001b[1m/home/subh/research/hetsim-rel/example/app/src/workq_mutex.cpp:36:9: \u001b[0m\u001b[0;1;35mwarning: \u001b[0m\u001b[1mTracing enabled for this run [-W#pragma-messages]\u001b[0m\n", + "#pragma message(\"Tracing enabled for this run\")\n", + "\u001b[0;1;32m ^\n", + "\u001b[0m1 warning generated.\n", + "[ 33%] \u001b[32m\u001b[1mLinking CXX executable workq_mutex\u001b[0m\n", + "[ 33%] Built target workq_mutex\n", + "\u001b[35m\u001b[1mScanning dependencies of target serial_factorial\u001b[0m\n", + "[ 50%] \u001b[32mBuilding CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o\u001b[0m\n", + "\u001b[1m/home/subh/research/hetsim-rel/example/app/src/serial_factorial.cpp:35:9: \u001b[0m\u001b[0;1;35mwarning: \u001b[0m\u001b[1mTracing enabled for this run [-W#pragma-messages]\u001b[0m\n", + "#pragma message(\"Tracing enabled for this run\")\n", + "\u001b[0;1;32m ^\n", + "\u001b[0m1 warning generated.\n", + "[ 66%] \u001b[32m\u001b[1mLinking CXX executable serial_factorial\u001b[0m\n", + "[ 66%] Built target serial_factorial\n", + "\u001b[35m\u001b[1mScanning dependencies of target vector_add\u001b[0m\n", + "[ 83%] \u001b[32mBuilding CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o\u001b[0m\n", + "[100%] \u001b[32m\u001b[1mLinking CXX executable vector_add\u001b[0m\n", + "[100%] Built target vector_add\n", + "== Vector Add Test with N = 100000, NUM_WORKER = 8\n", + "== Test Passed ==\n" + ] + } + ], + "source": [ + "# run trace generation\n", + "%cd example/app/build\n", + "!mkdir -p traces\n", + "%rm -f CMakeCache.txt\n", + "!MODE=EMU_AUTO_TRACE cmake .. > /dev/null && make\n", + "!./vector_add" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==> traces/pe_0.trace <==\n", + "BARINIT 0xe0101000 9 10\n", + "ST @11 0x1dfdf60 ( )\n", + "ST @12 0x1dfdf64 ( )\n", + "ST @13 0x1dfdf68 ( )\n", + "ST @14 0x1dfdf6c ( )\n", + "ST @15 0x1dfdf70 ( )\n", + "ST @16 0x1dfdf74 ( )\n", + "ST @17 0x1dfdf78 ( )\n", + "ST @18 0x1dfdf7c ( )\n", + "PUSH 1 1\n", + "\n", + "==> traces/pe_1.trace <==\n", + "POP 0 1\n", + "POP 0 1\n", + "POP 0 1\n", + "POP 0 1\n", + "POP 0 1\n", + "POP 0 1\n", + "POP 0 1\n", + "STALL 3 ( )\n", + "LD @1 0x40000000 ( )\n", + "LD @2 0x40061a80 ( 0x40000000 )\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "### Trace Format\n", + "!head -n10 traces/pe_[01].trace \n", + "%cd ../../../" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Run Trace Replay\n", + "We will now run the generated traces through the TRE-enabled `gem5` model." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/subh/research/hetsim-rel/scripts\n", + "...\n", + "TRE[6]: halted @694256563 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 1/9\n", + "TRE[8]: halted @694258565 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 2/9\n", + "TRE[1]: halted @694260567 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 3/9\n", + "TRE[3]: halted @694262569 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 4/9\n", + "TRE[2]: halted @694264571 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 5/9\n", + "TRE[7]: halted @694266573 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 6/9\n", + "TRE[5]: halted @694268575 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 7/9\n", + "TRE[4]: halted @694270577 after completing 100005 trace entries\n", + "Number of TREs IDLE now: 8/9\n", + "TRE[0]: triggered DMPRST\n", + "TRE[0]: halted @694273580 after completing 0 trace entries\n", + "Number of TREs IDLE now: 9/9\n", + "Exiting @ tick 694273580 because all TREs are done\n", + "/home/subh/research/hetsim-rel\n" + ] + } + ], + "source": [ + "%cd scripts\n", + "!MODE=EMU_TRACE APP=vector_add bash run-gem5.sh > /dev/null\n", + "!echo \"...\" && tail -n20 ../gem5/m5out/run.log\n", + "%cd .." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "#### Comparison between Detailed and HetSim Runs\n", + "As the final step, we will compare the runtime between the detailed `gem5` run and the TRE-enabled `gem5` run." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "752808056\n", + "694044351\n" + ] + } + ], + "source": [ + "%cp gem5/m5out/stats.txt res/stats.tre.txt\n", + "!grep sim_ticks res/stats.tre.txt | head -n1 | tr -s ' ' | cut -d' ' -f2 > ticks.tre.txt\n", + "!cat ticks.det.txt\n", + "!cat ticks.tre.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "> This Jupyter Notebook is available in the public repository: https://github.com/umich-cadre/HetSim-gem5/demos/iiswc-20/tutorial.ipynb" + ] + } + ], + "metadata": { + "celltoolbar": "Slideshow", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/demos/iiswc-20/tutorial.slides.html b/demos/iiswc-20/tutorial.slides.html new file mode 100644 index 0000000..59a4e7c --- /dev/null +++ b/demos/iiswc-20/tutorial.slides.html @@ -0,0 +1,14542 @@ + + + + + + + + + + + + +tutorial slides + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+
+

HetSim Demo

We will now demonstrate the use of HetSim for an example scenarios.

+ +
+
+
+
+
+
+
    +
  • We will first change an existing target model
  • +
  • We will then write an application for it and run it through a) detailed simulation, and b) HetSim
  • +
+ +
+
+
+
+
+
+

Install Dependencies

    +
  • Install LLVM (version > 10.0) by following instructions from https://llvm.org/docs/GettingStarted.html
  • +
  • Install any cross-compilers required for the target
      +
    • This example uses an Arm gcc that you can install by running sudo apt install g++-arm-linux-gnueabihf
    • +
    +
  • +
+ +
+
+
+
+
+
+

Build gem5

We will build gem5 by running a convenience script inside scripts/.

+ +
+
+
+
+
+
In [2]:
+
+
+
%cd scripts
+!VERBOSE=0 CC=/usr/bin/gcc CXX=/usr/bin/g++ bash build-gem5.sh
+%cd ..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/scripts
+[0]: Starting gem5 build for TimingSimpleCPU
+[1]: gem5 build succeeded
+[2]: Compiling m5threads library
+Makefile:50: warning: overriding recipe for target 'test_omp'
+Makefile:42: warning: ignoring old recipe for target 'test_omp'
+make: '../pthread.o' is up to date.
+[3]: build-gem5.sh successfully exiting
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Construct a gem5 Model for the Target

Consider a programmable target composed of two types of PEs - worker and manager.

+ +
+
+
+
+
+
+
+ example target +
+
+
+
+
+
+
+
    +
  • All PEs share a D-Cache, a DSPM, and the main memory
  • +
  • Each PE has a private instruction cache
  • +
  • The manager distributes work to the workers via FIFO queues
  • +
+ +
+
+
+
+
+
+

Tweak the Existing Model

We will make two changes to the existing model.

+
    +
  • Change number of workers from 4 → 16
  • +
  • Change the depth of each work queue from 4 → 6
  • +
+ +
+
+
+
+
+
+

Step 1: Change Macros and Python Bindings

+
+
+
+
+
+
In [3]:
+
+
+
%cd example/model
+# change the relevant define in params.h
+!sed -i 's/#define NUM_WORKER.*/#define NUM_WORKER             8/g' params.h
+!sed -i 's/#define WQ_DEPTH.*/#define WQ_DEPTH               6/g' params.h
+# generate Python bindings
+!make
+%cd ../..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/example/model
+swig -python -module params params.h
+gcc -c -fpic params_wrap.c -I/usr/include/python2.7 -o params_wrap.o
+gcc -shared params_wrap.o -o _params.so
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Step 2: Reflect Changes in User Spec and Libraries

+
+
+
+
+
+
+
    +
  • Assign PE IDs to the new worker PEs and queue IDs for the queues corresponding the new workers
  • +
+ +
+
+
+
+
+
In [4]:
+
+
+
# this step should be done manually!
+!sed -i 's/\[1, 2, 3, 4\]/[1, 2, 3, 4, 5, 6, 7, 8]/g' spec/spec.json
+
+ +
+
+
+ +
+
+
+
+
    +
  • Generate code to connect the queues in the gem5 model and the emulation and TRE libraries
  • +
+ +
+
+
+
+
+
In [5]:
+
+
+
%cd scripts
+!python2 populate_init_queues.py
+!VERBOSE=0 bash build-gem5.sh
+%cd ..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/scripts
+[0]: Starting gem5 build for TimingSimpleCPU
+[1]: gem5 build succeeded
+[2]: Compiling m5threads library
+Makefile:50: warning: overriding recipe for target 'test_omp'
+Makefile:42: warning: ignoring old recipe for target 'test_omp'
+make: '../pthread.o' is up to date.
+[3]: build-gem5.sh successfully exiting
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Step 3: Regenerate Compiler Plugin

Finally, we regenerate the compiler plugin and build the tracing library.

+ +
+
+
+
+
+
In [6]:
+
+
+
%cd scripts
+!python generate_model.py ../spec/spec.json > /dev/null
+%cd ../tracer
+%mkdir -p build
+%cd build
+!cmake .. > /dev/null && make
+%cd ../..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/scripts
+/home/subh/research/hetsim-rel/tracer
+/home/subh/research/hetsim-rel/tracer/build
+/usr/bin/ar: creating t.a
+Scanning dependencies of target LLVMHetsim
+[ 20%] Building CXX object compiler-pass/CMakeFiles/LLVMHetsim.dir/hetsim-analysis.cpp.o
+[ 40%] Building CXX object compiler-pass/CMakeFiles/LLVMHetsim.dir/hetsim-codegen.cpp.o
+[ 60%] Linking CXX shared module LLVMHetsim.so
+[ 60%] Built target LLVMHetsim
+Scanning dependencies of target hetsim_default_rt
+[ 80%] Building CXX object runtime/default/CMakeFiles/hetsim_default_rt.dir/hetsim_default_rt.cpp.o
+[100%] Linking CXX shared library libhetsim_default_rt.so
+[100%] Built target hetsim_default_rt
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Write Application for the Target

We will write a program to do vector addition on the target hardware.

+ +
+
+
+
+
+
+

Header Files

#include "params.h"  // import parameters of target hardware as macros
+#include "util.h"    // import primitive definitions
+#include <pthread.h>
+#include <sys/mman.h>
+
+ +
+
+
+
+
+
+

Boilerplate Initialization

void *work(void *arg) { // manager "spawns" worker threads with tid=1,2,3...
+    unsigned tid = *(unsigned *)(arg);
+    __register_core_id(tid);
+    ...
+}
+int main() {
+    __init_queues(WQ_DEPTH);
+    __register_core_id(0); // manager is assigned core-id 0
+    ...
+    __teardown_queues();
+    return 0;
+}
+
+ +
+
+
+
+
+
+

Note

The core_id must be the same across the application, spec.json and target.py.

+ +
+
+
+
+
+
In [7]:
+
+
+
# spec.json
+!grep -A1 -B1 "id" spec/spec.json
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
        "mgr": {
+            "id": [0],
+            "__push(unsigned int, unsigned long)" : {
+--
+        "wrkr": {
+            "id": [1, 2, 3, 4, 5, 6, 7, 8],
+            "__pop(unsigned int)": {
+
+
+
+ +
+
+ +
+
+
+
In [8]:
+
+
+
# target.py
+!grep -A1 -B1 "id=" example/model/target.py
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
    system.mgr = TRE(
+        id=0,   # This must correspond to the ID assigned in the user spec file.
+        queue_depth=WQ_DEPTH,
+--
+        wrkr.append(TRE(
+            id=i+1, # This must correspond to the ID assigned in the user spec file.
+            max_outstanding_addrs=MAX_OUTSTANDING_REQS
+
+
+
+ +
+
+ +
+
+
+
+

Manager PE code: main()

Memory allocation is done at the beginning part of main().

+
// main memory allocation
+    // in this example, we are working with 3 float arrays each of size N
+    size_t RAM_SIZE_BYTES = 3 * N * sizeof(float);
+    char *ram = (char *)mmap((void *)(RAM_BASE_ADDR), RAM_SIZE_BYTES,
+                             PROT_READ | PROT_WRITE | PROT_EXEC, 
+                             MAP_ANON | MAP_PRIVATE, 0, 0);
+
+ +
+
+
+
+
+
+
// scratchpad memory allocation
+#ifdef EMULATION
+    // for emulation
+    char *dspm = (char *)mmap((void *)(SPM_BASE_ADDR), SPM_SIZE_BYTES,
+                              PROT_READ | PROT_WRITE | PROT_EXEC, 
+                              MAP_ANON | MAP_PRIVATE, 0, 0);
+#else  // !EMULATION
+    // the model uses physically-addressed scratchpad that does not need explicit allocation
+    char *dspm = (char *)SPM_BASE_ADDR;
+#endif // EMULATION
+
+ +
+
+
+
+
+
+

We "allocate" the input and output vectors at the pre-allocated main memory and initialize them.

+
// allocate the vectors and populate them
+    float *a = (float *)(ram);
+    float *b = (float *)(ram + N * sizeof(float));
+    float *c = (float *)(ram + 2 * N * sizeof(float));
+    for (int i = 0; i < N; ++i) {
+        a[i] = float(i + 1);
+        b[i] = float(i + 1);
+        c[i] = 0.0;
+    }
+
+ +
+
+
+
+
+
+

For illustration, we allocate a barrier in the shared DSPM.

+
// allocate barrier object for synchronization
+    pthread_barrier_t *bar = (pthread_barrier_t *)(dspm);
+    // initialize barrier with participants = 1 manager + NUM_WORKER workers
+    __barrier_init(bar, NUM_WORKER + 1);
+
+ +
+
+
+
+
+
+

Next, we allocate and spawn the worker threads.

+
// allocate thread objects for each "worker" PE
+    pthread_t *workers = new pthread_t[NUM_WORKER];
+
+ +
+
+
+
+
+
+
// create vector of core IDs to send to each thread
+    unsigned *tids = new unsigned[NUM_WORKER];
+    for (int i = 0; i < NUM_WORKER; ++i) {
+        tids[i] = i + 1;
+        // spawn worker thread
+        pthread_create(workers + i, NULL, work, &tids[i]);
+    }
+
+ +
+
+
+
+
+
+

The most important part of the code for the manager -- distribute and push work to the workers!

+
// partition the work and push work "packets"
+    for (int i = 0; i < NUM_WORKER; ++i) {
+    // each worker is assigned floor(N / NUM_WORKER) elements
+        int n = N / NUM_WORKER;
+        int start_idx = i * n;
+        int end_idx = (i + 1) * n - 1;
+
+        // handle trailing elements by assigning to final worker
+        if (i == NUM_WORKER - 1) {
+            end_idx = N - 1;
+        }
+
+ +
+
+
+
+
+
+
// push through work queues
+        __push(i + 1, (uintptr_t)(a));
+        __push(i + 1, (uintptr_t)(b));
+        __push(i + 1, (uintptr_t)(c));
+        __push(i + 1, (unsigned)(start_idx));
+        __push(i + 1, (unsigned)(end_idx));
+        __push(i + 1, (uintptr_t)(bar));
+    }
+
+ +
+
+
+
+
+
+
// ----- ROI begin -----
+    __reset_stats(); // begin recording time here
+    for (int i = 0; i < NUM_WORKER; ++i) {
+        __push(i + 1, 0); // start signal, value is ignored
+    }
+    __barrier_wait(bar); // synchronize with worker threads
+
+    __dump_reset_stats(); // end recording time here
+// ----- ROI end -----
+
+ +
+
+
+
+
+
+
// join with all threads
+    for (int tid = 0; tid < NUM_WORKER; ++tid) {
+        pthread_join(workers[tid], NULL);
+    }
+
+ +
+
+
+
+
+
+
// clean up
+#ifdef EMULATION
+    munmap(dspm, SPM_SIZE_BYTES);
+#endif // EMULATION
+    munmap(ram, RAM_SIZE_BYTES);
+    delete[] workers;
+    delete[] tids;
+    __teardown_queues();
+} // end of main()
+
+ +
+
+
+
+
+
+

Worker PE Code: work()

+
+
+
+
+
+
+
void *work(void *arg) {
+    unsigned tid = *(unsigned *)(arg);
+    __register_core_id(tid);
+
+    // retrieve variables from work queue
+    volatile float *a = (volatile float *)__pop(0);
+    volatile float *b = (volatile float *)__pop(0);
+    volatile float *c = (volatile float *)__pop(0);
+    int start_idx = (int)__pop(0);
+    int end_idx = (int)__pop(0);
+    pthread_barrier_t *bar = (pthread_barrier_t *)__pop(0);
+
+ +
+
+
+
+
+
+
// receive start signal
+    __pop(0);
+
+    // perform actual computation
+    for (int i = start_idx; i <= end_idx; ++i) {
+        c[i] += a[i] + b[i];
+    }
+
+    // synchronize with manager
+    __barrier_wait(bar);
+
+    return NULL;
+} // end of work()
+
+ +
+
+
+
+
+
+

Verify Functionality of Emulated Code

+
+
+
+
+
+
In [9]:
+
+
+
# build emulator library
+%cd emu
+!mkdir -p build
+%cd build
+!cmake .. > /dev/null && make
+%cd ../..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/emu
+/home/subh/research/hetsim-rel/emu/build
+Scanning dependencies of target hetsim_prim
+[ 50%] Building CXX object CMakeFiles/hetsim_prim.dir/src/util.cpp.o
+[100%] Linking CXX shared library libhetsim_prim.so
+[100%] Built target hetsim_prim
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
In [10]:
+
+
+
# build application with emulation library
+%cd example/app
+%rm -rf build
+%mkdir -p build
+%cd build
+!CC=/usr/bin/gcc CXX=/usr/bin/g++ MODE=EMU cmake .. > /dev/null && make
+%cd ../../..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/example/app
+/home/subh/research/hetsim-rel/example/app/build
+MODE set to EMU
+Processing application: serial_factorial
+Processing application: vector_add
+Processing application: workq_mutex
+Scanning dependencies of target workq_mutex
+[ 16%] Building CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o
+[ 33%] Linking CXX executable workq_mutex
+[ 33%] Built target workq_mutex
+Scanning dependencies of target serial_factorial
+[ 50%] Building CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o
+[ 66%] Linking CXX executable serial_factorial
+[ 66%] Built target serial_factorial
+Scanning dependencies of target vector_add
+[ 83%] Building CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o
+[100%] Linking CXX executable vector_add
+[100%] Built target vector_add
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
In [11]:
+
+
+
# run emulated application for functional verification
+%cd example/app/build
+!./vector_add
+%cd ../../..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/example/app/build
+== Vector Add Test with N = 100000, NUM_WORKER = 8
+== Test Passed ==
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Simulation on Detailed Model

+
+
+
+
+
+
In [25]:
+
+
+
# build application with emulation library
+%cd example/app
+%rm -rf build
+%mkdir -p build
+%cd build
+!MODE=SIM cmake .. > /dev/null && make
+%cd ../../../
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/example/app
+/home/subh/research/hetsim-rel/example/app/build
+MODE set to SIM
+Processing application: serial_factorial
+Processing application: vector_add
+Processing application: workq_mutex
+Scanning dependencies of target m5threads
+[ 12%] Building C object CMakeFiles/m5threads.dir/home/subh/research/hetsim-rel/m5threads/pthread.c.o
+/home/subh/research/hetsim-rel/m5threads/pthread.c: In function ‘pthread_create’:
+/home/subh/research/hetsim-rel/m5threads/pthread.c:256:3: warning: implicit declaration of function ‘clone’ [-Wimplicit-function-declaration]
+   clone(__pthread_trampoline, tcb->stack_start_addr, CLONE_VM|CLONE_FS|CLONE_FI
+   ^
+[ 25%] Linking C static library libm5threads.a
+[ 25%] Built target m5threads
+Scanning dependencies of target serial_factorial
+[ 37%] Building CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o
+[ 50%] Linking CXX executable serial_factorial
+[ 50%] Built target serial_factorial
+Scanning dependencies of target workq_mutex
+[ 62%] Building CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o
+[ 75%] Linking CXX executable workq_mutex
+[ 75%] Built target workq_mutex
+Scanning dependencies of target vector_add
+[ 87%] Building CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o
+[100%] Linking CXX executable vector_add
+[100%] Built target vector_add
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
In [27]:
+
+
+
# run gem5 simulation
+%cd scripts
+!MODE=SIM APP=vector_add bash run-gem5.sh > /dev/null
+!echo "..." && tail -n20 ../gem5/m5out/run.log
+%cd ../
+### Gather Runtime
+%mkdir -p res
+%cp gem5/m5out/stats.txt res/stats.det.txt
+!grep sim_ticks res/stats.det.txt | head -n1 | tr -s ' ' | cut -d' ' -f2 > ticks.det.txt
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/scripts
+...
+== Vector Add Test with N = 100000, NUM_WORKER = 8
+warn: PowerState: Already in the requested power state, request ignored
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+warn: User mode does not have SPSR
+== Test Passed ==
+Exiting @ tick 9534907382 because exiting with last active thread context
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Run Trace Generation

We will first make the required changes to enable tracing in the CPP program that we wrote and then run the trace generation step.

+ +
+
+
+
+
+
+
#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
+#include "hetsim_default_rt.h"
+#endif
+
+void *work(void *arg) {
+    unsigned tid = *(unsigned *)(arg);
+    __register_core_id(tid);
+#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
+    __open_trace_log(tid);
+#endif // AUTO_TRACING || MANUAL_TRACING
+    // ROI begin
+    ...
+    // ROI end
+#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
+    __close_trace_log(tid);
+#endif // AUTO_TRACING || MANUAL_TRACING
+}
+
+ +
+
+
+
+
+
+
int main() {
+    ...
+#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
+    __open_trace_log(0); // use core-id as argument
+#endif
+    ...
+    __barrier_init(bar, NUM_WORKER + 1); // set number of participants to 1 manager + NUM_WORKER worker PEs
+    ... 
+    __dump_reset_stats(); // end recording time here
+    // ----- ROI end -----
+
+#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
+    __close_trace_log(0);
+#endif // AUTO_TRACING || MANUAL_TRACING
+    ...
+} // end of main()
+
+ +
+
+
+
+
+
In [17]:
+
+
+
# run trace generation
+%cd example/app/build
+!mkdir -p traces
+%rm -f CMakeCache.txt
+!MODE=EMU_AUTO_TRACE cmake .. > /dev/null && make
+!./vector_add
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/example/app/build
+MODE set to EMU_AUTO_TRACE
+Processing application: serial_factorial
+Processing application: vector_add
+Processing application: workq_mutex
+Scanning dependencies of target workq_mutex
+[ 16%] Building CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o
+/home/subh/research/hetsim-rel/example/app/src/workq_mutex.cpp:36:9: warning: Tracing enabled for this run [-W#pragma-messages]
+#pragma message("Tracing enabled for this run")
+        ^
+1 warning generated.
+[ 33%] Linking CXX executable workq_mutex
+[ 33%] Built target workq_mutex
+Scanning dependencies of target serial_factorial
+[ 50%] Building CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o
+/home/subh/research/hetsim-rel/example/app/src/serial_factorial.cpp:35:9: warning: Tracing enabled for this run [-W#pragma-messages]
+#pragma message("Tracing enabled for this run")
+        ^
+1 warning generated.
+[ 66%] Linking CXX executable serial_factorial
+[ 66%] Built target serial_factorial
+Scanning dependencies of target vector_add
+[ 83%] Building CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o
+[100%] Linking CXX executable vector_add
+[100%] Built target vector_add
+== Vector Add Test with N = 100000, NUM_WORKER = 8
+== Test Passed ==
+
+
+
+ +
+
+ +
+
+
+
In [18]:
+
+
+
### Trace Format
+!head -n10 traces/pe_[01].trace  
+%cd ../../../
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
==> traces/pe_0.trace <==
+BARINIT 0xe0101000  9 10
+ST @11 0x1dfdf60 (  )
+ST @12 0x1dfdf64 (  )
+ST @13 0x1dfdf68 (  )
+ST @14 0x1dfdf6c (  )
+ST @15 0x1dfdf70 (  )
+ST @16 0x1dfdf74 (  )
+ST @17 0x1dfdf78 (  )
+ST @18 0x1dfdf7c (  )
+PUSH 1 1
+
+==> traces/pe_1.trace <==
+POP 0 1
+POP 0 1
+POP 0 1
+POP 0 1
+POP 0 1
+POP 0 1
+POP 0 1
+STALL 3 ( )
+LD @1 0x40000000 (  )
+LD @2 0x40061a80 ( 0x40000000  )
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Run Trace Replay

We will now run the generated traces through the TRE-enabled gem5 model.

+ +
+
+
+
+
+
In [24]:
+
+
+
%cd scripts
+!MODE=EMU_TRACE APP=vector_add bash run-gem5.sh > /dev/null
+!echo "..." && tail -n20 ../gem5/m5out/run.log
+%cd ..
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
/home/subh/research/hetsim-rel/scripts
+...
+TRE[6]: halted @694256563 after completing 100005 trace entries
+Number of TREs IDLE now: 1/9
+TRE[8]: halted @694258565 after completing 100005 trace entries
+Number of TREs IDLE now: 2/9
+TRE[1]: halted @694260567 after completing 100005 trace entries
+Number of TREs IDLE now: 3/9
+TRE[3]: halted @694262569 after completing 100005 trace entries
+Number of TREs IDLE now: 4/9
+TRE[2]: halted @694264571 after completing 100005 trace entries
+Number of TREs IDLE now: 5/9
+TRE[7]: halted @694266573 after completing 100005 trace entries
+Number of TREs IDLE now: 6/9
+TRE[5]: halted @694268575 after completing 100005 trace entries
+Number of TREs IDLE now: 7/9
+TRE[4]: halted @694270577 after completing 100005 trace entries
+Number of TREs IDLE now: 8/9
+TRE[0]: triggered DMPRST
+TRE[0]: halted @694273580 after completing 0 trace entries
+Number of TREs IDLE now: 9/9
+Exiting @ tick 694273580 because all TREs are done
+/home/subh/research/hetsim-rel
+
+
+
+ +
+
+ +
+
+
+
+

Comparison between Detailed and HetSim Runs

As the final step, we will compare the runtime between the detailed gem5 run and the TRE-enabled gem5 run.

+ +
+
+
+
+
+
In [20]:
+
+
+
%cp gem5/m5out/stats.txt res/stats.tre.txt
+!grep sim_ticks res/stats.tre.txt | head -n1 | tr -s ' ' | cut -d' ' -f2 > ticks.tre.txt
+!cat ticks.det.txt
+!cat ticks.tre.txt
+
+ +
+
+
+ +
+
+ + +
+ +
+ + +
+
752808056
+694044351
+
+
+
+ +
+
+ +
+
+
+
+

This Jupyter Notebook is available in the public repository: https://github.com/umich-cadre/HetSim-gem5/demos/iiswc-20/tutorial.ipynb

+
+ +
+
+
+
+
+ + + + + + +