CIS565-Fall-2014 · JivingTechnostic · Sep 29, 2014
diff --git a/README.md b/README.md
@@ -1,133 +1,16 @@
-Project-2
-=========
-
-A Study in Parallel Algorithms : Stream Compaction
-
-# INTRODUCTION
-Many of the algorithms you have learned thus far in your career have typically
-been developed from a serial standpoint.  When it comes to GPUs, we are mainly
-looking at massively parallel work.  Thus, it is necessary to reorient our
-thinking.  In this project, we will be implementing a couple different versions
-of prefix sum.  We will start with a simple single thread serial CPU version,
-and then move to a naive GPU version.  Each part of this homework is meant to
-follow the logic of the previous parts, so please do not do this homework out of
-order.
-
-This project will serve as a stream compaction library that you may use (and
-will want to use) in your
-future projects.  For that reason, we suggest you create proper header and CUDA
-files so that you can reuse this code later.  You may want to create a separate
-cpp file that contains your main function so that you can test the code you
-write.
-
-# OVERVIEW
-Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
-
-## SCAN
-Scan or prefix sum is the summation of the elements in an array such that the
-resulting array is the summation of the terms before it.  Prefix sum can either
-be inclusive, meaning the current term is a summation of all the elements before
-it and itself, or exclusive, meaning the current term is a summation of all
-elements before it excluding itself. 
-
-Inclusive:
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 3 7 13 20 29 39 ]
-
-Exclusive
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 0 3 7 13 20 29 ]
-
-Note that the resulting prefix sum will always be n + 1 elements if the input
-array is of length n.  Similarly, the first element of the exclusive prefix sum
-will always be 0.  In the following sections, all references to prefix sum will
-be to the exclusive version of prefix sum.
-
-## SCATTER
-The scatter section of stream compaction takes the results of the previous scan
-in order to reorder the elements to form a compact array.
-
-For example, let's say we have the following array:
-[ 0 0 3 4 0 6 6 7 0 1 ]
-
-We would only like to consider the non-zero elements in this zero, so we would
-like to compact it into the following array:
-[ 3 4 6 6 7 1 ]
-
-We can perform a transform on input array to transform it into a boolean array:
-
-In :  [ 0 0 3 4 0 6 6 7 0 1 ]
-
-Out : [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Performing a scan on the output, we get the following array :
-
-In :  [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Out : [ 0 0 0 1 2 2 3 4 5 5 ]
-
-Notice that the output array produces a corresponding index array that we can
-use to create the resulting array for stream compaction. 
-
-# PART 1 : REVIEW OF PREFIX SUM
-Given the definition of exclusive prefix sum, please write a serial CPU version
-of prefix sum.  You may write this in the cpp file to separate this from the
-CUDA code you will be writing in your .cu file. 
-
-# PART 2 : NAIVE PREFIX SUM
-We will now parallelize this the previous section's code.  Recall from lecture
-that we can parallelize this using a series of kernel calls.  In this portion,
-you are NOT allowed to use shared memory.
-
-### Questions 
-* Compare this version to the serial version of exclusive prefix scan. Please
-  include a table of how the runtimes compare on different lengths of arrays.
-* Plot a graph of the comparison and write a short explanation of the phenomenon you
-  see here.
-
-# PART 3 : OPTIMIZING PREFIX SUM
-In the previous section we did not take into account shared memory.  In the
-previous section, we kept everything in global memory, which is much slower than
-shared memory.
-
-## PART 3a : Write prefix sum for a single block
-Shared memory is accessible to threads of a block. Please write a version of
-prefix sum that works on a single block.  
-
-## PART 3b : Generalizing to arrays of any length.
-Taking the previous portion, please write a version that generalizes prefix sum
-to arbitrary length arrays, this includes arrays that will not fit on one block.
-
-### Questions
-* Compare this version to the parallel prefix sum using global memory.
-* Plot a graph of the comparison and write a short explanation of the phenomenon
-  you see here.
-
-# PART 4 : ADDING SCATTER
-First create a serial version of scatter by expanding the serial version of
-prefix sum.  Then create a GPU version of scatter.  Combine the function call
-such that, given an array, you can call stream compact and it will compact the
-array for you.  Finally, write a version using thrust. 
-
-### Questions
-* Compare your version of stream compact to your version using thrust.  How do
-  they compare?  How might you optimize yours more, or how might thrust's stream
-  compact be optimized.
-
-# EXTRA CREDIT (+10)
-For extra credit, please optimize your prefix sum for work parallelism and to
-deal with bank conflicts.  Information on this can be found in the GPU Gems
-chapter listed in the references.  
-
-# SUBMISSION
-Please answer all the questions in each of the subsections above and write your
-answers in the README by overwriting the README file.  In future projects, we
-expect your analysis to be similar to the one we have led you through in this
-project.  Like other projects, please open a pull request and email Harmony.
-
-# REFERENCES
-"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
+Project-2
+=========
+
+Serial vs. Na�ve prefix scan (please see readme.pdf)
+
+
+
+Size1000500010000500001000005000001000000500000010000000Serial0.0068420.0282250.0581610.2535990.492231.987314.2838122.12343.947Na�ve0.3809920.759040.4976321.11761.356543.506856.6593335.50350.145952
+Runtime (microseconds) of serial vs. na�ve GPU implementation of prefix scan
+
+It�s difficult to see from the graph itself, but a look at the table will make it clear that the serial implementation scales much more quickly than my na�ve GPU implementation.  This is because the GPU implementation roughly scales on log(n), one wave per depth, while the serial implementation is on the order of n, since it makes n calculations on the array to get the final result.
+
+
+Na�ve vs. Shared Memory Prefix Scan
+
+-N/A-

diff --git a/README.md.pdf b/README.md.pdf
diff --git a/data.xlsx b/data.xlsx
diff --git a/project_2/Debug/Project-1_part-2.Build.CppClean.log b/project_2/Debug/Project-1_part-2.Build.CppClean.log
@@ -0,0 +1,7 @@
+D:\Documents\CIS 565\Project-1_part-2\Debug\Project-1_part-2.pdb
+D:\Documents\CIS 565\Project-1_part-2\Project-1_part-2\Debug\link.command.1.tlog
+D:\Documents\CIS 565\Project-1_part-2\Project-1_part-2\Debug\link.read.1.tlog
+D:\Documents\CIS 565\Project-1_part-2\Project-1_part-2\Debug\link.write.1.tlog
+D:\Documents\CIS 565\Project-1_part-2\Project-1_part-2\Debug\matrix_math.cu.cache
+D:\Documents\CIS 565\Project-1_part-2\Project-1_part-2\Debug\Project-1_part-2.exe.intermediate.manifest
+D:\Documents\CIS 565\Project-1_part-2\Project-1_part-2\Debug\Project-1_part-2.write.1.tlog
diff --git a/project_2/Debug/Project-1_part-2.lastbuildstate b/project_2/Debug/Project-1_part-2.lastbuildstate
@@ -0,0 +1,2 @@
+#v4.0:v100
+Debug|Win32|D:\Documents\CIS 565\Project-1_part-2\|
diff --git a/project_2/Debug/Project-1_part-2.log b/project_2/Debug/Project-1_part-2.log
@@ -0,0 +1,14 @@
+Build started 9/27/2014 2:57:45 PM.
+     1>Project "D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2\Project-1_part-2.vcxproj" on node 2 (clean target(s)).
+     1>CudaClean:
+         cmd.exe /C "C:\Users\Jiatong\AppData\Local\Temp\tmp5bfc9457b194476bb8e71d7240a8faf8.cmd"
+         "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile   -g   -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -o "D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2\Win32/Debug/matrix_math.cu.obj" "D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2\matrix_math.cu" -clean
+
+         D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile   -g   -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -o "D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2\Win32/Debug/matrix_math.cu.obj" "D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2\matrix_math.cu" -clean 
+         matrix_math.cu
+         Deleting file "Debug\matrix_math.cu.deps".
+     1>Done Building Project "D:\Documents\CIS 565\Project2-StreamCompaction\project_2\Project-1_part-2\Project-1_part-2.vcxproj" (clean target(s)).
+
+Build succeeded.
+
+Time Elapsed 00:00:00.71
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#v4.0:v100
		Debug\|Win32\|D:\Documents\CIS 565\Project-1_part-2\\|