CIS565-Fall-2014 · stephengou · Sep 26, 2014 · Sep 27, 2014 · Sep 28, 2014 · Sep 29, 2014
diff --git a/README.md b/README.md
@@ -3,131 +3,21 @@ Project-2
 
 A Study in Parallel Algorithms : Stream Compaction
 
-# INTRODUCTION
-Many of the algorithms you have learned thus far in your career have typically
-been developed from a serial standpoint.  When it comes to GPUs, we are mainly
-looking at massively parallel work.  Thus, it is necessary to reorient our
-thinking.  In this project, we will be implementing a couple different versions
-of prefix sum.  We will start with a simple single thread serial CPU version,
-and then move to a naive GPU version.  Each part of this homework is meant to
-follow the logic of the previous parts, so please do not do this homework out of
-order.
-
-This project will serve as a stream compaction library that you may use (and
-will want to use) in your
-future projects.  For that reason, we suggest you create proper header and CUDA
-files so that you can reuse this code later.  You may want to create a separate
-cpp file that contains your main function so that you can test the code you
-write.
-
-# OVERVIEW
-Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
-
-## SCAN
-Scan or prefix sum is the summation of the elements in an array such that the
-resulting array is the summation of the terms before it.  Prefix sum can either
-be inclusive, meaning the current term is a summation of all the elements before
-it and itself, or exclusive, meaning the current term is a summation of all
-elements before it excluding itself. 
-
-Inclusive:
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 3 7 13 20 29 39 ]
-
-Exclusive
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 0 3 7 13 20 29 ]
-
-Note that the resulting prefix sum will always be n + 1 elements if the input
-array is of length n.  Similarly, the first element of the exclusive prefix sum
-will always be 0.  In the following sections, all references to prefix sum will
-be to the exclusive version of prefix sum.
-
-## SCATTER
-The scatter section of stream compaction takes the results of the previous scan
-in order to reorder the elements to form a compact array.
-
-For example, let's say we have the following array:
-[ 0 0 3 4 0 6 6 7 0 1 ]
-
-We would only like to consider the non-zero elements in this zero, so we would
-like to compact it into the following array:
-[ 3 4 6 6 7 1 ]
-
-We can perform a transform on input array to transform it into a boolean array:
-
-In :  [ 0 0 3 4 0 6 6 7 0 1 ]
-
-Out : [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Performing a scan on the output, we get the following array :
-
-In :  [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Out : [ 0 0 0 1 2 2 3 4 5 5 ]
-
-Notice that the output array produces a corresponding index array that we can
-use to create the resulting array for stream compaction. 
-
-# PART 1 : REVIEW OF PREFIX SUM
-Given the definition of exclusive prefix sum, please write a serial CPU version
-of prefix sum.  You may write this in the cpp file to separate this from the
-CUDA code you will be writing in your .cu file. 
-
-# PART 2 : NAIVE PREFIX SUM
-We will now parallelize this the previous section's code.  Recall from lecture
-that we can parallelize this using a series of kernel calls.  In this portion,
-you are NOT allowed to use shared memory.
-
-### Questions 
-* Compare this version to the serial version of exclusive prefix scan. Please
-  include a table of how the runtimes compare on different lengths of arrays.
-* Plot a graph of the comparison and write a short explanation of the phenomenon you
-  see here.
-
-# PART 3 : OPTIMIZING PREFIX SUM
-In the previous section we did not take into account shared memory.  In the
-previous section, we kept everything in global memory, which is much slower than
-shared memory.
-
-## PART 3a : Write prefix sum for a single block
-Shared memory is accessible to threads of a block. Please write a version of
-prefix sum that works on a single block.  
-
-## PART 3b : Generalizing to arrays of any length.
-Taking the previous portion, please write a version that generalizes prefix sum
-to arbitrary length arrays, this includes arrays that will not fit on one block.
-
-### Questions
-* Compare this version to the parallel prefix sum using global memory.
-* Plot a graph of the comparison and write a short explanation of the phenomenon
-  you see here.
-
-# PART 4 : ADDING SCATTER
-First create a serial version of scatter by expanding the serial version of
-prefix sum.  Then create a GPU version of scatter.  Combine the function call
-such that, given an array, you can call stream compact and it will compact the
-array for you.  Finally, write a version using thrust. 
-
-### Questions
-* Compare your version of stream compact to your version using thrust.  How do
-  they compare?  How might you optimize yours more, or how might thrust's stream
-  compact be optimized.
-
-# EXTRA CREDIT (+10)
-For extra credit, please optimize your prefix sum for work parallelism and to
-deal with bank conflicts.  Information on this can be found in the GPU Gems
-chapter listed in the references.  
-
-# SUBMISSION
-Please answer all the questions in each of the subsections above and write your
-answers in the README by overwriting the README file.  In future projects, we
-expect your analysis to be similar to the one we have led you through in this
-project.  Like other projects, please open a pull request and email Harmony.
-
-# REFERENCES
-"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
+(For part 2 and 3 questions) 
+# Scan Comparison
+![](scanChart.bmp)
+At a first glance, the naive implementation of scan performs worse than the serial version for all number of N.
+This is probably because the parallel algorithm used has a complexity of O(N*Log(N)) where the serial version has O(N)
+However when utilizing shared memory, the GPU version gradually catches up the serial version as n gets bigger
+and out perform it at around n= 5,000,000, the use of shared memory clearly speeds it up and it runs the same
+ algorithm which has O(N*log(N)) this is probably why the GPU catches up when n gets largers as the log(N) term slows down
+
+Part 4 
+# Stream Compaction Comparison
+![](streamCompactCompare.bmp)
+Both my GPU implementation and Thrust's beat the serial version no matter how big n was and as n gets larger the bigger advantage.
+Mine is slower than Thrust because my scan doesn't use the work efficient algorithm and doesn't solve bank conflicts. And this is 
+where to improve and boost my implementation's performance.
+
+References
+http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
diff --git a/scanChart.bmp b/scanChart.bmp
diff --git a/streamCompactCompare.bmp b/streamCompactCompare.bmp
diff --git a/streamCompaction/Debug/streamCompaction.ilk b/streamCompaction/Debug/streamCompaction.ilk
diff --git a/streamCompaction/Debug/streamCompaction.pdb b/streamCompaction/Debug/streamCompaction.pdb
diff --git a/streamCompaction/Release/streamCompaction.pdb b/streamCompaction/Release/streamCompaction.pdb
diff --git a/streamCompaction/streamCompaction.sdf b/streamCompaction/streamCompaction.sdf
diff --git a/streamCompaction/streamCompaction.sln b/streamCompaction/streamCompaction.sln
@@ -0,0 +1,20 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 2012
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "streamCompaction", "streamCompaction\streamCompaction.vcxproj", "{D9546217-3526-4668-A4B7-EDEC91D9E1A1}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Win32 = Debug|Win32
+		Release|Win32 = Release|Win32
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{D9546217-3526-4668-A4B7-EDEC91D9E1A1}.Debug|Win32.ActiveCfg = Debug|Win32
+		{D9546217-3526-4668-A4B7-EDEC91D9E1A1}.Debug|Win32.Build.0 = Debug|Win32
+		{D9546217-3526-4668-A4B7-EDEC91D9E1A1}.Release|Win32.ActiveCfg = Release|Win32
+		{D9546217-3526-4668-A4B7-EDEC91D9E1A1}.Release|Win32.Build.0 = Release|Win32
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+EndGlobal
diff --git a/streamCompaction/streamCompaction.v11.suo b/streamCompaction/streamCompaction.v11.suo
diff --git a/streamCompaction/streamCompaction/CPU_streamCompaction.cpp b/streamCompaction/streamCompaction/CPU_streamCompaction.cpp
@@ -0,0 +1,31 @@
+#include "CPU_streamCompaction.h"
+//Part 1
+void exPrefixSum(float * input, int n, float * out)
+{
+	float curSum = 0.0f;
+	for(int i = 0;i < n+1 ;i++)
+	{
+		out[i] = curSum;
+		curSum += input[i];
+	}
+
+}
+
+void CPUstreamCompaction(float * input, int n, float * out)
+{
+	float * boolInput = new float[n];
+	for(int i=0;i<n;i++)
+	{
+		boolInput[i] = (input[i] == 0.0f) ? 0.0f : 1.0f;
+	}
+
+	float * scannedBool = new float[n+1];
+	exPrefixSum(boolInput,n,scannedBool);
+
+	for(int i=0;i<n;i++)
+	{
+		if(boolInput[i] > 0.0f) out[(int)scannedBool[i]] = input[i];
+	}
+
+
+}
diff --git a/streamCompaction/streamCompaction/CPU_streamCompaction.h b/streamCompaction/streamCompaction/CPU_streamCompaction.h
@@ -0,0 +1,3 @@
+void exPrefixSum(float * input, int n, float * out);
+
+void CPUstreamCompaction(float * input, int n, float * out);
diff --git a/streamCompaction/streamCompaction/Debug/CL.read.1.tlog b/streamCompaction/streamCompaction/Debug/CL.read.1.tlog
diff --git a/streamCompaction/streamCompaction/Debug/CL.write.1.tlog b/streamCompaction/streamCompaction/Debug/CL.write.1.tlog
diff --git a/streamCompaction/streamCompaction/Debug/cl.command.1.tlog b/streamCompaction/streamCompaction/Debug/cl.command.1.tlog
diff --git a/streamCompaction/streamCompaction/Debug/kernel.cu.cache b/streamCompaction/streamCompaction/Debug/kernel.cu.cache
@@ -0,0 +1,49 @@
+Identity=kernel.cu
+AdditionalCompilerOptions=
+AdditionalCompilerOptions=
+AdditionalDependencies=
+AdditionalDeps=
+AdditionalLibraryDirectories=
+AdditionalOptions=
+AdditionalOptions=
+CInterleavedPTX=false
+CodeGeneration=compute_20,sm_20
+CodeGeneration=compute_20,sm_20
+CompileOut=Debug\kernel.cu.obj
+CudaRuntime=Static
+CudaToolkitCustomDir=
+Defines=;_MBCS;
+Emulation=false
+FastMath=false
+GenerateLineInfo=false
+GenerateRelocatableDeviceCode=false
+GPUDebugInfo=true
+GPUDebugInfo=true
+HostDebugInfo=true
+Include=;;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include
+Inputs=
+Keep=false
+KeepDir=Debug
+LinkOut=
+MaxRegCount=0
+NvccCompilation=compile
+NvccPath=
+Optimization=Od
+Optimization=Od
+PerformDeviceLink=
+PtxAsOptionV=false
+RequiredIncludes=
+Runtime=MDd
+Runtime=MDd
+RuntimeChecks=RTC1
+RuntimeChecks=RTC1
+TargetMachinePlatform=32
+TargetMachinePlatform=32
+TypeInfo=
+TypeInfo=
+UseHostDefines=true
+UseHostInclude=true
+UseHostLibraryDependencies=
+UseHostLibraryDirectories=
+Warning=W3
+Warning=W3
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		void exPrefixSum(float * input, int n, float * out);

		void CPUstreamCompaction(float * input, int n, float * out);