Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wei-Chien Tu Project2 pull request #8

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions Project Description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
Project-2
=========

A Study in Parallel Algorithms : Stream Compaction

# INTRODUCTION
Many of the algorithms you have learned thus far in your career have typically
been developed from a serial standpoint. When it comes to GPUs, we are mainly
looking at massively parallel work. Thus, it is necessary to reorient our
thinking. In this project, we will be implementing a couple different versions
of prefix sum. We will start with a simple single thread serial CPU version,
and then move to a naive GPU version. Each part of this homework is meant to
follow the logic of the previous parts, so please do not do this homework out of
order.

This project will serve as a stream compaction library that you may use (and
will want to use) in your
future projects. For that reason, we suggest you create proper header and CUDA
files so that you can reuse this code later. You may want to create a separate
cpp file that contains your main function so that you can test the code you
write.

# OVERVIEW
Stream compaction is broken down into two parts: (1) scan, and (2) scatter.

## SCAN
Scan or prefix sum is the summation of the elements in an array such that the
resulting array is the summation of the terms before it. Prefix sum can either
be inclusive, meaning the current term is a summation of all the elements before
it and itself, or exclusive, meaning the current term is a summation of all
elements before it excluding itself.

Inclusive:

In : [ 3 4 6 7 9 10 ]

Out : [ 3 7 13 20 29 39 ]

Exclusive

In : [ 3 4 6 7 9 10 ]

Out : [ 0 3 7 13 20 29 ]

Note that the resulting prefix sum will always be n + 1 elements if the input
array is of length n. Similarly, the first element of the exclusive prefix sum
will always be 0. In the following sections, all references to prefix sum will
be to the exclusive version of prefix sum.

## SCATTER
The scatter section of stream compaction takes the results of the previous scan
in order to reorder the elements to form a compact array.

For example, let's say we have the following array:
[ 0 0 3 4 0 6 6 7 0 1 ]

We would only like to consider the non-zero elements in this zero, so we would
like to compact it into the following array:
[ 3 4 6 6 7 1 ]

We can perform a transform on input array to transform it into a boolean array:

In : [ 0 0 3 4 0 6 6 7 0 1 ]

Out : [ 0 0 1 1 0 1 1 1 0 1 ]

Performing a scan on the output, we get the following array :

In : [ 0 0 1 1 0 1 1 1 0 1 ]

Out : [ 0 0 0 1 2 2 3 4 5 5 ]

Notice that the output array produces a corresponding index array that we can
use to create the resulting array for stream compaction.

# PART 1 : REVIEW OF PREFIX SUM
Given the definition of exclusive prefix sum, please write a serial CPU version
of prefix sum. You may write this in the cpp file to separate this from the
CUDA code you will be writing in your .cu file.

# PART 2 : NAIVE PREFIX SUM
We will now parallelize this the previous section's code. Recall from lecture
that we can parallelize this using a series of kernel calls. In this portion,
you are NOT allowed to use shared memory.

### Questions
* Compare this version to the serial version of exclusive prefix scan. Please
include a table of how the runtimes compare on different lengths of arrays.
* Plot a graph of the comparison and write a short explanation of the phenomenon you
see here.

# PART 3 : OPTIMIZING PREFIX SUM
In the previous section we did not take into account shared memory. In the
previous section, we kept everything in global memory, which is much slower than
shared memory.

## PART 3a : Write prefix sum for a single block
Shared memory is accessible to threads of a block. Please write a version of
prefix sum that works on a single block.

## PART 3b : Generalizing to arrays of any length.
Taking the previous portion, please write a version that generalizes prefix sum
to arbitrary length arrays, this includes arrays that will not fit on one block.

### Questions
* Compare this version to the parallel prefix sum using global memory.
* Plot a graph of the comparison and write a short explanation of the phenomenon
you see here.

# PART 4 : ADDING SCATTER
First create a serial version of scatter by expanding the serial version of
prefix sum. Then create a GPU version of scatter. Combine the function call
such that, given an array, you can call stream compact and it will compact the
array for you. Finally, write a version using thrust.

### Questions
* Compare your version of stream compact to your version using thrust. How do
they compare? How might you optimize yours more, or how might thrust's stream
compact be optimized.

# EXTRA CREDIT (+10)
For extra credit, please optimize your prefix sum for work parallelism and to
deal with bank conflicts. Information on this can be found in the GPU Gems
chapter listed in the references.

# SUBMISSION
Please answer all the questions in each of the subsections above and write your
answers in the README by overwriting the README file. In future projects, we
expect your analysis to be similar to the one we have led you through in this
project. Like other projects, please open a pull request and email Harmony.

# REFERENCES
"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
35 changes: 35 additions & 0 deletions Project2/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# General
builds/

# Compiled objects
*.o
*.obj

# Compiled dynamic libraries
*.dll
*.so

# Compiled static libraries
*.lib

# Windows specific
[Rr]elease/
[Dd]ebug/
*.suo
*.pdb
*.sdf
*.opensdf
*.user
*.deps
*.ipch

# OSX specific
.DS_Store
*/.DS_Store

# VIM swap files
*.swp

# Exceptions
!*/shared/*

20 changes: 20 additions & 0 deletions Project2/PROJ_WIN/CUDA_ProjectInitial.sln
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@

Microsoft Visual Studio Solution File, Format Version 11.00
# Visual Studio 2010
Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "CUDA_ProjectInitial", "CUDA_ProjectInitial\CUDA_ProjectInitial.vcxproj", "{D7BEFF7A-4902-4B7E-922B-B0417A66864C}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Win32 = Debug|Win32
Release|Win32 = Release|Win32
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{D7BEFF7A-4902-4B7E-922B-B0417A66864C}.Debug|Win32.ActiveCfg = Debug|Win32
{D7BEFF7A-4902-4B7E-922B-B0417A66864C}.Debug|Win32.Build.0 = Debug|Win32
{D7BEFF7A-4902-4B7E-922B-B0417A66864C}.Release|Win32.ActiveCfg = Release|Win32
{D7BEFF7A-4902-4B7E-922B-B0417A66864C}.Release|Win32.Build.0 = Release|Win32
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
EndGlobal
108 changes: 108 additions & 0 deletions Project2/PROJ_WIN/CUDA_ProjectInitial/CUDA_ProjectInitial.vcxproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="Build" ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|Win32">
<Configuration>Debug</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|Win32">
<Configuration>Release</Configuration>
<Platform>Win32</Platform>
</ProjectConfiguration>
</ItemGroup>
<PropertyGroup Label="Globals">
<ProjectGuid>{D7BEFF7A-4902-4B7E-922B-B0417A66864C}</ProjectGuid>
<RootNamespace>Project3</RootNamespace>
<ProjectName>CUDA_ProjectInitial</ProjectName>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>true</UseDebugLibraries>
<CharacterSet>MultiByte</CharacterSet>
<PlatformToolset>v100</PlatformToolset>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
<ConfigurationType>Application</ConfigurationType>
<UseDebugLibraries>false</UseDebugLibraries>
<WholeProgramOptimization>true</WholeProgramOptimization>
<CharacterSet>MultiByte</CharacterSet>
<PlatformToolset>v100</PlatformToolset>
</PropertyGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
<ImportGroup Label="ExtensionSettings">
<Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.props" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
</ImportGroup>
<PropertyGroup Label="UserMacros" />
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<LinkIncremental>false</LinkIncremental>
</PropertyGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<Optimization>Disabled</Optimization>
<AdditionalIncludeDirectories>$(SolutionDir)/shared/glew/include/;$(SolutionDir)/shared/freeglut/include/;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
<PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
</ClCompile>
<Link>
<GenerateDebugInformation>true</GenerateDebugInformation>
<AdditionalLibraryDirectories>$(SolutionDir)/shared/glew/lib;$(SolutionDir)/shared/freeglut/lib;%(AdditionalLibraryDirectories)</AdditionalLibraryDirectories>
<AdditionalDependencies>opengl32.lib;glut32.lib;glew32.lib;freeglut.lib;cudart.lib;%(AdditionalDependencies)</AdditionalDependencies>
<SubSystem>Console</SubSystem>
<EntryPointSymbol>mainCRTStartup</EntryPointSymbol>
</Link>
<CudaCompile>
<Include>$(CudaToolkitIncludeDir)</Include>
<CompileOut>$(ProjectDir)$(Platform)/$(Configuration)/%(Filename)%(Extension).obj</CompileOut>
<GPUDebugInfo>true</GPUDebugInfo>
<GenerateLineInfo>true</GenerateLineInfo>
<HostDebugInfo>true</HostDebugInfo>
<CodeGeneration>compute_20,sm_20;compute_30,sm_30</CodeGeneration>
</CudaCompile>
</ItemDefinitionGroup>
<ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
<ClCompile>
<WarningLevel>Level3</WarningLevel>
<Optimization>MaxSpeed</Optimization>
<FunctionLevelLinking>true</FunctionLevelLinking>
<IntrinsicFunctions>true</IntrinsicFunctions>
<AdditionalIncludeDirectories>$(SolutionDir)/shared/glew/include/;$(SolutionDir)/shared/freeglut/include/;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
</ClCompile>
<Link>
<GenerateDebugInformation>true</GenerateDebugInformation>
<EnableCOMDATFolding>true</EnableCOMDATFolding>
<OptimizeReferences>true</OptimizeReferences>
<AdditionalDependencies>opengl32.lib;freeglut.lib;glew32.lib;cudart.lib;%(AdditionalDependencies)</AdditionalDependencies>
<AdditionalLibraryDirectories>$(SolutionDir)/shared/glew/lib;$(SolutionDir)/shared/freeglut/lib;%(AdditionalLibraryDirectories)</AdditionalLibraryDirectories>
</Link>
<CudaCompile>
<CompileOut>$(ProjectDir)$(Platform)/$(Configuration)/%(Filename)%(Extension).obj</CompileOut>
</CudaCompile>
<CudaCompile>
<Include>$(CudaToolkitIncludeDir)</Include>
</CudaCompile>
</ItemDefinitionGroup>
<ItemGroup>
<ClCompile Include="..\..\src\main.cpp" />
</ItemGroup>
<ItemGroup>
<CudaCompile Include="..\..\src\kernel.cu">
<FileType>Document</FileType>
<CodeGeneration Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">compute_20,sm_20</CodeGeneration>
</CudaCompile>
</ItemGroup>
<ItemGroup>
<ClInclude Include="..\..\src\kernel.h" />
<ClInclude Include="..\..\src\main.h" />
</ItemGroup>
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
<Import Project="$(VCTargetsPath)\BuildCustomizations\CUDA 6.5.targets" />
</ImportGroup>
</Project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup>
<ClInclude Include="..\..\src\kernel.h">
<Filter>Header</Filter>
</ClInclude>
<ClInclude Include="..\..\src\main.h">
<Filter>Header</Filter>
</ClInclude>
</ItemGroup>
<ItemGroup>
<Filter Include="Resource">
<UniqueIdentifier>{a80fd6d7-5fd1-4847-8559-ab28bf5e7f41}</UniqueIdentifier>
</Filter>
<Filter Include="Header">
<UniqueIdentifier>{ad476b75-28a6-4e18-ac67-edb7036b0a5b}</UniqueIdentifier>
</Filter>
</ItemGroup>
<ItemGroup>
<ClCompile Include="..\..\src\main.cpp">
<Filter>Resource</Filter>
</ClCompile>
</ItemGroup>
<ItemGroup>
<CudaCompile Include="..\..\src\kernel.cu">
<Filter>Resource</Filter>
</CudaCompile>
</ItemGroup>
</Project>
3 changes: 3 additions & 0 deletions Project2/PROJ_WIN/CUDA_ProjectInitial/scan_scatter_CPU.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#include "main.h"


14 changes: 14 additions & 0 deletions Project2/PROJ_WIN/CUDA_ProjectInitial/scan_scatter_CPU.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#include <iostream>
#include <sstream>
#include <fstream>
#include <ctime>

using namespace std;

int* serialPrefixSumInclusive(int* , int);
int* serialPrefixSumExclusive(int* , int);
int* serialScatter(int* arr, int N);

double diffclock( clock_t clock1, clock_t clock2 );

#endif
Empty file.
27 changes: 27 additions & 0 deletions Project2/PROJ_WIN/shared/freeglut/Copying.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@

Freeglut Copyright
------------------

Freeglut code without an explicit copyright is covered by the following
copyright:

Copyright (c) 1999-2000 Pawel W. Olszta. All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies or substantial portions of the Software.

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
PAWEL W. OLSZTA BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Except as contained in this notice, the name of Pawel W. Olszta shall not be
used in advertising or otherwise to promote the sale, use or other dealings
in this Software without prior written authorization from Pawel W. Olszta.
Loading