Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uses the array data structure for ParticleField and pullbacks for AD #13

Open
wants to merge 264 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
264 commits
Select commit Hold shift + click to select a range
40f856f
passing unit tests
rymanderson Nov 15, 2023
b31c96f
make expansion_order consistent and prep benchmark
rymanderson Nov 25, 2023
1caa44d
prep benchmark with fmm
rymanderson Nov 25, 2023
78bf00b
specify julia-1.6
rymanderson Nov 25, 2023
56d5d34
allow changing pfield.Uinf by using less restrictive type
rymanderson Nov 27, 2023
c106932
bug fix
rymanderson Nov 27, 2023
69bcba8
remove duplicate eltype definition
rymanderson Nov 27, 2023
2c8f93d
add nonzero_sigma parameter
rymanderson Nov 27, 2023
8b6271e
change nonzero_sigma default =false
rymanderson Nov 28, 2023
d9b28e7
initialize particle field with custom float type
rymanderson Dec 5, 2023
1101eb6
merge conflict on workstation
rymanderson Nov 27, 2023
c163a52
ForwardDiff compatibility
rymanderson Dec 6, 2023
ceeaac6
add custom_UJ function
rymanderson Dec 14, 2023
ad44787
working with panels
rymanderson Jan 10, 2024
df9de1a
fix viscous and SFS syntax error
rymanderson Jan 13, 2024
4fdbf47
update fields explicitly in remove_particle
rymanderson Feb 14, 2024
2ec569a
toggle saving particle field
rymanderson Feb 15, 2024
73c6d13
update fmm syntax
rymanderson Feb 21, 2024
07da72c
add custom_UJ function
rymanderson Dec 14, 2023
712650f
working with panels
rymanderson Jan 10, 2024
99ef2f7
fix viscous and SFS syntax error
rymanderson Jan 13, 2024
4acd689
update fmm syntax
rymanderson Feb 21, 2024
9b355c8
merge conflict
rymanderson Feb 23, 2024
29360c1
Remove unused tests
cibinjoseph Feb 26, 2024
0ee4bef
Take out vol
cibinjoseph Feb 26, 2024
03f8065
Remove field sigma
cibinjoseph Feb 26, 2024
710eb00
Comment out field vol
cibinjoseph Feb 26, 2024
859bfba
Remove field circulation
cibinjoseph Feb 26, 2024
27415bc
Remove field Gamma
cibinjoseph Feb 26, 2024
6ddaa10
Remove field X
cibinjoseph Feb 26, 2024
ee6ec70
Remove field U
cibinjoseph Feb 26, 2024
7546d40
Remove field W
cibinjoseph Feb 26, 2024
5049e6c
Remove field PSE
cibinjoseph Feb 26, 2024
7c4202b
Remove field PSE
cibinjoseph Feb 26, 2024
5529f0b
Remove field C
cibinjoseph Feb 26, 2024
9eba49a
Remove field SFS
cibinjoseph Feb 26, 2024
88d928b
Remove field J
cibinjoseph Feb 26, 2024
4264836
Remove field M
cibinjoseph Mar 1, 2024
dbc6203
Correct typo
cibinjoseph Mar 1, 2024
f57754d
Remove field static
cibinjoseph Mar 3, 2024
26726c6
Remove index field
cibinjoseph Mar 4, 2024
01eec2c
Remove Particle struct and maintain an array of particles
cibinjoseph Mar 7, 2024
2a7d0e7
Remove particle file
cibinjoseph Mar 7, 2024
beaf02f
Remove debug statements and correct setindex and getindex functions t…
cibinjoseph Mar 12, 2024
6a49638
Generalize buffer_element() for the particle and remove Particle typing
cibinjoseph Mar 12, 2024
b7895be
Remove comment
cibinjoseph Mar 12, 2024
285b5f4
Add getter and setter functions
cibinjoseph Mar 12, 2024
b6c70dd
Use getter and setter functions in FMM definitions
cibinjoseph Mar 13, 2024
24fca7d
Use the getter and setter functions for resetting the particles
cibinjoseph Mar 13, 2024
0fe84da
Add get and set for the other fields
cibinjoseph Mar 13, 2024
3ad5366
Use getter and setter functions everywhere
cibinjoseph Mar 13, 2024
afba95d
Replace some more functions with getter and setter functions
cibinjoseph Mar 13, 2024
28e5abf
Use the views to make code readability better and improve perf
cibinjoseph Mar 13, 2024
67833ac
Add a variable replacer script for ease of use later if necessary
cibinjoseph Mar 14, 2024
9771fbb
Remove replacer script from later commits
cibinjoseph Mar 14, 2024
ef01eee
Unroll 1:3 loops
cibinjoseph Mar 14, 2024
5602198
Variable substitution using the get functions
cibinjoseph Mar 14, 2024
00b95a3
Remove index field from xdmf file
cibinjoseph Mar 14, 2024
51a0324
Undo the vectorization on 1-3 loops
cibinjoseph Mar 29, 2024
ad595e6
Convert direct to use getter functions
cibinjoseph Mar 29, 2024
12c6c91
Merge pull request #7 from cibinjoseph/cibin
cibinjoseph Apr 4, 2024
e070b22
pullbacks for direct! and timestepping
Apr 11, 2024
9614d1f
actually add rrules
Apr 11, 2024
1e0264b
update rrules
Apr 16, 2024
819b1f0
Change particle field to store particle states in one big matrix. Thi…
Apr 26, 2024
cc54df3
Remove manifest file
cibinjoseph Apr 29, 2024
4da0b74
Ignore manifest.toml
cibinjoseph Apr 29, 2024
4d1089c
Remove indexing that allocates
cibinjoseph Apr 29, 2024
082772a
Merge branch 'master' into derivatives
cibinjoseph Apr 29, 2024
a6b917f
Use getter setter functions
cibinjoseph Apr 29, 2024
f1993eb
Use getter setter functions
cibinjoseph Apr 29, 2024
ab48b2a
Remove particle.jl file
cibinjoseph Apr 29, 2024
bd25b6c
Change Jexa to J in src/ and examples/
cibinjoseph Apr 30, 2024
995db11
remove PyPlot dependency
rymanderson Apr 30, 2024
cf066d6
remove BenchmarkTools from dependencies
rymanderson Apr 30, 2024
a16fb46
update FMM interface
rymanderson Apr 30, 2024
2a3435a
Use get_Gamma() for particle
cibinjoseph Apr 30, 2024
719a7f3
add .swp files to gitignore
rymanderson Apr 30, 2024
2cb847f
remove allocations from inviscid simulation and remove redundant UJ_d…
rymanderson May 1, 2024
1141a5f
add viscous and reformulation unit tests to single_vortexring
rymanderson May 1, 2024
e9e083a
Corrected SFS methods and added tests for them
BTV25 May 2, 2024
9b0954b
shorten testing time, added leapfrog tests but not using them
BTV25 May 2, 2024
5591e5d
Activated leapfrogging tests
BTV25 May 2, 2024
caac7c4
update syntax for specifying float type
rymanderson May 13, 2024
68f0cda
Fix access to C variables in particle states.
May 16, 2024
db9af8a
Merge branch 'derivatives' of https://github.com/byuflowlab/FLOWVPM.j…
May 16, 2024
8f7f849
update FMM argument names
rymanderson May 18, 2024
7f05995
isolate euler step function and add velocity gradient to output files
rymanderson May 22, 2024
9422cad
change FMM default to include shrinking method
rymanderson May 30, 2024
5151a85
Remove Manifest file
cibinjoseph Jun 18, 2024
ebca213
Add simple test for trying out gpu stuff
cibinjoseph Jun 18, 2024
3db3c83
Function direct based on environment variable
cibinjoseph Jun 18, 2024
0d3e75f
Add gpu kernel framework
rymanderson Jun 18, 2024
48194f2
Add julia_history to giignore
rymanderson Jun 18, 2024
597e1cc
Remove obsolete reference to particle struct
rymanderson Jun 19, 2024
a5e69ed
Add gpu based interaction kernel
rymanderson Jun 20, 2024
488f11f
fmm.direct function overloaded with GPU kernel. Ensure ENV works.
rymanderson Jun 20, 2024
34e9da9
Add a useGPU flag, but not robust yet
rymanderson Jun 21, 2024
c1a9b3e
Add function signature for targets
cibinjoseph Jun 22, 2024
e28a9f5
Make changes to handle running on GPU architecture inside ParticleFie…
cibinjoseph Jun 25, 2024
0d2cf4f
Add a basic test for gpu and cpu particlefields
cibinjoseph Jun 25, 2024
c66f4c5
Add SFS contribution to GPU kernel but in CPU
rymanderson Jun 25, 2024
b0308a8
Add useGPU option
rymanderson Jun 25, 2024
6ba47f9
Add Cuda functions for verification
rymanderson Jun 25, 2024
8d4b64d
Remove unnecessary target variables
rymanderson Jun 25, 2024
6fe45b7
Substitute SpecialFunctions.erf() for custom_erf() and remove Special…
rymanderson Jun 25, 2024
af107d9
Comment out custom g_sgm_dgdr() function for gpu
rymanderson Jun 25, 2024
e2bbdc7
Remove obsolete gpu_g_dgdr function
rymanderson Jun 25, 2024
fefe405
Remove obsolete gpu_g_dgdr function
rymanderson Jun 25, 2024
48a5abd
Pass the g_dgdr() kernel function to the GPU kernel
rymanderson Jun 25, 2024
fb48eb8
Add tests for GPU if device is functional
rymanderson Jun 25, 2024
c0d2f6a
Add tests for GPU if device is functional
rymanderson Jun 25, 2024
e38dcac
Copy only 1:24 fields in target matrix
cibinjoseph Jun 26, 2024
5c225bc
Extend compatibility of custom_erf() to ForwardDiff.Dual type
cibinjoseph Jun 27, 2024
1f29f1c
Add provision for specifying p_max and q_max in get_launch_config
cibinjoseph Jun 29, 2024
41afd04
Bandaid fix for @atomic incompatibility by forcing q=1
cibinjoseph Jun 29, 2024
43139db
Example to test derivatives
cibinjoseph Jun 29, 2024
e945fb9
Typo in arguments
rymanderson Jun 29, 2024
41e150a
Add ForwardDiff import inside gpu_erf definition
rymanderson Jun 29, 2024
0169bf7
Use JacobiElliptic instead of Elliptic for ForwardDiff compatibility
rymanderson Jun 29, 2024
38af91a
Add JacobiElliptic to requirements
rymanderson Jun 29, 2024
d2d63ff
Add flag for no testing
rymanderson Jun 29, 2024
dd8d8ae
Make cross() AD compatible
rymanderson Jun 30, 2024
f6091ff
Replace Cubature for AD compatible HCubature
rymanderson Jun 30, 2024
b5b19ce
Replace JacobiElliptic with EllipticFunctions for better AD compatibi…
cibinjoseph Jun 30, 2024
4efcf35
Changes for AD compatibility of vortexring examples
cibinjoseph Jun 30, 2024
59c6344
Make GPU kernel AD compatible by removing type conversion and shared …
rymanderson Jun 30, 2024
0299bd2
Andrew changes
rymanderson Jul 1, 2024
f5c6ffd
FLOWVPM.jl
rymanderson Jul 1, 2024
71c7a7a
No changes
rymanderson Jul 1, 2024
a354056
Sync all examples and Project.toml file from gpu branch
rymanderson Jul 1, 2024
ee19619
Second set of sync with gpu branch
rymanderson Jul 1, 2024
adb75d2
Second set of sync with gpu branch
rymanderson Jul 1, 2024
cdc3bec
Sync with gpu branch and correct presence of obsolete pfield.particles
rymanderson Jul 1, 2024
11a3730
Next round of syncs with gpu branch
rymanderson Jul 1, 2024
7ad6f04
Sync gpu branch
rymanderson Jul 1, 2024
da1ebd3
Change VectorStrength to Strength for FastMultipole compatibility
rymanderson Jul 1, 2024
eee5f27
Change VectorStrength to Strength for fmm
rymanderson Jul 1, 2024
9aeeb45
Add reset particle function
rymanderson Jul 1, 2024
ce6b5c0
Remove manifest
rymanderson Jul 1, 2024
d9f782c
Remove installation docs python notebook
rymanderson Jul 1, 2024
0ce30d1
Merge changes for AD compatibility for trajectory optimization
rymanderson Jul 1, 2024
f42c0a7
Easy defaults for ncrit and useGPU
rymanderson Jul 1, 2024
0d699c6
Remove ncrit_default
rymanderson Jul 2, 2024
00296bb
Do not use GPU by default
cibinjoseph Jul 2, 2024
b58503a
Add parallel reductions but have to reduce shared memory size later
rymanderson Jul 2, 2024
e17a150
Add elementwise operation for removing particle
cibinjoseph Jul 3, 2024
89eaed8
Basic test for VPM
cibinjoseph Jul 3, 2024
ce318ab
Add convenience function for precompiling GPU kernel
rymanderson Jul 3, 2024
bf0e764
Rename convenience function to warmup_gpu()
rymanderson Jul 3, 2024
99035d8
Set concurrent_direct to pfield.useGPU
cibinjoseph Jul 3, 2024
ec884e0
fmm updated syntax
rymanderson Jul 4, 2024
c85efb3
fix remove particles bug
rymanderson Jul 4, 2024
d0e2205
static particle field is no longer boolean bug
rymanderson Jul 4, 2024
ca41fc6
Ignore bson files
rymanderson Jul 5, 2024
d0abccb
correction to function call
BTV25 Jul 5, 2024
55dae46
use is_static function
rymanderson Jul 5, 2024
c651099
Add parallel reduction for Duals
cibinjoseph Jul 6, 2024
98d7e81
Add @atomic to atomic reduction kernel
rymanderson Jul 6, 2024
00226c7
Correct errors in parallel reduction kernel
rymanderson Jul 7, 2024
24191e2
Correct ncrit setting with useGPU
rymanderson Jul 7, 2024
2d3cd97
Reduce ncrit for gpu for Duals to fit in shared memory
rymanderson Jul 7, 2024
b7d372b
Update example with chunk size
cibinjoseph Jul 7, 2024
f80e1a4
Add ncrit reduction to shared mem overflow error message
cibinjoseph Jul 7, 2024
30b6fa6
Comment out parallel reduction until ForwardDiff.Dual type check is f…
cibinjoseph Jul 7, 2024
34ea879
Correct error in q selection logic
rymanderson Jul 7, 2024
57afc83
Merge gpu branch
cibinjoseph Jul 8, 2024
91d09fc
Change max_threads_per_block for Float32 to 1024
cibinjoseph Jul 9, 2024
ff14ad5
Run function call
cibinjoseph Jul 9, 2024
2b02566
Add max_threads argument to function calls
rymanderson Jul 9, 2024
10a4da6
Changes for target_indices function signature
cibinjoseph Jul 9, 2024
ffce682
Force concurrent_direct=true
rymanderson Jul 9, 2024
7d456ca
Change GPU kernel to cater to target_indices argument
rymanderson Jul 10, 2024
ee74349
Extend to 2 gpus
rymanderson Jul 10, 2024
6aaf266
Add warnings if using low number of GPUs
rymanderson Jul 10, 2024
dff39b3
Bug fix in 2 gpu kernel
rymanderson Jul 10, 2024
3917a25
Bug fix in 2 gpu kernel
rymanderson Jul 10, 2024
41c60ed
Use an hcat with a view
cibinjoseph Jul 10, 2024
6006d19
Make useGPU an integer
rymanderson Jul 10, 2024
ffe74aa
Use expanded target_indices with views for copying data to GPU
rymanderson Jul 10, 2024
a2b6996
Correct typo in expanded_indices
rymanderson Jul 10, 2024
37efc00
Correct typos in 2GPU kernel
rymanderson Jul 10, 2024
51bae8d
Improve warmup_gpu function for multiple gpus
rymanderson Jul 10, 2024
116f611
Correct target_indices typo in warmup_gpu()
rymanderson Jul 10, 2024
2453334
Make ngpu an Int when running code
rymanderson Jul 11, 2024
35dcf3e
Add padding to target size to nearest multiple of 10 for efficient p,…
rymanderson Jul 12, 2024
bfa671b
Benchmarking
rymanderson Jul 12, 2024
3caadbd
Add max threads per block for Pascal
rymanderson Jul 12, 2024
be1036a
update fmm.direct! syntax to use multiple target indices
rymanderson Jul 12, 2024
e5aa5e8
backwards-compatibility for DynamicSFS convenience syntax
rymanderson Jul 12, 2024
b5db201
Add sort function to divisors
rymanderson Jul 12, 2024
9be1cb9
merge conflict
rymanderson Jul 12, 2024
bf7de7c
Fix tests for numerical useGPU
Jul 13, 2024
d494f76
Ignore runfiles and scripts
Jul 13, 2024
35856cb
Add max thread limits and padding in multiples of 32
Jul 13, 2024
b910152
Revert to target_index for cpu direct()
cibinjoseph Jul 16, 2024
ab6f5dc
Make direct_gpu!() use source_indices and target_indices and revert d…
rymanderson Jul 16, 2024
c7eff92
Make compatible with new direct_gpu!() in FastMultipole
rymanderson Jul 17, 2024
398d35a
Add GPU kernel for multiple gpus
rymanderson Jul 17, 2024
5905f85
Correct leapfrog example to use multiple gpus
rymanderson Jul 17, 2024
75b2f65
Bug fixes for multiple gpus
Jul 17, 2024
d8fbc94
Use gpu array list to avoid variable scoping issue in multiple-gpu ke…
Jul 17, 2024
450baec
Set useGPU to 2 as default
Jul 17, 2024
e0acfd4
Reduce parsing of for loops. Break fast
Jul 18, 2024
164d7be
Fix for massive allocations in check_launch()
Jul 18, 2024
bad13ac
Remove T=T in arguments inside kernel
Jul 18, 2024
971a15a
Move toggle_sfs condition outside loops
Jul 19, 2024
dbbf6ab
Adapt warmup function so that it compiles on all available gpus
rymanderson Jul 19, 2024
5456de9
addition to fluiddomain to allow Uinf included
BTV25 Jul 19, 2024
a34bafd
Perform self-interaction on one gpu for entire particle field when le…
rymanderson Jul 28, 2024
0d5243d
Add additional conditions for single self-interaction on gpu
rymanderson Jul 28, 2024
a11c782
Remove unnecessary view for GPU DtoH copy
Jul 28, 2024
ca301de
Reduce register pressure by using Int32 indices
rymanderson Jul 29, 2024
9d211ea
Correct dimension of target_system particles in gpu kernel
Jul 29, 2024
cf387d4
merge conflicts
rymanderson Aug 2, 2024
5b1b556
Correct direct_full logic
rymanderson Aug 9, 2024
fb76ed1
upgrade FastMultipole 0.3.0
rymanderson Oct 16, 2024
6b8ba74
remove extraneous function parameter
rymanderson Oct 16, 2024
6dc1f35
update FastMultipole to version 0.3.0
rymanderson Oct 16, 2024
99f8f2e
add relative error option to fmm
rymanderson Oct 12, 2024
b21015f
Single gpu streams kernel
cibinjoseph Nov 11, 2024
0e6c9d4
Add .case to end of folder name
cibinjoseph Nov 11, 2024
56b0caa
Add .case files to ignore
cibinjoseph Nov 11, 2024
6038fb2
Remove trial kernel
cibinjoseph Nov 12, 2024
59c5839
Add check_shared_memory()
cibinjoseph Nov 12, 2024
e73c936
Add multiple streams multiple gpu kernel
cibinjoseph Nov 12, 2024
65a3a3f
update FastMultipole gpu syntax
rymanderson Nov 12, 2024
2fb4c5b
Correct error in shared memory check function
cibinjoseph Nov 13, 2024
9670101
Change to in-place operation for interaction
cibinjoseph Nov 13, 2024
f7be9ad
Remove debugging override
Nov 13, 2024
fa4e4e7
Correct igpu to istream
Nov 13, 2024
b7d22a2
Changes to gpu kernel
cibinjoseph Nov 13, 2024
f4d4075
update derivativesSwitch
rymanderson Nov 14, 2024
0bdfede
Add nearfield_device!() for gpu direct computation
cibinjoseph Nov 14, 2024
b2122c4
Add gpu kernel changes from Cibin
cibinjoseph Nov 15, 2024
a6ba8de
Correct sort function name
cibinjoseph Nov 15, 2024
4914d85
Correct mistake in expand_source_indices()
cibinjoseph Nov 15, 2024
7b69593
Add documentation for expand_source_indices()
cibinjoseph Nov 15, 2024
0fbb5a4
Add multiple gpus and multiple streams to nearfield_device!()
cibinjoseph Nov 18, 2024
da56a2b
Add gpu kernel for multiple gpus and streams
Nov 18, 2024
bb63c7c
Add multithreading for UJ_direct() over targets
cibinjoseph Dec 4, 2024
9ad9c0c
Use shared memory function check
cibinjoseph Dec 6, 2024
13c7c0f
Correct obsolete function add_SFS
cibinjoseph Dec 10, 2024
f7d2fc8
Remove warmup_gpu temporarily
cibinjoseph Dec 11, 2024
f154a74
Add fully_direct case
cibinjoseph Dec 11, 2024
253c6ff
Use UJ_d as a separate output to prevent redundat copying
cibinjoseph Dec 11, 2024
6406846
Benchmark leapfrog case
cibinjoseph Dec 11, 2024
ead8e76
Use 512 as max threads per block
Dec 18, 2024
ecedbc5
Correct max threads per block
Dec 18, 2024
ccd3219
Correct UJ_d variable
Dec 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Use UJ_d as a separate output to prevent redundat copying
  • Loading branch information
cibinjoseph committed Dec 11, 2024
commit 253c6ff7d354d479a128804df48a8939b6d3cec2
51 changes: 28 additions & 23 deletions src/FLOWVPM_fmm.jl
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,8 @@ function fmm.nearfield_device!(
# Sets precision for computations on GPU
T = Float64

# Dummy initialization so that t_d is defined in all lower scopes
t_d_list = Vector{CuArray{T, 2}}(undef, ngpus)
# Dummy initialization so that UJ_d is defined in all lower scopes
UJ_d_list = Vector{CuArray{T, 2}}(undef, ngpus)

if fully_direct && ngpus == 1
ns = get_np(source_systems)
Expand All @@ -112,7 +112,8 @@ function fmm.nearfield_device!(
t_size = nt + t_padding

# Copy target particles from CPU to GPU
t_d = CuArray{T}(view(source_systems.particles, 1:24, 1:nt))
t_d = s_d
UJ_d = CUDA.zeros(T, 12, nt)

# Get p, q for optimal GPU kernel launch configuration
# p is no. of targets in a block
Expand All @@ -130,13 +131,13 @@ function fmm.nearfield_device!(

# Compute interactions using GPU
kernel = source_systems.kernel.g_dgdr
@cuda threads=threads blocks=blocks shmem=shmem gpu_atomic_direct!(s_d, t_d, Int32(p), Int32(q), kernel)
@cuda threads=threads blocks=blocks shmem=shmem gpu_atomic_direct!(UJ_d, s_d, t_d, Int32(p), Int32(q), kernel)

view(target_systems.particles, 10:12, 1:nt) .= Array(view(t_d, 10:12, :))
view(target_systems.particles, 16:24, 1:nt) .= Array(view(t_d, 16:24, :))
view(target_systems.particles, 10:12, 1:nt) .= Array(view(UJ_d, 1:3, :))
view(target_systems.particles, 16:24, 1:nt) .= Array(view(UJ_d, 4:12, :))

# Clear GPU array to avoid GC pressure
CUDA.unsafe_free!(t_d)
CUDA.unsafe_free!(UJ_d)
else
ileaf = 1
while ileaf <= leaf_count
Expand Down Expand Up @@ -167,7 +168,8 @@ function fmm.nearfield_device!(
end

# Copy target particles from CPU to GPU
t_d = CuArray{T}(view(target_systems.particles, 1:24, target_index_range))
t_d = CuArray{T}(view(target_systems.particles, 1:7, target_index_range))
UJ_d = CUDA.zeros(T, 12, nt)
t_size = nt + t_padding

# Get p, q for optimal GPU kernel launch configuration
Expand All @@ -185,9 +187,9 @@ function fmm.nearfield_device!(

# Compute interactions using GPU
kernel = source_systems.kernel.g_dgdr
@cuda threads=threads blocks=blocks shmem=shmem gpu_atomic_direct!(s_d, t_d, Int32(p), Int32(q), kernel)
@cuda threads=threads blocks=blocks shmem=shmem gpu_atomic_direct!(UJ_d, s_d, t_d, Int32(p), Int32(q), kernel)

t_d_list[igpu] = t_d
UJ_d_list[igpu] = UJ_d

ileaf_gpu += 1
end
Expand All @@ -200,12 +202,12 @@ function fmm.nearfield_device!(
target_index_range = target_tree.branches[target_sources[ileaf_gpu][1]].bodies_index

# Copy results back from GPU to CPU
t_d = t_d_list[igpu]
view(target_systems.particles, 10:12, target_index_range) .= Array(view(t_d, 10:12, :))
view(target_systems.particles, 16:24, target_index_range) .= Array(view(t_d, 16:24, :))
UJ_d = UJ_d_list[igpu]
view(target_systems.particles, 10:12, target_index_range) .= Array(view(UJ_d, 1:3, :))
view(target_systems.particles, 16:24, target_index_range) .= Array(view(UJ_d, 4:12, :))

# Clear GPU array to avoid GC pressure
CUDA.unsafe_free!(t_d)
CUDA.unsafe_free!(UJ_d)

ileaf_gpu += 1
end
Expand Down Expand Up @@ -257,8 +259,8 @@ function fmm.nearfield_device!(
# Sets precision for computations on GPU
T = Float64

# Dummy initialization so that t_d is defined in all lower scopes
t_d_list = Vector{CuArray{T, 2}}(undef, nstreams)
# Dummy initialization so that UJ_d is defined in all lower scopes
UJ_d_list = Vector{CuArray{T, 2}}(undef, nstreams)

ileaf = 1
while ileaf <= leaf_count
Expand Down Expand Up @@ -290,9 +292,12 @@ function fmm.nearfield_device!(
end

# Copy target particles from CPU to GPU
t_d = CuArray{T}(view(target_systems.particles, 1:24, target_index_range))
t_d = CuArray{T}(view(target_systems.particles, 1:7, target_index_range))
t_size = nt + t_padding

# Initialize output array
UJ_d = CUDA.zeros(T, 12, nt)

# Get p, q for optimal GPU kernel launch configuration
# p is no. of targets in a block
# q is no. of columns per block
Expand All @@ -308,9 +313,9 @@ function fmm.nearfield_device!(

# Compute interactions using GPU
kernel = source_systems.kernel.g_dgdr
@cuda threads=threads blocks=blocks stream=streams[istream] shmem=shmem gpu_atomic_direct!(s_d, t_d, Int32(p), Int32(q), kernel)
@cuda threads=threads blocks=blocks stream=streams[istream] shmem=shmem gpu_atomic_direct!(UJ_d, s_d, t_d, Int32(p), Int32(q), kernel)

t_d_list[istream] = t_d
UJ_d_list[istream] = UJ_d

ileaf_stream += 1
igpu = (igpu % ngpus) + 1 # Cycle igpu over 1:ngpus
Expand All @@ -327,12 +332,12 @@ function fmm.nearfield_device!(
target_index_range = target_tree.branches[target_sources[ileaf_stream][1]].bodies_index

# Copy results back from GPU to CPU
t_d = t_d_list[istream]
view(target_systems.particles, 10:12, target_index_range) .= Array(view(t_d, 10:12, :))
view(target_systems.particles, 16:24, target_index_range) .= Array(view(t_d, 16:24, :))
UJ_d = UJ_d_list[istream]
view(target_systems.particles, 10:12, target_index_range) .= Array(view(UJ_d, 1:3, :))
view(target_systems.particles, 16:24, target_index_range) .= Array(view(Uj_d, 4:12, :))

# Clear GPU array to avoid GC pressure
CUDA.unsafe_free!(t_d)
CUDA.unsafe_free!(UJ_d)
end

ileaf_stream += 1
Expand Down
24 changes: 7 additions & 17 deletions src/FLOWVPM_gpu.jl
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ end

# Each thread handles a single target and uses local GPU memory
# Sources divided into multiple columns and influence is computed by multiple threads
function gpu_atomic_direct!(s, t, p, q, kernel)
function gpu_atomic_direct!(out, s, t, p, q, kernel)
t_size::Int32 = size(t, 2)
s_size::Int32 = size(s, 2)

Expand Down Expand Up @@ -200,13 +200,8 @@ function gpu_atomic_direct!(s, t, p, q, kernel)
# Each target will be accessed by q no. of threads
if itarget <= t_size
idim = 1i32
while idim <= 3i32
@inbounds CUDA.@atomic t[9i32+idim, itarget] += UJ[idim]
idim += 1i32
end
idim = 4i32
while idim <= 12i32
@inbounds CUDA.@atomic t[12i32+idim, itarget] += UJ[idim]
@inbounds CUDA.@atomic out[idim, itarget] += UJ[idim]
idim += 1i32
end
end
Expand All @@ -219,7 +214,7 @@ end
# Low-storage parallel reduction
# - p is no. of targets per block. Typically same as no. of sources per block.
# - q is no. of columns per tile
function gpu_reduction_direct!(s, t, num_cols, kernel)
function gpu_reduction_direct!(out, s, t, num_cols, kernel)
t_size::Int32 = size(t, 2)
s_size::Int32 = size(s, 2)

Expand Down Expand Up @@ -340,15 +335,10 @@ function gpu_reduction_direct!(s, t, num_cols, kernel)
# Now, each col 1 has the net influence of all sources on its target
# Write all data back to global memory
if col == 1
idim = 1
while idim <= 3
@inbounds t[9+idim, itarget] += UJ[idim]
idim += 1
end
idim = 4
while idim<= 12
@inbounds t[12+idim, itarget] += UJ[idim]
idim += 1
idim = 1i32
while idim <= 12i32
@inbounds out[idim, itarget] += UJ[idim]
idim += 1i32
end
end

Expand Down