Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CPU copy with SharedStorage #445

Merged
merged 2 commits into from
Oct 8, 2024
Merged

Use CPU copy with SharedStorage #445

merged 2 commits into from
Oct 8, 2024

Conversation

christiangnrd
Copy link
Contributor

@christiangnrd christiangnrd commented Oct 2, 2024

Use CPU copy for shared storage arrays to avoid ObjectiveC.jl overhead.

Is this even a good idea?

Depends on #452

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: e9ac0d2 Previous: ff7c7eb Ratio
private array/construct 27208.333333333332 ns 26687.5 ns 1.02
private array/broadcast 455584 ns 465979.5 ns 0.98
private array/random/randn/Float32 1011500 ns 993270.5 ns 1.02
private array/random/randn!/Float32 631583 ns 632166.5 ns 1.00
private array/random/rand!/Int64 577417 ns 568500 ns 1.02
private array/random/rand!/Float32 586000 ns 583500 ns 1.00
private array/random/rand/Int64 877125 ns 880458 ns 1.00
private array/random/rand/Float32 703750 ns 844333.5 ns 0.83
private array/copyto!/gpu_to_gpu 622250 ns 614333 ns 1.01
private array/copyto!/cpu_to_gpu 692250 ns 739479 ns 0.94
private array/copyto!/gpu_to_cpu 594083.5 ns 599208 ns 0.99
private array/accumulate/1d 1434083 ns 1447750.5 ns 0.99
private array/accumulate/2d 1479500 ns 1496375 ns 0.99
private array/iteration/findall/int 2218500 ns 2263917 ns 0.98
private array/iteration/findall/bool 2002187.5 ns 1989875 ns 1.01
private array/iteration/findfirst/int 1688250 ns 1678000 ns 1.01
private array/iteration/findfirst/bool 1650625 ns 1663625 ns 0.99
private array/iteration/scalar 2399750 ns 2393834 ns 1.00
private array/iteration/logical 3446416 ns 3431520.5 ns 1.00
private array/iteration/findmin/1d 1757084 ns 1794125 ns 0.98
private array/iteration/findmin/2d 1358875 ns 1403416 ns 0.97
private array/reductions/reduce/1d 800917 ns 805792 ns 0.99
private array/reductions/reduce/2d 700479.5 ns 704146 ns 0.99
private array/reductions/mapreduce/1d 811125 ns 815812.5 ns 0.99
private array/reductions/mapreduce/2d 701166.5 ns 716666.5 ns 0.98
private array/permutedims/4d 947645.5 ns 943959 ns 1.00
private array/permutedims/2d 950791 ns 938875 ns 1.01
private array/permutedims/3d 1007916 ns 1005416.5 ns 1.00
private array/copy 876354.5 ns 862875 ns 1.02
latency/precompile 4414162875 ns 4407793041 ns 1.00
latency/ttfp 6916084749.5 ns 6915521687.5 ns 1.00
latency/import 726415791.5 ns 726643917 ns 1.00
integration/metaldevrt 743792 ns 749270.5 ns 0.99
integration/byval/slices=1 1482750 ns 1557959 ns 0.95
integration/byval/slices=3 8832249.5 ns 8832020.5 ns 1.00
integration/byval/reference 1515979 ns 1611291 ns 0.94
integration/byval/slices=2 2747375 ns 2583750 ns 1.06
kernel/indexing 469583 ns 476584 ns 0.99
kernel/indexing_checked 444083 ns 441500 ns 1.01
kernel/launch 11125 ns 10875 ns 1.02
metal/synchronization/stream 19292 ns 19208 ns 1.00
metal/synchronization/context 19792 ns 19750 ns 1.00
shared array/construct 24017.416666666664 ns 23756.916666666664 ns 1.01
shared array/broadcast 466625 ns 469584 ns 0.99
shared array/random/randn/Float32 1024625 ns 1020166 ns 1.00
shared array/random/randn!/Float32 632917 ns 634458 ns 1.00
shared array/random/rand!/Int64 579292 ns 572000 ns 1.01
shared array/random/rand!/Float32 598750 ns 593208.5 ns 1.01
shared array/random/rand/Int64 862833 ns 742792 ns 1.16
shared array/random/rand/Float32 883625 ns 898812.5 ns 0.98
shared array/copyto!/gpu_to_gpu 97125 ns 659667 ns 0.15
shared array/copyto!/cpu_to_gpu 87542 ns 94458 ns 0.93
shared array/copyto!/gpu_to_cpu 82041 ns 84333 ns 0.97
shared array/accumulate/1d 1434500 ns 1418250 ns 1.01
shared array/accumulate/2d 1492917 ns 1500167 ns 1.00
shared array/iteration/findall/int 1972125 ns 1939666 ns 1.02
shared array/iteration/findall/bool 1780625 ns 1746333 ns 1.02
shared array/iteration/findfirst/int 1405208 ns 1413458 ns 0.99
shared array/iteration/findfirst/bool 1369834 ns 1374750 ns 1.00
shared array/iteration/scalar 187667 ns 189167 ns 0.99
shared array/iteration/logical 3193624.5 ns 3212770.5 ns 0.99
shared array/iteration/findmin/1d 1460500 ns 1481709 ns 0.99
shared array/iteration/findmin/2d 1374084 ns 1379250 ns 1.00
shared array/reductions/reduce/1d 673729 ns 659583 ns 1.02
shared array/reductions/reduce/2d 698209 ns 706354 ns 0.99
shared array/reductions/mapreduce/1d 631187 ns 620667 ns 1.02
shared array/reductions/mapreduce/2d 706416.5 ns 704958.5 ns 1.00
shared array/permutedims/4d 954291 ns 963438 ns 0.99
shared array/permutedims/2d 918604 ns 939020.5 ns 0.98
shared array/permutedims/3d 1013459 ns 1003520.5 ns 1.01
shared array/copy 239958.5 ns 880541 ns 0.27

This comment was automatically generated by workflow using github-action-benchmark.

@christiangnrd christiangnrd added speculative Not sure if we want this. performance Gotta go fast. labels Oct 4, 2024
@christiangnrd christiangnrd marked this pull request as draft October 4, 2024 17:39
@maleadt
Copy link
Member

maleadt commented Oct 7, 2024

Is this even a good idea?

I think so; we have similar optimizations in CUDA.jl with unified memory. Copies from and to CPU memory are blocking anyway.

@christiangnrd christiangnrd marked this pull request as ready for review October 7, 2024 17:16
@christiangnrd christiangnrd removed the speculative Not sure if we want this. label Oct 7, 2024
@maleadt maleadt merged commit c4c0e28 into main Oct 8, 2024
2 checks passed
@maleadt maleadt deleted the fastercopy branch October 8, 2024 08:27
maleadt referenced this pull request Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Gotta go fast.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants