-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: combine workflows #1023
ci: combine workflows #1023
Conversation
39d6228
to
2ffa2f7
Compare
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
JuliaFormatter
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[1] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/instancenorm_tests.jl
Lines 96 to 97 in 2bff3e4
@testitem "Instance Norm: Group 2" tags=[:normalization] setup=[ | |
SharedTestSetup, InstanceNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[2] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/instancenorm_tests.jl
Lines 107 to 108 in 2bff3e4
@testitem "Instance Norm: Group 3" tags=[:normalization] setup=[ | |
SharedTestSetup, InstanceNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[3] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/instancenorm_tests.jl
Lines 118 to 119 in 2bff3e4
@testitem "Instance Norm: Group 4" tags=[:normalization] setup=[ | |
SharedTestSetup, InstanceNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[4] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/instancenorm_tests.jl
Lines 129 to 130 in 2bff3e4
@testitem "Instance Norm: Group 5" tags=[:normalization] setup=[ | |
SharedTestSetup, InstanceNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[5] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/layernorm_tests.jl
Lines 92 to 93 in 2bff3e4
@testitem "Layer Norm: Group 1" tags=[:normalization] setup=[ | |
SharedTestSetup, LayerNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[1] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/layernorm_tests.jl
Lines 103 to 104 in 2bff3e4
@testitem "Layer Norm: Group 2" tags=[:normalization] setup=[ | |
SharedTestSetup, LayerNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[2] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/layernorm_tests.jl
Lines 114 to 115 in 2bff3e4
@testitem "Layer Norm: Group 3" tags=[:normalization] setup=[ | |
SharedTestSetup, LayerNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[3] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/layernorm_tests.jl
Lines 125 to 126 in 2bff3e4
@testitem "Layer Norm: Group 4" tags=[:normalization] setup=[ | |
SharedTestSetup, LayerNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[4] |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/normalization/layernorm_tests.jl
Lines 136 to 137 in 2bff3e4
@testitem "Layer Norm: Group 5" tags=[:normalization] setup=[ | |
SharedTestSetup, LayerNormSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[5] |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Layer Norm: Error Checks" tags=[:normalization] setup=[SharedTestSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "batched_mul" tags=[:misc] setup=[SharedTestSetup, BatchedMMSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Lines 130 to 131 in 2bff3e4
@testitem "batched_mul: trivial dimensions & unit strides" tags=[:misc] setup=[ | |
SharedTestSetup, BatchedMMSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Lines 162 to 163 in 2bff3e4
@testitem "BatchedAdjOrTrans interface" tags=[:misc] setup=[ | |
SharedTestSetup, BatchedMMSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Lines 229 to 230 in 2bff3e4
@testitem "batched_matmul(ndims < 3)" tags=[:misc] setup=[ | |
SharedTestSetup, BatchedMMSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Line 260 in 2bff3e4
@testitem "BMM AutoDiff" tags=[:misc] setup=[SharedTestSetup, BatchedMMSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Lines 271 to 279 in 2bff3e4
@test_gradients(fn, aType(randn(rng, Float32, M, P, B)), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, batched_adjoint(aType(randn(rng, Float32, P, M, B))), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, aType(randn(rng, Float32, M, P, B)), | |
batched_transpose(aType(randn(rng, Float32, Q, P, B))); atol=1e-3, | |
rtol=1e-3, skip_backends=[AutoEnzyme()]) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Lines 283 to 301 in 2bff3e4
@test_gradients(fn, aType(randn(rng, Float32, M, P)), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, adjoint(aType(randn(rng, Float32, P, M))), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, aType(randn(rng, Float32, M, P)), | |
batched_adjoint(aType(randn(rng, Float32, Q, P, B))); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, aType(randn(rng, Float32, M, P)), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, adjoint(aType(randn(rng, Float32, P, M))), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, aType(randn(rng, Float32, M, P)), | |
batched_adjoint(aType(randn(rng, Float32, Q, P, B))); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Lines 305 to 313 in 2bff3e4
@test_gradients(fn, aType(randn(rng, Float32, M, P, 1)), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, batched_transpose(aType(randn(rng, Float32, P, M, 1))), | |
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3, | |
skip_backends=[AutoEnzyme()]) | |
@test_gradients(fn, aType(randn(rng, Float32, M, P, 1)), | |
batched_transpose(aType(randn(rng, Float32, Q, P, B))); atol=1e-3, | |
rtol=1e-3, skip_backends=[AutoEnzyme()]) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/bmm_tests.jl
Line 318 in 2bff3e4
@testitem "BMM Tracker AoS" tags=[:misc] setup=[SharedTestSetup, BatchedMMSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Efficient JVPs" tags=[:misc] setup=[SharedTestSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "ForwardDiff dropout" tags=[:misc] setup=[SharedTestSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "internal_operation_mode: Wrapped Arrays" tags=[:misc] setup=[SharedTestSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
x = rand(Float32, 4, 3) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Matmul: StaticArrays" tags=[:misc] setup=[SharedTestSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Aqua: Quality Assurance" tags=[:misc] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/lib/LuxLib/test/others/qa_tests.jl
Line 14 in 2bff3e4
@testitem "Explicit Imports" tags=[:misc] setup=[SharedTestSetup] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/contrib/debug_tests.jl
Line 1 in 2bff3e4
@testitem "Debugging Tools: DimensionMismatch" setup=[SharedTestSetup] tags=[:contrib] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/contrib/debug_tests.jl
Line 46 in 2bff3e4
@testitem "Debugging Tools: NaN" setup=[SharedTestSetup] tags=[:contrib] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/contrib/freeze_tests.jl
Line 1 in 2bff3e4
@testitem "All Parameter Freezing" setup=[SharedTestSetup] tags=[:contrib] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/contrib/freeze_tests.jl
Line 66 in 2bff3e4
@testitem "Partial Freezing" setup=[SharedTestSetup] tags=[:contrib] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/contrib/map_tests.jl
Line 1 in 2bff3e4
@testitem "Layer Map" setup=[SharedTestSetup] tags=[:contrib] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Parameter Sharing" setup=[SharedTestSetup] tags=[:contrib] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/compact_tests.jl
Line 1 in 2bff3e4
@testitem "@compact" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/compact_tests.jl
Line 442 in 2bff3e4
@testitem "@compact error checks" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/loss_tests.jl
Line 1 in 2bff3e4
@testitem "LuxOps.xlogx & LuxOps.xlogy" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/loss_tests.jl
Line 58 in 2bff3e4
@testitem "Regression Loss" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/loss_tests.jl
Line 102 in 2bff3e4
@testitem "Classification Loss" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/loss_tests.jl
Line 286 in 2bff3e4
@testitem "Other Losses" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/loss_tests.jl
Line 407 in 2bff3e4
@testitem "Losses: Error Checks and Misc" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Size Propagator" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "Size Propagator" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/stateful_tests.jl
Line 1 in 2bff3e4
@testitem "Simple Stateful Tests" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/training_tests.jl
Line 1 in 2bff3e4
@testitem "TrainState" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/training_tests.jl
Line 22 in 2bff3e4
@testitem "AbstractADTypes" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/training_tests.jl
Line 53 in 2bff3e4
@testitem "Training API" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/training_tests.jl
Line 128 in 2bff3e4
@testitem "Enzyme: Invalidate Cache on State Update" setup=[SharedTestSetup] tags=[:helpers] skip=:(using LuxTestUtils; !LuxTestUtils.ENZYME_TESTING_ENABLED) begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/helpers/training_tests.jl
Line 199 in 2bff3e4
@testitem "Compiled ReverseDiff" setup=[SharedTestSetup] tags=[:helpers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/layers/basic_tests.jl
Lines 314 to 315 in 2bff3e4
x = zeros(Float32, 2, 1) |> aType | |
y = zeros(Float32, 1, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/layers/basic_tests.jl
Line 332 in 2bff3e4
x = randn(Float32, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/layers/basic_tests.jl
Line 335 in 2bff3e4
ps, st = Lux.setup(rng, layer) |> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/layers/basic_tests.jl
Line 344 in 2bff3e4
x = randn(Float32, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/layers/basic_tests.jl
Line 347 in 2bff3e4
ps, st = Lux.setup(rng, layer) |> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/layers/basic_tests.jl
Line 359 in 2bff3e4
@testitem "Embedding" setup=[SharedTestSetup] tags=[:core_layers] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 1 in 2bff3e4
@testitem "Aqua: Quality Assurance" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 13 in 2bff3e4
@testitem "Explicit Imports: Quality Assurance" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 15 in 2bff3e4
import Lux, ComponentArrays, ReverseDiff, SimpleChains, Tracker, Zygote, Enzyme |
[JuliaFormatter] reported by reviewdog 🐶
Line 33 in 2bff3e4
@testitem "doctests: Quality Assurance" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Lines 8 to 9 in 2bff3e4
"core_layers", "contrib", "helpers", "distributed", "normalize_layers", | |
"others", "autodiff", "recurrent_layers", "fluxcompat"] |
[JuliaFormatter] reported by reviewdog 🐶
Lines 104 to 105 in 2bff3e4
@testset "eltype_mismath_handling: $option" for option in ( | |
"none", "warn", "convert", "error") |
[JuliaFormatter] reported by reviewdog 🐶
Line 121 in 2bff3e4
Int, get(ENV, "RETESTITEMS_NWORKERS", string(min(Hwloc.num_physical_cores(), 4)))) |
[JuliaFormatter] reported by reviewdog 🐶
Lines 128 to 129 in 2bff3e4
ReTestItems.runtests(Lux; tags=(tag == "all" ? nothing : [Symbol(tag)]), | |
nworkers=RETESTITEMS_NWORKERS, testitem_timeout=2400) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 1 to 2 in 2bff3e4
@testitem "FromFluxAdaptor" setup=[SharedTestSetup] tags=[:fluxcompat] begin | |
import Flux |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 15 to 17 in 2bff3e4
models = [Flux.Chain(Flux.Dense(2 => 5), Flux.Dense(5 => 1)), | |
Flux.Chain(; l1=Flux.Dense(2 => 5), l2=Flux.Dense(5 => 1))] .|> | |
fdev(dev) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 20 in 2bff3e4
x = rand(Float32, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 23 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 28 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 35 to 36 in 2bff3e4
model = Flux.Maxout(() -> Flux.Dense(2 => 5), 4) |> fdev(dev) | |
x = rand(Float32, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 39 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 44 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 50 to 51 in 2bff3e4
model = Flux.SkipConnection(Flux.Dense(2 => 2), +) |> fdev(dev) | |
x = rand(Float32, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 54 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 59 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 65 to 67 in 2bff3e4
models = [Flux.Parallel(+, Flux.Dense(2 => 2), Flux.Dense(2 => 2)), | |
Flux.Parallel(+; l1=Flux.Dense(2 => 2), l2=Flux.Dense(2 => 2))] .|> | |
fdev(dev) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 70 in 2bff3e4
x = rand(Float32, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 73 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 78 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 85 to 87 in 2bff3e4
model = Flux.PairwiseFusion(+, Flux.Dense(2 => 2), Flux.Dense(2 => 2)) |> | |
fdev(dev) | |
x = (rand(Float32, 2, 1), rand(Float32, 2, 1)) .|> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 90 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 95 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 103 to 105 in 2bff3e4
for model in [Flux.Dense(2 => 4) |> fdev(dev), | |
Flux.Dense(2 => 4; bias=false) |> fdev(dev)] | |
x = randn(Float32, 2, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 108 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 113 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 120 to 122 in 2bff3e4
for model in [ | |
Flux.Scale(2) |> fdev(dev), Flux.Scale(2; bias=false) |> fdev(dev)] | |
x = randn(Float32, 2, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 125 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 130 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 137 to 140 in 2bff3e4
for model in [Flux.Bilinear((2, 3) => 5) |> fdev(dev), | |
Flux.Bilinear((2, 3) => 5; bias=false) |> fdev(dev)] | |
x = randn(Float32, 2, 4) |> aType | |
y = randn(Float32, 3, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 143 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 148 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 155 to 156 in 2bff3e4
model = Flux.Embedding(16 => 4) |> fdev(dev) | |
x = rand(1:16, 2, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 159 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 164 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 172 to 173 in 2bff3e4
model = Flux.Conv((3, 3), 1 => 2) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 176 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 180 to 181 in 2bff3e4
model = Flux.Conv((3, 3), 1 => 2; pad=Flux.SamePad()) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 184 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 189 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 195 to 196 in 2bff3e4
model = Flux.CrossCor((3, 3), 1 => 2) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 199 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 203 to 204 in 2bff3e4
model = Flux.CrossCor((3, 3), 1 => 2; pad=Flux.SamePad()) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 207 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 212 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 218 to 219 in 2bff3e4
model = Flux.ConvTranspose((3, 3), 1 => 2) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 222 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 226 to 227 in 2bff3e4
model = Flux.ConvTranspose((3, 3), 1 => 2; pad=Flux.SamePad()) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 230 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 235 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 243 to 244 in 2bff3e4
model = Flux.AdaptiveMaxPool((2, 2)) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 247 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 253 to 254 in 2bff3e4
model = Flux.AdaptiveMeanPool((2, 2)) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 257 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 263 to 264 in 2bff3e4
model = Flux.MaxPool((2, 2)) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 267 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 273 to 274 in 2bff3e4
model = Flux.MeanPool((2, 2)) |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 277 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 283 to 284 in 2bff3e4
model = Flux.GlobalMaxPool() |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 287 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 293 to 294 in 2bff3e4
model = Flux.GlobalMeanPool() |> fdev(dev) | |
x = rand(Float32, 6, 6, 1, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 297 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 305 to 306 in 2bff3e4
model = Flux.Upsample(5) |> fdev(dev) | |
x = rand(Float32, 2, 2, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 309 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 316 to 317 in 2bff3e4
model = Flux.PixelShuffle(2) |> fdev(dev) | |
x = randn(Float32, 2, 2, 4, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 320 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 329 in 2bff3e4
model = Flux.RNNCell(2 => 3) |> fdev(dev) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 335 in 2bff3e4
model = Flux.LSTMCell(2 => 3) |> fdev(dev) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 341 in 2bff3e4
model = Flux.GRUCell(2 => 3) |> fdev(dev) |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 349 to 350 in 2bff3e4
model = Flux.BatchNorm(2) |> fdev(dev) | |
x = randn(Float32, 2, 4) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 358 in 2bff3e4
x = randn(Float32, 2, 2, 2, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 363 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 369 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 376 to 377 in 2bff3e4
model = Flux.GroupNorm(4, 2) |> fdev(dev) | |
x = randn(Float32, 2, 2, 4, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 380 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 386 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 392 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 399 to 400 in 2bff3e4
model = Flux.LayerNorm(4) |> fdev(dev) | |
x = randn(Float32, 4, 4, 4, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Line 403 in 2bff3e4
ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 410 to 411 in 2bff3e4
model = Flux.InstanceNorm(4) |> fdev(dev) | |
x = randn(Float32, 4, 4, 4, 1) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 424 to 425 in 2bff3e4
x = randn(Float32, 2, 4) |> aType | |
ps, st = Lux.setup(StableRNG(12345), model) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 438 to 439 in 2bff3e4
x = randn(Float32, 2, 4) |> aType | |
ps, st = Lux.setup(StableRNG(12345), model) .|> dev |
[JuliaFormatter] reported by reviewdog 🐶
Lux.jl/test/transform/flux_tests.jl
Lines 460 to 461 in 2bff3e4
c = CustomFluxLayer(randn(10), randn(10)) |> fdev(dev) | |
x = randn(10) |> aType |
[JuliaFormatter] reported by reviewdog 🐶
@testitem "ToSimpleChainsAdaptor" setup=[SharedTestSetup] tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 1 in 2bff3e4
@testitem "replicate" setup=[SharedTestSetup] tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 10 in 2bff3e4
@testitem "istraining" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 24 in 2bff3e4
@testitem "ComponentArrays edge cases" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 34 in 2bff3e4
@testitem "multigate" setup=[SharedTestSetup] tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 71 in 2bff3e4
@testitem "ComponentArrays" setup=[SharedTestSetup] tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 127 in 2bff3e4
@testitem "FP Conversions" setup=[SharedTestSetup] tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 163 in 2bff3e4
@testitem "Edge Cases" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 190 in 2bff3e4
@testitem "Recursive Utils" tags=[:others] begin |
[JuliaFormatter] reported by reviewdog 🐶
Line 193 in 2bff3e4
struct functorABC{A, B} |
[JuliaFormatter] reported by reviewdog 🐶
Line 263 in 2bff3e4
@testitem "Functors Compatibility" setup=[SharedTestSetup] tags=[:others] begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 24c12cc | Previous: 8bfa628 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4333.5 ns |
4625 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4583 ns |
4084 ns |
1.12 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
5791 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4333 ns |
4292 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
61610 ns |
60959 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10167 ns |
10125 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10375 ns |
9959 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11209 ns |
10375 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9833 ns |
10666 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
435760 ns |
427044 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1083 ns |
1167 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1292 ns |
1250 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1375 ns |
1458 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1042 ns |
3542 ns |
0.29 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18442 ns |
18260 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3979.5 ns |
4125 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4167 ns |
3833 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4250 ns |
4125 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4000 ns |
4000 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
112645.5 ns |
111381 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57667 ns |
57709 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
37792 ns |
47250 ns |
0.80 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46750 ns |
38250 ns |
1.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81395.5 ns |
80333 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38084 ns |
37655 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2037562.5 ns |
2026167 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2104125 ns |
2092708.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2099166 ns |
2059625.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1984916.5 ns |
1993416 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198932 ns |
197377 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144291 ns |
152958 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144208 ns |
148250 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
148083 ns |
146417 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144166.5 ns |
150375 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165721.5 ns |
167595 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1121459 ns |
1098542 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1157396 ns |
1124250 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1131250 ns |
1116146 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1116458 ns |
1107229.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
530286 ns |
523151 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3167 ns |
3584 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3625 ns |
3625 ns |
1 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4708 ns |
5708.5 ns |
0.82 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3208.5 ns |
3417 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
70966.5 ns |
70157 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8709 ns |
8834 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9292 ns |
8667 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9959 ns |
9291 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
9042 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
486762.5 ns |
492826.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15458 ns |
17000 ns |
0.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16042 ns |
16375 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18291 ns |
18667 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
14792 ns |
17083 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55364.5 ns |
54850 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216833 ns |
213146 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
226250 ns |
216104 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214583 ns |
214167 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
223812 ns |
225333 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
277851.5 ns |
272672.5 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
459 ns |
1.45 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
583 ns |
542 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
709 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
417 ns |
583 ns |
0.72 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
18112 ns |
17542 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1375 ns |
1708 ns |
0.81 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1625 ns |
1458 ns |
1.11 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1958 ns |
1625 ns |
1.20 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1625 ns |
1750 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
105356 ns |
104205 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7250 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5167 ns |
5833 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5209 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9833 ns |
4000 ns |
2.46 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24464.5 ns |
23961 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222292 ns |
228750.5 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
236208 ns |
228333 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229416.5 ns |
228500 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
259229 ns |
226334 ns |
1.15 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
172717 ns |
170956 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3792 ns |
3875 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3875 ns |
3916 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3834 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
24053 ns |
23832 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16625 ns |
16833 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16542 ns |
16708 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
19959 ns |
16708 ns |
1.19 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16625 ns |
16958 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
164960.5 ns |
165501.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
578875 ns |
579042 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
577459 ns |
574375 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
575875 ns |
575083 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
576458 ns |
576292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113991.5 ns |
113664 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1424917 ns |
1417708 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1426834 ns |
1429333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1424104.5 ns |
1425729.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1419916 ns |
1422208 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
214497 ns |
214791 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1079667 ns |
1082104 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
952541 ns |
959958.5 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1344209 ns |
1341792 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1294500 ns |
1294792 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
277520.5 ns |
281583.5 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5796125 ns |
5777875 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4601750 ns |
4456083 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4812062.5 ns |
4934792 ns |
0.98 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5685666.5 ns |
5627500 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1102298 ns |
1106964 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23934 ns |
23988 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2166 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2083 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
174719 ns |
179026 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4375 ns |
6084 ns |
0.72 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5208 ns |
6167 ns |
0.84 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5979 ns |
7041 ns |
0.85 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4208 ns |
6375 ns |
0.66 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65897 ns |
66163.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11333 ns |
11291 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11562.5 ns |
10791 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12000 ns |
12125 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11458 ns |
11354.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
456442.5 ns |
456626.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6875 ns |
7000 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7084 ns |
7042 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7437.5 ns |
8375 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6667 ns |
7042 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
53263 ns |
52652 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
18479 ns |
17375 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17917 ns |
17167 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18708.5 ns |
17770.5 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
18000 ns |
18708 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
306492 ns |
306093.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
459 ns |
1.27 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
459 ns |
1.27 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
33354.5 ns |
33004 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8958 ns |
8583 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9250 ns |
8208 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9125 ns |
9583 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9125 ns |
9042 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
161864 ns |
162492.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64667 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64958 ns |
64417 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64583 ns |
64625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64708 ns |
64750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112223.5 ns |
112347.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
284958 ns |
277542 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
293500 ns |
281625 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
287084 ns |
288750 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
278979 ns |
275500 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
188579 ns |
189809 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3384709 ns |
3285583 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
2770562.5 ns |
3022333.5 ns |
0.92 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3022292 ns |
2780375 ns |
1.09 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4051959 ns |
4038625 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
577085.5 ns |
573967 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7638708 ns |
7586208.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7382937 ns |
7415437 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7276667 ns |
7333375 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8178104.5 ns |
8220958 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1348594 ns |
1351752.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
18820250 ns |
18835167 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
19140583 ns |
19044834 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19170875 ns |
19135125 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
15780292 ns |
15633417 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23651833.5 ns |
23661916.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
42906541 ns |
33965500 ns |
1.26 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37105250 ns |
41107417 ns |
0.90 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34831333 ns |
34858709 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1856373 ns |
1862815 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
188407625 ns |
189289541 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
178681958.5 ns |
164224708 ns |
1.09 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152434916 ns |
157847979 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
441085041 ns |
438904833 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13924696 ns |
13913764 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
291040208.5 ns |
289733584 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
281280395.5 ns |
338173667 ns |
0.83 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
298676250 ns |
307489541.5 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
394717750.5 ns |
393585937.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23083.5 ns |
21708.5 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24541 ns |
24458 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25333 ns |
25937 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22125 ns |
24229 ns |
0.91 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
95853 ns |
96907 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
114728.5 ns |
103750 ns |
1.11 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
105333.5 ns |
105292 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104875 ns |
104208 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103083 ns |
151250 ns |
0.68 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
499564 ns |
504189 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6041 ns |
6583 ns |
0.92 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6375 ns |
7292 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7542 ns |
7959 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
6958 ns |
0.84 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68212 ns |
68581 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15000 ns |
14916.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15792 ns |
14709 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16250 ns |
16666 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14708 ns |
14292 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
477913.5 ns |
483895 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3005417 ns |
3017937 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2088562.5 ns |
2022458 ns |
1.03 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2282792 ns |
2307959 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4876979.5 ns |
4846645.5 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
585444 ns |
585796 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23510520.5 ns |
23617917 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18309333 ns |
17975417 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16948500 ns |
18323812.5 ns |
0.92 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35051750 ns |
35597209 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3105194 ns |
3109235 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33290292 ns |
33405687.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27968792 ns |
27693604 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27595000 ns |
27860958 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41906104 ns |
42002937.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
72479.5 ns |
72375 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73729 ns |
84624.5 ns |
0.87 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76375 ns |
83250 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74916 ns |
73750 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100581 ns |
102852 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
298500 ns |
218167 ns |
1.37 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216916.5 ns |
309979 ns |
0.70 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
209292 ns |
317479 ns |
0.66 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220250 ns |
288875 ns |
0.76 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
540007.5 ns |
550996 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11541 ns |
12041 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12167 ns |
12729.5 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12875 ns |
13833 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11625 ns |
11666.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
70274.5 ns |
71604 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26708 ns |
26625 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27334 ns |
26959 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27542 ns |
28292 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26792 ns |
26458 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
471677.5 ns |
484486.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12833 ns |
12417 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12937.5 ns |
12542 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13645.5 ns |
14584 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12333.5 ns |
13041.5 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
52633 ns |
53694 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26459 ns |
26312.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26042 ns |
26270.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26583 ns |
26667 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26458 ns |
26333 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
301140.5 ns |
309291.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179792 ns |
178770.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182250 ns |
182334 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
182813 ns |
184895.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
180541 ns |
179750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55883 ns |
57908 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
596042 ns |
587125 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
583312.5 ns |
596500 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
583584 ns |
593770.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
593083.5 ns |
583166 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
283344.5 ns |
290369.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6541 ns |
7354.5 ns |
0.89 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6667 ns |
7167 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7375 ns |
7875 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6208 ns |
6833 ns |
0.91 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
69352 ns |
70829 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14458 ns |
14375 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15208 ns |
14708 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15125 ns |
15625 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14479 ns |
14083 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
459571.5 ns |
471312.5 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1217208 ns |
1235042 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1276667 ns |
1283583 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1266000 ns |
1282875 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1317770.5 ns |
1325208 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
300482.5 ns |
301270 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4115000 ns |
4111125 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4522624.5 ns |
4361625 ns |
1.04 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4588833 ns |
4786395.5 ns |
0.96 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
4444459 ns |
4453229.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1036466 ns |
1047552 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1750 ns |
1.07 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23505.5 ns |
23328 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4875 ns |
4833 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
4792 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4875 ns |
4917 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4875 ns |
4917 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
188144 ns |
186698 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6020.5 ns |
7208.5 ns |
0.84 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6209 ns |
5584 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7375 ns |
8667 ns |
0.85 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6208.5 ns |
7312.5 ns |
0.85 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
55343.5 ns |
54539 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12833 ns |
10833 ns |
1.18 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11792 ns |
10834 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11584 ns |
12375 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11958 ns |
11916 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
330451.5 ns |
329099 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
334 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
334 ns |
0.87 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22931 ns |
22753 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2708 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2750 ns |
2667 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3000 ns |
2959 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2750 ns |
3000 ns |
0.92 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
158264.5 ns |
157496 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11854.5 ns |
13167 ns |
0.90 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12083 ns |
13166 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
14500 ns |
15000 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
12000 ns |
13792 ns |
0.87 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
56327.5 ns |
55218 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24833 ns |
24833 ns |
1 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24709 ns |
24542 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25333 ns |
25375 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25166 ns |
24709 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
295802 ns |
289966 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4125 ns |
4083 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4208 ns |
4166 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4167 ns |
4125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24484 ns |
24660 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16125 ns |
15958 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16041 ns |
16417 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16250 ns |
16042 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16083 ns |
16125 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
197179 ns |
194045.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5667 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5625 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5750 ns |
5750 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
5791 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
33583 ns |
32989 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21000 ns |
21125 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21292 ns |
20459 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21042 ns |
21542 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20833.5 ns |
20875 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
175724 ns |
174273 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
404500 ns |
403209 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
366729 ns |
371125 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
491792 ns |
474292 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
527187.5 ns |
539604.5 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66440.5 ns |
66734 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
962125 ns |
1011917 ns |
0.95 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
879458 ns |
884896 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1231479.5 ns |
1220125 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
1416458.5 ns |
1400208 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
191580 ns |
190566.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
82062 ns |
82917 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82666 ns |
82791 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82937.5 ns |
88958.5 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83292 ns |
83187.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192849 ns |
192556.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1915978.5 ns |
1921500 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1938834 ns |
1696166 ns |
1.14 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1710833 ns |
1938083 ns |
0.88 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1917416.5 ns |
1915875 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
404784 ns |
393732 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21961 ns |
21580 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
173279.5 ns |
165924 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7125 ns |
6708 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7000 ns |
6250 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8687.5 ns |
9750 ns |
0.89 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6667 ns |
8125 ns |
0.82 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
61915 ns |
56950.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
8916.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9167 ns |
8958 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9667 ns |
9625 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9459 ns |
9542 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
319219.5 ns |
299584.5 ns |
1.07 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120493375 ns |
120035854.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
182164250 ns |
174382959 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148124041.5 ns |
154831333 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106683562 ns |
103109500 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5478101 ns |
5474606 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
619037042 ns |
617124000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
581117583 ns |
555612167 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
452503854 ns |
468382792 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
759531500.5 ns |
756087750 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
38238069 ns |
38213656 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
650158250 ns |
651747459 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
689625562.5 ns |
666674583.5 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
584746895.5 ns |
602170708.5 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
746064791 ns |
734251875 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59917 ns |
57208 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38667 ns |
48167 ns |
0.80 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47958 ns |
39167 ns |
1.22 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83333 ns |
83958 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37783 ns |
37250 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1926458.5 ns |
1929792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1973292 ns |
1973292 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982604 ns |
1984249.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1891667 ns |
1881417 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
175065.5 ns |
171491 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
267583.5 ns |
273354 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
269084 ns |
267959 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
289917 ns |
270687.5 ns |
1.07 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267166.5 ns |
268834 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
131334.5 ns |
124192.5 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
589666 ns |
658333 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
689042 ns |
674854.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
665916 ns |
665333 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
673125 ns |
670500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
706985 ns |
664813 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2181896 ns |
2190167 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2232625 ns |
2214354.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2183854 ns |
2216958.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2191875 ns |
2099979 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133858 ns |
133238 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5512271 ns |
5505354.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5598583.5 ns |
5504750 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5489687.5 ns |
5565292 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5490875 ns |
5499708 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
746209 ns |
740235 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
646416 ns |
650417 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
647583 ns |
649020.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
647666 ns |
640625 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
644958 ns |
648292 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47169 ns |
47265 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1826146 ns |
1821708 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1670917 ns |
1720959 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1722292 ns |
1675729.5 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2100708 ns |
2108500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
224227.5 ns |
224014 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58250 ns |
58583 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38500 ns |
46645.5 ns |
0.83 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45917 ns |
38750 ns |
1.18 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84125 ns |
83834 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28997 ns |
28947 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2027375 ns |
2024916 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2103792 ns |
2086188 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2095437.5 ns |
2100521 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1994916 ns |
1993416.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190639 ns |
191815.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13358312.5 ns |
13473875 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12522354 ns |
12547041.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12489916 ns |
12559604 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
14891124.5 ns |
15213416.5 ns |
0.98 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
514106 ns |
517805 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47397250 ns |
47353458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
42110166.5 ns |
41833334 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41023354.5 ns |
41118750 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
57862667 ns |
58300041 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3195086 ns |
3203904 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
97510834 ns |
74077042 ns |
1.32 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
91327667 ns |
68022250 ns |
1.34 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90418625 ns |
90906749.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
99397458 ns |
99115937.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59375 ns |
58958 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38750 ns |
47375 ns |
0.82 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47666 ns |
38729.5 ns |
1.23 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82625 ns |
83500 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
46954 ns |
47777 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1922250 ns |
1923375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1818833.5 ns |
1961541 ns |
0.93 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1971041 ns |
1980229 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1887729.5 ns |
1890354 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190192.5 ns |
194350.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
31804 ns |
32617.5 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6208.5 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6833 ns |
5958 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6708 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6459 ns |
6437.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
173129 ns |
173722.5 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31435 ns |
32110 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2583 ns |
1.11 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2542 ns |
1.10 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2834 ns |
2833 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2834 ns |
2833 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
160716 ns |
161891 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
287914625 ns |
286335145.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
347874917 ns |
339870250 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
313778604 ns |
320445937.5 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
274558667 ns |
272825875 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7101120 ns |
7113314 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1001223208 ns |
990386709 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
962335292 ns |
938484666 ns |
1.03 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
851949854 ns |
868613416.5 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1161969458 ns |
1158749666 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33885355 ns |
33903874 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1680835000 ns |
1310266104.5 ns |
1.28 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1703135000 ns |
1325766333.5 ns |
1.28 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1598082833 ns |
1623996500 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1673120084 ns |
1663239334 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1417562.5 ns |
1461479 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1417084 ns |
1415750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1418167 ns |
1429167 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1460750 ns |
1414437.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
128326 ns |
128213 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5025104 ns |
5019792 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5057208 ns |
5022458 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5024188 ns |
5050000 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5022083 ns |
5006541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
573974 ns |
557532 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
175868729.5 ns |
175263520.5 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
181907771 ns |
129816208.5 ns |
1.40 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
127813000 ns |
145953208.5 ns |
0.88 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
170303209 ns |
164619104.5 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4854931.5 ns |
4883992 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
666951834 ns |
831528333 ns |
0.80 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
612960875 ns |
497840084 ns |
1.23 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
496604958 ns |
556789916 ns |
0.89 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
682123167 ns |
679969833 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16051467 ns |
16195623 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8911750 ns |
8914083 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8824812.5 ns |
8769917 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7877958 ns |
8216313 ns |
0.96 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10154667 ns |
10158000 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1605096 ns |
1595526 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
35800833 ns |
35894250 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37985916 ns |
36843625 ns |
1.03 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33369583 ns |
34476562 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
38797917 ns |
38802729 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6461183 ns |
6454567.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47542 ns |
47396 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47417 ns |
49334 ns |
0.96 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47708.5 ns |
47542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47458 ns |
47417 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18387 ns |
19457 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50542 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50417 ns |
50520.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50729.5 ns |
50584 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50354.5 ns |
50250 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
197522 ns |
189575 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
8104 ns |
0.89 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7458 ns |
6791 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8167 ns |
9125 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7250 ns |
7333 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
100674.5 ns |
86829.5 ns |
1.16 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10167 ns |
9875 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10125 ns |
9583 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10125 ns |
10375 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10167 ns |
10208 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
532551.5 ns |
537525 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6084 ns |
8208 ns |
0.74 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7583 ns |
8250 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8583 ns |
9812.5 ns |
0.87 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6125 ns |
6375 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
106692.5 ns |
113788.5 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13417 ns |
13333.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13354.5 ns |
12625 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13459 ns |
13584 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13375 ns |
13208 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
464411 ns |
479705.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
958 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
958 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1041 ns |
1042 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
32682 ns |
32580 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8125 ns |
7750 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
7625 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
8542 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
8208 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
204035.5 ns |
201701.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23458 ns |
23250 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23333 ns |
23042 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23375 ns |
23500 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23458 ns |
23167 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18657 ns |
18765.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52708 ns |
52875 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52604.5 ns |
52292 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52916 ns |
52792 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52708 ns |
52459 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
271882 ns |
260844.5 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1398500 ns |
1400229 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1401729.5 ns |
1398666.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1411917 ns |
1400708 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1400229 ns |
1398917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196861.5 ns |
196521.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5018875 ns |
5018604 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5048167 ns |
5004729.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5006959 ns |
5044229.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5008333 ns |
5001271 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
574886.5 ns |
595122 ns |
0.97 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3024166 ns |
3043083 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2100833 ns |
2094042 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2282500 ns |
2287146 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4885312.5 ns |
4530875 ns |
1.08 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
583172 ns |
582703 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24461812.5 ns |
24366625 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
19084792 ns |
18829583 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
19138792 ns |
19120291 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36884542 ns |
36653000 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3200942 ns |
3189516.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34091166.5 ns |
33943229 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28709958.5 ns |
28373417 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28378584 ns |
28357208 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41695667 ns |
41659750 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144674125 ns |
144299750 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
143080042 ns |
142248375 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
124467729.5 ns |
126632146 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
174621958 ns |
173840291.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22564506 ns |
22781482 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1302493437.5 ns |
1307941437.5 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
874700000 ns |
1133574500.5 ns |
0.77 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
764524604 ns |
711240125 ns |
1.07 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
677522375 ns |
670828250 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
116859638 ns |
118499942 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
75000 ns |
74542 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74833 ns |
73917 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76292 ns |
83125 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74896 ns |
72916.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
216038.5 ns |
225032.5 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
210750 ns |
202979.5 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
201917 ns |
282792 ns |
0.71 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
281958 ns |
253479.5 ns |
1.11 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
269333 ns |
244146 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1090324 ns |
1201754 ns |
0.91 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35470834 ns |
35408938 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35928792 ns |
35449645.5 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32119604 ns |
32512083 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40968833.5 ns |
41003541.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5850850 ns |
5848198 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
147364709 ns |
146608875 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
158273854.5 ns |
151542938 ns |
1.04 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
133245583 ns |
138849083 ns |
0.96 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
288316083 ns |
287439584 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34901456.5 ns |
34913824 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
119595041.5 ns |
121086291.5 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
182760375 ns |
174190000 ns |
1.05 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147949250 ns |
155717667 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
104280375 ns |
106488666.5 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5427032 ns |
5478422 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
474025250 ns |
611208666 ns |
0.78 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
487387541.5 ns |
466441167 ns |
1.04 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
438686937.5 ns |
453562937.5 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
743324584 ns |
741621625 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35184946 ns |
35157227 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
709818791.5 ns |
648662584 ns |
1.09 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
676047500 ns |
657411208 ns |
1.03 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
576077125 ns |
585962375 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
856793750 ns |
845072208 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1335187.5 ns |
1304708 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
690416 ns |
965666 ns |
0.71 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
968250 ns |
744354 ns |
1.30 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2069791.5 ns |
1944604 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
562588.5 ns |
572387 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2971667 ns |
2974271 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2540917 ns |
2531646 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2628687 ns |
2512854 ns |
1.05 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3706167 ns |
3691334 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1656253 ns |
1817474 ns |
0.91 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
6655750 ns |
6642416 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
6500458 ns |
6630792 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6506291.5 ns |
6466375 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
4453146 ns |
4443145.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7334 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
6208 ns |
0.85 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6209 ns |
5458 ns |
1.14 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10167 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25101 ns |
25916 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213084 ns |
212104 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221084 ns |
219562.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221042 ns |
220667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215333 ns |
206291 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
249680 ns |
257490 ns |
0.97 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
302094083 ns |
301772791.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
282009708 ns |
222879750 ns |
1.27 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
198372812.5 ns |
222700312.5 ns |
0.89 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
311510292 ns |
311773125 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7672645 ns |
7676597.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1082528563 ns |
1082870459 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
990032792 ns |
892532250 ns |
1.11 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
865476333 ns |
883941208.5 ns |
0.98 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1159115917 ns |
1154293562 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26437793 ns |
26959026 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5625 ns |
6459 ns |
0.87 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6270.5 ns |
5209 ns |
1.20 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6979.5 ns |
10000 ns |
0.70 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5625 ns |
5708.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
145670 ns |
168546.5 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7750 ns |
7458 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7459 ns |
6792 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7542 ns |
7542 ns |
1 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
7792 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
586665.5 ns |
639812.5 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
458 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
458 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
24215 ns |
24361 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
11416.5 ns |
9000 ns |
1.27 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9334 ns |
9000 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9916 ns |
9583 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9417 ns |
9708 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
208481.5 ns |
234125.5 ns |
0.89 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351459 ns |
351500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
353166 ns |
351500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
351542 ns |
351916 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
351042 ns |
356625 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21437 ns |
21502 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
830208 ns |
811270.5 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
778354 ns |
774958.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
777000 ns |
776584 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
826583.5 ns |
821875 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
263376.5 ns |
315795.5 ns |
0.83 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
337500 ns |
335896 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
323625.5 ns |
338208.5 ns |
0.96 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
450541 ns |
441167 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
333583 ns |
331375 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17840 ns |
18761.5 ns |
0.95 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
694666.5 ns |
695166 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
737375 ns |
738208 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1032062.5 ns |
1036458 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
692917 ns |
692396 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
237490 ns |
292461.5 ns |
0.81 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
354833 ns |
354166.5 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
332292 ns |
346771 ns |
0.96 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
423312.5 ns |
433791 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
373979 ns |
370250 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22593 ns |
23121 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
751812.5 ns |
757417 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
744708 ns |
749625 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1083000 ns |
1070562.5 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
822375 ns |
828458 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
213868 ns |
257074.5 ns |
0.83 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3625 ns |
3292 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3625 ns |
3458 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3750 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3583 ns |
3417 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
18261 ns |
18586 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4250 ns |
4167 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4250 ns |
4375 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4312.5 ns |
4417 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4209 ns |
4250 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
231902 ns |
296700.5 ns |
0.78 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3708 ns |
3625 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4209 ns |
3750 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5354.5 ns |
6541 ns |
0.82 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4000 ns |
6354.5 ns |
0.63 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
168839 ns |
232189.5 ns |
0.73 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8750 ns |
8187.5 ns |
1.07 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8459 ns |
8000 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8584 ns |
8458 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8583 ns |
8500 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1145924 ns |
1227082 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204750 ns |
203417 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210041 ns |
209541.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211750 ns |
208250 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199500 ns |
198709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35034 ns |
35300 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
644625 ns |
612417 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
623875 ns |
623292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
621917 ns |
623250 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
629792 ns |
630166 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
345017 ns |
347973 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
976354 ns |
977646 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
938833 ns |
935437.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
957084 ns |
970083 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1292688 ns |
1286374.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207331 ns |
209031 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4494021 ns |
4514333 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4616104 ns |
4466146 ns |
1.03 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4311125 ns |
4452875 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
6200916 ns |
6260416.5 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
937889.5 ns |
947144.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3667 ns |
3542 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3917 ns |
3417 ns |
1.15 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4729.5 ns |
5896 ns |
0.80 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
6667 ns |
0.59 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
233257 ns |
219336.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7708 ns |
6917 ns |
1.11 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7209 ns |
6958 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7625 ns |
7708 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7375 ns |
7291 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
1025391 ns |
1020167.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1621375.5 ns |
1635042 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1154563 ns |
1200395.5 ns |
0.96 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1379604.5 ns |
1363584 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2492334 ns |
2345187.5 ns |
1.06 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
213997.5 ns |
215784.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12340250 ns |
12316854.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9638583 ns |
9564000 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9296229 ns |
9378437.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
17980437.5 ns |
17989542 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1947124.5 ns |
1948181 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17336562.5 ns |
17368125 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14454166 ns |
14382958 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14368729 ns |
14502250 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21078291 ns |
21085917 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
89208 ns |
90917 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
91000 ns |
89500 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
91333 ns |
91833 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
92750 ns |
113437.5 ns |
0.82 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126550 ns |
126891 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2036167 ns |
2009625 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2042812.5 ns |
2030000 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2030000 ns |
2039270.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2024333 ns |
1871125 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1062042.5 ns |
1032563 ns |
1.03 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
346437.5 ns |
342166.5 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
333250 ns |
343375 ns |
0.97 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
396583 ns |
406458 ns |
0.98 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
309645.5 ns |
311729 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15810 ns |
16465.5 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
704708.5 ns |
706208 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
727834 ns |
728542 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
1027958 ns |
1018584 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
649500 ns |
650375 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
194848.5 ns |
195366.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7166 ns |
7375 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
5875 ns |
0.90 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5416 ns |
1.11 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
10000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33484 ns |
34591 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223145.5 ns |
243791 ns |
0.92 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220708 ns |
220125 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220459 ns |
221083 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218770.5 ns |
239167 ns |
0.91 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
345803.5 ns |
327793 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3709 ns |
3667 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22336 ns |
22616 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14375 ns |
14292 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14333 ns |
14416 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14375 ns |
14208 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14416 ns |
14417 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
484513 ns |
480334.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
92750 ns |
94458 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
94166.5 ns |
92625 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
95292 ns |
96875 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
92583 ns |
96229.5 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125724.5 ns |
126007 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1932874.5 ns |
1714792 ns |
1.13 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1864833 ns |
1926792 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1919250 ns |
1913291.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1919500 ns |
1711417 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
959155 ns |
1034230 ns |
0.93 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
888958 ns |
876916.5 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
806791 ns |
817791 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1216687 ns |
1169438 ns |
1.04 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
963500 ns |
966187.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
276636 ns |
275657.5 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2755937.5 ns |
2828583 ns |
0.97 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2481208.5 ns |
2474833 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3328000 ns |
3335750 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3410604.5 ns |
3304292 ns |
1.03 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1602148 ns |
1618381.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15666 ns |
16709 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17500 ns |
15625 ns |
1.12 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18271 ns |
18667 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15583 ns |
15583 ns |
1 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
141937 ns |
142594 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
226520.5 ns |
228750 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
264250 ns |
215750 ns |
1.22 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215937.5 ns |
217625 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
230833 ns |
255500 ns |
0.90 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
636070 ns |
641543.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
221833 ns |
222458 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222375 ns |
221500 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
222000 ns |
223458.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
219084 ns |
222604.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
268289.5 ns |
269850.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
531167 ns |
537583 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
507750 ns |
497334 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
495979.5 ns |
499583 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
509541 ns |
526833 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1373603.5 ns |
1430878.5 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
334979 ns |
330125 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
317792 ns |
332834 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
385750 ns |
435458.5 ns |
0.89 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
322104 ns |
315917 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16515 ns |
16581 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
715583 ns |
717084 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
731000 ns |
728166.5 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
1024292 ns |
1021104 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
663792 ns |
662729.5 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
193684 ns |
195479.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17729.5 ns |
17875 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18125 ns |
17167 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19646 ns |
20250 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17209 ns |
17208 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
146583 ns |
145639 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213208 ns |
223750 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
214812 ns |
212417 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214084 ns |
214041 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
246000 ns |
221917 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1004648 ns |
1035551.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5208 ns |
6708 ns |
0.78 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6166.5 ns |
6333 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6958 ns |
7208 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4500 ns |
6625 ns |
0.68 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
238961.5 ns |
240542 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11166.5 ns |
10584 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10666 ns |
9917 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10916 ns |
11166.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10875 ns |
10917 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1055349 ns |
1097401.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3125 ns |
3500 ns |
0.89 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3895.5 ns |
3208 ns |
1.21 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4584 ns |
6333.5 ns |
0.72 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3667 ns |
6750 ns |
0.54 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
234616.5 ns |
250006 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7625 ns |
7625 ns |
1 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7959 ns |
7084 ns |
1.12 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7708 ns |
8125 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7542 ns |
7500 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1063331 ns |
1102649 ns |
0.96 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23774042 ns |
23315625 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
43803625 ns |
34529125 ns |
1.27 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37663250 ns |
41513333.5 ns |
0.91 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34941166 ns |
34929834 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1776020.5 ns |
1838602 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
183783458 ns |
184421875 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
173905833 ns |
159459792 ns |
1.09 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146559750 ns |
151225083 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
414263417 ns |
413223958 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16542930.5 ns |
16387494 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
426447000 ns |
428743125 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
259361708 ns |
252439020.5 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
231145458 ns |
233017396 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
485707250 ns |
484197291 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182625 ns |
183584 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
184500 ns |
182750 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184937 ns |
186625 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
183083 ns |
183146 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
219470.5 ns |
228677.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
588000 ns |
596083 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
589541.5 ns |
586292 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
587000 ns |
589770.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
628792 ns |
631958 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1062123 ns |
1119701 ns |
0.95 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3844333.5 ns |
3838833 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3705021.5 ns |
3643375.5 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3474208 ns |
3563521 ns |
0.97 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
5354792 ns |
5359750 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
531867 ns |
537722 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17410625 ns |
17412417 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17824604 ns |
17190667 ns |
1.04 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16530791.5 ns |
17100375 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
22146375 ns |
22144083 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2636212 ns |
2612799 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
458 ns |
1.18 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31938 ns |
32035 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9417 ns |
9208 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9417 ns |
8542 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
10208 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10084 ns |
9459 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
261322 ns |
264327.5 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
501945500 ns |
504274209 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
515492167 ns |
430218396 ns |
1.20 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
433413812.5 ns |
471374500 ns |
0.92 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
671064375 ns |
672994208.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
11967043.5 ns |
12486595 ns |
0.96 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
2042006188 ns |
2049529562.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1670021000 ns |
1632649709 ns |
1.02 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1491864937.5 ns |
1536417708 ns |
0.97 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2215896125 ns |
2205666041.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49221176 ns |
49389302 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1658250.5 ns |
1657645.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1182042 ns |
1189208.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1397145.5 ns |
1382000 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2472500 ns |
2334125 ns |
1.06 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215560 ns |
214982 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12703625 ns |
12688500 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
10026271 ns |
9942000 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9729709 ns |
9748312.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18433208 ns |
18407312 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2021422 ns |
2050613 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17685333 ns |
17691583.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14724875 ns |
14746041.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14636209 ns |
14804417 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21349667 ns |
21386084 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26209 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26292 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26209 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24156 ns |
24125 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66792 ns |
66875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66875 ns |
66917 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
68125 ns |
67083 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66917 ns |
67209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
395970.5 ns |
398847.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204542 ns |
202667 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
208875 ns |
209000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210625 ns |
209167 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199000 ns |
199583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26615 ns |
26392 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
602625.5 ns |
612416.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
622854 ns |
627416.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
665875 ns |
667979 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
581270.5 ns |
631250 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
347431.5 ns |
353043.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
637021 ns |
645542 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
589083 ns |
643375 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
653375 ns |
664187.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
640021 ns |
540834 ns |
1.18 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131837 ns |
132126 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2250645.5 ns |
2247375 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2309792 ns |
2239958 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2226583 ns |
2302917 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2236125 ns |
2219000 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1169730 ns |
1328726 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17292 ns |
17667 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18334 ns |
16979.5 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22750 ns |
20792 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17062.5 ns |
18500 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
145778 ns |
146392.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221375 ns |
229708 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
226833 ns |
225333 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
256916 ns |
229292 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
259375 ns |
259083 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
998109 ns |
1081671 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
459 ns |
1.18 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23995 ns |
23645 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9958 ns |
9833.5 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10000 ns |
9542 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9916 ns |
10708 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9625 ns |
9916 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
260297.5 ns |
262941 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6084 ns |
7291 ns |
0.83 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6250 ns |
5833 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6395.5 ns |
9625 ns |
0.66 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5770.5 ns |
7250 ns |
0.80 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
225464 ns |
234003 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7458 ns |
7333 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7125 ns |
7000 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7709 ns |
7833 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7291.5 ns |
7250 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
787625 ns |
810029.5 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2375 ns |
2042 ns |
1.16 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2417 ns |
2000 ns |
1.21 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2208 ns |
2375 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2167 ns |
2208 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18197 ns |
18218 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6583.5 ns |
6542 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6625 ns |
6500 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6875 ns |
6708 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6584 ns |
6750 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
334201 ns |
335368 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
746791 ns |
750166 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
749604 ns |
746604.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
750395.5 ns |
751041 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
748875 ns |
761417 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21933 ns |
21856 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
775958.5 ns |
775334 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
789521 ns |
775042 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
797084 ns |
804792 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
808708.5 ns |
791625 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
295656 ns |
299022 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7417 ns |
7375 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5125 ns |
5875 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5208 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10125 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33608 ns |
32492 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220416 ns |
233188 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
234416.5 ns |
227750 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
266500 ns |
254458 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217792 ns |
255583 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
363230.5 ns |
359227 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10792 ns |
11042 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11833 ns |
12458 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12292 ns |
12959 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10062.5 ns |
12000 ns |
0.84 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
247964 ns |
245075.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24979 ns |
24875 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24875 ns |
24458 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25542 ns |
25458 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24584 ns |
24583.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1136226 ns |
1120608 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
107466667 ns |
106980458 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
126780083 ns |
118006979.5 ns |
1.07 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
119831458 ns |
123940208 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117447125 ns |
118407959 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2659299 ns |
2661574 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
393751417 ns |
394378313 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
371911479 ns |
368164500 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
422962500 ns |
358657167 ns |
1.18 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
483853792 ns |
482282708 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15200417 ns |
15138278 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
946860708 ns |
759267583 ns |
1.25 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
771052125 ns |
577881125 ns |
1.33 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
743467646.5 ns |
749378833 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
950911604.5 ns |
945671312.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7083 ns |
7458 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7542 ns |
7958 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7917 ns |
8750 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7375 ns |
7333 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
239130 ns |
235620 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14833 ns |
14500 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14041 ns |
13333 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14083 ns |
15041 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15375 ns |
14292 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1090512.5 ns |
1078273.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6833 ns |
8542 ns |
0.80 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7500 ns |
7792 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8375 ns |
9187.5 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6334 ns |
7833.5 ns |
0.81 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
238152.5 ns |
235827.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13208 ns |
13167 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12666 ns |
12084 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12708 ns |
13084 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12917 ns |
12833 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
802056 ns |
787391.5 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
350458 ns |
347250 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
329666 ns |
344875 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
395709 ns |
409896 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
316709 ns |
310562 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
17058.5 ns |
16566 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
708541.5 ns |
713833.5 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
727687.5 ns |
727291 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
1032625 ns |
1023416 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
647187 ns |
654959 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
201070 ns |
197250.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23717 ns |
23066 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6708 ns |
6250 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6708 ns |
6334 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6750 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6541 ns |
6791 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
243248.5 ns |
238420 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5750 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5750 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5917 ns |
5834 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
25115.5 ns |
23863 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21333 ns |
21750 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21125 ns |
21000 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21250 ns |
21958 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
23584 ns |
21708 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
266274.5 ns |
261085 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148458 ns |
152146 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
147750 ns |
145250 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
151062.5 ns |
149541 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
146167 ns |
145937 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167783 ns |
166536.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1323750 ns |
1328792 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1358667 ns |
1319083.5 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1313750 ns |
1350812.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1301166 ns |
1317084 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1350876 ns |
1336276 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23229 ns |
24917 ns |
0.93 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23958 ns |
24208 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25479 ns |
25708 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21834 ns |
24208.5 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
286227 ns |
351114.5 ns |
0.82 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
178625 ns |
131125 ns |
1.36 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
183854 ns |
117791 ns |
1.56 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
131375 ns |
172917 ns |
0.76 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
130417 ns |
177334 ns |
0.74 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1475072 ns |
1465398.5 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23581 ns |
22926 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6791 ns |
6417 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6500 ns |
6458 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6917 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6375 ns |
6542 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
259663.5 ns |
254551 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4750 ns |
7625 ns |
0.62 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4584 ns |
4167 ns |
1.10 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
7708.5 ns |
0.78 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4625 ns |
7375 ns |
0.63 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
257684 ns |
250274.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10708 ns |
10042 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
9708 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10292 ns |
10333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10375 ns |
10250 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1358165 ns |
1345295 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1583 ns |
1584 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1583 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23401 ns |
22897 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5625 ns |
5625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5667 ns |
5584 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5958 ns |
5959 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5625 ns |
5958 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
277871.5 ns |
271438.5 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6812917 ns |
6886125 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6411729 ns |
6378229 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6505500 ns |
6526875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7506500 ns |
7602250 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215844 ns |
213111 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24090645.5 ns |
24073062 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21288541 ns |
21283625 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21056833.5 ns |
21045584 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29826229 ns |
29677875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2100052 ns |
2108165 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
49024708 ns |
37353145.5 ns |
1.31 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45355854 ns |
34386667 ns |
1.32 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45722417 ns |
45930020.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49511500 ns |
49322334 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6167 ns |
7708.5 ns |
0.80 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6792 ns |
5875 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6833 ns |
8333 ns |
0.82 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6041.5 ns |
7062.5 ns |
0.86 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
234601.5 ns |
238522.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8958 ns |
8458 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8500 ns |
8042 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8417 ns |
8583 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8292 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1058513.5 ns |
1070850 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1558500 ns |
1544374.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1250312.5 ns |
1259666.5 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1630333 ns |
1632771 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2133792 ns |
2150667 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
282051.5 ns |
278945 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7880208 ns |
7908937.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6620833 ns |
6609937 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7206000 ns |
7237750.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10474041 ns |
10434334 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1900822 ns |
1889956 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
346250 ns |
340979 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
328854 ns |
345792 ns |
0.95 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
405042 ns |
417125 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
340459 ns |
345833 ns |
0.98 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
43005 ns |
42448 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
748124.5 ns |
746500.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
772375 ns |
784542 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1074375 ns |
1073250 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
760354.5 ns |
761062.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
306884 ns |
303720.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397458 ns |
397500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
212042 ns |
288250 ns |
0.74 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288500 ns |
212666 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
750625 ns |
756084 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44869 ns |
43887 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
670667 ns |
671083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
471167 ns |
530083 ns |
0.89 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
530000 ns |
470667 ns |
1.13 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
973833 ns |
974750 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
193447 ns |
188388.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
581041.5 ns |
679250 ns |
0.86 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
635229.5 ns |
645333.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
653000 ns |
642458 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
648562.5 ns |
638562.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132519 ns |
131530 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2467438 ns |
2409292 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2534479 ns |
2456416.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2455458 ns |
2514583 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2440417 ns |
2456292 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1289127 ns |
1277300 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
347416.5 ns |
345146 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
344624.5 ns |
343583 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
393083 ns |
403708.5 ns |
0.97 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
314209 ns |
312208 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15737 ns |
16009 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
705458 ns |
709667 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
724687 ns |
724500 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
1024417 ns |
1022687.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
644354 ns |
650417 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
200179.5 ns |
195917 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1464625 ns |
1460417 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1497083 ns |
1500812.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1505541 ns |
1496375 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1442125 ns |
1438708 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41443 ns |
40600 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5133584 ns |
5128791 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5289208 ns |
5302375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5294792 ns |
5313000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4976875 ns |
4970208.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
200107.5 ns |
196206.5 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3667 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33649 ns |
32895 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15250 ns |
15167 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15166 ns |
15083 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15375 ns |
15083 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15292 ns |
15375 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
384079.5 ns |
376729 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71000 ns |
71459 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71583 ns |
71250 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71292 ns |
71375 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71084 ns |
70708 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113767.5 ns |
113177.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
318209 ns |
317917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
326667 ns |
320417 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318708 ns |
325333 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
317750 ns |
320916 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
197585.5 ns |
193043 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
958 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1084 ns |
958 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23981 ns |
23363 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8334 ns |
8083 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8166 ns |
7792 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8542 ns |
8750 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8167 ns |
8750 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
265125.5 ns |
260535.5 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
473542 ns |
475499.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
457604 ns |
470520.5 ns |
0.97 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
551709 ns |
557125 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
532104.5 ns |
557959 ns |
0.95 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130158 ns |
129404 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1398125 ns |
1399270.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1382291 ns |
1382375 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1633958 ns |
1611125 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
1575646 ns |
1582104.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
276840 ns |
274924 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32584 ns |
31647 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6334 ns |
6375 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6042 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6583 ns |
6666 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6625 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
268198.5 ns |
262541.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1719562.5 ns |
1761833 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1729708 ns |
1723396 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1726959 ns |
1733812.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1722666 ns |
1730625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169321 ns |
169477.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4372708 ns |
4358625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4391416 ns |
4358708 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4359145.5 ns |
4403062.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4362875 ns |
4373875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1232540 ns |
1208123 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7042 ns |
7167 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6729.5 ns |
6875 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6812.5 ns |
6916 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6750 ns |
6750 ns |
1 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20395.5 ns |
20662 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
32500 ns |
51625 ns |
0.63 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
32562.5 ns |
32917 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
48500 ns |
48208.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
51771 ns |
51417 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
210266.5 ns |
292106.5 ns |
0.72 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
356646 ns |
354562.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
335791.5 ns |
348666.5 ns |
0.96 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
402625 ns |
433333 ns |
0.93 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
323416 ns |
322041.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18222 ns |
18353 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
725125 ns |
724625 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
737333 ns |
730583 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
1031875 ns |
1038687.5 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
669958.5 ns |
675333 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
344455.5 ns |
335730.5 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75208 ns |
75458 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75500 ns |
75333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75500 ns |
75375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75125 ns |
74584 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47155 ns |
46864.5 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
326834 ns |
325166 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
333625 ns |
324250 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
328209 ns |
336875 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
324000 ns |
325125 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
211460.5 ns |
209059.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1488895.5 ns |
1485709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1521291 ns |
1526833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1530292 ns |
1522792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1466500 ns |
1462625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52743 ns |
51397 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5023833 ns |
5113395.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5236584 ns |
5295292 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5286958 ns |
5300812.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4985229 ns |
5001042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
203850 ns |
202971.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28125 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28250 ns |
28208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28167 ns |
28208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28167 ns |
28209 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24574 ns |
24514.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66417 ns |
66417 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66333 ns |
66458 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66542 ns |
66500 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66500 ns |
66500 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
530555.5 ns |
505942 ns |
1.05 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1396021 ns |
1502084 ns |
0.93 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
943916.5 ns |
1124250 ns |
0.84 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1146250 ns |
944270.5 ns |
1.21 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2256750 ns |
2255250 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
570145.5 ns |
566674 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3088833 ns |
3090791 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2633020.5 ns |
2751542 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2704979.5 ns |
2628896 ns |
1.03 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3774834 ns |
3819709 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2057819 ns |
1979936 ns |
1.04 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8833062 ns |
8847333 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8756313 ns |
8768375 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8786291.5 ns |
8750250 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
6371937.5 ns |
6340375 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80521 ns |
85125 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
84083 ns |
83021 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
81292 ns |
85708.5 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81479 ns |
83562.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192849.5 ns |
192703 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1749833 ns |
2012875 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2022542 ns |
2024062.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2022167 ns |
2038542 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2008646 ns |
2008812 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
800645 ns |
791664.5 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
c88703c
to
15651bb
Compare
15651bb
to
a26b842
Compare
a26b842
to
58369a5
Compare
318055f
to
a5a1978
Compare
Packages Requiring a Tag