Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: combine workflows #1023

Merged
merged 11 commits into from
Nov 5, 2024
Merged

ci: combine workflows #1023

merged 11 commits into from
Nov 5, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 4, 2024

Packages Requiring a Tag

  • LuxTestUtils (minor)
  • Lux (patch)
  • LuxLib (patch)
  • MLDataDevices (minor)

@avik-pal avik-pal force-pushed the ap/workflows branch 3 times, most recently from 39d6228 to 2ffa2f7 Compare November 4, 2024 05:25
Copy link
Contributor

github-actions bot commented Nov 4, 2024

Benchmark Results (ASV)

main 70578af... main/70578af6fc1276...
basics/overhead 0.133 ± 0.0013 μs 0.138 ± 0.00095 μs 0.967
time_to_load 1.2 ± 0.007 s 1.2 ± 0.011 s 0.998

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[1]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Instance Norm: Group 2" tags=[:normalization] setup=[
SharedTestSetup, InstanceNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[2]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Instance Norm: Group 3" tags=[:normalization] setup=[
SharedTestSetup, InstanceNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[3]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Instance Norm: Group 4" tags=[:normalization] setup=[
SharedTestSetup, InstanceNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[4]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Instance Norm: Group 5" tags=[:normalization] setup=[
SharedTestSetup, InstanceNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $sz, $training $act" for (T, sz, training, act) in TEST_BLOCKS[5]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Norm: Group 1" tags=[:normalization] setup=[
SharedTestSetup, LayerNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[1]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Norm: Group 2" tags=[:normalization] setup=[
SharedTestSetup, LayerNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[2]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Norm: Group 3" tags=[:normalization] setup=[
SharedTestSetup, LayerNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[3]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Norm: Group 4" tags=[:normalization] setup=[
SharedTestSetup, LayerNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[4]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Norm: Group 5" tags=[:normalization] setup=[
SharedTestSetup, LayerNormSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testset "eltype $T, size $x_shape, $act" for (T, x_shape, affine_shape, act) in TEST_BLOCKS[5]


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Norm: Error Checks" tags=[:normalization] setup=[SharedTestSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "batched_mul" tags=[:misc] setup=[SharedTestSetup, BatchedMMSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "batched_mul: trivial dimensions & unit strides" tags=[:misc] setup=[
SharedTestSetup, BatchedMMSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "BatchedAdjOrTrans interface" tags=[:misc] setup=[
SharedTestSetup, BatchedMMSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "batched_matmul(ndims < 3)" tags=[:misc] setup=[
SharedTestSetup, BatchedMMSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "BMM AutoDiff" tags=[:misc] setup=[SharedTestSetup, BatchedMMSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@test_gradients(fn, aType(randn(rng, Float32, M, P, B)),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, batched_adjoint(aType(randn(rng, Float32, P, M, B))),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, aType(randn(rng, Float32, M, P, B)),
batched_transpose(aType(randn(rng, Float32, Q, P, B))); atol=1e-3,
rtol=1e-3, skip_backends=[AutoEnzyme()])


[JuliaFormatter] reported by reviewdog 🐶

@test_gradients(fn, aType(randn(rng, Float32, M, P)),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, adjoint(aType(randn(rng, Float32, P, M))),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, aType(randn(rng, Float32, M, P)),
batched_adjoint(aType(randn(rng, Float32, Q, P, B))); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, aType(randn(rng, Float32, M, P)),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, adjoint(aType(randn(rng, Float32, P, M))),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, aType(randn(rng, Float32, M, P)),
batched_adjoint(aType(randn(rng, Float32, Q, P, B))); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])


[JuliaFormatter] reported by reviewdog 🐶

@test_gradients(fn, aType(randn(rng, Float32, M, P, 1)),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, batched_transpose(aType(randn(rng, Float32, P, M, 1))),
aType(randn(rng, Float32, P, Q, B)); atol=1e-3, rtol=1e-3,
skip_backends=[AutoEnzyme()])
@test_gradients(fn, aType(randn(rng, Float32, M, P, 1)),
batched_transpose(aType(randn(rng, Float32, Q, P, B))); atol=1e-3,
rtol=1e-3, skip_backends=[AutoEnzyme()])


[JuliaFormatter] reported by reviewdog 🐶

@testitem "BMM Tracker AoS" tags=[:misc] setup=[SharedTestSetup, BatchedMMSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Efficient JVPs" tags=[:misc] setup=[SharedTestSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "ForwardDiff dropout" tags=[:misc] setup=[SharedTestSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "internal_operation_mode: Wrapped Arrays" tags=[:misc] setup=[SharedTestSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

x = rand(Float32, 4, 3) |> aType


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Matmul: StaticArrays" tags=[:misc] setup=[SharedTestSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Aqua: Quality Assurance" tags=[:misc] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Explicit Imports" tags=[:misc] setup=[SharedTestSetup] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Debugging Tools: DimensionMismatch" setup=[SharedTestSetup] tags=[:contrib] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Debugging Tools: NaN" setup=[SharedTestSetup] tags=[:contrib] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "All Parameter Freezing" setup=[SharedTestSetup] tags=[:contrib] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Partial Freezing" setup=[SharedTestSetup] tags=[:contrib] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Layer Map" setup=[SharedTestSetup] tags=[:contrib] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Parameter Sharing" setup=[SharedTestSetup] tags=[:contrib] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "@compact" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "@compact error checks" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "LuxOps.xlogx & LuxOps.xlogy" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Regression Loss" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Classification Loss" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Other Losses" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Losses: Error Checks and Misc" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Size Propagator" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Size Propagator" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Simple Stateful Tests" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "TrainState" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "AbstractADTypes" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Training API" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Enzyme: Invalidate Cache on State Update" setup=[SharedTestSetup] tags=[:helpers] skip=:(using LuxTestUtils; !LuxTestUtils.ENZYME_TESTING_ENABLED) begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Compiled ReverseDiff" setup=[SharedTestSetup] tags=[:helpers] begin


[JuliaFormatter] reported by reviewdog 🐶

x = zeros(Float32, 2, 1) |> aType
y = zeros(Float32, 1, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

x = randn(Float32, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(rng, layer) |> dev


[JuliaFormatter] reported by reviewdog 🐶

x = randn(Float32, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(rng, layer) |> dev


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Embedding" setup=[SharedTestSetup] tags=[:core_layers] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Aqua: Quality Assurance" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Explicit Imports: Quality Assurance" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

import Lux, ComponentArrays, ReverseDiff, SimpleChains, Tracker, Zygote, Enzyme


[JuliaFormatter] reported by reviewdog 🐶

@testitem "doctests: Quality Assurance" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

"core_layers", "contrib", "helpers", "distributed", "normalize_layers",
"others", "autodiff", "recurrent_layers", "fluxcompat"]


[JuliaFormatter] reported by reviewdog 🐶

Lux.jl/test/runtests.jl

Lines 104 to 105 in 2bff3e4

@testset "eltype_mismath_handling: $option" for option in (
"none", "warn", "convert", "error")


[JuliaFormatter] reported by reviewdog 🐶

Int, get(ENV, "RETESTITEMS_NWORKERS", string(min(Hwloc.num_physical_cores(), 4))))


[JuliaFormatter] reported by reviewdog 🐶

Lux.jl/test/runtests.jl

Lines 128 to 129 in 2bff3e4

ReTestItems.runtests(Lux; tags=(tag == "all" ? nothing : [Symbol(tag)]),
nworkers=RETESTITEMS_NWORKERS, testitem_timeout=2400)


[JuliaFormatter] reported by reviewdog 🐶

@testitem "FromFluxAdaptor" setup=[SharedTestSetup] tags=[:fluxcompat] begin
import Flux


[JuliaFormatter] reported by reviewdog 🐶

models = [Flux.Chain(Flux.Dense(2 => 5), Flux.Dense(5 => 1)),
Flux.Chain(; l1=Flux.Dense(2 => 5), l2=Flux.Dense(5 => 1))] .|>
fdev(dev)


[JuliaFormatter] reported by reviewdog 🐶

x = rand(Float32, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.Maxout(() -> Flux.Dense(2 => 5), 4) |> fdev(dev)
x = rand(Float32, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.SkipConnection(Flux.Dense(2 => 2), +) |> fdev(dev)
x = rand(Float32, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

models = [Flux.Parallel(+, Flux.Dense(2 => 2), Flux.Dense(2 => 2)),
Flux.Parallel(+; l1=Flux.Dense(2 => 2), l2=Flux.Dense(2 => 2))] .|>
fdev(dev)


[JuliaFormatter] reported by reviewdog 🐶

x = rand(Float32, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.PairwiseFusion(+, Flux.Dense(2 => 2), Flux.Dense(2 => 2)) |>
fdev(dev)
x = (rand(Float32, 2, 1), rand(Float32, 2, 1)) .|> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

for model in [Flux.Dense(2 => 4) |> fdev(dev),
Flux.Dense(2 => 4; bias=false) |> fdev(dev)]
x = randn(Float32, 2, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

for model in [
Flux.Scale(2) |> fdev(dev), Flux.Scale(2; bias=false) |> fdev(dev)]
x = randn(Float32, 2, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

for model in [Flux.Bilinear((2, 3) => 5) |> fdev(dev),
Flux.Bilinear((2, 3) => 5; bias=false) |> fdev(dev)]
x = randn(Float32, 2, 4) |> aType
y = randn(Float32, 3, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.Embedding(16 => 4) |> fdev(dev)
x = rand(1:16, 2, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.Conv((3, 3), 1 => 2) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.Conv((3, 3), 1 => 2; pad=Flux.SamePad()) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.CrossCor((3, 3), 1 => 2) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.CrossCor((3, 3), 1 => 2; pad=Flux.SamePad()) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.ConvTranspose((3, 3), 1 => 2) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.ConvTranspose((3, 3), 1 => 2; pad=Flux.SamePad()) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.AdaptiveMaxPool((2, 2)) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.AdaptiveMeanPool((2, 2)) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.MaxPool((2, 2)) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.MeanPool((2, 2)) |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.GlobalMaxPool() |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.GlobalMeanPool() |> fdev(dev)
x = rand(Float32, 6, 6, 1, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.Upsample(5) |> fdev(dev)
x = rand(Float32, 2, 2, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.PixelShuffle(2) |> fdev(dev)
x = randn(Float32, 2, 2, 4, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.RNNCell(2 => 3) |> fdev(dev)


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.LSTMCell(2 => 3) |> fdev(dev)


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.GRUCell(2 => 3) |> fdev(dev)


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.BatchNorm(2) |> fdev(dev)
x = randn(Float32, 2, 4) |> aType


[JuliaFormatter] reported by reviewdog 🐶

x = randn(Float32, 2, 2, 2, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.GroupNorm(4, 2) |> fdev(dev)
x = randn(Float32, 2, 2, 4, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.LayerNorm(4) |> fdev(dev)
x = randn(Float32, 4, 4, 4, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

ps, st = Lux.setup(StableRNG(12345), model_lux) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

model = Flux.InstanceNorm(4) |> fdev(dev)
x = randn(Float32, 4, 4, 4, 1) |> aType


[JuliaFormatter] reported by reviewdog 🐶

x = randn(Float32, 2, 4) |> aType
ps, st = Lux.setup(StableRNG(12345), model) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

x = randn(Float32, 2, 4) |> aType
ps, st = Lux.setup(StableRNG(12345), model) .|> dev


[JuliaFormatter] reported by reviewdog 🐶

c = CustomFluxLayer(randn(10), randn(10)) |> fdev(dev)
x = randn(10) |> aType


[JuliaFormatter] reported by reviewdog 🐶

@testitem "ToSimpleChainsAdaptor" setup=[SharedTestSetup] tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "replicate" setup=[SharedTestSetup] tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "istraining" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "ComponentArrays edge cases" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "multigate" setup=[SharedTestSetup] tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "ComponentArrays" setup=[SharedTestSetup] tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "FP Conversions" setup=[SharedTestSetup] tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Edge Cases" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Recursive Utils" tags=[:others] begin


[JuliaFormatter] reported by reviewdog 🐶

struct functorABC{A, B}


[JuliaFormatter] reported by reviewdog 🐶

@testitem "Functors Compatibility" setup=[SharedTestSetup] tags=[:others] begin

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 24c12cc Previous: 8bfa628 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4333.5 ns 4625 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4583 ns 4084 ns 1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5375 ns 5791 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4333 ns 4292 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61610 ns 60959 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10167 ns 10125 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10375 ns 9959 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11209 ns 10375 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9833 ns 10666 ns 0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 435760 ns 427044 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1083 ns 1167 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1292 ns 1250 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1375 ns 1458 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1042 ns 3542 ns 0.29
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18442 ns 18260 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3979.5 ns 4125 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4167 ns 3833 ns 1.09
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4250 ns 4125 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4000 ns 4000 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 112645.5 ns 111381 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 57709 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 37792 ns 47250 ns 0.80
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46750 ns 38250 ns 1.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81395.5 ns 80333 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38084 ns 37655 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2037562.5 ns 2026167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2104125 ns 2092708.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2099166 ns 2059625.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1984916.5 ns 1993416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198932 ns 197377 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144291 ns 152958 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144208 ns 148250 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148083 ns 146417 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144166.5 ns 150375 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165721.5 ns 167595 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1121459 ns 1098542 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1157396 ns 1124250 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1131250 ns 1116146 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1116458 ns 1107229.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 530286 ns 523151 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3167 ns 3584 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3625 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4708 ns 5708.5 ns 0.82
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3208.5 ns 3417 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70966.5 ns 70157 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8709 ns 8834 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9292 ns 8667 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9959 ns 9291 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8833 ns 9042 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 486762.5 ns 492826.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15458 ns 17000 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16042 ns 16375 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18291 ns 18667 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14792 ns 17083 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55364.5 ns 54850 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216833 ns 213146 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226250 ns 216104 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214583 ns 214167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 223812 ns 225333 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 277851.5 ns 272672.5 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 459 ns 1.45
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 709 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 417 ns 583 ns 0.72
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 18112 ns 17542 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1375 ns 1708 ns 0.81
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1625 ns 1458 ns 1.11
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1958 ns 1625 ns 1.20
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1625 ns 1750 ns 0.93
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 105356 ns 104205 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7250 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5167 ns 5833 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5209 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9833 ns 4000 ns 2.46
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24464.5 ns 23961 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222292 ns 228750.5 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 236208 ns 228333 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229416.5 ns 228500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 259229 ns 226334 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 172717 ns 170956 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3792 ns 3875 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3834 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 24053 ns 23832 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16625 ns 16833 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16542 ns 16708 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 19959 ns 16708 ns 1.19
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16625 ns 16958 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 164960.5 ns 165501.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 578875 ns 579042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 577459 ns 574375 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 575875 ns 575083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 576458 ns 576292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113991.5 ns 113664 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1424917 ns 1417708 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1426834 ns 1429333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1424104.5 ns 1425729.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1419916 ns 1422208 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 214497 ns 214791 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1079667 ns 1082104 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 952541 ns 959958.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1344209 ns 1341792 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1294500 ns 1294792 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 277520.5 ns 281583.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5796125 ns 5777875 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4601750 ns 4456083 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4812062.5 ns 4934792 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5685666.5 ns 5627500 ns 1.01
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1102298 ns 1106964 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23934 ns 23988 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2166 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2083 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 174719 ns 179026 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4375 ns 6084 ns 0.72
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5208 ns 6167 ns 0.84
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5979 ns 7041 ns 0.85
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4208 ns 6375 ns 0.66
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65897 ns 66163.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11333 ns 11291 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11562.5 ns 10791 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12000 ns 12125 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11458 ns 11354.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 456442.5 ns 456626.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6875 ns 7000 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7084 ns 7042 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7437.5 ns 8375 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6667 ns 7042 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53263 ns 52652 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18479 ns 17375 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17917 ns 17167 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18708.5 ns 17770.5 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18000 ns 18708 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 306492 ns 306093.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33354.5 ns 33004 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8958 ns 8583 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9250 ns 8208 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9125 ns 9583 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9125 ns 9042 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 161864 ns 162492.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64667 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64958 ns 64417 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64583 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64708 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112223.5 ns 112347.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 284958 ns 277542 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 293500 ns 281625 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 287084 ns 288750 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 278979 ns 275500 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 188579 ns 189809 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3384709 ns 3285583 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2770562.5 ns 3022333.5 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3022292 ns 2780375 ns 1.09
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4051959 ns 4038625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 577085.5 ns 573967 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7638708 ns 7586208.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7382937 ns 7415437 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7276667 ns 7333375 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8178104.5 ns 8220958 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1348594 ns 1351752.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18820250 ns 18835167 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19140583 ns 19044834 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19170875 ns 19135125 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15780292 ns 15633417 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23651833.5 ns 23661916.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 42906541 ns 33965500 ns 1.26
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37105250 ns 41107417 ns 0.90
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34831333 ns 34858709 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1856373 ns 1862815 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188407625 ns 189289541 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 178681958.5 ns 164224708 ns 1.09
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152434916 ns 157847979 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 441085041 ns 438904833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13924696 ns 13913764 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 291040208.5 ns 289733584 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 281280395.5 ns 338173667 ns 0.83
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298676250 ns 307489541.5 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 394717750.5 ns 393585937.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23083.5 ns 21708.5 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24541 ns 24458 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25333 ns 25937 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22125 ns 24229 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95853 ns 96907 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 114728.5 ns 103750 ns 1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 105333.5 ns 105292 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104875 ns 104208 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103083 ns 151250 ns 0.68
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 499564 ns 504189 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6041 ns 6583 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 7292 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7542 ns 7959 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 6958 ns 0.84
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68212 ns 68581 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15000 ns 14916.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15792 ns 14709 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16250 ns 16666 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14708 ns 14292 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 477913.5 ns 483895 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3005417 ns 3017937 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2088562.5 ns 2022458 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2282792 ns 2307959 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4876979.5 ns 4846645.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585444 ns 585796 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23510520.5 ns 23617917 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18309333 ns 17975417 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16948500 ns 18323812.5 ns 0.92
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35051750 ns 35597209 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3105194 ns 3109235 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33290292 ns 33405687.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27968792 ns 27693604 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27595000 ns 27860958 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41906104 ns 42002937.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72479.5 ns 72375 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73729 ns 84624.5 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76375 ns 83250 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74916 ns 73750 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100581 ns 102852 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 298500 ns 218167 ns 1.37
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216916.5 ns 309979 ns 0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 209292 ns 317479 ns 0.66
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220250 ns 288875 ns 0.76
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 540007.5 ns 550996 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11541 ns 12041 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12167 ns 12729.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12875 ns 13833 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11625 ns 11666.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 70274.5 ns 71604 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26708 ns 26625 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27334 ns 26959 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27542 ns 28292 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26792 ns 26458 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 471677.5 ns 484486.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12833 ns 12417 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12937.5 ns 12542 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13645.5 ns 14584 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12333.5 ns 13041.5 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 52633 ns 53694 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26459 ns 26312.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26042 ns 26270.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26583 ns 26667 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26458 ns 26333 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301140.5 ns 309291.5 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179792 ns 178770.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182250 ns 182334 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 182813 ns 184895.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180541 ns 179750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55883 ns 57908 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 596042 ns 587125 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 583312.5 ns 596500 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 583584 ns 593770.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 593083.5 ns 583166 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 283344.5 ns 290369.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6541 ns 7354.5 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6667 ns 7167 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7375 ns 7875 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6208 ns 6833 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 69352 ns 70829 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14458 ns 14375 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15208 ns 14708 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15125 ns 15625 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14479 ns 14083 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 459571.5 ns 471312.5 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1217208 ns 1235042 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1276667 ns 1283583 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1266000 ns 1282875 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1317770.5 ns 1325208 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 300482.5 ns 301270 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4115000 ns 4111125 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4522624.5 ns 4361625 ns 1.04
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4588833 ns 4786395.5 ns 0.96
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4444459 ns 4453229.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1036466 ns 1047552 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1750 ns 1.07
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23505.5 ns 23328 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4875 ns 4833 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4792 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4917 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 188144 ns 186698 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6020.5 ns 7208.5 ns 0.84
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6209 ns 5584 ns 1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7375 ns 8667 ns 0.85
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6208.5 ns 7312.5 ns 0.85
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55343.5 ns 54539 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12833 ns 10833 ns 1.18
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11792 ns 10834 ns 1.09
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11584 ns 12375 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11958 ns 11916 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 330451.5 ns 329099 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 334 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 334 ns 0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22931 ns 22753 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2708 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2750 ns 2667 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3000 ns 2959 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 3000 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 158264.5 ns 157496 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11854.5 ns 13167 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12083 ns 13166 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14500 ns 15000 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12000 ns 13792 ns 0.87
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 56327.5 ns 55218 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24833 ns 24833 ns 1
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24709 ns 24542 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25333 ns 25375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25166 ns 24709 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 295802 ns 289966 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4083 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4208 ns 4166 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24484 ns 24660 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16125 ns 15958 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16041 ns 16417 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16250 ns 16042 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16083 ns 16125 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 197179 ns 194045.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5667 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5625 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5750 ns 5750 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 5791 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33583 ns 32989 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21000 ns 21125 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21292 ns 20459 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21042 ns 21542 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20833.5 ns 20875 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 175724 ns 174273 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 404500 ns 403209 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 366729 ns 371125 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 491792 ns 474292 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 527187.5 ns 539604.5 ns 0.98
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66440.5 ns 66734 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 962125 ns 1011917 ns 0.95
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 879458 ns 884896 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1231479.5 ns 1220125 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1416458.5 ns 1400208 ns 1.01
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 191580 ns 190566.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82062 ns 82917 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82666 ns 82791 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82937.5 ns 88958.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83292 ns 83187.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192849 ns 192556.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1915978.5 ns 1921500 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1938834 ns 1696166 ns 1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1710833 ns 1938083 ns 0.88
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1917416.5 ns 1915875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 404784 ns 393732 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21961 ns 21580 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 173279.5 ns 165924 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7125 ns 6708 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7000 ns 6250 ns 1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8687.5 ns 9750 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6667 ns 8125 ns 0.82
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61915 ns 56950.5 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9542 ns 8916.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9167 ns 8958 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9667 ns 9625 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9459 ns 9542 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 319219.5 ns 299584.5 ns 1.07
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120493375 ns 120035854.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 182164250 ns 174382959 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148124041.5 ns 154831333 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106683562 ns 103109500 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5478101 ns 5474606 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 619037042 ns 617124000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 581117583 ns 555612167 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 452503854 ns 468382792 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 759531500.5 ns 756087750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38238069 ns 38213656 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 650158250 ns 651747459 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 689625562.5 ns 666674583.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 584746895.5 ns 602170708.5 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 746064791 ns 734251875 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59917 ns 57208 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38667 ns 48167 ns 0.80
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47958 ns 39167 ns 1.22
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83333 ns 83958 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37783 ns 37250 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1926458.5 ns 1929792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1973292 ns 1973292 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1982604 ns 1984249.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1891667 ns 1881417 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175065.5 ns 171491 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267583.5 ns 273354 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 269084 ns 267959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 289917 ns 270687.5 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267166.5 ns 268834 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 131334.5 ns 124192.5 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 589666 ns 658333 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 689042 ns 674854.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 665916 ns 665333 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 673125 ns 670500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 706985 ns 664813 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2181896 ns 2190167 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2232625 ns 2214354.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2183854 ns 2216958.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2191875 ns 2099979 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133858 ns 133238 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5512271 ns 5505354.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5598583.5 ns 5504750 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5489687.5 ns 5565292 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5490875 ns 5499708 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 746209 ns 740235 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 646416 ns 650417 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 647583 ns 649020.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 647666 ns 640625 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 644958 ns 648292 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47169 ns 47265 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1826146 ns 1821708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1670917 ns 1720959 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1722292 ns 1675729.5 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2100708 ns 2108500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 224227.5 ns 224014 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58250 ns 58583 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38500 ns 46645.5 ns 0.83
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45917 ns 38750 ns 1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84125 ns 83834 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28997 ns 28947 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027375 ns 2024916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2103792 ns 2086188 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2095437.5 ns 2100521 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1994916 ns 1993416.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190639 ns 191815.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13358312.5 ns 13473875 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12522354 ns 12547041.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12489916 ns 12559604 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14891124.5 ns 15213416.5 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 514106 ns 517805 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47397250 ns 47353458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 42110166.5 ns 41833334 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41023354.5 ns 41118750 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 57862667 ns 58300041 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3195086 ns 3203904 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97510834 ns 74077042 ns 1.32
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91327667 ns 68022250 ns 1.34
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90418625 ns 90906749.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99397458 ns 99115937.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59375 ns 58958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38750 ns 47375 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47666 ns 38729.5 ns 1.23
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82625 ns 83500 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46954 ns 47777 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1922250 ns 1923375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1818833.5 ns 1961541 ns 0.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1971041 ns 1980229 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1887729.5 ns 1890354 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190192.5 ns 194350.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 31804 ns 32617.5 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6625 ns 6208.5 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6833 ns 5958 ns 1.15
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6708 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6459 ns 6437.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 173129 ns 173722.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31435 ns 32110 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 2583 ns 1.11
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2792 ns 2542 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 2833 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2834 ns 2833 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 160716 ns 161891 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 287914625 ns 286335145.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 347874917 ns 339870250 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313778604 ns 320445937.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 274558667 ns 272825875 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7101120 ns 7113314 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1001223208 ns 990386709 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 962335292 ns 938484666 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 851949854 ns 868613416.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1161969458 ns 1158749666 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33885355 ns 33903874 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1680835000 ns 1310266104.5 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1703135000 ns 1325766333.5 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1598082833 ns 1623996500 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1673120084 ns 1663239334 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1417562.5 ns 1461479 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1417084 ns 1415750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1418167 ns 1429167 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1460750 ns 1414437.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128326 ns 128213 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5025104 ns 5019792 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5057208 ns 5022458 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5024188 ns 5050000 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5022083 ns 5006541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 573974 ns 557532 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 175868729.5 ns 175263520.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 181907771 ns 129816208.5 ns 1.40
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 127813000 ns 145953208.5 ns 0.88
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 170303209 ns 164619104.5 ns 1.03
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4854931.5 ns 4883992 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 666951834 ns 831528333 ns 0.80
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 612960875 ns 497840084 ns 1.23
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 496604958 ns 556789916 ns 0.89
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 682123167 ns 679969833 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16051467 ns 16195623 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8911750 ns 8914083 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8824812.5 ns 8769917 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7877958 ns 8216313 ns 0.96
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10154667 ns 10158000 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1605096 ns 1595526 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 35800833 ns 35894250 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37985916 ns 36843625 ns 1.03
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33369583 ns 34476562 ns 0.97
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38797917 ns 38802729 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6461183 ns 6454567.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47542 ns 47396 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47417 ns 49334 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47708.5 ns 47542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47458 ns 47417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18387 ns 19457 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50542 ns 50292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50417 ns 50520.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50729.5 ns 50584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50354.5 ns 50250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 197522 ns 189575 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7250 ns 8104 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7458 ns 6791 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8167 ns 9125 ns 0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7250 ns 7333 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 100674.5 ns 86829.5 ns 1.16
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10167 ns 9875 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 9583 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 10375 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10167 ns 10208 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 532551.5 ns 537525 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6084 ns 8208 ns 0.74
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7583 ns 8250 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8583 ns 9812.5 ns 0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 6375 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 106692.5 ns 113788.5 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13417 ns 13333.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13354.5 ns 12625 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13459 ns 13584 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13375 ns 13208 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 464411 ns 479705.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 958 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1041 ns 1042 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32682 ns 32580 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 7750 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 7625 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8542 ns 0.95
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8208 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 204035.5 ns 201701.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23458 ns 23250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23333 ns 23042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23375 ns 23500 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23458 ns 23167 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18657 ns 18765.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52708 ns 52875 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52604.5 ns 52292 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52916 ns 52792 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52708 ns 52459 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 271882 ns 260844.5 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1398500 ns 1400229 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1401729.5 ns 1398666.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1411917 ns 1400708 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1400229 ns 1398917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196861.5 ns 196521.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5018875 ns 5018604 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5048167 ns 5004729.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5006959 ns 5044229.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5008333 ns 5001271 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 574886.5 ns 595122 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3024166 ns 3043083 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2100833 ns 2094042 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2282500 ns 2287146 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4885312.5 ns 4530875 ns 1.08
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583172 ns 582703 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24461812.5 ns 24366625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19084792 ns 18829583 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19138792 ns 19120291 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36884542 ns 36653000 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3200942 ns 3189516.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34091166.5 ns 33943229 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28709958.5 ns 28373417 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28378584 ns 28357208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41695667 ns 41659750 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144674125 ns 144299750 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143080042 ns 142248375 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 124467729.5 ns 126632146 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174621958 ns 173840291.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22564506 ns 22781482 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1302493437.5 ns 1307941437.5 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 874700000 ns 1133574500.5 ns 0.77
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 764524604 ns 711240125 ns 1.07
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 677522375 ns 670828250 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 116859638 ns 118499942 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75000 ns 74542 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74833 ns 73917 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76292 ns 83125 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74896 ns 72916.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 216038.5 ns 225032.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 210750 ns 202979.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 201917 ns 282792 ns 0.71
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 281958 ns 253479.5 ns 1.11
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 269333 ns 244146 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1090324 ns 1201754 ns 0.91
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35470834 ns 35408938 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35928792 ns 35449645.5 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32119604 ns 32512083 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40968833.5 ns 41003541.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5850850 ns 5848198 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 147364709 ns 146608875 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 158273854.5 ns 151542938 ns 1.04
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 133245583 ns 138849083 ns 0.96
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 288316083 ns 287439584 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34901456.5 ns 34913824 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 119595041.5 ns 121086291.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 182760375 ns 174190000 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147949250 ns 155717667 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104280375 ns 106488666.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5427032 ns 5478422 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 474025250 ns 611208666 ns 0.78
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 487387541.5 ns 466441167 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 438686937.5 ns 453562937.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 743324584 ns 741621625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35184946 ns 35157227 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 709818791.5 ns 648662584 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 676047500 ns 657411208 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 576077125 ns 585962375 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 856793750 ns 845072208 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1335187.5 ns 1304708 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 690416 ns 965666 ns 0.71
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 968250 ns 744354 ns 1.30
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2069791.5 ns 1944604 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 562588.5 ns 572387 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2971667 ns 2974271 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2540917 ns 2531646 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2628687 ns 2512854 ns 1.05
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3706167 ns 3691334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1656253 ns 1817474 ns 0.91
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6655750 ns 6642416 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6500458 ns 6630792 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6506291.5 ns 6466375 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4453146 ns 4443145.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7334 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 6208 ns 0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6209 ns 5458 ns 1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10167 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25101 ns 25916 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213084 ns 212104 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221084 ns 219562.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221042 ns 220667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215333 ns 206291 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 249680 ns 257490 ns 0.97
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 302094083 ns 301772791.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 282009708 ns 222879750 ns 1.27
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 198372812.5 ns 222700312.5 ns 0.89
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 311510292 ns 311773125 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7672645 ns 7676597.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1082528563 ns 1082870459 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 990032792 ns 892532250 ns 1.11
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 865476333 ns 883941208.5 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1159115917 ns 1154293562 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26437793 ns 26959026 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5625 ns 6459 ns 0.87
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6270.5 ns 5209 ns 1.20
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6979.5 ns 10000 ns 0.70
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5625 ns 5708.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 145670 ns 168546.5 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 7458 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7459 ns 6792 ns 1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7542 ns 7542 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7792 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 586665.5 ns 639812.5 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 541 ns 458 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24215 ns 24361 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 11416.5 ns 9000 ns 1.27
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9334 ns 9000 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9916 ns 9583 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9417 ns 9708 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 208481.5 ns 234125.5 ns 0.89
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351459 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 353166 ns 351500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351542 ns 351916 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351042 ns 356625 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21437 ns 21502 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 830208 ns 811270.5 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 778354 ns 774958.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 777000 ns 776584 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 826583.5 ns 821875 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 263376.5 ns 315795.5 ns 0.83
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 337500 ns 335896 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 323625.5 ns 338208.5 ns 0.96
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 450541 ns 441167 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 333583 ns 331375 ns 1.01
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17840 ns 18761.5 ns 0.95
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 694666.5 ns 695166 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 737375 ns 738208 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1032062.5 ns 1036458 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 692917 ns 692396 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 237490 ns 292461.5 ns 0.81
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 354833 ns 354166.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 332292 ns 346771 ns 0.96
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 423312.5 ns 433791 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 373979 ns 370250 ns 1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22593 ns 23121 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 751812.5 ns 757417 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 744708 ns 749625 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1083000 ns 1070562.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 822375 ns 828458 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 213868 ns 257074.5 ns 0.83
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3625 ns 3292 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3625 ns 3458 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3708 ns 3750 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns 3417 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18261 ns 18586 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 4167 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4375 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4312.5 ns 4417 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4209 ns 4250 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 231902 ns 296700.5 ns 0.78
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3708 ns 3625 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4209 ns 3750 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5354.5 ns 6541 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4000 ns 6354.5 ns 0.63
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 168839 ns 232189.5 ns 0.73
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8750 ns 8187.5 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8459 ns 8000 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8584 ns 8458 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8583 ns 8500 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1145924 ns 1227082 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204750 ns 203417 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210041 ns 209541.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211750 ns 208250 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199500 ns 198709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35034 ns 35300 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 644625 ns 612417 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 623875 ns 623292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621917 ns 623250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 629792 ns 630166 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 345017 ns 347973 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 976354 ns 977646 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 938833 ns 935437.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 957084 ns 970083 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1292688 ns 1286374.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207331 ns 209031 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4494021 ns 4514333 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4616104 ns 4466146 ns 1.03
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4311125 ns 4452875 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6200916 ns 6260416.5 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 937889.5 ns 947144.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3667 ns 3542 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3917 ns 3417 ns 1.15
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4729.5 ns 5896 ns 0.80
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 6667 ns 0.59
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 233257 ns 219336.5 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 6917 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7209 ns 6958 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7708 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns 7291 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1025391 ns 1020167.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1621375.5 ns 1635042 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1154563 ns 1200395.5 ns 0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1379604.5 ns 1363584 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2492334 ns 2345187.5 ns 1.06
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213997.5 ns 215784.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12340250 ns 12316854.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9638583 ns 9564000 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9296229 ns 9378437.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17980437.5 ns 17989542 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1947124.5 ns 1948181 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17336562.5 ns 17368125 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14454166 ns 14382958 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14368729 ns 14502250 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21078291 ns 21085917 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 89208 ns 90917 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 91000 ns 89500 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91333 ns 91833 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 92750 ns 113437.5 ns 0.82
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126550 ns 126891 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2036167 ns 2009625 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2042812.5 ns 2030000 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2030000 ns 2039270.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024333 ns 1871125 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1062042.5 ns 1032563 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 346437.5 ns 342166.5 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 333250 ns 343375 ns 0.97
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 396583 ns 406458 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 309645.5 ns 311729 ns 0.99
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15810 ns 16465.5 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 704708.5 ns 706208 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 727834 ns 728542 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1027958 ns 1018584 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 649500 ns 650375 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 194848.5 ns 195366.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7375 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5875 ns 0.90
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 5416 ns 1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33484 ns 34591 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223145.5 ns 243791 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220708 ns 220125 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220459 ns 221083 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218770.5 ns 239167 ns 0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 345803.5 ns 327793 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22336 ns 22616 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14375 ns 14292 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14333 ns 14416 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14375 ns 14208 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14416 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 484513 ns 480334.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 92750 ns 94458 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 94166.5 ns 92625 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 95292 ns 96875 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 92583 ns 96229.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125724.5 ns 126007 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1932874.5 ns 1714792 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1864833 ns 1926792 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1919250 ns 1913291.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1919500 ns 1711417 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 959155 ns 1034230 ns 0.93
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 888958 ns 876916.5 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 806791 ns 817791 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1216687 ns 1169438 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 963500 ns 966187.5 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 276636 ns 275657.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2755937.5 ns 2828583 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2481208.5 ns 2474833 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3328000 ns 3335750 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3410604.5 ns 3304292 ns 1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1602148 ns 1618381.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15666 ns 16709 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17500 ns 15625 ns 1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18271 ns 18667 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15583 ns 15583 ns 1
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 141937 ns 142594 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226520.5 ns 228750 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 264250 ns 215750 ns 1.22
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215937.5 ns 217625 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 230833 ns 255500 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 636070 ns 641543.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 221833 ns 222458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222375 ns 221500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222000 ns 223458.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 219084 ns 222604.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 268289.5 ns 269850.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 531167 ns 537583 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 507750 ns 497334 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 495979.5 ns 499583 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 509541 ns 526833 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1373603.5 ns 1430878.5 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 334979 ns 330125 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 317792 ns 332834 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 385750 ns 435458.5 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 322104 ns 315917 ns 1.02
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16515 ns 16581 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 715583 ns 717084 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 731000 ns 728166.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1024292 ns 1021104 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 663792 ns 662729.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 193684 ns 195479.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17729.5 ns 17875 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18125 ns 17167 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19646 ns 20250 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17209 ns 17208 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146583 ns 145639 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213208 ns 223750 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214812 ns 212417 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214084 ns 214041 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 246000 ns 221917 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1004648 ns 1035551.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5208 ns 6708 ns 0.78
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6166.5 ns 6333 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6958 ns 7208 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4500 ns 6625 ns 0.68
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 238961.5 ns 240542 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11166.5 ns 10584 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10666 ns 9917 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10916 ns 11166.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10875 ns 10917 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1055349 ns 1097401.5 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3125 ns 3500 ns 0.89
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3895.5 ns 3208 ns 1.21
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4584 ns 6333.5 ns 0.72
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3667 ns 6750 ns 0.54
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 234616.5 ns 250006 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7625 ns 1
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7959 ns 7084 ns 1.12
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7708 ns 8125 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7500 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1063331 ns 1102649 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23774042 ns 23315625 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43803625 ns 34529125 ns 1.27
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37663250 ns 41513333.5 ns 0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34941166 ns 34929834 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1776020.5 ns 1838602 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183783458 ns 184421875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 173905833 ns 159459792 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146559750 ns 151225083 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 414263417 ns 413223958 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16542930.5 ns 16387494 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 426447000 ns 428743125 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 259361708 ns 252439020.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231145458 ns 233017396 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 485707250 ns 484197291 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182625 ns 183584 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184500 ns 182750 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184937 ns 186625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183083 ns 183146 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 219470.5 ns 228677.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 588000 ns 596083 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 589541.5 ns 586292 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 587000 ns 589770.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628792 ns 631958 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1062123 ns 1119701 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3844333.5 ns 3838833 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3705021.5 ns 3643375.5 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3474208 ns 3563521 ns 0.97
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5354792 ns 5359750 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 531867 ns 537722 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17410625 ns 17412417 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17824604 ns 17190667 ns 1.04
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16530791.5 ns 17100375 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22146375 ns 22144083 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2636212 ns 2612799 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 541 ns 458 ns 1.18
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31938 ns 32035 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9417 ns 9208 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9417 ns 8542 ns 1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 10208 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10084 ns 9459 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 261322 ns 264327.5 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 501945500 ns 504274209 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 515492167 ns 430218396 ns 1.20
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 433413812.5 ns 471374500 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 671064375 ns 672994208.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 11967043.5 ns 12486595 ns 0.96
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2042006188 ns 2049529562.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1670021000 ns 1632649709 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1491864937.5 ns 1536417708 ns 0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2215896125 ns 2205666041.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49221176 ns 49389302 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1658250.5 ns 1657645.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1182042 ns 1189208.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1397145.5 ns 1382000 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2472500 ns 2334125 ns 1.06
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215560 ns 214982 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12703625 ns 12688500 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 10026271 ns 9942000 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9729709 ns 9748312.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18433208 ns 18407312 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2021422 ns 2050613 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17685333 ns 17691583.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14724875 ns 14746041.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14636209 ns 14804417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21349667 ns 21386084 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26209 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26292 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26209 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24156 ns 24125 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66792 ns 66875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66875 ns 66917 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68125 ns 67083 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66917 ns 67209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 395970.5 ns 398847.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204542 ns 202667 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 208875 ns 209000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210625 ns 209167 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199000 ns 199583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26615 ns 26392 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 602625.5 ns 612416.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 622854 ns 627416.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 665875 ns 667979 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 581270.5 ns 631250 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347431.5 ns 353043.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 637021 ns 645542 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 589083 ns 643375 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 653375 ns 664187.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 640021 ns 540834 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131837 ns 132126 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2250645.5 ns 2247375 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2309792 ns 2239958 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2226583 ns 2302917 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2236125 ns 2219000 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1169730 ns 1328726 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17292 ns 17667 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18334 ns 16979.5 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22750 ns 20792 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17062.5 ns 18500 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145778 ns 146392.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221375 ns 229708 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226833 ns 225333 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 256916 ns 229292 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 259375 ns 259083 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 998109 ns 1081671 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 541 ns 459 ns 1.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23995 ns 23645 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9958 ns 9833.5 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10000 ns 9542 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9916 ns 10708 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9625 ns 9916 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 260297.5 ns 262941 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6084 ns 7291 ns 0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6250 ns 5833 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6395.5 ns 9625 ns 0.66
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5770.5 ns 7250 ns 0.80
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 225464 ns 234003 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7333 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7125 ns 7000 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7709 ns 7833 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7291.5 ns 7250 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 787625 ns 810029.5 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2375 ns 2042 ns 1.16
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2417 ns 2000 ns 1.21
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2375 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2167 ns 2208 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18197 ns 18218 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6583.5 ns 6542 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6625 ns 6500 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6875 ns 6708 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6584 ns 6750 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 334201 ns 335368 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 746791 ns 750166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 749604 ns 746604.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 750395.5 ns 751041 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748875 ns 761417 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21933 ns 21856 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 775958.5 ns 775334 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 789521 ns 775042 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 797084 ns 804792 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 808708.5 ns 791625 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 295656 ns 299022 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7375 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5125 ns 5875 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5208 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10125 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33608 ns 32492 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220416 ns 233188 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 234416.5 ns 227750 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 266500 ns 254458 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217792 ns 255583 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 363230.5 ns 359227 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10792 ns 11042 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11833 ns 12458 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12292 ns 12959 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10062.5 ns 12000 ns 0.84
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 247964 ns 245075.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24979 ns 24875 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24875 ns 24458 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25542 ns 25458 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24584 ns 24583.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1136226 ns 1120608 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 107466667 ns 106980458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 126780083 ns 118006979.5 ns 1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 119831458 ns 123940208 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117447125 ns 118407959 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2659299 ns 2661574 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 393751417 ns 394378313 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 371911479 ns 368164500 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 422962500 ns 358657167 ns 1.18
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 483853792 ns 482282708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15200417 ns 15138278 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 946860708 ns 759267583 ns 1.25
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 771052125 ns 577881125 ns 1.33
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 743467646.5 ns 749378833 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 950911604.5 ns 945671312.5 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7083 ns 7458 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7542 ns 7958 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7917 ns 8750 ns 0.90
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7375 ns 7333 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 239130 ns 235620 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14833 ns 14500 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14041 ns 13333 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14083 ns 15041 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15375 ns 14292 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1090512.5 ns 1078273.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6833 ns 8542 ns 0.80
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7500 ns 7792 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 9187.5 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6334 ns 7833.5 ns 0.81
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 238152.5 ns 235827.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13208 ns 13167 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12666 ns 12084 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12708 ns 13084 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12917 ns 12833 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 802056 ns 787391.5 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 350458 ns 347250 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 329666 ns 344875 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 395709 ns 409896 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 316709 ns 310562 ns 1.02
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17058.5 ns 16566 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 708541.5 ns 713833.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 727687.5 ns 727291 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1032625 ns 1023416 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 647187 ns 654959 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 201070 ns 197250.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23717 ns 23066 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6708 ns 6250 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6708 ns 6334 ns 1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6750 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6541 ns 6791 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 243248.5 ns 238420 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5917 ns 5834 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25115.5 ns 23863 ns 1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21333 ns 21750 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21125 ns 21000 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21250 ns 21958 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 23584 ns 21708 ns 1.09
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 266274.5 ns 261085 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148458 ns 152146 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 147750 ns 145250 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 151062.5 ns 149541 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146167 ns 145937 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167783 ns 166536.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1323750 ns 1328792 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1358667 ns 1319083.5 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1313750 ns 1350812.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1301166 ns 1317084 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1350876 ns 1336276 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23229 ns 24917 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23958 ns 24208 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25479 ns 25708 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21834 ns 24208.5 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 286227 ns 351114.5 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 178625 ns 131125 ns 1.36
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 183854 ns 117791 ns 1.56
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 131375 ns 172917 ns 0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 130417 ns 177334 ns 0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1475072 ns 1465398.5 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23581 ns 22926 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6791 ns 6417 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6458 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6917 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6375 ns 6542 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 259663.5 ns 254551 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4750 ns 7625 ns 0.62
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4584 ns 4167 ns 1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6041 ns 7708.5 ns 0.78
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4625 ns 7375 ns 0.63
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 257684 ns 250274.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10708 ns 10042 ns 1.07
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9833 ns 9708 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10333 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10250 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1358165 ns 1345295 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1583 ns 1584 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23401 ns 22897 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5625 ns 5625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns 5584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5958 ns 5959 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5958 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 277871.5 ns 271438.5 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6812917 ns 6886125 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6411729 ns 6378229 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6505500 ns 6526875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7506500 ns 7602250 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215844 ns 213111 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24090645.5 ns 24073062 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21288541 ns 21283625 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21056833.5 ns 21045584 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29826229 ns 29677875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2100052 ns 2108165 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 49024708 ns 37353145.5 ns 1.31
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45355854 ns 34386667 ns 1.32
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45722417 ns 45930020.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49511500 ns 49322334 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6167 ns 7708.5 ns 0.80
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6792 ns 5875 ns 1.16
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6833 ns 8333 ns 0.82
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6041.5 ns 7062.5 ns 0.86
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 234601.5 ns 238522.5 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8958 ns 8458 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 8042 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8417 ns 8583 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8292 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1058513.5 ns 1070850 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1558500 ns 1544374.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1250312.5 ns 1259666.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1630333 ns 1632771 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2133792 ns 2150667 ns 0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA 282051.5 ns 278945 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7880208 ns 7908937.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6620833 ns 6609937 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7206000 ns 7237750.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10474041 ns 10434334 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1900822 ns 1889956 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 346250 ns 340979 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 328854 ns 345792 ns 0.95
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 405042 ns 417125 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 340459 ns 345833 ns 0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA 43005 ns 42448 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 748124.5 ns 746500.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 772375 ns 784542 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1074375 ns 1073250 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 760354.5 ns 761062.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 306884 ns 303720.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397458 ns 397500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 212042 ns 288250 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288500 ns 212666 ns 1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750625 ns 756084 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44869 ns 43887 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 670667 ns 671083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 471167 ns 530083 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 530000 ns 470667 ns 1.13
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 973833 ns 974750 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 193447 ns 188388.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 581041.5 ns 679250 ns 0.86
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 635229.5 ns 645333.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 653000 ns 642458 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 648562.5 ns 638562.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132519 ns 131530 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2467438 ns 2409292 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2534479 ns 2456416.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2455458 ns 2514583 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2440417 ns 2456292 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1289127 ns 1277300 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 347416.5 ns 345146 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 344624.5 ns 343583 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 393083 ns 403708.5 ns 0.97
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 314209 ns 312208 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15737 ns 16009 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 705458 ns 709667 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 724687 ns 724500 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1024417 ns 1022687.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 644354 ns 650417 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 200179.5 ns 195917 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1464625 ns 1460417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1497083 ns 1500812.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1505541 ns 1496375 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1442125 ns 1438708 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41443 ns 40600 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5133584 ns 5128791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5289208 ns 5302375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5294792 ns 5313000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4976875 ns 4970208.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 200107.5 ns 196206.5 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33649 ns 32895 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15250 ns 15167 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15166 ns 15083 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15375 ns 15083 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15292 ns 15375 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 384079.5 ns 376729 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71000 ns 71459 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71583 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71292 ns 71375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71084 ns 70708 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113767.5 ns 113177.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318209 ns 317917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 326667 ns 320417 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318708 ns 325333 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317750 ns 320916 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 197585.5 ns 193043 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 958 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23981 ns 23363 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8334 ns 8083 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8166 ns 7792 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8750 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 8750 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 265125.5 ns 260535.5 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 473542 ns 475499.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 457604 ns 470520.5 ns 0.97
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 551709 ns 557125 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 532104.5 ns 557959 ns 0.95
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130158 ns 129404 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1398125 ns 1399270.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1382291 ns 1382375 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1633958 ns 1611125 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1575646 ns 1582104.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 276840 ns 274924 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32584 ns 31647 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6334 ns 6375 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6042 ns 1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6666 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6625 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 268198.5 ns 262541.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1719562.5 ns 1761833 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1729708 ns 1723396 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1726959 ns 1733812.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1722666 ns 1730625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169321 ns 169477.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4372708 ns 4358625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4391416 ns 4358708 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4359145.5 ns 4403062.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4362875 ns 4373875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1232540 ns 1208123 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7042 ns 7167 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6729.5 ns 6875 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6812.5 ns 6916 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6750 ns 6750 ns 1
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20395.5 ns 20662 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32500 ns 51625 ns 0.63
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32562.5 ns 32917 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 48500 ns 48208.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51771 ns 51417 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210266.5 ns 292106.5 ns 0.72
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 356646 ns 354562.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 335791.5 ns 348666.5 ns 0.96
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 402625 ns 433333 ns 0.93
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 323416 ns 322041.5 ns 1.00
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18222 ns 18353 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 725125 ns 724625 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 737333 ns 730583 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1031875 ns 1038687.5 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 669958.5 ns 675333 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 344455.5 ns 335730.5 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75208 ns 75458 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75500 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75500 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75125 ns 74584 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47155 ns 46864.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 326834 ns 325166 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 333625 ns 324250 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 328209 ns 336875 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 324000 ns 325125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 211460.5 ns 209059.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1488895.5 ns 1485709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1521291 ns 1526833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1530292 ns 1522792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1466500 ns 1462625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52743 ns 51397 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5023833 ns 5113395.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5236584 ns 5295292 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5286958 ns 5300812.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4985229 ns 5001042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 203850 ns 202971.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28125 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28167 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns 28209 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24574 ns 24514.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66417 ns 66417 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66333 ns 66458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66542 ns 66500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66500 ns 66500 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 530555.5 ns 505942 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1396021 ns 1502084 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 943916.5 ns 1124250 ns 0.84
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1146250 ns 944270.5 ns 1.21
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2256750 ns 2255250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 570145.5 ns 566674 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3088833 ns 3090791 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2633020.5 ns 2751542 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2704979.5 ns 2628896 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3774834 ns 3819709 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2057819 ns 1979936 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8833062 ns 8847333 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8756313 ns 8768375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8786291.5 ns 8750250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6371937.5 ns 6340375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80521 ns 85125 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 84083 ns 83021 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81292 ns 85708.5 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81479 ns 83562.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192849.5 ns 192703 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1749833 ns 2012875 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2022542 ns 2024062.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2022167 ns 2038542 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2008646 ns 2008812 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 800645 ns 791664.5 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/workflows branch 3 times, most recently from c88703c to 15651bb Compare November 4, 2024 23:35
@avik-pal avik-pal merged commit bf1f12b into main Nov 5, 2024
32 of 66 checks passed
@avik-pal avik-pal deleted the ap/workflows branch November 5, 2024 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant