Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bytes_per_second to transpose benchmark #14170

Merged

Conversation

Blonck
Copy link
Contributor

@Blonck Blonck commented Sep 22, 2023

This patch relates to #13735.

Benchmark: transpose_benchmark.txt

Checklist

@Blonck Blonck requested a review from a team as a code owner September 22, 2023 10:34
@Blonck Blonck requested review from bdice and vuule September 22, 2023 10:35
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 22, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 22, 2023
@davidwendt
Copy link
Contributor

/ok to test

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 22, 2023
@Blonck Blonck force-pushed the processed_bytes_transpose_bench branch from ded8fa9 to d2676b5 Compare September 26, 2023 13:46
@Blonck Blonck requested review from a team as code owners September 26, 2023 13:46
@Blonck Blonck requested a review from charlesbluca September 26, 2023 13:46
@Blonck Blonck changed the base branch from branch-23.10 to branch-23.12 September 26, 2023 13:46
@wence- wence- removed request for a team and charlesbluca September 26, 2023 14:00
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this. Just one request for maintainability.

@@ -40,16 +40,29 @@ static void BM_transpose(benchmark::State& state)
cuda_event_timer raii(state, true);
auto output = cudf::transpose(input);
}

// Collect memory statistics.
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(int32_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid potential future type mismatches that result in wrong bytes/s reports. So I think you should stash the type_id in a variable above:

constexpr auto column_type = cudf::type_id::INT32;

And then here use CUDF's id_to_type utility:

Suggested change
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(int32_t));
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(cudf::id_to_type(column_type)));

See https://docs.rapids.ai/api/libcudf/stable/group__utility__dispatcher#gad7e12b8accf60e7c0e500294e1ee8536

@@ -42,7 +44,7 @@ static void BM_transpose(benchmark::State& state)
}

// Collect memory statistics.
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(int32_t));
auto const bytes_read = input.num_columns() * input.num_rows() * cudf::size_of(column_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 suggestion: ‏ This is one way to do it. But I think the way I suggested is a bit better because it all happens at compile time, whereas cudf::size_of() invokes the type dispatcher at run time. Not that it will affect benchmarks, but it just seems cleaner to use sizeof(cudf::id_to_type<column_type>).

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blonck !

@harrism
Copy link
Member

harrism commented Sep 28, 2023

/ok to test

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blonck !

@PointKernel
Copy link
Member

@Blonck Can you please rebase with the latest branch-23.12 and fix the formatting issues?

auto const bytes_written = bytes_read;
// Account for nullability in input and output.
auto const null_bytes =
2 * input.num_columns() * cudf::bitmask_allocation_size_bytes(input.num_rows());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ‏This one could also overflow, I think, perhaps:

Suggested change
2 * input.num_columns() * cudf::bitmask_allocation_size_bytes(input.num_rows());
2 * static_cast<uint64_t>(input.num_columns()) * cudf::bitmask_allocation_size_bytes(input.num_rows());

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this one? Since the return type of cudf::bitmask_allocation_size_bytes is std::size_t which is either unsigned long or unsigned long long so for reasonable input sizes the integer type promotion will avoid the overflow (https://cppinsights.io/s/26f977cb).

That said, just having this discussion indicates I should've included an explicit cast upfront to clear up any potential confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left-to-right associativity means that this is evaluated as (2 * ncol) * nrow, the first multiplication is performed in size_type (AKA, int32_t), so that could overflow, no? Although I think these benchmarks are generally run with fewer than $2^{30}$ rows, there's in general no reason why they couldn't be (although the transpose performance will be terrible I grant you).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always put the thing that returns size_t (sizeof or bitmask_allocation_size_bytes) first in the arithmetic in all of these PRs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would keep the cast explicit to make visible what is happening, but I don't have a strong stance on it.

@wence-
Copy link
Contributor

wence- commented Oct 2, 2023

/ok to test

@harrism
Copy link
Member

harrism commented Oct 2, 2023

/ok to test

@harrism
Copy link
Member

harrism commented Oct 3, 2023

/ok to test

@harrism
Copy link
Member

harrism commented Oct 3, 2023

/ok to test

@harrism
Copy link
Member

harrism commented Oct 3, 2023

/merge

@harrism
Copy link
Member

harrism commented Oct 4, 2023

/ok to test

@harrism
Copy link
Member

harrism commented Oct 4, 2023

/merge

1 similar comment
@ttnghia
Copy link
Contributor

ttnghia commented Oct 10, 2023

/merge

@rapids-bot rapids-bot bot merged commit c0c7ed8 into rapidsai:branch-23.12 Oct 10, 2023
@harrism
Copy link
Member

harrism commented Oct 11, 2023

Wonder why my merge commands weren't accepted.

@ttnghia
Copy link
Contributor

ttnghia commented Oct 11, 2023

It means now github starts to realize that you are no longer cudf developer 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants