Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update name for Arrow R package #66

Closed
wants to merge 1 commit into from

Conversation

amoeba
Copy link

@amoeba amoeba commented Nov 21, 2023

Hi @Tmonster, I contribute to the Arrow R package and noticed Arrow's inclusion on https://duckdblabs.github.io/db-benchmark/ which is great to see.

I think a minor improvement to the way this benchmark is presented would be to make it clearer that a specific implementation of Arrow is what's being benchmarked since the name Arrow by itself generally refers to the columnar format and not any one implementation.

I've updated the README and made my best guess at changing the name in the plots since I didn't run them. Let me know if that looks like it'll do the trick. For the changed name for the plots, I opted to follow the style you're already using with (py)data.table but feel free to suggest something else. We generally refer to the Arrow R package as "R arrow" or "arrow R" so any string which contains those would be good.

@Tmonster
Copy link
Collaborator

Hi @amoeba,

I just forked the benchmark and started maintaining it as it was before, so sorry about the mixup. I took a look at your PR and I think a bit more work is required to change the name from arrow to R-arrow.

Most switches are minor, you could almost just do a find replace with "arrow" and "r-arrow". One thing that isn't simple to fix is how to deal with the old results. Results are loaded and divided based on solution name, dataset, question etc. I don't necessarily want to change the name of all of the old results since that can muddy the history of the time.csv and logs.csv.

I'll look into how we can change those values once the results are loaded for report generation. I'll get back to you on this

@jangorecki
Copy link

Easiest to update to new name in report.R file

@eitsupi
Copy link

eitsupi commented Nov 22, 2023

I am wondering about the need to mark it as Acero as per the discussion in the old repo. h2oai#229

@jangorecki
Copy link

jangorecki commented Nov 22, 2023

This is the place for cleaning up time.csv entries if something got renamed or produced different results in the past etc.

clean_time = function(d) {
if (nrow(d[!nzchar(version) | is.na(version)]))
stop("timings data contains NA or '' as version field, that should not happen")
old_advanced_groupby_questions = c("median v3 sd v3 by id2 id4","max v1 - min v2 by id2 id4","largest two v3 by id2 id4","regression v1 v2 by id2 id4","sum v3 count by id1:id6")
d[!nzchar(git), git := NA_character_
][,"on_disk" := as.logical(on_disk)
][task=="groupby" & solution%in%c("pandas","dask","spark") & batch<1558106628, "out_cols" := NA_integer_
][task=="groupby" & solution=="dask" & batch<1558106628 & question%in%c("max v1 - min v2 by id2 id4","regression v1 v2 by id2 id4"), c("out_rows","out_cols","chk") := .(NA_integer_, NA_integer_, NA_character_)
][task=="groupby" & solution=="pandas" & batch<=1558106628 & question=="largest two v3 by id2 id4", "out_cols" := NA_integer_
][task=="groupby" & solution=="spark" & batch<1548084547, "chk_time_sec" := NA_real_ # spark chk calculation speed up, NA to make validation work on bigger threshold
][task=="groupby" & question%in%old_advanced_groupby_questions & batch<1573882448, c("out_rows","out_cols","chk") := list(NA_integer_, NA_integer_, NA_character_)
][task=="groupby" & solution=="dask" & batch>=1609583373 & batch<Inf & question=="regression v1 v2 by id2 id4", c("out_rows","chk") := .(NA_integer_, NA_character_) ## change Inf to batch after upgrading to dask#7024
][solution=="polars" & batch<=1622492790, c("chk","out_rows") := list(NA_character_, NA_integer_) # polars NA handling broken in 0.7.19? #223
][, `:=`(nodename=ft(nodename), in_rows=ft(in_rows), question=ft(question), solution=ft(solution), fun=ft(fun), version=ft(version), git=ft(git), task=ft(task),
data=fctr(data, levels=unlist(get_data_levels())))
][]
}

dt[solution=="arrow", "solution" := "newname"]

@amoeba
Copy link
Author

amoeba commented Nov 22, 2023

In the event that updating past results is hard or problematic, I think a change going forward would be just fine. Thanks for being willing to look into this @Tmonster and let me know if I can help.

@eitsupi: Thanks for finding that thread. While Acero makes some sense, or even arrow-dplyr as was also suggested in that thread, I think "R arrow" or similar is still probably the best. While Acero is mostly what's responsible for the features and performance the benchmark is reporting on, (1) the Arrow R package is the installable thing the benchmark runs and (2) the code snippet shown next to each result isn't really Acero in the broad sense but the novel bindings the Arrow R package built on top. I know you're familiar with all of the above so the above so this is just to explain my rationale for the proposed rename. Happy to continue this discussion here or elsewhere though.

@eitsupi
Copy link

eitsupi commented Nov 22, 2023

Thanks for your response.

I think the point here is that in most other cases, the "engine" is used as the name, not the language or API.
That said, I think data.table, Spark, DuckDB, and Polars can be manipulated via dplyr, but I don't think we are actually using dplyr here, so something like "dplyr-Acero" could be appropriate to explicitly state that it is dplyr API.

The name Acero is now in dplyr's README, so users should be able to quickly find out what it is.

@thisisnic
Copy link

IMO dplyr-Acero is OK as long as both components are mentioned as it's not a good test of just Acero given that the R package is adding a lot more in here. That said, I feel like "arrow R package" or "R arrow" is much more recognisable shorthand for what is effectively multiple things bundled together here as there are more things in the R package which may affect the performance than just Acero and our dplyr bindings - the internal R code which works out how to construct ExecPlans to run in Acero, how we make use of ALTREP, etc.

@Tmonster
Copy link
Collaborator

Hi @amoeba

I have a branch up here that should handle all of the logic to rename arrow to R-arrow in the report and also during future logging. It is running CI right now, but feel free to copy the PR, change the name to something else if desired and open it up here against master. If CI is green I can merge and also update the report.

Tmonster#10

@amoeba
Copy link
Author

amoeba commented Nov 30, 2023

Thanks @Tmonster, "R-arrow" is great. And thanks for jumping on this so quickly, we appreciate it.

@amoeba amoeba closed this Nov 30, 2023
@Tmonster
Copy link
Collaborator

Tmonster commented Dec 4, 2023

PR for this is up here!
#68

@amoeba
Copy link
Author

amoeba commented Dec 4, 2023

Thanks @Tmonster!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants