-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update name for Arrow R package #66
Conversation
Hi @amoeba, I just forked the benchmark and started maintaining it as it was before, so sorry about the mixup. I took a look at your PR and I think a bit more work is required to change the name from arrow to Most switches are minor, you could almost just do a find replace with "arrow" and "r-arrow". One thing that isn't simple to fix is how to deal with the old results. Results are loaded and divided based on solution name, dataset, question etc. I don't necessarily want to change the name of all of the old results since that can muddy the history of the I'll look into how we can change those values once the results are loaded for report generation. I'll get back to you on this |
Easiest to update to new name in report.R file |
I am wondering about the need to mark it as Acero as per the discussion in the old repo. h2oai#229 |
This is the place for cleaning up time.csv entries if something got renamed or produced different results in the past etc. Lines 68 to 84 in 4901623
dt[solution=="arrow", "solution" := "newname"] |
In the event that updating past results is hard or problematic, I think a change going forward would be just fine. Thanks for being willing to look into this @Tmonster and let me know if I can help. @eitsupi: Thanks for finding that thread. While Acero makes some sense, or even arrow-dplyr as was also suggested in that thread, I think "R arrow" or similar is still probably the best. While Acero is mostly what's responsible for the features and performance the benchmark is reporting on, (1) the Arrow R package is the installable thing the benchmark runs and (2) the code snippet shown next to each result isn't really Acero in the broad sense but the novel bindings the Arrow R package built on top. I know you're familiar with all of the above so the above so this is just to explain my rationale for the proposed rename. Happy to continue this discussion here or elsewhere though. |
Thanks for your response. I think the point here is that in most other cases, the "engine" is used as the name, not the language or API. The name Acero is now in dplyr's README, so users should be able to quickly find out what it is. |
IMO dplyr-Acero is OK as long as both components are mentioned as it's not a good test of just Acero given that the R package is adding a lot more in here. That said, I feel like "arrow R package" or "R arrow" is much more recognisable shorthand for what is effectively multiple things bundled together here as there are more things in the R package which may affect the performance than just Acero and our dplyr bindings - the internal R code which works out how to construct ExecPlans to run in Acero, how we make use of ALTREP, etc. |
Hi @amoeba I have a branch up here that should handle all of the logic to rename arrow to R-arrow in the report and also during future logging. It is running CI right now, but feel free to copy the PR, change the name to something else if desired and open it up here against master. If CI is green I can merge and also update the report. |
Thanks @Tmonster, "R-arrow" is great. And thanks for jumping on this so quickly, we appreciate it. |
PR for this is up here! |
Thanks @Tmonster! |
Hi @Tmonster, I contribute to the Arrow R package and noticed Arrow's inclusion on https://duckdblabs.github.io/db-benchmark/ which is great to see.
I think a minor improvement to the way this benchmark is presented would be to make it clearer that a specific implementation of Arrow is what's being benchmarked since the name Arrow by itself generally refers to the columnar format and not any one implementation.
I've updated the README and made my best guess at changing the name in the plots since I didn't run them. Let me know if that looks like it'll do the trick. For the changed name for the plots, I opted to follow the style you're already using with (py)data.table but feel free to suggest something else. We generally refer to the Arrow R package as "R arrow" or "arrow R" so any string which contains those would be good.