Update name for Arrow R package #66

amoeba · 2023-11-21T20:13:53Z

Hi @Tmonster, I contribute to the Arrow R package and noticed Arrow's inclusion on https://duckdblabs.github.io/db-benchmark/ which is great to see.

I think a minor improvement to the way this benchmark is presented would be to make it clearer that a specific implementation of Arrow is what's being benchmarked since the name Arrow by itself generally refers to the columnar format and not any one implementation.

I've updated the README and made my best guess at changing the name in the plots since I didn't run them. Let me know if that looks like it'll do the trick. For the changed name for the plots, I opted to follow the style you're already using with (py)data.table but feel free to suggest something else. We generally refer to the Arrow R package as "R arrow" or "arrow R" so any string which contains those would be good.

Tmonster · 2023-11-22T10:20:26Z

Hi @amoeba,

I just forked the benchmark and started maintaining it as it was before, so sorry about the mixup. I took a look at your PR and I think a bit more work is required to change the name from arrow to R-arrow.

Most switches are minor, you could almost just do a find replace with "arrow" and "r-arrow". One thing that isn't simple to fix is how to deal with the old results. Results are loaded and divided based on solution name, dataset, question etc. I don't necessarily want to change the name of all of the old results since that can muddy the history of the time.csv and logs.csv.

I'll look into how we can change those values once the results are loaded for report generation. I'll get back to you on this

jangorecki · 2023-11-22T10:41:05Z

Easiest to update to new name in report.R file

eitsupi · 2023-11-22T15:22:02Z

I am wondering about the need to mark it as Acero as per the discussion in the old repo. h2oai#229

jangorecki · 2023-11-22T15:26:48Z

This is the place for cleaning up time.csv entries if something got renamed or produced different results in the past etc.

db-benchmark/_report/report.R

Lines 68 to 84 in 4901623

    
           clean_time = function(d) { 
        
             if (nrow(d[!nzchar(version) | is.na(version)])) 
        
               stop("timings data contains NA or '' as version field, that should not happen") 
        
             old_advanced_groupby_questions = c("median v3 sd v3 by id2 id4","max v1 - min v2 by id2 id4","largest two v3 by id2 id4","regression v1 v2 by id2 id4","sum v3 count by id1:id6") 
        
             d[!nzchar(git), git := NA_character_ 
        
               ][,"on_disk" := as.logical(on_disk) 
        
                 ][task=="groupby" & solution%in%c("pandas","dask","spark") & batch<1558106628, "out_cols" := NA_integer_ 
        
                   ][task=="groupby" & solution=="dask" & batch<1558106628 & question%in%c("max v1 - min v2 by id2 id4","regression v1 v2 by id2 id4"), c("out_rows","out_cols","chk") := .(NA_integer_, NA_integer_, NA_character_) 
        
                     ][task=="groupby" & solution=="pandas" & batch<=1558106628 & question=="largest two v3 by id2 id4", "out_cols" := NA_integer_ 
        
                       ][task=="groupby" & solution=="spark" & batch<1548084547, "chk_time_sec" := NA_real_ # spark chk calculation speed up, NA to make validation work on bigger threshold 
        
                         ][task=="groupby" & question%in%old_advanced_groupby_questions & batch<1573882448, c("out_rows","out_cols","chk") := list(NA_integer_, NA_integer_, NA_character_) 
        
                           ][task=="groupby" & solution=="dask" & batch>=1609583373 & batch<Inf & question=="regression v1 v2 by id2 id4", c("out_rows","chk") := .(NA_integer_, NA_character_) ## change Inf to batch after upgrading to dask#7024 
        
                           ][solution=="polars" & batch<=1622492790, c("chk","out_rows") := list(NA_character_, NA_integer_) # polars NA handling broken in 0.7.19? #223 
        
                           ][, `:=`(nodename=ft(nodename), in_rows=ft(in_rows), question=ft(question), solution=ft(solution), fun=ft(fun), version=ft(version), git=ft(git), task=ft(task), 
        
                                    data=fctr(data, levels=unlist(get_data_levels()))) 
        
                               ][] 
        
           }

dt[solution=="arrow", "solution" := "newname"]

amoeba · 2023-11-22T19:18:51Z

In the event that updating past results is hard or problematic, I think a change going forward would be just fine. Thanks for being willing to look into this @Tmonster and let me know if I can help.

@eitsupi: Thanks for finding that thread. While Acero makes some sense, or even arrow-dplyr as was also suggested in that thread, I think "R arrow" or similar is still probably the best. While Acero is mostly what's responsible for the features and performance the benchmark is reporting on, (1) the Arrow R package is the installable thing the benchmark runs and (2) the code snippet shown next to each result isn't really Acero in the broad sense but the novel bindings the Arrow R package built on top. I know you're familiar with all of the above so the above so this is just to explain my rationale for the proposed rename. Happy to continue this discussion here or elsewhere though.

eitsupi · 2023-11-22T23:38:49Z

Thanks for your response.

I think the point here is that in most other cases, the "engine" is used as the name, not the language or API.
That said, I think data.table, Spark, DuckDB, and Polars can be manipulated via dplyr, but I don't think we are actually using dplyr here, so something like "dplyr-Acero" could be appropriate to explicitly state that it is dplyr API.

The name Acero is now in dplyr's README, so users should be able to quickly find out what it is.

thisisnic · 2023-11-23T12:35:38Z

IMO dplyr-Acero is OK as long as both components are mentioned as it's not a good test of just Acero given that the R package is adding a lot more in here. That said, I feel like "arrow R package" or "R arrow" is much more recognisable shorthand for what is effectively multiple things bundled together here as there are more things in the R package which may affect the performance than just Acero and our dplyr bindings - the internal R code which works out how to construct ExecPlans to run in Acero, how we make use of ALTREP, etc.

Tmonster · 2023-11-24T07:59:20Z

Hi @amoeba

I have a branch up here that should handle all of the logic to rename arrow to R-arrow in the report and also during future logging. It is running CI right now, but feel free to copy the PR, change the name to something else if desired and open it up here against master. If CI is green I can merge and also update the report.

Tmonster#10

amoeba · 2023-11-30T23:11:41Z

Thanks @Tmonster, "R-arrow" is great. And thanks for jumping on this so quickly, we appreciate it.

Tmonster · 2023-12-04T09:07:56Z

PR for this is up here!
#68

amoeba · 2023-12-04T17:55:30Z

Thanks @Tmonster!

Update name for Arrow R package

5e0b0a1

amoeba closed this Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update name for Arrow R package #66

Update name for Arrow R package #66

amoeba commented Nov 21, 2023

Tmonster commented Nov 22, 2023

jangorecki commented Nov 22, 2023

eitsupi commented Nov 22, 2023

jangorecki commented Nov 22, 2023 •

edited

Loading

amoeba commented Nov 22, 2023

eitsupi commented Nov 22, 2023

thisisnic commented Nov 23, 2023

Tmonster commented Nov 24, 2023

amoeba commented Nov 30, 2023

Tmonster commented Dec 4, 2023

amoeba commented Dec 4, 2023

Update name for Arrow R package #66

Update name for Arrow R package #66

Conversation

amoeba commented Nov 21, 2023

Tmonster commented Nov 22, 2023

jangorecki commented Nov 22, 2023

eitsupi commented Nov 22, 2023

jangorecki commented Nov 22, 2023 • edited Loading

amoeba commented Nov 22, 2023

eitsupi commented Nov 22, 2023

thisisnic commented Nov 23, 2023

Tmonster commented Nov 24, 2023

amoeba commented Nov 30, 2023

Tmonster commented Dec 4, 2023

amoeba commented Dec 4, 2023

jangorecki commented Nov 22, 2023 •

edited

Loading