New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

benchmarking vignette #3132

Open

jangorecki wants to merge 9 commits into master from bench-vign

Member

jangorecki commented Nov 1, 2018 •

edited

Loading

extending benchmarking vignettes for things spotted in some benchmarks, and some extra stuff
Closes #3028

jangorecki added 4 commits

November 1, 2018 10:55


          order of points in bench vign is not relevant

60a5694


          mention parallel pkg in bench vignette

9554c3d


          index aware benchmark will be also valid for grouping, joining, etc.

291c0a2


          lazy evaluation aware benchmarking

2556b05

st-pasha reviewed

View reviewed changes

Contributor

st-pasha left a comment

The first section is not entirely accurate, on several counts:

The expression of the form DT[a == 1] (where "a" is a column name) is not even possible in Python, because it treats a as a variable name. Thus, the two expressions are not equivalent, the first will raise an error.
In python datatable we write DT[f.a == 1] instead; but this already has lazy evaluation semantics, imposed by the f object.
It is true that R's lazy evaluation is "truer" than Python's. However, it is not accurate to say that DT[DT[[col]] == filter] "forces" eager evaluation, it just makes it harder. In theory, nothing prevents data.table from recognizing that "DT[[col]]" is equivalent to ..col and replacing it as such. But, it's more work, and nobody did that work yet.

So, I guess a better way of conveying the idea of this section, is to say that for each query, the "most idiomatic" expression be used for each of the solutions tested.

Member Author

jangorecki commented Nov 2, 2018

I used R syntax, not python, just referred to python-way behaviour. I assume R users don't know python syntax so I prefer express it in such way.
Yes, such complex calls could be optimised but it is not part of our API at the moment thus it prevents optimisation as of now.

codecov bot commented Nov 27, 2018 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.49%. Comparing base (40afa84) to head (6ee3826).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3132   +/-   ##
=======================================
  Coverage   97.49%   97.49%           
=======================================
  Files          80       80           
  Lines       14861    14861           
=======================================
  Hits        14488    14488           
  Misses        373      373

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jangorecki added 2 commits

March 6, 2019 10:14


          address feedback on bench-vign improvements

213295c


          reflect change of cores to 50 pct

1cda608

Member Author

jangorecki commented Mar 6, 2019 •

edited

Loading

addressed Pasha's comment from #2701 (comment) and closes #3028

mattdowle added this to the 1.12.4 milestone

mattdowle modified the milestones: 1.12.4, 1.13.0

mattdowle modified the milestones: 1.12.7, 1.12.9

mattdowle modified the milestones: 1.13.1, 1.13.3

jangorecki modified the milestones: 1.14.3, 1.14.5

jangorecki modified the milestones: 1.14.11, 1.15.1

jangorecki added 2 commits

November 6, 2023 18:25


          add example requested by Matt

4e4578a


          Merge branch 'master' into bench-vign

06fbb59

jangorecki requested a review from mattdowle as a code owner

December 8, 2023 20:17


          Merge branch 'master' into bench-vign

6ee3826

jangorecki requested a review from MichaelChirico as a code owner

March 14, 2024 05:42

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

-              # fread: clear caches
+              ## General suggestions
+              Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
          
            Let's assume you are measuring a particular process. It is blazingly fast, taking only microseconds to evaluate.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              ## General suggestions
+              Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
+              What does it mean and how to approach such measurements?

Member

MichaelChirico Mar 14, 2024

it?

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
+              What does it mean and how to approach such measurements?
+              The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.

Member

MichaelChirico Mar 14, 2024

This paragraph gets close to the Knuth quote:

Premature optimization is the root of all evil

I think it would be almost strange not to include such a famous quote in a benchmarking vignette!

Member

MichaelChirico Mar 14, 2024

Do we also want to cite a study about human perception? Here we get 13 milliseconds as the shortest time we can even detect:

https://news.mit.edu/2014/in-the-blink-of-an-eye-0116

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
+              What does it mean and how to approach such measurements?
+              The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
          
            The smaller time measurements are, the bigger (relatively) call overhead is. Call overhead might be perceived as noise in measurement due to method dispatch, package/class initialization, low-level object constructors, etc. As a result you may naturally want to measure such timings many times and take the average (or median) to deal with the noise.
          
            This is a valid approach, but ultimately the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if (say) writing the results to the target environment/in the target format takes a minute? 1 second is 1,000,000 microseconds. Do the microseconds, or even milliseconds make any difference?
          
            Of course there are cases where it makes a difference and we should care about microsecond-scale timings, for example, when a function is called every row. The point is that in most users' benchmarks, it won't make difference. Most of the common R functions are vectorized, meaning they're called for full columns, not individual rows. If something is blazingly fast for your data and use case then perhaps you don't have to worry about performance and benchmarks.
          
            Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.

Member

MichaelChirico Mar 14, 2024

I would also just drop the last paragraph (Unless...scale), I feel it's belaboring the point

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
+              What does it mean and how to approach such measurements?
+              The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
+              There are multiple dimensions that you should consider when examining scaling of your process.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            There are multiple dimensions that you should consider when examining scaling of your process.
          
            There are multiple dimensions that you should consider when examining how your process scales:

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
+              There are multiple dimensions that you should consider when examining scaling of your process.
+              - increase numbers of rows on input
+              - cardinality of data

Member

MichaelChirico Mar 14, 2024

what's cardinality mean here? number of groups?

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              - cardinality of data
+              - skewness of data - for most cases this should have the least importance
+              - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.
+              - presence of NAs in input

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            - presence of NAs in input  
          
            - prevalence of NAs in input

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              There are multiple dimensions that you should consider when examining scaling of your process.
+              - increase numbers of rows on input
+              - cardinality of data
+              - skewness of data - for most cases this should have the least importance

Member

MichaelChirico Mar 14, 2024

I think you mean how much variation there is in .N by group, right? maybe reader needs that spelled out a bit.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              What does it mean and how to approach such measurements?
+              The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
+              There are multiple dimensions that you should consider when examining scaling of your process.
+              - increase numbers of rows on input

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            - increase numbers of rows on input  
          
            - increase number of rows on input

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              - increase numbers of rows on input
+              - cardinality of data
+              - skewness of data - for most cases this should have the least importance
+              - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.
          
            - increase number of columns on input. This should mostly come up when your input is a matrix. Because variable column count for data.frames means an undefined schema, we suggest to modeling your data such that extra columns are instead (using *melt*/*unpivot*) new groups of rows.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              - presence of NAs in input
+              - sortedness of input
+              To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.
          
            To measure the *scaling factor* for a given input size, you have to measure timings of at least three different sizes, e.g. for number of rows, 1 million, 10 million and 100 million. Those three different measurements will allow you to infer how your process scales. Why three and not two? From two sizes you cannot yet conclude if a process scales linearly or not. In theory, based on that you could estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.

I don't like "conclude", n=3 is definitely too small to be making "conclusions", but we can definitely start to "infer". I also don't think "linearly" and "exponentially" are opposites -- "exponentially" means "the log scales linearly". There are many other forms of non-linear growth.

Member

MichaelChirico Mar 14, 2024

In theory, based on that you could estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.

Drop this sentence? I haven't seen a use case for this, I'm not sure it adds much.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              - sortedness of input
+              To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.
+              Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.
          
            Once we have our input scaled up to reduce the impact of call overhead, the next question we might ask is "Should I repeat measurements multiple times?". The answer is that it strongly depends on your use case, a data processing workflow. If the process is called just once in your workflow, why should you bother about its timing on the second, third... and 100th run? Things like disk cache might result in subsequent runs evaluating faster. Other optimizations might be triggered like memoizing results for given input, or use of indexes created on the first run. If your workflow does not repeatedly call your process, why should you do it in your benchmark? The main focus of benchmarks should be real use case scenarios.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.
+              Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.
+              You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking.
          
            Lastly, do not forget about taking extra care about the environment in which you are running a benchmark. Startup configurations should be stripped out, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of the tools you are benchmarking.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.
+              You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking.
+              You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly.
          
            You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to make sure you're using their library correctly.

Member

MichaelChirico Mar 14, 2024

Not sure this advice is generallypractical, maybe "reach out to experts of that package" is better?

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

-              This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios.
+              This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
+              Example below represents the problem discussed:

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            Example below represents the problem discussed:
          
            Here is a poignant example:

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

               Matt once said:
               > I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.
-              This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios.
+              This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
          
            This is very valid. The smaller the time measurement is, the bigger (relatively) noise is, e.g. as generated by method dispatch, package/class initialization, etc. Again: the main focus of benchmarks should be real use case scenarios.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              setindex(dt, "id")
+              df = as.data.frame(dt)
+              microbenchmark(
+                dt[id==5e6L, value],

Member

MichaelChirico Mar 14, 2024

Why 5e6 here, I was distracted by the difference, shouldn't 5e4L continue to demonstrate the point?

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

               ```
-              # inside a loop prefer `set` instead of `:=`
+              Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
          
            Keep in mind that using the `parallel` R package together with `data.table` will force `data.table` to use only a single core. Thus it is recommended to verify core utilization in resource monitoring tools, for example `htop`.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

Comment on lines +201 to +203

		### inside a loop prefer `setDT()` instead of `data.table()`

		As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above.

Member

MichaelChirico Mar 14, 2024 •

edited

Loading

Suggested change

      
            ### inside a loop prefer `setDT()` instead of `data.table()`
          
            As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above.
          
            ### inside a loop prefer `setDT()` or `as.data.table()` instead of `data.table()`
          
            As of now `data.table()` has an overhead that `as.data.table()` avoids, thus inside loops the latter is preferable. Even better is `setDT()`, or ideally just avoid class coercion as described in _avoid class coercion_ above.

Member

MichaelChirico Mar 14, 2024

Is there a bug for removing that overhead? Should it be cited as an HTML comment here?

Member Author

jangorecki Mar 14, 2024 •

edited

Loading

No bug, I don't think we should be focusing on optimizing it, as.data.table is meant to be fast. These are still microseconds so it is only relevant when someone is looping on it.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd


		### lazy evaluation aware benchmarking

		#### let applications to optimize queries

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            #### let applications to optimize queries
          
            #### let applications optimize queries

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd


		#### let applications to optimize queries

		In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
          
            In languages like python which do not support _lazy evaluation_, the following two filter queries would be processed exactly the same way.

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

+              DT[DT[["a"]] == 1L]
+              ```
+              R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).

Member

MichaelChirico Mar 14, 2024

Suggested change

      
            R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
          
            R has _lazy evaluation_, which allows an application to investigate and optimize expressions before it gets evaluated; SQL engines also do this. In the above, if we filter using `DT[[col]] == filter` we are forcing the whole LHS to materialize. This prevents `data.table` optimizing expression whenever it is possible and basically falls back to the base R `data.frame` way of doing subsets. For more information on that subject refer to the [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).

MichaelChirico reviewed

View reviewed changes

vignettes/datatable-benchmarking.Rmd

    
              As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.

              The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed (or even only partially computed) when its results were required. Because of that you should ensure that computation took place completely. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).

Member

MichaelChirico Mar 14, 2024

I actually thing the way out here is to focus again on end-to-end benchmarks. Actually lazy operations are very good if the computation never needs to materialize in actual workflow. If we have a file from disk and read 100 columns in as lazy ALTREP, but the workflow only needs 5 columns, it's inefficient to materialize the other 95.

So again making the benchmark as realistic as possible is key.

MichaelChirico modified the milestones: 1.16.0, 1.17.0

MichaelChirico modified the milestones: 1.17.0, 1.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet