Skip to content

Commit

Permalink
Minor doc edits. (#347)
Browse files Browse the repository at this point in the history
  • Loading branch information
jondegenhardt authored Jun 10, 2021
1 parent 8a5c483 commit 6a284fb
Show file tree
Hide file tree
Showing 21 changed files with 52 additions and 54 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ File an [issue](https://github.com/eBay/tsv-utils/issues) if you have problems,
* [Experimenting with Link Time Optimization](docs/dlang-meetup-14dec2017.pdf). Dec 14, 2017. A presentation at the [Silicon Valley D Meetup](https://www.meetup.com/D-Lang-Silicon-Valley/) describing experiments using LTO based on eBay's TSV Utilities.
* [Exploring D via Benchmarking of eBay's TSV Utilities](http://dconf.org/2018/talks/degenhardt.html). May 2, 2018. A presentation at [DConf 2018](http://dconf.org/2018/) describing performance benchmark studies conducted using eBay's TSV Utilities (slides [here](docs/dconf2018.pdf)).

![GitHub Workflow Status](https://img.shields.io/github/workflow/status/eBay/tsv-utils/build-test)
[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/eBay/tsv-utils/build-test)](https://github.com/eBay/tsv-utils/actions/workflows/build-test.yml)
[![Codecov](https://img.shields.io/codecov/c/github/eBay/tsv-utils.svg)](https://codecov.io/gh/eBay/tsv-utils)
[![GitHub release](https://img.shields.io/github/release/eBay/tsv-utils.svg)](https://github.com/eBay/tsv-utils/releases)
[![Github commits (since latest release)](https://img.shields.io/github/commits-since/eBay/tsv-utils/latest.svg)](https://github.com/eBay/tsv-utils/commits/master)
Expand Down Expand Up @@ -309,7 +309,7 @@ See the [tsv-summarize reference](docs/tool_reference/tsv-summarize.md) for the
* Bernoulli sampling (`--p|prob P`) - A streaming form of sampling. Lines are read one at a time and selected for output using probability `P`. e.g. `-p 0.1` specifies that 10% of lines should be included in the sample.
* Distinct sampling (`--k|key-fields F`, `--p|prob P`) - Another streaming form of sampling. However, instead of each line being subject to an independent selection choice, lines are selected based on a key contained in each line. A portion of keys are randomly selected for output, with probability P. Every line containing a selected key is included in the output. Consider a query log with records consisting of <user, query, clicked-url> triples. It may be desirable to sample records for one percent of the users, but include all records for the selected users.

`tsv-sample` is designed for large data sets. Streaming algorithms make immediate decisions on each line. They do not accumulate memory and can run on infinite length input streams. Both shuffling and sampling with replacement read in the entire dataset and are limited by available memory. Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (`--n|num`) in memory. By default, a new random order is generated every run, but options are available for using the same randomization order over multiple runs. The random values assigned to each line can be printed, either to observe the behavior or to run custom algorithms on the results.
`tsv-sample` is designed for large data sets. Streaming algorithms make immediate decisions on each line. They do not accumulate memory and can run on infinite length input streams. Both shuffling and sampling with replacement read the entire dataset all at once and are limited by available memory. Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (`--n|num`) in memory. By default, a new random order is generated every run, but options are available for using the same randomization order over multiple runs. The random values assigned to each line can be printed, either to observe the behavior or to run custom algorithms on the results.

See the [tsv-sample reference](docs/tool_reference/tsv-sample.md) for further details.

Expand All @@ -322,7 +322,7 @@ Example:
$ tsv-join -H --filter-file filter.tsv --key-fields Country,City --append-fields Population,Elevation data.tsv
```

This reads `filter.tsv`, creating a lookup table keyed on the fields `Country` and `City` fields. `data.tsv` is read, lines with a matching key are written to standard output with the `Population` and `Elevation` fields from `filter.tsv` appended. This is an inner join. Left outer joins and anti-joins are also supported.
This reads `filter.tsv`, creating a lookup table keyed on the `Country` and `City` fields. `data.tsv` is read, lines with a matching key are written to standard output with the `Population` and `Elevation` fields from `filter.tsv` appended. This is an inner join. Left outer joins and anti-joins are also supported.

Common uses for `tsv-join` are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that.

Expand Down
8 changes: 4 additions & 4 deletions common/src/tsv_utils/common/numerics.d
Original file line number Diff line number Diff line change
Expand Up @@ -944,18 +944,18 @@ unittest
alias AreNaNsEqual = Flag!"areNaNsEqual";

/**
nearEqual checks two floating point numbers are "near equal".
nearEqual checks if two floating point numbers are "near equal".
nearEqual is an alternative to the approxEqual and isClose functions in the Phobos
standard library. It should be regarded as experimental. Currently it is only used in
standard library. It should be regarded as experimental. Currently it is used only in
unit tests.
Default relative diff tolerance is small. Absolute diff tolerance is also small, but
non-zero. This means comparing number near zero to zero will considered near equal by
defaults.
Default tolerances will not survive float-to-double conversion. Use tolerances based
on float in these cases.
Default tolerances will not suffice for float-to-double conversion. Use tolerances
based on float in these cases.
*/
bool nearEqual(T, AreNaNsEqual naNsAreEqual = No.areNaNsEqual)
(T x, T y, T maxRelDiff = 4.0 * T.epsilon, T maxAbsDiff = T.epsilon)
Expand Down
2 changes: 1 addition & 1 deletion csv2tsv/src/tsv_utils/csv2tsv.d
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Behaviors of this program that often vary between CSV implementations:
* Newlines are supported in quoted fields.
* Double quotes are permitted in a non-quoted field. However, a field starting
with a quote must follow quoting rules.
* Each record can have a different numbers of fields.
* Each record can have a different number of fields.
* The three common forms of newlines are supported: CR, CRLF, LF. Output is
written using Unix newlines (LF).
* A newline will be added if the file does not end with one.
Expand Down
20 changes: 10 additions & 10 deletions docs/AboutTheCode.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,39 +12,39 @@ Contents:

## Code structure

There is directory for each tool, plus one directory for shared code (`common`). The tools all have a similar structure. Code is typically in one file, e.g. `tsv-uniq.d`. Functionality is broken into three pieces:
There is a directory for each tool, plus one directory for shared code (`common`). The tools all have a similar structure. Code is typically in one file, e.g. `tsv-uniq.d`. Functionality is broken into three pieces:

* A class managing command line options. e.g. `TsvUniqOptions`.
* A function reading reading input and processing each line. e.g. `tsvUniq`.
* A function reading input and processing each line. e.g. `tsvUniq`.
* A `main` routine putting it all together.

Documentation for each tool is found near the top of the main file, both in the help text and the option documentation. The [tsv-utils code documentation](https://tsv-utils.dpldocs.info/) is also useful for understanding the code.

The simplest tool is `number-lines`. It is useful as an illustration of the code outline followed by the other tools. `tsv-select` and `tsv-uniq` also have straightforward functionality, but employ a few more D programming concepts. `tsv-select` uses templates and compile-time programming in a somewhat less common way, it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison.
The simplest tool is `number-lines`. It is useful as an illustration of the code outline followed by the other tools. `tsv-select` and `tsv-uniq` also have straightforward functionality, but employ a few more D programming concepts. `tsv-select` uses templates and compile-time programming in a somewhat less common way; it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison.

`tsv-append` has a simple code structure. It's one of the newer tools. It's only additional complexity is that writes to an 'output range' rather than directly to standard output. This enables better encapsulation for unit testing. `tsv-sample`, another new tool, is written in a similar fashion. The code is only a bit more complicated, but the algorithm is much more interesting.
`tsv-append` has a simple code structure. It's one of the newer tools. It's only additional complexity is that it writes to an 'output range' rather than directly to standard output. This enables better encapsulation for unit testing. `tsv-sample`, another new tool, is written in a similar fashion. The code is only a bit more complicated, but the algorithm is much more interesting.

`tsv-sample` is one of the newest tools and one of the better code examples. Sampling is algorithmically interesting and the code includes implementations of a number of sampling methods.

`tsv-join` and `tsv-filter` also have relatively straightforward functionality, but support more use cases resulting in more code. `tsv-filter` in particular has more elaborate setup steps that take a bit more time to understand. `tsv-filter` uses several features like delegates (closures) and regular expressions not used in the other tools.

`tsv-summarize` is one or the more recent tools. It uses a more object oriented style than the other tools, this makes it relatively easy to add new operations. It also makes quite extensive use of built-in unit tests.
`tsv-summarize` is one of the more recent tools. It uses a more object oriented style than the other tools, this makes it relatively easy to add new operations. It also makes quite extensive use of built-in unit tests.

The `common` directory has code shared by the tools. See the [tsv-utils.common code documentation](https://tsv-utils.dpldocs.info/tsv_utils.common.html) for a description of the classes.

New tools can be added by creating a new directory and a source tree following the same pattern as one of existing tools.

## Coding philosophy

The tools were written in part to explore D for use in a data science environment. Data mining environments have custom data and application needs. This leads to custom tools, which in turn raises the productivity vs execution speed question. This trade-off is exemplified by interpreted languages like Python on the one hand and system languages like C/C++ on the other. The D programming language occupies an interesting point on this spectrum. D's programmer experience is somewhere in the middle ground between interpreted languages and C/C++, but run-time performance is closer to C/C++. Execution speed is a very practical consideration in data mining environments: it increases dataset sizes that can handled on a single machine, perhaps the researcher's own machine, without needing to switch to a distributed compute environment. There is additional value in having data science practitioners program these tools quickly, themselves, without needing to invest time in low-level programming.
The tools were written in part to explore D for use in a data science environment. Data mining environments have custom data and application needs. This leads to custom tools, which in turn raises the productivity vs execution speed question. This trade-off is exemplified by interpreted languages like Python on the one hand and system languages like C/C++ on the other. The D programming language occupies an interesting point on this spectrum. D's programmer experience is somewhere in the middle ground between interpreted languages and C/C++, but run-time performance is closer to C/C++. Execution speed is a very practical consideration in data mining environments: it increases dataset sizes that can be handled on a single machine, perhaps the researcher's own machine, without needing to switch to a distributed compute environment. There is additional value in having data science practitioners program these tools quickly, themselves, without needing to invest time in low-level programming.

These tools were implemented with these trade-offs in mind. The code was deliberately kept at a reasonably high level. The obvious built-in facilities were used, notably the standard library. A certain amount of performance optimization was done to explore this dimension of D programming, but low-level optimizations were generally avoided. Indeed, there are options likely to improve performance, notably:

* Custom I/O buffer management, including reading entire files into memory.
* Custom hash tables rather than built-in associative arrays.
* Avoiding garbage collection.

A useful aspect of D is that is additional optimization can be made as the need arises. Coding of these tools did utilize a several optimizations that might not have been done in an initial effort. These include:
A useful aspect of D is that additional optimizations can be made as the need arises. Coding of these tools did utilize several optimizations that might not have been done in an initial effort. These include:

* The `InputFieldReordering` class in the `common` directory. This is an optimization for processing only the first N fields needed for the individual command invocation. This is used by several tools.
* The template expansion done in `tsv-select`. This reduces the number of if-tests in the inner loop.
Expand All @@ -65,7 +65,7 @@ The makefile setup is very simplistic. It works reasonably in this case because
* `make test-codecov` - Runs unit tests and debug app tests with code coverage reports turned on.
* `make help` - Shows all the make commands.

Builds can be customized by changing the settings in `makedefs.mk`. The most basic customization is the compiler choice, this controlled by the `DCOMPILER` variable.
Builds can be customized by changing the settings in `makedefs.mk`. The most basic customization is the compiler choice, this is controlled by the `DCOMPILER` variable.

### DUB package setup

Expand All @@ -79,9 +79,9 @@ $ make test-nobuild

## Unit tests and code coverage reports

D has an excellent facility for adding unit tests right with the code. The `common` utility functions and the more recent tools take advantage of built-in unit tests. However, the earlier tools do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The more recent tools use both built-in unit tests ad tests against the executable. This includes `csv2tsv`, `tsv-summarize`, `tsv-append`, and `tsv-sample`. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.
D has an excellent facility for adding unit tests right with the code. The `common` utility functions and the more recent tools take advantage of built-in unit tests. However, the earlier tools do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The more recent tools use both built-in unit tests and tests against the executable. This includes `csv2tsv`, `tsv-summarize`, `tsv-append`, and `tsv-sample`. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.

Tests for the command line executables are in the `tests` directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as an examples of command line tool invocations. See the `tests.sh` file in each `test` directory, and the `test` makefile target in `makeapp.mk`.
Tests for the command line executables are in the `tests` directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as examples of command line tool invocations. See the `tests.sh` file in each `test` directory, and the `test` makefile target in `makeapp.mk`.

The unit test built into the common code (`common/src/tsvutil.d`) illustrates a useful interaction with templates: it is quite easy and natural to unit test template instantiations that might not occur naturally in the application being written along with the template.

Expand Down
2 changes: 1 addition & 1 deletion docs/BuildingWithLTO.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ _Visit the [main page](../README.md)_

# Building with Link Time Optimization and Profile Guided Optimization

This page provides instruction for building the TSV Utilities from source code using Link Time Optimization (LTO) and Profile Guided Optimization (PGO). LTO is enabled for all the tools, PGO is enabled for a select few. Both improve run-time performance, LTO has the additional effect of reducing binary sizes. Normally PGO and LTO can be used independently, however, the TSV Utilities build system only supports PGO when already using LTO.
This page provides instructions for building the TSV Utilities from source code using Link Time Optimization (LTO) and Profile Guided Optimization (PGO). LTO is enabled for all the tools, PGO is enabled for a select few. Both improve run-time performance, LTO has the additional effect of reducing binary sizes. Normally PGO and LTO can be used independently, however, the TSV Utilities build system only supports PGO when already using LTO.

Contents:

Expand Down
Loading

0 comments on commit 6a284fb

Please sign in to comment.