Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation #1

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Linter
on:
pull_request:
branches:
- main
jobs:
test:
name: Lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Use Node.js
uses: actions/setup-node@v2
with:
node-version: '18.x'
- name: Install packages
run: npm ci
- name: Prettier
run: npm run format
- name: Lint
run: npm run lint:ci
49 changes: 49 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: Unit Tests
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
install-and-build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Use Node
uses: actions/setup-node@v3
with:
node-version: '18.x'
cache: npm
cache-dependency-path: ./package-lock.json
- name: Install dependencies
run: npm ci
- name: Build Shared
run: npm run build
- name: Archive artifacts
uses: actions/upload-artifact@v3
with:
name: lib
path: ./lib
test:
runs-on: ubuntu-latest
needs: [ install-and-build ]
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Use Node
uses: actions/setup-node@v3
with:
node-version: '18.x'
cache: npm
cache-dependency-path: ./package-lock.json
- name: Download shared dist
uses: actions/download-artifact@v3
with:
name: lib
path: ./lib
- name: Link Dependencies
run: npm ci
- name: Test
run: npm test
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
lib/
node_modules/
coverage/
test/e2e-test-piles/
test/e2e-test-out.csv
66 changes: 64 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,72 @@ Linear-time shuffling of large datasets for Node.js
This package uses the [Rao](https://www.jstor.org/stable/25049166)
algorithm to shuffle data sets that are too large to fit in memory. The
algorithm is described pretty well by [Chris Hardin](https://blog.janestreet.com/how-to-shuffle-a-big-dataset/).
In essence, the input stream is randomly scattered into "piles" which
The input stream is randomly scattered into "piles" which
are stored on disk. Then each pile is shuffled in-memory with
[Fisher-Yates](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle).

If your data set is extremely large, then even your piles may not fit in
memory. In that case, the algorithm could recurse until the piles are
small enough, but that feature is not implemented here.
small enough, but that feature is not yet implemented here.

On my personal machine, this package randomized the order of a 5.3GB, 22
million row TSV in about three hours. I suspect it could be optimized
quite a bit more.

## Limitations / Future Work

Because the input elements are written to disk as part of the shuffle,
`big-shuffle` can only take string data. If you need to shuffle other
types of times, serialize them to `string` first.

Support for shuffling `Buffer` and `Uint8Array` objects may be added
later if there is demand.

## Getting Started

`npm install big-shuffle`

For TypeScript users:
```ts
import { shuffle } from 'big-shuffle';
import * as path from 'path';

const inArray = [];

function *asyncRange(max: number) {
for (let i = 0; i < max; i++) {
yield i.toString(10);
}
}

const shuffled = shuffle(asyncRange(1000000));

for await (const i of shuffled) {
console.log();
}
```

This will generate, shuffle, and print a million random numbers.

Should work the same for JavaScript users after a few changes.

## API Reference

```
function shuffle(
inStream: AsyncIterable<string>,
numPiles: number = 1000, // More piles reduces memory usage but requires more open file descriptors
pileDir: string = path.join(__dirname, 'shuffle_piles'), // Filesystem path where the files are located
): Promise<AsyncIterable<string>>;
```

Note that the shuffled iterable will not yield any records until the
input iterable is fully consumed.

## Use with Node.js Streams

For interop with the built-in [Stream](https://nodejs.org/api/stream.html)
library, consider using `stream-to-async-iterator` [(details)](https://www.npmjs.com/package/stream-to-async-iterator)
to convert the unshuffled stream into an iterator, and then converting
the shuffled iterator back into a stream using `Readable.from`. See
[test/e2e.spec.ts] for an example using streams to shuffle a CSV.
7 changes: 7 additions & 0 deletions __mocks__/fs/promises.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
const { fs } = require('memfs');

module.exports = {
...fs.promises,
open: jest.fn()
.mockImplementation((...args) => fs.promises.open(...args)),
};
Loading