Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset.pipe method, based on pandas.DataFrame.pipe. #734

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

copybara-service[bot]
Copy link

@copybara-service copybara-service bot commented Feb 14, 2025

Add Dataset.pipe method, based on pandas.DataFrame.pipe.

pipe is convenient because it allows for using method chaining syntax in an extensible fashion, with transformations that are not built-in methods on Dataset.

For example, consider shuffling a dataset in windows. It would be convenient if we could write something like:

ds = (
    dataset.MapDataset.range(400)
    .window_shuffle(window_size=10, seed=42)
    .batch(16)
    .repeat()
)

Unfortunately this doesn't work, because there is no window_shuffle() method. Instead you would need to write something like:

ds = (
    shuffle.WindowShuffleMapDataset(
        dataset.MapDataset.range(400),
        window_size=10,
        seed=42,
    )
    .batch(16)
    .repeat()
)

Readability suffers here, because the shuffle transformation comes out of order.

Instead, pipe lets us write something like, keeping transformations in the order in which they are applied:

ds = (
    dataset.MapDataset.range(400)
    .pipe(
        shuffle.WindowShuffleMapDataset,
        window_size=10,
        seed=42,
    )
    .batch(16)
    .repeat()
)

@copybara-service copybara-service bot force-pushed the test_726948545 branch 3 times, most recently from 1851d76 to 0027e19 Compare February 21, 2025 00:04
`pipe` is convenient because it allows for using method chaining syntax in an extensible fashion, with transformations that are not built-in methods on `Dataset`.

For example, consider shuffling a dataset in windows. It would be convenient if we could write something like:
```
ds = (
    dataset.MapDataset.range(400)
    .window_shuffle(window_size=10, seed=42)
    .batch(16)
    .repeat()
)
```

Unfortunately this doesn't work, because there is no `window_shuffle()` method. Instead you would need to write something like:

```
ds = (
    shuffle.WindowShuffleMapDataset(
        dataset.MapDataset.range(400),
        window_size=10,
        seed=42,
    )
    .batch(16)
    .repeat()
)
```

Readability suffers here, because the shuffle transformation comes out of order.

Instead, `pipe` lets us write something like, keeping transformations in the order in which they are applied:
```
ds = (
    dataset.MapDataset.range(400)
    .pipe(
        shuffle.WindowShuffleMapDataset,
        window_size=10,
        seed=42,
    )
    .batch(16)
    .repeat()
)
```
PiperOrigin-RevId: 729289880
@copybara-service copybara-service bot merged commit 42039a1 into main Feb 21, 2025
1 of 2 checks passed
@copybara-service copybara-service bot deleted the test_726948545 branch February 21, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant