*: framework agnostic loaders #682

tharvik · 2024-06-05T12:51:59Z

on the road of supporting ONNX, we first need to extract TFJS from the code base, making discojs framework agnostic.
based on #735.

adding a Dataset type, wrapping standard AsyncGenerator with some helpers (map, zip, split, batch, ...)
- for now, disco.fit, validator.{assess,predict}, requires more type informations (via TypedDataset) as theses are supporting any dataset for any task. it should be enforced at some point via Task splitting so that we can drop it.
- now, any users of disco have to use it, keeping the remaining Datasplit, Data, tf.* internal to disco (and removing theses when preprocessing and task typing are reworked)
also exposes types: Image, Tabular & Text
redo loaders to be small functions rather than cumbersome-to-use classes
added some convertors as a potential new way to do preprocessing
- flat functions, that encodes type changes instead of wrapping in tf.TensorContainer (pretty much equivalent to any)
webapp are now typing its dataset, having a clear separation between Image, Tabular & Text, with Data component emitting changes

JulienVig

Huge amount of work, well done!! Things feel much cleaner now!

I think there are a few issues with the webapp right now:

Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)
Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.
Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000
Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear. (Sidenote titanic training currently fails on develop with error: Error: A Dataset iterator for fitDataset() is expected to generate objects of the form {xs: xVal, ys: yVal}, where the two values may be tf.Tensor, an array of Tensors, or a map of string to Tensor. The provided Dataset instead generates Tensor so you seemed to have fixed this error)
Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.
There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

Dataset is not a data structure composed of Data elements
Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)
DataSplit represents a train-test split of Data rather than a dataset
dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

cli/src/benchmark_gpt.ts

discojs-node/src/loaders/index.ts

discojs-node/src/loaders.spec.ts

discojs-node/src/loaders/image.ts

cli/src/data.ts

discojs/src/task/training_information.ts

webapp/src/components/data/TabularDatasetInput.vue

docs/examples/wikitext.ts

webapp/src/components/containers/ImageCard.vue

webapp/src/components/data/ImageDatasetInput.vue

tharvik · 2024-06-11T08:37:13Z

Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.

good catch! I indeed forgot to propagate the "clear file" event

Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000

indeed, I mixed RGBA & RGB images, added a real Image type and related remove_alpha convertor.

Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear.

right, thanks for catching that, I missplaced a toRaw in there. NB: this is needed as VueJS proxies every value, but it doesn't work with JS private fields (# prefixed values).

Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.

ho oupsi, I'm generating an empty dataset when uploading a folder, should have added a throwing TODO, thanks for catching that.

There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp

correct, there isn't any. we can add one in a next iteration but that out of scope of the PR. from what I remember, one student should have worked on that but it wasn't clear if it happened in the end.

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

I totally agree with you, it's not understandable. that's also why this PR is introducing a new one that should replace all of theses in its next iterations (when I reworked the convertors).

Dataset is not a data structure composed of Data elements

that would be the central (and only) way to have a serie of data (image/tabular/text/...) with transformations applied on it.

Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)

it will be dropped, replaced by Dataset<Image>, Dataset<Tabular> & Dataset<Text> and some convertors applied on it.

DataSplit represents a train-test split of Data rather than a dataset

it will be replaced by the return type of Dataset<T>.split() ie [Dataset<T>, Dataset<T>].

dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

it will be real values, not fields, by shaping Dataset<T> via convertors functions. for example, Dataset<Tabular> -> extract_columns -> convert_to_numbers -> Dataset<[features: number[], label: number]>

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

as I'm hoping to remove most of theses soon, it seems a bit premature to update the doc now. I'll update the doc accordingly in the next one, but feel free to setup a PR if you feel that's it's taking too long.

JulienVig · 2024-06-11T08:45:44Z

Sounds good! Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

JulienVig · 2024-06-11T14:29:37Z

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

tharvik · 2024-06-12T15:22:23Z

Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

hum, I think that's doable, not for the webapp Tester (it's not really supported now anyway) but for the others parts, yes.

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

hum, users are weird, but yeah, make sense, it's very demo-y; adding it back.

tharvik · 2024-08-12T12:42:49Z

so, it's alive again! way smaller to review thanks to extracted scraps. description updated.

JulienVig · 2024-08-13T09:35:23Z

There seems to be something wrong with the image data loaders: connecting images (by group for lus_covid and by csv for mnist) results in an error (Something went wrong on our side) when clicking train (both alone and collaboratively). Titanic and the llm task work fine.

webapp/training: back to stop-by-throwing This reverts commit 7b259af.

JulienVig · 2024-08-22T10:57:52Z

Yay we've done it! 🎉

martinjaggi · 2024-08-22T12:39:30Z

congrats!

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 333bd90 to 0560bb0 Compare June 5, 2024 12:52

tharvik marked this pull request as ready for review June 5, 2024 12:59

tharvik requested a review from JulienVig June 5, 2024 12:59

JulienVig requested changes Jun 6, 2024

View reviewed changes

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 0560bb0 to a58143c Compare June 11, 2024 08:34

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch 2 times, most recently from 88b3514 to d1437de Compare June 12, 2024 15:22

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from d1437de to fbbd48b Compare June 12, 2024 15:46

tharvik changed the base branch from develop to 669-wikitext-web-julien June 15, 2024 15:02

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 976b163 to 649c7a0 Compare June 15, 2024 15:03

Base automatically changed from 669-wikitext-web-julien to develop June 17, 2024 07:07

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 649c7a0 to 15a2a27 Compare July 3, 2024 23:30

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch 2 times, most recently from 3c1b6ef to 667cf26 Compare July 15, 2024 13:32

tharvik added this to the v4.0.0 milestone Jul 23, 2024

tharvik mentioned this pull request Jul 31, 2024

scraps from #682 #735

Merged

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 667cf26 to 6c4ce8a Compare August 7, 2024 22:44

tharvik changed the base branch from develop to NAN-scraped-cleanups-tharvik August 7, 2024 22:45

tharvik force-pushed the NAN-scraped-cleanups-tharvik branch from e1f1ff4 to aae81eb Compare August 8, 2024 12:15

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 6c4ce8a to 54478b6 Compare August 8, 2024 12:38

tharvik force-pushed the NAN-scraped-cleanups-tharvik branch from aae81eb to bb30a7e Compare August 9, 2024 11:59

Base automatically changed from NAN-scraped-cleanups-tharvik to develop August 9, 2024 12:09

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 54478b6 to 1a96493 Compare August 12, 2024 12:37

tharvik requested a review from JulienVig August 12, 2024 12:42

tharvik added 23 commits August 22, 2024 10:06

discojs/dataset: revamp

5cc875d

discojs-*: rework loaders

2b02823

webapp/data: rework

e48fd32

*: use Dataset

ce5169b

discojs*: rm *Loader & DatasetBuilder

4165165

discojs/Image: add full type

800204a

*: import tfjs-node for speed

0e0e23a

webapp/training: stop via AbortController

4385e18

webapp/training: back to stop-by-throwing This reverts commit 7b259af.

webapp/tests: train lus covid

68b928f

webapp/training: fix unraw dataset use

ce13404

cli: use path.join where possible

76f629e

webapp/tests: commonize server setup

f7a0307

webapp/tests: add testing e2e

8c46f3f

webapp/testing: fix text test

81141cf

*: avoid some tf memory leak

6e5c894

discojs/prepocessing/text: fix type check

265b578

discojs: convertors -> processing

ee1bba2

discojs: add some comments

46e6e22

examples/training: fix comment

9832f19

webapp: small fixes

20a05ce

webapp/testing: specialize Tester

e72c1b1

webapp/store/validation: simplify

d0dd431

discojs/validation: simplify

0fd9781

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 0c8cc80 to 0fd9781 Compare August 22, 2024 10:21

tharvik merged commit 58f916c into develop Aug 22, 2024
23 checks passed

tharvik deleted the 650-framework-agnostic-loaders-tharvik branch August 22, 2024 10:38

JulienVig mentioned this pull request Oct 7, 2024

Tabular columns are not checked anymore before training #801

Closed

tharvik mentioned this pull request Oct 28, 2024

*: framework agnostic preprocessing #781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: framework agnostic loaders #682

*: framework agnostic loaders #682

tharvik commented Jun 5, 2024 •

edited

Loading

JulienVig left a comment

tharvik commented Jun 11, 2024

JulienVig commented Jun 11, 2024

JulienVig commented Jun 11, 2024

tharvik commented Jun 12, 2024

tharvik commented Aug 12, 2024

JulienVig commented Aug 13, 2024

JulienVig commented Aug 22, 2024

martinjaggi commented Aug 22, 2024

*: framework agnostic loaders #682

*: framework agnostic loaders #682

Conversation

tharvik commented Jun 5, 2024 • edited Loading

JulienVig left a comment

Choose a reason for hiding this comment

tharvik commented Jun 11, 2024

JulienVig commented Jun 11, 2024

JulienVig commented Jun 11, 2024

tharvik commented Jun 12, 2024

tharvik commented Aug 12, 2024

JulienVig commented Aug 13, 2024

JulienVig commented Aug 22, 2024

martinjaggi commented Aug 22, 2024

tharvik commented Jun 5, 2024 •

edited

Loading