Visualize size of processed datasets #662

lukaszdz · 2021-12-01T21:58:49Z

Description

I'm always frustrated when I'm running daily or weekly sets of modular pipelines and my final output does not make complete sense. This indicates that there was an issue when running the pipeline but I'm not sure, at a glance, what step didn't provide output.

One example problem: one initial dataset had the mapping of market IDs. One day, the market ID for our second biggest market was omitted from the first step, causing all subsequent downstream analysis to be off by a nontrivial amount.

Context

This change is important to me because it would help me, at a glance, identify changes across runs through visual cues, so I know where to begin.

Possible Implementation

Visualize the total size of each dataset that has been processed via kedro viz:
The day that things ran correctly:

The day that things failed:

Would be nice to also visualize the nodes that had been attempted to run, but failed

In this example, by visualizing the size of each step that had been run, you would immediately see that the data set with the biggest difference was the companies set. Even though the pipeline strictly failed a step later, you would immediately know where to start debugging.

antonymilne · 2021-12-02T10:41:33Z

This is more of a kedro-viz issue so I've moved it 🙂 This is a great suggestion @lukaszdz and is also something I've pondered before so let me add some thoughts here... Related: kedro-org/kedro#1076 https://github.com/quantumblacklabs/private-kedro/issues/1148

Current methods for tracking dataset size

For an immediate solution, outside viz there are actually a couple of different ways you might be able to achieve what you're looking for already:

great expectations via the kedro-great plugin. I'm not at all familiar with this myself but I imagine you should be able to write some rule that validates the number of rows in a dataframe
use a hook that emits a log message (to the console or a file) giving the number of rows in the dataframe like this

And one which will show you something, though not exactly what you want, in kedro viz:

make a node that takes in all the datasets you want to check the size of and save the information to one of the new tracking.JSONDataSet or tracking.MetricsDataSet datasets. As part of the new experiment tracking functionality you would then be able to visualise this in a graph in kedro viz, including seeing how the number changes over time between different runs

More generally

I love this idea and would actually like to make it more general. As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset. Let me call this a "trackable".

In the future I think there should be two possible methods for this:

via experiment tracking - this is already work in progress. You can write code to calculate whatever trackable you like in a node and then save it to a tracking dataset. Crucially this will give you a sense of how the trackable changes between one kedro run and the next, since I should be able to go back in time and visualise the pipeline and datasets of historic runs.
some kind of customisable "widget" which allows me to give, in the catalog, as many trackables as I like, e.g. (completely made up example syntax)

shuttles:
    type: pandas.CSVDataSet
    filepath: ...
    viz_widgets:
        number_of_rows
        number_of_na: column1, column2, column3
        my_custom_widget

Where we supply with kedro viz a few common widgets like number_of_rows, but a user can define their own my_custom_widget also so it's very flexible. The natural place for this information to be shown on kedro viz would be the side panel on the right hand side that appears when you click on a dataset. But it would be super cool if somehow we could make the pipeline visualisation customisable with user-pluggable widgets too.

Visualising failed nodes

This would also be great, and actually I don't think we're too far off being able to do it. We already hacked together something which gets halfway there during a hackathon. Again I'd actually go further here: ideally kedro viz would live update while you're doing a run and show which is the currently running node, and I'd also be able to trigger runs from kedro viz.

antonymilne · 2021-12-02T10:43:27Z

FYI @MerelTheisenQB @tynandebold @studioswong very relevant to what we did during the hackathon and the general question of people tracking things through kedro-viz that aren't metrics in the traditional sense (i.e. not model performance).

tynandebold · 2023-06-05T14:57:57Z

We have a design for a possible solution here, which looks like this:

This feature becomes unlocked by this change as well as an addition we'd have to make in Kedro datasets.

NeroOkwa · 2023-07-11T15:14:39Z

Copying a user's comment and request for this feature on the slack channel here:

"I want to log the number of rows for the datasets at each step of my pipeline. It's for debugging. The goal is to notice big drop of rows during one data transformation step. For example, after one node, I may see that my number of lines drops by 30% when it’s supposed to stay the same."

datajoely · 2023-07-21T17:06:51Z

Hey everyone - I was chatting to Nero seeing this go into progress and I have some thoughts on the feature because there is a lot of potential value here.

Evaluating the original user request against the sidebar solution

The use wanted to show custom metadata directly on the flowchart
The point of this was to a direct comparison between nodes and a bird's eye view
Pushing this information to the sidebar doesn't allow the user to compare this data in any meaningful way. They can't compare two or more datasets to make any decisions like the empty file issue the user reports.
Providing a comparison workflow is important here, a low effort way of doing so would be providing some sort of table view.

Challenging the decision to index tightly on dataset statistics, we should provide a mechanism to provide key/values arbitrarily

I would challenge the idea that we should opine on what statistics are important for our users
The new metadata YAML allows us to provide any arbitrary field, why limit this to just dataset statistics? Users will immediately start asking for other attributes.

We need to provide a way of letting users configure this dynamically

If we stick to the dataset statistics point, no user is going to update these data points manually.
We need to provide an interface for doing so dynamically. Hooks are the right solution for this.
This is a super naive solution to the actual problem, but we should be building this in a way that empowers users to add their own data:

class VizMetricHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog) -> None:
        def _add_shape_metadata(dataset):
            rows, columns = dataset.load().shape
            metadata = {
                "kedro_viz": {"side_bar": {"num_rows": rows, "num_columns": columns}}
            }
            dataset.metadata = metadata
            return dataset

        pandas_datasets = {
            name: _add_shape_metadata(dataset_instance)
            for name, dataset_instance in catalog.datasets.__dict__.items()
            if not name.startswith("param")
            and "pandas" in str(type(dataset_instance))
            and dataset_instance.exists()
        }

        for name, dataset_instance in pandas_datasets.items():
            catalog.add(name, dataset_instance, replace=True)

ravi-kumar-pilla · 2023-07-21T19:01:42Z

I agree with @datajoely on this. Getting the statistics displayed in the metadata panel would be helpful but it will be really hard for the users to compare and get a bird's eye view. If we do not want to clutter the flowchart with the stats view, we need to have some sort of comparison view (like a table may be). We can extend more on this once we have new designs for the comparison view.
Thank you !

ravi-kumar-pilla · 2023-07-24T14:40:03Z

Hi Team,

@merelcht, @noklam, @rashidakanchwala
  
I am working on this story and I need some suggestions. 

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

In the catalog files, users can mention profiler_args as below -

reviews:
  type: pandas.CSVDataSet
  filepath: ${base_location}/01_raw/reviews.csv
  metadata:
    kedro-viz:
      layer: raw
      preview_args: 
        nrows: 10
      profiler_args:
        show: true

Based on profiler_args show key, we will get the stats (rows, columns, file size) without loading the entire file into memory.

Questions -

For local files, this can be acheived using the csv and openpyxl like - https://github.com/kedro-org/kedro-plugins/compare/feature/profiler-csv-excel (any suggestions would help).
I would like to know how can we do profiling without loading the entire file to memory when the files are stored in remote locations (S3, Azure, GCS, HTTPS) ?
Should we support profiling for remote locations or just local ?

Thank you !

merelcht · 2023-07-24T15:31:19Z

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

In the catalog files, users can mention profiler_args as below -
reviews:
  type: pandas.CSVDataSet
  filepath: ${base_location}/01_raw/reviews.csv
  metadata:
    kedro-viz:
      layer: raw
      preview_args: 
        nrows: 10
      profiler_args:
        show: true
Based on profiler_args show key, we will get the stats (rows, columns, file size) without loading the entire file into memory.

I think the above is actually a bit inconsistent. If you call the key profiler_args I'd expect to be able to provide the arguments of what's going to be displayed. Whereas "show" doesn't specify at all what's going to be shown. So in this case maybe it could be a list like:

profiler_args:
   - rows
   - columns
   - file_size

That also allows for flexibility where for some datasets you can show all these things and others maybe only the file size.

Questions -

For local files, this can be acheived using the csv and openpyxl like - https://github.com/kedro-org/kedro-plugins/compare/feature/profiler-csv-excel (any suggestions would help).

I would like to know how can we do profiling without loading the entire file to memory when the files are stored in remote locations (S3, Azure, GCS, HTTPS) ?

Should we support profiling for remote locations or just local ?

I think this depends on what "metrics" we exactly want to show. I think it should be possible to get file size without downloading the data, but maybe some of the other things are not possible to provide without downloading.

NeroOkwa · 2023-07-26T12:55:10Z

Thanks @datajoely for the comments, I agree with this.

Summary

The goal of this ticket is to help a user debug their dataset, by enabling them to easily compare (preset) attributes that may have changed during data transformation of a run. Yes having the information in the sidebar limits data comparison, which is the user’s objective.

As mentioned by @antonymilne above:

As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset.

The first step would be focusing on dataset statistics e.g. number or rows/columns e.t.c. and later other attributes (based on feedback and metrics).
Another opportunity as highlighted by @datajoely would be to provide an interface in Kedro-Viz for users to dynamically configure these attributes (via hooks) vs the current manual approach.
The next step and opportunity would be for users to be able to debug nodes, by ‘visualising failed nodes’, but that’s beyond the scope of this ticket.

Based on all of this and a conversation with @studioswong, here are some potential next steps.

Potential next steps

Similarly to how when you click on a node and the side panel opens with an option to ‘Show Code’, we can have the same implementation when you click on a dataset but the show code would open up a canvas with the comparison table. This is a better MVP solution than using the side bar only, and we don’t have to change the existing flowchart.
We can design the comparison table using the ‘Compare runs’ feature in experiment tracking as inspiration.

CC @amandakys @stephkaiser @ravi-kumar-pilla

amandakys · 2023-07-26T16:34:29Z

Had a really productive chat with @ravi-kumar-pilla today about the dataset statistics in the metadata panel

Some key takeaways:
Dataset Statistics in the Metadata panel

the loading icon displayed with dataset statistics are fetching should be moved to be inside the metadata panel rather than displayed above the main flowchart. @amandakys to provide visuals for this
it might be worth displaying the dataset statistics label in the metadata panel even when they aren't enabled for that dataset just for visibility. This will also make it clearer what is being loaded. If profiler args aren't enabled, it can display something like "not configured" so users know that they can take steps to do that if they want dataset statistics.

Dataset statistics comparison

Using a Show Code style toggle to open up a panel with a comparison table is a valid option for enabling comparison, but it is a multi-dataset feature that will be accessibility only by first selecting a dataset.
We discussed options to visualise these statistics on the flowchart itself like shown by the user. An alternative could be designing "profiler mode" similar to the "show/hide labels" which changes the flowchart's display to show relevant dataset statistics and enable comparison of statistics in conjunction with the display of dataset relationships that is available with the flowchart. This is still a very rough idea and will need further investigation.
I like the idea of taking inspiration from the compare runs feature, as that will increase consistency and I'll be looking into this next.

ravi-kumar-pilla · 2023-07-26T18:15:23Z

I had a discussion with @rashidakanchwala about what statistics can we display for quick debugging. Retrieving total number of row/columns seems to be an expensive operation for some dataset types like excel.

Also, there might not be rows/columns for few datasets like PlotlyDataSet or Json etc. So we thought this ticket needs some technical discussion regarding what stats can be globally available for all datasets and will be useful for debugging.

One such stat we thought of was the file size. Getting a file size can be less expensive and can give some details to debug if something is drastically wrong. As per the implementation goes, we are not sure if extracting the file size should be part of each kedro-dataset plugin or be part of Kedro Framework AbstractDataSet implementation. It would be great to have this in a technical discussion across the team.

@merelcht @astrojuanlu @noklam please suggest

Thank you !

noklam · 2023-07-26T21:37:42Z

Imo it shouldn't be implemented in kedro or kedro-dataset. The preview method was viz only, why can't it be implemented on viz side instead? This should be true for any other plugins.

In terms on implementation of the feature, filesize is cheap to get via the filesystem. For columns and rows maybe we can just trim it if it exceed a certain amount of rows to say "more than 1000000 rows".

More crazy idea, can viz use hooks to record the statistic during a kedro run? This way there is no cost to read the stats.

ravi-kumar-pilla · 2023-07-26T23:09:19Z

Thank you @noklam . I see what you are saying, it make sense to have it on the viz side.

I would not completely agree on trimming the rows info as this still takes time and also we might not have rows for all datasets.

I think for the first pass, we can get the file size stat across all datasets. I am not well aware of the hook implementation you suggested here. If the crazy idea is efficient, we should do that :D

@tynandebold any suggestions here ?

Thank you

astrojuanlu · 2023-07-27T08:45:04Z

Does "file size" make sense for, say, APIDataSet?

I essentially agree with @tynandebold above, this should probably be focused on arbitrary key-value pairs and datasets can expose that dataset_info somehow.

tynandebold · 2023-07-27T09:11:45Z

A lot of good points being raised. Let me synthesize some of it and make some suggestions:

Design opportunities

At this stage, one main constraint is UI/UX design. The completed designs have this feature living in the Metadata panel, which, as many of you have raised, is suboptimal and doesn't add much value. Nevertheless, if we can't get a new design done that moves some of this information into the flowchart by the time the engineering work is ready, my suggestion is to first release the work in the Metadata panel and then move it elsewhere once the design is ready.

On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled, as written by @amandakys above.

Lastly, on this point:

the loading icon displayed with dataset statistics are fetching should be moved to be inside the metadata panel rather than displayed above the main flowchart. @amandakys to provide visuals for this

Are you saying we replace the main loading indicator we have over the flowchart and move it in the metadata panel? If yes, I don't think we should do that, as the flowchart sometimes needs an indicator to show when it's loading for larger pipelines. We can add a loading indicator into the Metadata panel, and it should probably match with the skeleton loader we have in experiment tracking, since it's inline data.

Engineering opportunities

A big question is around what should we allow the user to show. I agree with @datajoely here, in that we should allow them to configure key/values arbitrarily in the new metadata YAML, and even better if we can do that dynamically with something like VizMetricHooks as he used as an example.

If we could define some defaults here, like "file size"/"rows"/"columns" that may be useful, and as @amandakys wrote above, display them even if there's no value for that particular dataset to promote discoverability.
@noklam I don't think Viz can use hooks and I think that would need to be done in Kedro, right?
My suggestion here is to try and get "file size"/"rows"/"columns" to show up for every dataset, and for the ones where it doesn't make sense, don't show a value. One we have that wired up, it's trivial enough to move the data from the Metadata panel to another part of the app.

amandakys · 2023-07-27T09:28:37Z

This is a great summary 🚀

On the loading indicator, when Ravi showed me a demo of the feature, the loading icon was displayed over the flowchart. It did not block interaction with the flowchart and was there to indicate that metadata was loading. This felt misleading as it was not indicating that the flowchart was loading.

I was not suggesting we move the global loading icon to the metadata panel, just that metadata loading should be indicated in the metadata panel. For this the skeleton loader sounds like the best solution.

From my side, the things that would be relevant to this ticket's implementation work are:

using the skeleton loader instead of the flowchart loading icon when metadata panel is getting data
adding default values or the "dataset statistics" label to the metadata panel even when no value is available for discoverability

Based on @tynandebold's comment here I've opened another ticket to explore the concept of a dev/debug mode. The need, the use cases and the opportunities. #1464

On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled

noklam · 2023-07-27T15:07:17Z

@ravi-kumar-pilla #1465 a quick PoC to demonstrate what I mean.

ravi-kumar-pilla · 2023-07-27T18:43:23Z

Hi @noklam , Thank you for the quick POC. I am not familiar with python hooks or Kedro Framework hook used in the POC.

I think we should collect stats during a kedro run and then kedro viz can read the stats file to display the metadata. This would be the most optimal way to retrieve the stats as they are pre-calculated.

As @datajoely pointed we need to look at a way to let users configure this dynamically. It would be nice if this metadata can be collected for every run like experiment tracking in a database and then viz can read it ( we can have a history of metadata change ). I clearly have a huge knowledge gap in this area and let me understand hooks first before I can comment further on this ticket. Thank you !!

noklam · 2023-07-27T19:52:44Z

Happy to walk you through that, maybe can combine it with a few new joiners. It's covered in kedro intermediate training or we can revive the Kedro University.

noklam · 2023-07-31T14:12:18Z

@noklam I don't think Viz can use hooks and I think that would need to be done in Kedro, right?
@tynandebold viz can use hook.

noklam · 2023-07-31T14:19:30Z

I think I am missing context here. I can advise on the implementation and design but I need to understand the scope of this ticket better.

@NeroOkwa Maybe a quick catch up?

What's the goal?

Is there any MVP we aim?
is filesize/row/column enough?
performance concern?
Do we need to cover versions or we only show latest?

There are lots of optimisation we can do, the solution can also be just hooks, plugins,

NeroOkwa · 2023-08-30T10:55:21Z

@lukaszdz this feature has been implemented on the latest Kedro-Viz release. Can you confirm if this solves your pain point and provide feed back. Thanks.

lukaszdz · 2023-08-30T14:31:16Z

@NeroOkwa This is almost there. Ideally, we would want to see the dataset sizes in the graph view so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red)

lukaszdz · 2023-08-30T14:41:45Z

can be viewed directly on the node in the graph view:

can use abbreviations with up to 3 digits to show the rough size/number of rows.

If empty - then can be red:

The goal is to be able to quickly visually know whether some steps in the pipeline failed to run.

In the future, you could imagine also having rules to color the node as red if a node deviates from its normal values. for example, say the companies node size is 77,000 rows on Monday, 77,100 on Tuesday, 78,000 Wed, then drops to 10,000 on Thursday. Then you could see at a glance that something failed with the node, visually. This would greatly accelerate debugging pipelines.

NeroOkwa · 2023-09-05T18:31:56Z

@lukaszdz thanks for the feed back.

The goal is to be able to quickly visually know whether some steps in the pipeline failed to run.

I have 3 follow up questions:

Previously, what steps have you observed failed in the pipeline run?
Isn't the information required to debug the 'failed node' already shown in the CLI?
Would you be up for a future user interview about your experience with this feature and Kedro-Viz ?

lukaszdz · 2023-09-05T19:05:04Z

I dont understand the question
I'm not sure what that screen is, but seeing something in the CLI is not as useful as seeing it visually
I'm down for a 15min user interview, but I do not use kedro at all, because either a) onboarding is too complicated or b) I can't easily build the data pipelines I want, if at all.

noklam · 2023-09-05T19:14:43Z

@lukaszdz thank you for the feedback. For 3. could you give a little bit of details what kind of data pipeline you are trying to build and kedro fails you?

lukaszdz · 2023-09-05T22:56:01Z

I would like to be able to easily create a kedro data pipeline and call that from within a function that already exists in my code base. I installed kedro and tried to figure out how to do this from the documentation; couldn't figure it out. Then asked in slack and got a couple answers, which I haven't tried yet. I feel like doing something this simple should take me less than 10 minutes to do, and it should be very very easy/self-evident from the documentation.
I would like to be able to create a data pipeline where I can run jobs across partitions of data. Some examples of this existing in other frameworks are called window functions. For example, I have 500M rows, split by city name, and would like to create and run a node: one node for each market, as part of a pipeline. I wanted to do this a couple years ago, so not sure if this has gotten easier.

NeroOkwa · 2023-09-06T12:43:29Z

@lukaszdz pls share an email address with which I can book the user interview session. Thanks.

lukaszdz · 2023-09-06T14:05:13Z

[email protected]

NeroOkwa · 2023-09-18T13:39:38Z

@lukaszdz, the session has been booked for today 18/09/23.

antonymilne transferred this issue from kedro-org/kedro Dec 2, 2021

limdauto changed the title ~~Visualize size of processed datasets~~ [KED-3038] Visualize size of processed datasets Jan 4, 2022

tynandebold added the Issue: Feature Request label Jan 10, 2022

tynandebold added this to Kedro-Viz Apr 20, 2022

tynandebold moved this to Inbox in Kedro-Viz Apr 20, 2022

tynandebold moved this from Inbox to Backlog in Kedro-Viz May 4, 2022

antonymilne mentioned this issue May 4, 2022

Visualisation of Kedro Hooks #836

Closed

3 tasks

antonymilne mentioned this issue Jun 14, 2022

Kedro-Viz to show preview of data #907

Closed

tynandebold changed the title ~~[KED-3038] Visualize size of processed datasets~~ Visualize size of processed datasets Aug 15, 2022

antonymilne mentioned this issue Aug 24, 2022

Provide simple mechanism for adding icons to datasets #480

Closed

1 task

tynandebold added this to the Deeper insights into datasets milestone Jan 16, 2023

rashidakanchwala mentioned this issue Jun 8, 2023

Add preview to datasets as specified in the Kedro catalog under metadata #1374

Merged

5 tasks

tynandebold moved this from Backlog to Todo in Kedro-Viz Jul 11, 2023

NeroOkwa assigned NeroOkwa and ravi-kumar-pilla and unassigned NeroOkwa Jul 21, 2023

NeroOkwa moved this from Todo to In Progress in Kedro-Viz Jul 21, 2023

tynandebold assigned amandakys Jul 26, 2023

noklam mentioned this issue Jul 27, 2023

[DON'T MERGE] PoC of recording stats during kedro run #1465

Closed

5 tasks

noklam mentioned this issue Jul 31, 2023

Spike: Provide a way for plugins to have runtime configuration and extend CLI arguments kedro-org/kedro#2866

Open

ravi-kumar-pilla mentioned this issue Aug 2, 2023

Visualize Dataset statistics in metadata panel #1472

Merged

5 tasks

amandakys mentioned this issue Jul 27, 2023

Investigate Dev/Debug Mode visualisation on the flowchart #1464

Closed

2 tasks

tynandebold moved this from In Progress to In Review in Kedro-Viz Aug 4, 2023

ravi-kumar-pilla closed this as completed in #1472 Aug 14, 2023

github-project-automation bot moved this from In Review to Done in Kedro-Viz Aug 14, 2023

ravi-kumar-pilla mentioned this issue Aug 30, 2023

Extend Visualize dataset statistics to include additional use cases #1511

Open

1 task

rashidakanchwala mentioned this issue Nov 10, 2023

[Debugging] Visualise dataset statistics in the Flowchart #1635

Open

1 task

astrojuanlu mentioned this issue Nov 18, 2024

Offer public API to get dataset info? kedro-org/kedro-plugins#926

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualize size of processed datasets #662

Visualize size of processed datasets #662

lukaszdz commented Dec 1, 2021

antonymilne commented Dec 2, 2021

antonymilne commented Dec 2, 2021

tynandebold commented Jun 5, 2023 •

edited

Loading

NeroOkwa commented Jul 11, 2023

datajoely commented Jul 21, 2023

ravi-kumar-pilla commented Jul 21, 2023 •

edited

Loading

ravi-kumar-pilla commented Jul 24, 2023

merelcht commented Jul 24, 2023

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

Questions -

NeroOkwa commented Jul 26, 2023 •

edited

Loading

amandakys commented Jul 26, 2023

ravi-kumar-pilla commented Jul 26, 2023

noklam commented Jul 26, 2023

ravi-kumar-pilla commented Jul 26, 2023

astrojuanlu commented Jul 27, 2023

tynandebold commented Jul 27, 2023

amandakys commented Jul 27, 2023 •

edited

Loading

noklam commented Jul 27, 2023

ravi-kumar-pilla commented Jul 27, 2023 •

edited

Loading

noklam commented Jul 27, 2023 •

edited

Loading

noklam commented Jul 31, 2023

noklam commented Jul 31, 2023

NeroOkwa commented Aug 30, 2023

lukaszdz commented Aug 30, 2023

lukaszdz commented Aug 30, 2023

NeroOkwa commented Sep 5, 2023 •

edited

Loading

lukaszdz commented Sep 5, 2023

noklam commented Sep 5, 2023

lukaszdz commented Sep 5, 2023 •

edited

Loading

NeroOkwa commented Sep 6, 2023

lukaszdz commented Sep 6, 2023

NeroOkwa commented Sep 18, 2023 •

edited

Loading

Visualize size of processed datasets #662

Visualize size of processed datasets #662

Comments

lukaszdz commented Dec 1, 2021

Description

Context

Possible Implementation

antonymilne commented Dec 2, 2021

Current methods for tracking dataset size

More generally

Visualising failed nodes

antonymilne commented Dec 2, 2021

tynandebold commented Jun 5, 2023 • edited Loading

NeroOkwa commented Jul 11, 2023

datajoely commented Jul 21, 2023

Evaluating the original user request against the sidebar solution

Challenging the decision to index tightly on dataset statistics, we should provide a mechanism to provide key/values arbitrarily

We need to provide a way of letting users configure this dynamically

ravi-kumar-pilla commented Jul 21, 2023 • edited Loading

ravi-kumar-pilla commented Jul 24, 2023

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

Questions -

merelcht commented Jul 24, 2023

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

Questions -

NeroOkwa commented Jul 26, 2023 • edited Loading

Summary

Potential next steps

amandakys commented Jul 26, 2023

ravi-kumar-pilla commented Jul 26, 2023

noklam commented Jul 26, 2023

ravi-kumar-pilla commented Jul 26, 2023

astrojuanlu commented Jul 27, 2023

tynandebold commented Jul 27, 2023

Design opportunities

Engineering opportunities

amandakys commented Jul 27, 2023 • edited Loading

noklam commented Jul 27, 2023

ravi-kumar-pilla commented Jul 27, 2023 • edited Loading

noklam commented Jul 27, 2023 • edited Loading

noklam commented Jul 31, 2023

noklam commented Jul 31, 2023

NeroOkwa commented Aug 30, 2023

lukaszdz commented Aug 30, 2023

lukaszdz commented Aug 30, 2023

NeroOkwa commented Sep 5, 2023 • edited Loading

lukaszdz commented Sep 5, 2023

noklam commented Sep 5, 2023

lukaszdz commented Sep 5, 2023 • edited Loading

NeroOkwa commented Sep 6, 2023

lukaszdz commented Sep 6, 2023

NeroOkwa commented Sep 18, 2023 • edited Loading

tynandebold commented Jun 5, 2023 •

edited

Loading

ravi-kumar-pilla commented Jul 21, 2023 •

edited

Loading

NeroOkwa commented Jul 26, 2023 •

edited

Loading

amandakys commented Jul 27, 2023 •

edited

Loading

ravi-kumar-pilla commented Jul 27, 2023 •

edited

Loading

noklam commented Jul 27, 2023 •

edited

Loading

NeroOkwa commented Sep 5, 2023 •

edited

Loading

lukaszdz commented Sep 5, 2023 •

edited

Loading

NeroOkwa commented Sep 18, 2023 •

edited

Loading