-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualize size of processed datasets #662
Comments
This is more of a kedro-viz issue so I've moved it 🙂 This is a great suggestion @lukaszdz and is also something I've pondered before so let me add some thoughts here... Related: kedro-org/kedro#1076 https://github.com/quantumblacklabs/private-kedro/issues/1148 Current methods for tracking dataset sizeFor an immediate solution, outside viz there are actually a couple of different ways you might be able to achieve what you're looking for already:
And one which will show you something, though not exactly what you want, in kedro viz:
More generallyI love this idea and would actually like to make it more general. As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type In the future I think there should be two possible methods for this:
Where we supply with kedro viz a few common widgets like Visualising failed nodesThis would also be great, and actually I don't think we're too far off being able to do it. We already hacked together something which gets halfway there during a hackathon. Again I'd actually go further here: ideally kedro viz would live update while you're doing a run and show which is the currently running node, and I'd also be able to trigger runs from kedro viz. |
FYI @MerelTheisenQB @tynandebold @studioswong very relevant to what we did during the hackathon and the general question of people tracking things through kedro-viz that aren't metrics in the traditional sense (i.e. not model performance). |
We have a design for a possible solution here, which looks like this: This feature becomes unlocked by this change as well as an addition we'd have to make in Kedro datasets. |
Copying a user's comment and request for this feature on the slack channel here: "I want to log the number of rows for the datasets at each step of my pipeline. It's for debugging. The goal is to notice big drop of rows during one data transformation step. For example, after one node, I may see that my number of lines drops by 30% when it’s supposed to stay the same." |
Hey everyone - I was chatting to Nero seeing this go into progress and I have some thoughts on the feature because there is a lot of potential value here. Evaluating the original user request against the sidebar solution
Challenging the decision to index tightly on dataset statistics, we should provide a mechanism to provide key/values arbitrarily
We need to provide a way of letting users configure this dynamically
class VizMetricHooks:
@hook_impl
def after_catalog_created(self, catalog: DataCatalog) -> None:
def _add_shape_metadata(dataset):
rows, columns = dataset.load().shape
metadata = {
"kedro_viz": {"side_bar": {"num_rows": rows, "num_columns": columns}}
}
dataset.metadata = metadata
return dataset
pandas_datasets = {
name: _add_shape_metadata(dataset_instance)
for name, dataset_instance in catalog.datasets.__dict__.items()
if not name.startswith("param")
and "pandas" in str(type(dataset_instance))
and dataset_instance.exists()
}
for name, dataset_instance in pandas_datasets.items():
catalog.add(name, dataset_instance, replace=True) |
I agree with @datajoely on this. Getting the statistics displayed in the metadata panel would be helpful but it will be really hard for the users to compare and get a bird's eye view. If we do not want to clutter the flowchart with the stats view, we need to have some sort of comparison view (like a table may be). We can extend more on this once we have new designs for the comparison view. |
Hi Team, @merelcht, @noklam, @rashidakanchwala Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -
Questions -
Thank you ! |
I think the above is actually a bit inconsistent. If you call the key profiler_args:
- rows
- columns
- file_size That also allows for flexibility where for some datasets you can show all these things and others maybe only the file size.
I think this depends on what "metrics" we exactly want to show. I think it should be possible to get file size without downloading the data, but maybe some of the other things are not possible to provide without downloading. |
Thanks @datajoely for the comments, I agree with this. SummaryThe goal of this ticket is to help a user debug their dataset, by enabling them to easily compare (preset) attributes that may have changed during data transformation of a run. Yes having the information in the sidebar limits data comparison, which is the user’s objective. As mentioned by @antonymilne above:
Based on all of this and a conversation with @studioswong, here are some potential next steps. Potential next steps
|
Had a really productive chat with @ravi-kumar-pilla today about the dataset statistics in the metadata panel Some key takeaways:
Dataset statistics comparison
|
I had a discussion with @rashidakanchwala about what statistics can we display for quick debugging. Retrieving total number of row/columns seems to be an expensive operation for some dataset types like excel. Also, there might not be rows/columns for few datasets like PlotlyDataSet or Json etc. So we thought this ticket needs some technical discussion regarding what stats can be globally available for all datasets and will be useful for debugging. One such stat we thought of was the file size. Getting a file size can be less expensive and can give some details to debug if something is drastically wrong. As per the implementation goes, we are not sure if extracting the file size should be part of each kedro-dataset plugin or be part of Kedro Framework AbstractDataSet implementation. It would be great to have this in a technical discussion across the team. @merelcht @astrojuanlu @noklam please suggest Thank you ! |
Imo it shouldn't be implemented in kedro or kedro-dataset. The preview method was viz only, why can't it be implemented on viz side instead? This should be true for any other plugins. In terms on implementation of the feature, filesize is cheap to get via the filesystem. For columns and rows maybe we can just trim it if it exceed a certain amount of rows to say "more than 1000000 rows". More crazy idea, can viz use hooks to record the statistic during a kedro run? This way there is no cost to read the stats. |
Thank you @noklam . I see what you are saying, it make sense to have it on the viz side. I would not completely agree on trimming the rows info as this still takes time and also we might not have rows for all datasets. I think for the first pass, we can get the file size stat across all datasets. I am not well aware of the hook implementation you suggested here. If the crazy idea is efficient, we should do that :D @tynandebold any suggestions here ? Thank you |
Does "file size" make sense for, say, I essentially agree with @tynandebold above, this should probably be focused on arbitrary key-value pairs and datasets can expose that |
A lot of good points being raised. Let me synthesize some of it and make some suggestions: Design opportunitiesAt this stage, one main constraint is UI/UX design. The completed designs have this feature living in the Metadata panel, which, as many of you have raised, is suboptimal and doesn't add much value. Nevertheless, if we can't get a new design done that moves some of this information into the flowchart by the time the engineering work is ready, my suggestion is to first release the work in the Metadata panel and then move it elsewhere once the design is ready. On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled, as written by @amandakys above. Lastly, on this point:
Are you saying we replace the main loading indicator we have over the flowchart and move it in the metadata panel? If yes, I don't think we should do that, as the flowchart sometimes needs an indicator to show when it's loading for larger pipelines. We can add a loading indicator into the Metadata panel, and it should probably match with the skeleton loader we have in experiment tracking, since it's inline data. Engineering opportunitiesA big question is around what should we allow the user to show. I agree with @datajoely here, in that we should allow them to configure key/values arbitrarily in the new metadata YAML, and even better if we can do that dynamically with something like
|
This is a great summary 🚀 On the loading indicator, when Ravi showed me a demo of the feature, the loading icon was displayed over the flowchart. It did not block interaction with the flowchart and was there to indicate that metadata was loading. This felt misleading as it was not indicating that the flowchart was loading. I was not suggesting we move the global loading icon to the metadata panel, just that metadata loading should be indicated in the metadata panel. For this the skeleton loader sounds like the best solution. From my side, the things that would be relevant to this ticket's implementation work are:
Based on @tynandebold's comment here I've opened another ticket to explore the concept of a dev/debug mode. The need, the use cases and the opportunities. #1464
|
@ravi-kumar-pilla #1465 a quick PoC to demonstrate what I mean. |
Hi @noklam , Thank you for the quick POC. I am not familiar with python hooks or Kedro Framework hook used in the POC. I think we should collect stats during a kedro run and then kedro viz can read the stats file to display the metadata. This would be the most optimal way to retrieve the stats as they are pre-calculated. As @datajoely pointed we need to look at a way to let users configure this dynamically. It would be nice if this metadata can be collected for every run like experiment tracking in a database and then viz can read it ( we can have a history of metadata change ). I clearly have a huge knowledge gap in this area and let me understand hooks first before I can comment further on this ticket. Thank you !! |
Happy to walk you through that, maybe can combine it with a few new joiners. It's covered in kedro intermediate training or we can revive the Kedro University. |
|
I think I am missing context here. I can advise on the implementation and design but I need to understand the scope of this ticket better. @NeroOkwa Maybe a quick catch up? What's the goal?
There are lots of optimisation we can do, the solution can also be just hooks, plugins, |
@lukaszdz this feature has been implemented on the latest Kedro-Viz release. Can you confirm if this solves your pain point and provide feed back. Thanks. |
@NeroOkwa This is almost there. Ideally, we would want to see the dataset sizes in the graph view so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red) |
@lukaszdz thanks for the feed back.
I have 3 follow up questions:
|
|
@lukaszdz thank you for the feedback. For 3. could you give a little bit of details what kind of data pipeline you are trying to build and kedro fails you? |
|
@lukaszdz pls share an email address with which I can book the user interview session. Thanks. |
@lukaszdz, the session has been booked for today 18/09/23. |
Description
I'm always frustrated when I'm running daily or weekly sets of modular pipelines and my final output does not make complete sense. This indicates that there was an issue when running the pipeline but I'm not sure, at a glance, what step didn't provide output.
One example problem: one initial dataset had the mapping of market IDs. One day, the market ID for our second biggest market was omitted from the first step, causing all subsequent downstream analysis to be off by a nontrivial amount.
Context
This change is important to me because it would help me, at a glance, identify changes across runs through visual cues, so I know where to begin.
Possible Implementation
Visualize the total size of each dataset that has been processed via kedro viz:
The day that things ran correctly:
The day that things failed:
Would be nice to also visualize the nodes that had been attempted to run, but failed
In this example, by visualizing the size of each step that had been run, you would immediately see that the data set with the biggest difference was the companies set. Even though the pipeline strictly failed a step later, you would immediately know where to start debugging.
The text was updated successfully, but these errors were encountered: