Plugins for semantic changes tracking in dependencies #5920
Replies: 20 comments 2 replies
-
I like it, @dmpetrov , specially for working with databases in a flexible way! Maybe, instead of using the exit code, we can track the output (for example There are several instances of I would prefer to sit on it and think on other solutions to comply with databases. I may be short-sighted, but I'm not seeing any advantages to maintain a feature like that besides the integration with databases 🙈 Other possible name could be |
Beta Was this translation helpful? Give feedback.
-
@MrOutis totally agree! The solution can benefit a lot from this flexibility - if we save the outputs in dvc files then it will save users from having to write additional status files. On the one hand, integration with databases is a super important scenario. On the other hand, @MrOutis brought a great point about |
Beta Was this translation helpful? Give feedback.
-
I am currently evaluating DVC for use in our ML workflow. Databases play a role as we have images as input for which metadata needs to be stored. DVC works great for experimentation when adding a dataset directly (thanks a lot!), but in the end I want to store data independent of DVC and not duplicated. I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files. I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the |
Beta Was this translation helpful? Give feedback.
-
Hi @fmannhardt !
Directories on both S3 and GCP can be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate? |
Beta Was this translation helpful? Give feedback.
-
Cool. Would be great to see this.
The scenario is to have the images (or the image URI) in a database to be queried and used for different training sets. Now, from my understanding when I have a script querying the DB as stage in a pipeline. Now DVC would keep track of changes to the SQL query and re-execute the stage when I change the query. But it would not re-execute when additional data (images) was added to the DB through some other (non-tracked) channel. How should it now without executing the query again. What I thought as a workaround is similar what is proposed here to have a query providing some cheap metadata that can be tuned to the desired level of robustness, e.g. the total count of rows in an append-only DB would be enough. But differently from what I read here, this query would write be executed in a standard DVC stage that write the result to a file that is tracked by DVC as output. Now, in case this output was changes (detected with the standard MD5 mechanism), everything downstream would need to be re-run. Otherwise, everything is assumed to be up-to-date. Of course, this should only be done upon request from the user to keep having reproducible results for previous executions of the pipeline. I saw the What I was proposing is to somehow automate this by marking |
Beta Was this translation helpful? Give feedback.
-
@fmannhardt Thanks for the explanation! 🙂
Yes, I think so.
We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. |
Beta Was this translation helpful? Give feedback.
-
I think this feature would do the trick. Thanks! |
Beta Was this translation helpful? Give feedback.
-
A user asked about this use case up today on Discord. Specifically about DVC understanding Python imports inside commands fed to So besides implementing these plugins or middleware that Dmitry mentioned, what about out-of-box support for certain programming languages like Python, C++, etc? so in the case above, DVC would autodetect that |
Beta Was this translation helpful? Give feedback.
-
@jorgeorpinel yes, it is a bit different use case - python file dependencies is not the same as dependencies to python functions from the initial message. The file dependencies use case should be easier to implement, I guess. Package systems should be able to track the dependencies check and I hope this ideas (or code) can be reused in DVC. |
Beta Was this translation helpful? Give feedback.
-
Yes, it's a bit different but related. I can open a separate issue if you prefer. I'm no talking about packages or libraries though, in that case you could kind of hack it now, by having requirements.txt as a dependency, for example (in Python). I'm talking about inter-dependencies between source code files in the project i.e. when your stage is spread in several source code files, but only one is executable and marked as a Also note I'm not just talking about Python code but multiple languages. I guess Python would be a first obvious platform to include such a feature for, since our core code is also Python. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone. I think I have an idea about how to implement this for python (and many other languages, actually): We can manually compile python endpoint like that:
and to add a This also works with C/C++: we just need to use compiled endpoint as a dependency. In case of databases I think we could take advantage of So all that DVC plugin should do is to automatically compile endpoint and to redirect code dependencies to that binary. We could make some kind of flag, like |
Beta Was this translation helpful? Give feedback.
-
@anotherbugmaster sounds like a good option to automatically detect all the changes in the dependencies recursively and it will probably avoid rerunning stuff if I changed only a comment or whitespace in the script? It's not a solution for the:
as far as I can tell. |
Beta Was this translation helpful? Give feedback.
-
Yeah, seems like I misunderstood the issue here. The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed. |
Beta Was this translation helpful? Give feedback.
-
@anotherbugmaster for sure. It can be a part of a solution. |
Beta Was this translation helpful? Give feedback.
-
Over in #2378, we are discussing a similar issue, and I have hacked up support for database status checking by monkey-patching a custom remote (with associated output and dependency support) into DVC: #2378 (comment) As I've thought more on this issue, I've become increasingly persuaded that external dependencies with custom remote schemas are one of the more elegant ways to deal with this family of issues, in particularly because they do not require adding any new syntax or concepts to DVC stage files - they just need the ability to dispatch URLs with a custom scheme to an appropriate class, function, or command. |
Beta Was this translation helpful? Give feedback.
-
@dmpetrov Is there any update on how to use DVC to track a database e.g. a mongodb collection? |
Beta Was this translation helpful? Give feedback.
-
@jtlz2 No updates for now 🙁 |
Beta Was this translation helpful? Give feedback.
-
Here's another case for this (I think) from a user on the forum: https://discuss.dvc.org/t/update-same-output-dir-in-different-stages/620 |
Beta Was this translation helpful? Give feedback.
-
As part of my project, I wrote a function that hashes a given Python function based on its AST and recursively any global objects it contains (including other functions it calls) https://github.com/jhrmnn/mona/blob/master/src/mona/pyhash.py Published under MPL 2.0, maybe you could reuse this. |
Beta Was this translation helpful? Give feedback.
-
Alternatively, if there is interest, I could carve it out into a separate package. |
Beta Was this translation helpful? Give feedback.
-
Problem
DVC reproduces command if dependencies were changed. Today we support many general types of dependencies:
dvc run -d azure://path/to/my blob train.py ...
.dvc run -d train.py -d images/ train.py ...
However, there are a bunch of not general dependencies which cannot be validated by DVC.
Problem examples:
mycode()
was changed in classMyClass
in a python filetrain.py
.Possible solution
A custom plugin (code) might be executed to check a dependency change. A plugin could be any command which returns 0 if repro is not needed.
Solution examples:
check_db.sh
to validate if a table was changed and then execute the DB dump script (if it was a change). Command example:dvc -d db_dump.sh -p check_db.sh -o clients.csv run db_dump.sh clients.csv
. Note, there is a new, plugin option-p
.dvc -d train.py -p "python check_method_change.py MyClass.mycode change_timestamp" -d change_timestamp -o clients.csv run train.py
wherecheck_method_change.py
check the code changes and returns 0 if it was a change.UPDATE: Please note that the script
check_method_change.py
might be still our responsibility and we should implement it (probably outside of DVC core).Beta Was this translation helpful? Give feedback.
All reactions