Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide base class for dataset loaders #59

Merged
merged 2 commits into from
Jan 21, 2022

Conversation

cthoyt
Copy link
Contributor

@cthoyt cthoyt commented Jan 21, 2022

Summary

This PR abstracts the essential components of the dataset loader into a base class to allow for future implementations of eager datasets (e.g., all parts of the dataset are already in memory) and for other lazy local dataset loaders.

  • Code passes all tests
  • Unit tests provided for these changes
  • Documentation and docstrings added for these changes

Changes

  • Create abstract base class with unimplemented methods for getting drugs, contexts, and labeled triples
  • Rename mid-level class to RemoteDatasetLoader

Next steps

The following shows an implementation of an eager dataset, which might be more useful for local datasets.

@dataclass
class EagerDatasetLoader(DatasetLoader):
    """An eager dataset."""

    context_feature_set: ContextFeatureSet
    drug_feature_set: DrugFeatureSet
    labeled_triples: LabeledTriples

    def get_labeled_triples(self) -> LabeledTriples:
        """Get the labeled triples file from the storage."""
        return self.labeled_triples

    def get_context_features(self) -> ContextFeatureSet:
        """Get the context feature set."""
        return self.context_feature_set

    def get_drug_features(self):
        """Get the drug feature set."""
        return self.drug_feature_set

This will allow for the future addition of an eager dataset loader
Copy link
Contributor

@benedekrozemberczki benedekrozemberczki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @cthoyt !

@benedekrozemberczki benedekrozemberczki merged commit 47af8b4 into AstraZeneca:main Jan 21, 2022
@cthoyt cthoyt deleted the update-dataset branch January 21, 2022 16:16
@aminemosbah
Copy link

any chance we would load , train and essentially test our own dataset ?

@cthoyt
Copy link
Contributor Author

cthoyt commented Feb 15, 2022

@aminemosbah I would suggest looking at https://chemicalx.readthedocs.io/en/latest/api/chemicalx.data.LocalDatasetLoader.html#chemicalx.data.LocalDatasetLoader for loading your own dataset that's already in the right format in a given directory

@aminemosbah
Copy link

thx, but i have smiles to predict in a csv file to predict locally , any quick snippet ?

@cthoyt
Copy link
Contributor Author

cthoyt commented Feb 15, 2022

@aminemosbah I have in mind a solution for what you want (which is the obvious realistic use case) but this it is blocked #50 and #58. @benedekrozemberczki would love to get your input on #50 ;)

@aminemosbah
Copy link

need to hack the dataloader to make it work for local data

@aminemosbah
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants