-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup first model version for the SMILES-SELFIES modality pair #5
Conversation
Could you provide the data on which you ran your trainings? If its too large to check into git, we can probably also find a different way to distribute this. |
@codingS3b here you can see the small subset I trained/validated on. I can also upload the notebook? |
Is this really the data you used? I'm kind of missing a column for the SMILES representation? Or am I misinterpreting the data? Maybe the notebook would indeed help. In the code it seems you are reading in a file called "selfies_smiles_data.csv"? |
there is a |
hydra-core==1.3.2 | ||
hydra-colorlog==1.2.0 | ||
hydra-optuna-sweeper==1.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is hydra pinned and the torch stuff only has a lower bound 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the default stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean it was pinned this way by default? But what is the rational behind doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do we actually now use both requirements.txt and the toml
file? This might easily become messy to have it in two places
src/molbind/data/dataloaders.py
Outdated
def __init__(self, dataset, batch_size, shuffle, num_workers, modality="smiles"): | ||
super(StringDataLoader, self).__init__(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers) | ||
class StringDataset(Dataset): | ||
def __init__(self, dataset, modality, context_length=256): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add docstrings? To me it was not clear from the variable name dataset
that this is supposed to be indexable (like a tuple)
src/molbind/data/dataloaders.py
Outdated
super(StringDataLoader, self).__init__(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers) | ||
class StringDataset(Dataset): | ||
def __init__(self, dataset, modality, context_length=256): | ||
self.dataset = dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need it? Otherwise, we can perhaps keep the object leaner by avoiding this attribute
self.tokenized_smiles = STRING_TOKENIZERS["smiles"]( | ||
dataset[0], | ||
padding="max_length", | ||
truncation=True, | ||
return_tensors="pt", | ||
max_length=context_length, | ||
) | ||
self.tokenized_string = STRING_TOKENIZERS[modality]( | ||
dataset[1], | ||
padding="max_length", | ||
truncation=True, | ||
return_tensors="pt", | ||
max_length=context_length, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if your data is large, you might need to revisit this and replace this with some tokenization on the fly or loading from pre-tokenized datasets.
it is okay for now, but I'd keep in mind that this might need to be refactored
src/molbind/data/dataloaders.py
Outdated
def __init__(self, dataset, context_length=128): | ||
self.dataset = dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comments as above
src/molbind/data/dataloaders.py
Outdated
class GraphDataset(Dataset): | ||
def __init__(self, dataset, context_length=128): | ||
self.dataset = dataset | ||
self.graphs = dataset[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are assumed to be PyG objects?
src/molbind/data/dataloaders.py
Outdated
num_workers: int, | ||
drop_last: bool = True, | ||
) -> CombinedLoader: | ||
"""_summary_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary is missing ;)
src/molbind/utils/utils.py
Outdated
def reinitialize_weights(model) -> None: | ||
for module in model.modules(): | ||
if isinstance(module, nn.Linear): | ||
nn.init.normal_(module.weight, mean=0, std=0.02) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the ideal choice? why not use a "proper" initialization scheme? (Xavier, Golorot ...)
experiments/train.py
Outdated
}, | ||
} | ||
|
||
from omegaconf import DictConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't the linter complain here? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is on the way to being an immaculate code base. Happy to see that 👍🏽
hydra-core==1.3.2 | ||
hydra-colorlog==1.2.0 | ||
hydra-optuna-sweeper==1.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do we actually now use both requirements.txt and the toml
file? This might easily become messy to have it in two places
def __init__( | ||
self, dataset: Tuple[Tensor, Tensor], modality: str, context_length=256 | ||
): | ||
"""_summary_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_summary_
is strange :)
assert len(dataset) == 2 | ||
assert len(dataset[0]) == len(dataset[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
def xavier_init(model: nn.Module): | ||
for param in model.parameters(): | ||
if len(param.shape) > 1: | ||
nn.init.xavier_uniform_(param) | ||
return model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that this is a super tricky and important point.
It also highly depends on the activation function. For example, you'd like to use He if you use ReLU
|
I tested the training loop on SMILES-SELFIES pairs. Ran a small dataset of 800 SMILES-SELFIES pairs. Achieved an average of 0.93
cosine similarity
on the training set, and 0.8 on the test set.Tested negative examples as well, and those have poor cosine similarities.
Edit: using proper embedding averaging