Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup first model version for the SMILES-SELFIES modality pair #5

Merged
merged 21 commits into from
Apr 11, 2024

Conversation

AdrianM0
Copy link
Collaborator

@AdrianM0 AdrianM0 commented Apr 7, 2024

I tested the training loop on SMILES-SELFIES pairs. Ran a small dataset of 800 SMILES-SELFIES pairs. Achieved an average of 0.93 cosine similarity on the training set, and 0.8 on the test set.

Tested negative examples as well, and those have poor cosine similarities.

Edit: using proper embedding averaging

image

@codingS3b
Copy link
Collaborator

Could you provide the data on which you ran your trainings? If its too large to check into git, we can probably also find a different way to distribute this.

@AdrianM0
Copy link
Collaborator Author

AdrianM0 commented Apr 8, 2024

@codingS3b here you can see the small subset I trained/validated on. I can also upload the notebook?

train_selfies.csv
valid_selfies.csv

@codingS3b
Copy link
Collaborator

Is this really the data you used? I'm kind of missing a column for the SMILES representation? Or am I misinterpreting the data? Maybe the notebook would indeed help.

In the code it seems you are reading in a file called "selfies_smiles_data.csv"?

@AdrianM0
Copy link
Collaborator Author

AdrianM0 commented Apr 9, 2024

Is this really the data you used? I'm kind of missing a column for the SMILES representation? Or am I misinterpreting the data? Maybe the notebook would indeed help.

In the code it seems you are reading in a file called "selfies_smiles_data.csv"?

there is a smiles column? you have to scroll right cause the selfies column is long 😄

Comment on lines 9 to 11
hydra-core==1.3.2
hydra-colorlog==1.2.0
hydra-optuna-sweeper==1.2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is hydra pinned and the torch stuff only has a lower bound 🤔 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the default stuff

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean it was pinned this way by default? But what is the rational behind doing it this way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do we actually now use both requirements.txt and the toml file? This might easily become messy to have it in two places

def __init__(self, dataset, batch_size, shuffle, num_workers, modality="smiles"):
super(StringDataLoader, self).__init__(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
class StringDataset(Dataset):
def __init__(self, dataset, modality, context_length=256):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add docstrings? To me it was not clear from the variable name dataset that this is supposed to be indexable (like a tuple)

super(StringDataLoader, self).__init__(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
class StringDataset(Dataset):
def __init__(self, dataset, modality, context_length=256):
self.dataset = dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need it? Otherwise, we can perhaps keep the object leaner by avoiding this attribute

Comment on lines +28 to +41
self.tokenized_smiles = STRING_TOKENIZERS["smiles"](
dataset[0],
padding="max_length",
truncation=True,
return_tensors="pt",
max_length=context_length,
)
self.tokenized_string = STRING_TOKENIZERS[modality](
dataset[1],
padding="max_length",
truncation=True,
return_tensors="pt",
max_length=context_length,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if your data is large, you might need to revisit this and replace this with some tokenization on the fly or loading from pre-tokenized datasets.

it is okay for now, but I'd keep in mind that this might need to be refactored

Comment on lines 60 to 61
def __init__(self, dataset, context_length=128):
self.dataset = dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comments as above

class GraphDataset(Dataset):
def __init__(self, dataset, context_length=128):
self.dataset = dataset
self.graphs = dataset[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are assumed to be PyG objects?

num_workers: int,
drop_last: bool = True,
) -> CombinedLoader:
"""_summary_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary is missing ;)

def reinitialize_weights(model) -> None:
for module in model.modules():
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, mean=0, std=0.02)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the ideal choice? why not use a "proper" initialization scheme? (Xavier, Golorot ...)

},
}

from omegaconf import DictConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't the linter complain here? :D

Copy link
Collaborator

@kjappelbaum kjappelbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is on the way to being an immaculate code base. Happy to see that 👍🏽

Comment on lines 9 to 11
hydra-core==1.3.2
hydra-colorlog==1.2.0
hydra-optuna-sweeper==1.2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do we actually now use both requirements.txt and the toml file? This might easily become messy to have it in two places

def __init__(
self, dataset: Tuple[Tensor, Tensor], modality: str, context_length=256
):
"""_summary_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_summary_ is strange :)

Comment on lines +35 to +36
assert len(dataset) == 2
assert len(dataset[0]) == len(dataset[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Comment on lines +7 to +11
def xavier_init(model: nn.Module):
for param in model.parameters():
if len(param.shape) > 1:
nn.init.xavier_uniform_(param)
return model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this is a super tricky and important point.

It also highly depends on the activation function. For example, you'd like to use He if you use ReLU

src/molbind/models/components/base_encoder.py Show resolved Hide resolved
src/molbind/models/components/head.py Outdated Show resolved Hide resolved
src/molbind/models/lightning_module.py Outdated Show resolved Hide resolved
@codingS3b
Copy link
Collaborator

codingS3b commented Apr 10, 2024

  • I would not put the train_molbind function inside the library but instead use it in the train.py file.

  • in head.py, I think we discussed using the class_resolver library for creation of the proper activation layers instead of doing all the if-else statements in the ProjectionHead class

  • also I think you are not making use of the config yaml files yet in which you could likely move some of the hardcoded configuration (paths, hyperparameters etc)...

@AdrianM0 AdrianM0 linked an issue Apr 10, 2024 that may be closed by this pull request
@AdrianM0 AdrianM0 merged commit 25b3bf5 into main Apr 11, 2024
0 of 3 checks passed
@AdrianM0 AdrianM0 deleted the selfies_smiles_example branch April 11, 2024 06:32
@AdrianM0 AdrianM0 changed the title SELFIES - SMILES multimodal example Setup first training loop for the SMILES-SELFIES modality pair Apr 11, 2024
@AdrianM0 AdrianM0 changed the title Setup first training loop for the SMILES-SELFIES modality pair Setup first model version for the SMILES-SELFIES modality pair Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor dataloaders to return tuples
3 participants