Setup first model version for the SMILES-SELFIES modality pair #5

AdrianM0 · 2024-04-07T08:26:38Z

I tested the training loop on SMILES-SELFIES pairs. Ran a small dataset of 800 SMILES-SELFIES pairs. Achieved an average of 0.93 cosine similarity on the training set, and 0.8 on the test set.

Tested negative examples as well, and those have poor cosine similarities.

Edit: using proper embedding averaging

codingS3b · 2024-04-08T08:02:52Z

Could you provide the data on which you ran your trainings? If its too large to check into git, we can probably also find a different way to distribute this.

AdrianM0 · 2024-04-08T08:09:07Z

@codingS3b here you can see the small subset I trained/validated on. I can also upload the notebook?

train_selfies.csv
valid_selfies.csv

codingS3b · 2024-04-09T07:39:29Z

Is this really the data you used? I'm kind of missing a column for the SMILES representation? Or am I misinterpreting the data? Maybe the notebook would indeed help.

In the code it seems you are reading in a file called "selfies_smiles_data.csv"?

AdrianM0 · 2024-04-09T07:54:39Z

Is this really the data you used? I'm kind of missing a column for the SMILES representation? Or am I misinterpreting the data? Maybe the notebook would indeed help.

In the code it seems you are reading in a file called "selfies_smiles_data.csv"?

there is a smiles column? you have to scroll right cause the selfies column is long 😄

kjappelbaum · 2024-04-09T08:36:11Z

requirements.txt

 hydra-core==1.3.2
 hydra-colorlog==1.2.0
 hydra-optuna-sweeper==1.2.0


why is hydra pinned and the torch stuff only has a lower bound 🤔 ?

that's the default stuff

you mean it was pinned this way by default? But what is the rational behind doing it this way?

and do we actually now use both requirements.txt and the toml file? This might easily become messy to have it in two places

kjappelbaum · 2024-04-09T08:53:43Z

src/molbind/data/dataloaders.py

-    def __init__(self, dataset, batch_size, shuffle, num_workers, modality="smiles"):
-        super(StringDataLoader, self).__init__(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
+class StringDataset(Dataset):
+    def __init__(self, dataset, modality, context_length=256):


could we add docstrings? To me it was not clear from the variable name dataset that this is supposed to be indexable (like a tuple)

kjappelbaum · 2024-04-09T08:54:18Z

src/molbind/data/dataloaders.py

-        super(StringDataLoader, self).__init__(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers)
+class StringDataset(Dataset):
+    def __init__(self, dataset, modality, context_length=256):
+        self.dataset = dataset


do we still need it? Otherwise, we can perhaps keep the object leaner by avoiding this attribute

src/molbind/data/dataloaders.py

kjappelbaum · 2024-04-09T08:56:58Z

src/molbind/data/dataloaders.py

+        self.tokenized_smiles = STRING_TOKENIZERS["smiles"](
+            dataset[0],
+            padding="max_length",
+            truncation=True,
+            return_tensors="pt",
+            max_length=context_length,
+        )
+        self.tokenized_string = STRING_TOKENIZERS[modality](
+            dataset[1],
+            padding="max_length",
+            truncation=True,
+            return_tensors="pt",
+            max_length=context_length,
+        )


if your data is large, you might need to revisit this and replace this with some tokenization on the fly or loading from pre-tokenized datasets.

it is okay for now, but I'd keep in mind that this might need to be refactored

kjappelbaum · 2024-04-09T08:57:15Z

src/molbind/data/dataloaders.py

+    def __init__(self, dataset, context_length=128):
+        self.dataset = dataset


similar comments as above

kjappelbaum · 2024-04-09T08:57:33Z

src/molbind/data/dataloaders.py

+class GraphDataset(Dataset):
+    def __init__(self, dataset, context_length=128):
+        self.dataset = dataset
+        self.graphs = dataset[1]


Those are assumed to be PyG objects?

kjappelbaum · 2024-04-09T08:57:48Z

src/molbind/data/dataloaders.py

+    num_workers: int,
+    drop_last: bool = True,
+) -> CombinedLoader:
+    """_summary_


summary is missing ;)

src/molbind/data/dataloaders.py

src/molbind/models/components/custom_encoders.py

src/molbind/models/lightning_module.py

src/molbind/data/dataloaders.py

kjappelbaum · 2024-04-09T09:11:42Z

src/molbind/utils/utils.py

+def reinitialize_weights(model) -> None:
+    for module in model.modules():
+        if isinstance(module, nn.Linear):
+            nn.init.normal_(module.weight, mean=0, std=0.02)


is this the ideal choice? why not use a "proper" initialization scheme? (Xavier, Golorot ...)

src/molbind/models/lightning_module.py

kjappelbaum · 2024-04-10T09:55:36Z

experiments/train.py

+        },
+    }
+
+    from omegaconf import DictConfig


didn't the linter complain here? :D

kjappelbaum

Overall, this is on the way to being an immaculate code base. Happy to see that 👍🏽

kjappelbaum · 2024-04-10T09:56:16Z

requirements.txt

 hydra-core==1.3.2
 hydra-colorlog==1.2.0
 hydra-optuna-sweeper==1.2.0


and do we actually now use both requirements.txt and the toml file? This might easily become messy to have it in two places

kjappelbaum · 2024-04-10T09:56:56Z

src/molbind/data/dataloaders.py

+    def __init__(
+        self, dataset: Tuple[Tensor, Tensor], modality: str, context_length=256
+    ):
+        """_summary_


_summary_ is strange :)

kjappelbaum · 2024-04-10T09:57:27Z

src/molbind/data/dataloaders.py

+        assert len(dataset) == 2
+        assert len(dataset[0]) == len(dataset[1])


kjappelbaum · 2024-04-10T10:00:51Z

src/molbind/models/components/base_encoder.py

+def xavier_init(model: nn.Module):
+    for param in model.parameters():
+        if len(param.shape) > 1:
+            nn.init.xavier_uniform_(param)
+    return model


Keep in mind that this is a super tricky and important point.

It also highly depends on the activation function. For example, you'd like to use He if you use ReLU

src/molbind/models/components/base_encoder.py

src/molbind/models/components/head.py

src/molbind/models/lightning_module.py

codingS3b · 2024-04-10T10:57:21Z

I would not put the train_molbind function inside the library but instead use it in the train.py file.
in head.py, I think we discussed using the class_resolver library for creation of the proper activation layers instead of doing all the if-else statements in the ProjectionHead class
also I think you are not making use of the config yaml files yet in which you could likely move some of the hardcoded configuration (paths, hyperparameters etc)...

AdrianM0 added 2 commits April 7, 2024 10:23

v1

61c923b

feat: clean-up encoderes

babda3a

AdrianM0 added 2 commits April 8, 2024 10:22

add tokenizers

3a177fb

small clean-up with ruff

2f990c7

modifications to dataloader and encoders with proper embedding averaging

40decc4