Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index symbol customization #71

Closed
wants to merge 72 commits into from

Conversation

csnbritt
Copy link

@csnbritt csnbritt commented Nov 9, 2021

Added index symbol customization functionality following the style of the bond constraints code to address #69

index_alphabet.py includes three new functions:

  • get_preset_index_alphabet
sf.get_preset_index_alphabet('default')
  • get_current_index_alphabet
sf.get_current_index_alphabet()
  • set_index_alphabet
new_index_alphabet = {
    "[1C]": 0, 
    "[2C]": 1, 
    "[3C]": 2, 
    "[4C]": 3, 
    "[5C]": 4, 
    "[6C]": 5, 
    "[7C]": 6, 
    "[8C]": 7,
    "[9C]": 8,
    "[10C]": 9,
    "[11C]": 10,
    "[12C]": 11,
    "[13C]": 12,
    "[14C]": 13,
    "[15C]": 14,
    "[16C]": 15
}

sf.set_index_alphabet(new_index_alphabet)

Currently I've only included one preset index alphabet, the default one. Having multiple presets may or may not be useful, remains to be seen how much different index alphabets affect performance on different tasks I suppose.

New index alphabets can be set by passing a dictionary with tokens as keys and indices as values. Checks to make sure the passed dictionary has 16 keys, each key is a valid atom or ring/branch token, and that the set of dictionary values includes all integers 0-15

@csnbritt
Copy link
Author

Thanks for the feedback - I made most of the suggested changes!

  • Changed get_current_index_alphabet to get_index_alphabet

  • Removed tuple(get_current_index_alphabet()) and list(set(index_alphabet.values()))

  • Agreed that (index -> symbol) makes more sense than (symbol -> index), this has been changed. Regarding how to store the index alphabet - in the latest commits I've changed the function from set_index_alphabet to update_index_alphabet and now any number of index values can be changed, rather than having to update the whole index alphabet each time. A copy of the index alphabet is updated by the passed dictionary, and if this updated dictionary has 16 unique values, each value is a valid selfies token, and each key is an int between 0-15, it is saved as _current_index_alphabet. I think that the ability to update index symbols just a few at a time is nice, and a dictionary makes it convenient to do so. I'm open to changing this function back or moving to using a list or tuple for this if desired however.

  • Moved get_index_from_selfies and get_selfies_from_index into index_alphabet.py. As suggested, for efficiency I added _current_index_alphabet_symbols and _current_index_alphabet_reversed to store the correctly ordered index alphabet symbols and reverse mapping of the index alphabet.

@csnbritt
Copy link
Author

I think that this latest commit complies with PEP8 length requirements, besides the error messages (I'm not too familiar with the standard so I don't know if this is an issue). In my testing I did find a bug that affects both setting semantic bond constraints and index alphabets when using dask. Dask parallelized functions don't use updated bond constraints or index alphabets unless the bond constraints or index alphabet is updated within the function itself, which is inefficient. I'm not exactly sure why this happens, but my guess is that dask reinitializes SELFIES for use in the parallelized function, causing settings to reset to defaults. Could a proper config file for SELFIES customizations fix this?

An example of the bug:

import pandas as pd
import dask
import dask.dataframe as dd
import selfies as sf
print('Pandas version: ' + str(pd.__version__))
print('Dask version: ' + str(dask.__version__))
print('SELFIES version: ' + str(sf.__version__))

## Selfies decoded using default constraints
testing_selfies = "[Li][=C][C][S][=C][C][#S]"
print('Default constraints:' + str(sf.decoder(testing_selfies)))

## Update semantic constraints
new_constraints = sf.get_preset_constraints("default")
new_constraints['Li'] = 1
new_constraints['S'] = 2
sf.set_semantic_constraints(new_constraints)

## Selfies decoded using updated constraints
testing_selfies = "[Li][=C][C][S][=C][C][#S]"
print('Updated constraints:' + str(sf.decoder(testing_selfies)))


def decode_selfies(selfies):
    smiles = sf.decoder(selfies)
    return(smiles)


def parallel_decode_selfies(df):
    return df.apply(lambda x: decode_selfies(x.selfies), axis=1)

## Selfies decoded using updated constraints using dask
df = pd.DataFrame([testing_selfies])
df.columns = ['selfies']
ddf = dd.from_pandas(df,npartitions=1)
df = ddf.map_partitions(parallel_decode_selfies, meta='float').compute(scheduler='processes')
print('Updated constraints w/dask:' + str(df[0]))
Pandas version: 1.1.5
Dask version: 2020.12.0
SELFIES version: 2.0.0
Default constraints:[Li]=CCS=CC#S
Updated constraints:[Li]CCSCC=S
Updated constraints w/dask:[Li]=CCS=CC#S

@MarioKrenn6240
Copy link
Collaborator

Thanks for discovering the subtle bug. Could you pls repost as an issue so we can look at it further? Thank you!

Please give me some time to decode the CI errors.

@MarioKrenn6240
Copy link
Collaborator

hi @csnbritt -- this PR shows some CI mistakes for some time. Do you plan to continue try fixing these issues, and submit this valueable contribution - or shall we close the PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants