Index symbol customization #71

csnbritt · 2021-11-09T20:15:02Z

Added index symbol customization functionality following the style of the bond constraints code to address #69

index_alphabet.py includes three new functions:

get_preset_index_alphabet

sf.get_preset_index_alphabet('default')

get_current_index_alphabet

sf.get_current_index_alphabet()

set_index_alphabet

new_index_alphabet = {
    "[1C]": 0, 
    "[2C]": 1, 
    "[3C]": 2, 
    "[4C]": 3, 
    "[5C]": 4, 
    "[6C]": 5, 
    "[7C]": 6, 
    "[8C]": 7,
    "[9C]": 8,
    "[10C]": 9,
    "[11C]": 10,
    "[12C]": 11,
    "[13C]": 12,
    "[14C]": 13,
    "[15C]": 14,
    "[16C]": 15
}

sf.set_index_alphabet(new_index_alphabet)

Currently I've only included one preset index alphabet, the default one. Having multiple presets may or may not be useful, remains to be seen how much different index alphabets affect performance on different tasks I suppose.

New index alphabets can be set by passing a dictionary with tokens as keys and indices as values. Checks to make sure the passed dictionary has 16 keys, each key is a valid atom or ring/branch token, and that the set of dictionary values includes all integers 0-15

Removed index alphabet and index code from constants

Create index_alphabet.py following style of bond_constraints.py

import index_alphabet and index_code from index_alphabet.py

import index_alphabet from index_alphabet.py

Check length, keys, values of passed index_alphabet dict

Add functions from index_alphabet.py

csnbritt · 2021-11-12T04:50:07Z

Thanks for the feedback - I made most of the suggested changes!

Changed get_current_index_alphabet to get_index_alphabet
Removed tuple(get_current_index_alphabet()) and list(set(index_alphabet.values()))
Agreed that (index -> symbol) makes more sense than (symbol -> index), this has been changed. Regarding how to store the index alphabet - in the latest commits I've changed the function from set_index_alphabet to update_index_alphabet and now any number of index values can be changed, rather than having to update the whole index alphabet each time. A copy of the index alphabet is updated by the passed dictionary, and if this updated dictionary has 16 unique values, each value is a valid selfies token, and each key is an int between 0-15, it is saved as _current_index_alphabet. I think that the ability to update index symbols just a few at a time is nice, and a dictionary makes it convenient to do so. I'm open to changing this function back or moving to using a list or tuple for this if desired however.
Moved get_index_from_selfies and get_selfies_from_index into index_alphabet.py. As suggested, for efficiency I added _current_index_alphabet_symbols and _current_index_alphabet_reversed to store the correctly ordered index alphabet symbols and reverse mapping of the index alphabet.

Fix some formatting for unused imports, line spacing, white spaces, and line lengths.

csnbritt · 2021-11-22T20:39:21Z

I think that this latest commit complies with PEP8 length requirements, besides the error messages (I'm not too familiar with the standard so I don't know if this is an issue). In my testing I did find a bug that affects both setting semantic bond constraints and index alphabets when using dask. Dask parallelized functions don't use updated bond constraints or index alphabets unless the bond constraints or index alphabet is updated within the function itself, which is inefficient. I'm not exactly sure why this happens, but my guess is that dask reinitializes SELFIES for use in the parallelized function, causing settings to reset to defaults. Could a proper config file for SELFIES customizations fix this?

An example of the bug:

import pandas as pd
import dask
import dask.dataframe as dd
import selfies as sf
print('Pandas version: ' + str(pd.__version__))
print('Dask version: ' + str(dask.__version__))
print('SELFIES version: ' + str(sf.__version__))

## Selfies decoded using default constraints
testing_selfies = "[Li][=C][C][S][=C][C][#S]"
print('Default constraints:' + str(sf.decoder(testing_selfies)))

## Update semantic constraints
new_constraints = sf.get_preset_constraints("default")
new_constraints['Li'] = 1
new_constraints['S'] = 2
sf.set_semantic_constraints(new_constraints)

## Selfies decoded using updated constraints
testing_selfies = "[Li][=C][C][S][=C][C][#S]"
print('Updated constraints:' + str(sf.decoder(testing_selfies)))


def decode_selfies(selfies):
    smiles = sf.decoder(selfies)
    return(smiles)


def parallel_decode_selfies(df):
    return df.apply(lambda x: decode_selfies(x.selfies), axis=1)

## Selfies decoded using updated constraints using dask
df = pd.DataFrame([testing_selfies])
df.columns = ['selfies']
ddf = dd.from_pandas(df,npartitions=1)
df = ddf.map_partitions(parallel_decode_selfies, meta='float').compute(scheduler='processes')
print('Updated constraints w/dask:' + str(df[0]))

Pandas version: 1.1.5
Dask version: 2020.12.0
SELFIES version: 2.0.0
Default constraints:[Li]=CCS=CC#S
Updated constraints:[Li]CCSCC=S
Updated constraints w/dask:[Li]=CCS=CC#S

MarioKrenn6240 · 2021-11-23T02:03:36Z

Thanks for discovering the subtle bug. Could you pls repost as an issue so we can look at it further? Thank you!

Please give me some time to decode the CI errors.

MarioKrenn6240 · 2022-01-05T05:36:13Z

hi @csnbritt -- this PR shows some CI mistakes for some time. Do you plan to continue try fixing these issues, and submit this valueable contribution - or shall we close the PR? Thanks!

csnbritt added 30 commits November 9, 2021 10:26

Update constants.py

df61cd8

Removed index alphabet and index code from constants

create index_alphabet.py

cb89754

Create index_alphabet.py following style of bond_constraints.py

Update index_alphabet/code import

8c65780

import index_alphabet and index_code from index_alphabet.py

Update index_alphabet import

9cc5c9f

import index_alphabet from index_alphabet.py

Add docstrings

390b1cc

Update error checking for index alphabet

a89242c

Check length, keys, values of passed index_alphabet dict

Update index_alphabet.py

5a6c195

Update index_alphabet.py

43b3eda

Update index_alphabet.py

99814a4

Update index_alphabet.py

7914648

Update index_alphabet.py

7edcbd9

Update __init__.py

6646a9f

Add functions from index_alphabet.py

Update __init__.py

61bd024

Update __init__.py

8d3a2b0

Update index_alphabet.py

a9d8a87

Update bond_constraints.py

72b47c9

Update grammar_rules.py

2885b2e

Update bond_constraints.py

64fd82a

Update index_alphabet.py

7f66b92

Create index_alphabet.py

fff8f4d

Update bond_constraints.py

88e91a8

Update grammar_rules.py

7c7c3c2

Update index_alphabet.py

be6d20e

Update index_alphabet.py

9b2b837

Update grammar_rules.py

2cf0f9e

Update grammar_rules.py

e00d134

Update index_alphabet.py

960d799

Update index_alphabet.py

e3438a7

Update index_alphabet.py

c85dd96

Update index_alphabet.py

0a830d0

csnbritt added 15 commits November 11, 2021 20:02

Update index_alphabet.py

bbd5746

Update __init__.py

d279a97

Update index_alphabet.py

b7a401d

Update bond_constraints.py

f0acbfd

Update index_alphabet.py

d29fc7d

Update index_alphabet.py

380e2dc

Update index_alphabet.py

523f9ac

Update index_alphabet.py

8bdc485

Update index_alphabet.py

963ec88

Update index_alphabet.py

143272b

Update index_alphabet.py

9ea016b

Update index_alphabet.py

3e13ce0

Update index_alphabet.py

7a1e9de

Update index_alphabet.py

12d3b1d

Update index_alphabet.py

7c0ce75

csnbritt added 10 commits November 12, 2021 09:28

Update index_alphabet.py

7983a94

Update index_alphabet.py

fba8ba1

Fix some formatting for unused imports, line spacing, white spaces, and line lengths.

Update grammar_rules.py

3c4fa24

Update bond_constraints.py

f4cea13

Update index_alphabet.py

a2ed3b2

Update index_alphabet.py

c5cd8af

Update index_alphabet.py

6c76812

Update index_alphabet.py

1b24cdf

Update index_alphabet.py

aefbd45

Update index_alphabet.py

dd81259

MarioKrenn6240 closed this Jul 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index symbol customization #71

Index symbol customization #71

csnbritt commented Nov 9, 2021 •

edited

Loading

csnbritt commented Nov 12, 2021

csnbritt commented Nov 22, 2021

MarioKrenn6240 commented Nov 23, 2021

MarioKrenn6240 commented Jan 5, 2022

Index symbol customization #71

Index symbol customization #71

Conversation

csnbritt commented Nov 9, 2021 • edited Loading

csnbritt commented Nov 12, 2021

csnbritt commented Nov 22, 2021

MarioKrenn6240 commented Nov 23, 2021

MarioKrenn6240 commented Jan 5, 2022

csnbritt commented Nov 9, 2021 •

edited

Loading