Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continous Integration Tests #129

Merged
merged 30 commits into from
Nov 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
78fbcea
add continous integration
jcharkow Oct 30, 2024
b34300b
preinstall numpy
jcharkow Oct 30, 2024
955e1f1
remove numpy from setup
jcharkow Oct 30, 2024
a684136
install numpy in setup script
jcharkow Oct 30, 2024
ff5804c
convert to .toml setup
jcharkow Oct 30, 2024
4b6750a
remove numpy from requirement
jcharkow Oct 30, 2024
122753d
just ubuntu for now
jcharkow Oct 30, 2024
b9f35af
fix setup.py and .toml
jcharkow Oct 30, 2024
be62dfb
add line to build extension
jcharkow Oct 30, 2024
065dc28
Merge branch 'master' into ci
jcharkow Oct 31, 2024
15dd65b
fix: stats tests
jcharkow Oct 31, 2024
c99db26
add pytest-regtest to workflow
jcharkow Oct 31, 2024
fdb5513
update autotunning so does not fail
jcharkow Nov 1, 2024
3c8bfbd
fix: fix level context tests
jcharkow Nov 1, 2024
d320945
update export-parquet tests and fix tests
jcharkow Nov 1, 2024
71804e4
raise error if standard deviation computed is 0
jcharkow Nov 20, 2024
516389e
test: set tree method as exact for tests
jcharkow Nov 21, 2024
ca4a14e
update snapshot tests
jcharkow Nov 21, 2024
0704475
update actions to dependent on reuiqrements file
jcharkow Nov 21, 2024
ce075f3
add dependabot
jcharkow Nov 21, 2024
a0235e0
add tests for windows and mac
jcharkow Nov 21, 2024
88291fa
remove default_rng
jcharkow Nov 21, 2024
08df7cd
remove mac tests
jcharkow Nov 21, 2024
0c342e9
remove copy of np arrays
jcharkow Nov 21, 2024
527c365
remove windows tests
jcharkow Nov 21, 2024
9c64665
refactor: new function for normlaizing score to decoys
jcharkow Nov 21, 2024
9be0a9a
replace semi-supervised learning normalization with sklearn
jcharkow Nov 21, 2024
aa70dbe
revert to numpy std
jcharkow Nov 21, 2024
005af6c
minor updates to pyprophet.toml
jcharkow Nov 21, 2024
81f78bf
fix: ValueError: Buffer dtype mismatch, expected 'DATA_TYPE' but got …
jcharkow Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: continuous-integration

on: [push]

jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest]
# Requirements file generated with python=3.11
python-version: ["3.11"]
steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt # test with requirements file so can easily bump with dependabot
pip install .

- name: Compile cython module
run: python setup.py build_ext --inplace

- name: Test
run: |
python -m pytest tests/
9 changes: 9 additions & 0 deletions .github/workflows/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/" # Location of your pyproject.toml or requirements.txt
schedule:
interval: "weekly" # Checks for updates every week
commit-message:
prefix: "deps" # Prefix for pull request titles
open-pull-requests-limit: 5 # Limit the number of open PRs at a time
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ nosetests.xml

# vim
*.sw[opqrs]
*~
55 changes: 55 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
[build-system]
requires = ["setuptools", "wheel", "numpy", "cython"] # Dependencies needed to build the package
build-backend = "setuptools.build_meta"

[project]
name = "pyprophet"
version = "2.2.9"
description = "PyProphet: Semi-supervised learning and scoring of OpenSWATH results."
readme = { file = "README.md", content-type = "text/markdown" }
license = { text = "BSD" }
authors = [{ name = "The PyProphet Developers", email = "[email protected]" }]
classifiers = [
"Development Status :: 3 - Alpha",
"Environment :: Console",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Topic :: Scientific/Engineering :: Bio-Informatics",
"Topic :: Scientific/Engineering :: Chemistry"
]
keywords = ["bioinformatics", "openSWATH", "mass spectrometry"]

# Dependencies required for runtime
dependencies = [
"Click",
"duckdb",
"duckdb-extensions",
"duckdb-extension-sqlite-scanner",
Comment on lines +26 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb is currently only used for OSW to parquet exporting right? I'm thinking if we can create a separate dependency version, so that if someone wants to be able to export to parquet, then they can install pyprophet[parquet] or something? Just so that we reduce the number of dependencies for the main library for just performing regular scoring tsv exporting? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my initial tests duckdb tends to speed up sqlite statements with many table joins so I was thinking of extending its usage to scoring and tsv exporting as it is minimal changes required to do this.

"numpy >= 2.0",
"scipy",
"pandas >= 0.17",
"cython",
"numexpr >= 2.10.1",
"scikit-learn >= 0.17",
"xgboost",
"hyperopt",
"statsmodels >= 0.8.0",
"matplotlib",
"tabulate",
"pyarrow",
"pypdf"
]

# Optional dependencies
[project.optional-dependencies]
testing = ["pytest", "pytest-regtest"]

# Define console entry points
[project.scripts]
pyprophet = "pyprophet.main:cli"

[tool.setuptools]
packages = { find = { exclude = ["ez_setup", "examples", "tests"] } }
include-package-data = true
zip-safe = false
13 changes: 7 additions & 6 deletions pyprophet/classifiers.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def objective(params):

clf = xgb.XGBClassifier(random_state=42, verbosity=0, objective='binary:logitraw', eval_metric='auc', **params)

score = cross_val_score(clf, X, y, scoring='roc_auc', n_jobs=self.threads, cv=KFold(n_splits=3, shuffle=True, random_state=np.random.RandomState(42))).mean()
score = cross_val_score(clf, X, y, scoring='roc_auc', n_jobs=self.threads, cv=KFold(n_splits=3, shuffle=True, random_state=42)).mean()
singjc marked this conversation as resolved.
Show resolved Hide resolved
# click.echo("Info: AUC: {:.3f} hyperparameters: {}".format(score, params))
return score

Expand All @@ -129,7 +129,8 @@ def objective(params):
xgb_params_complexity = self.xgb_params_tuned
xgb_params_complexity.update({k: self.xgb_params_space[k] for k in ('max_depth', 'min_child_weight')})

best_complexity = fmin(fn=objective, space=xgb_params_complexity, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
rng = np.random.default_rng(42)
best_complexity = fmin(fn=objective, space=xgb_params_complexity, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)
best_complexity['max_depth'] = int(best_complexity['max_depth'])
best_complexity['min_child_weight'] = int(best_complexity['min_child_weight'])

Expand All @@ -139,31 +140,31 @@ def objective(params):
xgb_params_gamma = self.xgb_params_tuned
xgb_params_gamma['gamma'] = self.xgb_params_space['gamma']

best_gamma = fmin(fn=objective, space=xgb_params_gamma, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_gamma = fmin(fn=objective, space=xgb_params_gamma, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_gamma)

# Tune subsampling hyperparameters
xgb_params_subsampling = self.xgb_params_tuned
xgb_params_subsampling.update({k: self.xgb_params_space[k] for k in ('subsample', 'colsample_bytree', 'colsample_bylevel', 'colsample_bynode')})

best_subsampling = fmin(fn=objective, space=xgb_params_subsampling, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_subsampling = fmin(fn=objective, space=xgb_params_subsampling, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_subsampling)

# Tune regularization hyperparameters
xgb_params_regularization = self.xgb_params_tuned
xgb_params_regularization.update({k: self.xgb_params_space[k] for k in ('lambda', 'alpha')})

best_regularization = fmin(fn=objective, space=xgb_params_regularization, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_regularization = fmin(fn=objective, space=xgb_params_regularization, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_regularization)

# Tune learning rate
xgb_params_learning = self.xgb_params_tuned
xgb_params_learning['eta'] = self.xgb_params_space['eta']

best_learning = fmin(fn=objective, space=xgb_params_learning, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_learning = fmin(fn=objective, space=xgb_params_learning, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_learning)
click.echo("Info: Optimal hyperparameters: {}".format(self.xgb_params_tuned))
Expand Down
18 changes: 17 additions & 1 deletion pyprophet/data_handling.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import sys
import os
import multiprocessing
from .stats import mean_and_std_dev

from .optimized import find_top_ranked, rank

Expand Down Expand Up @@ -336,6 +337,21 @@ def get_top_target_peaks(self):
def get_feature_matrix(self, use_main_score):
min_col = 5 if use_main_score else 6
return self.df.iloc[:, min_col:-1].values

def normalize_score_by_decoys(self, score_col_name):
'''
normalize the decoy scores to mean 0 and std 1, scale the targets accordingly
Args:
score_col_name: str, the name of the score column
'''
td_scores = self.get_top_decoy_peaks()[score_col_name]
mu, nu = mean_and_std_dev(td_scores)

if nu == 0:
raise Exception("Warning: Standard deviation of decoy scores is zero. Cannot normalize scores.")

self.df.loc[:, score_col_name] = (self.df[score_col_name] - mu) / nu


def filter_(self, idx):
return Experiment(self.df[idx])
Expand All @@ -344,7 +360,7 @@ def filter_(self, idx):
def add_peak_group_rank(self):
ids = self.df.tg_num_id.values
scores = self.df.d_score.values
peak_group_ranks = rank(ids, scores)
peak_group_ranks = rank(ids, scores.astype(np.float32, copy=False))
self.df["peak_group_rank"] = peak_group_ranks

@profile
Expand Down
2 changes: 1 addition & 1 deletion pyprophet/export_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ def export_to_parquet(infile, outfile, transitionLevel, onlyFeatures=False):

# transition level
if transitionLevel:
columns['FEATURE_TRANSITION'] = ['AREA_INTENSITY', 'TOTAL_AREA_INTENSITY', 'APEX_INTENSITY', 'TOTAL_MI'] + getVarColumnNames(condb, 'FEATURE_TRANSITION')
columns['FEATURE_TRANSITION'] = ['AREA_INTENSITY', 'TOTAL_AREA_INTENSITY', 'APEX_INTENSITY', 'TOTAL_MI'] + getVarColumnNames(con, 'FEATURE_TRANSITION')
columns['TRANSITION'] = ['TRAML_ID', 'PRODUCT_MZ', 'CHARGE', 'TYPE', 'ORDINAL', 'DETECTING', 'IDENTIFYING', 'QUANTIFYING', 'LIBRARY_INTENSITY']
columns['TRANSITION_PRECURSOR_MAPPING'] = ['TRANSITION_ID']

Expand Down
11 changes: 8 additions & 3 deletions pyprophet/levels_contexts.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,12 @@ def statistics_report(data, outfile, context, analyte, parametric, pfdr, pi0_lam
outfile = outfile + "_" + str(data['run_id'].unique()[0])

# export PDF report
save_report(outfile + "_" + context + "_" + analyte + ".pdf", outfile + ": " + context + " " + analyte + "-level error-rate control", data[data.decoy==1]["score"], data[data.decoy==0]["score"], stat_table["cutoff"], stat_table["svalue"], stat_table["qvalue"], data[data.decoy==0]["p_value"], pi0, color_palette)
save_report(outfile + "_" + context + "_" + analyte + ".pdf",
outfile + ": " + context + " " + analyte + "-level error-rate control",
data[data.decoy==1]["score"].values, data[data.decoy==0]["score"].values, stat_table["cutoff"].values,
stat_table["svalue"].values, stat_table["qvalue"].values, data[data.decoy==0]["p_value"].values,
pi0,
color_palette)

return(data)

Expand Down Expand Up @@ -184,7 +189,7 @@ def infer_proteins(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m
con.close()

if context == 'run-specific':
data = data.groupby('run_id').apply(statistics_report, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette).reset_index()
data = data.groupby('run_id').apply(statistics_report, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reset_index no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing reset index is required to prevent the error

  File "/home/joshua/mambaforge/envs/pyprophet_dev/lib/python3.11/site-packages/pandas/core/frame.py", line 5158, in insert
    raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert run_id, already exists

Same as below. Must be a change to pandas groupby functionality

Copy link
Contributor

@singjc singjc Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like it, some groupby deprecations occured for Pandas v2.2.0

Deprecated the Grouping attributes group_index, result_index, and group_arraylike; these will be removed in a future version of pandas (GH 56148)

If you don't mind, would you be able to test with a version prior to pandas v2.2.0, to see if the old code works with the .reset_index(), just so we know for sure that is the change.


elif context in ['global', 'experiment-wide']:
data = statistics_report(data, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
Expand Down Expand Up @@ -257,7 +262,7 @@ def infer_peptides(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m
con.close()

if context == 'run-specific':
data = data.groupby('run_id').apply(statistics_report, outfile, context, "peptide", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette).reset_index()
data = data.groupby('run_id').apply(statistics_report, outfile, context, "peptide", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
jcharkow marked this conversation as resolved.
Show resolved Hide resolved

elif context in ['global', 'experiment-wide']:
data = statistics_report(data, outfile, context, "peptide", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
Expand Down
2 changes: 2 additions & 0 deletions pyprophet/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ def score(infile, outfile, classifier, xgb_autotune, apply_weights, xeval_fracti
xgb_hyperparams = {'autotune': xgb_autotune, 'autotune_num_rounds': 10, 'num_boost_round': 100, 'early_stopping_rounds': 10, 'test_size': 0.33}

xgb_params = {'eta': 0.3, 'gamma': 0, 'max_depth': 6, 'min_child_weight': 1, 'subsample': 1, 'colsample_bytree': 1, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'lambda': 1, 'alpha': 0, 'scale_pos_weight': 1, 'verbosity': 0, 'objective': 'binary:logitraw', 'nthread': 1, 'eval_metric': 'auc'}
if test:
xgb_params['tree_method'] = 'exact'
singjc marked this conversation as resolved.
Show resolved Hide resolved

xgb_params_space = {'eta': hp.uniform('eta', 0.0, 0.3), 'gamma': hp.uniform('gamma', 0.0, 0.5), 'max_depth': hp.quniform('max_depth', 2, 8, 1), 'min_child_weight': hp.quniform('min_child_weight', 1, 5, 1), 'subsample': 1, 'colsample_bytree': 1, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'lambda': hp.uniform('lambda', 0.0, 1.0), 'alpha': hp.uniform('alpha', 0.0, 1.0), 'scale_pos_weight': 1.0, 'verbosity': 0, 'objective': 'binary:logitraw', 'nthread': 1, 'eval_metric': 'auc'}

Expand Down
14 changes: 3 additions & 11 deletions pyprophet/semi_supervised.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from .data_handling import Experiment, update_chosen_main_score_in_table
from .classifiers import AbstractLearner, XGBLearner
from .stats import mean_and_std_dev, find_cutoff
from .stats import find_cutoff

try:
profile
Expand Down Expand Up @@ -64,13 +64,9 @@ def learn_randomized(self, experiment, score_columns, working_thread_number):

# after semi supervised iteration: classify full dataset
clf_scores = self.score(experiment, params)
mu, nu = mean_and_std_dev(clf_scores)
experiment.set_and_rerank("classifier_score", clf_scores)

td_scores = experiment.get_top_decoy_peaks()["classifier_score"]

mu, nu = mean_and_std_dev(td_scores)
experiment["classifier_score"] = (experiment["classifier_score"] - mu) / nu
experiment.normalize_score_by_decoys('classifier_score')
experiment.rank_by("classifier_score")

top_test_peaks = experiment.get_top_test_peaks()
Expand All @@ -92,13 +88,9 @@ def learn_final(self, experiment):

# after semi supervised iteration: classify full dataset
clf_scores = self.score(experiment, params)
mu, nu = mean_and_std_dev(clf_scores)
experiment.set_and_rerank("classifier_score", clf_scores)

td_scores = experiment.get_top_decoy_peaks()["classifier_score"]

mu, nu = mean_and_std_dev(td_scores)
experiment["classifier_score"] = (experiment["classifier_score"] - mu) / nu
experiment.normalize_score_by_decoys('classifier_score')
experiment.rank_by("classifier_score")

return params
Expand Down
Loading