Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/extract #44

Draft
wants to merge 101 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
adf5e76
update gitignore
mathias-samuelides Nov 20, 2023
25a5ca5
add script to test extract
mathias-samuelides Nov 21, 2023
d5a961a
in progress
mathias-samuelides Nov 23, 2023
48ead4d
make extract work
mathias-samuelides Nov 24, 2023
74c0354
refact in progress
mathias-samuelides Nov 24, 2023
1798af9
change use_icu type
mathias-samuelides Nov 24, 2023
f50efc7
raw path
mathias-samuelides Nov 24, 2023
c27dbb4
rename file
mathias-samuelides Nov 24, 2023
e9a4e75
icd conversion
mathias-samuelides Nov 25, 2023
94258cc
refact extract
mathias-samuelides Nov 25, 2023
0222f06
minor
mathias-samuelides Nov 25, 2023
db9bf68
fix path in existing code to make it work
mathias-samuelides Nov 25, 2023
abfdb23
remove useless files
mathias-samuelides Nov 25, 2023
5251a46
start feature selection
mathias-samuelides Nov 27, 2023
6bb85a2
first test_feature_selection
mathias-samuelides Nov 27, 2023
0fda4fc
update todo
mathias-samuelides Nov 27, 2023
0357091
minor
mathias-samuelides Nov 27, 2023
29c872e
remove bp
mathias-samuelides Nov 27, 2023
3f6deed
in progress
mathias-samuelides Nov 27, 2023
f5bd0f3
icu feature preprocessing
mathias-samuelides Nov 28, 2023
49a088c
update raw files
mathias-samuelides Nov 28, 2023
b6401a1
non icu features preprocessing
mathias-samuelides Nov 28, 2023
2dcbe51
clean
mathias-samuelides Nov 28, 2023
98199fa
split raw_files
mathias-samuelides Nov 28, 2023
6e79d45
admission inputer file
mathias-samuelides Nov 28, 2023
ec72790
ndc file
mathias-samuelides Nov 28, 2023
f00b154
refact wip
mathias-samuelides Nov 28, 2023
544dafc
rename file
mathias-samuelides Nov 28, 2023
fb65ebc
fix existing model generation icu
mathias-samuelides Nov 29, 2023
68140ad
prediction task
mathias-samuelides Nov 29, 2023
9cb601e
rename rawdataloader
mathias-samuelides Nov 29, 2023
1e1771d
minor
mathias-samuelides Nov 29, 2023
dfaf5a6
preproc files
mathias-samuelides Nov 30, 2023
9ae5d58
minor
mathias-samuelides Nov 30, 2023
6f1be09
minor
mathias-samuelides Nov 30, 2023
3fb5f13
minor
mathias-samuelides Nov 30, 2023
5e28465
refact extractor preprocessing
mathias-samuelides Nov 30, 2023
5cca228
replace path arguments with dataframe
mathias-samuelides Nov 30, 2023
0f06c22
rename ... extractor
mathias-samuelides Nov 30, 2023
409d8a0
review
mathias-samuelides Nov 30, 2023
ecaada1
refact cohort extractor
mathias-samuelides Nov 30, 2023
dcfc65c
refact feature extraction
mathias-samuelides Dec 1, 2023
2f16214
minor
mathias-samuelides Dec 1, 2023
0b39f80
wip
mathias-samuelides Dec 1, 2023
e0a5c98
wip
mathias-samuelides Dec 1, 2023
a69f70d
wip
mathias-samuelides Dec 1, 2023
3dac5da
minor
mathias-samuelides Dec 1, 2023
ef9eb6c
new readme
mathias-samuelides Dec 2, 2023
a9058ec
wip
mathias-samuelides Dec 3, 2023
137934f
wip
mathias-samuelides Dec 4, 2023
ecbe337
feature
mathias-samuelides Dec 4, 2023
1532de5
refact features wip
mathias-samuelides Dec 5, 2023
ac574d9
summary refact
mathias-samuelides Dec 5, 2023
e50a031
remove commented code
mathias-samuelides Dec 5, 2023
ea045de
typo
mathias-samuelides Dec 5, 2023
049a7bd
renaming
mathias-samuelides Dec 5, 2023
7383995
wip
mathias-samuelides Dec 7, 2023
f829005
wip
mathias-samuelides Dec 7, 2023
cdb977e
wip
mathias-samuelides Dec 7, 2023
fb7a740
wip
mathias-samuelides Dec 8, 2023
2e2d8ed
class cohort
mathias-samuelides Dec 8, 2023
5ed6a09
...
mathias-samuelides Dec 8, 2023
6555435
...
mathias-samuelides Dec 8, 2023
56348ba
docstring
mathias-samuelides Dec 8, 2023
9afa74f
.
mathias-samuelides Dec 8, 2023
3b050c5
.
mathias-samuelides Dec 8, 2023
3efe389
minor
mathias-samuelides Dec 8, 2023
0aed766
remove useless code
mathias-samuelides Dec 8, 2023
c5ddc8f
remove hardcoded string
mathias-samuelides Dec 8, 2023
71137b8
.
mathias-samuelides Dec 8, 2023
ad3a114
.
mathias-samuelides Dec 8, 2023
a839fca
test icd converter
mathias-samuelides Dec 8, 2023
7e3c93c
fix
mathias-samuelides Dec 8, 2023
abb194c
.
mathias-samuelides Dec 8, 2023
1c8c680
feature extract_from
mathias-samuelides Dec 8, 2023
1fac5ec
.
mathias-samuelides Dec 8, 2023
c838a90
.
mathias-samuelides Dec 8, 2023
b60b6d6
.
mathias-samuelides Dec 9, 2023
e5b939d
.
mathias-samuelides Dec 9, 2023
08f4973
.
mathias-samuelides Dec 9, 2023
a3ff7bf
.
mathias-samuelides Dec 9, 2023
e39ac67
summarizer
mathias-samuelides Dec 9, 2023
39c70be
.
mathias-samuelides Dec 9, 2023
508ea8c
feature preprocessor
mathias-samuelides Dec 9, 2023
dcf22a7
.
mathias-samuelides Dec 9, 2023
295c41a
generator
mathias-samuelides Dec 10, 2023
275a584
remove cohort fields from feature classes
mathias-samuelides Dec 10, 2023
ce90791
.
mathias-samuelides Dec 10, 2023
7add049
.
mathias-samuelides Dec 10, 2023
c38de8a
gen med
mathias-samuelides Dec 10, 2023
055577a
.
mathias-samuelides Dec 10, 2023
8c89746
.
mathias-samuelides Dec 10, 2023
08a5c32
.
mathias-samuelides Dec 10, 2023
8e045d8
.
mathias-samuelides Dec 10, 2023
01da5f4
empty dict maker
mathias-samuelides Dec 10, 2023
4610bb3
.
mathias-samuelides Dec 10, 2023
17d6b40
temp file
mathias-samuelides Dec 11, 2023
a01dc09
add preproc data to gitignore
mathias-samuelides Dec 11, 2023
2cb620b
debug files and feature name
mathias-samuelides Dec 11, 2023
3c7af19
test with feature name
mathias-samuelides Dec 11, 2023
ca60898
.
mathias-samuelides Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
test with feature name
  • Loading branch information
mathias-samuelides committed Dec 11, 2023
commit 3c7af19b68483b6756b9461f05919e8c41031a6c
9 changes: 5 additions & 4 deletions pipeline/features_extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,11 +104,12 @@ def save_features(self) -> List[pd.DataFrame]:
EXTRACT_LABS_PATH,
),
]
features = []
features = {}
for condition, feature, path in feature_conditions:
if condition:
features.append(feature.extract_from(cohort))
breakpoint()
save_data(feature.df, path, feature.__class__.name())
extract_feature = feature.extract_from(cohort)
feature_name = feature.__class__.name()
features[feature_name] = extract_feature
save_data(extract_feature, path, feature_name)

return features
96 changes: 48 additions & 48 deletions tests/test_cohort_extractor.py
Original file line number Diff line number Diff line change
@@ -1,50 +1,50 @@
import pytest
from pipeline.cohort_extractor import CohortExtractor
from pipeline.prediction_task import PredictionTask, TargetType
# import pytest
# from pipeline.cohort_extractor import CohortExtractor
# from pipeline.prediction_task import PredictionTask, TargetType


@pytest.mark.parametrize(
"use_icu, target_type, nb_days, disease_readmission, disease_selection, expected_admission_records_count, expected_patients_count, expected_positive_cases_count",
[
(True, TargetType.MORTALITY, 0, None, None, 140, 100, 10),
(True, TargetType.LOS, 3, None, None, 140, 100, 55),
(True, TargetType.LOS, 7, None, None, 140, 100, 20),
(True, TargetType.READMISSION, 30, None, None, 128, 93, 18),
(True, TargetType.READMISSION, 90, None, None, 128, 93, 22),
(True, TargetType.READMISSION, 30, "I50", None, 27, 20, 2),
(True, TargetType.READMISSION, 30, "I25", None, 32, 29, 2),
(True, TargetType.READMISSION, 30, "N18", None, 25, 18, 2),
(True, TargetType.READMISSION, 30, "J44", None, 17, 12, 3),
(False, TargetType.MORTALITY, 0, None, None, 275, 100, 15),
(False, TargetType.LOS, 3, None, None, 275, 100, 163),
(False, TargetType.LOS, 7, None, None, 275, 100, 76),
(False, TargetType.READMISSION, 30, None, None, 260, 95, 52),
(False, TargetType.READMISSION, 90, None, None, 260, 95, 86),
(False, TargetType.READMISSION, 30, "I50", None, 55, 23, 13),
# heart failure
(False, TargetType.READMISSION, 30, "I25", None, 68, 32, 13),
(False, TargetType.READMISSION, 30, "N18", None, 63, 22, 10),
(False, TargetType.READMISSION, 30, "J44", None, 26, 12, 7),
(True, TargetType.MORTALITY, 0, None, "I50", 32, 22, 5),
],
)
def test_cohort_extractor(
use_icu,
target_type,
nb_days,
disease_readmission,
disease_selection,
expected_admission_records_count,
expected_patients_count,
expected_positive_cases_count,
):
prediction_task = PredictionTask(
target_type, disease_readmission, disease_selection, nb_days, use_icu
)
cohort_extractor = CohortExtractor(
prediction_task=prediction_task,
)
df = cohort_extractor.extract().df
assert len(df) == expected_admission_records_count
assert df["subject_id"].nunique() == expected_patients_count
assert df["label"].sum() == expected_positive_cases_count
# @pytest.mark.parametrize(
# "use_icu, target_type, nb_days, disease_readmission, disease_selection, expected_admission_records_count, expected_patients_count, expected_positive_cases_count",
# [
# (True, TargetType.MORTALITY, 0, None, None, 140, 100, 10),
# (True, TargetType.LOS, 3, None, None, 140, 100, 55),
# (True, TargetType.LOS, 7, None, None, 140, 100, 20),
# (True, TargetType.READMISSION, 30, None, None, 128, 93, 18),
# (True, TargetType.READMISSION, 90, None, None, 128, 93, 22),
# (True, TargetType.READMISSION, 30, "I50", None, 27, 20, 2),
# (True, TargetType.READMISSION, 30, "I25", None, 32, 29, 2),
# (True, TargetType.READMISSION, 30, "N18", None, 25, 18, 2),
# (True, TargetType.READMISSION, 30, "J44", None, 17, 12, 3),
# (False, TargetType.MORTALITY, 0, None, None, 275, 100, 15),
# (False, TargetType.LOS, 3, None, None, 275, 100, 163),
# (False, TargetType.LOS, 7, None, None, 275, 100, 76),
# (False, TargetType.READMISSION, 30, None, None, 260, 95, 52),
# (False, TargetType.READMISSION, 90, None, None, 260, 95, 86),
# (False, TargetType.READMISSION, 30, "I50", None, 55, 23, 13),
# # heart failure
# (False, TargetType.READMISSION, 30, "I25", None, 68, 32, 13),
# (False, TargetType.READMISSION, 30, "N18", None, 63, 22, 10),
# (False, TargetType.READMISSION, 30, "J44", None, 26, 12, 7),
# (True, TargetType.MORTALITY, 0, None, "I50", 32, 22, 5),
# ],
# )
# def test_cohort_extractor(
# use_icu,
# target_type,
# nb_days,
# disease_readmission,
# disease_selection,
# expected_admission_records_count,
# expected_patients_count,
# expected_positive_cases_count,
# ):
# prediction_task = PredictionTask(
# target_type, disease_readmission, disease_selection, nb_days, use_icu
# )
# cohort_extractor = CohortExtractor(
# prediction_task=prediction_task,
# )
# df = cohort_extractor.extract().df
# assert len(df) == expected_admission_records_count
# assert df["subject_id"].nunique() == expected_patients_count
# assert df["label"].sum() == expected_positive_cases_count
37 changes: 19 additions & 18 deletions tests/test_feature_extractor.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from pipeline.features_extractor import (
FeatureExtractor,
)
from pipeline.feature.feature_abc import Name


def test_feature_icu_all_true():
Expand All @@ -16,17 +17,17 @@ def test_feature_icu_all_true():
)
result = feature_extractor.save_features()
assert len(result) == 5
assert len(result[0]) == 2647
assert result[0].columns.tolist() == [
assert len(result[Name.DIAGNOSES]) == 2647
assert result[Name.DIAGNOSES].columns.tolist() == [
"subject_id",
"hadm_id",
"icd_code",
"root_icd10_convert",
"root",
"stay_id",
]
assert len(result[1]) == 1435
assert result[1].columns.tolist() == [
assert len(result[Name.PROCEDURES]) == 1435
assert result[Name.PROCEDURES].columns.tolist() == [
"subject_id",
"hadm_id",
"stay_id",
Expand All @@ -35,8 +36,8 @@ def test_feature_icu_all_true():
"intime",
"event_time_from_admit",
]
assert len(result[2]) == 11038
assert result[2].columns.tolist() == [
assert len(result[Name.MEDICATIONS]) == 11038
assert result[Name.MEDICATIONS].columns.tolist() == [
"subject_id",
"hadm_id",
"starttime",
Expand All @@ -49,8 +50,8 @@ def test_feature_icu_all_true():
"amount",
"orderid",
]
assert len(result[3]) == 9362
assert result[3].columns.tolist() == [
assert len(result[Name.OUTPUT]) == 9362
assert result[Name.OUTPUT].columns.tolist() == [
"subject_id",
"hadm_id",
"stay_id",
Expand All @@ -59,8 +60,8 @@ def test_feature_icu_all_true():
"intime",
"event_time_from_admit",
]
assert len(result[4]) == 72108
assert result[4].columns.tolist() == [
assert len(result[Name.CHART]) == 72108
assert result[Name.CHART].columns.tolist() == [
"stay_id",
"itemid",
"valuenum",
Expand All @@ -81,16 +82,16 @@ def test_feature_non_icu_all_true():
)
result = feature_extractor.save_features()
assert len(result) == 4
assert len(result[0]) == 1273
assert result[0].columns.tolist() == [
assert len(result[Name.DIAGNOSES]) == 1273
assert result[Name.DIAGNOSES].columns.tolist() == [
"subject_id",
"hadm_id",
"icd_code",
"root_icd10_convert",
"root",
]
assert len(result[1]) == 136
assert result[1].columns.tolist() == [
assert len(result[Name.PROCEDURES]) == 136
assert result[Name.PROCEDURES].columns.tolist() == [
"subject_id",
"hadm_id",
"icd_code",
Expand All @@ -99,8 +100,8 @@ def test_feature_non_icu_all_true():
"admittime",
"proc_time_from_admit",
]
assert len(result[2]) == 4803
assert result[2].columns.tolist() == [
assert len(result[Name.MEDICATIONS]) == 4803
assert result[Name.MEDICATIONS].columns.tolist() == [
"subject_id",
"hadm_id",
"starttime",
Expand All @@ -112,8 +113,8 @@ def test_feature_non_icu_all_true():
"dose_val_rx",
"EPC",
]
assert len(result[3]) == 22029
assert result[3].columns.tolist() == [
assert len(result[Name.LAB]) == 22029
assert result[Name.LAB].columns.tolist() == [
"subject_id",
"hadm_id",
"itemid",
Expand Down
142 changes: 71 additions & 71 deletions tests/test_feature_preprocessor.py
Original file line number Diff line number Diff line change
@@ -1,75 +1,75 @@
from pipeline.features_extractor import FeatureExtractor
from pipeline.features_preprocessor import FeaturePreprocessor, IcdGroupOption
from pipeline.data_generator import DataGenerator
# from pipeline.features_extractor import FeatureExtractor
# from pipeline.features_preprocessor import FeaturePreprocessor, IcdGroupOption
# from pipeline.data_generator import DataGenerator


def test_feature_icu_all_true():
extractor = FeatureExtractor(
cohort_output="cohort_icu_mortality_0_",
use_icu=True,
for_diagnoses=True,
for_output_events=True,
for_chart_events=True,
for_procedures=True,
for_medications=True,
for_labs=True,
)
preprocessor = FeaturePreprocessor(
feature_extractor=extractor,
group_diag_icd=IcdGroupOption.GROUP,
group_med_code=True,
keep_proc_icd9=False,
clean_chart=True,
impute_outlier_chart=True,
impute_labs=True,
thresh=98,
left_thresh=2,
clean_labs=True,
)
extractor.save_features()
preprocessor.preprocess()
generator = DataGenerator(
cohort_output=extractor.cohort_output,
feature_extractor=extractor,
)
generator.generate_features()
generator.length_by_target()
generator.smooth_ini()
generator.smooth_tqdm()
assert 5 == 5
# def test_feature_icu_all_true():
# extractor = FeatureExtractor(
# cohort_output="cohort_icu_mortality_0_",
# use_icu=True,
# for_diagnoses=True,
# for_output_events=True,
# for_chart_events=True,
# for_procedures=True,
# for_medications=True,
# for_labs=True,
# )
# preprocessor = FeaturePreprocessor(
# feature_extractor=extractor,
# group_diag_icd=IcdGroupOption.GROUP,
# group_med_code=True,
# keep_proc_icd9=False,
# clean_chart=True,
# impute_outlier_chart=True,
# impute_labs=True,
# thresh=98,
# left_thresh=2,
# clean_labs=True,
# )
# extractor.save_features()
# preprocessor.preprocess()
# generator = DataGenerator(
# cohort_output=extractor.cohort_output,
# feature_extractor=extractor,
# )
# generator.generate_features()
# generator.length_by_target()
# generator.smooth_ini()
# generator.smooth_tqdm()
# assert 5 == 5


def test_feature_non_icu_all_true():
extractor = FeatureExtractor(
cohort_output="cohort_Non-ICU_readmission_30_I50",
use_icu=False,
for_diagnoses=True,
for_output_events=True,
for_chart_events=True,
for_procedures=True,
for_medications=True,
for_labs=True,
)
preprocessor = FeaturePreprocessor(
feature_extractor=extractor,
group_diag_icd=IcdGroupOption.GROUP,
group_med_code=True,
keep_proc_icd9=False,
clean_chart=True,
impute_outlier_chart=True,
impute_labs=True,
thresh=95,
left_thresh=5,
clean_labs=True,
)
extractor.save_features()
preprocessor.preprocess()
generator = DataGenerator(
cohort_output=extractor.cohort_output,
feature_extractor=extractor,
)
generator.generate_features()
generator.length_by_target()
generator.smooth_ini()
generator.smooth_tqdm()
assert 4 == 4
# def test_feature_non_icu_all_true():
# extractor = FeatureExtractor(
# cohort_output="cohort_Non-ICU_readmission_30_I50",
# use_icu=False,
# for_diagnoses=True,
# for_output_events=True,
# for_chart_events=True,
# for_procedures=True,
# for_medications=True,
# for_labs=True,
# )
# preprocessor = FeaturePreprocessor(
# feature_extractor=extractor,
# group_diag_icd=IcdGroupOption.GROUP,
# group_med_code=True,
# keep_proc_icd9=False,
# clean_chart=True,
# impute_outlier_chart=True,
# impute_labs=True,
# thresh=95,
# left_thresh=5,
# clean_labs=True,
# )
# extractor.save_features()
# preprocessor.preprocess()
# generator = DataGenerator(
# cohort_output=extractor.cohort_output,
# feature_extractor=extractor,
# )
# generator.generate_features()
# generator.length_by_target()
# generator.smooth_ini()
# generator.smooth_tqdm()
# assert 4 == 4