Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HieAODE #27

Draft
wants to merge 41 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
e20cebd
wip: start implementing hieaode
Jasperhino Dec 21, 2023
a193da6
wip: calculate cpts
Jasperhino Dec 21, 2023
0ededd3
refactored HieAODE to calculate each of the conditional probabilities…
saragrau4 Jan 13, 2024
702e176
fixed calculate_class_prior
saragrau4 Jan 14, 2024
4e4d2c4
insertion of value in calculation of calculate_class_prior and also a…
saragrau4 Jan 14, 2024
6bf31f1
added black formatting
saragrau4 Jan 14, 2024
2bc47c8
corrected calculation of class prior
saragrau4 Jan 15, 2024
5f82d39
wip: add argmax for prediction logic
Jasperhino Jan 17, 2024
639ef42
fixed bugs in select_predict hieAODE
saragrau4 Jan 17, 2024
216a638
wip: replace pyitlib
Jasperhino Jan 22, 2024
8350ab9
refactor package structure
Jasperhino Jan 22, 2024
43fafbe
feat: lint and fix all tests
Jasperhino Jan 24, 2024
d3e9b77
fix test ci
Jasperhino Jan 24, 2024
53959ca
add missing lib file
Jasperhino Jan 24, 2024
c39e70c
lint
Jasperhino Jan 24, 2024
069676f
add matrix
Jasperhino Jan 24, 2024
d9265cf
remove pypi python versions
Jasperhino Jan 24, 2024
c7871c6
implemented tests for cpts
saragrau Jan 27, 2024
4cd5068
now descendants = all features - ancestors - current feature and not …
saragrau Jan 28, 2024
819beb5
Merge pull request #2 from hasso-plattner-institute/add-matrix-ci
Jasperhino Feb 7, 2024
1bfef18
Merge remote-tracking branch 'origin/master' into add-hie-aode
Jasperhino Feb 7, 2024
5cdf45a
lint
Jasperhino Feb 7, 2024
38ccb05
Calculating conditional probabilities with the laplace estimator
saragrau Feb 12, 2024
5caa487
removed old test form test_hie_aode.py
saragrau Mar 20, 2024
70e1912
refactored `select_and_predict` method for HieAODE class
saragrau Mar 20, 2024
1b77ad2
Refactored HieAODE into HieAODEBase
saragrau Mar 20, 2024
f898c6b
added HieAODELite
saragrau Mar 20, 2024
74d621b
added HieAODEplusplus
saragrau Mar 29, 2024
bb6b385
renamed module `selectors` to `hierarchical_selectors`
saragrau Apr 19, 2024
26341a5
implementation of using positive or negative values only for product
saragrau Apr 19, 2024
dac88a5
renamed descendats and ancestors cpts to prob_feature_given_class_and…
saragrau Apr 20, 2024
30111f4
refactored calculate_prob_given_ascendant_class to calculate_prob_fea…
saragrau Apr 20, 2024
17cc376
Modified implementation HieAODE_plus_plus
saragrau Apr 20, 2024
408f821
renamed feature_idx to parent_idx
saragrau Apr 20, 2024
627ab7c
Refactor select_and_predict to enforce subclass-specific logic
saragrau Apr 20, 2024
c25e3d7
black formatting
saragrau Apr 20, 2024
4a030a5
Implemented HieAODELitePlusPlus, HieAODELitePlus, HieAODEPlusPlus and…
saragrau Apr 20, 2024
c2bcef6
added comments and documentation
saragrau Apr 22, 2024
128dda4
linting
saragrau Apr 22, 2024
ad38daa
Added detailed documentation for each method
saragrau Apr 30, 2024
8d11f4b
resolved indentation formatting issues
Jan 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions .github/workflows/lint-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Run this job on pushes to `main`, and for pull requests. If you don't specify
# `branches: [main], then this actions runs _twice_ on pull requests, which is
# annoying.

on:
push:
branches: [main]
pull_request:

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: psf/black@stable
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

# Cache the installation of Poetry itself, e.g. the next step. This prevents the workflow
# from installing Poetry every time, which can be slow. Note the use of the Poetry version
# number in the cache key, and the "-0" suffix: this allows you to invalidate the cache
# manually if/when you want to upgrade Poetry, or if something goes wrong. This could be
# mildly cleaner by using an environment variable, but I don't really care.
- name: cache poetry install
uses: actions/cache@v2
with:
path: ~/.local
key: poetry-1.7.1

# Install Poetry. You could do this manually, or there are several actions that do this.
# `snok/install-poetry` seems to be minimal yet complete, and really just calls out to
# Poetry's default install script, which feels correct. I pin the Poetry version here
# because Poetry does occasionally change APIs between versions and I don't want my
# actions to break if it does.
#
# The key configuration value here is `virtualenvs-in-project: true`: this creates the
# venv as a `.venv` in your testing directory, which allows the next step to easily
# cache it.
- uses: snok/install-poetry@v1
with:
version: 1.7.1
virtualenvs-create: true
virtualenvs-in-project: true

# Cache your dependencies (i.e. all the stuff in your `pyproject.toml`). Note the cache
# key: if you're using multiple Python versions, or multiple OSes, you'd need to include
# them in the cache key. I'm not, so it can be simple and just depend on the poetry.lock.
- name: cache deps
id: cache-deps
uses: actions/cache@v2
with:
path: .venv
key: pydeps-${{ hashFiles('**/poetry.lock') }}

# Install dependencies. `--no-root` means "install all dependencies but not the project
# itself", which is what you want to avoid caching _your_ code. The `if` statement
# ensures this only runs on a cache miss.
- run: poetry install --no-interaction --no-root
if: steps.cache-deps.outputs.cache-hit != 'true'

# Now install _your_ project. This isn't necessary for many types of projects -- particularly
# things like Django apps don't need this. But it's a good idea since it fully-exercises the
# pyproject.toml and makes that if you add things like console-scripts at some point that
# they'll be installed and working.
- run: poetry install --no-interaction

# And finally run tests. I'm using pytest and all my pytest config is in my `pyproject.toml`
# so this line is super-simple. But it could be as complex as you need.
- run: poetry run pytest hfs
38 changes: 0 additions & 38 deletions .github/workflows/python-app.yml

This file was deleted.

6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
Expand Down Expand Up @@ -79,4 +78,7 @@ results/
slurm-*.out

# WandB
wandb
wandb

# Weird file kgextension generates
rate_limits.db
63 changes: 63 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
====================================================
hfs - A library for hierarchical feature selection
====================================================

Introduction
=============

Welcome to the **hfs** repository!👋
This library provides several hierarchical feature selection algorithms.

Many real-world settings contain hierarchical relations. While in text mining, words can be ordered in generalization-specialization relationships in bioinformatics the function of genes is often described as a hierarchy. We can make use of these relationships between datasets' features by using special hierarchical feature selection algorithms that reduce redundancy in the data. This can not only make tasks like classification faster but also improve the results. Depending on use case and preference you can choose from lazy and eager hierarchical feature selection algorithms in this library.

Getting Started
===================================================

1. Installation
-------------------------------------

The package cannot be installed with pip or conda yet so to create your package, you need to clone the ``hfs`` repository::

``git clone https://github.com/hasso-plattner-institute/hfs.git

Install the environment using::

```poetry install```

1. Usage
-------------------------------------------
Here is a simple example of how to use one of the hierarchical feature selection algorithms implemented in hfs:

.. code-block:: python

from hfs import SHSELSelector

# Initialize selector
selector = SHSELSelector(hierarchy)

# Fit selector and transform data
selector.fit(X, y, columns=columns)
X_transformed = selector.transform(X)

Documentation
=============

For detailed information on how to use **hfs**, check out our complete documentation at https://hfs.readthedocs.io. 📖

There you can find not only the API documentation but also more examples, background information on the algorithms we implemented and results for some experiments we performed with them.

Contributing
============

We welcome contributions! If you would like to contribute to the project,
feel free to create a pull request.

Linting and Testing
```
poetry run black .
```

```
poetry run pytest hfs
```
Happy feature selecting!
5 changes: 0 additions & 5 deletions environment.yml

This file was deleted.

4 changes: 2 additions & 2 deletions examples/eager_learning_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
import networkx as nx
import numpy as np

from hfs import SHSELSelector
from hfs.helpers import get_columns_for_numpy_hierarchy
from hfs.hierarchical_selectors import SHSELSelector

# Example dataset X with 3 samples and 5 features.
X = np.array(
Expand Down Expand Up @@ -65,7 +65,7 @@
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

from hfs.data.data_utils import create_mapping_columns_to_nodes, load_data, process_data
from hfs.data_utils import create_mapping_columns_to_nodes, load_data, process_data
from hfs.preprocessing import HierarchicalPreprocessor
from hfs.shsel import SHSELSelector

Expand Down
33 changes: 25 additions & 8 deletions examples/lazy_learning_example.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# -*- coding: utf-8 -*-
# %%
"""
Lazy learning
=====================
Expand All @@ -10,11 +11,8 @@
import networkx as nx
import numpy as np

from hfs.hip import HIP
from hfs.hnb import HNB
from hfs.mr import MR
from hfs.preprocessing import HierarchicalPreprocessor
from hfs.tan import Tan
from hfs.hierarchical_selectors import HIP, HNB, MR, RNB, TAN, HieAODEBase, HNBs


# Define data
Expand All @@ -39,6 +37,25 @@ def preprocess():


train, test, train_y_data, test_y_data, hierarchy = preprocess()
# %%
"""
=========================================================================
HieAODE -
=========================================================================
"""

print("\nHieAODE:")
# Initialize and fit HNB model with threshold k = 3 features to select
model = HieAODEBase(hierarchy=hierarchy)
model.fit_selector(X_train=train, y_train=train_y_data, X_test=test)
# %%
# Select features and predict
predictions = model.select_and_predict(predict=True, saveFeatures=True)
print(predictions)
# %%
# Calculate score
score = model.get_score(test_y_data, predictions)
print(score)

"""
=========================================================================
Expand All @@ -59,7 +76,7 @@ def preprocess():
score = model.get_score(test_y_data, predictions)
print(score)


# %%
"""
=========================================================================
HNB-s
Expand All @@ -68,7 +85,7 @@ def preprocess():

print("HNB-s:")
# Initialize and fit HNBs model
model = HNB(hierarchy=hierarchy)
model = HNBs(hierarchy=hierarchy)
model.fit_selector(X_train=train, y_train=train_y_data, X_test=test)

# Select features and predict
Expand All @@ -88,7 +105,7 @@ def preprocess():

print("\nRNB:")
# Initialize and fit RNB model with threshold k = 3 features to select
model = HNB(hierarchy=hierarchy)
model = RNB(hierarchy=hierarchy)
model.fit_selector(X_train=train, y_train=train_y_data, X_test=test)

# Select features and predict
Expand Down Expand Up @@ -144,7 +161,7 @@ def preprocess():
"""
print("\nTAN:")
# Initialize and fit Tan model
model = Tan(hierarchy=hierarchy)
model = TAN(hierarchy=hierarchy)
model.fit_selector(X_train=train, y_train=train_y_data, X_test=test)

# Select features and predict
Expand Down
Binary file added examples/rate_limits.db
Binary file not shown.
25 changes: 10 additions & 15 deletions experiments/experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,12 @@

import networkx as nx
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report
from hfs.data.data_utils import create_mapping_columns_to_nodes
from sklearn.naive_bayes import BernoulliNB

from hfs.hip import HIP
from hfs.hnb import HNB
from hfs.hnbs import HNBs
from hfs.mr import MR
from hfs.data_utils import create_mapping_columns_to_nodes
from hfs.preprocessing import HierarchicalPreprocessor
from hfs.rnb import RNB
from hfs.tan import Tan
from hfs.hierarchical_selectors import HIP, HNB, MR, RNB, TAN, HNBs


def data():
Expand Down Expand Up @@ -81,12 +76,12 @@ def mr(hierarchy, train, y_train, test, y_test, k, columns, path):


def tan(hierarchy, train, y_train, test, y_test, k, columns, path):
model = Tan(hierarchy=hierarchy)
model = TAN(hierarchy=hierarchy)
model.fit_selector(X_train=train, y_train=y_train, X_test=test, columns=columns)
pred = model.select_and_predict(predict=True, saveFeatures=True)
score = model.get_score(y_test, pred)
with open(path, "a") as file:
file.write("\nTan:\n")
file.write("\nTAN:\n")
file.write(json.dumps(score))


Expand All @@ -99,11 +94,11 @@ def hip(hierarchy, train, y_train, test, y_test, k, columns, path):
file.write("\nHIP:\n")
file.write(json.dumps(score))

def naive_bayes(hierarchy, train, y_train, test, y_test, k, columns,path):

def naive_bayes(hierarchy, train, y_train, test, y_test, k, columns, path):
clf = BernoulliNB()
clf.fit(train, y_train)
predictions = clf.predict(test)
predictions = clf.predict(test)
score = classification_report(y_true=y_test, y_pred=predictions, output_dict=True)
with open(path, "a") as file:
file.write("\nBaseline:\n")
Expand All @@ -117,7 +112,7 @@ def evaluate(data, k):
preprocessor.fit(train, columns=columns)
train = preprocessor.transform(train)
test = preprocessor.transform(test)

hierarchy = preprocessor.get_hierarchy()
graph = nx.DiGraph(hierarchy)
columns = create_mapping_columns_to_nodes(pd.DataFrame(train), graph)
Expand All @@ -134,7 +129,7 @@ def evaluate(data, k):
y_test=y_test,
k=k,
columns=columns,
path = path
path=path,
)


Expand Down
Loading