Skip to content

Commit

Permalink
Merge pull request #31 from artefactory/suggestions
Browse files Browse the repository at this point in the history
Added Suggestions
  • Loading branch information
VincentAuriau authored Mar 14, 2024
2 parents 8864d44 + 2951eb2 commit bd7f59b
Show file tree
Hide file tree
Showing 8 changed files with 211 additions and 176 deletions.
22 changes: 22 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
The MIT License (MIT)

Copyright (c) 2023 The choice-learn developers, artefactory
All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
<img src="docs/choice_learn_official_logo.png" width="256">

Choice-Learn is a Python package designed to help you build discrete choice models.
The package provides ready to use datasets and different models from the litterature. It also provides a lower level use if you want to customize any model or create your own from scratch. In particular you will find smart datasets handling to limit RAM usage and different structure commons to any choice model.
The package provides ready-to-use datasets and models from the litterature. It also provides a lower level use if you want to customize any model or create your own from scratch. In particular you will find efficient data handling to limit RAM usage and structure commons to any choice model.

Choice-Learn uses NumPy and pandas as data backend engines and TensorFlow for models.

Expand Down Expand Up @@ -55,7 +55,7 @@ If you are new to choice modelling, you can check this [resource](https://www.pu
- Conditional MultiNomialLogit [[4]](#citation)[[Example]](https://github.com/artefactory/choice-learn-private/blob/main/notebooks/choice_learn_introduction_clogit.ipynb)
- Latent Class MultiNomialLogit [[Example]](https://github.com/artefactory/choice-learn-private/blob/main/notebooks/latent_class_model.ipynb)
- RUMnet [[1]](#citation)[[Example]](https://github.com/artefactory/choice-learn-private/blob/main/notebooks/rumnet_example.ipynb)
- Ready-to-use models to be implemented:
- (WIP) - Ready-to-use models to be implemented:
- Nested MultiNomialLogit
- [TasteNet](https://arxiv.org/abs/2002.00922)
- [SHOPPER](https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/SHOPPER--A-probabilistic-model-of-consumer-choice-with-substitutes/10.1214/19-AOAS1265.full)
Expand Down Expand Up @@ -102,6 +102,10 @@ For modelling you need:
Finally, an optional requirement used for report and LBFG-S optimization is:
- TensorFlow Probability (>=0.20.1)

Once you have created your conda/pip python==3.9 environment, you can install requirements by:
```bash
pip install choice-learn
```
## Usage
```python
from choice_learn.data import ChoiceDataset
Expand Down Expand Up @@ -160,6 +164,14 @@ A detailed documentation of this project is available [here](https://artefactory

## Citation

If you consider this package and any of its feature useful for your research, please cite our paper:

(WIP - Paper to come)

### License

The use of this software is under the MIT license, with no limitation of usage, including for commercial applications.

### Contributors
### Special Thanks

Expand All @@ -170,7 +182,7 @@ A detailed documentation of this project is available [here](https://artefactory
[2][The Acceptance of Model Innovation: The Case of Swissmetro](https://www.researchgate.net/publication/37456549_The_acceptance_of_modal_innovation_The_case_of_Swissmetro), Bierlaire, M.; Axhausen, K., W.; Abay, G. (2001)\
[3][Applications and Interpretation of Nested Logit Models of Intercity Mode Choice](https://trid.trb.org/view/385097), Forinash, C., V.; Koppelman, F., S. (1993)\
[4][The Demand for Local Telephone Service: A Fully Discrete Model of Residential Calling Patterns and Service Choices](https://www.jstor.org/stable/2555538), Train K., E.; McFadden, D., L.; Moshe, B. (1987)\
[5] [Estimation of Travel Choice Models with Randomly Distributed Values of Time](https://ideas.repec.org/p/fth/lavaen/9303.html), Ben-Akiva M; Bolduc D; Bradley M(1993)\
[5] [Estimation of Travel Choice Models with Randomly Distributed Values of Time](https://ideas.repec.org/p/fth/lavaen/9303.html), Ben-Akiva, M.; Bolduc, D.; Bradley, M. (1993)\
[6] [Personalize Expedia Hotel Searches - ICDM 2013](https://www.kaggle.com/c/expedia-personalized-sort), Ben Hamner, A.; Friedman, D.; SSA_Expedia. (2013)

### Code and Repositories
Expand Down
55 changes: 35 additions & 20 deletions choice_learn/data/choice_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,6 @@ def _build_features_by_ids(self):
indexes and features_by_id of contexts_items_features
"""
if len(self.features_by_ids) == 0:
print("No features_by_ids given.")
return {}, {}, {}

if (
Expand Down Expand Up @@ -629,6 +628,21 @@ def __len__(self):
"""
return len(self.choices)

def __str__(self):
"""Returns short representation of ChoiceDataset.
Returns:
--------
str
short representation of ChoiceDataset
"""
template = """First choice is:\nItems features: {}\nContexts features: {}\n
Contexts Items features: {}\nContexts Items Availabilities: {}\n
Contexts Choice: {}"""
return template.format(
self.batch[0][0], self.batch[0][1], self.batch[0][2], self.batch[0][3], self.batch[0][4]
)

def get_n_items(self):
"""Method to access the total number of different items.
Expand Down Expand Up @@ -741,7 +755,7 @@ def from_single_wide_df(
contexts_items_availabilities_prefix=None,
delimiter="_",
choices_column="choice",
choice_mode="items_id",
choice_format="items_id",
):
"""Builds numpy arrays for ChoiceDataset from a single dataframe in wide format.
Expand Down Expand Up @@ -770,7 +784,7 @@ def from_single_wide_df(
default is "_"
choice_column: str, optional
Name of the column containing the choices, default is "choice"
choice_mode: str, optional
choice_format: str, optional
How choice is indicated in df, either "items_name" or "items_index",
default is "items_id"
Expand Down Expand Up @@ -853,7 +867,7 @@ def from_single_wide_df(
contexts_items_features = []
for item in items_id:
columns = [
f"{feature}{delimiter}{item}" for feature in contexts_items_features_suffixes
f"{feature}{delimiter}{item}" for feature in contexts_items_features_prefixes
]
for col in columns:
if col not in df.columns:
Expand Down Expand Up @@ -901,10 +915,11 @@ def from_single_wide_df(
contexts_items_availabilities = None

choices = df[choices_column].to_numpy()
if choice_mode == "items_id":
if choice_format == "items_id":
if items_id is None:
raise ValueError("items_id must be given to use choice_mode 'items_id'")
raise ValueError("items_id must be given to use choice_format='items_id'")
items_id = np.array(items_id)

choices = np.squeeze([np.where(items_id == c)[0] for c in choices])
if choices.shape[0] == 0:
raise ValueError("No choice found in the items_id list")
Expand All @@ -922,33 +937,33 @@ def from_single_wide_df(
def from_single_long_df(
cls,
df,
choices_column="choice",
items_id_column="item_id",
contexts_id_column="context_id",
fixed_items_features_columns=None,
contexts_features_columns=None,
contexts_items_features_columns=None,
items_id_column="item_id",
contexts_id_column="context_id",
choices_column="choice",
choice_mode="items_id",
choice_format="items_id",
):
"""Builds numpy arrays for ChoiceDataset from a single dataframe in long format.
Parameters
----------
df : pandas.DataFrame
dataframe in Long format
choices_column: str, optional
Name of the column containing the choices, default is "choice"
items_id_column: str, optional
Name of the column containing the item ids, default is "items_id"
contexts_id_column: str, optional
Name of the column containing the sessions ids, default is "contexts_id"
fixed_items_features_columns : list
Columns of the dataframe that are item features, default is None
contexts_features_columns : list
Columns of the dataframe that are contexts features, default is None
contexts_items_features_columns : list
Columns of the dataframe that are context-item features, default is None
items_id_column: str, optional
Name of the column containing the item ids, default is "items_id"
contexts_id_column: str, optional
Name of the column containing the sessions ids, default is "contexts_id"
choices_column: str, optional
Name of the column containing the choices, default is "choice"
choice_mode: str, optional
choice_format: str, optional
How choice is indicated in df, either "items_name" or "one_zero",
default is "items_id"
Expand Down Expand Up @@ -1000,13 +1015,13 @@ def from_single_long_df(
else None
)

if choice_mode == "items_id":
if choice_format == "items_id":
choices = df[[choices_column, contexts_id_column]].drop_duplicates(contexts_id_column)
choices = choices.set_index(contexts_id_column)
choices = choices.loc[sessions].to_numpy()
# items is the value (str) of the item
choices = np.squeeze([np.where(items == c)[0] for c in choices])
elif choice_mode == "one_zero":
elif choice_format == "one_zero":
choices = df[[items_id_column, choices_column, contexts_id_column]]
choices = choices.loc[choices[choices_column] == 1]
choices = choices.set_index(contexts_id_column)
Expand All @@ -1017,7 +1032,7 @@ def from_single_long_df(
)
else:
raise ValueError(
f"choice_mode {choice_mode} not recognized. Must be in ['items_id', 'one_zero']"
f"choice_format {choice_format} not recognized. Must be in ['items_id', 'one_zero']"
)
return ChoiceDataset(
fixed_items_features=items_features,
Expand Down
36 changes: 19 additions & 17 deletions choice_learn/datasets/base.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
"""Datasets loader."""
import csv
import gzip
import os
from importlib import resources

import numpy as np
import pandas as pd

from choice_learn.data.choice_dataset import ChoiceDataset

OS_DATA_MODULE = os.path.join(os.path.abspath(".."), "choice_learn", "datasets", "data")
DATA_MODULE = "choice_learn.datasets.data"


Expand Down Expand Up @@ -36,7 +38,7 @@ def get_path(data_file_name, module=DATA_MODULE):
return path


def load_csv(data_file_name, data_module=DATA_MODULE, encoding="utf-8"):
def load_csv(data_file_name, data_module=OS_DATA_MODULE, encoding="utf-8"):
"""Base function to load csv files.
Parameters
Expand All @@ -55,8 +57,7 @@ def load_csv(data_file_name, data_module=DATA_MODULE, encoding="utf-8"):
np.ndarray
data contained in the csv file
"""
data_path = resources.files(data_module)
with (data_path / data_file_name).open("r", encoding=encoding) as csv_file:
with open(os.path.join(data_module, data_file_name), "r", encoding=encoding) as csv_file:
data_file = csv.reader(csv_file)
names = next(data_file)
data = []
Expand All @@ -66,7 +67,7 @@ def load_csv(data_file_name, data_module=DATA_MODULE, encoding="utf-8"):
return names, np.stack(data)


def load_gzip(data_file_name, data_module=DATA_MODULE, encoding="utf-8"):
def load_gzip(data_file_name, data_module=OS_DATA_MODULE, encoding="utf-8"):
"""Base function to load zipped .csv.gz files.
Parameters
Expand All @@ -85,8 +86,7 @@ def load_gzip(data_file_name, data_module=DATA_MODULE, encoding="utf-8"):
np.ndarray
data contained in the csv file
"""
data_path = resources.files(data_module)
with (data_path / data_file_name).open("rb") as compressed_file:
with open(os.path.join(data_module, data_file_name), "rb") as compressed_file:
compressed_file = gzip.open(compressed_file, mode="rt", encoding=encoding)
names = next(compressed_file)
names = names.replace("\n", "")
Expand Down Expand Up @@ -363,7 +363,7 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
contexts_items_features_suffixes=contexts_items_features_names,
contexts_items_availabilities_suffix=availabilities_column,
choices_column=choice_column,
choice_mode="item_index",
choice_format="item_index",
)


Expand All @@ -372,7 +372,7 @@ def load_modecanada(
add_is_public=False,
as_frame=False,
return_desc=False,
choice_mode="one_zero",
choice_format="one_zero",
split_features=False,
to_wide=False,
preprocessing=None,
Expand All @@ -392,8 +392,8 @@ def load_modecanada(
by default False.
return_desc : bool, optional
Whether to return the description, by default False.
choice_mode : str, optional, among ["one_zero", "items_id"]
mode indicating how the choice is encoded, by default "one_zero".
choice_format : str, optional, among ["one_zero", "items_id"]
format indicating how the choice is encoded, by default "one_zero".
split_features : bool, optional
Whether to split features by type in different dataframes, by default False.
to_wide : bool, optional
Expand Down Expand Up @@ -461,7 +461,7 @@ def load_modecanada(
for col in canada_df.columns:
canada_df[col] = pd.to_numeric(canada_df[col], errors="ignore")

if choice_mode == "items_id":
if choice_format == "items_id":
# We need to transform how the choice is encoded to add the chosen item id
named_choice = [0] * len(canada_df)
for n_row, row in canada_df.iterrows():
Expand Down Expand Up @@ -565,7 +565,7 @@ def load_modecanada(
items_id_column="alt",
contexts_id_column="case",
choices_column="choice",
choice_mode="one_zero",
choice_format="one_zero",
)

return ChoiceDataset.from_single_long_df(
Expand All @@ -576,7 +576,7 @@ def load_modecanada(
items_id_column="alt",
contexts_id_column="case",
choices_column=choice_column,
choice_mode="one_zero",
choice_format="one_zero",
)


Expand Down Expand Up @@ -706,7 +706,7 @@ def load_electricity(
contexts_items_features_columns=["pf", "cl", "loc", "wk", "tod", "seas"],
items_id_column="alt",
contexts_id_column="chid",
choice_mode="one_zero",
choice_format="one_zero",
)


Expand Down Expand Up @@ -750,6 +750,7 @@ def load_train(
if as_frame:
return train_df
train_df["choice"] = train_df.apply(lambda row: row.choice[-1], axis=1)
"""
train_df = train_df.rename(
columns={
"price1": "1_price",
Expand All @@ -766,14 +767,15 @@ def load_train(
"comfort2": "2_comfort",
}
)

"""
return ChoiceDataset.from_single_wide_df(
df=train_df,
items_id=["1", "2"],
fixed_items_suffixes=None,
contexts_features_columns=["id"],
contexts_items_features_suffixes=["price", "time", "change", "comfort"],
contexts_items_features_prefixes=["price", "time", "change", "comfort"],
delimiter="",
contexts_items_availabilities_suffix=None,
choices_column="choice",
choice_mode="items_id",
choice_format="items_id",
)
Loading

0 comments on commit bd7f59b

Please sign in to comment.