generated from CDCgov/template
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
## Description NBS provided us a spreadsheet that defines an existing state and then a bunch of tests with expected results. Architect a testing script of sorts, that can parse a similar spreadsheet (can just be a CSV) with seeding data and iterate through the tests with the expectations. ## Related Issues closes #110 ## Additional Notes For a full explanation of run instructions, directory structure, and file uses check out the [README.md](https://github.com/CDCgov/RecordLinker/pull/123/files#:~:text=%23%20Record%20Linkage%20Algorithm%20Testing) The key inputs to these tests will be the following: - a CSV file that can be used to seed the MPI - a CSV with test cases (basically the same format as the above), but also a column to indicate if it is a match and to whom - a JSON file with an algorithm config to use The setup should make it easy to edit any 3 of those above files, run the test cases, and examine the results.
- Loading branch information
1 parent
3a7e58b
commit a6e837b
Showing
14 changed files
with
467 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,3 +73,6 @@ __pycache__/ | |
|
||
# Databases | ||
*.sqlite3 | ||
|
||
# Test result files | ||
output.csv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Use the official Python 3.11 slim image as the base | ||
FROM python:3.12-slim | ||
|
||
# Set the working directory | ||
WORKDIR /app | ||
|
||
# Copy the scripts and data directories into the image | ||
COPY scripts /app/scripts | ||
COPY data /app/data | ||
|
||
# Install Python dependencies | ||
RUN pip install --no-cache-dir requests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Record Linkage Algorithm Testing | ||
|
||
This repository contains a project to evaluate the match accuracy performance of the RecordLinker algorithm. | ||
|
||
## Prerequisites | ||
|
||
Before getting started, ensure you have the following installed: | ||
|
||
- [Docker](https://docs.docker.com/engine/install/) | ||
- [Docker Compose](https://docs.docker.com/compose/install/) | ||
|
||
## Directory Structure | ||
|
||
- `/`: Contains the `.env` file and `Dockerfile` to build | ||
- `configurations/`: Contains the configuration `.json` file that will be used for the test | ||
- `data/`: Contains the data `.csv` files used for the algorithm test (seed file and test file) | ||
- `results/`: Contains the results `.csv` file after running the test | ||
- `scripts/`: Contains the scripts to run the test | ||
|
||
## Setup | ||
|
||
1. Build the Docker images: | ||
|
||
```bash | ||
docker compose --profile algo-test build | ||
``` | ||
|
||
2. Add seed and test data files | ||
You can use the sample data files provided in the `data` directory or add your own data files. | ||
The format of the input files should be a CSV file with the same column headers as shown in the sample files. | ||
|
||
`/data/sample_seed_data.csv` | ||
|
||
`/data/sample_test_data.csv` | ||
|
||
|
||
3. Configure environment variables | ||
|
||
`/algo.env` | ||
|
||
Edit the environment variables in the file | ||
|
||
4. Edit the algorithm configuration file | ||
|
||
`/configurations/algorithm_configuration.json` | ||
|
||
Edit the configuration file to tune the algorithm parameters | ||
|
||
## Running Algorithm Tests | ||
|
||
1. Run the test | ||
|
||
```bash | ||
docker compose run --rm algo-test-runner scripts/run_test.py | ||
``` | ||
|
||
2. Analyze the results | ||
|
||
The results of the algorithm tests will be available in the `results/output.csv` file. | ||
|
||
The results will be in a CSV formatted file with the following columns: | ||
`Test Case #`, `Expected Result`, `Match Result`, `Details` | ||
|
||
## Rerunning Algorithm Tests | ||
|
||
After you've run the algorithm tests, you may want to rerun the tests with different seed data, test data, or configurations. | ||
Edit the csv files and/or the configuration file as needed and then run the following commands to rerun the tests. | ||
1. Reset the mpi database | ||
```bash | ||
docker compose run --rm algo-test-runner python scripts/reset_db.py | ||
``` | ||
2. Run the tests | ||
```bash | ||
docker compose run --rm algo-test-runner scripts/run_test.py | ||
``` | ||
## Environment Variables | ||
1. `env file`: The attributes that should be tuned for your particular algorithm test, | ||
are located in the `algo_test.env` file. | ||
2. `environment`: The attributes that should likely remain static for all algorithm tests are located directly in the `compose.yml` file. | ||
### Algorithm Test Parameters | ||
The following environment variables can be tuned in the `algo-test.env` file: | ||
- `SEED_FILE`: The file containing person data to seed the mpi with | ||
- `TEST_FILE`: The file containing patient data to test the algorithm with | ||
- `ALGORITHM_CONFIGURATION`: The file containing the algorithm configuration json | ||
- `ALGORITHM_NAME`: The name of the algorithm to use (either the name of your `ALGORITHM_CONFIGURATION` or can be the built in `dibbs-basic` or `dibbs-enhanced` algorithms) | ||
## Cleanup | ||
After you've finished running algorithm tests and analyzing the results, you can stop and remove the Docker containers by running: | ||
|
||
```bash | ||
docker compose --profile algo-test down | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SEED_FILE="data/sample_seed_data.csv" | ||
TEST_FILE="data/sample_test_data.csv" | ||
ALGORITHM_CONFIGURATION="configurations/algorithm_configuration.json" | ||
ALGORITHM_NAME="test-config" |
66 changes: 66 additions & 0 deletions
66
tests/algorithm/configurations/algorithm_configuration.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
{ | ||
"label": "test-config", | ||
"description": "test algorithm configuration", | ||
"is_default": false, | ||
"include_multiple_matches": true, | ||
"belongingness_ratio": [0.75, 0.9], | ||
"passes": [ | ||
{ | ||
"blocking_keys": [ | ||
"BIRTHDATE" | ||
], | ||
"evaluators": [ | ||
{ | ||
"feature": "FIRST_NAME", | ||
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string" | ||
}, | ||
{ | ||
"feature": "LAST_NAME", | ||
"func": "func:recordlinker.linking.matchers.feature_match_exact" | ||
} | ||
], | ||
"rule": "func:recordlinker.linking.matchers.eval_perfect_match", | ||
"cluster_ratio": 0.9, | ||
"kwargs": { | ||
"thresholds": { | ||
"FIRST_NAME": 0.9, | ||
"LAST_NAME": 0.9, | ||
"BIRTHDATE": 0.95, | ||
"ADDRESS": 0.9, | ||
"CITY": 0.92, | ||
"ZIP": 0.95 | ||
} | ||
} | ||
}, | ||
{ | ||
"blocking_keys": [ | ||
"ZIP", | ||
"FIRST_NAME", | ||
"LAST_NAME", | ||
"SEX" | ||
], | ||
"evaluators": [ | ||
{ | ||
"feature": "ADDRESS", | ||
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string" | ||
}, | ||
{ | ||
"feature": "BIRTHDATE", | ||
"func": "func:recordlinker.linking.matchers.feature_match_exact" | ||
} | ||
], | ||
"rule": "func:recordlinker.linking.matchers.eval_perfect_match", | ||
"cluster_ratio": 0.9, | ||
"kwargs": { | ||
"thresholds": { | ||
"FIRST_NAME": 0.9, | ||
"LAST_NAME": 0.9, | ||
"BIRTHDATE": 0.95, | ||
"ADDRESS": 0.9, | ||
"CITY": 0.92, | ||
"ZIP": 0.95 | ||
} | ||
} | ||
} | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN, | ||
1,3020167,1951-06-02,Linda,Nash,Sr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449, | ||
2,9488697,1942-08-03,Jose,Singleton,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668, | ||
3,1805504,1963-01-29,Ryan,Lawrence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433, | ||
4,1792678,1950-08-10,Thomas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905, | ||
5,1332302,1972-08-26,Angie,Murphy,Sr,Mcmahon,Black,Non-Hispanic,F,60015 Edward Vista Suite 518,Lake Andreaview,UT,North Rodney County,46540,740-16-5170, |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Test Case #,Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,Expected Result | ||
1,1,3020167,1951-06-02,Linda,Nash,Jr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,Should be a Match | ||
2,2,9488697,1942-08-03,Singleton,Jose,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,Should be a Match | ||
3,3,1805504,1963-01-29,Ryan,Law-rence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,Should be a Match | ||
4,4,1792678,1950-08-10,Tho-mas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match | ||
5,4,1792678,1950-08-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match | ||
6,0,1792679,1950-18-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should fail |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import json | ||
|
||
|
||
def dict_to_pii(record_data) -> dict | None: | ||
# convert row to a pii_record | ||
pii_record = { | ||
"external_id": record_data.get('ID', None), | ||
"birth_date": record_data.get("BIRTHDATE", None), | ||
"sex": record_data.get("GENDER", None), | ||
"address": [ | ||
{ | ||
"line": [record_data.get("ADDRESS", None)], | ||
"city": record_data.get("CITY", None), | ||
"state": record_data.get("STATE", None), | ||
"county": record_data.get("COUNTY", None), | ||
"postal_code": str(record_data.get("ZIP", "")) | ||
} | ||
], | ||
"name": [ | ||
{ | ||
"given": [record_data.get("FIRST", None)], | ||
"family": record_data.get("LAST", None), | ||
"suffix": [record_data.get("SUFFIX", None)] | ||
} | ||
], | ||
"ssn": record_data.get("SSN", None), | ||
"race": record_data.get("RACE", None) | ||
} | ||
|
||
return pii_record | ||
|
||
|
||
def load_json(file_path: str) -> dict | None: | ||
""" | ||
Load JSON data from a file. | ||
""" | ||
with open(file_path, "rb") as fobj: | ||
try: | ||
content = json.load(fobj) | ||
return content | ||
except json.JSONDecodeError as exc: | ||
print(f"Error loading JSON file: {exc}") | ||
return None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
import os | ||
|
||
import requests | ||
|
||
|
||
def reset_db(api_url): | ||
print("Resetting the database...") | ||
try: | ||
response = requests.delete(f"{api_url}/seed") | ||
response.raise_for_status() # Raise an error for bad status codes | ||
print("Database reset successfully") | ||
except requests.exceptions.RequestException as e: | ||
print(f"Failed to reset the database: {e}") | ||
|
||
|
||
if __name__ == "__main__": | ||
api_url = os.getenv("API_URL") | ||
reset_db(api_url) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import os | ||
|
||
from helpers import load_json | ||
from seed_db import seed_database | ||
from send_test_records import send_test_records | ||
from set_configuration import add_configuration | ||
from set_configuration import check_if_config_already_exists | ||
from set_configuration import update_configuration | ||
|
||
|
||
def main(): | ||
# Get the environment variables | ||
api_url = os.getenv("API_URL") | ||
algorithm_name = os.getenv("ALGORITHM_NAME") | ||
algorithm_config_file = os.getenv("ALGORITHM_CONFIGURATION") | ||
seed_csv = os.getenv("SEED_FILE") | ||
test_csv = os.getenv("TEST_FILE") | ||
|
||
# setup the algorithm configuration | ||
algorithm_config = load_json(algorithm_config_file) | ||
if check_if_config_already_exists(algorithm_config, api_url): | ||
update_configuration(algorithm_config, api_url) | ||
else: | ||
add_configuration(algorithm_config, api_url) | ||
|
||
seed_database(seed_csv, api_url) | ||
|
||
send_test_records(test_csv, algorithm_name, api_url) | ||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import csv | ||
|
||
import requests | ||
from helpers import dict_to_pii | ||
|
||
|
||
def seed_database(csv_file, api_url): | ||
MAX_CLUSTERS = 100 | ||
cluster_group = [] | ||
|
||
print("Seeding the database...") | ||
|
||
# Read the CSV file using the csv module | ||
with open(csv_file, mode='r', newline='', encoding='utf-8') as file: | ||
reader = csv.DictReader(file) | ||
|
||
for row in reader: | ||
record_data = {k: ("" if v in [None, "NaN"] else v) for k, v in row.items()} | ||
|
||
# convert dict to a pii_record | ||
pii_record = dict_to_pii(record_data) | ||
|
||
# nesting for the seeding api request | ||
cluster = {"records": [pii_record]} | ||
cluster_group.append(cluster) | ||
|
||
if len(cluster_group) == MAX_CLUSTERS: | ||
send_clusters_to_api(cluster_group, api_url) | ||
cluster_group = [] | ||
|
||
if cluster_group: | ||
send_clusters_to_api(cluster_group, api_url) | ||
|
||
print("Finished seeding the database.") | ||
|
||
|
||
def send_clusters_to_api(cluster_group, api_url): | ||
"""Helper function to send a batch of clusters to the API.""" | ||
try: | ||
response = requests.post(f"{api_url}/seed", json={"clusters": cluster_group}) | ||
response.raise_for_status() # Raise an error for bad status codes | ||
except requests.exceptions.RequestException as e: | ||
print(f"Failed to post batch: {e}") |
Oops, something went wrong.