Skip to content

Commit

Permalink
NBS Test Cases (#123)
Browse files Browse the repository at this point in the history
## Description
NBS provided us a spreadsheet that defines an existing state and then a
bunch of tests with expected results. Architect a testing script of
sorts, that can parse a similar spreadsheet (can just be a CSV) with
seeding data and iterate through the tests with the expectations.

## Related Issues
closes #110 

## Additional Notes
For a full explanation of run instructions, directory structure, and
file uses check out the
[README.md](https://github.com/CDCgov/RecordLinker/pull/123/files#:~:text=%23%20Record%20Linkage%20Algorithm%20Testing)

The key inputs to these tests will be the following:

- a CSV file that can be used to seed the MPI
- a CSV with test cases (basically the same format as the above), but
also a column to indicate if it is a match and to whom
- a JSON file with an algorithm config to use
The setup should make it easy to edit any 3 of those above files, run
the test cases, and examine the results.
  • Loading branch information
cbrinson-rise8 authored Dec 11, 2024
1 parent 3a7e58b commit a6e837b
Show file tree
Hide file tree
Showing 14 changed files with 467 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,6 @@ __pycache__/

# Databases
*.sqlite3

# Test result files
output.csv
22 changes: 22 additions & 0 deletions compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,25 @@ services:
depends_on:
api:
condition: service_healthy

algo-test-runner:
build:
context: tests/algorithm
dockerfile: Dockerfile.algo
env_file:
- tests/algorithm/algo.env
environment:
DB_URI: "postgresql+psycopg2://postgres:pw@db:5432/postgres"
API_URL: "http://api:8080"
volumes:
- ./tests/algorithm/scripts:/app/scripts
- ./tests/algorithm/data:/app/data
- ./tests/algorithm/results:/app/results
- ./tests/algorithm/configurations:/app/configurations
depends_on:
db:
condition: service_healthy
api:
condition: service_healthy
profiles:
- algo-test
12 changes: 12 additions & 0 deletions tests/algorithm/Dockerfile.algo
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Use the official Python 3.11 slim image as the base
FROM python:3.12-slim

# Set the working directory
WORKDIR /app

# Copy the scripts and data directories into the image
COPY scripts /app/scripts
COPY data /app/data

# Install Python dependencies
RUN pip install --no-cache-dir requests
104 changes: 104 additions & 0 deletions tests/algorithm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Record Linkage Algorithm Testing

This repository contains a project to evaluate the match accuracy performance of the RecordLinker algorithm.

## Prerequisites

Before getting started, ensure you have the following installed:

- [Docker](https://docs.docker.com/engine/install/)
- [Docker Compose](https://docs.docker.com/compose/install/)

## Directory Structure

- `/`: Contains the `.env` file and `Dockerfile` to build
- `configurations/`: Contains the configuration `.json` file that will be used for the test
- `data/`: Contains the data `.csv` files used for the algorithm test (seed file and test file)
- `results/`: Contains the results `.csv` file after running the test
- `scripts/`: Contains the scripts to run the test

## Setup

1. Build the Docker images:

```bash
docker compose --profile algo-test build
```

2. Add seed and test data files
You can use the sample data files provided in the `data` directory or add your own data files.
The format of the input files should be a CSV file with the same column headers as shown in the sample files.

`/data/sample_seed_data.csv`

`/data/sample_test_data.csv`


3. Configure environment variables

`/algo.env`

Edit the environment variables in the file

4. Edit the algorithm configuration file

`/configurations/algorithm_configuration.json`

Edit the configuration file to tune the algorithm parameters

## Running Algorithm Tests

1. Run the test

```bash
docker compose run --rm algo-test-runner scripts/run_test.py
```

2. Analyze the results

The results of the algorithm tests will be available in the `results/output.csv` file.

The results will be in a CSV formatted file with the following columns:
`Test Case #`, `Expected Result`, `Match Result`, `Details`

## Rerunning Algorithm Tests

After you've run the algorithm tests, you may want to rerun the tests with different seed data, test data, or configurations.
Edit the csv files and/or the configuration file as needed and then run the following commands to rerun the tests.
1. Reset the mpi database
```bash
docker compose run --rm algo-test-runner python scripts/reset_db.py
```
2. Run the tests
```bash
docker compose run --rm algo-test-runner scripts/run_test.py
```
## Environment Variables
1. `env file`: The attributes that should be tuned for your particular algorithm test,
are located in the `algo_test.env` file.
2. `environment`: The attributes that should likely remain static for all algorithm tests are located directly in the `compose.yml` file.
### Algorithm Test Parameters
The following environment variables can be tuned in the `algo-test.env` file:
- `SEED_FILE`: The file containing person data to seed the mpi with
- `TEST_FILE`: The file containing patient data to test the algorithm with
- `ALGORITHM_CONFIGURATION`: The file containing the algorithm configuration json
- `ALGORITHM_NAME`: The name of the algorithm to use (either the name of your `ALGORITHM_CONFIGURATION` or can be the built in `dibbs-basic` or `dibbs-enhanced` algorithms)
## Cleanup
After you've finished running algorithm tests and analyzing the results, you can stop and remove the Docker containers by running:

```bash
docker compose --profile algo-test down
```
4 changes: 4 additions & 0 deletions tests/algorithm/algo.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SEED_FILE="data/sample_seed_data.csv"
TEST_FILE="data/sample_test_data.csv"
ALGORITHM_CONFIGURATION="configurations/algorithm_configuration.json"
ALGORITHM_NAME="test-config"
66 changes: 66 additions & 0 deletions tests/algorithm/configurations/algorithm_configuration.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"label": "test-config",
"description": "test algorithm configuration",
"is_default": false,
"include_multiple_matches": true,
"belongingness_ratio": [0.75, 0.9],
"passes": [
{
"blocking_keys": [
"BIRTHDATE"
],
"evaluators": [
{
"feature": "FIRST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
},
{
"feature": "LAST_NAME",
"func": "func:recordlinker.linking.matchers.feature_match_exact"
}
],
"rule": "func:recordlinker.linking.matchers.eval_perfect_match",
"cluster_ratio": 0.9,
"kwargs": {
"thresholds": {
"FIRST_NAME": 0.9,
"LAST_NAME": 0.9,
"BIRTHDATE": 0.95,
"ADDRESS": 0.9,
"CITY": 0.92,
"ZIP": 0.95
}
}
},
{
"blocking_keys": [
"ZIP",
"FIRST_NAME",
"LAST_NAME",
"SEX"
],
"evaluators": [
{
"feature": "ADDRESS",
"func": "func:recordlinker.linking.matchers.feature_match_fuzzy_string"
},
{
"feature": "BIRTHDATE",
"func": "func:recordlinker.linking.matchers.feature_match_exact"
}
],
"rule": "func:recordlinker.linking.matchers.eval_perfect_match",
"cluster_ratio": 0.9,
"kwargs": {
"thresholds": {
"FIRST_NAME": 0.9,
"LAST_NAME": 0.9,
"BIRTHDATE": 0.95,
"ADDRESS": 0.9,
"CITY": 0.92,
"ZIP": 0.95
}
}
}
]
}
6 changes: 6 additions & 0 deletions tests/algorithm/data/sample_seed_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,
1,3020167,1951-06-02,Linda,Nash,Sr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,
2,9488697,1942-08-03,Jose,Singleton,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,
3,1805504,1963-01-29,Ryan,Lawrence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,
4,1792678,1950-08-10,Thomas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,
5,1332302,1972-08-26,Angie,Murphy,Sr,Mcmahon,Black,Non-Hispanic,F,60015 Edward Vista Suite 518,Lake Andreaview,UT,North Rodney County,46540,740-16-5170,
7 changes: 7 additions & 0 deletions tests/algorithm/data/sample_test_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Test Case #,Match Id,ID,BIRTHDATE,FIRST,LAST,SUFFIX,MAIDEN,RACE,ETHNICITY,GENDER,ADDRESS,CITY,STATE,COUNTY,ZIP,SSN,Expected Result
1,1,3020167,1951-06-02,Linda,Nash,Jr,Gutierrez,Asian,Hispanic,F,968 Gonzalez Mount,South Emilybury,GU,North Kennethburgh County,93236,675-79-1449,Should be a Match
2,2,9488697,1942-08-03,Singleton,Jose,Sr,Ingram,Asian,Hispanic,M,631 Fowler Causeway,Port Williamfurt,IN,Wardburgh County,90637,587-60-3668,Should be a Match
3,3,1805504,1963-01-29,Ryan,Law-rence,IV,Armstrong,Black,Non-Hispanic,M,5256 Lisa Light,Port Monica,GA,South Christine County,51813,371-33-0433,Should be a Match
4,4,1792678,1950-08-10,Tho-mas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match
5,4,1792678,1950-08-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should be a Match
6,0,1792679,1950-18-10,ThoMas,Brady,II,Cobb,White,Unknown,M,944 Hayes Port,Jonesville,FM,Jonesview County,6015,272-78-9905,Should fail
43 changes: 43 additions & 0 deletions tests/algorithm/scripts/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import json


def dict_to_pii(record_data) -> dict | None:
# convert row to a pii_record
pii_record = {
"external_id": record_data.get('ID', None),
"birth_date": record_data.get("BIRTHDATE", None),
"sex": record_data.get("GENDER", None),
"address": [
{
"line": [record_data.get("ADDRESS", None)],
"city": record_data.get("CITY", None),
"state": record_data.get("STATE", None),
"county": record_data.get("COUNTY", None),
"postal_code": str(record_data.get("ZIP", ""))
}
],
"name": [
{
"given": [record_data.get("FIRST", None)],
"family": record_data.get("LAST", None),
"suffix": [record_data.get("SUFFIX", None)]
}
],
"ssn": record_data.get("SSN", None),
"race": record_data.get("RACE", None)
}

return pii_record


def load_json(file_path: str) -> dict | None:
"""
Load JSON data from a file.
"""
with open(file_path, "rb") as fobj:
try:
content = json.load(fobj)
return content
except json.JSONDecodeError as exc:
print(f"Error loading JSON file: {exc}")
return None
18 changes: 18 additions & 0 deletions tests/algorithm/scripts/reset_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import os

import requests


def reset_db(api_url):
print("Resetting the database...")
try:
response = requests.delete(f"{api_url}/seed")
response.raise_for_status() # Raise an error for bad status codes
print("Database reset successfully")
except requests.exceptions.RequestException as e:
print(f"Failed to reset the database: {e}")


if __name__ == "__main__":
api_url = os.getenv("API_URL")
reset_db(api_url)
33 changes: 33 additions & 0 deletions tests/algorithm/scripts/run_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env python3

import os

from helpers import load_json
from seed_db import seed_database
from send_test_records import send_test_records
from set_configuration import add_configuration
from set_configuration import check_if_config_already_exists
from set_configuration import update_configuration


def main():
# Get the environment variables
api_url = os.getenv("API_URL")
algorithm_name = os.getenv("ALGORITHM_NAME")
algorithm_config_file = os.getenv("ALGORITHM_CONFIGURATION")
seed_csv = os.getenv("SEED_FILE")
test_csv = os.getenv("TEST_FILE")

# setup the algorithm configuration
algorithm_config = load_json(algorithm_config_file)
if check_if_config_already_exists(algorithm_config, api_url):
update_configuration(algorithm_config, api_url)
else:
add_configuration(algorithm_config, api_url)

seed_database(seed_csv, api_url)

send_test_records(test_csv, algorithm_name, api_url)

if __name__ == "__main__":
main()
43 changes: 43 additions & 0 deletions tests/algorithm/scripts/seed_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import csv

import requests
from helpers import dict_to_pii


def seed_database(csv_file, api_url):
MAX_CLUSTERS = 100
cluster_group = []

print("Seeding the database...")

# Read the CSV file using the csv module
with open(csv_file, mode='r', newline='', encoding='utf-8') as file:
reader = csv.DictReader(file)

for row in reader:
record_data = {k: ("" if v in [None, "NaN"] else v) for k, v in row.items()}

# convert dict to a pii_record
pii_record = dict_to_pii(record_data)

# nesting for the seeding api request
cluster = {"records": [pii_record]}
cluster_group.append(cluster)

if len(cluster_group) == MAX_CLUSTERS:
send_clusters_to_api(cluster_group, api_url)
cluster_group = []

if cluster_group:
send_clusters_to_api(cluster_group, api_url)

print("Finished seeding the database.")


def send_clusters_to_api(cluster_group, api_url):
"""Helper function to send a batch of clusters to the API."""
try:
response = requests.post(f"{api_url}/seed", json={"clusters": cluster_group})
response.raise_for_status() # Raise an error for bad status codes
except requests.exceptions.RequestException as e:
print(f"Failed to post batch: {e}")
Loading

0 comments on commit a6e837b

Please sign in to comment.