These python programs are modified from the dedupe.io app developed by Forest Gregg, DataMade , Derek Eder, DataMade. The dedupe.io app GitHub : https://github.com/dedupeio/dedupe. Also, see here for examples: https://github.com/dedupeio/dedupe-examples.
This repository modifies scripts for deduplication of duplicate records into clusters and record linkage of two datasets with similar records using the csv_example.py and record_linkage_example.py programs, respectively.
1.) Install Python 3.8.10 for dedupe app - have not tested other versions, most recent versions will likely not work. To install specific python version 3.8.10. see here: https://www.python.org/downloads/release/python-3810/
2.) Download and install dedupe app dependencies(click on the the requirements.txt file name, download and install globally to any scripts folder or install in a specific venv used only for this project) example for installing requirements open the integrated terminal in Visual studio code (or your IDE) OR command prompt terminal then do the following :
- After installing Python, set the python environment by opening a terminal (e.g.,command prompt), navigating to the path of your project with the requirements.txt, creating and activating a venv and install requirements:
-
cd path_to_your_project
-
python -m venv dedupeapp_venv
-
venv\Scripts\activate
-
pip install -r requirements.txt
you should now be able to run the python scripts from this environment:
#example
python P04_record_linkage_match_accounts_UPDATE_OPS.py
3.) If using Visual Studio code (highly recommend), install IDE . https://code.visualstudio.com/download
Protocol for intake of new accounts, deduplication against historical accounts, new account ID assignment and fuzzy matching new accounts to a master dataset
Note: (each script is prefixed with 'P0(value)_name_of_script_to_run.py', the number is the sequence in which the script should be executed i.e., P01 is first)
- P01 (P01_check_new_accounts_against_logbook.py)
- P02 (P02_Dedupe_new_and_historical_accounts.py)
- P03 (P03_Assign_account_IDs_by_max_in_Logbook.py)
- P04 (P04_record_linkage_match_accounts.py) STOP! 🛑 👀
- (Optional) An additional fuzzy matching of P04 results, focusing on the borderline (potential mis-matches), final script (P04b), requiring manual analysis of P04 output to generate input for P04b (P04b_fuzzy_on_threshold_matches.py).
- a)use watch_dog_TEMPLATE.py script to monitor the incoming files on your local machine. Modify accordingly to receive email updates as well.
- b)OR use subprocess to run the next script ( must import subprocess) - e.g.,
# Your P04 script would have already executed above this line
# Then subprocess is used to trigger the follow-up script for analysis of threshold fuzzy matches
print("Triggering the final script, P04b_fuzzy_on_threshold_matches.py...")
python_path = r"C:/Users/beste/envs/dedupe-examples/Scripts/python.exe"
script_path = r"C:\Users\beste\OneDrive - Qral Group\Desktop\python\FuzzyMatch\P04_Account fuzzy matching scripts\P04b_fuzzy_on_threshold_matches.py
subprocess.run([python_path, script_path])