These python programs are modified from the app developed by Forest Gregg, DataMade , Derek Eder, DataMade. The app GitHub : Also, see here for examples:
This repository modifies scripts for deduplication of duplicate records into clusters and record linkage of two datasets with similar records using the and programs, respectively.
1.) Install Python 3.8.10 for dedupe app - have not tested other versions, most recent versions will likely not work. To install specific python version 3.8.10. see here:
2.) Download and install dedupe app dependencies(click on the the requirements.txt file name, download and install globally to any scripts folder or install in a specific venv used only for this project) example for installing requirements open the integrated terminal in Visual studio code (or your IDE) OR command prompt terminal then do the following :
- After installing Python, set the python environment by opening a terminal (e.g.,command prompt), navigating to the path of your project with the requirements.txt, creating and activating a venv and install requirements:
cd path_to_your_project
python -m venv dedupeapp_venv
pip install -r requirements.txt
you should now be able to run the python scripts from this environment:
3.) If using Visual Studio code (highly recommend), install IDE .
Protocol for intake of new accounts, deduplication against historical accounts, new account ID assignment and fuzzy matching new accounts to a master dataset
Note: (each script is prefixed with 'P0(value)', the number is the sequence in which the script should be executed i.e., P01 is first)
- P01 (
- P02 (
- P03 (
- P04 ( STOP! 🛑 👀
- (Optional) An additional fuzzy matching of P04 results, focusing on the borderline (potential mis-matches), final script (P04b), requiring manual analysis of P04 output to generate input for P04b (
- a)use script to monitor the incoming files on your local machine. Modify accordingly to receive email updates as well.
- b)OR use subprocess to run the next script ( must import subprocess) - e.g.,
# Your P04 script would have already executed above this line
# Then subprocess is used to trigger the follow-up script for analysis of threshold fuzzy matches
print("Triggering the final script,")
python_path = r"C:/Users/beste/envs/dedupe-examples/Scripts/python.exe"
script_path = r"C:\Users\beste\OneDrive - Qral Group\Desktop\python\FuzzyMatch\P04_Account fuzzy matching scripts\[python_path, script_path])