All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
SQLGlot produces SQL in the wrong dialect in some instances. See issue 269
Versions of sqlglot before 1.18.0 could not parse the spark sql left
and right
functions - see here.
As a result, if left
or right
are used in the user's custom sql case expressions, Splink will produce an error.
This release bumps the version in the requirement to fix this problem.
sql_expr
now added to tooltips on bayes factor chart, displaying the SQL expression for each comparison level- Warnings to the user if they don't include a null level in their case expression, custom columns is different to cols used in case expression,
splink
now parsescase_expression
to auto-populatenum_levels
orcol_name
orcustom_columns_used
. The user may still provide this information, but is no longer required to.
Note that Splink now has a depedency on sqlglot
, a no-dependency SQL parser.
- Add function to analyse blocking rules
from splink.analyse_blocking_rule import analyse_blocking_rule
in moj-analytical-services#260
Full Changelog: https://github.com/moj-analytical-services/splink/compare/v2.0.3...v2.0.4
- Gamma distribution chart by @RobinL in moj-analytical-services#246
Full Changelog: https://github.com/moj-analytical-services/splink/compare/v2.0.2...v2.0.3
- Add function to compute m values from labelled data by @RobinL in moj-analytical-services#248
- Add function that outputs the full path to the similarity jar by @RobinL in moj-analytical-services#237
- Allow match weight to be used in the diagnostic histogram by @RobinL in moj-analytical-services#239
- Term frequency adjustments are now calculated directly from a term freqency lookup table making them more accurate
- Term frequency adjustments are now part of the iterative EM estimation step, improving convergence
- All internal calculations are changed to use bayes factors (match weights) rather than probabilities to make the maths simpler
- Splink now outputs
match_weight
, the log2(Bayes Factor) of the match score. - New
splink.charts.save_offline_chart
function that produces charts that work in airgapped environments with no internet connection - New
splink.cluster.clusters_at_thresholds
function that clusters are one or more match thresholds - The
splink.truth.roc_chart
function now allows several ROCS to be plotted on a single chart, to compare the accuracy of different models - Splink now includes an slower Python implementation of jaro_winkler, in case users are having trouble with the string similarity jar
- Since term frequency adjustments are no longer an ex-post step, there's no longer a need for them to be calculated separately. Splink therefore no longer outputs
tf_adjusted_match_prob
. Instead, TF adjustments are included withinmatch_probability
m
andu
probabilities are now reset toNone
rather than0
in EM iteration when they cannot be estimated- Now use
_repr_pretty_
so that objects display nicely in Jupyter Lab rather than__repr__
, which had been interfering with the interpretatino of stack trace errors
- Bug whereby Splink lowercased case expressions, see here
- Improve estimate comparison charts, including tooltips and better labels
- Added mousewheel zoom to bayes factor chart
- Added mousewheel zoom to splink score histogram
- Update estimate comparison chart to use different shapes for different estimates, making it possible to distinguish overlapping symbols
- m and u history charts now display barchart correctly
- Charts now feature improved tooltips, and have a cleaner appearance. Many are now zoomable
- Charts now display better in Jupyter Lab, especially the html file produced by
all_charts_write_html_file()
m
andu
probabilities charts can now be produced fromSettings
objects- The user can now combine settings objects using
ModelCombiner from splink.combine_models
A number of backwards incompatible changes have been made for Splink 1.0.
- The main
Splink
API is different. Instead ofSplink(...,df=df)
for dedupe andSplink(...,df_l=df_l,df_r=df_r)
for linking, the user provides an agumentdf_or_dfs
, which is either a single DataFrame or a list of DataFrames. This allows linking n>2 datasets. - When linking multiple dataframes, the user must now include a
source_dataset
column (default namesource_dataset
, configurable viasource_dataset_column_name
in the settings dict) - The
Params
class is now calledModel
in themodel.py
module. - The on-disk (json) format of the
Model
object has changed and is incompatible withParams
- The new
Model
class now uses the same representation for parameters as the Settings object, reducing duplicate code. Internal functions now havesettings
ormodel
as function arguments, never both. - Vega lite chart definitions now stored in json files in splink/files/chart_defs
- All case statement generation functions are now consistently named, with all names starting
sql_gen_case_stmt_
- Fixed
case_statements.sql_gen_case_smnt_strict_equality_2
which previously behaved differently to all other case functions - All case statements now have a default threshold of exact equality on their top gamma level