One to one clustering #2578

aymonwuolanne · 2025-01-06T02:19:44Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Related to #2562 and #251.

Give a brief description for the solution you have provided

There is some discussion of the method in #2562. This PR implements an alternative to cluster_pairwise_predictions_at_threshold which is called cluster_using_single_best_links (more exciting name suggestions welcome).

The goal of this clustering method is to produce clusters where for each cluster and for each dataset in the source_datasets list at most one record from that dataset can be in the cluster. To do this, at each iteration it only accepts a link if it is the single best link for the left id and the right id, and if accepting that link will not create any duplicates.

To deal with ties (e.g. where A1 links to B1 and to B2 with the same match probability) this implementation uses row_number rather than rank which arbitrarily (note: not randomly) picks one of the edges tied for first. A good extension on this would be to implement some other options for dealing with ties, but I think in most Splink applications there are very few ties, especially when term frequency adjustments are being used.

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter
Run the spellchecker (if appropriate)

RobinL · 2025-01-15T14:49:43Z

Just to say sorry for the delay on this. I'm starting looking at this now!

I've got as far as reading the messages and looking at the tests. I'm mainly uploading the following to help me remember if I look back at this in future:

So in your example we have:

And the solution is:

I was first confused that in the solution we didn't have 6a -> 7b, but then I read your message:

I like the simplicity of only accepting the best links and not moving on to the next best ones if the best option is already "taken". I think this corresponds to the Unique Mapping Clustering in the paper I mentioned.

So, am I right in saying the logic goes something like:

The best link from 6a is to 5c at 0.8, hence we ignore 6a to 7b. But 6a to 5c eliminated by the stronger 0.85 link from 3a to 5c. And hence we end up with 6a and 7b separate.

Or, as Andy on the team suggested:

First the picking of the best links, and then forcing the dataset uniqueness.

I like how you've named the function (cluster_using_single_best_links) which I think reflects this well

aymonwuolanne · 2025-01-17T03:36:33Z

Thanks @RobinL, hope you had a well-deserved break!

And yes, that is the expected behaviour. The best link for 6a is to 5c, but it cannot link to that cluster because of the duplicate criteria, so it will remain unlinked. Andy's comment sounds exactly right.

In my opinion this approach made the most sense, but a very easy change to the code would let 6a link to the next best record after all the edges violating the duplicate criteria are removed (in other words, first forcing dataset uniqueness, then picking best links).

RobinL · 2025-01-22T08:12:34Z

I've stepped through the code this morning and it looks good - particularly nice to see how you've kept it stylistically consistent with the other clustering code, and remembered to handle edge cases like absence of match_probability - thanks.

I think we'll be able to merge a version of this, but before we do I want to dig a bit further into the 6a -> 7b issue.

Having spoken with the team, I think we have a reservation around the 6a -> 7b link, which comes down to the initial surprise that it's not made, but then trying to explain convincingly at a 'high level' why, in a linkage context, we would drop this link. (i.e. what is the argument why we should conclude it's invalid).

The best I can come up with is:

We have prior knowledge of the 'one to one' condition
From this knowledge, we conclude 6a to 5c is a false positive
If 6a to 5c is a false positive, and 6a to 7b has a lower score, it must be a false positive too (despite it scoring above the threshold)

But I think we're not convinced by (3). We're not convinced that the information that 6a to 5c is a FP somehow 'correlates' or otherwise tells us something is wrong with 6a to 7b. Are we saying that we think the 'problem' or 'edge case' that causes the model to incorrectly score 6a to 5c also is likely to apply to 6a to 7b? i.e. there's some sort of shared error mechanism.

Do you think we're missing something here? Or am I thinking along the wrong lines entirely? If the code change is relatively straightforward, one option could be to allow the user to choose which version they want.

RobinL · 2025-01-22T13:09:25Z

I've also been experimenting with some additional examples. To this end I've build this an interactive example builder. Drag between nodes to create links, hover and drag up and down on links to change probability. Click nodes to change source dataset (represented by colour). Click to create new nodes.

You can then paste the results into this to run the clustering and see the results (open vega_output.html after running)

Click to expand

import json

import duckdb
import pandas as pd

from splink import DuckDBAPI, Linker, SettingsCreator

# --------------------------------------------------------------------------------------
# PASTE DATA HERE
graph_data = {
    "nodes": [
        {"unique_id": "1", "source_dataset": "a"},
        {"unique_id": "2", "source_dataset": "b"},
        {"unique_id": "3", "source_dataset": "a"},
        {"unique_id": "4", "source_dataset": "b"},
        {"unique_id": "5", "source_dataset": "a"},
        {"unique_id": "6", "source_dataset": "b"},
        {"unique_id": "7", "source_dataset": "d"},
    ],
    "links": [
        {
            "unique_id_l": "1",
            "source_dataset_l": "a",
            "unique_id_r": "2",
            "source_dataset_r": "b",
            "match_probability": 0.92,
        },
        {
            "unique_id_l": "2",
            "source_dataset_l": "b",
            "unique_id_r": "3",
            "source_dataset_r": "a",
            "match_probability": 0.9,
        },
        {
            "unique_id_l": "3",
            "source_dataset_l": "a",
            "unique_id_r": "4",
            "source_dataset_r": "b",
            "match_probability": 0.99,
        },
        {
            "unique_id_l": "4",
            "source_dataset_l": "b",
            "unique_id_r": "5",
            "source_dataset_r": "a",
            "match_probability": 0.9,
        },
        {
            "unique_id_l": "5",
            "source_dataset_l": "a",
            "unique_id_r": "6",
            "source_dataset_r": "b",
            "match_probability": 0.9,
        },
        {
            "unique_id_l": "6",
            "source_dataset_l": "b",
            "unique_id_r": "1",
            "source_dataset_r": "a",
            "match_probability": 0.96,
        },
        {
            "unique_id_l": "4",
            "source_dataset_l": "b",
            "unique_id_r": "7",
            "source_dataset_r": "d",
            "match_probability": 0.9,
        },
    ],
}

nodes_df = pd.DataFrame(graph_data["nodes"])
links_df = pd.DataFrame(graph_data["links"])


# --------------------------------------------------------------------------------------


df = pd.DataFrame(graph_data["nodes"])
predictions = pd.DataFrame(graph_data["links"])


# Define settings
settings = SettingsCreator(
    link_type="link_only",
    comparisons=[],
    blocking_rules_to_generate_predictions=[],
)

# Initialize linker
con = duckdb.connect(":memory:")
db_api = DuckDBAPI(con)
linker = Linker(df, settings, db_api=db_api)
# linker._debug_mode = True
# Register predictions
df_predict = linker.table_management.register_table_predict(predictions, overwrite=True)

# linker._debug_mode = True
df_clusters = linker.clustering.cluster_using_single_best_links(
    df_predict, source_datasets=["a", "b", "d"], threshold_match_probability=0.5
)
df_clusters.as_duckdbpyrelation().show(max_width=1000)

# After getting df_clusters, convert to pandas and prepare data for visualization
df_clusters_pd = df_clusters.as_pandas_dataframe()

# Create nodes with cluster information
nodes_with_clusters = df_clusters_pd.merge(nodes_df, on=["unique_id", "source_dataset"])
nodes_for_vega = [
    {
        "id": row["unique_id"],
        "name": f"Node {row['unique_id']}",
        "group": row["cluster_id"],
        "source_dataset": row["source_dataset"],
    }
    for _, row in nodes_with_clusters.iterrows()
]

# Create a mapping of IDs to indices
id_to_index = {node["id"]: i for i, node in enumerate(nodes_for_vega)}

# Create original probability-based links
original_links = []
for _, row in links_df.iterrows():
    source_id = row["unique_id_l"]
    target_id = row["unique_id_r"]
    original_links.append(
        {
            "source": source_id,
            "target": target_id,
            "value": row["match_probability"],
        }
    )

# Create cluster-based links (renamed from links_for_vega)
cluster_links = []
clusters = df_clusters_pd.groupby("cluster_id")["unique_id"].apply(list)
for cluster_nodes in clusters:
    if len(cluster_nodes) > 1:
        for i in range(len(cluster_nodes)):
            for j in range(i + 1, len(cluster_nodes)):
                cluster_links.append(
                    {
                        "source": cluster_nodes[i],
                        "target": cluster_nodes[j],
                        "value": 1.0,
                    }
                )


# Create base visualization spec as a function
def create_vega_spec(nodes, links, title):
    return {
        "$schema": "https://vega.github.io/schema/vega/v5.json",
        "description": title,
        "width": 700,
        "height": 500,
        "padding": 0,
        "autosize": "none",
        "signals": [
            {"name": "static", "value": False, "bind": {"input": "checkbox"}},
            {
                "name": "fix",
                "value": False,
                "on": [
                    {
                        "events": "symbol:mouseout[!event.buttons], window:mouseup",
                        "update": "false",
                    },
                    {"events": "symbol:mouseover", "update": "fix || true"},
                    {
                        "events": "[symbol:mousedown, window:mouseup] > window:mousemove!",
                        "update": "xy()",
                        "force": True,
                    },
                ],
            },
            {
                "name": "node",
                "value": None,
                "on": [
                    {
                        "events": "symbol:mouseover",
                        "update": "fix === true ? item() : node",
                    }
                ],
            },
            {
                "name": "restart",
                "value": False,
                "on": [{"events": {"signal": "fix"}, "update": "fix && fix.length"}],
            },
            {"name": "cx", "update": "width / 2"},
            {"name": "cy", "update": "height / 2"},
            {
                "name": "nodeRadius",
                "value": 15,
                "bind": {"input": "range", "min": 1, "max": 50, "step": 1},
            },
            {
                "name": "nodeCharge",
                "value": -30,
                "bind": {"input": "range", "min": -2000, "max": 500, "step": 1},
            },
            {
                "name": "linkDistance",
                "value": 100,
                "bind": {"input": "range", "min": 5, "max": 200, "step": 1},
            },
            {
                "name": "linkStrength",
                "value": 0.5,
                "bind": {"input": "range", "min": 0.0, "max": 2.0, "step": 0.01},
            },
        ],
        "data": [
            {"name": "node-data", "values": nodes},
            {"name": "link-data", "values": links},
        ],
        "scales": [
            {
                "name": "color",
                "type": "ordinal",
                "domain": {"data": "node-data", "field": "source_dataset"},
                "range": {"scheme": "category10"},
            },
            {
                "name": "linkStrength",
                "type": "linear",
                "domain": [0, 1],
                "range": [0.5, 3],
            },
        ],
        "marks": [
            {
                "name": "links",
                "type": "path",
                "from": {"data": "link-data"},
                "interactive": False,
                "encode": {
                    "update": {
                        "stroke": {"value": "#ccc"},
                        "strokeWidth": {"scale": "linkStrength", "field": "value"},
                    }
                },
                "transform": [
                    {
                        "type": "linkpath",
                        "require": {"signal": "force"},
                        "shape": "line",
                        "sourceX": "datum.source.x",
                        "sourceY": "datum.source.y",
                        "targetX": "datum.target.x",
                        "targetY": "datum.target.y",
                    }
                ],
            },
            {
                "name": "nodes",
                "type": "symbol",
                "zindex": 1,
                "from": {"data": "node-data"},
                "on": [
                    {
                        "trigger": "fix",
                        "modify": "node",
                        "values": "fix === true ? {fx: node.x, fy: node.y} : {fx: fix[0], fy: fix[1]}",
                    }
                ],
                "encode": {
                    "enter": {
                        "fill": {"scale": "color", "field": "source_dataset"},
                        "stroke": {"value": "white"},
                    },
                    "update": {
                        "size": {"signal": "2 * nodeRadius * nodeRadius"},
                        "cursor": {"value": "pointer"},
                    },
                },
                "transform": [
                    {
                        "type": "force",
                        "iterations": 300,
                        "restart": {"signal": "restart"},
                        "static": {"signal": "static"},
                        "signal": "force",
                        "forces": [
                            {
                                "force": "center",
                                "x": {"signal": "cx"},
                                "y": {"signal": "cy"},
                            },
                            {
                                "force": "collide",
                                "radius": {"signal": "nodeRadius"},
                                "strength": 1,
                            },
                            {
                                "force": "nbody",
                                "strength": {"signal": "nodeCharge"},
                            },
                            {
                                "force": "link",
                                "links": "link-data",
                                "id": "datum.id",
                                "distance": {"signal": "linkDistance"},
                                "strength": {"signal": "linkStrength"},
                            },
                        ],
                    }
                ],
            },
            {
                "type": "text",
                "from": {"data": "link-data"},
                "interactive": False,
                "encode": {
                    "enter": {
                        "align": {"value": "center"},
                        "baseline": {"value": "middle"},
                        "fontSize": {"value": 10},
                        "fill": {"value": "#666"},
                    },
                    "update": {
                        "x": {"signal": "(datum.source.x + datum.target.x) / 2"},
                        "y": {"signal": "(datum.source.y + datum.target.y) / 2"},
                        "text": {"signal": "format(datum.value, '.2f')"},
                    },
                },
                "transform": [
                    {
                        "type": "linkpath",
                        "require": {"signal": "force"},
                        "shape": "line",
                        "sourceX": "datum.source.x",
                        "sourceY": "datum.source.y",
                        "targetX": "datum.target.x",
                        "targetY": "datum.target.y",
                    }
                ],
            },
            {
                "type": "text",
                "from": {"data": "nodes"},
                "interactive": False,
                "zindex": 2,
                "encode": {
                    "enter": {
                        "align": {"value": "center"},
                        "baseline": {"value": "middle"},
                        "fontSize": {"value": 15},
                        "fontWeight": {"value": "bold"},
                        "text": {
                            "signal": "datum.datum.source_dataset + '-' + datum.datum.id"
                        },
                    },
                    "update": {
                        "x": {"field": "x"},
                        "y": {"field": "y"},
                    },
                },
            },
        ],
    }


# Write both visualizations to HTML
with open("vega_output.html", "w") as f:
    f.write(
        """
<!DOCTYPE html>
<html>
<head>
    <script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
    <script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
    <script src="https://cdn.jsdelivr.net/npm/vega-embed@6"></script>
    <style>
        .vis-container {
            margin: 20px;
            padding: 20px;
            border: 1px solid #ccc;
        }
        h2 {
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="vis-container">
        <h2>Original Links with Probabilities</h2>
        <div id="view1"></div>
    </div>
    <div class="vis-container">
        <h2>Clustered Results</h2>
        <div id="view2"></div>
    </div>
    <script>
        const spec1 = """
        + json.dumps(create_vega_spec(nodes_for_vega, original_links, "Original Links"))
        + """;
        const spec2 = """
        + json.dumps(
            create_vega_spec(nodes_for_vega, cluster_links, "Clustered Results")
        )
        + """;
        vegaEmbed('#view1', spec1).catch(console.error);
        vegaEmbed('#view2', spec2).catch(console.error);
    </script>
</body>
</html>
"""
    )

So this input

Turns into (note the colours are not consistent between the tools)
Before:

After:

RobinL · 2025-01-22T13:12:30Z

I've shot myself in the foot a couple of times experimenting with ☝️ and failing to update the source_datasets=["a", "b", "d"] bit of this:

df_clusters = linker.clustering.cluster_using_single_best_links(
    df_predict, source_datasets=["a", "b",  "d"], threshold_match_probability=0.5
)

Is there a reason we can't do that automatically?

aymonwuolanne · 2025-01-23T03:25:50Z

But I think we're not convinced by (3). We're not convinced that the information that 6a to 5c is a FP somehow 'correlates' or otherwise tells us something is wrong with 6a to 7b. Are we saying that we think the 'problem' or 'edge case' that causes the model to incorrectly score 6a to 5c also is likely to apply to 6a to 7b? i.e. there's some sort of shared error mechanism.

This is a good point. I've asked around and most people arrive at the conclusion that 6a and 7b should be linked, and I think I agree that's more sensible. I think I had in mind some scenarios where the assumption of no duplicates is not 100% guaranteed, in which case it's possible that 6a is more likely to belong to the other cluster and 7b would be left by itself, but this is a stretch and if we're assuming there are no duplicates then we might as well believe the assumption fully.

I'm happy to change the method around to that approach (first forcing dataset uniqueness, then ranking links), I'll leave it up to you whether it should be added as an option or not. In general for users I feel like this is quite nuanced to understand so perhaps it's simpler with no option.

Is there a reason we can't do that automatically?

Yes, actually! If you have three datasets a, b, c, and you want to assume that a and b have no duplicates, then you can use source_datasets = ["a", "b"]. Maybe that needs to be more clear from the variable name?

RobinL · 2025-01-23T13:11:29Z

OK, sounds good. Yes please, that'd be great if you could change the method accordingly.

Also - agreed, it's simpler with no option, so if you're happy with the new approach, let's just do that.

Yes, actually! If you have three datasets a, b, c, and you want to assume that a and b have no duplicates, then you can use source_datasets = ["a", "b"]. Maybe that needs to be more clear from the variable name?

Aha - I understand now. Yeah, I think the arg name could be clearer, though I can't think of a good name off the top of my head.

… duplicates and ranking

RobinL · 2025-01-31T10:24:55Z

Example 2 (cycle):
Input:

Solution:

Example 3 (transitivity)
Input:

Solution

Script i used to create the vis

import json

import duckdb
import pandas as pd

from splink import DuckDBAPI, Linker, SettingsCreator

# --------------------------------------------------------------------------------------
import pandas as pd
import json
import pandas as pd
import json

# Load the graph data
graph_data = json.loads("""
{
  "nodes": [
    {
      "unique_id": "1",
      "source_dataset": "a"
    },
    {
      "unique_id": "2",
      "source_dataset": "c"
    },
    {
      "unique_id": "3",
      "source_dataset": "b"
    },
    {
      "unique_id": "4",
      "source_dataset": "a"
    },
    {
      "unique_id": "5",
      "source_dataset": "b"
    },
    {
      "unique_id": "6",
      "source_dataset": "c"
    },
    {
      "unique_id": "7",
      "source_dataset": "a"
    }
  ],
  "links": [
    {
      "unique_id_l": "1",
      "source_dataset_l": "a",
      "unique_id_r": "2",
      "source_dataset_r": "c",
      "match_probability": 0.98
    },
    {
      "unique_id_l": "2",
      "source_dataset_l": "c",
      "unique_id_r": "3",
      "source_dataset_r": "b",
      "match_probability": 0.9
    },
    {
      "unique_id_l": "3",
      "source_dataset_l": "b",
      "unique_id_r": "4",
      "source_dataset_r": "a",
      "match_probability": 0.8
    },
    {
      "unique_id_l": "4",
      "source_dataset_l": "a",
      "unique_id_r": "5",
      "source_dataset_r": "b",
      "match_probability": 0.81
    },
    {
      "unique_id_l": "5",
      "source_dataset_l": "b",
      "unique_id_r": "6",
      "source_dataset_r": "c",
      "match_probability": 0.91
    },
    {
      "unique_id_l": "6",
      "source_dataset_l": "c",
      "unique_id_r": "7",
      "source_dataset_r": "a",
      "match_probability": 0.99
    }
  ]
}
""")

nodes_df = pd.DataFrame(graph_data["nodes"])
links_df = pd.DataFrame(graph_data["links"])


# --------------------------------------------------------------------------------------


df = pd.DataFrame(graph_data["nodes"])
predictions = pd.DataFrame(graph_data["links"])


# Define settings
settings = SettingsCreator(
    link_type="link_only",
    comparisons=[],
    blocking_rules_to_generate_predictions=[],
)

# Initialize linker
con = duckdb.connect(":memory:")
db_api = DuckDBAPI(con)
linker = Linker(df, settings, db_api=db_api)
# linker._debug_mode = True
# Register predictions
df_predict = linker.table_management.register_table_predict(predictions, overwrite=True)

linker._debug_mode = True
df_clusters = linker.clustering.cluster_using_single_best_links(
    df_predict, duplicate_free_datasets=["a", "b", "d"], threshold_match_probability=0.5
)
df_clusters.as_duckdbpyrelation().show(max_width=1000)

# After getting df_clusters, convert to pandas and prepare data for visualization
df_clusters_pd = df_clusters.as_pandas_dataframe()

# Create nodes with cluster information
nodes_with_clusters = df_clusters_pd.merge(nodes_df, on=["unique_id", "source_dataset"])
nodes_for_vega = [
    {
        "id": row["unique_id"],
        "name": f"Node {row['unique_id']}",
        "group": row["cluster_id"],
        "source_dataset": row["source_dataset"],
    }
    for _, row in nodes_with_clusters.iterrows()
]

# Create a mapping of IDs to indices
id_to_index = {node["id"]: i for i, node in enumerate(nodes_for_vega)}

# Create original probability-based links
original_links = []
for _, row in links_df.iterrows():
    source_id = row["unique_id_l"]
    target_id = row["unique_id_r"]
    original_links.append(
        {
            "source": source_id,
            "target": target_id,
            "value": row["match_probability"],
        }
    )

# Create cluster-based links (renamed from links_for_vega)
cluster_links = []
clusters = df_clusters_pd.groupby("cluster_id")["unique_id"].apply(list)
for cluster_nodes in clusters:
    if len(cluster_nodes) > 1:
        for i in range(len(cluster_nodes)):
            for j in range(i + 1, len(cluster_nodes)):
                cluster_links.append(
                    {
                        "source": cluster_nodes[i],
                        "target": cluster_nodes[j],
                        "value": 1.0,
                    }
                )


# Create base visualization spec as a function
def create_vega_spec(nodes, links, title):
    return {
        "$schema": "https://vega.github.io/schema/vega/v5.json",
        "description": title,
        "width": 700,
        "height": 500,
        "padding": 0,
        "autosize": "none",
        "signals": [
            {"name": "static", "value": False, "bind": {"input": "checkbox"}},
            {
                "name": "fix",
                "value": False,
                "on": [
                    {
                        "events": "symbol:mouseout[!event.buttons], window:mouseup",
                        "update": "false",
                    },
                    {"events": "symbol:mouseover", "update": "fix || true"},
                    {
                        "events": "[symbol:mousedown, window:mouseup] > window:mousemove!",
                        "update": "xy()",
                        "force": True,
                    },
                ],
            },
            {
                "name": "node",
                "value": None,
                "on": [
                    {
                        "events": "symbol:mouseover",
                        "update": "fix === true ? item() : node",
                    }
                ],
            },
            {
                "name": "restart",
                "value": False,
                "on": [{"events": {"signal": "fix"}, "update": "fix && fix.length"}],
            },
            {"name": "cx", "update": "width / 2"},
            {"name": "cy", "update": "height / 2"},
            {
                "name": "nodeRadius",
                "value": 15,
                "bind": {"input": "range", "min": 1, "max": 50, "step": 1},
            },
            {
                "name": "nodeCharge",
                "value": -30,
                "bind": {"input": "range", "min": -2000, "max": 500, "step": 1},
            },
            {
                "name": "linkDistance",
                "value": 100,
                "bind": {"input": "range", "min": 5, "max": 200, "step": 1},
            },
            {
                "name": "linkStrength",
                "value": 0.5,
                "bind": {"input": "range", "min": 0.0, "max": 2.0, "step": 0.01},
            },
        ],
        "data": [
            {"name": "node-data", "values": nodes},
            {"name": "link-data", "values": links},
        ],
        "scales": [
            {
                "name": "color",
                "type": "ordinal",
                "domain": {"data": "node-data", "field": "source_dataset"},
                "range": {"scheme": "category10"},
            },
            {
                "name": "linkStrength",
                "type": "linear",
                "domain": [0, 1],
                "range": [0.5, 3],
            },
        ],
        "marks": [
            {
                "name": "links",
                "type": "path",
                "from": {"data": "link-data"},
                "interactive": False,
                "encode": {
                    "update": {
                        "stroke": {"value": "#ccc"},
                        "strokeWidth": {"scale": "linkStrength", "field": "value"},
                    }
                },
                "transform": [
                    {
                        "type": "linkpath",
                        "require": {"signal": "force"},
                        "shape": "line",
                        "sourceX": "datum.source.x",
                        "sourceY": "datum.source.y",
                        "targetX": "datum.target.x",
                        "targetY": "datum.target.y",
                    }
                ],
            },
            {
                "name": "nodes",
                "type": "symbol",
                "zindex": 1,
                "from": {"data": "node-data"},
                "on": [
                    {
                        "trigger": "fix",
                        "modify": "node",
                        "values": "fix === true ? {fx: node.x, fy: node.y} : {fx: fix[0], fy: fix[1]}",
                    }
                ],
                "encode": {
                    "enter": {
                        "fill": {"scale": "color", "field": "source_dataset"},
                        "stroke": {"value": "white"},
                    },
                    "update": {
                        "size": {"signal": "2 * nodeRadius * nodeRadius"},
                        "cursor": {"value": "pointer"},
                    },
                },
                "transform": [
                    {
                        "type": "force",
                        "iterations": 300,
                        "restart": {"signal": "restart"},
                        "static": {"signal": "static"},
                        "signal": "force",
                        "forces": [
                            {
                                "force": "center",
                                "x": {"signal": "cx"},
                                "y": {"signal": "cy"},
                            },
                            {
                                "force": "collide",
                                "radius": {"signal": "nodeRadius"},
                                "strength": 1,
                            },
                            {
                                "force": "nbody",
                                "strength": {"signal": "nodeCharge"},
                            },
                            {
                                "force": "link",
                                "links": "link-data",
                                "id": "datum.id",
                                "distance": {"signal": "linkDistance"},
                                "strength": {"signal": "linkStrength"},
                            },
                        ],
                    }
                ],
            },
            {
                "type": "text",
                "from": {"data": "link-data"},
                "interactive": False,
                "encode": {
                    "enter": {
                        "align": {"value": "center"},
                        "baseline": {"value": "middle"},
                        "fontSize": {"value": 10},
                        "fill": {"value": "#666"},
                    },
                    "update": {
                        "x": {"signal": "(datum.source.x + datum.target.x) / 2"},
                        "y": {"signal": "(datum.source.y + datum.target.y) / 2"},
                        "text": {"signal": "format(datum.value, '.2f')"},
                    },
                },
                "transform": [
                    {
                        "type": "linkpath",
                        "require": {"signal": "force"},
                        "shape": "line",
                        "sourceX": "datum.source.x",
                        "sourceY": "datum.source.y",
                        "targetX": "datum.target.x",
                        "targetY": "datum.target.y",
                    }
                ],
            },
            {
                "type": "text",
                "from": {"data": "nodes"},
                "interactive": False,
                "zindex": 2,
                "encode": {
                    "enter": {
                        "align": {"value": "center"},
                        "baseline": {"value": "middle"},
                        "fontSize": {"value": 15},
                        "fontWeight": {"value": "bold"},
                        "text": {
                            "signal": "datum.datum.source_dataset + '-' + datum.datum.id"
                        },
                    },
                    "update": {
                        "x": {"field": "x"},
                        "y": {"field": "y"},
                    },
                },
            },
        ],
    }


# Write both visualizations to HTML
with open("vega_output.html", "w") as f:
    f.write(
        """
<!DOCTYPE html>
<html>
<head>
    <script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
    <script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
    <script src="https://cdn.jsdelivr.net/npm/vega-embed@6"></script>
    <style>
        .vis-container {
            margin: 20px;
            padding: 20px;
            border: 1px solid #ccc;
        }
        h2 {
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="vis-container">
        <h2>Original Links with Probabilities</h2>
        <div id="view1"></div>
    </div>
    <div class="vis-container">
        <h2>Clustered Results</h2>
        <div id="view2"></div>
    </div>
    <script>
        const spec1 = """
        + json.dumps(create_vega_spec(nodes_for_vega, original_links, "Original Links"))
        + """;
        const spec2 = """
        + json.dumps(
            create_vega_spec(nodes_for_vega, cluster_links, "Clustered Results")
        )
        + """;
        vegaEmbed('#view1', spec1).catch(console.error);
        vegaEmbed('#view2', spec2).catch(console.error);
    </script>
</body>
</html>
"""
    )

codecov-commenter · 2025-01-31T10:33:19Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 98.03922% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@0fd67de). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
splink/internals/linker_components/clustering.py	97.67%	1 Missing ⚠️
splink/internals/one_to_one_clustering.py	98.30%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff            @@
##             master    #2578   +/-   ##
=========================================
  Coverage          ?   80.76%           
=========================================
  Files             ?      106           
  Lines             ?     8974           
  Branches          ?        0           
=========================================
  Hits              ?     7248           
  Misses            ?     1726           
  Partials          ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RobinL

This is looking great, I added a couple more tests. If you're happy I'll go ahead and merge

Thanks so much!

aymonwuolanne · 2025-02-04T22:30:48Z

Thanks for adding the extra tests! I am very tempted to say go ahead and merge it in, but there's one more thing we might as well think about now rather than leaving it until later. When two edges are tied this method arbitrarily picks one using row_number rather than rank, and in some SQL engines that might even be non-deterministic (you could run the clustering twice and potentially get a different outcome).

Some options here:

change nothing and use row_number to break ties (easiest)
use rank when ranking edges and potentially allow a duplicate to form where there is a tie (easy)
add an option to choose between row_number and rank (also pretty easy)
add an option to keep ties or drop ties (this is feasible but a bit harder). This would only drop ties that would form a duplicate, e.g. a tie between A1-B2 and A1-C3 could be kept.

RobinL · 2025-02-10T10:00:51Z

Hmm - tricky. I don't really have strong feeling but I think I'd probably just go for (1). If it really causes problems for people we can take another look.

A fifth option is simply for Splink to report on the prevalence of this issue i.e. "WARNING: we found n ties....", though I think we'd want this to be opt-in since it would (presumably) affect performance.

I think in a real world situation as a user, the information I'd be most interested in is the prevance of ties.

If there were a lot, I'd probably then be looking at options for breaking these ties in a more explicit/principled manner - e.g. improving the model somehow to break them at the scoring stage.

What do you think? Note I'm very happy to go ahead and use your option (1) in the interests of getting this merged.

aymonwuolanne · 2025-02-11T05:07:00Z

I'm happy with leaving it as it is. Feel free to merge it in!

aymonwuolanne added 10 commits December 18, 2024 12:08

initial implementation of one-to-one clustering

63c10c8

added a test for single_best_links and fixed some issues

4291b37

added new test for single best links clustering

85fa9e5

formatting and updating docstrings

be4b680

renamed test file for consistency

01888fc

more formatting

a67d2eb

removed stray commas, should fix sqlite error

b9461e0

swap == for = for postgres compatibility

1ad74ec

cast contains flags to int rather than boolean

dfc2a25

fix for postgres backend

5ccb062

aymonwuolanne and others added 2 commits January 24, 2025 09:24

renamed source_datasets variable, and switched the order of filtering…

a659b96

… duplicates and ranking

add additional one to one tests

b4b49d1

fix artifact v3 deprecation

cd6ea16

RobinL added 2 commits January 31, 2025 10:37

update ci to fix overwrite error

7989b9c

update changelog

f0e27cc

RobinL approved these changes Jan 31, 2025

View reviewed changes

Merge branch 'master' into master

a498ed0

Merge branch 'master' into master

a800990

RobinL merged commit 44d6f5d into moj-analytical-services:master Feb 11, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One to one clustering #2578

One to one clustering #2578

aymonwuolanne commented Jan 6, 2025

RobinL commented Jan 15, 2025 •

edited

Loading

aymonwuolanne commented Jan 17, 2025

RobinL commented Jan 22, 2025 •

edited

Loading

RobinL commented Jan 22, 2025 •

edited

Loading

RobinL commented Jan 22, 2025

aymonwuolanne commented Jan 23, 2025

RobinL commented Jan 23, 2025 •

edited

Loading

RobinL commented Jan 31, 2025 •

edited

Loading

codecov-commenter commented Jan 31, 2025 •

edited

Loading

RobinL left a comment •

edited

Loading

aymonwuolanne commented Feb 4, 2025

RobinL commented Feb 10, 2025

aymonwuolanne commented Feb 11, 2025

One to one clustering #2578

One to one clustering #2578

Conversation

aymonwuolanne commented Jan 6, 2025

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

RobinL commented Jan 15, 2025 • edited Loading

aymonwuolanne commented Jan 17, 2025

RobinL commented Jan 22, 2025 • edited Loading

RobinL commented Jan 22, 2025 • edited Loading

RobinL commented Jan 22, 2025

aymonwuolanne commented Jan 23, 2025

RobinL commented Jan 23, 2025 • edited Loading

RobinL commented Jan 31, 2025 • edited Loading

codecov-commenter commented Jan 31, 2025 • edited Loading

Codecov Report

RobinL left a comment • edited Loading

Choose a reason for hiding this comment

aymonwuolanne commented Feb 4, 2025

RobinL commented Feb 10, 2025

aymonwuolanne commented Feb 11, 2025

RobinL commented Jan 15, 2025 •

edited

Loading

RobinL commented Jan 22, 2025 •

edited

Loading

RobinL commented Jan 22, 2025 •

edited

Loading

RobinL commented Jan 23, 2025 •

edited

Loading

RobinL commented Jan 31, 2025 •

edited

Loading

codecov-commenter commented Jan 31, 2025 •

edited

Loading

RobinL left a comment •

edited

Loading