Combined Datasets #3

thorwhalen · 2023-12-20T09:54:29Z

thorwhalen
Dec 20, 2023
Maintainer

Data analysis frequently necessitates the manipulation of datasets scattered across various tables, such as those found in pandas DataFrames or SQL databases. The integration and examination of this distributed data mandate the execution of table joins via shared fields. Although this operation is not inherently difficult, it can become cumbersome when the datasets are extensive or when numerous tables are involved.

In an ideal scenario, analysts could simply identify the set of tables and outline the desired data, subsequently receiving the tables that match their criteria. With advancements in Large Language Models (LLMs), it is becoming feasible to pinpoint query-relevant tables using column names and their descriptions. The doodad package is being developed to facilitate such capabilities, as evidenced by discussions on topics like Make it easier to find and use field names and objects and Get jargon definitions.

Given a collection of tables that collectively contain the answer to a query, the challenge lies in generating a "combined dataset" that conveys only the data pertinent to the query.

To address this challenge, a function like the one below would be useful:

def get_table_join(tables, fields):
    """
    Generate a table comprising the requested `fields`, derived by joining the necessary tables from `tables`.
    """

thorwhalen · 2023-12-20T10:43:14Z

thorwhalen
Dec 20, 2023
Maintainer Author

Let's illustrate with the following data:

import pandas as pd

tables = {
    "A": pd.DataFrame({'b': [1, 2, 3, 33], 'c': [4, 5, 6, 66]}),
    "B": pd.DataFrame({'b': [1, 2, 3], 'a': [4, 5, 6], 'd': [7, 8, 9], 'e': [10, 11, 12], 'f': [13, 14, 15]}),
    "C": pd.DataFrame({'f': [13, 14, 15], 'g': [4, 5, 6]}),
    "D": pd.DataFrame({'d': [7, 8, 77], 'e': [10, 11, 77], 'h': [7, 8, 9], 'i': [1, 2, 3]}),
    "E": pd.DataFrame({'i': [1, 2, 3], 'j': [4, 5, 6]})
}

field_sets = {table_id: set(df.columns) for table_id, df in tables.items()}
assert field_sets == {
    "A": {'b', 'c'},
    "B": {'b', 'a', 'd', 'e', 'f'},
    "C": {'f', 'g'},
    "D": {'d', 'e', 'h', 'i'},
    "E": {'i', 'j'}
}

Here, assuming all kinds of stuff about the data (which we will not get into now), our get_table_join would look something like this:

def get_table_join(tables, fields):
    """
    Get table with requested `fields`, computed by joining relevant tables of `tables`.
    """
    resolution_sequence = join_resolution(tables, fields)
    return compute_join_resolution(resolution_sequence, tables)

# where...

def join_resolution(field_sets: dict, fields_to_cover: Iterable) -> list:
    """
    Returns the list of join operations that, when carried out, cover the given fields.
    
    :param field_sets: A mapping of table names to sets of their fields.
    :param fields: The fields to cover.
    """

def compute_join_resolution(
        resolution_sequence: Iterable, tables: Mapping[str, pd.DataFrame]
    ) -> pd.DataFrame:
    """
    Carries `resolution_sequence` join operations out with tables taken from `tables`.
    
    :param resolution_sequence: An iterable of join operations to carry out. 
        Each join operation is either a table name (str) or a JoinWith object.
        If it's a JoinWith object, it's assumed that the table has already been joined
        and the fields to remove are in the `remove` attribute of the object.
    :param tables: A mapping of table names to tables (pd.DataFrame)
    """

The tests would be:

from typing import Callable, Iterable, Mapping
from dataclasses import dataclass

@dataclass
class JoinWith:
    table_key: str
    remove: list = None

fields_to_cover = ['b', 'g', 'j']
expected_join_resolution = [
    'B',
    JoinWith('C', remove=['a', 'f']),
    JoinWith('D', remove=['d', 'e', 'h']),
    JoinWith('E', remove=['i'])
]
expected_result = pd.DataFrame({
    'b': [1, 2],
    'g': [4, 5],
    'j': [4, 5]
})

from typing import Callable, Iterable, Mapping
from dataclasses import dataclass

@dataclass
class JoinWith:
    table_key: str
    remove: list = None

fields_to_cover = ['b', 'g', 'j']
expected_join_resolution = [
    'B',
    JoinWith('C', remove=['a', 'f']),
    JoinWith('D', remove=['d', 'e', 'h']),
    JoinWith('E', remove=['i'])
]
expected_result = pd.DataFrame({
    'b': [1, 2],
    'g': [4, 5],
    'j': [4, 5]
})

def test_join_resolution(
        join_resolution: Callable,
        *,
        field_sets: dict, 
        fields_to_cover: Iterable,
        expected_join_resolution: list,
    ):
    assert join_resolution(field_sets, fields_to_cover) == expected_join_resolution

def test_compute_join_resolution(
    compute_join_resolution: Callable,   
    *,
    resolution_sequence: Iterable,
    tables: Mapping[str, pd.DataFrame],
    expected_result: pd.DataFrame, 
):
    resolution_sequence = expected_join_resolution
    result = compute_join_resolution(resolution_sequence, tables)
    assert tables_are_equal(result, expected_result)

2 replies

thorwhalen Dec 20, 2023
Maintainer Author

This passes the test:

def ensure_join_op(obj):
    if not isinstance(obj, JoinWith):
        return JoinWith(obj)
    return obj
        

def compute_join_resolution(
        resolution_sequence: Iterable, tables: Mapping[str, pd.DataFrame]
    ) -> pd.DataFrame:
    """
    Carries `resolution_sequence` join operations out with tables taken from `tables`.
    
    :param resolution_sequence: An iterable of join operations to carry out. 
        Each join operation is either a table name (str) or a JoinWith object.
        If it's a JoinWith object, it's assumed that the table has already been joined
        and the fields to remove are in the `remove` attribute of the object.
    :param tables: A mapping of table names to tables (pd.DataFrame)
    """
    join_ops = map(ensure_join_op, resolution_sequence)
    table_key = next(join_ops).table_key
    joined = tables[table_key]
    for join_op in join_ops:
        table = tables[join_op.table_key]
        joined = joined.merge(table, how='inner')
        if join_op.remove:
            remove_cols = set(join_op.remove) & set(joined.columns)
            joined = joined.drop(columns=remove_cols)
    return joined

test_compute_join_resolution(compute_join_resolution)

thorwhalen Dec 20, 2023
Maintainer Author

I made the intersection_graph to help out with the development of the join_resolution function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined Datasets #3

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Combined Datasets #3

thorwhalen Dec 20, 2023 Maintainer

Replies: 1 comment · 2 replies

thorwhalen Dec 20, 2023 Maintainer Author

thorwhalen Dec 20, 2023 Maintainer Author

thorwhalen Dec 20, 2023 Maintainer Author

thorwhalen
Dec 20, 2023
Maintainer

Replies: 1 comment 2 replies

thorwhalen
Dec 20, 2023
Maintainer Author

thorwhalen Dec 20, 2023
Maintainer Author

thorwhalen Dec 20, 2023
Maintainer Author