Combined Datasets #3
thorwhalen
started this conversation in
General
Replies: 1 comment 2 replies
-
Let's illustrate with the following data: import pandas as pd
tables = {
"A": pd.DataFrame({'b': [1, 2, 3, 33], 'c': [4, 5, 6, 66]}),
"B": pd.DataFrame({'b': [1, 2, 3], 'a': [4, 5, 6], 'd': [7, 8, 9], 'e': [10, 11, 12], 'f': [13, 14, 15]}),
"C": pd.DataFrame({'f': [13, 14, 15], 'g': [4, 5, 6]}),
"D": pd.DataFrame({'d': [7, 8, 77], 'e': [10, 11, 77], 'h': [7, 8, 9], 'i': [1, 2, 3]}),
"E": pd.DataFrame({'i': [1, 2, 3], 'j': [4, 5, 6]})
}
field_sets = {table_id: set(df.columns) for table_id, df in tables.items()}
assert field_sets == {
"A": {'b', 'c'},
"B": {'b', 'a', 'd', 'e', 'f'},
"C": {'f', 'g'},
"D": {'d', 'e', 'h', 'i'},
"E": {'i', 'j'}
} Here, assuming all kinds of stuff about the data (which we will not get into now), our def get_table_join(tables, fields):
"""
Get table with requested `fields`, computed by joining relevant tables of `tables`.
"""
resolution_sequence = join_resolution(tables, fields)
return compute_join_resolution(resolution_sequence, tables)
# where...
def join_resolution(field_sets: dict, fields_to_cover: Iterable) -> list:
"""
Returns the list of join operations that, when carried out, cover the given fields.
:param field_sets: A mapping of table names to sets of their fields.
:param fields: The fields to cover.
"""
def compute_join_resolution(
resolution_sequence: Iterable, tables: Mapping[str, pd.DataFrame]
) -> pd.DataFrame:
"""
Carries `resolution_sequence` join operations out with tables taken from `tables`.
:param resolution_sequence: An iterable of join operations to carry out.
Each join operation is either a table name (str) or a JoinWith object.
If it's a JoinWith object, it's assumed that the table has already been joined
and the fields to remove are in the `remove` attribute of the object.
:param tables: A mapping of table names to tables (pd.DataFrame)
""" The tests would be: from typing import Callable, Iterable, Mapping
from dataclasses import dataclass
@dataclass
class JoinWith:
table_key: str
remove: list = None
fields_to_cover = ['b', 'g', 'j']
expected_join_resolution = [
'B',
JoinWith('C', remove=['a', 'f']),
JoinWith('D', remove=['d', 'e', 'h']),
JoinWith('E', remove=['i'])
]
expected_result = pd.DataFrame({
'b': [1, 2],
'g': [4, 5],
'j': [4, 5]
})
from typing import Callable, Iterable, Mapping
from dataclasses import dataclass
@dataclass
class JoinWith:
table_key: str
remove: list = None
fields_to_cover = ['b', 'g', 'j']
expected_join_resolution = [
'B',
JoinWith('C', remove=['a', 'f']),
JoinWith('D', remove=['d', 'e', 'h']),
JoinWith('E', remove=['i'])
]
expected_result = pd.DataFrame({
'b': [1, 2],
'g': [4, 5],
'j': [4, 5]
})
def test_join_resolution(
join_resolution: Callable,
*,
field_sets: dict,
fields_to_cover: Iterable,
expected_join_resolution: list,
):
assert join_resolution(field_sets, fields_to_cover) == expected_join_resolution
def test_compute_join_resolution(
compute_join_resolution: Callable,
*,
resolution_sequence: Iterable,
tables: Mapping[str, pd.DataFrame],
expected_result: pd.DataFrame,
):
resolution_sequence = expected_join_resolution
result = compute_join_resolution(resolution_sequence, tables)
assert tables_are_equal(result, expected_result) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Data analysis frequently necessitates the manipulation of datasets scattered across various tables, such as those found in pandas DataFrames or SQL databases. The integration and examination of this distributed data mandate the execution of table joins via shared fields. Although this operation is not inherently difficult, it can become cumbersome when the datasets are extensive or when numerous tables are involved.
In an ideal scenario, analysts could simply identify the set of tables and outline the desired data, subsequently receiving the tables that match their criteria. With advancements in Large Language Models (LLMs), it is becoming feasible to pinpoint query-relevant tables using column names and their descriptions. The doodad package is being developed to facilitate such capabilities, as evidenced by discussions on topics like Make it easier to find and use field names and objects and Get jargon definitions.
Given a collection of tables that collectively contain the answer to a query, the challenge lies in generating a "combined dataset" that conveys only the data pertinent to the query.
To address this challenge, a function like the one below would be useful:
Beta Was this translation helpful? Give feedback.
All reactions