Skip to content

Commit

Permalink
Add small changes
Browse files Browse the repository at this point in the history
  • Loading branch information
OlivieFranklova committed Jan 8, 2025
1 parent 537b22d commit 0fec728
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 6 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@
<!-- tocstop -->

## What is Datasets Similarity?
The Dataset Similarity project deals with the
The Dataset Similarity project deals with the
issue of comparing tabular datasets.
The idea of the project is that we will have a set of
datasets that we want to compare with each other
and find out their similarity or distance.
This project mainly focuses on comparing only two tables.
This project mainly focuses on comparing only two tables but it implements `similarity_runner` that can compare more tables.
The final similarity is calculated according
to the similarity of individual columns based on their metadata.
Columns are compared by type and by content.
Expand All @@ -27,6 +27,7 @@ the main set (training) on which the program is
tuned, and a validation set for validating the results.

#### Definition of table similarity:
Two tables are similar if they have at least *k* similar columns.
![img_1.png](docs/similarity_def.png)
>Parameter **important columns** is user input.
>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ def _compare(self, metadata1: Metadata, metadata2: Metadata) -> SimilarityOutput
res = self.distance_function.compute(distances)
res = res * res
else:
res = 0
res = 1
if table_distances:
for dist in table_distances:
res += dist * dist
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,12 @@ def add_comparator_type(self, comparator: HandlerType) -> "ComparatorByType":
"""
Add comparator
"""
self.comparator_type.append(comparator)
if comparator == ColumnKindHandler:
self.kinds = True
if comparator == ColumnTypeHandler:
self.types = True
else:
self.comparator_type.append(comparator)
return self

def __compare_all_columns(
Expand Down
4 changes: 2 additions & 2 deletions similarity_framework/src/impl/comparator/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ def get_ratio(count1: int, count2: int) -> float:
if count1 == 0 or count2 == 0:
return 1
if count1 < count2:
return count2 / count1
return count1 / count2
return count1 / count2
return count2 / count1


def fill_result(metadata1_names, metadata2_names) -> pd.DataFrame:
Expand Down

0 comments on commit 0fec728

Please sign in to comment.