Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance documentation to highlight new features #35

Merged
merged 1 commit into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 50 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,22 +35,49 @@ OpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38
2 http://dbpedia.org/resource/E840454 http://dbpedia.org/ontology/activeYearsStartYear 1948^^<http://www.w3.org/2001/XMLSchema#gYear>
3 http://dbpedia.org/resource/E971710 http://purl.org/dc/elements/1.1/description English singer-songwriter
4 http://dbpedia.org/resource/E022831 http://dbpedia.org/ontology/militaryCommand Commandant of the Marine Corps
>>> ds.ent_links.head()
left right
0 http://dbpedia.org/resource/E123186 http://www.wikidata.org/entity/Q21197
1 http://dbpedia.org/resource/E228902 http://www.wikidata.org/entity/Q5909974
2 http://dbpedia.org/resource/E718575 http://www.wikidata.org/entity/Q707008
3 http://dbpedia.org/resource/E469216 http://www.wikidata.org/entity/Q1471945
4 http://dbpedia.org/resource/E649433 http://www.wikidata.org/entity/Q1198381

The gold standard entity links are stored as [eche](https://github.com/dobraczka/eche) ClusterHelper, which provides convenient functionalities:

>>> ds.ent_links.clusters[0]
{'http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186'}
>>> ('http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186') in ds.ent_links
True
>>> ('http://dbpedia.org/resource/E123186', 'http://www.wikidata.org/entity/Q21197') in ds.ent_links
True
>>> ds.ent_links.links('http://www.wikidata.org/entity/Q21197')
'http://dbpedia.org/resource/E123186'
>>> ds.ent_links.all_pairs()
<itertools.chain object at 0x7f92c6287c10>
```

You can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:
Most datasets are binary matching tasks, but for example the `MovieGraphBenchmark` provides a multi-source setting:

```
>>> ds.canonical_name
'openea_d_w_15k_v1'
>>> ds = MovieGraphBenchmark(graph_pair="multi")
>>> ds
MovieGraphBenchmark(backend=pandas,graph_pair=multi, rel_triples_0=17507, attr_triples_0=20800 rel_triples_1=27903, attr_triples_1=23761 rel_triples_2=15455, attr_triples_2=20902, ent_links=3598, folds=5)
>>> ds.dataset_names
('imdb', 'tmdb', 'tvdb')
```

Here the [`PrefixedClusterHelper`](https://eche.readthedocs.io/en/latest/reference/eche/#eche.PrefixedClusterHelper) various convenience functions:

```
Get pairs between specific dataset pairs

>>> list(ds.ent_links.pairs_in_ds_tuple(("imdb","tmdb")))[0]
('https://www.scads.de/movieBenchmark/resource/IMDB/nm0641721', 'https://www.scads.de/movieBenchmark/resource/TMDB/person1236714')

Get number of intra-dataset pairs
>>> ds.ent_links.number_of_intra_links
(1, 64, 22663)
```

For all datasets you can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:

```
>>> ds.canonical_name
'openea_d_w_15k_v1'
```

You can use [dask](https://www.dask.org/) as backend for larger datasets:
Expand Down Expand Up @@ -127,3 +154,16 @@ Datasets
| [MED-BBK](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.MED_BBK) | 2020 | 1 | Baidu Baike | [Paper](https://aclanthology.org/2020.coling-industry.17.pdf), [Repo](https://github.com/ZihengZZH/industry-eval-EA/tree/main#benchmark) |
| [MovieGraphBenchmark](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.MovieGraphBenchmark) | 2022 | 3 | IMDB, TMDB, TheTVDB | [Paper](http://ceur-ws.org/Vol-2873/paper8.pdf), [Repo](https://github.com/ScaDS/MovieGraphBenchmark) |
| [OAEI](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OAEI) | 2022 | 5 | Fandom wikis | [Paper](https://ceur-ws.org/Vol-3324/oaei22_paper0.pdf), [Website](http://oaei.ontologymatching.org/2022/knowledgegraph/index.html) |

More broad statistics are provided in `dataset_statistics.csv`. You can also get a pandas DataFrame with statistics for specific datasets for example to create tables for publications:
```
>>> ds = MovieGraphBenchmark(graph_pair="multi")
>>> from sylloge.base import create_statistics_df
>>> stats_df = create_statistics_df([ds])
>>> stats_df.loc[("MovieGraphBenchmark","moviegraphbenchmark_multi","imdb")]
Entities Relation Triples Attribute Triples ... Clusters Intra-dataset Matches All Matches
Dataset family Task Name Dataset Name ...
MovieGraphBenchmark moviegraphbenchmark_multi imdb 5129 17507 20800 ... 3598 1 31230

[1 rows x 9 columns]
```
43 changes: 36 additions & 7 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,42 @@ This simple library aims to collect entity-alignment benchmark datasets and make
# 2 http://dbpedia.org/resource/E840454 http://dbpedia.org/ontology/activeYearsStartYear 1948^^<http://www.w3.org/2001/XMLSchema#gYear>
# 3 http://dbpedia.org/resource/E971710 http://purl.org/dc/elements/1.1/description English singer-songwriter
# 4 http://dbpedia.org/resource/E022831 http://dbpedia.org/ontology/militaryCommand Commandant of the Marine Corps
print(ds.ent_links.head())
# left right
# 0 http://dbpedia.org/resource/E123186 http://www.wikidata.org/entity/Q21197
# 1 http://dbpedia.org/resource/E228902 http://www.wikidata.org/entity/Q5909974
# 2 http://dbpedia.org/resource/E718575 http://www.wikidata.org/entity/Q707008
# 3 http://dbpedia.org/resource/E469216 http://www.wikidata.org/entity/Q1471945
# 4 http://dbpedia.org/resource/E649433 http://www.wikidata.org/entity/Q1198381

The gold standard entity links are stored as [eche](https://github.com/dobraczka/eche) ClusterHelper, which provides convenient functionalities:

.. code-block:: python

print(ds.ent_links.clusters[0])
# {'http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186'}
print(('http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186') in ds.ent_links)
# True
print(('http://dbpedia.org/resource/E123186', 'http://www.wikidata.org/entity/Q21197') in ds.ent_links)
# True
print(ds.ent_links.links('http://www.wikidata.org/entity/Q21197'))
# 'http://dbpedia.org/resource/E123186'
print(ds.ent_links.all_pairs())
# <itertools.chain object at 0x7f92c6287c10>

Most datasets are binary matching tasks, but for example the `MovieGraphBenchmark` provides a multi-source setting:

.. code-block:: python

ds = MovieGraphBenchmark(graph_pair="multi")
print(ds)
MovieGraphBenchmark(backend=pandas,graph_pair=multi, rel_triples_0=17507, attr_triples_0=20800 rel_triples_1=27903, attr_triples_1=23761 rel_triples_2=15455, attr_triples_2=20902, ent_links=3598, folds=5)
print(ds.dataset_names)
('imdb', 'tmdb', 'tvdb')

Here the [`PrefixedClusterHelper`](https://eche.readthedocs.io/en/latest/reference/eche/#eche.PrefixedClusterHelper) various convenience functions:

.. code-block:: python
# Get pairs between specific dataset pairs
print(list(ds.ent_links.pairs_in_ds_tuple(("imdb","tmdb")))[0])
# ('https://www.scads.de/movieBenchmark/resource/IMDB/nm0641721', 'https://www.scads.de/movieBenchmark/resource/TMDB/person1236714')

# Get number of intra-dataset pairs
print(ds.ent_links.number_of_intra_links)
# (1, 64, 22663)

You can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:

Expand Down
2 changes: 2 additions & 0 deletions sylloge/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
TrainTestValSplit,
ZipEADataset,
ZipEADatasetWithPreSplitFolds,
create_statistics_df,
)
from .id_mapped import IdMappedEADataset
from .med_bbk_loader import MED_BBK
Expand All @@ -34,6 +35,7 @@
"BinaryZipEADatasetWithPreSplitFolds",
"ZipEADatasetWithPreSplitFolds",
"TrainTestValSplit",
"create_statistics_df",
]
__version__ = version(__package__)
logging.getLogger(__name__).setLevel(logging.INFO)
Loading