Extract and clean tables from financial xlsx files from EDGAR and convert them to JSON with bi-tree positional information and metadata.
Related dataset:
Zavitsanos, E., Mavroeidis, D., Spyropoulou, E., Fergadiotis, M., & Paliouras, G. (2024). ENTRANT: A Large Financial Dataset for Table Understanding [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10667088
Please cite the related paper as follows: Zavitsanos, E., Mavroeidis, D., Spyropoulou, E. et al. ENTRANT: A Large Financial Dataset for Table Understanding. Sci Data 11, 876 (2024). https://doi.org/10.1038/s41597-024-03605-5
- Before starting, ideally, it's recommended to switch to a virtual environment first via
conda
orvirtualenv
. - Install dependencies via
pip install -r requirements.txt
- Place the xls files in a directory named
data
in the project's root. - Create a directory named
output
to store the results. - Run
extract_tables_multiprocess.py
.
- See
fetch_reports.py
- Pay attention to fair usage of EDGAR
- Data is hosted at Zenodo: https://zenodo.org/records/10667088
Use pytest
to run the unit tests.
The project is licensed under Creative Commons Attribution 4 license.
@article{entrant2024,
title={ENTRANT: A Large Financial Dataset For Table Understanding},
author={Zavitsanos, Elias and Mavroeidis, Dimitris and Spyropoulou, Eirini and Fergadiotis, Manos and Paliouras Georgios},
journal={Nature Scientific Data},
pages={876},
volume = {11},
year={2024},
doi = {https://doi.org/10.1038/s41597-024-03605-5}
}