Skip to content

Commit

Permalink
Fix pyproject.toml to work with uv and update README
Browse files Browse the repository at this point in the history
  • Loading branch information
vinayak-mehta committed Dec 26, 2024
1 parent 01cba95 commit 9707eff
Show file tree
Hide file tree
Showing 11 changed files with 243 additions and 3,378 deletions.
63 changes: 31 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,30 @@
<p align="center">
<img src="https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pypdf-table-extraction.png" width="200">
<img src="https://raw.githubusercontent.com/camelot-dev/camelot/master/docs/_static/camelot.png" width="200">
</p>

# pypdf_table_extraction (Camelot): PDF Table Extraction for Humans
# Camelot: PDF Table Extraction for Humans

[![tests](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml/badge.svg)](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/pypdf-table-extraction/badge/?version=latest)](https://pypdf-table-extraction.readthedocs.io/en/latest/)
[![codecov.io](https://codecov.io/github/py-pdf/pypdf_table_extraction/badge.svg?branch=main&service=github)](https://codecov.io/github/py-pdf/pypdf_table_extraction/?branch=main)
[![image](https://img.shields.io/pypi/v/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/l/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/pyversions/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/)
[![tests](https://github.com/camelot-dev/camelot/actions/workflows/tests.yml/badge.svg)](https://github.com/camelot-dev/camelot/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/camelot-py/badge/?version=master)](https://camelot-py.readthedocs.io/en/master/)
[![codecov.io](https://codecov.io/github/camelot-dev/camelot/badge.svg?branch=master&service=github)](https://codecov.io/github/camelot-dev/camelot?branch=master)
[![image](https://img.shields.io/pypi/v/camelot-py.svg)](https://pypi.org/project/camelot-py/) [![image](https://img.shields.io/pypi/l/camelot-py.svg)](https://pypi.org/project/camelot-py/) [![image](https://img.shields.io/pypi/pyversions/camelot-py.svg)](https://pypi.org/project/camelot-py/)

**pypdf_table_extraction** Formerly known as [Camelot](https://github.com/camelot-dev/camelot) is a Python library that can help you extract tables from PDFs!
**Camelot** is a Python library that can help you extract tables from PDFs!

---

**Here's how you can extract tables from PDFs.**
You can check out the quickstart notebook. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/pypdf_table_extraction_quick_start_notebook.ipynb)
You can check out the quickstart notebook. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/pypdf_table_extraction_quick_start_notebook.ipynb) or follow the example below.

Or follow the example below.
You can check out the PDF used in this example [here](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pdf/foo.pdf).
You can check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/main/docs/_static/pdf/foo.pdf).

```python3
>>> import pypdf_table_extraction
>>> tables = pypdf_table_extraction.read_pdf('foo.pdf')
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
&lt;TableList n=1&gt;
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
&lt;Table shape=(7, 7)&gt;
>>> tables[0].parsing_report
{
'accuracy': 99.02,
Expand All @@ -46,77 +45,77 @@ You can check out the PDF used in this example [here](https://github.com/py-pdf/
| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |

pypdf_table_extraction also comes packaged with a [command-line interface](https://pypdf-table-extraction.readthedocs.io/en/latest/user/cli.html)!
Camelot also comes packaged with a [command-line interface](https://camelot-py.readthedocs.io/en/latest/user/cli.html)!

Refer to the [QuickStart Guide](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/user/quickstart.rst#quickstart) to quickly get started with pypdf_table_extraction, extract tables from PDFs and explore some basic options.
Refer to the [QuickStart Guide](https://github.com/camelot-dev/camelot/blob/main/docs/user/quickstart.rst#quickstart) to quickly get started with Camelot, extract tables from PDFs and explore some basic options.

**Tip:** Visit the `parser-comparison-notebook` to get an overview of all the packed parsers and their features. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/parser-comparison-notebook.ipynb)

**Note:** pypdf_table_extraction only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
**Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions [here](https://pypdf-table-extraction.readthedocs.io/en/latest/user/faq.html).
You can check out some frequently asked questions [here](https://camelot-py.readthedocs.io/en/latest/user/faq.html).

## Why pypdf_table_extraction?
## Why Camelot?

- **Configurability**: pypdf_table_extraction gives you control over the table extraction process with [tweakable settings](https://pypdf-table-extraction.readthedocs.io/en/latest/user/advanced.html).
- **Configurability**: Camelot gives you control over the table extraction process with [tweakable settings](https://camelot-py.readthedocs.io/en/latest/user/advanced.html).
- **Metrics**: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See [comparison with similar libraries and tools](https://github.com/py-pdf/pypdf_table_extraction/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
See [comparison with similar libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).

## Installation

### Using conda

The easiest way to install pypdf_table_extraction is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
The easiest way to install Camelot is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.

```bash
conda install -c conda-forge pypdf-table-extraction
conda install -c conda-forge camelot-py
```

### Using pip

After [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install pypdf_table_extraction:
After [installing the dependencies](https://camelot-py.readthedocs.io/en/latest/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install Camelot:

```bash
pip install pypdf-table-extraction
pip install "camelot-py[base]"
```

### From the source code

After [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install.html#using-pip), clone the repo using:
After [installing the dependencies](https://camelot-py.readthedocs.io/en/latest/user/install.html#using-pip), clone the repo using:

```bash
git clone https://github.com/py-pdf/pypdf_table_extraction.git
git clone https://github.com/camelot-dev/camelot.git
```

and install using pip:

```
cd pypdf_table_extraction
cd camelot
pip install "."
```

## Documentation

The documentation is available at [http://pypdf-table-extraction.readthedocs.io/](http://pypdf-table-extraction.readthedocs.io/).
The documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).

## Wrappers

- [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.

## Related projects

- [camelot-sharp](https://github.com/BobLd/camelot-sharp) provides a C sharp implementation of pypdf_table_extraction (Camelot).
- [camelot-sharp](https://github.com/BobLd/camelot-sharp) provides a C sharp implementation of Camelot.

## Contributing

The [Contributor's Guide](https://pypdf-table-extraction.readthedocs.io/en/latest/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
The [Contributor's Guide](https://camelot-py.readthedocs.io/en/latest/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.

## Versioning

pypdf_table_extraction uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out the [releases](https://github.com/py-pdf/pypdf_table_extraction/releases) page.
Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out the [releases](https://github.com/camelot-dev/camelot/releases) page.

## License

This project is licensed under the MIT License, see the [LICENSE](https://github.com/py-pdf/pypdf_table_extraction/blob/main/LICENSE) file for details.
This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/main/LICENSE) file for details.
1 change: 0 additions & 1 deletion camelot/backends/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@


class ConversionBackend: # noqa D101

def installed(self) -> bool: # noqa D102
raise NotImplementedError

Expand Down
20 changes: 10 additions & 10 deletions camelot/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,25 @@
import tempfile
import zipfile
from operator import itemgetter
from typing import Any
from typing import Iterable
from typing import Iterator
from typing import Any, Iterable, Iterator

import cv2
import pandas as pd


if sys.version_info >= (3, 11):
from typing import TypedDict # pylint: disable=no-name-in-module
from typing import Unpack
from typing import (
TypedDict, # pylint: disable=no-name-in-module
Unpack,
)
else:
from typing_extensions import TypedDict, Unpack

from .backends import ImageConversionBackend
from .utils import build_file_path_in_temp_dir
from .utils import get_index_closest_point
from .utils import get_textline_coords

from .utils import (
build_file_path_in_temp_dir,
get_index_closest_point,
get_textline_coords,
)

# minimum number of vertical textline intersections for a textedge
# to be considered valid
Expand Down
4 changes: 2 additions & 2 deletions camelot/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def read_pdf(
parallel=False,
layout_kwargs=None,
debug=False,
**kwargs
**kwargs,
):
"""Read PDF and return extracted tables.
Expand Down Expand Up @@ -136,6 +136,6 @@ def read_pdf(
suppress_stdout=suppress_stdout,
parallel=parallel,
layout_kwargs=layout_kwargs,
**kwargs
**kwargs,
)
return tables
2 changes: 1 addition & 1 deletion camelot/parsers/hybrid.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def __init__(
row_tol=2,
column_tol=0,
debug=False,
**kwargs
**kwargs,
):
super().__init__(
"hybrid",
Expand Down
34 changes: 22 additions & 12 deletions examples/hybrid-parser-step-by-step.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
"outputs": [],
"source": [
"import os\n",
"\n",
"os.getcwd()\n",
"# Install from source\n",
"!git clone -b main https://github.com/py-pdf/pypdf_table_extraction.git src\n",
Expand Down Expand Up @@ -59,10 +60,15 @@
"source": [
"# Bootstrap and common imports\n",
"import sys, time\n",
"sys.path.insert(0, os.path.abspath('')) # Prefer the local version of pypdf_table_extraction if available\n",
"\n",
"sys.path.insert(\n",
" 0, os.path.abspath(\"\")\n",
") # Prefer the local version of pypdf_table_extraction if available\n",
"import pypdf_table_extraction\n",
"\n",
"print(f\"Using pypdf_table_extraction v{pypdf_table_extraction.__version__} from file {pypdf_table_extraction.__file__}.\")\n",
"print(\n",
" f\"Using pypdf_table_extraction v{pypdf_table_extraction.__version__} from file {pypdf_table_extraction.__file__}.\"\n",
")\n",
"\n",
"# Select a pdf to analyze.\n",
"kwargs = {}\n",
Expand Down Expand Up @@ -115,21 +121,23 @@
"# pdf_file, kwargs = \"tabula/schools.pdf\", {\"pages\": \"all\"} # network parser hangs on contour plot\n",
"\n",
"filename = os.path.join(\n",
" os.path.dirname(os.path.abspath('.')),\n",
" \"src/tests/files\",\n",
" pdf_file\n",
" os.path.dirname(os.path.abspath(\".\")), \"src/tests/files\", pdf_file\n",
")\n",
"\n",
"# Set up plotting options\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline\n",
"PLOT_HEIGHT = 12\n",
"\n",
"\n",
"def init_figure_and_axis(title):\n",
" fig = plt.figure(figsize=(PLOT_HEIGHT * 2.5, PLOT_HEIGHT))\n",
" ax = fig.add_subplot(111)\n",
" ax.set_title(title)\n",
" return fig, ax\n",
"\n",
"\n",
"# Utility function to display tables\n",
"def display_parse_results(tables, parse_time, flavor):\n",
" if not tables:\n",
Expand All @@ -139,10 +147,13 @@
" lambda table: \"{rows}x{cols}\".format(\n",
" rows=table.shape[0],\n",
" cols=table.shape[1],\n",
" ), tables\n",
" ),\n",
" tables,\n",
" )\n",
" )\n",
" print(f\"The {flavor} parser found {len(tables)} table(s) ({tables_dims}) in {parse_time:.2f}s\")\n",
" print(\n",
" f\"The {flavor} parser found {len(tables)} table(s) ({tables_dims}) in {parse_time:.2f}s\"\n",
" )\n",
" for table in tables:\n",
" display(table.df)"
]
Expand Down Expand Up @@ -288,8 +299,7 @@
" fig, ax = init_figure_and_axis(f\"Line structure in PDF\\n{pdf_file}\")\n",
" pypdf_table_extraction.plot(tables[0], kind=\"line\", ax=ax)\n",
"else:\n",
" print(\"No table found for this document.\")\n",
"\n"
" print(\"No table found for this document.\")"
]
},
{
Expand All @@ -309,7 +319,7 @@
"source": [
"for table in tables:\n",
" fig, ax = init_figure_and_axis(f\"Contour structure in PDF\\n{pdf_file}\")\n",
" pypdf_table_extraction.plot(table, kind=\"contour\", ax=ax)\n"
" pypdf_table_extraction.plot(table, kind=\"contour\", ax=ax)"
]
},
{
Expand All @@ -329,7 +339,7 @@
"source": [
"for table in tables:\n",
" fig, ax = init_figure_and_axis(f\"Joint structure in PDF\\n{pdf_file}\")\n",
" pypdf_table_extraction.plot(table, kind=\"joint\", ax=ax)\n"
" pypdf_table_extraction.plot(table, kind=\"joint\", ax=ax)"
]
},
{
Expand All @@ -351,7 +361,7 @@
},
"outputs": [],
"source": [
"display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)\n"
"display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)"
]
},
{
Expand Down
Loading

0 comments on commit 9707eff

Please sign in to comment.