Fix pyproject.toml to work with uv and update README

camelot-dev · Dec 26, 2024 · 9707eff · 9707eff
1 parent 01cba95
commit 9707eff
Show file tree

Hide file tree

Showing 11 changed files with 243 additions and 3,378 deletions.
diff --git a/README.md b/README.md
@@ -1,31 +1,30 @@
 <p align="center">
-   <img src="https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pypdf-table-extraction.png" width="200">
+  <img src="https://raw.githubusercontent.com/camelot-dev/camelot/master/docs/_static/camelot.png" width="200">
 </p>
 
-# pypdf_table_extraction (Camelot): PDF Table Extraction for Humans
+# Camelot: PDF Table Extraction for Humans
 
-[![tests](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml/badge.svg)](https://github.com/py-pdf/pypdf_table_extraction/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/pypdf-table-extraction/badge/?version=latest)](https://pypdf-table-extraction.readthedocs.io/en/latest/)
-[![codecov.io](https://codecov.io/github/py-pdf/pypdf_table_extraction/badge.svg?branch=main&service=github)](https://codecov.io/github/py-pdf/pypdf_table_extraction/?branch=main)
-[![image](https://img.shields.io/pypi/v/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/l/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/) [![image](https://img.shields.io/pypi/pyversions/pypdf-table-extraction.svg)](https://pypi.org/project/pypdf-table-extraction/)
+[![tests](https://github.com/camelot-dev/camelot/actions/workflows/tests.yml/badge.svg)](https://github.com/camelot-dev/camelot/actions/workflows/tests.yml) [![Documentation Status](https://readthedocs.org/projects/camelot-py/badge/?version=master)](https://camelot-py.readthedocs.io/en/master/)
+[![codecov.io](https://codecov.io/github/camelot-dev/camelot/badge.svg?branch=master&service=github)](https://codecov.io/github/camelot-dev/camelot?branch=master)
+[![image](https://img.shields.io/pypi/v/camelot-py.svg)](https://pypi.org/project/camelot-py/) [![image](https://img.shields.io/pypi/l/camelot-py.svg)](https://pypi.org/project/camelot-py/) [![image](https://img.shields.io/pypi/pyversions/camelot-py.svg)](https://pypi.org/project/camelot-py/)
 
-**pypdf_table_extraction** Formerly known as [Camelot](https://github.com/camelot-dev/camelot) is a Python library that can help you extract tables from PDFs!
+**Camelot** is a Python library that can help you extract tables from PDFs!
 
 ---
 
 **Here's how you can extract tables from PDFs.**
-You can check out the quickstart notebook. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/pypdf_table_extraction_quick_start_notebook.ipynb)
+You can check out the quickstart notebook. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/pypdf_table_extraction_quick_start_notebook.ipynb) or follow the example below.
 
-Or follow the example below.
-You can check out the PDF used in this example [here](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/_static/pdf/foo.pdf).
+You can check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/main/docs/_static/pdf/foo.pdf).
 
 ```python3
->>> import pypdf_table_extraction
->>> tables = pypdf_table_extraction.read_pdf('foo.pdf')
+>>> import camelot
+>>> tables = camelot.read_pdf('foo.pdf')
 >>> tables
-<TableList n=1>
+&lt;TableList n=1&gt;
 >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
 >>> tables[0]
-<Table shape=(7, 7)>
+&lt;Table shape=(7, 7)&gt;
 >>> tables[0].parsing_report
 {
     'accuracy': 99.02,
@@ -46,77 +45,77 @@ You can check out the PDF used in this example [here](https://github.com/py-pdf/
 | 2032_2     | 0.17      | 57.8          | 21.7%                | 0.3%            | 2.7%            | 1.2%           |
 | 4171_1     | 0.07      | 173.9         | 58.1%                | 1.6%            | 2.1%            | 0.5%           |
 
-pypdf_table_extraction also comes packaged with a [command-line interface](https://pypdf-table-extraction.readthedocs.io/en/latest/user/cli.html)!
+Camelot also comes packaged with a [command-line interface](https://camelot-py.readthedocs.io/en/latest/user/cli.html)!
 
-Refer to the [QuickStart Guide](https://github.com/py-pdf/pypdf_table_extraction/blob/main/docs/user/quickstart.rst#quickstart) to quickly get started with pypdf_table_extraction, extract tables from PDFs and explore some basic options.
+Refer to the [QuickStart Guide](https://github.com/camelot-dev/camelot/blob/main/docs/user/quickstart.rst#quickstart) to quickly get started with Camelot, extract tables from PDFs and explore some basic options.
 
 **Tip:** Visit the `parser-comparison-notebook` to get an overview of all the packed parsers and their features. [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/py-pdf/pypdf_table_extraction/blob/main/examples/parser-comparison-notebook.ipynb)
 
-**Note:** pypdf_table_extraction only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
+**Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
 
-You can check out some frequently asked questions [here](https://pypdf-table-extraction.readthedocs.io/en/latest/user/faq.html).
+You can check out some frequently asked questions [here](https://camelot-py.readthedocs.io/en/latest/user/faq.html).
 
-## Why pypdf_table_extraction?
+## Why Camelot?
 
-- **Configurability**: pypdf_table_extraction gives you control over the table extraction process with [tweakable settings](https://pypdf-table-extraction.readthedocs.io/en/latest/user/advanced.html).
+- **Configurability**: Camelot gives you control over the table extraction process with [tweakable settings](https://camelot-py.readthedocs.io/en/latest/user/advanced.html).
 - **Metrics**: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
 - **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
 
-See [comparison with similar libraries and tools](https://github.com/py-pdf/pypdf_table_extraction/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
+See [comparison with similar libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
 
 ## Installation
 
 ### Using conda
 
-The easiest way to install pypdf_table_extraction is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
+The easiest way to install Camelot is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
 
 ```bash
-conda install -c conda-forge pypdf-table-extraction
+conda install -c conda-forge camelot-py
 ```
 
 ### Using pip
 
-After [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install pypdf_table_extraction:
+After [installing the dependencies](https://camelot-py.readthedocs.io/en/latest/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install Camelot:
 
 ```bash
-pip install pypdf-table-extraction
+pip install "camelot-py[base]"
 ```
 
 ### From the source code
 
-After [installing the dependencies](https://pypdf-table-extraction.readthedocs.io/en/latest/user/install.html#using-pip), clone the repo using:
+After [installing the dependencies](https://camelot-py.readthedocs.io/en/latest/user/install.html#using-pip), clone the repo using:
 
 ```bash
-git clone https://github.com/py-pdf/pypdf_table_extraction.git
+git clone https://github.com/camelot-dev/camelot.git
 ```
 
 and install using pip:
 
 ```
-cd pypdf_table_extraction
+cd camelot
 pip install "."
 ```
 
 ## Documentation
 
-The documentation is available at [http://pypdf-table-extraction.readthedocs.io/](http://pypdf-table-extraction.readthedocs.io/).
+The documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).
 
 ## Wrappers
 
 - [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.
 
 ## Related projects
 
-- [camelot-sharp](https://github.com/BobLd/camelot-sharp) provides a C sharp implementation of pypdf_table_extraction (Camelot).
+- [camelot-sharp](https://github.com/BobLd/camelot-sharp) provides a C sharp implementation of Camelot.
 
 ## Contributing
 
-The [Contributor's Guide](https://pypdf-table-extraction.readthedocs.io/en/latest/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
+The [Contributor's Guide](https://camelot-py.readthedocs.io/en/latest/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
 
 ## Versioning
 
-pypdf_table_extraction uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out the [releases](https://github.com/py-pdf/pypdf_table_extraction/releases) page.
+Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out the [releases](https://github.com/camelot-dev/camelot/releases) page.
 
 ## License
 
-This project is licensed under the MIT License, see the [LICENSE](https://github.com/py-pdf/pypdf_table_extraction/blob/main/LICENSE) file for details.
+This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/main/LICENSE) file for details.
diff --git a/camelot/backends/base.py b/camelot/backends/base.py
@@ -2,7 +2,6 @@
 
 
 class ConversionBackend:  # noqa D101
-
     def installed(self) -> bool:  # noqa D102
         raise NotImplementedError
 

diff --git a/camelot/core.py b/camelot/core.py
@@ -9,25 +9,25 @@
 import tempfile
 import zipfile
 from operator import itemgetter
-from typing import Any
-from typing import Iterable
-from typing import Iterator
+from typing import Any, Iterable, Iterator
 
 import cv2
 import pandas as pd
 
-
 if sys.version_info >= (3, 11):
-    from typing import TypedDict  # pylint: disable=no-name-in-module
-    from typing import Unpack
+    from typing import (
+        TypedDict,  # pylint: disable=no-name-in-module
+        Unpack,
+    )
 else:
     from typing_extensions import TypedDict, Unpack
 
 from .backends import ImageConversionBackend
-from .utils import build_file_path_in_temp_dir
-from .utils import get_index_closest_point
-from .utils import get_textline_coords
-
+from .utils import (
+    build_file_path_in_temp_dir,
+    get_index_closest_point,
+    get_textline_coords,
+)
 
 # minimum number of vertical textline intersections for a textedge
 # to be considered valid

diff --git a/camelot/io.py b/camelot/io.py
@@ -20,7 +20,7 @@ def read_pdf(
     parallel=False,
     layout_kwargs=None,
     debug=False,
-    **kwargs
+    **kwargs,
 ):
     """Read PDF and return extracted tables.
 
@@ -136,6 +136,6 @@ def read_pdf(
             suppress_stdout=suppress_stdout,
             parallel=parallel,
             layout_kwargs=layout_kwargs,
-            **kwargs
+            **kwargs,
         )
         return tables
diff --git a/camelot/parsers/hybrid.py b/camelot/parsers/hybrid.py
@@ -56,7 +56,7 @@ def __init__(
         row_tol=2,
         column_tol=0,
         debug=False,
-        **kwargs
+        **kwargs,
     ):
         super().__init__(
             "hybrid",

diff --git a/examples/hybrid-parser-step-by-step.ipynb b/examples/hybrid-parser-step-by-step.ipynb
@@ -26,6 +26,7 @@
    "outputs": [],
    "source": [
     "import os\n",
+    "\n",
     "os.getcwd()\n",
     "# Install from source\n",
     "!git clone -b main https://github.com/py-pdf/pypdf_table_extraction.git src\n",
@@ -59,10 +60,15 @@
    "source": [
     "# Bootstrap and common imports\n",
     "import sys, time\n",
-    "sys.path.insert(0, os.path.abspath('')) # Prefer the local version of pypdf_table_extraction if available\n",
+    "\n",
+    "sys.path.insert(\n",
+    "    0, os.path.abspath(\"\")\n",
+    ")  # Prefer the local version of pypdf_table_extraction if available\n",
     "import pypdf_table_extraction\n",
     "\n",
-    "print(f\"Using pypdf_table_extraction v{pypdf_table_extraction.__version__} from file {pypdf_table_extraction.__file__}.\")\n",
+    "print(\n",
+    "    f\"Using pypdf_table_extraction v{pypdf_table_extraction.__version__} from file {pypdf_table_extraction.__file__}.\"\n",
+    ")\n",
     "\n",
     "# Select a pdf to analyze.\n",
     "kwargs = {}\n",
@@ -115,21 +121,23 @@
     "# pdf_file, kwargs = \"tabula/schools.pdf\", {\"pages\": \"all\"}  # network parser hangs on contour plot\n",
     "\n",
     "filename = os.path.join(\n",
-    "    os.path.dirname(os.path.abspath('.')),\n",
-    "    \"src/tests/files\",\n",
-    "    pdf_file\n",
+    "    os.path.dirname(os.path.abspath(\".\")), \"src/tests/files\", pdf_file\n",
     ")\n",
     "\n",
     "# Set up plotting options\n",
     "import matplotlib.pyplot as plt\n",
+    "\n",
     "%matplotlib inline\n",
     "PLOT_HEIGHT = 12\n",
+    "\n",
+    "\n",
     "def init_figure_and_axis(title):\n",
     "    fig = plt.figure(figsize=(PLOT_HEIGHT * 2.5, PLOT_HEIGHT))\n",
     "    ax = fig.add_subplot(111)\n",
     "    ax.set_title(title)\n",
     "    return fig, ax\n",
     "\n",
+    "\n",
     "# Utility function to display tables\n",
     "def display_parse_results(tables, parse_time, flavor):\n",
     "    if not tables:\n",
@@ -139,10 +147,13 @@
     "            lambda table: \"{rows}x{cols}\".format(\n",
     "                rows=table.shape[0],\n",
     "                cols=table.shape[1],\n",
-    "            ), tables\n",
+    "            ),\n",
+    "            tables,\n",
     "        )\n",
     "    )\n",
-    "    print(f\"The {flavor} parser found {len(tables)} table(s) ({tables_dims}) in {parse_time:.2f}s\")\n",
+    "    print(\n",
+    "        f\"The {flavor} parser found {len(tables)} table(s) ({tables_dims}) in {parse_time:.2f}s\"\n",
+    "    )\n",
     "    for table in tables:\n",
     "        display(table.df)"
    ]
@@ -288,8 +299,7 @@
     "    fig, ax = init_figure_and_axis(f\"Line structure in PDF\\n{pdf_file}\")\n",
     "    pypdf_table_extraction.plot(tables[0], kind=\"line\", ax=ax)\n",
     "else:\n",
-    "    print(\"No table found for this document.\")\n",
-    "\n"
+    "    print(\"No table found for this document.\")"
    ]
   },
   {
@@ -309,7 +319,7 @@
    "source": [
     "for table in tables:\n",
     "    fig, ax = init_figure_and_axis(f\"Contour structure in PDF\\n{pdf_file}\")\n",
-    "    pypdf_table_extraction.plot(table, kind=\"contour\", ax=ax)\n"
+    "    pypdf_table_extraction.plot(table, kind=\"contour\", ax=ax)"
    ]
   },
   {
@@ -329,7 +339,7 @@
    "source": [
     "for table in tables:\n",
     "    fig, ax = init_figure_and_axis(f\"Joint structure in PDF\\n{pdf_file}\")\n",
-    "    pypdf_table_extraction.plot(table, kind=\"joint\", ax=ax)\n"
+    "    pypdf_table_extraction.plot(table, kind=\"joint\", ax=ax)"
    ]
   },
   {
@@ -351,7 +361,7 @@
    },
    "outputs": [],
    "source": [
-    "display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)\n"
+    "display_parse_results(tables, timer_after_parse - timer_before_parse, flavor)"
    ]
   },
   {
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,7 +2,6 @@


		class ConversionBackend: # noqa D101

		def installed(self) -> bool: # noqa D102
		raise NotImplementedError

Expand Down