Skip to content

πŸ–οΈ Highlight text in documents


Notifications You must be signed in to change notification settings


Repository files navigation

Highlight text in documents

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.

Current file formats supported:

  • pdf


The easiest way to install is via pip and PyPI

pip install txtmarker

Python 3.9+ is supported. Using a Python virtual environment is recommended.

txtmarker can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+

Python 3.9+ is supported


The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.


Notebook Description
Introducing txtmarker Overview of the functionality provided by txtmarker Open In Colab
Highlighting with Transformers AI-driven highlighting with Transformers Open In Colab


The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

Create a new highlighter

Creates a new highlighter instance.

from txtmarker.factory import Factory
highlighter = Factory.create("pdf")


extension: string

Type of highlighter to create (i.e. pdf)

Optional constructor arguments:


formatter: callable

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.


chunks: int

Splits queries into multiple chunks. This is designed for very long text matches.

Page text

Extracts page text from infile and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.



infile: string

Full path to input file

Highlight text

Highlights using provided annotations. Annotated file is stored as outfile.

highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])


infile: string

Full path to input file


outfile: string

Full path to output file, i.e. the highlighted file


highlights: list of (string, string|regex)

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call re.escape).