Text Cleaning

Interface

Object Types

We either clean text or tokens. The tokens can be either str or data_structures.Token. The data_structures module is the repository doublethinklab/data-structures.git.

Pipelines

Pipelines apply to either text or tokens, and a genre and language.

Most basic interface example:

from text_cleaning import get_pipelines

clean_text, clean_tokens = get_pipelines('en', 'facebook')
text = clean_text('Whatever the text is')
# note that tokens can be either a string or `data_structures.Token`s
tokens = some_nlp_function_that_tokenizes(text)
tokens = clean_tokens(tokens)

Note that we also have a 'generic' option for the genre.

The idea is that behind the scenes there are decisions already made about how best to clean the text and tokens for each genre and language. Consumers can just use these cleaners without need to worry about the details. However, the design does allow for tweaking any of those details. Just be aware that experience shows that testing any given configuration of cleaning functions into a pipeline is actually quite sensitive and fraught with errors. Quite some time was spent making the unit tests pass.

Dependencies

Some cleaning functions break if another goes before them. The following table attempts to track these dependencies. Note that since all classes are in the text_cleaning module, we omit that for brevity here.

Function	Dependencies
`functions.text.StandardizeText`	`functions.text.LowerCaseText`

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
tests		tests
text_cleaning		text_cleaning
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build_image.sh		build_image.sh
jupyter.sh		jupyter.sh
requirements.txt		requirements.txt
run_test.sh		run_test.sh
run_tests.sh		run_tests.sh
setup.py		setup.py
tag_version.sh		tag_version.sh
version		version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Cleaning

Interface

Object Types

Pipelines

Dependencies

About

Releases

Packages

Contributors 3

Languages

doublethinklab/text-cleaning

Folders and files

Latest commit

History

Repository files navigation

Text Cleaning

Interface

Object Types

Pipelines

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages