A utility library for quickly cleaning texts
Python version in the dev environment: 3.11.5
pip install -U texy
Pipelines with parallelization in Rust:
>>> from texy.pipelines import extreme_clean, strict_clean, relaxed_clean
>>> data = ["hello ;/ from the other side 😊 \t "]
print(extreme_clean(data))
>>> ['hello from the other side']
print(strict_clean(data))
>>> ['hello ;/ from the other side']
print(relaxed_clean(data))
>>> ['hello ;/ from the other side 😊']
Parallelize custom functions with Python Multiprocessing:
from texy.pipelines import parallelize
def dummy(x):
return [i[0] for i in x]
data = ["a ", "b ", "c ", "d ", "e ", "f ", "g ", "h ?."] * 100
print(parallelize(dummy, data, 2))
Pipeline | Actions |
---|---|
relaxed_clean |
remove_newlines , remove_html , remove_xml , merge_spaces |
strict_clean |
remove_newlines , remove_urls , remove_emails , remove_html , remove_xml , remove_emoticons , remove_emojis , remove_infrequent_punctuations , merge_spaces |
extreme_clean |
remove_newlines , remove_urls , remove_emails , remove_html , remove_xml , remove_emoticons , remove_emojis , remove_all_punctuations , merge_spaces |