Corpy

Corpy is a simple manager for textual corpora written in Python

Features:

Handling multiple documents
Dictionary limited by setting a threshold
Handling of sections (train, test,...) for ML purposes
Different modalities to get textual chunks

Usage

Corpy is a class requiring a list of textual documents at the input.

corpus = Corpy(_list_of_texts_)

creates a textual corpus.

Parameters

Bold values mean Default

one_document: concatenates all the documents into one (space separator). Vaues: True, False
mode: encodes the text with words or characters. Values: 'word', 'char'
lower: sets the full text lowercase. Values: True, False
threshold: sets the threshold to limit the dictionary length. Values: None: keeps the full dictionary; float < 1 cuts the dictionary when the cumulative distribution of frequency is less than float; int cuts the dictionary after the int'th item.
text_sections: divides the full text basing on the values passed through a list or a tuple. The values are normalized to 1. For example: [2,1,1] creates three sections whose length if 50%, 25%, 25% of the full text. Values: list or tuple. Default: (1,) (only one section).
text_sections_level: sections can be created by either counting the single items (words or chars) or the number of documents (books). Values: 'item', 'book'. If 'book', the documents into the section have at least the 75% of the text in that specific section.
threshold_section: defines which section is used to cut the dictionary. Values: 'first' (only the first section), 'all' (threshold is applied to the full text), int: the index of the chosen section.
init_books_seq: order of input documents. Values: 'normal' (the same order as read from the disk), 'random' (randomized order), list: list with the index of the documents. If len(list) < len(books), only the documents denoted by list will be taken into account.
punct: list of punctuation for single items (word mode only). Values: string or list. Default: "'.,!?«»:;()[]-""

Methods:

get_chunk: returns a chunk of text from the corpus.

Parameters:

chunk_len: textual chunk length (int). Default: 30
chunk_mode: defines the modality for getting the textual chunk. If 'normal' it selects the chunk with book_sel and chunk_sel parames (see later). If 'sequential', the chunk is selected sequentially every time the same modality is called. If 'random', the chunk is chosen randomly.
book_sel: when chunk_mode is 'normal', defines the document to get the chunk from.
chunk_sel: when chunk_mode is 'normal', defines the chunk starting point in the selected document.
padding: when chunk_mode is 'sequential', defines how many steps foward the next chunk will start. Default 0 (full chunk length padding).
last_element: if True, the chunk length is chunk_len+1, but it doesn't affect the padding.
output_mode: defines the type of output chunk. It can be: 'item' as a list of sequential words or characters (depending on the Corpy mode); 'string' as a concatenated string of items; 'code' as a list of numbers defined by the items dictionary.
section: the section where the chunk is taken from.
unk: the symbol used for Unkown items. In case of output_mode='code': it can be 'max': insert the code=max item, or 'none': insert None. In case of output_mode='item' or 'string': insert the passed string.

string2code(string): converts a string to the associated sequence of code

items2code(list): converts a list of items to the associated sequence of code

code2items(list): converts a list of numerical codes to the associated sequence of items

code2string(list): converts a list of numerical codes to the associated string

reset_counter: resets the counter in 'sequential' mode.

Parameters:

section: it can be 'all', it resets to 0 all the sections, or a int defining the section to reset.

save: saves the Corpy object in a pickle file.

Parameters:

namefile: path and file name for saving data. If it is an empty string (default), a unique file name is set.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
texts		texts
.gitignore		.gitignore
README.md		README.md
corpy.ipynb		corpy.ipynb
corpy.py		corpy.py
corpy_0.pkl		corpy_0.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpy

Features:

Usage

Parameters

Methods:

get_chunk: returns a chunk of text from the corpus.

string2code(string): converts a string to the associated sequence of code

items2code(list): converts a list of items to the associated sequence of code

code2items(list): converts a list of numerical codes to the associated sequence of items

code2string(list): converts a list of numerical codes to the associated string

reset_counter: resets the counter in 'sequential' mode.

save: saves the Corpy object in a pickle file.

Other functions:

load_corpy(filename)

Examples:

About

Releases

Packages

Languages

alessandrolondei/Corpy

Folders and files

Latest commit

History

Repository files navigation

Corpy

Features:

Usage

Parameters

Methods:

get_chunk: returns a chunk of text from the corpus.

string2code(string): converts a string to the associated sequence of code

items2code(list): converts a list of items to the associated sequence of code

code2items(list): converts a list of numerical codes to the associated sequence of items

code2string(list): converts a list of numerical codes to the associated string

reset_counter: resets the counter in 'sequential' mode.

save: saves the Corpy object in a pickle file.

Other functions:

load_corpy(filename)

Examples:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages