Corpy is a simple manager for textual corpora written in Python
- Handling multiple documents
- Dictionary limited by setting a threshold
- Handling of sections (train, test,...) for ML purposes
- Different modalities to get textual chunks
Corpy is a class requiring a list of textual documents at the input.
corpus = Corpy(_list_of_texts_)
creates a textual corpus.
Bold values mean Default
- one_document: concatenates all the documents into one (space separator). Vaues: True, False
- mode: encodes the text with words or characters. Values: 'word', 'char'
- lower: sets the full text lowercase. Values: True, False
- threshold: sets the threshold to limit the dictionary length. Values: None: keeps the full dictionary; float < 1 cuts the dictionary when the cumulative distribution of frequency is less than float; int cuts the dictionary after the int'th item.
- text_sections: divides the full text basing on the values passed through a list or a tuple. The values are normalized to 1. For example: [2,1,1] creates three sections whose length if 50%, 25%, 25% of the full text. Values: list or tuple. Default: (1,) (only one section).
- text_sections_level: sections can be created by either counting the single items (words or chars) or the number of documents (books). Values: 'item', 'book'. If 'book', the documents into the section have at least the 75% of the text in that specific section.
- threshold_section: defines which section is used to cut the dictionary. Values: 'first' (only the first section), 'all' (threshold is applied to the full text), int: the index of the chosen section.
- init_books_seq: order of input documents. Values: 'normal' (the same order as read from the disk), 'random' (randomized order), list: list with the index of the documents. If len(list) < len(books), only the documents denoted by list will be taken into account.
- punct: list of punctuation for single items (word mode only). Values: string or list. Default: "'.,!?«»:;()[]-""
Parameters:
- chunk_len: textual chunk length (int). Default: 30
- chunk_mode: defines the modality for getting the textual chunk. If 'normal' it selects the chunk with book_sel and chunk_sel parames (see later). If 'sequential', the chunk is selected sequentially every time the same modality is called. If 'random', the chunk is chosen randomly.
- book_sel: when chunk_mode is 'normal', defines the document to get the chunk from.
- chunk_sel: when chunk_mode is 'normal', defines the chunk starting point in the selected document.
- padding: when chunk_mode is 'sequential', defines how many steps foward the next chunk will start. Default 0 (full chunk length padding).
- last_element: if True, the chunk length is chunk_len+1, but it doesn't affect the padding.
- output_mode: defines the type of output chunk. It can be: 'item' as a list of sequential words or characters (depending on the Corpy mode); 'string' as a concatenated string of items; 'code' as a list of numbers defined by the items dictionary.
- section: the section where the chunk is taken from.
- unk: the symbol used for Unkown items. In case of output_mode='code': it can be 'max': insert the code=max item, or 'none': insert None. In case of output_mode='item' or 'string': insert the passed string.
Parameters:
- section: it can be 'all', it resets to 0 all the sections, or a int defining the section to reset.
Parameters:
- namefile: path and file name for saving data. If it is an empty string (default), a unique file name is set.
load a pikle Corpy file and returns a Corpy object.
See corpy.ipynb Jupyter notebook.