Preprocess words #198

dgcrouse · 2017-06-03T07:20:11Z

Complete rewrite of preprocess script

Script adds following arguments: input_folder - Use a folder of documents as input (.txt only) rather than a single document. Invalidates input_txt value case_sensitive - Consider case differences as separate tokens. Default true min_occurrences - Replace a token that occurrs fewer than this many times with a wildcard token. Default 20 min_documents - Replace a token that occurrs in fewer than this many files with a wildcard token. Ignored if input_folder not used. Default 1 use_ascii - Ignore all non-ascii characters when generating tokens

… and made more robust.

…o fixed Unicode support

…capped rather than floored.

…ing, use the python script scripts/tokenizeWords.py to generate a JSON and then feed that into sample.lua as the start_tokens option

…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.

dgcrouse · 2017-06-03T07:24:31Z

Don't commit until I have had a chance to test the conflict resolution...

dgcrouse and others added 12 commits August 11, 2016 19:19

Accidentally used a reserved word

19361a4

Major rewrite to preprocessWords.py, fixed lots of errors, simplified…

53caddd

… and made more robust.

Fixed multiple issues pertaining to wildcards, e.g. overtraining, als…

5e9b1a4

…o fixed Unicode support

Added more flexibility for wildcards. Fixed bug where wildcards were …

d0fbb4e

…capped rather than floored.

Added support for seed text in word mode. To ensure constancy of pars…

8190d7f

…ing, use the python script scripts/tokenizeWords.py to generate a JSON and then feed that into sample.lua as the start_tokens option

Updated readme files

bd5e1a6

Fixed bad link in README.md

b796c3f

Fixed additional error with README.md link

5da9015

Unified word and character preprocessing scripts. Moved original prep…

f77c0d7

…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.

Quick readme fixes

d69106a

Merge branch 'New_Preprocess' into preprocessWords

45db245

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess words #198

Preprocess words #198

dgcrouse commented Jun 3, 2017

dgcrouse commented Jun 3, 2017

Preprocess words #198

Are you sure you want to change the base?

Preprocess words #198

Conversation

dgcrouse commented Jun 3, 2017

dgcrouse commented Jun 3, 2017