Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess words #198

Open
wants to merge 12 commits into
base: New_Preprocess
Choose a base branch
from
Open

Preprocess words #198

wants to merge 12 commits into from

Conversation

dgcrouse
Copy link

@dgcrouse dgcrouse commented Jun 3, 2017

Complete rewrite of preprocess script

dgcrouse and others added 12 commits August 11, 2016 19:19
Script adds following arguments:
  input_folder - Use a folder of documents as input (.txt only) rather than
            a single document. Invalidates input_txt value
  case_sensitive - Consider case differences as separate tokens. Default true
  min_occurrences - Replace a token that occurrs fewer than this many times
            with a wildcard token. Default 20
  min_documents - Replace a token that occurrs in fewer than this many files
            with a wildcard token. Ignored if input_folder not used. Default 1
  use_ascii - Ignore all non-ascii characters when generating tokens
…ing, use the python script scripts/tokenizeWords.py to generate a JSON and then feed that into sample.lua as the start_tokens option
…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.
@dgcrouse
Copy link
Author

dgcrouse commented Jun 3, 2017

Don't commit until I have had a chance to test the conflict resolution...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant