Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements and bugfixes #142

Closed
wants to merge 9 commits into from
Closed

Improvements and bugfixes #142

wants to merge 9 commits into from

Conversation

csavelief
Copy link
Collaborator

@csavelief csavelief commented Sep 26, 2018

Improvements on StringIterator :

  • Add better tests (easier to read) ;
  • Classify char code 160 as whitespace ;
  • Split test cleaning and sentence parsing. It is now working quite well to format texts extracted from PDF files with Apache Tika.

Closes Issue 104 : TokenInstanceIterator does not iterate on more than one Instance.

Closes Issue 126 : printDocumentTopics() throws an IndexOutOfBoundsException if the number of topics is not the same as the number of documents.

Be aware that due to compilation issues on Windows 10 I had to remove the symlink from lib/errorprone.jar in the build.xml file (commit ec265c3). Tell me if I need to rollback it for the pull-request.

@csavelief csavelief changed the title Improvements on StringIterator Improvements and bugfixes Sep 26, 2018
@csavelief
Copy link
Collaborator Author

Update (2019-05-11) :

  • Rebase MNCC/Mallet/master to mimno/Mallet/master
  • Fix compilation issue (remove Google Guava)

csavelief added 9 commits June 27, 2019 13:20
- Add better tests (easier to read) ;
- Classify char code 160 as whitespace ;
- Split test cleaning and sentence parsing. It is now working quite well to format texts extracted from PDF files with Apache Tika.
- Closes Issue 104 : TokenInstanceIterator does not iterate on more than one Instance.
- Closes Issue 126 : printDocumentTopics() throws an IndexOutOfBoundsException if the number of topics is not the same as the number of documents.
- Add better tests (easier to read) ;
- Classify char code 160 as whitespace ;
- Split test cleaning and sentence parsing. It is now working quite well to format texts extracted from PDF files with Apache Tika.
- Closes Issue 104 : TokenInstanceIterator does not iterate on more than one Instance.
- Closes Issue 126 : printDocumentTopics() throws an IndexOutOfBoundsException if the number of topics is not the same as the number of documents.
@csavelief csavelief closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant