-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/port to py3 #184
Open
geron
wants to merge
33
commits into
NAMD:develop
Choose a base branch
from
geron:feature/port_to_py3
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Feature/port to py3 #184
Changes from 27 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
702dbf1
Uses pycld2 instead of the (outdate) chrom[...]tector
flavioamieiro cb3d1d2
Removes pyparsing from requirements
flavioamieiro 915efa7
fix cld import
geron 0589592
prevent mongo from connecting at import time
geron 0b4ccf6
run 2to3
geron e41028e
Removes redundant try/except block in urlparse import
flavioamieiro ccfb5d9
Pins celery version
flavioamieiro 01a5fa6
Removes unnecessary cast to list that 2to3 inserted
flavioamieiro b16be95
Fixes test that expected str but receives bytes
flavioamieiro 21aa0a6
Adds test to make sure the 'process' method receives the expected data
flavioamieiro 7d540d0
Fixes existing base task test
flavioamieiro aa4478a
Uses BytesIO instead of StringIO in wordcloud
flavioamieiro d311b74
Changes Wordcloud test not to touch the database
flavioamieiro 65c07b1
Changes palavras_raw test to not touch the database
flavioamieiro 9c8f952
Fix freqdist test and sorting
geron 05594a1
fix spellchecker tests
geron 7b31c98
spellchecker: warn if dictionary is missing
geron 00cce60
fix test_unknown_mimetype_should_be_flagged test
geron afaaa0b
Update TestExtractorWorker.test_unknown_encoding_should_be_ignored
geron 427da7d
fix TestExtractorWorker.test_unescape_html_entities
geron 2c0f8e8
fix TestExtractorWorker.test_should_detect_encoding_and_return_a_unic…
geron 6989936
fix TestExtractorWorker.test_should_guess_mimetype_for_file_without_e…
geron 17e47cb
updated more extractor tests
geron 4eb5f61
fix extractor.extract_pdf
geron 24c266f
Rewrite extractor.trial_decode and write tests for it
geron c084132
extractor: convert text to string before calling parse_html
geron 8e67779
extractor: fix language detection
geron 11c203c
extractor: remove checks for text being a str, it will always be
geron c6b3296
extractor: remove up to 1k bytes that cld says are invalid
geron 25a8e54
SpellingChecker: no need to check for KeyError from document keys
geron 573a111
extractor: turn redundant tests into integration test
geron 0265786
extractor tests: support newer version of pdfinfo
geron 7b84def
change bigram worker to return metric names and respect bigram order
geron File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand here, the behavior didn't change because we were never executing
result = text.decode('utf-8', 'replace')
(since decoding with iso8859-1 never raises aUnicodeDecodeError
), right? If that's the case, perfect. If not, I think it would be a good idea to keep the forced decoding.Do you know why decoding with iso8859-1 never raises this exception? (I'm not doubting it, just didn't understand the reason :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because iso8859-1 is a single byte encoding and every byte from \x00 to \xff is a "valid" char in it. If for instance you were to erroneously try to decode chars encoded in utf8 you would just obtain a sequence of strange chars (aka a mojibake) for each multi-byte utf8 char. If you can provide a test case that proves otherwise I'd be glad to change my mind though.
A sort of proof for what I said about every byte being a valid iso8859-1 char:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geron that makes perfect sense to me. As I said, I didn't doubt it, I just didn't understand it before. Now I am a little ashamed of not realizing that.