Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/port to py3 #184

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
702dbf1
Uses pycld2 instead of the (outdate) chrom[...]tector
flavioamieiro Nov 23, 2016
cb3d1d2
Removes pyparsing from requirements
flavioamieiro Nov 23, 2016
915efa7
fix cld import
geron Nov 23, 2016
0589592
prevent mongo from connecting at import time
geron Nov 23, 2016
0b4ccf6
run 2to3
geron Nov 24, 2016
e41028e
Removes redundant try/except block in urlparse import
flavioamieiro Nov 24, 2016
ccfb5d9
Pins celery version
flavioamieiro Nov 24, 2016
01a5fa6
Removes unnecessary cast to list that 2to3 inserted
flavioamieiro Nov 24, 2016
b16be95
Fixes test that expected str but receives bytes
flavioamieiro Nov 24, 2016
21aa0a6
Adds test to make sure the 'process' method receives the expected data
flavioamieiro Nov 24, 2016
7d540d0
Fixes existing base task test
flavioamieiro Nov 24, 2016
aa4478a
Uses BytesIO instead of StringIO in wordcloud
flavioamieiro Nov 24, 2016
d311b74
Changes Wordcloud test not to touch the database
flavioamieiro Nov 24, 2016
65c07b1
Changes palavras_raw test to not touch the database
flavioamieiro Nov 24, 2016
9c8f952
Fix freqdist test and sorting
geron Nov 25, 2016
05594a1
fix spellchecker tests
geron Nov 25, 2016
7b31c98
spellchecker: warn if dictionary is missing
geron Nov 25, 2016
00cce60
fix test_unknown_mimetype_should_be_flagged test
geron Nov 25, 2016
afaaa0b
Update TestExtractorWorker.test_unknown_encoding_should_be_ignored
geron Nov 25, 2016
427da7d
fix TestExtractorWorker.test_unescape_html_entities
geron Nov 25, 2016
2c0f8e8
fix TestExtractorWorker.test_should_detect_encoding_and_return_a_unic…
geron Nov 25, 2016
6989936
fix TestExtractorWorker.test_should_guess_mimetype_for_file_without_e…
geron Nov 25, 2016
17e47cb
updated more extractor tests
geron Nov 26, 2016
4eb5f61
fix extractor.extract_pdf
geron Nov 26, 2016
24c266f
Rewrite extractor.trial_decode and write tests for it
geron Nov 27, 2016
c084132
extractor: convert text to string before calling parse_html
geron Nov 27, 2016
8e67779
extractor: fix language detection
geron Nov 27, 2016
11c203c
extractor: remove checks for text being a str, it will always be
geron Dec 2, 2016
c6b3296
extractor: remove up to 1k bytes that cld says are invalid
geron Dec 2, 2016
25a8e54
SpellingChecker: no need to check for KeyError from document keys
geron Dec 2, 2016
573a111
extractor: turn redundant tests into integration test
geron Dec 6, 2016
0265786
extractor tests: support newer version of pdfinfo
geron Dec 6, 2016
7b84def
change bigram worker to return metric names and respect bigram order
geron Jan 31, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@
master_doc = 'index'

# General information about the project.
project = u'PyPLN'
copyright = u'2011, Flávio Codeço Coelho'
project = 'PyPLN'
copyright = '2011, Flávio Codeço Coelho'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down Expand Up @@ -187,8 +187,8 @@
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
('index', 'PyPLN.tex', u'PyPLN Documentation',
u'Flávio Codeço Coelho', 'manual'),
('index', 'PyPLN.tex', 'PyPLN Documentation',
'Flávio Codeço Coelho', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
Expand Down Expand Up @@ -220,6 +220,6 @@
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'pypln', u'PyPLN Documentation',
[u'Flávio Codeço Coelho'], 1)
('index', 'pypln', 'PyPLN Documentation',
['Flávio Codeço Coelho'], 1)
]
2 changes: 1 addition & 1 deletion pypln/backend/celery_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

from celery import Celery
from kombu import Exchange, Queue
import config
from . import config

app = Celery('pypln_workers', backend='mongodb',
broker='amqp://', include=['pypln.backend.workers'])
Expand Down
2 changes: 1 addition & 1 deletion pypln/backend/celery_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from pypln.backend import config


mongo_client = pymongo.MongoClient(host=config.MONGODB_URIS)
mongo_client = pymongo.MongoClient(host=config.MONGODB_URIS, _connect=False)
database = mongo_client[config.MONGODB_DBNAME]
document_collection = database[config.MONGODB_COLLECTION]

Expand Down
12 changes: 4 additions & 8 deletions pypln/backend/config.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,12 @@
import os
import urllib.parse

from decouple import config, Csv

try:
import urlparse
except ImportError:
import urllib.parse as urlparse

def parse_url(url):
urlparse.uses_netloc.append('mongodb')
urlparse.uses_netloc.append('celery')
url = urlparse.urlparse(url)
urllib.parse.uses_netloc.append('mongodb')
urllib.parse.uses_netloc.append('celery')
url = urllib.parse.urlparse(url)

path = url.path[1:]
path = path.split('?', 2)[0]
Expand Down
24 changes: 12 additions & 12 deletions pypln/backend/workers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,18 @@
# You should have received a copy of the GNU General Public License
# along with PyPLN. If not, see <http://www.gnu.org/licenses/>.

from extractor import Extractor
from tokenizer import Tokenizer
from freqdist import FreqDist
from pos import POS
from statistics import Statistics
from bigrams import Bigrams
from palavras_raw import PalavrasRaw
from lemmatizer_pt import Lemmatizer
from palavras_noun_phrase import NounPhrase
from palavras_semantic_tagger import SemanticTagger
from word_cloud import WordCloud
from elastic_indexer import ElasticIndexer
from .extractor import Extractor
from .tokenizer import Tokenizer
from .freqdist import FreqDist
from .pos import POS
from .statistics import Statistics
from .bigrams import Bigrams
from .palavras_raw import PalavrasRaw
from .lemmatizer_pt import Lemmatizer
from .palavras_noun_phrase import NounPhrase
from .palavras_semantic_tagger import SemanticTagger
from .word_cloud import WordCloud
from .elastic_indexer import ElasticIndexer


__all__ = ['Extractor', 'Tokenizer', 'FreqDist', 'POS', 'Statistics',
Expand Down
2 changes: 1 addition & 1 deletion pypln/backend/workers/bigrams.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,4 @@ def process(self, document):
for m in metrics:
for res in bigram_finder.score_ngrams(getattr(bigram_measures,m)):
br[res[0]].append(res[1])
return {'metrics': metrics, 'bigram_rank': br.items()}
return {'metrics': metrics, 'bigram_rank': list(br.items())}
108 changes: 53 additions & 55 deletions pypln/backend/workers/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@
import base64
import shlex

from HTMLParser import HTMLParser
from html.parser import HTMLParser
from tempfile import NamedTemporaryFile
from os import unlink
from subprocess import Popen, PIPE
from mimetypes import guess_type
from re import compile as regexp_compile, DOTALL, escape

import cld
import pycld2 as cld
import magic

from pypln.backend.celery_task import PyPLNTask
Expand Down Expand Up @@ -84,10 +84,10 @@ def parse_html(html, remove_tags=None, remove_inside=None,
[''] * (total_to_remove - 2)
content_between[index + 1] = '\n'
complete_tags.append('')
result = ''.join(sum(zip(content_between, complete_tags), tuple()))
result = ''.join(sum(list(zip(content_between, complete_tags)), tuple()))
return clean(result)

def get_pdf_metadata(data):
def get_pdf_metadata(data: str) -> dict:
lines = data.strip().splitlines()
metadata = {}
for line in lines:
Expand All @@ -98,7 +98,7 @@ def get_pdf_metadata(data):
metadata[key.strip()] = value.strip()
return metadata

def extract_pdf(data):
def extract_pdf(data: bytes) -> (str, dict):
temp = NamedTemporaryFile(delete=False)
filename = temp.name
temp.close()
Expand All @@ -112,14 +112,16 @@ def extract_pdf(data):
unlink(filename + '_ind.html')
unlink(filename + 's.html')
text = parse_html(html.replace('&#160;', ' '), True, ['script', 'style'])
pdfinfo = Popen(shlex.split('pdfinfo -'), stdin=PIPE, stdout=PIPE,
stderr=PIPE)
meta_out, meta_err = pdfinfo.communicate(input=data)

info_process = Popen(shlex.split('pdfinfo -'), stdin=PIPE, stdout=PIPE,
stderr=PIPE)
meta_out, meta_err = info_process.communicate(input=data)
try:
metadata = get_pdf_metadata(meta_out)
except:
metadata = get_pdf_metadata(meta_out.decode('utf-8'))
except Exception:
# TODO: what should I do here?
metadata = {}
#TODO: what should I do here?

if not (text and metadata):
return '', {}
elif not html_err:
Expand All @@ -128,41 +130,30 @@ def extract_pdf(data):
return '', {}


def trial_decode(text):
def decode_text_bytes(text: bytes) -> str:
"""
Tries to detect text encoding using `magic`. If the detected encoding is
not supported, try utf-8, iso-8859-1 and ultimately falls back to decoding
as utf-8 replacing invalid chars with `U+FFFD` (the replacement character).

This is far from an ideal solution, but the extractor and the rest of the
pipeline need an unicode object.
Tries to detect text encoding using file magic. If that fails or the
detected encoding is not supported, tries using utf-8. If that doesn't work
tries using iso8859-1.
"""
with magic.Magic(flags=magic.MAGIC_MIME_ENCODING) as m:
content_encoding = m.id_buffer(text)

forced_decoding = False
try:
result = text.decode(content_encoding)
except LookupError:
# If the detected encoding is not supported, we try to decode it as
# utf-8.
with magic.Magic(flags=magic.MAGIC_MIME_ENCODING) as m:
content_encoding = m.id_buffer(text)
except magic.MagicError:
pass # This can happen for instance if text is a single char
else:
try:
result = text.decode('utf-8')
except UnicodeDecodeError:
# Is there a better way of doing this than nesting try/except
# blocks? This smells really bad.
try:
result = text.decode('iso-8859-1')
except UnicodeDecodeError:
# If neither utf-8 nor iso-885901 work are capable of handling
# this text, we just decode it using utf-8 and replace invalid
# chars with U+FFFD.
# Two somewhat arbitrary decisions were made here: use utf-8
# and use 'replace' instead of 'ignore'.
result = text.decode('utf-8', 'replace')
forced_decoding = True

return result, forced_decoding
return text.decode(content_encoding)
except LookupError: # The detected encoding is not supported
pass

try:
result = text.decode('utf-8')
except UnicodeDecodeError:
# Decoding with iso8859-1 doesn't raise UnicodeDecodeError, so this is
# a last resort.
result = text.decode('iso8859-1')
return result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand here, the behavior didn't change because we were never executing result = text.decode('utf-8', 'replace') (since decoding with iso8859-1 never raises a UnicodeDecodeError), right? If that's the case, perfect. If not, I think it would be a good idea to keep the forced decoding.

Do you know why decoding with iso8859-1 never raises this exception? (I'm not doubting it, just didn't understand the reason :) )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because iso8859-1 is a single byte encoding and every byte from \x00 to \xff is a "valid" char in it. If for instance you were to erroneously try to decode chars encoded in utf8 you would just obtain a sequence of strange chars (aka a mojibake) for each multi-byte utf8 char. If you can provide a test case that proves otherwise I'd be glad to change my mind though.

A sort of proof for what I said about every byte being a valid iso8859-1 char:

In [1]: import struct

In [2]: for i in range(256):
   ...:     byte = struct.pack('B', i)
   ...:     char = byte.decode('iso8859-1')
   ...:     print(i, repr(char), char)
   ...:     
0 '\x00' 
1 '\x01'2 '\x02'3 '\x03'4 '\x04'5 '\x05' 
6 '\x06'7 '\x07' 
8 '\x08' 
9 '\t' 	
10 '\n' 

11 '\x0b' 

12 '\x0c' 

13 '\r' 
14 '\x0e' 
15 '\x0f' 
16 '\x10'17 '\x11'18 '\x12'19 '\x13'20 '\x14'21 '\x15'22 '\x16'23 '\x17'24 '\x18'25 '\x19'26 '\x1a'27 '\x1b'28 '\x1c'29 '\x1d'30 '\x1e'31 '\x1f'32 ' '  
33 '!' !
34 '"' "
35 '#' #
36 '$' $
37 '%' %
38 '&' &
39 "'" '
40 '(' (
41 ')' )
42 '*' *
43 '+' +
44 ',' ,
45 '-' -
46 '.' .
47 '/' /
48 '0' 0
49 '1' 1
50 '2' 2
51 '3' 3
52 '4' 4
53 '5' 5
54 '6' 6
55 '7' 7
56 '8' 8
57 '9' 9
58 ':' :
59 ';' ;
60 '<' <
61 '=' =
62 '>' >
63 '?' ?
64 '@' @
65 'A' A
66 'B' B
67 'C' C
68 'D' D
69 'E' E
70 'F' F
71 'G' G
72 'H' H
73 'I' I
74 'J' J
75 'K' K
76 'L' L
77 'M' M
78 'N' N
79 'O' O
80 'P' P
81 'Q' Q
82 'R' R
83 'S' S
84 'T' T
85 'U' U
86 'V' V
87 'W' W
88 'X' X
89 'Y' Y
90 'Z' Z
91 '[' [
92 '\\' \
93 ']' ]
94 '^' ^
95 '_' _
96 '`' `
97 'a' a
98 'b' b
99 'c' c
100 'd' d
101 'e' e
102 'f' f
103 'g' g
104 'h' h
105 'i' i
106 'j' j
107 'k' k
108 'l' l
109 'm' m
110 'n' n
111 'o' o
112 'p' p
113 'q' q
114 'r' r
115 's' s
116 't' t
117 'u' u
118 'v' v
119 'w' w
120 'x' x
121 'y' y
122 'z' z
123 '{' {
124 '|' |
125 '}' }
126 '~' ~
127 '\x7f'128 '\x80'129 '\x81'130 '\x82'131 '\x83'132 '\x84'133 '\x85'134 '\x86'135 '\x87'136 '\x88'137 '\x89'138 '\x8a'139 '\x8b'140 '\x8c'141 '\x8d'142 '\x8e'143 '\x8f'144 '\x90'145 '\x91'146 '\x92'147 '\x93'148 '\x94'149 '\x95'150 '\x96'151 '\x97'152 '\x98'153 '\x99'154 '\x9a'155 '\x9b'156 '\x9c'157 '\x9d'158 '\x9e'159 '\x9f'160 '\xa0'  
161 '¡' ¡
162 '¢' ¢
163 '£' £
164 '¤' ¤
165 '¥' ¥
166 '¦' ¦
167 '§' §
168 '¨' ¨
169 '©' ©
170 'ª' ª
171 '«' «
172 '¬' ¬
173 '\xad' ­
174 '®' ®
175 '¯' ¯
176 '°' °
177 '±' ±
178 '²' ²
179 '³' ³
180 '´' ´
181 'µ' µ
182 '¶'183 '·' ·
184 '¸' ¸
185 '¹' ¹
186 'º' º
187 '»' »
188 '¼' ¼
189 '½' ½
190 '¾' ¾
191 '¿' ¿
192 'À' À
193 'Á' Á
194 'Â' Â
195 'Ã' Ã
196 'Ä' Ä
197 'Å' Å
198 'Æ' Æ
199 'Ç' Ç
200 'È' È
201 'É' É
202 'Ê' Ê
203 'Ë' Ë
204 'Ì' Ì
205 'Í' Í
206 'Î' Î
207 'Ï' Ï
208 'Ð' Ð
209 'Ñ' Ñ
210 'Ò' Ò
211 'Ó' Ó
212 'Ô' Ô
213 'Õ' Õ
214 'Ö' Ö
215 '×' ×
216 'Ø' Ø
217 'Ù' Ù
218 'Ú' Ú
219 'Û' Û
220 'Ü' Ü
221 'Ý' Ý
222 'Þ' Þ
223 'ß' ß
224 'à' à
225 'á' á
226 'â' â
227 'ã' ã
228 'ä' ä
229 'å' å
230 'æ' æ
231 'ç' ç
232 'è' è
233 'é' é
234 'ê' ê
235 'ë' ë
236 'ì' ì
237 'í' í
238 'î' î
239 'ï' ï
240 'ð' ð
241 'ñ' ñ
242 'ò' ò
243 'ó' ó
244 'ô' ô
245 'õ' õ
246 'ö' ö
247 '÷' ÷
248 'ø' ø
249 'ù' ù
250 'ú' ú
251 'û' û
252 'ü' ü
253 'ý' ý
254 'þ' þ
255 'ÿ' ÿ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geron that makes perfect sense to me. As I said, I didn't doubt it, I just didn't understand it before. Now I am a little ashamed of not realizing that.


class Extractor(PyPLNTask):
Expand All @@ -173,11 +164,12 @@ def process(self, file_data):
contents = base64.b64decode(file_data['contents'])
with magic.Magic(flags=magic.MAGIC_MIME_TYPE) as m:
file_mime_type = m.id_buffer(contents)

metadata = {}
if file_mime_type == 'text/plain':
text = contents
elif file_mime_type == 'text/html':
text = parse_html(contents, True, ['script', 'style'])
if file_mime_type in ('text/plain', 'text/html'):
text = decode_text_bytes(contents)
if file_mime_type == 'text/html':
text = parse_html(text, True, ['script', 'style'])
elif file_mime_type == 'application/pdf':
text, metadata = extract_pdf(contents)
else:
Expand All @@ -191,9 +183,7 @@ def process(self, file_data):
return {'mimetype': 'unknown', 'text': "",
'file_metadata': {}, 'language': ""}

text, forced_decoding = trial_decode(text)

if isinstance(text, unicode):
if isinstance(text, str):
# HTMLParser only handles unicode objects. We can't pass the text
# through it if we don't know the encoding, and it's possible we
# also shouldn't. There's no way of knowing if it's a badly encoded
Expand All @@ -203,10 +193,18 @@ def process(self, file_data):

text = clean(text)

if isinstance(text, unicode):
language = cld.detect(text.encode('utf-8'))[1]
if isinstance(text, str):
languages = cld.detect(text.encode('utf-8'))[2]
else:
language = cld.detect(text)[1]

return {'text': text, 'file_metadata': metadata, 'language': language,
'mimetype': file_mime_type, 'forced_decoding': forced_decoding}
languages = cld.detect(text)[2]

detected_language = None
if languages:
detected_language = languages[0][1]

# TODO: check for uses of forced_decoding and remove them
return {'text': text,
'file_metadata': metadata,
'language': detected_language,
'mimetype': file_mime_type,
'forced_decoding': None}
4 changes: 2 additions & 2 deletions pypln/backend/workers/freqdist.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def process(self, document):
tokens = [info.lower() for info in document_tokens]
frequency_distribution = {token: tokens.count(token) \
for token in set(tokens)}
fd = frequency_distribution.items()
fd.sort(lambda x, y: cmp(y[1], x[1]))
fd = list(frequency_distribution.items())
fd.sort(key=lambda x: (-x[1], x[0]))

return {'freqdist': fd}
2 changes: 1 addition & 1 deletion pypln/backend/workers/palavras_noun_phrase.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def process(self, document):
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
palavras_output = document['palavras_raw']
if isinstance(palavras_output, unicode):
if isinstance(palavras_output, str):
# we *need* to send a 'str' to the process. Otherwise it's going to try to use ascii.
palavras_output = palavras_output.encode('utf-8')
stdout, stderr = process.communicate(palavras_output)
Expand Down
19 changes: 10 additions & 9 deletions pypln/backend/workers/palavras_raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,15 @@ def process(self, document):

text = document['text']

# For some reason, in some pypln installations the document['text'] is
# not always unicode as it should be. This may be due to errors during
# the decoding process that we fixed earlier. That meant that, when we
# got a non-unicode string, python would try to decode it using the
# default codec (ascii) in `text.encode(PALAVRAS_ENCODING)`. Since we
# know the text came from mongodb, we can just decode it using utf-8 to
# make sure we have a unicode object.
if not isinstance(text, unicode):
# This code is here because when using python2 for some
# reason, sometimes document['text'] was not a unicode object
# (as it should be, coming from pymongo). Since we're now
# using python3, we should really always get a str (unicode)
# object. But, since we do not know the real reason for the
# original error, we will keep this code here for now. As
# before, if we receive a bytes object, since it came from
# mongodb we can be sure it will be encoded in utf-8.
if isinstance(text, bytes):
text = text.decode('utf-8')

process = subprocess.Popen([BASE_PARSER, PARSER_MODE],
Expand All @@ -55,4 +56,4 @@ def process(self, document):
stderr=subprocess.PIPE)
stdout, stderr = process.communicate(text.encode(PALAVRAS_ENCODING))

return {'palavras_raw': stdout, 'palavras_raw_ran': True}
return {'palavras_raw': stdout.decode('utf-8'), 'palavras_raw_ran': True}
Loading