Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added files via upload #46

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

LivKelley
Copy link

Submitting a text mining project and PDF

Review on Reviewable

Submitting a text mining project and PDF
Oops... I forgot the doctest line on the bottom. I added that in, so hopefully everything works now.
@rdiverdi
Copy link

Review status: 0 of 3 files reviewed at latest revision, 10 unresolved discussions.


AnalysisofCloningonWikipedia.py, line 1 [r1] (raw file):
Nice job, everything here works well.
There were a couple of improvements you could make to make the code more efficient and more robust
Also, did you look into sentiment analysis from the pattern library?

This was scoped as a pretty small project, it might have been good to do something like search through other pages, and find an average amount of positive and negative words, so you could classify these according to how close they are to neutral. You could also have normalized the output so that it wouldn't be effected by the length of the article.


AnalysisofCloningonWikipedia.py, line 12 [r1] (raw file):
Where did these lists come from?


AnalysisofCloningonWikipedia.py, line 26 [r1] (raw file):
This test fails


AnalysisofCloningonWikipedia.py, line 42 [r1] (raw file):
Why do you split the file up by lines instead of looking through the whole file at once?


AnalysisofCloningonWikipedia.py, line 46 [r1] (raw file):
Nested for loops are very slow
In this case, it would have been better to just loop through the words in b, then test if the word was in the list of negative words.

for word in b:
    if word in negative_search_words:
        ...

Also, a comment here would have been helpful to know what the for loops are doing


AnalysisofCloningonWikipedia.py, line 49 [r1] (raw file):
If you only ever care about the total number of negative words, then it doesn't make sense to keep a dictionary which stores the count of each individual word. Instead, you could just use a variable for total number of negative words which starts at 0 and increments in your if statement.


AnalysisofCloningonWikipedia.py, line 57 [r1] (raw file):
This test will miss any words with punctuation immediately before or after them:
for example "hi" == "hi." is false.
There is a list called punctuation, which is built into the string class, and a strip method of strings, which lets you pull the punctuation off of words:

import string
word = '.banana.'
word.strip(string.punctuation)

outputs 'banana'


AnalysisofCloningonWikipedia.py, line 65 [r1] (raw file):
You need to import doctest to use it


DownloadWikiArticlesGenetics.py, line 1 [r1] (raw file):
You never use pickle here:
Pickle is essentially just a formatting scheme used to represent python variables such as dictionaries or lists as a string. That is helpful because a string can be saved in a file that can be accessed later.

Here, you are just storing strings, so you don't need pickle (and you don't actually use it anywhere).


DownloadWikiArticlesGenetics.py, line 13 [r1] (raw file):
I'm guessing you found this somewhere and aren't exactly sure what it does (which isn't really a problem)
this does the same thing as:

f = open( '...' , 'w')
f.write( ... )

where "open" opens the specified file, and stores the file in the variable f.
f.write then writes a string into the file f
In addition, it is always good practice to close files you use: it makes sure that everything is saved correctly
f.close()


Comments from the review on Reviewable.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants