Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring #11

Open
schmamps opened this issue May 22, 2018 · 1 comment

Comments

@schmamps
Copy link

I expect that the issues and PRs I've posted here are going to my own fork and no further, but if anyone out there is reading, your feedback is welcome.

In Parser.getKeywords():

# ...
uniqueWords = list(set(words))

keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords] 
keywords = sorted(keywords, key=lambda x: -x['count'])

# ... returns: ...
keywords[0]  = {'word': 'foo', 'count': 100}  # rank: 1st (most common keyword)
keywords[1]  = {'word': 'eggs', 'count': 25}  # rank: 9th
keywords[2]  = {'word': 'bacon', 'count': 32} # rank: 8th
# ...
keywords[10] = {'word': 'spam', 'count': 25}  # rank: 9th
keywords[11] = {'word': 'bar', 'count': 19}   # rank: 13th
keywords[12] = {'word': 'ham', 'count': 25}   # rank: 9th
# ...

set() does not return an ordered enumerable, so uniqueWords is an unordered list. Iterating it to build keywords means this list is also unordered, so the return order of sorted() on the count key alone is inconsistent.

Down in Summarizer.summarize() it means sentence scores are also inconsistent:

# ... 
(keywords, wordCount) = self.parser.getKeywords(text)

topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)

# ... iterating sentences ...
sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList)

# ... calculate sentence score based on these features ...
# ... 

For consistency, when the summarize method calls getTopKeywords(), should it still pass a fixed list of ten keywords that's been sorted with more tiebreakers…

# replace: keywords = sorted(keywords, key=lambda x: -x['count'])
keywords = sorted(keywords, key=lambda x: (-x['count'], -len(x['word'], x['word']))

…every keyword that ranks 10th or better, i.e. this comprehension…

topKeywordSlice = [kw for kw in keywords if kw['count'] >= keywords[9]['count']]

… or maybe both?

Computationally, the advanced sort seems unnecessarily expensive, but I don't know if there's a rationale for exactly ten top keywords. What's the best way to make this work?

@schmamps
Copy link
Author

schmamps commented May 29, 2018

For anyone out there playing along at home, switching from a fixed ten to topKeywordSlice resulted in as many as 18 "top keywords" in my sample texts. It slightly boosted some scores, with one outlier getting a +14% bump… yet it remained the lowest scoring sentence.

Differences after change, absolute percentage AP and standard deviations SD: AP (SD)

  • mean: +1.083% (±0.0132s)
  • median: +0.1667% (±0.0112s)
  • mode: +0.000% (±0.011011)
  • minimum: +0.0000% (±0.025s)
  • maximum score: +14.06778% (±0.0341s)

The limited deltas are surprising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant