An experimental Hacker News reader incorporating sentiment analysis of comments.
A color label for each story is produced by taking the mean of the sentiment of each comment, multiplying this by the log of the number of comments (to reduce the variance of stories with few comments), and finally estimating the area under the probability density function between this score and the mean score. This results in an even distribution of scores, essentially magnifying differences close to the mean.
Each comment is labeled using the same method, but the histogram shows the raw scores.
Uses node.js, React, D3.js, and the sentiment module.
Use tf-idf to identify the N most "important" words within a comment thread, relative to a corpus of comments.
- Generate csv with large corpus of comments .
- Analyze corpus with tf-idf, and store the idf cache in a json file.
- Given a story, load the idf cache for the corpus.
- Calculate the tf-idf for each term in the comment thread (normalize by length?), relative to the corpus.
- Select some N terms with the highest scores.
Use the presence (or frequency?) of these words in each comment as features in a clustering model. Maybe exclude some if they are too common?
- For each comment create a vector from the presence/frequency/tfidf of each term.
- Do some kind of clustering using the vectors.
See: http://brandonrose.org/clustering
- Check distribution of comments
- Performance: http://benchling.engineering/performance-engineering-with-react/