An end-to-end machine learning library for detect trending stories and content in real time. An open source Python framework that currently works in memory on a single node and can comfortably perform with around 500k-1m articles. Parallelisation is very much on the near term road map (next 6 months).
Trends are identified by detecting phrases that start occurring much more frequently than those that don't typically occur. Various natural language processing and data science techniques are used to ensure similar words are modelled together (i.e. "cycle", "cycling" and "cyclist" all reduce down to a common word form, such as "cycle").
Documents can be grouped by a subject, so it is possible to detect "localised" trends. It is often the case that a trending story has a number of related phrases (for example, "doping scandal" and "Tour de France"), so this is handled using hierachical clustering and doc2vec to handle this.
Based on ramekin
, but going to take it further to do real time detection and maintaining models rather than creating them each time.
We are going to create a royston to contain a set of news articles, and then find the trends.
First we will install the package via pip
by typing the following into the command line:
pip3 install royston
The following script creates some simple documents and adds them to a royston
(also shipped in the examples
directory):
from royston import Royston
from datetime import datetime as dt
roy = Royston()
# ingest a few documents
roy.ingest({ 'id': '123', 'body': 'Random text string', 'date': dt.now() })
roy.ingest({ 'id': '456', 'body': 'Antoher random string', 'date': dt.now() })
# find the trends - with this example, it won't find anything, as it's
# only got two stories!
trends = roy.trending()
print(trends)
This package is heavily configurable to allow us to tune how we look for emerging trends. The default options have been set for the most common use case that looks at new trends that have emerged over the last 24 hours.
Currently, the main way of tuning these parameters is controlled by passing the Royston constructor an options
dict with the following attributes:
Attribute | Type | Default | Description |
---|---|---|---|
min_trend_freq |
int |
4 | A threshold for the minimum number of times a phrase has to occur in a single day before it can even be considered a trend for a given subject. |
history_days |
int |
90 | The context of the number of days to consider for the history. This means we look at how often a phrase has occured over this period, and get an idea of typical use. |
trend_days |
int |
1 | The period of time in which we want to look for trends. With the default of 1, we are looking at documents from the last day to see if new trends have emerged during that time compared with the typical use period defined by history_days |
max_n |
int |
6 | The maximum size of the n-gram window (i.e. the window size of each phrase) |
history_frequency_tolerance |
float |
1.6 | Factor the history count by this amount to handle words that just didn't get mentioned in the history period. This usefulness of this is in review, and it is likely to be removed in future (or at least set to 1 by default). |
trends_top_n |
int |
8 | The maximum number of trends to return |
Disclaimer: the following options are currently supported but expected to change significantly in future releases:
Attribute | Type | Default | Description |
---|---|---|---|
start |
datetime |
now - trend_days | The start of the "trend" period (i.e. a day ago) |
end |
datetime |
now | The end of the "trend" period |
history_start |
datetime |
start |
Start of the trend period (i.e. history_days before end ) |
history_end |
datetime |
end - history_days |
Start of the trend period (i.e. history_days before end ) |
Currently they are calculate in the constructor only, which is stupid, as we want this to run in realtime and adapt each time the trend
method is called.
poetry run test
Run coverage reports:
poetry run coverage
This now uses poetry for package management, which can be done with the following command:
poetry build && poetry publish
This is still in the early stages of being ported over from JavaScript, and any help would be appreciated. The issues contain a lot of features that are needed. Please get in touch via LinkedIn and I can talk you thought anything.
Main concerns are:
- 100% test coverage.
- Retain the document format
- Code formatted using black/flake8