-
Notifications
You must be signed in to change notification settings - Fork 58
Storytelling Tools
Storytelling components in Gramex can take a structured dataset, and generate an exhaustive list of facts about the dataset. These facts are then filtered through the patterns of insights:
- Unknown results
- Surprising comparisons
- Surprising extremes
- Significant outliers
- Abnormal distributions
Each such significant insight is then filtered through the BUS system. Each BUS insight is then represented in the data story with any or all of the following:
- Text
- Charts and graphics
- Interactive elements
- Autolysis: take a dataset and generate a pool of facts. Each column must be marked as a dimension or a metric. Accordingly, perform univariate and bivariate analysis. The output can be a set of structured objects, each of which has a type (chart, text, etc), and is accompanied by some representation of the operation that has been performed on the data in order to come up with the result. The process must be auditable, users may retrospectively tweak a result.
- Patterns of Insights - the results generated by autolysis are previewed by the users, and either tagged with one PoI, or discarded. Results are now insights.
- BUS Filtering - Each insight is graded with the BUS system (on a low, high, medium scale), and any insight which rates medium or high on at least two out of three scales, makes it to the storyboarding process.
- Storyboarding - this is the process of embedding insights into a story. This contains the following stages, each optional: - Generate text with NLG - Create charts / comics / slides, any other static medium - Create interactive elements around the insight - as in with Jupyter widgets or g1 components.
- Review - This finishes the storyboarding process by asking a user to review the story and add filler content. The whole story may then be exported as a Gramex app, or as a collection of js / Python / HTML assets.
This section describes the building blocks of Gramex's approach to natural language generation. These concepts serve as primitives to the logic and automation capabilities of the NLG engine.
- Narrative - A narrative is a piece of text written by a user or generated by a machine which contains facts about a dataset. A narrative in its entirity is assumed to be a function of three items: 1. A dataset 2. Operations on that dataset 3. Some "source text" provided by the user. For example, the following is a narrative about the Fisher Iris dataset.
The iris dataset contains measurements from a hundred and fifty samples of three unique species of the iris flower - setosa, versicolor and virginica. The species are equally distributed within the dataset, so that each species has fifty samples. For each sample, four measurements are taken - sepal width, petal width, sepal width and sepal length. The average petal length of the setosa is significantly less than that of versicolor or virginica. The average petal width of virginica is much higher than that of versicolor. However, there is no pair of features that can uniquely identify a species. The presence of such properties makes the iris dataset ideal for explaining machine learning concepts.
- Nugget - A nugget is ideally a single sentence which conveys a fact about the data. Each sentence in the example narrative except the last two is a nugget. Note that each nugget derives its facts from the source data directly, or from the result of some operation on the data. For example, the following nugget:
The average petal length of the setosa is significantly less than that of versicolor or virginica.
derives from a groupby-and-average operation on one column of the dataset. Some nuggets, like the one enumerating the number of samples in the dataset, derive from the raw dataset, not from the result of any operations on it. A narrative is essentially an ordered collection of nuggets.
- Variables - A variable is a piece of text which can change with the data or the operations performed on it. Here is a reproduction of the example narrative, with all variables shown in bold.
The iris dataset contains measurements from a hundred and fifty samples of three unique species of the iris flower - setosa, versicolor and virginica. The species are equally distributed within the dataset, so that each species has fifty samples. For each sample, four measurements are taken - sepal width, petal width, sepal width and sepal length. The average petal length of the setosa is significantly less than that of versicolor or virginica. The average petal width of virginica is much higher than that of versicolor. However, there is no pair of features that can uniquely identify a species. The presence of such properties makes the iris dataset ideal for explaining machine learning concepts.
Note that each variable has two defining components: - a source text, as initially provided by the user - one or more formulae, which compute the value of the variable for a specific instance of the data. Note that the source text of a variable may be found in multiple places within a dataset, and as such, a variable may have multiple formulae - one of which will have to be preferred by the user.
For example, for the first variable in example narrative, "hundred and fifty" is the source text, and the formula is any machine code that counts the number of rows in the dataset and translates it into a human-readable form. A variable may additionally have other attributes, like: - a set of linguistic inflections which determine the form of the rendered variable text - these are distinct from the formula itself, in that the formula creates the base form of the text and inflections modify the base form. - a name used to identify the variable within the template of the nugget Thus, narratives are composed from nuggets, and nuggets from variables. This grammar allows the NLG engine to approach the problem of data-driven, machine-generated narratives in a more compositional manner than a generative one.