This repository contains the code for the Hansard web app hosted at https://brown-ccv.github.io/hatori. The code is divided into two parts, the front-end of the web-app and the statistics compiler.
The webapp itself lives in the docs/page/
, which is an NPM package. It has five views, and two of these — front/
and cloud/
— have their own visualizations generated by generator.py
.
-
The
front
view contains a list of topics, with the 16 most popular words in each topic and the proportion of each topics with in the whole corpus. -
Then we have the
home/
view, which contains the word cloud for each topic listed in a sectional order. -
The
map/
view shows the visualiation of the embedding of the topics into the Euclidean plane, each topic being a point (or rather a disk). -
Clicking on the element corresponding to a topic in each of the previous topics brings you to the
topic/
page corresponding to that topic. Here you can see the distribution of the words in the topic, the significant documents for that topic and the proportion of the topic overtime. It can display data for multiple topics. Typing the topic ID number into the "Add Topic" text input on the top right corner and pressing enter will add the topic with the corresponding ID to the visualization. This view needs to make calls to the server (which we will talk about in the server section) -
Finally, clicking corpus words in the previous views will bring you to the
graph
page. This page displays the proportion/frequency of the words as a graph over time. Multiple words can be added for comparision.
The statistics compiler is written in Python and is contained in the serv
folder. It consists of an asset generator (serv/generator.py
) and the data reader (serv/reader.py
).
The reader reads the data from the MALLET-generated file and compiles it into a digestible format. The generator takes data compiled by the reader and creates assets such as the json data file, the word clouds and the visualizations needed for the web-app.
Due to the staggering number of combinations of the list of topics. We cannot save the data for all combinations in a JSON file preloaded along with the webpage. Due to the staggering size of the corpus, we cannot load the corpus onto the webpage and generate the data on demand either. This calls for a server. The topic
view is the only view that needs a server. The server finds important documents, the word distribution and the time distribution for a union of some combination topics. This comes in handy when multiple topics have the same content and should be looked at as one single topic.
The docs directory is an NPM repository. To initialize it, run
$ npm install
Then, to recompile each view, simply cd
into each subdirectory and run $ make
.
To run the server or the statistics compiler, run
$ pip install -r serv/requirements.txt
To generate the data, run
$ (cd serv && python generate_assets.py)
To start the server on port 5000, run
$ (cd serv && python run_server.py)
Settings can be configured in generate_assets.py
and run_server.py
. The server endpoint needs to be changed in docs/page/topic/src/index.js
and the code recompiled.