📋 Table of Contents

Created by

📋 Table of Contents

📖 What is Historian?
🎯 Objective
🚀 Usage
🛠 Build
💙 Contributors

📖 What is Historian?

Final project for CS 410 Text Information Systems at the University of Illinois Urbana-Champaign

Historian is a Google Chrome extension that builds a knowledge graph of your visited webpages based on their similarity with respect to each other.

Features

Builds graph of webpages visited
Visualizes the similarity of webpages in history
Clusters similar webpages together under shared topics
Presents an intuitive way to research

🎯 Objective

Create a Google Chrome extension that acts as a knowledge graph builder for webpages that the user visits while researching information online.

The extension should represent the user's visited webpages as nodes in a graph where the edges reflect the relative similarity between them such that similar webpages will be clustered together.

This will offer users an intuitive way to visualize their search history while performing online research and eliminate the reliance on other third party applications to track this information.

📌 Tasks

Task	Assigned To
Learn how to make a Chrome extension	Everyone
Visualizing Graphs	Blake
Web Scraping	Megha
Similarity Algorithm	Michael
Build Frontend	Rohan
Build Backend	Kaushal

🚀 Usage

Note

The section(s) that follow provide comprehensive instructions for getting Historian setup on your local device. After completing Step 1 and Step 4, you can run the demo script to install dependencies, start the local server, and open some sample webpages to see an easy demonstration of how Historian works.

Dependencies

The table below gives an overview of the dependencies for this project as well as the versions used. For the packages, you can download these directly or run the setup.py script as discussed in the next section.

Show dependencies

Item	Version
Python	3.12.0
NumPy	1.26.1
SciPy	1.11.3
SciKit-Learn	1.3.2
NLTK	3.8.1
BeautifulSoup	4.12.2
Plotly	5.18.0
Dash	2.14.2
Flask	3.0.0
Flask-Caching	2.1.0
Flask-Cors	4.0.0
Regex	2023.10.3
Alive-Progress	3.1.5

Setup

Important

Historian is not currently being hosted on a domain, which means that the only way to currently use this extension is by running the server locally on your machine. The instructions below will guide you through the setup process step-by-step.

Step 1: Clone this Repository

First you need to clone this repo to your local machine to access the server as well as the extension. The instructions below are adapted from GitHub's documentation on cloning repositories; for more information, please refer to the docs.

Show instructions

Navigate to the main page of the repository.
Above the list of files, click <> Code.
Copy the URL for the repository.
- To clone the repository using HTTPS, under "HTTPS", click .
Open Git Bash.
Change the current working directory to the location where you want the cloned repository. e.g.
```
cd path/to/folder
```
Type git clone, and then paste the URL you copied earlier, e.g.
```
git clone https://github.com/blakepm2/CS410_Final_Project
```
Press Enter to create your local clone.

Step 2: Install dependencies

After cloning the repo, you can install the dependencies using the setup.py or manually with pip.

Note

Python 3.12 is required in order to run setup.py. If you do not have Python 3.12 installed, please download it here before proceeding.

Show instructions

Option 1: Using `setup.py`

Navigate to the directory where you saved the repository.
```
cd path/to/repository
```
Run setup.py to install all dependencies.
```
py -3.12 setup.py
```

Option 2: Using `pip`

Navigate to the directory where you saved the repository.
```
cd path/to/repository
```

In the terminal, run the following command:

py -3.12 -m pip install -r config/requirements.txt

Step 3: Start the local server

Once you've successfully cloned the repo and installed the necessary dependencies, you can host the local server on your machine to enable the backend functionality of the extension.

Show instructions

Navigate to the directory where you saved the repository.
Run server.py.
- This will begin hosting a server on your local network.
Verify that the server is running.
- You can verify that the server is running by visiting http://127.0.0.1:8050/ in a web browser.

Important

Historian works by sending a list of the URLs from your history to the server, which will then perform the computations needed to create the graph. Once completed, the server will asynchrononously update the graph on the frontend for you to see. Thus, it is imperative that you run the server in order to see your results visualized.

Step 4: Load the unpacked extension into Chrome

In order to use the extension in Chrome, you need to load an unpacked version into your extensions. The instructions below are adapted from Google Chrome's documentation, which you can consult for more information.

Show instructions

On your computer, open Chrome.
At the top right, click More (three dots) > Extensions > Manage Extensions.
At the top right, enable Developer mode.
At the top left, click Load unpacked.
Navigate to the directory where you stored the repository, and select the extension folder.
Verify that the extension has been loaded.
- Once the extension has been successfully loaded into Chrome, you should see Historian listed in My extensions.

Step 5: Displaying the graph

After the extension has been loaded into Chrome and the local server has been started, you should now be able to use Historian to see the hierarchical graph of your recent search history.

On your computer, open Chrome.
Visit some webpages.
- Try to visit different kinds of webpages so that the app can highlight the divisions between them (e.g. "best snack foods", "top 10 careers for computer science majors", "best nba players of all time").
- If you're having trouble coming up with ideas or would rather use some pre-selected samples, please use the demo.
In the top right, click and select Historian from the dropdown menu.
In Historian, click Visualize History.

You should see a graph populate with lines connecting nodes that represents the hierarchical clusters of your browsing history.

Troubleshooting FAQ

The server successfully created the graph but the extension loads forever

If there are no errors on the server-side and the extension takes too long to load your dendrogram, the issue is the Cross-Origin Resource Sharing (CORS) policy. This issue occurs because when you load Historian into Chrome, your Extension ID may be different from the one included in the code.

To fix this, simply go to Chrome > Manage Extensions and copy the Extension ID you see under Historian. Then navigate to server.py and replace line 37 with your Extension ID.

The server failed to create a graph due to a dimensional mismatch

If the server throws an error saying it failed to create the dendrogram due to a mismatch in dimensions, this is likely because either the webpages could not be parsed or the webpages had the same titles. Currently, Historian can only analyze distinct webpages (i.e. webpages with unique titles).

To fix this, you can either try to visit some different webpages or you can simply run the demo script to get some presampled webpages to use.

🛠 Build

Overview

Historian defines several modules to facilitate its functionality. The table below provides a high-level overview of these modules with links to their respective code and documentation.

Module	Purpose	Documentation
Document	Represent webpages as documents	Link
WebScraper	Extract webpage text data	Link
HierarchicalClustering	Perform agglomerative hierarchical clustering	Link
Dendrogram	Visualize hierarchical clusters	Link
Frontend	Enable user functionality	Link

Document

A class to represent scraped webpages as documents

src.webScraping.document.Document(self, title, text, url)

Parameters

self Document : The Document object
title str : The title of the webpage
text str : The text data of the webpage
url str : The url of the webpage

Methods

None

WebScraper

A class for extracting text data from webpages

src.webScraping.webScraper.WebScraper(self)

Parameters

self WebScraper : The WebScraper object

Methods

getWebpageText( self, response )

Extracts and preprocesses the data from a webpage from a given requests.Response object

Parameters

self WebScraper : The WebScraper object
response requests.Response : A requests.Response object for a given URL

Returns str

scrapeWebpages ( self, urls )

Extracts text data from webpage(s) at a given url and saves their text data as a string into the Webscraper.corpus hashmap

Parameters

self WebScraper : The WebScraper object
urls list : A list of URLS for the webpages you want to scrape

Returns dict

HierarchicalClustering

A class to perform agglomerative hierarchical clustering with average link for a collection of webpages

src.graphing.hierarchicalClustering.HierarchicalCluster(self)

Parameters

self HierarchicalClustering : The HierarchicalClustering object

Methods

preprocess( self, text )

Preprocesses text data from a document by performing normalization, tokenization, and lemmatization

Parameters

self HierarchicalClustering : The HierarchicalClustering object
text str : The text data from a webpage document

Returns str

preprocess_docs( self, docs )

Preprocesses text for a collection of documents by performing normalization, tokenization, and lemmatization

Parameters

self HierarchicalClustering : The HierarchicalClustering object
docs list : A list of the processed documents to

Returns list

extract_features( self, docs )

Implements Term Frequency (TF) - Inverse Document Frequency (IDF) weighting to a set of (processed) documents

Parameters

self HierarchicalClustering : The HierarchicalClustering object
docs list : A list of the processed documents you want to analyze

Returns numpy.ndarray

create_hierarchical_cluster( self, tfidf_matrix )

Performs hierarchical/agglomerative clustering for a TF-IDF weighted matrix of text data from a collection of documents using Average-Link

Parameters

self HierarchicalClustering : The HierarchicalClustering object
tfidf_matrix numpy.ndarray : A TF-IDF weighted mattrix of text data

Returns numpy.ndarray

create_dendrogram( self, cluster, docs)

Creates a dendrogram to visualize a hierarchical/agglomerative cluster

Parameters

self HierarchicalClustering : The HierarchicalClustering object
cluster numpy.ndarray : The hierarchical cluster of the data
docs list : A list of the original documents

Returns Dendrogram

Dendrogram

A class to visualize a hierarchical clustering of webpages

src.graphing.dendrogram.Dendrogram(self, cluster, docs)

Parameters

self Dendrogram : The Dendrogram object
cluster np.ndarray : The hierarchical cluster of the data
docs list : A list of the original documents

Methods

create( self )

Creates a dendrogram figure for a hierarchical/agglomerative cluster

Parameters

self Dendrogram : The Dendrogram object

Returns plotly.graph_objs.Figure

create_lines( self )

Creates the lines representing the relationships between nodes in a dendrogram

Parameters

self Dendrogram : The Dendrogram object

Returns None

create_nodes( self )

Creates the nodes representing the documents in a dendrogram

Parameters

self Dendrogram : The Dendrogram object

Returns None

create_layout( self )

Creates the layout of the dendrogram

Parameters

self Dendrogram : The Dendrogram object

Returns None

Frontend

Enables user functionality by sending data to the server

Methods

fetchHistory( )

Uses the Google Chrome history API to fetch the user's recent browsing history

Parameters

None

Returns Response

checkAvailability( )

Checks the server to see if the preprocessing has been done so it can fetch the graph

Parameters

None

Returns boolean

getGraph( )

Loads the graph created by the server into the frontend for the user to see

Parameters

None

Returns boolean

sendURLSToServer( )

Leverages API call to send the user's browsing history over to the server for processing

Parameters

None

Returns data.message

showSpinner( )

Shows a spinner while the page loads

Parameters

None

Returns None

hideSpinner( )

Hides the spinner after a page has finished loading

Parameters

None

Returns None

💙 Contributors

"⭐️" denotes Team Leader

Name	NetID/Email	Contributions
Blake McBride ⭐️	[email protected]	Created Document class; created WebScraper class; created HierarchicalClustering class; created Dendrogram class; configured agglomerative hierarchical clustering algorithm; designed webscraping logic; wrote visualization logic; configured local server; created all functions for and designed frontend; implemented API calls from frontend to server; created setup and demo scripts; designed logo(s); wrote setup instructions; wrote documentation; designed and wrote README; wrote, editied, and produced video presentation.
Kaushal Dadi	[email protected]	Created manifest.json; put iframe in HTML to show graph on webpage; built preliminary frontend.
Rohan Parekh	[email protected]	Helped Kaushal with creation of manifest.json and the chrome extension that displayed the graph on the webpage.
Megha Chada	[email protected]	Changed colors for graph; added comments to code; added title, timestamp, and description to graph; created architectural diagram.
Michael Ma	[email protected]	Added unfinished topic labels to the graph.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
config		config
extension		extension
public/assets		public/assets
src		src
.gitignore		.gitignore
CS 410 Text Information Systems Final Project Progress Report.pdf		CS 410 Text Information Systems Final Project Progress Report.pdf
CS 410 Text Information Systems Final Project Proposal.pdf		CS 410 Text Information Systems Final Project Proposal.pdf
README.md		README.md
demo.py		demo.py
server.py		server.py
setup.py		setup.py

eskin22/Historian

Folders and files

Latest commit

History

Repository files navigation

📋 Table of Contents

📖 What is Historian?

Features

🎯 Objective

📌 Tasks

🚀 Usage

Dependencies

Setup

Step 1: Clone this Repository

Step 2: Install dependencies

Option 1: Using setup.py

Option 2: Using pip

Step 3: Start the local server

Step 4: Load the unpacked extension into Chrome

Step 5: Displaying the graph

Troubleshooting FAQ

The server successfully created the graph but the extension loads forever

The server failed to create a graph due to a dimensional mismatch

🛠 Build

Overview

Parameters

Methods

Parameters

Methods

Parameters

Methods

Parameters

Methods

Methods

💙 Contributors

About

Topics

Resources

Stars

Watchers

Forks

Languages

Option 1: Using `setup.py`

Option 2: Using `pip`