Skip to content

A Google Chrome extension that builds a hierarchical knowledge graph of your visited webpages

Notifications You must be signed in to change notification settings

eskin22/Historian

Repository files navigation

logo

Created by


Python JavaScript

NumPy SciPy scikit-learn Plotly Flask Dash


Release
Downloads

πŸ“‹ Table of Contents

  1. πŸ“– What is Historian?
  2. 🎯 Objective
  3. πŸš€ Usage
  4. πŸ›  Build
  5. πŸ’™ Contributors

πŸ“– What is Historian?

Final project for CS 410 Text Information Systems at the University of Illinois Urbana-Champaign

Historian is a Google Chrome extension that builds a knowledge graph of your visited webpages based on their similarity with respect to each other.

Features

  • Builds graph of webpages visited
  • Visualizes the similarity of webpages in history
  • Clusters similar webpages together under shared topics
  • Presents an intuitive way to research

🎯 Objective

Create a Google Chrome extension that acts as a knowledge graph builder for webpages that the user visits while researching information online.

The extension should represent the user's visited webpages as nodes in a graph where the edges reflect the relative similarity between them such that similar webpages will be clustered together.

This will offer users an intuitive way to visualize their search history while performing online research and eliminate the reliance on other third party applications to track this information.

πŸ“Œ Tasks

Task Assigned To
Learn how to make a Chrome extension Everyone
Visualizing Graphs Blake
Web Scraping Megha
Similarity Algorithm Michael
Build Frontend Rohan
Build Backend Kaushal

πŸš€ Usage

Note

The section(s) that follow provide comprehensive instructions for getting Historian setup on your local device. After completing Step 1 and Step 4, you can run the demo script to install dependencies, start the local server, and open some sample webpages to see an easy demonstration of how Historian works.

Dependencies

The table below gives an overview of the dependencies for this project as well as the versions used. For the packages, you can download these directly or run the setup.py script as discussed in the next section.

Show dependencies
Item Version
Python 3.12.0
NumPy 1.26.1
SciPy 1.11.3
SciKit-Learn 1.3.2
NLTK 3.8.1
BeautifulSoup 4.12.2
Plotly 5.18.0
Dash 2.14.2
Flask 3.0.0
Flask-Caching 2.1.0
Flask-Cors 4.0.0
Regex 2023.10.3
Alive-Progress 3.1.5

Setup

Important

Historian is not currently being hosted on a domain, which means that the only way to currently use this extension is by running the server locally on your machine. The instructions below will guide you through the setup process step-by-step.

Step 1: Clone this Repository

First you need to clone this repo to your local machine to access the server as well as the extension. The instructions below are adapted from GitHub's documentation on cloning repositories; for more information, please refer to the docs.

Show instructions
  1. Navigate to the main page of the repository.

  2. Above the list of files, click <> Code.

  3. Copy the URL for the repository.

    • To clone the repository using HTTPS, under "HTTPS", click .
  4. Open Git Bash.

  5. Change the current working directory to the location where you want the cloned repository. e.g.

    cd path/to/folder
    
  6. Type git clone, and then paste the URL you copied earlier, e.g.

    git clone https://github.com/blakepm2/CS410_Final_Project
    
  7. Press Enter to create your local clone.

Step 2: Install dependencies

After cloning the repo, you can install the dependencies using the setup.py or manually with pip.

Note

Python 3.12 is required in order to run setup.py. If you do not have Python 3.12 installed, please download it here before proceeding.

Show instructions

Option 1: Using setup.py

  1. Navigate to the directory where you saved the repository.

    cd path/to/repository
    
  2. Run setup.py to install all dependencies.

    py -3.12 setup.py
    

Option 2: Using pip

  1. Navigate to the directory where you saved the repository.

    cd path/to/repository
    
  2. In the terminal, run the following command:

    py -3.12 -m pip install -r config/requirements.txt
    

Step 3: Start the local server

Once you've successfully cloned the repo and installed the necessary dependencies, you can host the local server on your machine to enable the backend functionality of the extension.

Show instructions
  1. Navigate to the directory where you saved the repository.
  2. Run server.py.
    • This will begin hosting a server on your local network.
  3. Verify that the server is running.

Important

Historian works by sending a list of the URLs from your history to the server, which will then perform the computations needed to create the graph. Once completed, the server will asynchrononously update the graph on the frontend for you to see. Thus, it is imperative that you run the server in order to see your results visualized.

Step 4: Load the unpacked extension into Chrome

In order to use the extension in Chrome, you need to load an unpacked version into your extensions. The instructions below are adapted from Google Chrome's documentation, which you can consult for more information.

Show instructions
  1. On your computer, open Chrome.
  2. At the top right, click More (three dots) > Extensions > Manage Extensions.
  3. At the top right, enable Developer mode.
  4. At the top left, click Load unpacked.
  5. Navigate to the directory where you stored the repository, and select the extension folder.
  6. Verify that the extension has been loaded.
    • Once the extension has been successfully loaded into Chrome, you should see Historian listed in My extensions.

Step 5: Displaying the graph

After the extension has been loaded into Chrome and the local server has been started, you should now be able to use Historian to see the hierarchical graph of your recent search history.

  1. On your computer, open Chrome.
  2. Visit some webpages.
    • Try to visit different kinds of webpages so that the app can highlight the divisions between them (e.g. "best snack foods", "top 10 careers for computer science majors", "best nba players of all time").
    • If you're having trouble coming up with ideas or would rather use some pre-selected samples, please use the demo.
  3. In the top right, click and select Historian from the dropdown menu.
  4. In Historian, click Visualize History.

You should see a graph populate with lines connecting nodes that represents the hierarchical clusters of your browsing history.

Troubleshooting FAQ

The server successfully created the graph but the extension loads forever

If there are no errors on the server-side and the extension takes too long to load your dendrogram, the issue is the Cross-Origin Resource Sharing (CORS) policy. This issue occurs because when you load Historian into Chrome, your Extension ID may be different from the one included in the code.

To fix this, simply go to Chrome > Manage Extensions and copy the Extension ID you see under Historian. Then navigate to server.py and replace line 37 with your Extension ID.

The server failed to create a graph due to a dimensional mismatch

If the server throws an error saying it failed to create the dendrogram due to a mismatch in dimensions, this is likely because either the webpages could not be parsed or the webpages had the same titles. Currently, Historian can only analyze distinct webpages (i.e. webpages with unique titles).

To fix this, you can either try to visit some different webpages or you can simply run the demo script to get some presampled webpages to use.

πŸ›  Build

Overview

Historian defines several modules to facilitate its functionality. The table below provides a high-level overview of these modules with links to their respective code and documentation.

Module Purpose Documentation
Document Represent webpages as documents Link
WebScraper Extract webpage text data Link
HierarchicalClustering Perform agglomerative hierarchical clustering Link
Dendrogram Visualize hierarchical clusters Link
Frontend Enable user functionality Link

A class to represent scraped webpages as documents

src.webScraping.document.Document(self, title, text, url)

Parameters

  • self Document : The Document object
  • title str : The title of the webpage
  • text str : The text data of the webpage
  • url str : The url of the webpage

Methods


None


A class for extracting text data from webpages

src.webScraping.webScraper.WebScraper(self)

Parameters

  • self WebScraper : The WebScraper object

Methods


getWebpageText( self, response )

Extracts and preprocesses the data from a webpage from a given requests.Response object

Parameters

  • self WebScraper : The WebScraper object
  • response requests.Response : A requests.Response object for a given URL

Returns str


scrapeWebpages ( self, urls )

Extracts text data from webpage(s) at a given url and saves their text data as a string into the Webscraper.corpus hashmap

Parameters

  • self WebScraper : The WebScraper object
  • urls list : A list of URLS for the webpages you want to scrape

Returns dict


A class to perform agglomerative hierarchical clustering with average link for a collection of webpages

src.graphing.hierarchicalClustering.HierarchicalCluster(self)

Parameters

  • self HierarchicalClustering : The HierarchicalClustering object

Methods


preprocess( self, text )

Preprocesses text data from a document by performing normalization, tokenization, and lemmatization

Parameters

  • self HierarchicalClustering : The HierarchicalClustering object
  • text str : The text data from a webpage document

Returns str


preprocess_docs( self, docs )

Preprocesses text for a collection of documents by performing normalization, tokenization, and lemmatization

Parameters

  • self HierarchicalClustering : The HierarchicalClustering object
  • docs list : A list of the processed documents to

Returns list


extract_features( self, docs )

Implements Term Frequency (TF) - Inverse Document Frequency (IDF) weighting to a set of (processed) documents

Parameters

  • self HierarchicalClustering : The HierarchicalClustering object
  • docs list : A list of the processed documents you want to analyze

Returns numpy.ndarray


create_hierarchical_cluster( self, tfidf_matrix )

Performs hierarchical/agglomerative clustering for a TF-IDF weighted matrix of text data from a collection of documents using Average-Link

Parameters

  • self HierarchicalClustering : The HierarchicalClustering object
  • tfidf_matrix numpy.ndarray : A TF-IDF weighted mattrix of text data

Returns numpy.ndarray


create_dendrogram( self, cluster, docs)

Creates a dendrogram to visualize a hierarchical/agglomerative cluster

Parameters

  • self HierarchicalClustering : The HierarchicalClustering object
  • cluster numpy.ndarray : The hierarchical cluster of the data
  • docs list : A list of the original documents

Returns Dendrogram

A class to visualize a hierarchical clustering of webpages

src.graphing.dendrogram.Dendrogram(self, cluster, docs)

Parameters

  • self Dendrogram : The Dendrogram object
  • cluster np.ndarray : The hierarchical cluster of the data
  • docs list : A list of the original documents

Methods


create( self )

Creates a dendrogram figure for a hierarchical/agglomerative cluster

Parameters

  • self Dendrogram : The Dendrogram object

Returns plotly.graph_objs.Figure


create_lines( self )

Creates the lines representing the relationships between nodes in a dendrogram

Parameters

  • self Dendrogram : The Dendrogram object

Returns None


create_nodes( self )

Creates the nodes representing the documents in a dendrogram

Parameters

  • self Dendrogram : The Dendrogram object

Returns None


create_layout( self )

Creates the layout of the dendrogram

Parameters

  • self Dendrogram : The Dendrogram object

Returns None


Enables user functionality by sending data to the server

Methods


fetchHistory( )

Uses the Google Chrome history API to fetch the user's recent browsing history

Parameters

None

Returns Response


checkAvailability( )

Checks the server to see if the preprocessing has been done so it can fetch the graph

Parameters

None

Returns boolean


getGraph( )

Loads the graph created by the server into the frontend for the user to see

Parameters

None

Returns boolean


sendURLSToServer( )

Leverages API call to send the user's browsing history over to the server for processing

Parameters

None

Returns data.message


showSpinner( )

Shows a spinner while the page loads

Parameters

None

Returns None


hideSpinner( )

Hides the spinner after a page has finished loading

Parameters

None

Returns None


πŸ’™ Contributors

"⭐️" denotes Team Leader

Name NetID/Email Contributions
Blake McBride ⭐️ [email protected] Created Document class; created WebScraper class; created HierarchicalClustering class; created Dendrogram class; configured agglomerative hierarchical clustering algorithm; designed webscraping logic; wrote visualization logic; configured local server; created all functions for and designed frontend; implemented API calls from frontend to server; created setup and demo scripts; designed logo(s); wrote setup instructions; wrote documentation; designed and wrote README; wrote, editied, and produced video presentation.
Kaushal Dadi [email protected] Created manifest.json; put iframe in HTML to show graph on webpage; built preliminary frontend.
Rohan Parekh [email protected] Helped Kaushal with creation of manifest.json and the chrome extension that displayed the graph on the webpage.
Megha Chada [email protected] Changed colors for graph; added comments to code; added title, timestamp, and description to graph; created architectural diagram.
Michael Ma [email protected] Added unfinished topic labels to the graph.