Skip to content

Latest commit

 

History

History
164 lines (122 loc) · 3.48 KB

README.md

File metadata and controls

164 lines (122 loc) · 3.48 KB


🐈 Mister Meow 🐈

📹 Video Demo

mister-meow.mp4

📈 Performance

Important

  1. 🕷 Crawler: 1000 pages in 1m12s with 64 threads
  2. 📓 Indexer: 1000 pages in 47s with 50 threads
  3. 🔎 Search: search is not stable enough but in general it could be improved in the ranker.

✨ Features

  • MeowCrawler: crawl the web and insert the data into the database.
    • multi-threading
    • multi level host priority queue
    • handles robots.txt
    • url hashing and content hashing to prevent duplicate content
    • url filtering
    • url normalization
    • seeding with a list of urls
    • Incremental crawling - could be paused and resumed
    • creates a sitemap graph for the ranking algorithm
  • MeowIndexer: tokenize and index the crawled data.
    • multi-threading
    • store in a inverted index collection
    • get the TF and position of the tokens.
    • handles stemming (Porter Stemmer) PS: we are required to give higher priority to exact tokens_
    • handles stop words
    • incremental indexing - could be paused and resumed
  • MeowRanker: search the indexed data.
    • search for the query in the inverted index
    • use Google Page Rank algorithm to give popularity to the pages
    • rank the results based on the TF-IDF algorithm
    • phrase matching
    • higher rank bonus for the exact match then stems
    • higher rank bonus for words in important tags like title, h1, h2, etc.
  • MeowEngine: query engine and server.
    • RESTful API
    • snippet generation
    • search suggestions and history
    • query parsing
    • phrase matching queries
    • AND, OR, NOT operators in queries
    • stop words and stemming
    • pagination
    • cache
  • MeowApp: web application.
    • Fancy Custom theming 4 themes are available (light, dark, rose, and black)
    • Powerful Search bar and suggestions components
    • fancy pagination element
    • navigation and data loading with react-router 6

🤔 System Design

Basic System component

System Design

Indexer DB Design

indexer DB Design

Build Inverted Index Algorithm

Build Index


🔨 Technologies

  • Java
  • Gradle
  • MongoDB
  • Spring Boot - for the server only
  • ==FRONTEND==
    • React
    • TypeScript
    • Tailwind CSS
    • React Router 6

🚀 Quick Start

Prerequisites

  • Java 11
  • Gradle
  • MongoDB
  • Node

Note

to install java and gradle see the Java setup document to install mongo see the mongo setup document

Installation

  1. Clone the repository
git clone <repo-url>
  1. Install the dependencies
cd Mister-Meow
cd mistermeow
gradle build
  1. To run the crawler
sudo systemctl start mongod # have to be done once
gradle crawl
  1. To run the indexer
gradle index
  1. To run the server
gradle engine
  1. To install and run the web application
cd app/src/meowapp
npm install
npm run dev

Contributions

Please check the following documents before contributing: