Nichirin: : A Retrieval Augmented Generation framework combined with webcrawler.

Overview

Nichirin serves as an advanced layer atop Apache Solr, facilitating seamless data indexing operations.

What is Nichirin?
- Nichirin acts as a surface or layer on top of Apache Solr, making data indexing a breeze.
- It abstracts away the complexities of Solr indexing, allowing users to focus on providing their data without worrying about the nitty-gritty details.
Key Features:
- Multi-level Crawling: Performs multi-level web crawling utilizing a depth-first search methodology, with text indexing and retrieval facilitated through Apache Solr.
- Efficient Indexing: Integrated Apache Spark for parallel processing of URLs, improving the scalability and efficiency of both web crawling and text indexing.
- Python package: Available as a Python package on PyPI for easy installation and integration

Setup

# Option 1: install as read only; recommended to use as is
pip install git+https://github.com/sgowdaks/nichirin

# Option 2: install for editable mode; recommended if you'd like to modify code
git clone https://github.com/sgowdaks/nichirin
cd nichirin
pip install -e .

# Option 3: install from PyPI
pip install nichirin

Commands

install-solr to install solr
create-core --core <core name> to create solr core,
partition-data --path <path to the dataset> to partition the data
pipeline --path <path to the dataset> generate embeddings of the partition data
index-solr --data-path <path to dataset> --core <core to which the data needs to be sent> index the data
query-solr --input_sen <input sen> --core_name <core name to query from> query the data from solr
seed-urls --core <core name> --urls <urls separted with commas> to add the seed urls
start-crawler to start the web crawler
start-serve to start the web server

Quickstart

Begin by executing the install-solr command to install the Solr application.
Next, create the cores using the create-core command.
After setting up Solr and creating the cores, add seed URLs by running the seed-urls command.
Once the seed URLs are added, initiate the crawling process with the start-crawler command. Be patient, as this step may take some time.
Finally, to view the results, launch the Flask web app using the start-serve command.

This starts a service on http://127.0.0.1:5000 by default.

Contributing and Feedback: We welcome contributions! If you’d like to enhance Nichirin or report issues, feel free to submit a pull request. For feedback or questions, open an issue on our GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
docs		docs
nichirin		nichirin
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
delete.py		delete.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nichirin: : A Retrieval Augmented Generation framework combined with webcrawler.

Overview

Setup

Commands

Quickstart

About

Releases 1

Packages

Languages

License

sgowdaks/nichirin

Folders and files

Latest commit

History

Repository files navigation

Nichirin: : A Retrieval Augmented Generation framework combined with webcrawler.

Overview

Setup

Commands

Quickstart

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages