Nichirin serves as an advanced layer atop Apache Solr, facilitating seamless data indexing operations.
-
What is Nichirin?
- Nichirin acts as a surface or layer on top of Apache Solr, making data indexing a breeze.
- It abstracts away the complexities of Solr indexing, allowing users to focus on providing their data without worrying about the nitty-gritty details.
-
Key Features:
- Multi-level Crawling: Performs multi-level web crawling utilizing a depth-first search methodology, with text indexing and retrieval facilitated through Apache Solr.
- Efficient Indexing: Integrated Apache Spark for parallel processing of URLs, improving the scalability and efficiency of both web crawling and text indexing.
- Python package: Available as a Python package on PyPI for easy installation and integration
# Option 1: install as read only; recommended to use as is
pip install git+https://github.com/sgowdaks/nichirin
# Option 2: install for editable mode; recommended if you'd like to modify code
git clone https://github.com/sgowdaks/nichirin
cd nichirin
pip install -e .
# Option 3: install from PyPI
pip install nichirin
install-solr
to install solrcreate-core --core <core name>
to create solr core,partition-data --path <path to the dataset>
to partition the datapipeline --path <path to the dataset>
generate embeddings of the partition dataindex-solr --data-path <path to dataset> --core <core to which the data needs to be sent>
index the dataquery-solr --input_sen <input sen> --core_name <core name to query from>
query the data from solrseed-urls --core <core name> --urls <urls separted with commas>
to add the seed urlsstart-crawler
to start the web crawlerstart-serve
to start the web server
- Begin by executing the
install-solr
command to install the Solr application. - Next, create the cores using the
create-core
command. - After setting up Solr and creating the cores, add seed URLs by running the
seed-urls
command. - Once the seed URLs are added, initiate the crawling process with the
start-crawler
command. Be patient, as this step may take some time. - Finally, to view the results, launch the Flask web app using the
start-serve
command.
This starts a service on http://127.0.0.1:5000 by default.
Contributing and Feedback: We welcome contributions! If you’d like to enhance Nichirin or report issues, feel free to submit a pull request. For feedback or questions, open an issue on our GitHub repository.