A neural DB based google drive search engine
- Wispher
!git clone https://github.com/ggerganov/whisper.cpp.git
%cd /content/whisper.cpp/models
!bash download-ggml-model.sh base.en
%cd ..
!make
- ffmpeg (apt)
pydub
thirdai
thirdai[neural_db]
langchain
openai
paper-qa
requests
bs4
pandas
-
book_web_Scrapping:
To Download the books from the web (NCERT)
-
Thirdai_gdrive_engine_TEGD.ipynb
main script to run the search engine
To run the search engine:
- Open the notebook with google colab
- mount the drive by executing the very first cell
from google.colab import drive
drive.mount('/content/drive')
We are having a user input option in the notebook, as the extraction and training of audio file is time consuming.
Please give input as
y
to the cell if you want to train your G-Drive's audio files.
mp3_train = False
print("Do you want to fetch and train mp3 files too?")
print("It will take a lot of time. You can skip it for now")
if 'y' in input("Press 'y' to extract audio files else press 'n'").lower():
mp3_train = True
By running the notebook Thirdai_gdrive_engine_TEGD.ipynb
- The drive will be mounted to the notebook
- Get all the files in your G-Drive
Currently we are working on the
pdfs
,docx
,mp3
andwav
files. The other file format like allsource code
file andmp4
file will be added soon.
-
For processing audio files
- Download ggml-whisper which is a really fast version of whisper written in C/C++
- Convert an audio file with any format to wav (the only format currently supported by ggml-whisper)
- Convert the transcription to csv and save for model training later
-
We are storing a map for each audio_csv file to its original localtion in the drive. So that when we query from the engine, we can redirect the output to original location.
-
Get the
General_QnA
model from thebazaar
-
PREPARE INSERTABLE DOCS
- from pdfs
- docx
- mp3 (created earlier)
-
SEARCH THE QUERY (from user input), and you will get the exact location of the file in the drive.
https://drive.google.com/drive/folders/1tNJ-r6-URDAcU0VOZJ_h50hs4PrO5EkN?usp=sharing
Note: We are not submitting the model, as our main Moto is to keep data private. You can just train your model on your own personal data.
#POCKETLLM