Deployed 04fc368 with MkDocs version: 1.5.3

ossirytk · Mar 24, 2024 · 3e4023e · 3e4023e
1 parent a75d3eb
commit 3e4023e
Show file tree

Hide file tree

Showing 9 changed files with 186 additions and 33 deletions.
diff --git a/configs/index.html b/configs/index.html
@@ -106,7 +106,7 @@
           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
             <div class="section" itemprop="articleBody">
 
-                <h3 id="configs">Configs</h3>
+                <h3 id="basic-configs">Basic Configs</h3>
 <p>You can change the configuration settings in .env file.</p>
 <p>The available embeddings are llama,spacy and hugginface. Make sure that the config for the chat matches the embeddings that were used to create the chroma collection. </p>
 <p>VECTOR_K is the value for vector storage documents for how many documents should be returned. You might need to change this based on your context and vector store chunk size. BUFFER_K is the size for conversation buffer. The prompt will include last K qustion answer pairs. Having large VECTOR_K and BUFFER_K can overfill the prompt. The default character card is Skynet_V2.png. This is just a basic template.</p>
@@ -216,6 +216,11 @@ <h3 id="configs">Configs</h3>
 </tr>
 </tbody>
 </table>
+<h3 id="general-configs">General Configs</h3>
+<p>Other configs are found in the run_files folder. These include Webscrape configs, ner parse configs and filter configs. </p>
+<p>Filters folder defines the general webscrape filters to clean the documents. This file uses regex and can easily be modified to add extra filtering.</p>
+<p>Parse_configs defines the expected csv column structure and ner type parsing. This includes noun engrams, entities, noun chunks and parse type.</p>
+<p>Web scrape configs define the web pages fo a scrape. This is convinient if you want to scrape multiple pages.</p>
 
             </div>
           </div><footer>

diff --git a/creating_embeddings/index.html b/creating_embeddings/index.html
@@ -124,10 +124,13 @@ <h3 id="creating-embeddings">Creating embeddings</h3>
 NOTE: Textacy parsing will create a key file in key_storage that can be used by text parsing. Json files will create keys automatically if present in json file.</p>
 <pre><code>python -m document_parsing.textacy_parsing --collection-name skynet --embeddings-type spacy
 </code></pre>
-<p>Parse the documents with</p>
-<pre><code>python -m document_parsing.parse_text_documents --embeddings-type llama
-python -m document_parsing.parse_text_documents --collection-name skynet2 --embeddings-type spacy
-python -m document_parsing.parse_json_documents --embeddings-type spacy
+<p>Parse csv to text</p>
+<pre><code>python -m document_parsing.parse_csv_to_text
+</code></pre>
+<p>Parse the documents with </p>
+<pre><code>python -m document_parsing.parse_text_documents
+python -m document_parsing.parse_text_documents
+python -m document_parsing.parse_json_documents
 </code></pre>
 <p>You can test the embeddings with</p>
 <pre><code>python -m document_parsing.test_embeddings  --collection-name skynet --query &quot;Who is John Connor&quot; --embeddings-type llama
@@ -137,26 +140,26 @@ <h3 id="creating-embeddings">Creating embeddings</h3>
 <table>
 <thead>
 <tr>
-<th>Optional param</th>
+<th>Optional params</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td>--data-directory</td>
-<td>The directory where your text files are stored. Default "./documents/skynet"</td>
+<td>The directory where your text files are stored. Default "./run_files/documents/skynet"</td>
 </tr>
 <tr>
 <td>--collection-name</td>
 <td>The name of the collection. Default "skynet"</td>
 </tr>
 <tr>
 <td>--persist-directory</td>
-<td>The directory where you want to store the Chroma collection. Default "./character_storage/"</td>
+<td>The directory where you want to store the Chroma collection. Default "./run_files/character_storage/"</td>
 </tr>
 <tr>
 <td>--key-storage</td>
-<td>The directory for the collection metadata keys Need to be created with textacy parsing. Default "./key_storage/"</td>
+<td>The directory for the collection metadata keys Need to be created with textacy parsing. Default "./run_files/key_storage/"</td>
 </tr>
 <tr>
 <td>--chunk-size</td>

diff --git a/getting_started/index.html b/getting_started/index.html
@@ -114,7 +114,7 @@
 python -m spacy download en_core_web_lg
 playwright install
 </code></pre>
-<p>You will need playwright for webscraping and the spacy models for text embeddings if you do not use llama-cpp embeddings. These are not needed for running the chatbot itself.</BR></p>
+<p>You will need spacy models for text embeddings if you do not use llama-cpp embeddings. Playwright is used by the old webscrape scripts. These are not needed for running the chatbot itself.</BR></p>
 <p>You also might want to run llama-cpp with gpu acceleration like cuda. See <a href="https://github.com/abetlen/llama-cpp-python">llama-cpp-python</a> for specifics. Then run:</p>
 <pre><code>$env:FORCE_CMAKE=1
 $env:CMAKE_ARGS=&quot;-DLLAMA_CUBLAS=on&quot;

diff --git a/index.html b/index.html
@@ -161,5 +161,5 @@ <h1 id="llama-cpp-chat-memory">llama-cpp-chat-memory</h1>
 
 <!--
 MkDocs version : 1.5.3
-Build Date UTC : 2024-02-24 08:46:36.839546+00:00
+Build Date UTC : 2024-03-24 11:33:42.153859+00:00
 -->
diff --git a/named_entity_recognition/index.html b/named_entity_recognition/index.html
@@ -107,9 +107,153 @@
             <div class="section" itemprop="articleBody">
 
                 <h3 id="named-entity-recognitionner">Named Entity Recognition(NER)</h3>
-<p>You can use textacy_parsing script for generating document metadata keys automatically. The scripts are a modified version of textacy code updated to run with the current spacy version. The script uses a spacy embeddings model to process a text document for a json metadata keyfile. The include positions are: "PROPN", "NOUN", "ADJ". The includes entities are: "PRODUCT", "EVENT", "FAC", "NORP", "PERSON", "ORG", "GPE", "LOC", "DATE", "TIME", "WORK_OF_ART". For details see <a href="https://spacy.io/usage/linguistic-features">Spacy linguistic features</a> and <a href="https://spacy.io/models/en">Model NER labels</a>. The instructions expect en model, but spacy supports a wide range of models.</p>
+<p>You can use textacy_parsing script for generating document metadata keys automatically. The scripts are a modified version of textacy code updated to run with the current spacy version. The script uses a spacy embeddings model to process a text document for a json metadata keyfile. The keys are parsed based on a config file in run_files/parse_configs/ner_types.json or run_files/parse_configs/ner_types_full.json. You can give your own config file if you want.</p>
+<p>The available configs are</p>
+<table>
+<thead>
+<tr>
+<th>Ngrams</th>
+<th>Description</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>PROPN</td>
+<td>Proper Noun</td>
+</tr>
+<tr>
+<td>NOUN</td>
+<td>Noun</td>
+</tr>
+<tr>
+<td>ADJ</td>
+<td>Adjective</td>
+</tr>
+<tr>
+<td>NNP</td>
+<td>Noun proper singular</td>
+</tr>
+<tr>
+<td>NN</td>
+<td>Noun, singular or mass</td>
+</tr>
+<tr>
+<td>AUX</td>
+<td>Auxiliary</td>
+</tr>
+<tr>
+<td>VBZ</td>
+<td>Verb, 3rd person singular present</td>
+</tr>
+<tr>
+<td>VERB</td>
+<td>Verb</td>
+</tr>
+<tr>
+<td>ADP</td>
+<td>Adposition</td>
+</tr>
+<tr>
+<td>SYM</td>
+<td>Symbol</td>
+</tr>
+<tr>
+<td>NUM</td>
+<td>Numeral</td>
+</tr>
+<tr>
+<td>CD</td>
+<td>Cardinal number</td>
+</tr>
+<tr>
+<td>VBG</td>
+<td>verb, gerund or present participle</td>
+</tr>
+<tr>
+<td>ROOT</td>
+<td>Root</td>
+</tr>
+</tbody>
+</table>
+<table>
+<thead>
+<tr>
+<th>Entities</th>
+<th>Description</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>FAC</td>
+<td>Buildings, airports, highways, bridges, etc.</td>
+</tr>
+<tr>
+<td>NORP</td>
+<td>Nationalities or religious or political groups</td>
+</tr>
+<tr>
+<td>GPE</td>
+<td>Countries, cities, states</td>
+</tr>
+<tr>
+<td>PRODUCT</td>
+<td>Objects, vehicles, foods, etc. (not services)</td>
+</tr>
+<tr>
+<td>EVENT</td>
+<td>Named hurricanes, battles, wars, sports events, etc.</td>
+</tr>
+<tr>
+<td>PERSON</td>
+<td>People, including fictional</td>
+</tr>
+<tr>
+<td>ORG</td>
+<td>Companies, agencies, institutions, etc.</td>
+</tr>
+<tr>
+<td>LOC</td>
+<td>Non-GPE locations, mountain ranges, bodies of water</td>
+</tr>
+<tr>
+<td>DATE</td>
+<td>Absolute or relative dates or periods</td>
+</tr>
+<tr>
+<td>TIME</td>
+<td>Times smaller than a day</td>
+</tr>
+<tr>
+<td>WORK_OF_ART</td>
+<td>Titles of books, songs, etc.</td>
+</tr>
+</tbody>
+</table>
+<table>
+<thead>
+<tr>
+<th>Extract type</th>
+<th>Description</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>orth</td>
+<td>Terms are represented by their text exactly as written</td>
+</tr>
+<tr>
+<td>lower</td>
+<td>Lowercased form of the text</td>
+</tr>
+<tr>
+<td>lemma</td>
+<td>Base form w/o inflectional suffixes</td>
+</tr>
+</tbody>
+</table>
+<p>For details see <a href="https://spacy.io/usage/linguistic-features">Spacy linguistic features</a> and <a href="https://spacy.io/models/en">Model NER labels</a>. The instructions expect en model, but spacy supports a wide range of models. You can also specify Noun chunks. Noun chunk of 2 for example would create keys like "Yellow House" or "Blond Hair".</p>
 <p>You can create ner metadata list with</p>
-<pre><code>python -m document_parsing.textacy_parsing
+<pre><code>python -m document_parsing.parse_ner
 </code></pre>
 <table>
 <thead>
@@ -121,15 +265,15 @@ <h3 id="named-entity-recognitionner">Named Entity Recognition(NER)</h3>
 <tbody>
 <tr>
 <td>--data-directory</td>
-<td>The directory where your text files are stored. Default "./documents/skynet"</td>
+<td>The directory where your text files are stored. Default "./run_files/documents/skynet"</td>
 </tr>
 <tr>
 <td>--collection-name</td>
 <td>The name of the collection Will be used as name and location for the keyfile. Default "skynet"</td>
 </tr>
 <tr>
 <td>--key-storage</td>
-<td>The directory for the collection metadata keys. Default "./key_storage/"</td>
+<td>The directory for the collection metadata keys. Default "./run_files/key_storage/"</td>
 </tr>
 </tbody>
 </table>

diff --git a/search/search_index.json b/search/search_index.json
diff --git a/sitemap.xml b/sitemap.xml
@@ -2,67 +2,67 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/UNLICENSE/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/card_format/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/configs/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/creating_embeddings/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/examples/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/getting_started/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/named_entity_recognition/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/preparing_the_env/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/prompt_support/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/running_the_chatbot/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/running_the_env/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://ossirytk.github.io/llama-cpp-chat-memory/webscraping/</loc>
-         <lastmod>2024-02-24</lastmod>
+         <lastmod>2024-03-24</lastmod>
          <changefreq>daily</changefreq>
     </url>
 </urlset>
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
diff --git a/webscraping/index.html b/webscraping/index.html
@@ -107,10 +107,11 @@
             <div class="section" itemprop="articleBody">
 
                 <h3 id="webscraping">Webscraping</h3>
-<p>You can scrape web pages to text documents in order to use them as documents for chroma. The web scraping uses playwright and requires that the web engines are installed. After starting the virtual env run:</BR></p>
+<p>You can scrape web pages to text documents in order to use them as documents for chroma. </p>
+<p>Optional. The old web scraping uses playwright and requires that the web engines are installed. After starting the virtual env run:</BR></p>
 <pre><code>playwright install
 </code></pre>
-<p>The web scraping is prepared with config files in web_scrape_configs folder. The format is in json. See the example files for the specfics. The current impelementation is unoptimized, so use with caution for a large number of pages. See the example web_scrape_configs for config format. This will scrape the given web pages and format into a single text document.</BR></p>
+<p>The web scraping is prepared with config files in web_scrape_configs folder. The format is in json. See the example files for the specfics. A number of regex filters are used to clean the scrape data. You can modify and add filters if you want. The filters are stored in the src/llama_cpp_chat_memory/run_files/filters/web_scrape_filters.json file.</BR></p>
 <p>To run the scrape run:</p>
 <pre><code>python -m document_parsing.web_scraper&lt;/BR&gt;
 </code></pre>
@@ -124,15 +125,15 @@ <h3 id="webscraping">Webscraping</h3>
 <tbody>
 <tr>
 <td>--data-directory</td>
-<td>The directory where your text files are stored. Default "./documents/skynet"</td>
+<td>The directory where your text files are stored. Default "./run_files/documents/skynet"</td>
 </tr>
 <tr>
 <td>--collection-name</td>
 <td>The name of the collection. Default "skynet"</td>
 </tr>
 <tr>
 <td>--web-scrape-directory</td>
-<td>The config file to be used for the webscrape. Default "./web_scrape_configs/"</td>
+<td>The config file to be used for the webscrape. Default "./run_files/web_scrape_configs/"</td>
 </tr>
 </tbody>
 </table>