Moving functional code to the codebase and moving the development not…

…ebook to docs
ArturOle · Sep 25, 2024 · d3d2379 · d3d2379
1 parent 72c2fcc
commit d3d2379
Show file tree

Hide file tree

Showing 2 changed files with 192 additions and 230 deletions.
diff --git a/...sor/splitting_algorithm_development.ipynb → ...R&D/splitting_algorithm_development.ipynb b/...sor/splitting_algorithm_development.ipynb → ...R&D/splitting_algorithm_development.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 39,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -46,9 +46,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 40,
    "metadata": {},
    "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Created a chunk of size 546, which is longer than the specified 500\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -85,88 +92,89 @@
       "Disadvantages:\n",
       "Convergence Time: The algorithm may require a large number of samples to converge to an accurate solution, which can be computationally expensive.\n",
       "Accuracy: As it is based on random sampling, the result is an approximation and not an exact answer.\n",
-      "#9: Monte Carlo algorithms are especially useful in scenarios where exact mathematical modeling is difficult or impossible, but simulation can provide insights or approximate solutions.\n"
+      "#9: Monte Carlo algorithms are especially useful in scenarios where exact mathematical modeling is difficult or impossible, but simulation can provide insights or approximate solutions.\n",
+      "#1: The Monte Carlo algorithm (or Monte Carlo method) refers to a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. These methods are used in various fields such as finance, physics, engineering, and AI for problems that are deterministic in principle but difficult to solve directly.\n",
+      "#2: Key Aspects of the Monte Carlo Algorithm:\n",
+      "Random Sampling: The core idea is to generate random variables to simulate complex systems or processes. The algorithm uses probabilistic models and generates numerous random samples to approximate the solution to a deterministic problem.\n",
+      "#3: Estimation of Results: Monte Carlo methods provide estimates of the outcome by running many simulations and averaging the results. These results tend to get more accurate as the number of trials increases (according to the law of large numbers).\n",
+      "\n",
+      "Applications:\n",
+      "#4: Numerical Integration: When integrals are difficult to evaluate analytically, Monte Carlo methods approximate them by averaging sampled values.\n",
+      "Optimization: Used in scenarios like stochastic optimization, where exact solutions are hard to compute.\n",
+      "Statistical Inference: Monte Carlo is used in Bayesian inference to estimate posterior distributions.\n",
+      "Simulations: For example, in finance to model stock prices or risks.\n",
+      "Basic Process:\n",
+      "Define a Problem: The first step is to define a domain of possible inputs (often high-dimensional and complex).\n",
+      "#5: Generate Random Samples: Randomly generate inputs from a probability distribution over the domain.\n",
+      "\n",
+      "Compute Results: Evaluate the function or process for each randomly generated input.\n",
+      "\n",
+      "Average the Results: Use the results to compute an average or distribution of outputs, which serves as the approximation to the problem.\n",
+      "\n",
+      "Example: Estimating Pi\n",
+      "A common example is using the Monte Carlo method to estimate Pi:\n",
+      "#6: Consider a circle of radius 1 inside a square with sides of length 2.\n",
+      "Generate random points in the square.\n",
+      "The ratio of points that fall inside the circle to the total number of points is approximately Pi/4. By multiplying this ratio by 4, you can estimate the value of Pi.\n",
+      "Advantages:\n",
+      "Scalability: Works well with problems of high-dimensional spaces.\n",
+      "Simplicity: The algorithm is easy to implement and doesn't require detailed knowledge of the problem.\n",
+      "Flexibility: It can be applied to a wide range of problems where deterministic solutions are not feasible.\n",
+      "Disadvantages:\n",
+      "Convergence Time: The algorithm may require a large number of samples to converge to an accurate solution, which can be computationally expensive.\n",
+      "Accuracy: As it is based on random sampling, the result is an approximation and not an exact answer.\n",
+      "Monte Carlo algorithms are especially useful in scenarios where exact mathematical modeling is difficult or impossible, but simulation can provide insights or approximate solutions.\n"
      ]
     }
    ],
    "source": [
-    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
+    "from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter\n",
     "\n",
-    "splitter = RecursiveCharacterTextSplitter(\n",
+    "splitter_r = RecursiveCharacterTextSplitter(\n",
     "    chunk_size=500,\n",
     "    chunk_overlap=100\n",
     ")\n",
-    "sentences = splitter.split_text(multi_sentence_text)\n",
+    "sentences = splitter_r.split_text(multi_sentence_text)\n",
+    "\n",
+    "for i, sentence in enumerate(sentences):\n",
+    "    print(f\"#{i+1}: {sentence}\")\n",
+    "\n",
+    "splitter_c = CharacterTextSplitter(\n",
+    "    chunk_size=500,\n",
+    "    chunk_overlap=100\n",
+    ")\n",
+    "sentences = splitter_c.split_text(multi_sentence_text)\n",
     "\n",
     "for i, sentence in enumerate(sentences):\n",
     "    print(f\"#{i+1}: {sentence}\")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 47,
    "metadata": {},
    "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Text length: 2850\n",
-      "hm may require a lar\n",
-      "2\n",
-      "Positive margin subset: hm may require a lar\n",
-      "New position 2453\n",
-      "By multiplying this \n",
-      "2\n",
-      "New position 2056\n",
-      "Remaining text length: 2056\n",
-      " compute an average \n",
-      "0\n",
-      "New position 1657\n",
-      "Remaining text length: 1657\n",
-      "e, in finance to mod\n",
-      "2\n",
-      "New position 1260\n",
-      "Remaining text length: 1260\n",
-      "bers).\n",
-      "\n",
-      "Applications\n",
-      "5\n",
-      "New position 866\n",
-      "Remaining text length: 866\n",
-      "stems or processes. \n",
-      "18\n",
-      "New position 485\n",
-      "Remaining text length: 485\n",
-      "ional algorithms tha\n",
-      "5\n",
-      "New position 91\n",
-      "Remaining text length: 91\n",
-      "[191, 494, 481, 494, 497, 499, 497, 397]\n"
-     ]
-    },
     {
      "data": {
       "text/plain": [
-       "{0: 'The Monte Carlo algorithm (or Monte Carlo method) refers to a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. These methods are used',\n",
-       " 1: 'algorithms that rely on repeated random sampling to obtain numerical results. These methods are used in various fields such as finance, physics, engineering, and AI for problems that are deterministic in principle but difficult to solve directly.\\n\\nKey Aspects of the Monte Carlo Algorithm:\\nRandom Sampling: The core idea is to generate random variables to simulate complex systems or processes. The algorithm uses probabilistic models and generates numerous random samples to approximate the so',\n",
-       " 2: ' The algorithm uses probabilistic models and generates numerous random samples to approximate the solution to a deterministic problem.\\n\\nEstimation of Results: Monte Carlo methods provide estimates of the outcome by running many simulations and averaging the results. These results tend to get more accurate as the number of trials increases (according to the law of large numbers).\\n\\nApplications:\\n\\nNumerical Integration: When integrals are difficult to evaluate analytically, Monte',\n",
-       " 3: '\\n\\nApplications:\\n\\nNumerical Integration: When integrals are difficult to evaluate analytically, Monte Carlo methods approximate them by averaging sampled values.\\nOptimization: Used in scenarios like stochastic optimization, where exact solutions are hard to compute.\\nStatistical Inference: Monte Carlo is used in Bayesian inference to estimate posterior distributions.\\nSimulations: For example, in finance to model stock prices or risks.\\nBasic Process:\\nDefine a Problem: The first step is to def',\n",
-       " 4: 'in finance to model stock prices or risks.\\nBasic Process:\\nDefine a Problem: The first step is to define a domain of possible inputs (often high-dimensional and complex).\\n\\nGenerate Random Samples: Randomly generate inputs from a probability distribution over the domain.\\n\\nCompute Results: Evaluate the function or process for each randomly generated input.\\n\\nAverage the Results: Use the results to compute an average or distribution of outputs, which serves as the approximation to the problem.\\n\\nEx',\n",
-       " 5: 'compute an average or distribution of outputs, which serves as the approximation to the problem.\\n\\nExample: Estimating Pi\\nA common example is using the Monte Carlo method to estimate Pi:\\n\\nConsider a circle of radius 1 inside a square with sides of length 2.\\nGenerate random points in the square.\\nThe ratio of points that fall inside the circle to the total number of points is approximately Pi/4. By multiplying this ratio by 4, you can estimate the value of Pi.\\nAdvantages:\\nScalability: Works well w',\n",
-       " 6: \"multiplying this ratio by 4, you can estimate the value of Pi.\\nAdvantages:\\nScalability: Works well with problems of high-dimensional spaces.\\nSimplicity: The algorithm is easy to implement and doesn't require detailed knowledge of the problem.\\nFlexibility: It can be applied to a wide range of problems where deterministic solutions are not feasible.\\nDisadvantages:\\nConvergence Time: The algorithm may require a large number of samples to converge to an accurate solution, which can be computationa\",\n",
-       " 7: 'may require a large number of samples to converge to an accurate solution, which can be computationally expensive.\\nAccuracy: As it is based on random sampling, the result is an approximation and not an exact answer.\\nMonte Carlo algorithms are especially useful in scenarios where exact mathematical modeling is difficult or impossible, but simulation can provide insights or approximate solutions.'}"
+       "{0: 'The Monte Carlo algorithm (or Monte Carlo method) refers to a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. These methods are used in various fields such as finance, physics, engineering, and AI for problems',\n",
+       " 1: ' These methods are used in various fields such as finance, physics, engineering, and AI for problems that are deterministic in principle but difficult to solve directly.\\n\\nKey Aspects of the Monte Carlo Algorithm:\\nRandom Sampling: The core idea is to generate random variables to simulate complex systems or processes. The algorithm uses probabilistic models and generates numerous random samples to approximate the so',\n",
+       " 2: ' The algorithm uses probabilistic models and generates numerous random samples to approximate the solution to a deterministic problem.\\n\\nEstimation of Results: Monte Carlo methods provide estimates of the outcome by running many simulations and averaging the results. These results tend to get more accurate as the number of trials increases (according to the law of ',\n",
+       " 3: ' These results tend to get more accurate as the number of trials increases (according to the law of large numbers).\\n\\nApplications:\\n\\nNumerical Integration: When integrals are difficult to evaluate analytically, Monte Carlo methods approximate them by averaging sampled values.\\nOptimization: Used in scenarios like stochastic optimization, where exact solutions are hard to compute.\\nStatistical Inference: Monte Carlo is used in Bayesian inference to estimate posterior distribution',\n",
+       " 4: '\\nStatistical Inference: Monte Carlo is used in Bayesian inference to estimate posterior distributions.\\nSimulations: For example, in finance to model stock prices or risks.\\nBasic Process:\\nDefine a Problem: The first step is to define a domain of possible inputs (often high-dimensional and complex).\\n\\nGenerate Random Samples: Randomly generate inputs from a probability distribution over the domain.\\n\\nCompute Results: Evaluate the function or process for each randomly generated input.\\n\\nAverage the ',\n",
+       " 5: '\\n\\nCompute Results: Evaluate the function or process for each randomly generated input.\\n\\nAverage the Results: Use the results to compute an average or distribution of outputs, which serves as the approximation to the problem.\\n\\nExample: Estimating Pi\\nA common example is using the Monte Carlo method to estimate Pi:\\n\\nConsider a circle of radius 1 inside a square with sides of length 2.\\nGenerate random points in the square.\\nThe ratio of points that fall inside the circle to the total ',\n",
+       " 6: \"\\nGenerate random points in the square.\\nThe ratio of points that fall inside the circle to the total number of points is approximately Pi/4. By multiplying this ratio by 4, you can estimate the value of Pi.\\nAdvantages:\\nScalability: Works well with problems of high-dimensional spaces.\\nSimplicity: The algorithm is easy to implement and doesn't require detailed knowledge of the proble\",\n",
+       " 7: \"\\nSimplicity: The algorithm is easy to implement and doesn't require detailed knowledge of the problem.\\nFlexibility: It can be applied to a wide range of problems where deterministic solutions are not feasible.\\nDisadvantages:\\nConvergence Time: The algorithm may require a large number of samples to converge to an accurate solution, which can be computationally expensive.\\nAccuracy: As it is based on random sampling, the result is an approximation and not an exact answer\",\n",
+       " 8: '\\nAccuracy: As it is based on random sampling, the result is an approximation and not an exact answer.\\nMonte Carlo algorithms are especially useful in scenarios where exact mathematical modeling is difficult or impossible, but simulation can provide insights or approximate solutions.'}"
       ]
      },
-     "execution_count": 26,
+     "execution_count": 47,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "import numpy as np\n",
-    "import time\n",
-    "from typing import Union\n",
     "from functools import singledispatchmethod\n",
     "from copy import copy\n",
     "import re\n",
@@ -230,13 +238,10 @@
     "        max_chunk_size = chunk_size - overlap\n",
     "        remaining_text = copy(text)\n",
     "        split_positions = []\n",
-    "        print(f\"Text length: {len(remaining_text)}\")\n",
     "        text_length = len(remaining_text)\n",
     "        static_split_pos = text_length - max_chunk_size\n",
     "        positive_margin_subset = text[static_split_pos:static_split_pos+margin]\n",
     "        new_pos = self.split_pos(positive_margin_subset, static_split_pos)\n",
-    "        print(\"Positive margin subset:\", positive_margin_subset)\n",
-    "        print(f\"New position {new_pos}\")\n",
     "        split_positions.insert(0, (new_pos, len(remaining_text)))\n",
     "        remaining_text = remaining_text[:new_pos]\n",
     "        while True:\n",
@@ -245,15 +250,13 @@
     "            positive_margin_subset = text[static_split_pos:static_split_pos+margin]\n",
     "            new_pos = self.split_pos(positive_margin_subset, static_split_pos)\n",
     "            split_positions.insert(0, (new_pos, len(remaining_text)+overlap))\n",
-    "            print(f\"New position {new_pos}\")\n",
     "            remaining_text = remaining_text[:new_pos]\n",
-    "            print(f\"Remaining text length: {len(remaining_text)}\")\n",
     "            \n",
     "            if len(remaining_text) < max_chunk_size:\n",
     "                break\n",
     "        \n",
     "        split_positions.insert(0, (0, len(remaining_text)+overlap))\n",
-    "        print([j-i for i, j in split_positions])\n",
+    "\n",
     "        return {key:text[i:j] for key, (i, j) in enumerate(split_positions)}\n",
     "\n",
     "    @_split_dispatcher.register\n",
@@ -269,14 +272,10 @@
     "    def split_pos(self, string, current_position):\n",
     "        for i, letter in enumerate(string):\n",
     "            if letter == '.':\n",
-    "                print(string)\n",
-    "                print(i)\n",
     "                return current_position + i + 1\n",
     "\n",
     "        for i, letter in enumerate(string):\n",
     "            if not isinstance(self.white_space_pattern.match(letter), type(None)):\n",
-    "                print(string)\n",
-    "                print(i)\n",
     "                return current_position + i + 1\n",
     "            \n",
     "        return current_position\n",
@@ -286,14 +285,10 @@
     "        inv_string = string[::-1]\n",
     "        for i, letter in enumerate(inv_string):\n",
     "            if letter == '.':\n",
-    "                print(string)\n",
-    "                print(i)\n",
     "                return current_position + (len(string)-i)\n",
     "\n",
     "        for i, letter in enumerate(inv_string):\n",
     "            if not isinstance(self.white_space_pattern.match(letter), type(None)):\n",
-    "                print(string)\n",
-    "                print(i)\n",
     "                return current_position + (len(string)-i)\n",
     "            \n",
     "        return current_position\n",
@@ -306,20 +301,20 @@
     "\n",
     "\n",
     "ts = TextSplitter()\n",
-    "ts.split(multi_sentence_text, 500, 100, 20)\n"
+    "ts.split(multi_sentence_text, 500, 100, 400)\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 50,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Time needed for execution: 329113500\n",
-      "Time needed for execution: 325284100\n"
+      "Time needed for execution: 2577393700\n",
+      "Time needed for execution: 1556783200\n"
      ]
     }
    ],
@@ -342,7 +337,6 @@
     "    return start\n",
     "\n",
     "\n",
-    "\n",
     "def simple_subset_alt(string, start, end):\n",
     "    for i, letter in enumerate(string[start:end]):\n",
     "        if letter == '.':\n",
@@ -357,11 +351,11 @@
     "string = \"The Monte Carlo algorithm (or Monte Carlo method) refers to a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.\"\n",
     "\n",
     "start_time = time_ns()\n",
-    "[simple_subset(multi_sentence_text, 4, 500) for _ in range(0, 100000)]\n",
+    "[splitter_r.split_text(multi_sentence_text) for _ in range(0, 100000)]\n",
     "print(f\"Time needed for execution: {time_ns()-start_time}\")\n",
     "\n",
     "start_time = time_ns()\n",
-    "[simple_subset_alt(multi_sentence_text, 4, 500) for _ in range(0, 100000)]\n",
+    "[ts.split(multi_sentence_text, 500, 100, 20) for _ in range(0, 100000)]\n",
     "print(f\"Time needed for execution: {time_ns()-start_time}\")\n",
     "\n"
    ]