Merge branch 'master' into snova-jorgep/sambanovacloud_llm

langchain-ai · Nov 7, 2024 · e6206c3 · e6206c3
2 parents c14a1ac + 2cb3927
commit e6206c3
Show file tree

Hide file tree

Showing 26 changed files with 1,612 additions and 190 deletions.
diff --git a/.github/workflows/_integration_test.yml b/.github/workflows/_integration_test.yml
@@ -41,12 +41,6 @@ jobs:
         shell: bash
         run: poetry run pip install "boto3<2" "google-cloud-aiplatform<2"
 
-      - name: 'Authenticate to Google Cloud'
-        id: 'auth'
-        uses: google-github-actions/auth@v2
-        with:
-          credentials_json: '${{ secrets.GOOGLE_CREDENTIALS }}'
-
       - name: Run integration tests
         shell: bash
         env:

diff --git a/.github/workflows/_release.yml b/.github/workflows/_release.yml
@@ -267,12 +267,6 @@ jobs:
           make tests
         working-directory: ${{ inputs.working-directory }}
 
-      - name: 'Authenticate to Google Cloud'
-        id: 'auth'
-        uses: google-github-actions/auth@v2
-        with:
-          credentials_json: '${{ secrets.GOOGLE_CREDENTIALS }}'
-
       - name: Import integration test dependencies
         run: poetry install --with test,test_integration
         working-directory: ${{ inputs.working-directory }}

diff --git a/docs/docs/integrations/document_loaders/microsoft_onedrive.ipynb b/docs/docs/integrations/document_loaders/microsoft_onedrive.ipynb
@@ -8,7 +8,7 @@
     "\n",
     ">[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file hosting service operated by Microsoft.\n",
     "\n",
-    "This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n",
+    "This notebook covers how to load documents from `OneDrive`. By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
     "\n",
     "## Prerequisites\n",
     "1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
@@ -77,15 +77,64 @@
     "\n",
     "loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
     "documents = loader.load()\n",
-    "```\n"
+    "```\n",
+    "\n",
+    "#### 📑 Choosing supported file types and preffered parsers\n",
+    "By default `OneDriveLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
+    "```python\n",
+    "def _get_default_parser() -> BaseBlobParser:\n",
+    "    \"\"\"Get default mime-type based parser.\"\"\"\n",
+    "    return MimeTypeBasedParser(\n",
+    "        handlers={\n",
+    "            \"application/pdf\": PyMuPDFParser(),\n",
+    "            \"text/plain\": TextParser(),\n",
+    "            \"application/msword\": MsWordParser(),\n",
+    "            \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
+    "                MsWordParser()\n",
+    "            ),\n",
+    "        },\n",
+    "        fallback_parser=None,\n",
+    "    )\n",
+    "```\n",
+    "You can override this behavior by passing `handlers` argument to `OneDriveLoader`. \n",
+    "Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
+    "or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
+    "Note that you must use either file extensions or MIME types exclusively and \n",
+    "cannot mix them.\n",
+    "\n",
+    "Do not include the leading dot for file extensions.\n",
+    "\n",
+    "```python\n",
+    "# using file extensions:\n",
+    "handlers = {\n",
+    "    \"doc\": MsWordParser(),\n",
+    "    \"pdf\": PDFMinerParser(),\n",
+    "    \"mp3\": OpenAIWhisperParser()\n",
+    "}\n",
+    "\n",
+    "# using MIME types:\n",
+    "handlers = {\n",
+    "    \"application/msword\": MsWordParser(),\n",
+    "    \"application/pdf\": PDFMinerParser(),\n",
+    "    \"audio/mpeg\": OpenAIWhisperParser()\n",
+    "}\n",
+    "\n",
+    "loader = OneDriveLoader(document_library_id=\"...\",\n",
+    "                            handlers=handlers # pass handlers to OneDriveLoader\n",
+    "                            )\n",
+    "```\n",
+    "In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
+    "apply.\n",
+    "Example:\n",
+    "```python\n",
+    "# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
+    "# to parse all jpg/jpeg files.\n",
+    "handlers = {\n",
+    "    \"jpg\": FirstParser(),\n",
+    "    \"jpeg\": SecondParser()\n",
+    "}\n",
+    "```"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/docs/docs/integrations/document_loaders/microsoft_sharepoint.ipynb b/docs/docs/integrations/document_loaders/microsoft_sharepoint.ipynb
@@ -9,7 +9,7 @@
     "\n",
     "> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
     "\n",
-    "This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
+    "This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
     "\n",
     "## Prerequisites\n",
     "1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
@@ -100,7 +100,63 @@
     "\n",
     "loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
     "documents = loader.load()\n",
-    "```\n"
+    "```\n",
+    "\n",
+    "#### 📑 Choosing supported file types and preffered parsers\n",
+    "By default `SharePointLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
+    "```python\n",
+    "def _get_default_parser() -> BaseBlobParser:\n",
+    "    \"\"\"Get default mime-type based parser.\"\"\"\n",
+    "    return MimeTypeBasedParser(\n",
+    "        handlers={\n",
+    "            \"application/pdf\": PyMuPDFParser(),\n",
+    "            \"text/plain\": TextParser(),\n",
+    "            \"application/msword\": MsWordParser(),\n",
+    "            \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
+    "                MsWordParser()\n",
+    "            ),\n",
+    "        },\n",
+    "        fallback_parser=None,\n",
+    "    )\n",
+    "```\n",
+    "You can override this behavior by passing `handlers` argument to `SharePointLoader`. \n",
+    "Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
+    "or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
+    "Note that you must use either file extensions or MIME types exclusively and \n",
+    "cannot mix them.\n",
+    "\n",
+    "Do not include the leading dot for file extensions.\n",
+    "\n",
+    "```python\n",
+    "# using file extensions:\n",
+    "handlers = {\n",
+    "    \"doc\": MsWordParser(),\n",
+    "    \"pdf\": PDFMinerParser(),\n",
+    "    \"mp3\": OpenAIWhisperParser()\n",
+    "}\n",
+    "\n",
+    "# using MIME types:\n",
+    "handlers = {\n",
+    "    \"application/msword\": MsWordParser(),\n",
+    "    \"application/pdf\": PDFMinerParser(),\n",
+    "    \"audio/mpeg\": OpenAIWhisperParser()\n",
+    "}\n",
+    "\n",
+    "loader = SharePointLoader(document_library_id=\"...\",\n",
+    "                            handlers=handlers # pass handlers to SharePointLoader\n",
+    "                            )\n",
+    "```\n",
+    "In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
+    "apply.\n",
+    "Example:\n",
+    "```python\n",
+    "# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
+    "# to parse all jpg/jpeg files.\n",
+    "handlers = {\n",
+    "    \"jpg\": FirstParser(),\n",
+    "    \"jpeg\": SecondParser()\n",
+    "}\n",
+    "```"
    ]
   }
  ],