Skip to content

Commit

Permalink
Merge branch 'master' into snova-jorgep/sambanovacloud_llm
Browse files Browse the repository at this point in the history
  • Loading branch information
jhpiedrahitao authored Nov 7, 2024
2 parents c14a1ac + 2cb3927 commit e6206c3
Show file tree
Hide file tree
Showing 26 changed files with 1,612 additions and 190 deletions.
6 changes: 0 additions & 6 deletions .github/workflows/_integration_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,6 @@ jobs:
shell: bash
run: poetry run pip install "boto3<2" "google-cloud-aiplatform<2"

- name: 'Authenticate to Google Cloud'
id: 'auth'
uses: google-github-actions/auth@v2
with:
credentials_json: '${{ secrets.GOOGLE_CREDENTIALS }}'

- name: Run integration tests
shell: bash
env:
Expand Down
6 changes: 0 additions & 6 deletions .github/workflows/_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -267,12 +267,6 @@ jobs:
make tests
working-directory: ${{ inputs.working-directory }}

- name: 'Authenticate to Google Cloud'
id: 'auth'
uses: google-github-actions/auth@v2
with:
credentials_json: '${{ secrets.GOOGLE_CREDENTIALS }}'

- name: Import integration test dependencies
run: poetry install --with test,test_integration
working-directory: ${{ inputs.working-directory }}
Expand Down
67 changes: 58 additions & 9 deletions docs/docs/integrations/document_loaders/microsoft_onedrive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
">[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file hosting service operated by Microsoft.\n",
"\n",
"This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n",
"This notebook covers how to load documents from `OneDrive`. By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
Expand Down Expand Up @@ -77,15 +77,64 @@
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
"```\n",
"\n",
"#### 📑 Choosing supported file types and preffered parsers\n",
"By default `OneDriveLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
"```python\n",
"def _get_default_parser() -> BaseBlobParser:\n",
" \"\"\"Get default mime-type based parser.\"\"\"\n",
" return MimeTypeBasedParser(\n",
" handlers={\n",
" \"application/pdf\": PyMuPDFParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
" MsWordParser()\n",
" ),\n",
" },\n",
" fallback_parser=None,\n",
" )\n",
"```\n",
"You can override this behavior by passing `handlers` argument to `OneDriveLoader`. \n",
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
"Note that you must use either file extensions or MIME types exclusively and \n",
"cannot mix them.\n",
"\n",
"Do not include the leading dot for file extensions.\n",
"\n",
"```python\n",
"# using file extensions:\n",
"handlers = {\n",
" \"doc\": MsWordParser(),\n",
" \"pdf\": PDFMinerParser(),\n",
" \"mp3\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"# using MIME types:\n",
"handlers = {\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"audio/mpeg\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"loader = OneDriveLoader(document_library_id=\"...\",\n",
" handlers=handlers # pass handlers to OneDriveLoader\n",
" )\n",
"```\n",
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
"apply.\n",
"Example:\n",
"```python\n",
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
"# to parse all jpg/jpeg files.\n",
"handlers = {\n",
" \"jpg\": FirstParser(),\n",
" \"jpeg\": SecondParser()\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
60 changes: 58 additions & 2 deletions docs/docs/integrations/document_loaders/microsoft_sharepoint.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
"\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
Expand Down Expand Up @@ -100,7 +100,63 @@
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
"```\n",
"\n",
"#### 📑 Choosing supported file types and preffered parsers\n",
"By default `SharePointLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
"```python\n",
"def _get_default_parser() -> BaseBlobParser:\n",
" \"\"\"Get default mime-type based parser.\"\"\"\n",
" return MimeTypeBasedParser(\n",
" handlers={\n",
" \"application/pdf\": PyMuPDFParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
" MsWordParser()\n",
" ),\n",
" },\n",
" fallback_parser=None,\n",
" )\n",
"```\n",
"You can override this behavior by passing `handlers` argument to `SharePointLoader`. \n",
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
"Note that you must use either file extensions or MIME types exclusively and \n",
"cannot mix them.\n",
"\n",
"Do not include the leading dot for file extensions.\n",
"\n",
"```python\n",
"# using file extensions:\n",
"handlers = {\n",
" \"doc\": MsWordParser(),\n",
" \"pdf\": PDFMinerParser(),\n",
" \"mp3\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"# using MIME types:\n",
"handlers = {\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"audio/mpeg\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"loader = SharePointLoader(document_library_id=\"...\",\n",
" handlers=handlers # pass handlers to SharePointLoader\n",
" )\n",
"```\n",
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
"apply.\n",
"Example:\n",
"```python\n",
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
"# to parse all jpg/jpeg files.\n",
"handlers = {\n",
" \"jpg\": FirstParser(),\n",
" \"jpeg\": SecondParser()\n",
"}\n",
"```"
]
}
],
Expand Down
Loading

0 comments on commit e6206c3

Please sign in to comment.