Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Allow other than default parsers in SharePointLoader and OneDriveLoader #27716

Merged
Merged
67 changes: 58 additions & 9 deletions docs/docs/integrations/document_loaders/microsoft_onedrive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
">[Microsoft OneDrive](https://en.wikipedia.org/wiki/OneDrive) (formerly `SkyDrive`) is a file hosting service operated by Microsoft.\n",
"\n",
"This notebook covers how to load documents from `OneDrive`. Currently, only docx, doc, and pdf files are supported.\n",
"This notebook covers how to load documents from `OneDrive`. By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
Expand Down Expand Up @@ -77,15 +77,64 @@
"\n",
"loader = OneDriveLoader(drive_id=\"YOUR DRIVE ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
"```\n",
"\n",
"#### 📑 Choosing supported file types and preffered parsers\n",
"By default `OneDriveLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
"```python\n",
"def _get_default_parser() -> BaseBlobParser:\n",
" \"\"\"Get default mime-type based parser.\"\"\"\n",
" return MimeTypeBasedParser(\n",
" handlers={\n",
" \"application/pdf\": PyMuPDFParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
" MsWordParser()\n",
" ),\n",
" },\n",
" fallback_parser=None,\n",
" )\n",
"```\n",
"You can override this behavior by passing `handlers` argument to `OneDriveLoader`. \n",
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
"Note that you must use either file extensions or MIME types exclusively and \n",
"cannot mix them.\n",
"\n",
"Do not include the leading dot for file extensions.\n",
"\n",
"```python\n",
"# using file extensions:\n",
"handlers = {\n",
" \"doc\": MsWordParser(),\n",
" \"pdf\": PDFMinerParser(),\n",
" \"mp3\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"# using MIME types:\n",
"handlers = {\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"audio/mpeg\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"loader = OneDriveLoader(document_library_id=\"...\",\n",
" handlers=handlers # pass handlers to OneDriveLoader\n",
" )\n",
"```\n",
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
"apply.\n",
"Example:\n",
"```python\n",
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
"# to parse all jpg/jpeg files.\n",
"handlers = {\n",
" \"jpg\": FirstParser(),\n",
" \"jpeg\": SecondParser()\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
60 changes: 58 additions & 2 deletions docs/docs/integrations/document_loaders/microsoft_sharepoint.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"> [Microsoft SharePoint](https://en.wikipedia.org/wiki/SharePoint) is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.\n",
"\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). Currently, only docx, doc, and pdf files are supported.\n",
"This notebook covers how to load documents from the [SharePoint Document Library](https://support.microsoft.com/en-us/office/what-is-a-document-library-3b5976dd-65cf-4c9e-bf5a-713c10ca2872). By default the document loader loads `pdf`, `doc`, `docx` and `txt` files. You can load other file types by providing appropriate parsers (see more below).\n",
"\n",
"## Prerequisites\n",
"1. Register an application with the [Microsoft identity platform](https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) instructions.\n",
Expand Down Expand Up @@ -100,7 +100,63 @@
"\n",
"loader = SharePointLoader(document_library_id=\"YOUR DOCUMENT LIBRARY ID\", object_ids=[\"ID_1\", \"ID_2\"], auth_with_token=True)\n",
"documents = loader.load()\n",
"```\n"
"```\n",
"\n",
"#### 📑 Choosing supported file types and preffered parsers\n",
"By default `SharePointLoader` loads file types defined in [`document_loaders/parsers/registry`](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/parsers/registry.py#L10-L22) using the default parsers (see below).\n",
"```python\n",
"def _get_default_parser() -> BaseBlobParser:\n",
" \"\"\"Get default mime-type based parser.\"\"\"\n",
" return MimeTypeBasedParser(\n",
" handlers={\n",
" \"application/pdf\": PyMuPDFParser(),\n",
" \"text/plain\": TextParser(),\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\": (\n",
" MsWordParser()\n",
" ),\n",
" },\n",
" fallback_parser=None,\n",
" )\n",
"```\n",
"You can override this behavior by passing `handlers` argument to `SharePointLoader`. \n",
"Pass a dictionary mapping either file extensions (like `\"doc\"`, `\"pdf\"`, etc.) \n",
"or MIME types (like `\"application/pdf\"`, `\"text/plain\"`, etc.) to parsers. \n",
"Note that you must use either file extensions or MIME types exclusively and \n",
"cannot mix them.\n",
"\n",
"Do not include the leading dot for file extensions.\n",
"\n",
"```python\n",
"# using file extensions:\n",
"handlers = {\n",
" \"doc\": MsWordParser(),\n",
" \"pdf\": PDFMinerParser(),\n",
" \"mp3\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"# using MIME types:\n",
"handlers = {\n",
" \"application/msword\": MsWordParser(),\n",
" \"application/pdf\": PDFMinerParser(),\n",
" \"audio/mpeg\": OpenAIWhisperParser()\n",
"}\n",
"\n",
"loader = SharePointLoader(document_library_id=\"...\",\n",
" handlers=handlers # pass handlers to SharePointLoader\n",
" )\n",
"```\n",
"In case multiple file extensions map to the same MIME type, the last dictionary item will\n",
"apply.\n",
"Example:\n",
"```python\n",
"# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used \n",
"# to parse all jpg/jpeg files.\n",
"handlers = {\n",
" \"jpg\": FirstParser(),\n",
" \"jpeg\": SecondParser()\n",
"}\n",
"```"
]
}
],
Expand Down
116 changes: 94 additions & 22 deletions libs/community/langchain_community/document_loaders/base_o365.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,29 @@
from __future__ import annotations

import logging
import mimetypes
import os
import tempfile
from abc import abstractmethod
from enum import Enum
from pathlib import Path, PurePath
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Sequence, Union
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Union

from pydantic import (
BaseModel,
Field,
FilePath,
PrivateAttr,
SecretStr,
)
from pydantic_settings import BaseSettings, SettingsConfigDict

from langchain_community.document_loaders.base import BaseLoader
from langchain_community.document_loaders.base import BaseBlobParser, BaseLoader
from langchain_community.document_loaders.blob_loaders.file_system import (
FileSystemBlobLoader,
)
from langchain_community.document_loaders.blob_loaders.schema import Blob
from langchain_community.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain_community.document_loaders.parsers.registry import get_parser

if TYPE_CHECKING:
from O365 import Account
Expand All @@ -46,24 +49,27 @@ class _O365TokenStorage(BaseSettings):
token_path: FilePath = Path.home() / ".credentials" / "o365_token.txt"


class _FileType(str, Enum):
DOC = "doc"
DOCX = "docx"
PDF = "pdf"
def fetch_mime_types(file_types: Sequence[str]) -> Dict[str, str]:
"""Fetch the mime types for the specified file types."""
mime_types_mapping = {}
for ext in file_types:
mime_type, _ = mimetypes.guess_type(f"file.{ext}")
vbarda marked this conversation as resolved.
Show resolved Hide resolved
if mime_type:
mime_types_mapping[ext] = mime_type
else:
raise ValueError(f"Unknown mimetype of extention {ext}")
return mime_types_mapping


def fetch_mime_types(file_types: Sequence[_FileType]) -> Dict[str, str]:
def fetch_extensions(mime_types: Sequence[str]) -> Dict[str, str]:
"""Fetch the mime types for the specified file types."""
mime_types_mapping = {}
for file_type in file_types:
if file_type.value == "doc":
mime_types_mapping[file_type.value] = "application/msword"
elif file_type.value == "docx":
mime_types_mapping[file_type.value] = (
"application/vnd.openxmlformats-officedocument.wordprocessingml.document" # noqa: E501
)
elif file_type.value == "pdf":
mime_types_mapping[file_type.value] = "application/pdf"
for mime_type in mime_types:
ext = mimetypes.guess_extension(mime_type)
if ext:
mime_types_mapping[ext[1:]] = mime_type # ignore leading `.`
else:
raise ValueError(f"Unknown mimetype {mime_type}")
return mime_types_mapping


Expand All @@ -78,16 +84,82 @@ class O365BaseLoader(BaseLoader, BaseModel):
"""Number of bytes to retrieve from each api call to the server. int or 'auto'."""
recursive: bool = False
"""Should the loader recursively load subfolders?"""
handlers: Optional[Dict[str, Any]] = {}
vbarda marked this conversation as resolved.
Show resolved Hide resolved
"""
Provide custom handlers for MimeTypeBasedParser.

Pass a dictionary mapping either file extensions (like "doc", "pdf", etc.)
or MIME types (like "application/pdf", "text/plain", etc.) to parsers.
Note that you must use either file extensions or MIME types exclusively and
cannot mix them.

Do not include the leading dot for file extensions.

Example using file extensions:
```python
handlers = {
"doc": MsWordParser(),
"pdf": PDFMinerParser(),
"txt": TextParser()
}
```

Example using MIME types:
```python
handlers = {
"application/msword": MsWordParser(),
"application/pdf": PDFMinerParser(),
"text/plain": TextParser()
}
```
"""

_blob_parser: BaseBlobParser = PrivateAttr()
_file_types: Sequence[str] = PrivateAttr()
_mime_types: Dict[str, str] = PrivateAttr()

def __init__(self, **kwargs: Any) -> None:
super().__init__(**kwargs)
if self.handlers:
handler_keys = list(self.handlers.keys())
try:
# assume handlers.keys() are file extensions
self._mime_types = fetch_mime_types(handler_keys)
self._file_types = list(set(handler_keys))
mime_handlers = {
self._mime_types[extension]: handler
for extension, handler in self.handlers.items()
}
except ValueError:
try:
# assume handlers.keys() are mime types
self._mime_types = fetch_extensions(handler_keys)
self._file_types = list(set(self._mime_types.keys()))
mime_handlers = self.handlers
except ValueError:
raise ValueError(
"`handlers` keys must be either file extensions or mimetypes.\n"
f"{handler_keys} could not be interpreted as either.\n"
"File extensions and mimetypes cannot mix. "
"Use either one or the other"
)

@property
@abstractmethod
def _file_types(self) -> Sequence[_FileType]:
"""Return supported file types."""
self._blob_parser = MimeTypeBasedParser(
handlers=mime_handlers, fallback_parser=None
)
else:
self._blob_parser = get_parser("default")
if not isinstance(self._blob_parser, MimeTypeBasedParser):
raise TypeError(
'get_parser("default) was supposed to return MimeTypeBasedParser.'
f"It returned {type(self._blob_parser)}"
)
self._mime_types = fetch_extensions(list(self._blob_parser.handlers.keys()))

@property
def _fetch_mime_types(self) -> Dict[str, str]:
"""Return a dict of supported file types to corresponding mime types."""
return fetch_mime_types(self._file_types)
return self._mime_types

@property
@abstractmethod
Expand Down
Loading
Loading