Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Indexing Docx Files #801

Merged
merged 88 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
7c6be74
UI update for file filtered conversations
MythicalCow May 28, 2024
599917e
linter fixes
MythicalCow May 29, 2024
cf9f4dd
linter fixes
MythicalCow May 29, 2024
8569092
UI and API endpoints integrations complete. file querying changes to …
MythicalCow May 30, 2024
8640f24
UI streamlining by adding a plus button instead of a static search bar
MythicalCow May 30, 2024
0d87a8f
backend and UI conversation ID based file filtering implemented. file…
MythicalCow May 30, 2024
d2a369f
file filtering based on selected files implemented. further testing a…
MythicalCow May 30, 2024
75d251d
final touches and unnecessary comments removed.
MythicalCow May 31, 2024
fcc99db
adding authentication requirement to file-filters endpoint
MythicalCow May 31, 2024
e515f04
fileobjectadapter created and tested
MythicalCow Jun 2, 2024
8de18ac
removed temp file
MythicalCow Jun 2, 2024
b727a52
Apply file filters with correct syntax
sabaimran Jun 3, 2024
71184d5
file object adapter notes for later
MythicalCow Jun 3, 2024
13d71a1
made PR changes for a better user experience
MythicalCow Jun 3, 2024
0c129ca
Merge branch 'khoj-ai:master' into features/conversation-file-filter
MythicalCow Jun 3, 2024
9714b63
padding fix
MythicalCow Jun 3, 2024
649edd0
Merge branch 'features/conversation-file-filter' of https://github.co…
MythicalCow Jun 3, 2024
cd204c7
small UI improvements
MythicalCow Jun 3, 2024
4a2233e
embeddings update to handle FileObjects
MythicalCow Jun 3, 2024
e5c8576
embeddings update to handle FileObjects
MythicalCow Jun 3, 2024
ec0b0a5
Merge branch 'khoj-ai:master' into features/document-summarization
MythicalCow Jun 4, 2024
312b8c4
save
MythicalCow Jun 4, 2024
0bf3d57
added support for new command: /summarize
MythicalCow Jun 4, 2024
60de2b4
addressed security risk highlighted in PR.
MythicalCow Jun 5, 2024
f80654b
added handling for automatically adding an uploaded file to the curre…
MythicalCow Jun 5, 2024
27e3771
Merge pull request #1 from MythicalCow/features/conversation-file-filter
MythicalCow Jun 5, 2024
c7ea0d9
added integrations between file filter and summarize command with err…
MythicalCow Jun 5, 2024
1f52e9a
fixed API issue related to duplicate file management
MythicalCow Jun 5, 2024
464576f
Merge pull request #2 from MythicalCow/features/conversation-file-filter
MythicalCow Jun 5, 2024
55d191b
added query based context to summarization command and added more rig…
MythicalCow Jun 6, 2024
fef1fe0
added critical fix that causes indexer to break on non pdf files
MythicalCow Jun 6, 2024
fe025ab
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
ececa9a
added docx2txt requirement to .toml file
MythicalCow Jun 7, 2024
38f71de
removed print statement
MythicalCow Jun 7, 2024
116b4dd
first set of PR fixes
MythicalCow Jun 7, 2024
4396b39
finished addressing most PR changes.
MythicalCow Jun 7, 2024
3bdd230
fixed config settings method to avoid extra DB calls
MythicalCow Jun 10, 2024
5d49810
changing tester for pdf to entries
MythicalCow Jun 13, 2024
fe9283c
added markdown support
MythicalCow Jun 13, 2024
706da48
tester fix for markdown
MythicalCow Jun 13, 2024
3e4eaab
plaintext support and testers updated
MythicalCow Jun 13, 2024
dce6290
Merge branch 'master' of https://github.com/MythicalCow/khoj into fea…
MythicalCow Jun 13, 2024
264f174
merge fixes
MythicalCow Jun 13, 2024
dc291b2
database migration fix
MythicalCow Jun 13, 2024
7c05b51
added support for orgmode to /summarize
MythicalCow Jun 13, 2024
7d99d1d
return type specifier
MythicalCow Jun 13, 2024
6209acf
merge conflict fix?
MythicalCow Jun 13, 2024
1851c0e
adding logging
MythicalCow Jun 14, 2024
435d8e6
Merge branch 'master' of https://github.com/MythicalCow/khoj into fea…
MythicalCow Jun 14, 2024
09e99d9
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
2ee0c00
added docx2txt requirement to .toml file
MythicalCow Jun 7, 2024
bb44910
removed print statement
MythicalCow Jun 7, 2024
2e03035
Merge branch 'docx-file-support' of https://github.com/MythicalCow/kh…
MythicalCow Jun 14, 2024
bebc2ac
removing .doc allowed file type.
MythicalCow Jun 14, 2024
a538485
bug fix for update_raw_text adapter
MythicalCow Jun 14, 2024
9d2a9c8
quick print statement remove
MythicalCow Jun 14, 2024
c9496cb
small fixes and file summarization support for docx.
MythicalCow Jun 14, 2024
070bde2
resolve merge issues with migration files. keep the old one, add a ne…
sabaimran Jun 14, 2024
0b2e391
Add error handling for when file object not found, add summarize comm…
sabaimran Jun 14, 2024
1da9bc4
Merge branch 'master' of github.com:khoj-ai/khoj into features/docume…
sabaimran Jun 14, 2024
d19ec47
Add summarize support for normal streamed chat, non socket
sabaimran Jun 14, 2024
2919c16
Remove print line for file list
sabaimran Jun 14, 2024
bba083b
added some tests for /summarize command.
MythicalCow Jun 14, 2024
3344a96
additional tests added in
MythicalCow Jun 15, 2024
eb993fb
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
b0909f8
added docx2txt requirement to .toml file
MythicalCow Jun 7, 2024
1643548
removed print statement
MythicalCow Jun 7, 2024
fa8ef95
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
88ad9ca
removed print statement
MythicalCow Jun 7, 2024
fa8db61
small fixes and file summarization support for docx.
MythicalCow Jun 14, 2024
07df66e
resolving conflicting db files
MythicalCow Jun 15, 2024
6308528
Merge branch 'docx-file-support' of https://github.com/MythicalCow/kh…
MythicalCow Jun 15, 2024
e609eed
removing redundant db file
MythicalCow Jun 15, 2024
2bfcb3e
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
aaba9b5
removed print statement
MythicalCow Jun 7, 2024
854afa4
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
b2cdb5f
removed print statement
MythicalCow Jun 7, 2024
fa45ac5
small fixes and file summarization support for docx.
MythicalCow Jun 14, 2024
d644b64
removing redundant db file
MythicalCow Jun 15, 2024
82c26b5
Merge branch 'docx-file-support' of https://github.com/MythicalCow/kh…
MythicalCow Jun 15, 2024
f83cba6
resolving merge conflict with master
MythicalCow Jun 15, 2024
d7ef0d8
resolving merge conflict
MythicalCow Jun 17, 2024
5933042
added additional tests
MythicalCow Jun 17, 2024
c027099
Merge branch 'features/document-summarization' into docx-file-support
MythicalCow Jun 17, 2024
10e2af2
added tests to offline chat director
MythicalCow Jun 17, 2024
7edb6fa
Merge branch 'features/document-summarization' into docx-file-support
MythicalCow Jun 17, 2024
546c378
linter fixes
MythicalCow Jun 17, 2024
6f94f7d
Merge remote-tracking branch 'origin' into docx-file-support
MythicalCow Jun 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ dependencies = [
"cron-descriptor == 1.4.3",
"django_apscheduler == 0.6.2",
"anthropic == 0.26.1",
"docx2txt == 0.8"
]
dynamic = ["version"]

Expand Down
31 changes: 31 additions & 0 deletions src/khoj/database/migrations/0044_alter_entry_file_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Generated by Django 4.2.11 on 2024-06-06 22:42

from django.db import migrations, models


class Migration(migrations.Migration):
dependencies = [
("database", "0043_alter_chatmodeloptions_model_type"),
]

operations = [
migrations.AlterField(
model_name="entry",
name="file_type",
field=models.CharField(
choices=[
("image", "Image"),
("pdf", "Pdf"),
("plaintext", "Plaintext"),
("markdown", "Markdown"),
("org", "Org"),
("notion", "Notion"),
("github", "Github"),
("conversation", "Conversation"),
("docx", "Docx"),
],
default="plaintext",
max_length=30,
),
),
]
1 change: 1 addition & 0 deletions src/khoj/database/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@ class EntryType(models.TextChoices):
NOTION = "notion"
GITHUB = "github"
CONVERSATION = "conversation"
DOCX = "docx"

class EntrySource(models.TextChoices):
COMPUTER = "computer"
Expand Down
7 changes: 7 additions & 0 deletions src/khoj/interface/web/assets/icons/docx.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 7 additions & 2 deletions src/khoj/interface/web/chat.html
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@

To get started, just start typing below. You can also type / to see a list of commands.
`.trim()
const allowedExtensions = ['text/org', 'text/markdown', 'text/plain', 'text/html', 'application/pdf'];
const allowedFileEndings = ['org', 'md', 'txt', 'html', 'pdf'];
const allowedExtensions = ['text/org', 'text/markdown', 'text/plain', 'text/html', 'application/pdf', 'application/msword', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'];
const allowedFileEndings = ['org', 'md', 'txt', 'html', 'pdf', 'docx', 'doc'];
MythicalCow marked this conversation as resolved.
Show resolved Hide resolved
let chatOptions = [];
function createCopyParentText(message) {
return function(event) {
Expand Down Expand Up @@ -880,7 +880,12 @@
fileType = "text/html";
} else if (fileName.split('.').pop() === "pdf") {
fileType = "application/pdf";
} else if (fileName.split('.').pop() === "docx") {
fileType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
} else if (fileName.split('.').pop() === "doc") {
fileType = "application/msword";
}

}

let fileObj = new Blob([fileContents], { type: fileType });
Expand Down
2 changes: 2 additions & 0 deletions src/khoj/interface/web/content_source_computer_input.html
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ <h2 class="section-title">
image_name = "pdf.svg"
else if (fileExtension === "markdown" || fileExtension === "md")
image_name = "markdown.svg"
else if (fileExtension === "docx")
image_name = "docx.svg"
else
image_name = "plaintext.svg"

Expand Down
Empty file.
107 changes: 107 additions & 0 deletions src/khoj/processor/content/docx/docx_to_entries.py
debanjum marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import logging
import os
from datetime import datetime
from typing import List, Tuple

from langchain_community.document_loaders import Docx2txtLoader
MythicalCow marked this conversation as resolved.
Show resolved Hide resolved

from khoj.database.models import Entry as DbEntry
from khoj.database.models import KhojUser
from khoj.processor.content.text_to_entries import TextToEntries
from khoj.utils.helpers import timer
from khoj.utils.rawconfig import Entry

logger = logging.getLogger(__name__)


class DocxToEntries(TextToEntries):
def __init__(self):
super().__init__()

# Define Functions
def process(
self, files: dict[str, str] = None, full_corpus: bool = True, user: KhojUser = None, regenerate: bool = False
) -> Tuple[int, int]:
# Extract required fields from config
if not full_corpus:
deletion_file_names = set([file for file in files if files[file] == b""])
files_to_process = set(files) - deletion_file_names
files = {file: files[file] for file in files_to_process}
else:
deletion_file_names = None

# Extract Entries from specified Docx files
with timer("Extract entries from specified DOCX files", logger):
current_entries = DocxToEntries.extract_docx_entries(files)

# Split entries by max tokens supported by model
with timer("Split entries by max token size supported by model", logger):
current_entries = self.split_entries_by_max_tokens(current_entries, max_tokens=256)

# Identify, mark and merge any new entries with previous entries
with timer("Identify new or updated entries", logger):
num_new_embeddings, num_deleted_embeddings = self.update_embeddings(
current_entries,
DbEntry.EntryType.DOCX,
DbEntry.EntrySource.COMPUTER,
"compiled",
logger,
deletion_file_names,
user,
regenerate=regenerate,
)

return num_new_embeddings, num_deleted_embeddings

@staticmethod
def extract_docx_entries(docx_files) -> List[Entry]:
"""Extract entries from specified DOCX files"""

entries: List[str] = []
entry_to_location_map: List[Tuple[str, str]] = []
for docx_file in docx_files:
try:
timestamp_now = datetime.utcnow().timestamp()
tmp_file = f"tmp_docx_file_{timestamp_now}.docx"
MythicalCow marked this conversation as resolved.
Show resolved Hide resolved
with open(tmp_file, "wb") as f:
bytes_content = docx_files[docx_file]
f.write(bytes_content)

# Load the content using Docx2txtLoader
loader = Docx2txtLoader(tmp_file)
docx_entries_per_file = loader.load()

# Convert the loaded entries into the desired format
docx_texts = [page.page_content for page in docx_entries_per_file]

entry_to_location_map += zip(docx_texts, [docx_file] * len(docx_texts))
entries.extend(docx_texts)
except Exception as e:
logger.warning(f"Unable to process file: {docx_file}. This file will not be indexed.")
logger.warning(e, exc_info=True)
finally:
if os.path.exists(f"{tmp_file}"):
os.remove(f"{tmp_file}")
return DocxToEntries.convert_docx_entries_to_maps(entries, dict(entry_to_location_map))

@staticmethod
def convert_docx_entries_to_maps(parsed_entries: List[str], entry_to_file_map) -> List[Entry]:
"""Convert each DOCX entry into a dictionary"""
entries = []
for parsed_entry in parsed_entries:
entry_filename = entry_to_file_map[parsed_entry]
# Append base filename to compiled entry for context to model
heading = f"{entry_filename}\n"
compiled_entry = f"{heading}{parsed_entry}"
entries.append(
Entry(
compiled=compiled_entry,
raw=parsed_entry,
heading=heading,
file=f"{entry_filename}",
)
)

logger.debug(f"Converted {len(parsed_entries)} DOCX entries to dictionaries")

return entries
21 changes: 20 additions & 1 deletion src/khoj/routers/indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from starlette.authentication import requires

from khoj.database.models import GithubConfig, KhojUser, NotionConfig
from khoj.processor.content.docx.docx_to_entries import DocxToEntries
from khoj.processor.content.github.github_to_entries import GithubToEntries
from khoj.processor.content.markdown.markdown_to_entries import MarkdownToEntries
from khoj.processor.content.notion.notion_to_entries import NotionToEntries
Expand Down Expand Up @@ -40,6 +41,7 @@ class IndexerInput(BaseModel):
markdown: Optional[dict[str, str]] = None
pdf: Optional[dict[str, bytes]] = None
plaintext: Optional[dict[str, str]] = None
docx: Optional[dict[str, bytes]] = None


@indexer.post("/update")
Expand All @@ -63,7 +65,7 @@ async def update(
),
):
user = request.user.object
index_files: Dict[str, Dict[str, str]] = {"org": {}, "markdown": {}, "pdf": {}, "plaintext": {}}
index_files: Dict[str, Dict[str, str]] = {"org": {}, "markdown": {}, "pdf": {}, "plaintext": {}, "docx": {}}
try:
logger.info(f"📬 Updating content index via API call by {client} client")
for file in files:
Expand All @@ -79,6 +81,7 @@ async def update(
markdown=index_files["markdown"],
pdf=index_files["pdf"],
plaintext=index_files["plaintext"],
docx=index_files["docx"],
)

if state.config == None:
Expand All @@ -93,6 +96,7 @@ async def update(
org=None,
markdown=None,
pdf=None,
docx=None,
image=None,
github=None,
notion=None,
Expand Down Expand Up @@ -129,6 +133,7 @@ async def update(
"num_markdown": len(index_files["markdown"]),
"num_pdf": len(index_files["pdf"]),
"num_plaintext": len(index_files["plaintext"]),
"num_docx": len(index_files["docx"]),
}

update_telemetry_state(
Expand Down Expand Up @@ -295,6 +300,20 @@ def configure_content(
logger.error(f"🚨 Failed to setup Notion: {e}", exc_info=True)
success = False

try:
if (search_type == state.SearchType.All.value or search_type == state.SearchType.Docx.value) and files["docx"]:
logger.info("📄 Setting up search for docx")
text_search.setup(
DocxToEntries,
files.get("docx"),
regenerate=regenerate,
full_corpus=full_corpus,
user=user,
)
except Exception as e:
logger.error(f"🚨 Failed to setup docx: {e}", exc_info=True)
success = False

# Invalidate Query Cache
if user:
state.query_cache[user.uuid] = LRU()
Expand Down
1 change: 1 addition & 0 deletions src/khoj/utils/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ class SearchType(str, Enum):
Github = "github"
Notion = "notion"
Plaintext = "plaintext"
Docx = "docx"


class ProcessorType(str, Enum):
Expand Down
2 changes: 2 additions & 0 deletions src/khoj/utils/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ def get_file_type(file_type: str, file_content: bytes) -> tuple[str, str]:
return "org", encoding
elif file_type in ["application/pdf"]:
return "pdf", encoding
elif file_type in ["application/msword", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"]:
return "docx", encoding
elif file_type in ["image/jpeg"]:
return "jpeg", encoding
elif file_type in ["image/png"]:
Expand Down
1 change: 1 addition & 0 deletions src/khoj/utils/rawconfig.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ class ContentConfig(ConfigBase):
plaintext: Optional[TextContentConfig] = None
github: Optional[GithubContentConfig] = None
notion: Optional[NotionContentConfig] = None
docx: Optional[TextContentConfig] = None


class ImageSearchConfig(ConfigBase):
Expand Down
Loading