Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Indexing Docx Files #801

Merged
merged 88 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
7c6be74
UI update for file filtered conversations
MythicalCow May 28, 2024
599917e
linter fixes
MythicalCow May 29, 2024
cf9f4dd
linter fixes
MythicalCow May 29, 2024
8569092
UI and API endpoints integrations complete. file querying changes to …
MythicalCow May 30, 2024
8640f24
UI streamlining by adding a plus button instead of a static search bar
MythicalCow May 30, 2024
0d87a8f
backend and UI conversation ID based file filtering implemented. file…
MythicalCow May 30, 2024
d2a369f
file filtering based on selected files implemented. further testing a…
MythicalCow May 30, 2024
75d251d
final touches and unnecessary comments removed.
MythicalCow May 31, 2024
fcc99db
adding authentication requirement to file-filters endpoint
MythicalCow May 31, 2024
e515f04
fileobjectadapter created and tested
MythicalCow Jun 2, 2024
8de18ac
removed temp file
MythicalCow Jun 2, 2024
b727a52
Apply file filters with correct syntax
sabaimran Jun 3, 2024
71184d5
file object adapter notes for later
MythicalCow Jun 3, 2024
13d71a1
made PR changes for a better user experience
MythicalCow Jun 3, 2024
0c129ca
Merge branch 'khoj-ai:master' into features/conversation-file-filter
MythicalCow Jun 3, 2024
9714b63
padding fix
MythicalCow Jun 3, 2024
649edd0
Merge branch 'features/conversation-file-filter' of https://github.co…
MythicalCow Jun 3, 2024
cd204c7
small UI improvements
MythicalCow Jun 3, 2024
4a2233e
embeddings update to handle FileObjects
MythicalCow Jun 3, 2024
e5c8576
embeddings update to handle FileObjects
MythicalCow Jun 3, 2024
ec0b0a5
Merge branch 'khoj-ai:master' into features/document-summarization
MythicalCow Jun 4, 2024
312b8c4
save
MythicalCow Jun 4, 2024
0bf3d57
added support for new command: /summarize
MythicalCow Jun 4, 2024
60de2b4
addressed security risk highlighted in PR.
MythicalCow Jun 5, 2024
f80654b
added handling for automatically adding an uploaded file to the curre…
MythicalCow Jun 5, 2024
27e3771
Merge pull request #1 from MythicalCow/features/conversation-file-filter
MythicalCow Jun 5, 2024
c7ea0d9
added integrations between file filter and summarize command with err…
MythicalCow Jun 5, 2024
1f52e9a
fixed API issue related to duplicate file management
MythicalCow Jun 5, 2024
464576f
Merge pull request #2 from MythicalCow/features/conversation-file-filter
MythicalCow Jun 5, 2024
55d191b
added query based context to summarization command and added more rig…
MythicalCow Jun 6, 2024
fef1fe0
added critical fix that causes indexer to break on non pdf files
MythicalCow Jun 6, 2024
fe025ab
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
ececa9a
added docx2txt requirement to .toml file
MythicalCow Jun 7, 2024
38f71de
removed print statement
MythicalCow Jun 7, 2024
116b4dd
first set of PR fixes
MythicalCow Jun 7, 2024
4396b39
finished addressing most PR changes.
MythicalCow Jun 7, 2024
3bdd230
fixed config settings method to avoid extra DB calls
MythicalCow Jun 10, 2024
5d49810
changing tester for pdf to entries
MythicalCow Jun 13, 2024
fe9283c
added markdown support
MythicalCow Jun 13, 2024
706da48
tester fix for markdown
MythicalCow Jun 13, 2024
3e4eaab
plaintext support and testers updated
MythicalCow Jun 13, 2024
dce6290
Merge branch 'master' of https://github.com/MythicalCow/khoj into fea…
MythicalCow Jun 13, 2024
264f174
merge fixes
MythicalCow Jun 13, 2024
dc291b2
database migration fix
MythicalCow Jun 13, 2024
7c05b51
added support for orgmode to /summarize
MythicalCow Jun 13, 2024
7d99d1d
return type specifier
MythicalCow Jun 13, 2024
6209acf
merge conflict fix?
MythicalCow Jun 13, 2024
1851c0e
adding logging
MythicalCow Jun 14, 2024
435d8e6
Merge branch 'master' of https://github.com/MythicalCow/khoj into fea…
MythicalCow Jun 14, 2024
09e99d9
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
2ee0c00
added docx2txt requirement to .toml file
MythicalCow Jun 7, 2024
bb44910
removed print statement
MythicalCow Jun 7, 2024
2e03035
Merge branch 'docx-file-support' of https://github.com/MythicalCow/kh…
MythicalCow Jun 14, 2024
bebc2ac
removing .doc allowed file type.
MythicalCow Jun 14, 2024
a538485
bug fix for update_raw_text adapter
MythicalCow Jun 14, 2024
9d2a9c8
quick print statement remove
MythicalCow Jun 14, 2024
c9496cb
small fixes and file summarization support for docx.
MythicalCow Jun 14, 2024
070bde2
resolve merge issues with migration files. keep the old one, add a ne…
sabaimran Jun 14, 2024
0b2e391
Add error handling for when file object not found, add summarize comm…
sabaimran Jun 14, 2024
1da9bc4
Merge branch 'master' of github.com:khoj-ai/khoj into features/docume…
sabaimran Jun 14, 2024
d19ec47
Add summarize support for normal streamed chat, non socket
sabaimran Jun 14, 2024
2919c16
Remove print line for file list
sabaimran Jun 14, 2024
bba083b
added some tests for /summarize command.
MythicalCow Jun 14, 2024
3344a96
additional tests added in
MythicalCow Jun 15, 2024
eb993fb
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
b0909f8
added docx2txt requirement to .toml file
MythicalCow Jun 7, 2024
1643548
removed print statement
MythicalCow Jun 7, 2024
fa8ef95
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
88ad9ca
removed print statement
MythicalCow Jun 7, 2024
fa8db61
small fixes and file summarization support for docx.
MythicalCow Jun 14, 2024
07df66e
resolving conflicting db files
MythicalCow Jun 15, 2024
6308528
Merge branch 'docx-file-support' of https://github.com/MythicalCow/kh…
MythicalCow Jun 15, 2024
e609eed
removing redundant db file
MythicalCow Jun 15, 2024
2bfcb3e
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
aaba9b5
removed print statement
MythicalCow Jun 7, 2024
854afa4
added support for new file type on khoj: docx
MythicalCow Jun 6, 2024
b2cdb5f
removed print statement
MythicalCow Jun 7, 2024
fa45ac5
small fixes and file summarization support for docx.
MythicalCow Jun 14, 2024
d644b64
removing redundant db file
MythicalCow Jun 15, 2024
82c26b5
Merge branch 'docx-file-support' of https://github.com/MythicalCow/kh…
MythicalCow Jun 15, 2024
f83cba6
resolving merge conflict with master
MythicalCow Jun 15, 2024
d7ef0d8
resolving merge conflict
MythicalCow Jun 17, 2024
5933042
added additional tests
MythicalCow Jun 17, 2024
c027099
Merge branch 'features/document-summarization' into docx-file-support
MythicalCow Jun 17, 2024
10e2af2
added tests to offline chat director
MythicalCow Jun 17, 2024
7edb6fa
Merge branch 'features/document-summarization' into docx-file-support
MythicalCow Jun 17, 2024
546c378
linter fixes
MythicalCow Jun 17, 2024
6f94f7d
Merge remote-tracking branch 'origin' into docx-file-support
MythicalCow Jun 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
first set of PR fixes
  • Loading branch information
MythicalCow committed Jun 7, 2024
commit 116b4dd2ed9891825675edd1ed36726d35ab3352
6 changes: 3 additions & 3 deletions src/khoj/database/adapters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -849,7 +849,7 @@ async def aget_text_to_image_model_config():

class FileObjectAdapters:
@staticmethod
def overwrite_raw_text(file_object: FileObject, new_raw_text: str):
def update_raw_text(file_object: FileObject, new_raw_text: str):
file_object.raw_text = new_raw_text

@staticmethod
Expand All @@ -873,7 +873,7 @@ def delete_all_file_objects():
return FileObject.objects.all().delete()

@staticmethod
async def async_overwrite_raw_text(file_object: FileObject, new_raw_text: str):
async def async_update_raw_text(file_object: FileObject, new_raw_text: str):
await sync_to_async(lambda: setattr(file_object, "raw_text", new_raw_text))()
await sync_to_async(file_object.save)()

Expand All @@ -899,7 +899,7 @@ async def async_delete_all_file_objects():


class EntryAdapters:
word_filer = WordFilter() # typo here should be word_filter
word_filer = WordFilter()
file_filter = FileFilter()
date_filter = DateFilter()

Expand Down
2 changes: 1 addition & 1 deletion src/khoj/processor/content/text_to_entries.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ def update_embeddings(
raw_text = " ".join(file_to_text_map[file_name])
file_object = FileObjectAdapters.get_file_objects_by_name(file_name)
if file_object:
FileObjectAdapters.overwrite_raw_text(file_object, raw_text)
FileObjectAdapters.update_raw_text(file_object, raw_text)
else:
FileObjectAdapters.create_file_object(file_name, raw_text)

Expand Down