[Community] Document Loader for Logseq #27400

ishaan-upadhyay · 2024-10-16T17:14:17Z

ishaan-upadhyay
Oct 16, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Logseq is an open-source knowledge base with >= 30K Github stars - see the repository here. We (a group of 4 university students from the University of Toronto) would like to implement document-loading support for Logseq to enable RAG through Langchain.

Motivation

I use Logseq quite heavily for taking notes and generally organizing information (to the point where my graph is getting quite complicated). With a first-class Langchain integration, using RAG to explore it would greatly simplify my searching process. Furthermore, as Logseq expands to incorporate collaborative editing with its database version and becomes more viable for organization-level knowledge bases, it will be very useful to be able to navigate it using retrieval-augmented generation.

Other knowledge bases, such as Obsidian, have their own loaders as well, which are able to make use of special properties instead of simply loading the directory and contained markdown files. Furthermore, there is interest from the Logseq community for LLM integration. There are also a few plugins (#1, #2) for Logseq based around integrating LLMs, though primarily for summarizing text or assisting in note generation rather than retrieval.

Proposal (If applicable)

Currently, Logseq operates on a flat-directory structure of Markdown files, under pages and journals respectively, with embedded assets stored in assets. In the future, this may diverge as Logseq implements a database version (which should still have 2-way sync to the Markdown structure).

Therefore, the initial implementation would be similar to the existing ObsidianLoader and would be used as follows, and we anticipate only having to add a LogseqLoader class:

from langchain_community.document_loaders import LogseqLoader

loader = LogseqLoader("<logseq-graph-path>")
docs = loader.load()

Metadata is also stored similarly at the top of the file as front-matter (but not in YAML format). We propose a further extension of the loading functionality to add metadata to documents for:

Linked pages ([[page name]]) or tags (#page) in the body of the file,
Hierarchy - A__B in the file name corresponds to A/B, with B being a subpage of A.

Some brief pseudocode below for how the load function would work:

for parent in page_filename.split("__"):
     document.metadata[hierarchy].append(parent)

FRONT_MATTER_REGEX = "^---\n(.*?)\n---\n"

match and extract front matter
split on newlines, and then on commas 
(based on cursory testing, Logseq does not allow escaping commas to separate properties in front matter)
for key, values in front_matter.items():
     document.metadata[key] = values

LINK_REGEX = "(?<!\\)\[\[(.*)\]\]"
INLINE_TAG_LINK_REGEX = "(?<!\\)\#(.*)"

match and extract all links
for link_page_name:
     document.metadata[links].append(link)

page_content = file contents with front matter removed

yield document in lazy load function

These can all be detected with regexes and should not be overly complicated.

If a strategy-based approach is more appropriate (considering the similarities), we could also use strategies for parsing front matter, parsing bodies, loading file paths and then use a base lazy_load function for knowledge bases, but this may be over-engineering for this problem.

If accepted, we plan to submit a PR by no later than mid-November.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Community] Document Loader for Logseq #27400

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[Community] Document Loader for Logseq #27400

ishaan-upadhyay Oct 16, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

ishaan-upadhyay
Oct 16, 2024