Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters #20645

Conversation

coolbeevip
Copy link
Contributor

Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643

The following is the official test case. Just replacing # Foo\n\n with \ufeff# Foo\n\n will cause the test case to fail.

chunk metadata is empty

def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""

    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output

twitter: @coolbeevip

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 19, 2024
Copy link

vercel bot commented Apr 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 25, 2024 0:04am

@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 19, 2024
Remove non-printable characters from stripped lines
@coolbeevip coolbeevip force-pushed the fix-markdown-header-text-splitter-with-invisible-characters branch from 0d3d315 to 4cf5349 Compare April 19, 2024 14:17
@coolbeevip
Copy link
Contributor Author

coolbeevip commented Apr 20, 2024

2 workflows awaiting approval

Could you please take a look when you have a moment? @hwchase17 @baskaryan

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 25, 2024
@baskaryan baskaryan enabled auto-merge (squash) April 25, 2024 00:04
@baskaryan baskaryan merged commit 2cd907a into langchain-ai:master Apr 25, 2024
78 checks passed
hinthornw pushed a commit that referenced this pull request Apr 26, 2024
…headers with non-printable characters (#20645)

Description: MarkdownHeaderTextSplitter Fails to Parse Headers with
non-printable characters. more #20643

The following is the official test case. Just replacing `# Foo\n\n` with
`\ufeff# Foo\n\n` will cause the test case to fail.

chunk metadata is empty

```python
def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""

    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output
```

twitter: @coolbeevip

Co-authored-by: Bagatur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature lgtm PR looks good. Use to confirm that a PR is ready for merging. size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants