community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters #20645

coolbeevip · 2024-04-19T06:46:12Z

Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643

The following is the official test case. Just replacing # Foo\n\n with \ufeff# Foo\n\n will cause the test case to fail.

chunk metadata is empty

def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""

    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output

twitter: @coolbeevip

vercel · 2024-04-19T06:46:16Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Apr 25, 2024 0:04am

Remove non-printable characters from stripped lines

coolbeevip · 2024-04-20T16:27:24Z

2 workflows awaiting approval

Could you please take a look when you have a moment? @hwchase17 @baskaryan

…isible-characters

@coolbeevip

…headers with non-printable characters (#20645) Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643 The following is the official test case. Just replacing `# Foo\n\n` with `\ufeff# Foo\n\n` will cause the test case to fail. chunk metadata is empty ```python def test_md_header_text_splitter_1() -> None: """Test markdown splitter by header: Case 1.""" markdown_document = ( "\ufeff# Foo\n\n" " ## Bar\n\n" "Hi this is Jim\n\n" "Hi this is Joe\n\n" " ## Baz\n\n" " Hi this is Molly" ) headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ] markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, ) output = markdown_splitter.split_text(markdown_document) expected_output = [ Document( page_content="Hi this is Jim \nHi this is Joe", metadata={"Header 1": "Foo", "Header 2": "Bar"}, ), Document( page_content="Hi this is Molly", metadata={"Header 1": "Foo", "Header 2": "Baz"}, ), ] assert output == expected_output ``` twitter: @coolbeevip Co-authored-by: Bagatur <[email protected]>

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 19, 2024

dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 19, 2024

Remove non-printable characters from stripped lines

4cf5349

Remove non-printable characters from stripped lines

coolbeevip force-pushed the fix-markdown-header-text-splitter-with-invisible-characters branch from 0d3d315 to 4cf5349 Compare April 19, 2024 14:17

Merge branch 'master' into fix-markdown-header-text-splitter-with-inv…

2571858

…isible-characters

baskaryan approved these changes Apr 25, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 25, 2024

fmt

96c9a6f

baskaryan enabled auto-merge (squash) April 25, 2024 00:04

baskaryan merged commit 2cd907a into langchain-ai:master Apr 25, 2024
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters #20645

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters #20645

coolbeevip commented Apr 19, 2024

vercel bot commented Apr 19, 2024 •

edited

Loading

coolbeevip commented Apr 20, 2024 •

edited

Loading

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters #20645

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters #20645

Conversation

coolbeevip commented Apr 19, 2024

vercel bot commented Apr 19, 2024 • edited Loading

coolbeevip commented Apr 20, 2024 • edited Loading

vercel bot commented Apr 19, 2024 •

edited

Loading

coolbeevip commented Apr 20, 2024 •

edited

Loading