perf: improve parsing performance from o(n^2) to o(n) #162

AlextheYounga · 2024-12-17T04:21:04Z

The function process_tag was previously concatenating strings inside of a loop. Each + operation creates a new string, resulting in repeated copying of already accumulated data. This logic is of complexity O(n²).

We can take this from an exponential function to a linear function by using an array, appending to it, and then returning a ''.join() at the end. This logic is of complexity O(n).

After this change, the time it took to convert a 20.8MB HTML file (US social security tax law document) went from 18.94 minutes to 4.11 minutes, a speed increase of 4.6x.

# Before change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 1136.8312180042267 seconds

# After change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 246.62110209465027 seconds

All tox tests are also passing.

I also added .python-version to the .gitignore file, (I was using pyenv)

AlextheYounga · 2024-12-18T16:51:38Z

I also added a few early checks when iterating over node children to prevent unnecessary function calls. I won't make any more changes to this branch in the interest of limiting complexity.

      for i, el in enumerate(children):
          # Quick type check first to avoid unnecessary function calls
          if not isinstance(el, NavigableString):
              continue

          # Check if the text is entirely whitespace first
          text = six.text_type(el)
          if text.strip():
              continue

          # Determine if we can extract based on position and adjacency
          can_extract = (
              (should_remove_inside and (i == 0 or i == len(children) - 1)) or (i > 0 and should_remove_whitespace_outside(children[i - 1])) or (i < len(children) - 1 and should_remove_whitespace_outside(children[i + 1]))
          )

          # Extract if conditions are met
          if can_extract:
              el.extract()

AlextheYounga · 2024-12-20T22:13:53Z

@matthewwithanm I just want to make sure you saw this, I think you might like this. Let me know if you'd like me to make any changes.

Thanks for creating this library 🫡

matthewwithanm · 2024-12-21T02:35:17Z

Yes! I haven't looked at the code but it sounds great! Thank you!

@AlexVonB has been managing the repo for a while though. He'll get around to it eventually but it's a busy time of year 🙂

AlexVonB · 2024-12-23T11:58:58Z

Hey all, I might have time during the holidays to check in on the issue. From the first glance it looks good, but I would like to take the time to completely understand every change. I wish you quiet holidays :)

chrispy-snps · 2025-01-01T15:54:03Z

@AlextheYounga - thanks for this pull request! I also get an impressive runtime reduction (379 seconds to 86 seconds on one of our larger HTML documents).

When I diff the new output against the old output, I see extra newlines. It would be good for the output to diff identically. You can use the following HTML to reproduce the issue:

<article>
    <h1>Heading 1</h1>
    <div>
        <p>article body</p>
    </div>
    <article>
        <h2>Heading 2</h2>
        <div>
            <p>article body</p>
        </div>
    </article>
    <p>footnote</p>
</article>

AlextheYounga · 2025-01-01T18:15:59Z

@chrispy-snps Thanks for catching this. If it's adding newlines, that is no bueno. I will see if I can correct this and perhaps add an extra test to catch this.

AlextheYounga · 2025-01-08T23:37:11Z

markdownify/__init__.py

        for el in node.children:
            if isinstance(el, Comment) or isinstance(el, Doctype):
                continue
            elif isinstance(el, NavigableString):
-                text += self.process_text(el)
+                text_parts.append(self.process_text(el))
            else:


Just an update here that I am still working on this. I have added a test and improved the readability of the code so as to better identify the problem.

I tried to copy over as much as I could from the original code, but I think some of that code doesn't translate well to this new loop paradigm. I am positive that the variable in newlines_left has changed.

In the original function we are grabbing a number of line breaks from the full text and using this to create newlines_left:
https://github.com/matthewwithanm/python-markdownify/blob/6258f5c38b97ab443b4ddf03e6676ce29b392d06/markdownify/__init__.py#L164C1-L166C60

Original:

else: text_strip = text.rstrip('\n') newlines_left = len(text) - len(text_strip)

However, in this new structure, we trying to break away from concatenating a string in each loop, the source of the exponential logic, but that comes with its own tradeoffs, namely it is no longer straightforward to access the full string text. So now we are only counting the number of newlines in each "chunk", in the array, which may not be the same number as before. I feel like there is a solution here, I am still working the problem.

New:

else: text_strip = '' newlines_left = 0 if text_parts: last_chunk = text_parts.pop() text_strip = last_chunk.rstrip('\n') newlines_left = len(last_chunk) - len(text_strip)

@AlextheYounga - the list-based approach actually brings a lot of power with it. We know that the internal contents of each list item is fixed and completed, but there are opportunities to normalize/collapse whitespace (for inline content) and newlines (for block content) at the leading/trailing ends of the list items. (Does convert_children_as_newline determine this?)

For example, we could post-process the list before joining by looking at the end of the previous item and the beginning of the next item. For example, trailing+leading sequences of more than 2 newlines could be limited to 2. Maybe the list items could be expanded into a (leading-ws, content, trailing-ws) tuples to make this easier. I'm not sure, but you are right that this is a different paradigm where the logic of the old code might not apply, and that could very well be a good thing.

@AlextheYounga - if having another pair of eyes on the code would help, push your latest code to your branch, and I'll clone it and have a look. It would be great to get this enhancement into the next release of Markdownify.

Edit: I merged in the upstream develop branch into your pull request branch and resolved the conflicts.

Apologies, I have been very busy recently. Honestly another pair of eyes here would help, but I am looking into this again now.

The function process_tag was previously concatenating strings inside of a loop. Each + operation creates a new string, resulting in repeated copying of already accumulated data. By replacing this with an array we append to, and then joining this array at the end, we can take this from an exponential function to a linear function.

AlextheYounga · 2025-01-29T08:06:56Z

I think I'm closer, but I don't think the behavior is identical yet.

Using the list-based approach, there may be times when the last chunk is composed of entirely newline characters, and this throws off the newlines calculation completely. I gave up trying to think of a clever way to use the code we already had and instead began writing a new function to deal with these newline chunks (and ChatGPT helped here).

This new function pop_trailing_newlines() resolves the issues we were seeing in the example html we were using and pass the newline test I created, but I ran both current develop and my branch on the US Internal Revenue Code (34MB) and there is a discrepancy of about 100 lines (although that's not even a drop in the bucket for the IRC). We're also sacrificing a little bit of our performance here: down from a 4.6x boost to 2.2x boost after this change. :(

There also may be ways to refactor this function. Not a trivial problem, for me at least.

    def pop_trailing_newlines(self, text_parts):
        newlines_total = 0
        # 1) First pop off any chunks that are newlines
        newline_chunks = []
        while text_parts and text_parts[-1].rstrip('\n') == '':
            newline_chunks.append(text_parts.pop())
        if newline_chunks:
            # Example: if we had ["\n", "\n", "\n"] at the very end
            all_newlines = ''.join(reversed(newline_chunks))
            newlines_total += len(all_newlines)

        # 2) Now look at one more chunk which might have some real text + trailing newlines
        text_without_newline = ''
        if text_parts:
            last_chunk = text_parts.pop()
            stripped = last_chunk.rstrip('\n')
            newlines_total += (len(last_chunk) - len(stripped))
            text_without_newline = stripped

        return (text_without_newline, newlines_total)

chrispy-snps · 2025-02-01T16:50:41Z

@AlextheYounga - where can I find the test HTML document you are using?

chrispy-snps · 2025-02-02T15:32:43Z

@AlextheYounga - I have some code that retains the 4x performance benefit and also partially fixes #185. However, it's based on some of my other recent pull requests so I need to wait for those to land.

I don't want to open a separate pull request for my code (you deserve the credit for this approach!), so I suggest that (1) we wait for the pending open pull requests to be merged, (2) I open a pull request on your branch so we can review and sync up, (3) we we merge it into your branch, (4) your branch is reviewed and merged into the project.

What do you think?

AlextheYounga changed the title ~~feat: improve parsing performance from o(n^2) to o(n)~~ perf: improve parsing performance from o(n^2) to o(n) Dec 17, 2024

AlextheYounga force-pushed the develop branch from 7e5a5d0 to 13d7ba4 Compare December 17, 2024 05:09

AlexVonB self-assigned this Dec 23, 2024

chrispy-snps mentioned this pull request Jan 3, 2025

Inconsistent handling of No-Break Space and Space #175

Open

AlextheYounga commented Jan 8, 2025

View reviewed changes

chrispy-snps mentioned this pull request Jan 27, 2025

remove superfluous leading/trailing whitespace #181

Merged

chore: add python version file to gitignore

9c299ed

AlextheYounga force-pushed the develop branch 2 times, most recently from 5dd8301 to be96173 Compare January 29, 2025 07:20

AlextheYounga force-pushed the develop branch from be96173 to 5bae8a0 Compare January 29, 2025 07:23

test: add test for newline count

8d973b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve parsing performance from o(n^2) to o(n) #162

perf: improve parsing performance from o(n^2) to o(n) #162

AlextheYounga commented Dec 17, 2024 •

edited

Loading

AlextheYounga commented Dec 18, 2024 •

edited

Loading

AlextheYounga commented Dec 20, 2024 •

edited

Loading

matthewwithanm commented Dec 21, 2024 •

edited

Loading

AlexVonB commented Dec 23, 2024

chrispy-snps commented Jan 1, 2025

AlextheYounga commented Jan 1, 2025

AlextheYounga Jan 8, 2025 •

edited

Loading

chrispy-snps Jan 8, 2025 •

edited

Loading

chrispy-snps Jan 26, 2025 •

edited

Loading

AlextheYounga Jan 27, 2025

AlextheYounga commented Jan 29, 2025 •

edited

Loading

chrispy-snps commented Feb 1, 2025

chrispy-snps commented Feb 2, 2025

perf: improve parsing performance from o(n^2) to o(n) #162

Are you sure you want to change the base?

perf: improve parsing performance from o(n^2) to o(n) #162

Conversation

AlextheYounga commented Dec 17, 2024 • edited Loading

AlextheYounga commented Dec 18, 2024 • edited Loading

AlextheYounga commented Dec 20, 2024 • edited Loading

matthewwithanm commented Dec 21, 2024 • edited Loading

AlexVonB commented Dec 23, 2024

chrispy-snps commented Jan 1, 2025

AlextheYounga commented Jan 1, 2025

AlextheYounga Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

chrispy-snps Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

chrispy-snps Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

AlextheYounga Jan 27, 2025

Choose a reason for hiding this comment

AlextheYounga commented Jan 29, 2025 • edited Loading

chrispy-snps commented Feb 1, 2025

chrispy-snps commented Feb 2, 2025

AlextheYounga commented Dec 17, 2024 •

edited

Loading

AlextheYounga commented Dec 18, 2024 •

edited

Loading

AlextheYounga commented Dec 20, 2024 •

edited

Loading

matthewwithanm commented Dec 21, 2024 •

edited

Loading

AlextheYounga Jan 8, 2025 •

edited

Loading

chrispy-snps Jan 8, 2025 •

edited

Loading

chrispy-snps Jan 26, 2025 •

edited

Loading

AlextheYounga commented Jan 29, 2025 •

edited

Loading