Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve parsing performance from o(n^2) to o(n) #162

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

AlextheYounga
Copy link

@AlextheYounga AlextheYounga commented Dec 17, 2024

The function process_tag was previously concatenating strings inside of a loop. Each + operation creates a new string, resulting in repeated copying of already accumulated data. This logic is of complexity O(n2).

We can take this from an exponential function to a linear function by using an array, appending to it, and then returning a ''.join() at the end. This logic is of complexity O(n).

After this change, the time it took to convert a 20.8MB HTML file (US social security tax law document) went from 18.94 minutes to 4.11 minutes, a speed increase of 4.6x.

# Before change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 1136.8312180042267 seconds

# After change
Converting title_42—the_public_health_and_welfare::chapter_7—social_security.html to markdown...
Writing title_42—the_public_health_and_welfare::chapter_7—social_security.md...
Time taken: 246.62110209465027 seconds

All tox tests are also passing.

I also added .python-version to the .gitignore file, (I was using pyenv)

@AlextheYounga AlextheYounga changed the title feat: improve parsing performance from o(n^2) to o(n) perf: improve parsing performance from o(n^2) to o(n) Dec 17, 2024
@AlextheYounga
Copy link
Author

AlextheYounga commented Dec 18, 2024

I also added a few early checks when iterating over node children to prevent unnecessary function calls. I won't make any more changes to this branch in the interest of limiting complexity.

      for i, el in enumerate(children):
          # Quick type check first to avoid unnecessary function calls
          if not isinstance(el, NavigableString):
              continue

          # Check if the text is entirely whitespace first
          text = six.text_type(el)
          if text.strip():
              continue

          # Determine if we can extract based on position and adjacency
          can_extract = (
              (should_remove_inside and (i == 0 or i == len(children) - 1)) or (i > 0 and should_remove_whitespace_outside(children[i - 1])) or (i < len(children) - 1 and should_remove_whitespace_outside(children[i + 1]))
          )

          # Extract if conditions are met
          if can_extract:
              el.extract()

@AlextheYounga
Copy link
Author

AlextheYounga commented Dec 20, 2024

@matthewwithanm I just want to make sure you saw this, I think you might like this. Let me know if you'd like me to make any changes.

Thanks for creating this library 🫡

@matthewwithanm
Copy link
Owner

matthewwithanm commented Dec 21, 2024

Yes! I haven't looked at the code but it sounds great! Thank you!

@AlexVonB has been managing the repo for a while though. He'll get around to it eventually but it's a busy time of year 🙂

@AlexVonB
Copy link
Collaborator

Hey all, I might have time during the holidays to check in on the issue. From the first glance it looks good, but I would like to take the time to completely understand every change. I wish you quiet holidays :)

@AlexVonB AlexVonB self-assigned this Dec 23, 2024
@chrispy-snps
Copy link
Collaborator

@AlextheYounga - thanks for this pull request! I also get an impressive runtime reduction (379 seconds to 86 seconds on one of our larger HTML documents).

When I diff the new output against the old output, I see extra newlines. It would be good for the output to diff identically. You can use the following HTML to reproduce the issue:

<article>
    <h1>Heading 1</h1>
    <div>
        <p>article body</p>
    </div>
    <article>
        <h2>Heading 2</h2>
        <div>
            <p>article body</p>
        </div>
    </article>
    <p>footnote</p>
</article>

@AlextheYounga
Copy link
Author

@chrispy-snps Thanks for catching this. If it's adding newlines, that is no bueno. I will see if I can correct this and perhaps add an extra test to catch this.

for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
text_parts.append(self.process_text(el))
else:
Copy link
Author

@AlextheYounga AlextheYounga Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an update here that I am still working on this. I have added a test and improved the readability of the code so as to better identify the problem.

I tried to copy over as much as I could from the original code, but I think some of that code doesn't translate well to this new loop paradigm. I am positive that the variable in newlines_left has changed.

In the original function we are grabbing a number of line breaks from the full text and using this to create newlines_left:
https://github.com/matthewwithanm/python-markdownify/blob/6258f5c38b97ab443b4ddf03e6676ce29b392d06/markdownify/__init__.py#L164C1-L166C60

Original:

else:
    text_strip = text.rstrip('\n')
    newlines_left = len(text) - len(text_strip)

However, in this new structure, we trying to break away from concatenating a string in each loop, the source of the exponential logic, but that comes with its own tradeoffs, namely it is no longer straightforward to access the full string text. So now we are only counting the number of newlines in each "chunk", in the array, which may not be the same number as before. I feel like there is a solution here, I am still working the problem.

New:

else:
    text_strip = ''
    newlines_left = 0
    if text_parts:
        last_chunk = text_parts.pop()
        text_strip = last_chunk.rstrip('\n')
        newlines_left = len(last_chunk) - len(text_strip)

Copy link
Collaborator

@chrispy-snps chrispy-snps Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlextheYounga - the list-based approach actually brings a lot of power with it. We know that the internal contents of each list item is fixed and completed, but there are opportunities to normalize/collapse whitespace (for inline content) and newlines (for block content) at the leading/trailing ends of the list items. (Does convert_children_as_newline determine this?)

For example, we could post-process the list before joining by looking at the end of the previous item and the beginning of the next item. For example, trailing+leading sequences of more than 2 newlines could be limited to 2. Maybe the list items could be expanded into a (leading-ws, content, trailing-ws) tuples to make this easier. I'm not sure, but you are right that this is a different paradigm where the logic of the old code might not apply, and that could very well be a good thing.

Copy link
Collaborator

@chrispy-snps chrispy-snps Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlextheYounga - if having another pair of eyes on the code would help, push your latest code to your branch, and I'll clone it and have a look. It would be great to get this enhancement into the next release of Markdownify.

Edit: I merged in the upstream develop branch into your pull request branch and resolved the conflicts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I have been very busy recently. Honestly another pair of eyes here would help, but I am looking into this again now.

@AlextheYounga AlextheYounga force-pushed the develop branch 2 times, most recently from 5dd8301 to be96173 Compare January 29, 2025 07:20
The function process_tag was previously concatenating strings
inside of a loop. Each + operation creates a new string,
resulting in repeated copying of already accumulated data. By
replacing this with an array we append to, and then joining this
array at the end, we can take this from an exponential function
to a linear function.
@AlextheYounga
Copy link
Author

AlextheYounga commented Jan 29, 2025

I think I'm closer, but I don't think the behavior is identical yet.

Using the list-based approach, there may be times when the last chunk is composed of entirely newline characters, and this throws off the newlines calculation completely. I gave up trying to think of a clever way to use the code we already had and instead began writing a new function to deal with these newline chunks (and ChatGPT helped here).

This new function pop_trailing_newlines() resolves the issues we were seeing in the example html we were using and pass the newline test I created, but I ran both current develop and my branch on the US Internal Revenue Code (34MB) and there is a discrepancy of about 100 lines (although that's not even a drop in the bucket for the IRC). We're also sacrificing a little bit of our performance here: down from a 4.6x boost to 2.2x boost after this change. :(

There also may be ways to refactor this function. Not a trivial problem, for me at least.

    def pop_trailing_newlines(self, text_parts):
        newlines_total = 0
        # 1) First pop off any chunks that are newlines
        newline_chunks = []
        while text_parts and text_parts[-1].rstrip('\n') == '':
            newline_chunks.append(text_parts.pop())
        if newline_chunks:
            # Example: if we had ["\n", "\n", "\n"] at the very end
            all_newlines = ''.join(reversed(newline_chunks))
            newlines_total += len(all_newlines)

        # 2) Now look at one more chunk which might have some real text + trailing newlines
        text_without_newline = ''
        if text_parts:
            last_chunk = text_parts.pop()
            stripped = last_chunk.rstrip('\n')
            newlines_total += (len(last_chunk) - len(stripped))
            text_without_newline = stripped

        return (text_without_newline, newlines_total)

@chrispy-snps
Copy link
Collaborator

@AlextheYounga - where can I find the test HTML document you are using?

@chrispy-snps
Copy link
Collaborator

@AlextheYounga - I have some code that retains the 4x performance benefit and also partially fixes #185. However, it's based on some of my other recent pull requests so I need to wait for those to land.

I don't want to open a separate pull request for my code (you deserve the credit for this approach!), so I suggest that (1) we wait for the pending open pull requests to be merged, (2) I open a pull request on your branch so we can review and sync up, (3) we we merge it into your branch, (4) your branch is reviewed and merged into the project.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants