Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal performance when a large folder is listed in .markdownlintignore #108

Open
kachkaev opened this issue Jul 18, 2020 · 17 comments
Open

Comments

@kachkaev
Copy link

I have a project with a local data folder, which is listed in all .*ignore files. When this temp folder gets quite large (more than 10K files), Markdownlint seems to become much slower than other linting tools (Prettier and ESLint). This suggests that the ignore logic could be improved.

Here are the reproduction steps for macOS Catalina and zsh. I only compare Markdownlint with Prettier here for simplicity, but according to my observations, ESLint also remains quite performant.

  1. Init a project with README.md , Prettier, Markdownlint and identical .prettierignore + .markdownlintignore files.

    mkdir /tmp/slow-markdownlint-repro && cd "$_"
    
    yarn init --yes
    yarn add markdownlint-cli prettier
    
    cat <<EOF >README.md
    # Hello world
    
    test document
    EOF
    
    cat <<EOF >.prettierignore
    *.*
    !*.md
    
    my-big-temp-folder
    node_modules
    EOF
    
    cp .prettierignore .markdownlintignore
  2. Run Prettier and Markdownlint and observe reasonable performance.

    yarn prettier --check "**/*"
    ## a couple of secs
    
    yarn markdownlint "**/*"
    ## a couple of secs
  3. Create a large folder, which is already listed in the ignore files.

    cp -r $(yarn cache dir) my-big-temp-folder
    
    du -sh my-big-temp-folder
    ## 4.2G in my case 👀
  4. Run Prettier and Markdownlint again.

    yarn prettier --check "**/*"
    ## still a couple of secs
    
    yarn markdownlint "**/*"
    ## 115 secs in my case 👈
  5. Delete the project to release disk space.

    cd ~
    rm -rf /tmp/slow-markdownlint-repro

It feels like Markdownlint CLI still lists all the files in the ignored folder and then matches each path against the ignore list, while Prettier does not even look into my-big-temp-folder.

I don’t know Markdownlint internals well enough to submit a PR, but I hope that my repro steps may help with the investigation 🙌

I believe that having large local data / var / temp folders is not that rare in software development, so I won’t be the only person who would benefit from performance improvements 🙂

@DavidAnson
Copy link
Collaborator

This is great detail, thank you very much! I‘be been working on an experiment that I’d like you to try in a week or so that may help with this. I’ll follow up here when it’s ready.

@DavidAnson
Copy link
Collaborator

I've published something new: https://github.com/DavidAnson/markdownlint-cli2

It's a slightly different take on a CLI, please see the README for context.

For your purposes, something like the following should be equivalent to what you have today. But faster (according to my own limited testing).

Could you please give it a try and let me know what you find?

.markdownlintignore:

my-big-temp-folder
node_modules

Command-line:

markdownlint-cli2 "**/*.md"

Aside: The pattern you show of including all files and then excluding Markdown files in the ignore file is not supported by CLI2 – and kind of weird anyway. I wouldn't guess it's doing much to help with performance, though I agree that in the abstract it is equivalent to what I show above.

@DavidAnson
Copy link
Collaborator

By the way, you could put the two ignore paths on the command-line, but I was trying to match your current configuration as closely as possible.

@kachkaev
Copy link
Author

kachkaev commented Jul 21, 2020

Hey David, thanks for the follow up! I’ll try testing the new CLI performance over the weekend, fingers crossed!

I understand that the .markdownlintignore approach I’ve shared may look a bit odd, but it has its reasons. Excluding all files and then allowing certain extensions helps configure several linting tools consistenly. For me, this includes ESLint, Prettier and also Markdownlint since recently. By moving all include patterns into the corresponding ignore file, I can forget about providing the right arguments to CLI and also not worry about editor integration. This reduces repetition of complex globs and thus the probability of human errors.

If for Markdownlint the rule is simply to include *.md(x) files, for other tools the CLI glob would be way more sophisticated. Support for multiple extensions is just one source of complexity, we also need to account for build folders, git-ignored paths and so on. In package.json, there are linting and fixing scripts as well as pre-commit hooks – all three require some kind of globs. Besides, there is an IDE, which would need its own tweaking if the 'source of truth' for ignoring and including files is in CLI args and not in a corresponding .*ignore file.

See this project as an example: https://github.com/kachkaev/njt. It is quite small so does not suffer from CLI performance issues, but it has a full-blown config for the linters. Hope it helps illustrate the coherence being achieved with the described approach to the .*ignore files.

@DavidAnson
Copy link
Collaborator

DavidAnson commented Jul 21, 2020

That reasoning makes sense, but interferes with what I’m trying to do which is represent all of the positive and negative pattern matching on the command line so it can be passed into the glob tool for best performance. As you suggested above, the slow implementation of ignore is to apply the ignore file rules with a filter after using the globs to search the disk. I looked briefly at ESLint and it seems that’s what they do, though they apply some default ignore rules at glob time to filter out, for example, the node_modules directory which is probably why it’s not also slow. Though I didn’t analyze it in depth, allowing negative patterns in the ignore file didn’t seem to work with this approach because they are supposed to be scoped to just the ignore operation (“!*.md” in an ignore file does not mean the same as “*.md” on the command line). I may be wrong about this or you may point out a way to handle it better. I’d be happy to consider either! But for now, this compromise seemed to have a fairly minor effect on users but a potentially significant effect on performance and so that’s what I’ve gone with.

@DavidAnson
Copy link
Collaborator

FYI, I've just published version 0.0.3 of markdownlint-cli2. It now supports everything markdownlint-cli does (though differently). The globbing behavior discussed above is not changed vs. version 0.0.2 - did you get a chance to try that out for your scenario?

@borekb
Copy link

borekb commented Aug 9, 2020

I've got a .markdownlintignore file with this contents:

node_modules
/prototypes

I've got 35 *.md files and running plain markdownlint in the folder takes about half a minute.

I tried constructing the command differently, not depending on .markdownlintignore, and am betting sub-second time:

npx markdownlint -i prototypes $(fd -g "*.md" -E "prototypes" . | tr '\n' ' ')

Is this the same issue reported here?

@DavidAnson
Copy link
Collaborator

@borekb Yes, seems like the same thing. However, both approaches should be fast in markdownlint-cli2. If you confirm or refute that, please let me know.

@borekb
Copy link

borekb commented Aug 9, 2020

Thanks, I'll try to experiment with it sometimes soon.

@borekb
Copy link

borekb commented Aug 13, 2020

@DavidAnson I can confirm that cli2 doesn't have this problem 🎉.

@kachkaev
Copy link
Author

I ran my reproduction steps, but replace markdownlint with markdownlint-cli2. Running yarn markdownlint-cli2 "**/*" took less than a second instead of >100 seconds! 🎉 Many thanks for your work @DavidAnson, v2 looks promising! 🚀

@adamkudrna
Copy link

adamkudrna commented Aug 21, 2020

I can also confirm markdownlint-cli2 is way faster than markdownlint, great job @DavidAnson! ❤️

So what will happen to the original CLI now? Are you going to deprecate it? Also calling the new CLI via markdownlint-cli2 seems a bit unusual compared to other npm scripts…

Edit: I now realized the original CLI is from different author, sorry 🤦.

@DavidAnson
Copy link
Collaborator

@adamkudrna Thanks for confirming that! My motivation for doing CLI2 was that there were a number of features/improvements that I didn't see a clean way of adding to CLI1. I will continue updating the markdownlint dependency and keep the tests passing and review pull requests for CLI1, but will probably dedicate most of my time on features/requests to CLI2.

@kachkaev
Copy link
Author

kachkaev commented Aug 22, 2020

Just tried [email protected] in a real project in which I experienced slow performance with [email protected]. Seeing:

RangeError: Maximum call stack size exceeded
    at main (/path/to/project/node_modules/markdownlint-cli2/markdownlint-cli2.js:179:30)
    at async /path/to/project/node_modules/markdownlint-cli2/markdownlint-cli2.js:356:9
error Command failed with exit code 2.

The stack points to this line: dirInfo.parent.files.push(...dirInfo.files);. Happy to help with the investigation if there are ideas or new versions to try out 🙂

@DavidAnson
Copy link
Collaborator

Great, thank you for letting me know! We should probably move this investigation to an issue in the repository for CLI2. Could you please open something there? Looking at the info above, that’s not a particularly deep call stack, so I’m guessing this is more to do with Promise chains? Also, it might help to know which repository you are seeing this problem with. If you aren’t comfortable sharing that publicly, we can work something out.

@DavidAnson
Copy link
Collaborator

Quick update, I don’t need an example to reproduce this, I think I figured out what’s going on by inspection. (The message is a little bit misleading.)

@DavidAnson
Copy link
Collaborator

@kachkaev, that RangeError should be fixed in markdownlint-cli2 version 0.0.6, thanks!

Some info about the bug: https://twitter.com/DavidAns/status/1297361113386872833

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants