Suboptimal performance when a large folder is listed in .markdownlintignore #108

kachkaev · 2020-07-18T17:57:03Z

I have a project with a local data folder, which is listed in all .*ignore files. When this temp folder gets quite large (more than 10K files), Markdownlint seems to become much slower than other linting tools (Prettier and ESLint). This suggests that the ignore logic could be improved.

Here are the reproduction steps for macOS Catalina and zsh. I only compare Markdownlint with Prettier here for simplicity, but according to my observations, ESLint also remains quite performant.

Init a project with README.md , Prettier, Markdownlint and identical .prettierignore + .markdownlintignore files.

mkdir /tmp/slow-markdownlint-repro && cd "$_"

yarn init --yes
yarn add markdownlint-cli prettier

cat <<EOF >README.md
# Hello world

test document
EOF

cat <<EOF >.prettierignore
*.*
!*.md

my-big-temp-folder
node_modules
EOF

cp .prettierignore .markdownlintignore

Run Prettier and Markdownlint and observe reasonable performance.

yarn prettier --check "**/*"
## a couple of secs

yarn markdownlint "**/*"
## a couple of secs

Create a large folder, which is already listed in the ignore files.

cp -r $(yarn cache dir) my-big-temp-folder

du -sh my-big-temp-folder
## 4.2G in my case 👀

Run Prettier and Markdownlint again.

yarn prettier --check "**/*"
## still a couple of secs

yarn markdownlint "**/*"
## 115 secs in my case 👈

Delete the project to release disk space.

cd ~
rm -rf /tmp/slow-markdownlint-repro

It feels like Markdownlint CLI still lists all the files in the ignored folder and then matches each path against the ignore list, while Prettier does not even look into my-big-temp-folder.

I don’t know Markdownlint internals well enough to submit a PR, but I hope that my repro steps may help with the investigation 🙌

I believe that having large local data / var / temp folders is not that rare in software development, so I won’t be the only person who would benefit from performance improvements 🙂

The text was updated successfully, but these errors were encountered:

DavidAnson · 2020-07-18T18:46:26Z

This is great detail, thank you very much! I‘be been working on an experiment that I’d like you to try in a week or so that may help with this. I’ll follow up here when it’s ready.

DavidAnson · 2020-07-21T01:59:53Z

I've published something new: https://github.com/DavidAnson/markdownlint-cli2

It's a slightly different take on a CLI, please see the README for context.

For your purposes, something like the following should be equivalent to what you have today. But faster (according to my own limited testing).

Could you please give it a try and let me know what you find?

.markdownlintignore:

my-big-temp-folder
node_modules

Command-line:

markdownlint-cli2 "**/*.md"

Aside: The pattern you show of including all files and then excluding Markdown files in the ignore file is not supported by CLI2 – and kind of weird anyway. I wouldn't guess it's doing much to help with performance, though I agree that in the abstract it is equivalent to what I show above.

DavidAnson · 2020-07-21T02:02:43Z

By the way, you could put the two ignore paths on the command-line, but I was trying to match your current configuration as closely as possible.

kachkaev · 2020-07-21T12:37:57Z

Hey David, thanks for the follow up! I’ll try testing the new CLI performance over the weekend, fingers crossed!

I understand that the .markdownlintignore approach I’ve shared may look a bit odd, but it has its reasons. Excluding all files and then allowing certain extensions helps configure several linting tools consistenly. For me, this includes ESLint, Prettier and also Markdownlint since recently. By moving all include patterns into the corresponding ignore file, I can forget about providing the right arguments to CLI and also not worry about editor integration. This reduces repetition of complex globs and thus the probability of human errors.

If for Markdownlint the rule is simply to include *.md(x) files, for other tools the CLI glob would be way more sophisticated. Support for multiple extensions is just one source of complexity, we also need to account for build folders, git-ignored paths and so on. In package.json, there are linting and fixing scripts as well as pre-commit hooks – all three require some kind of globs. Besides, there is an IDE, which would need its own tweaking if the 'source of truth' for ignoring and including files is in CLI args and not in a corresponding .*ignore file.

See this project as an example: https://github.com/kachkaev/njt. It is quite small so does not suffer from CLI performance issues, but it has a full-blown config for the linters. Hope it helps illustrate the coherence being achieved with the described approach to the .*ignore files.

DavidAnson · 2020-07-21T15:52:59Z

That reasoning makes sense, but interferes with what I’m trying to do which is represent all of the positive and negative pattern matching on the command line so it can be passed into the glob tool for best performance. As you suggested above, the slow implementation of ignore is to apply the ignore file rules with a filter after using the globs to search the disk. I looked briefly at ESLint and it seems that’s what they do, though they apply some default ignore rules at glob time to filter out, for example, the node_modules directory which is probably why it’s not also slow. Though I didn’t analyze it in depth, allowing negative patterns in the ignore file didn’t seem to work with this approach because they are supposed to be scoped to just the ignore operation (“!*.md” in an ignore file does not mean the same as “*.md” on the command line). I may be wrong about this or you may point out a way to handle it better. I’d be happy to consider either! But for now, this compromise seemed to have a fairly minor effect on users but a potentially significant effect on performance and so that’s what I’ve gone with.

DavidAnson · 2020-07-31T21:24:42Z

FYI, I've just published version 0.0.3 of markdownlint-cli2. It now supports everything markdownlint-cli does (though differently). The globbing behavior discussed above is not changed vs. version 0.0.2 - did you get a chance to try that out for your scenario?

borekb · 2020-08-09T18:58:34Z

I've got a .markdownlintignore file with this contents:

node_modules
/prototypes

I've got 35 *.md files and running plain markdownlint in the folder takes about half a minute.

I tried constructing the command differently, not depending on .markdownlintignore, and am betting sub-second time:

npx markdownlint -i prototypes $(fd -g "*.md" -E "prototypes" . | tr '\n' ' ')

Is this the same issue reported here?

DavidAnson · 2020-08-09T19:13:32Z

@borekb Yes, seems like the same thing. However, both approaches should be fast in markdownlint-cli2. If you confirm or refute that, please let me know.

borekb · 2020-08-09T19:20:55Z

Thanks, I'll try to experiment with it sometimes soon.

borekb · 2020-08-13T15:45:20Z

@DavidAnson I can confirm that cli2 doesn't have this problem 🎉.

kachkaev · 2020-08-16T14:01:18Z

I ran my reproduction steps, but replace markdownlint with markdownlint-cli2. Running yarn markdownlint-cli2 "**/*" took less than a second instead of >100 seconds! 🎉 Many thanks for your work @DavidAnson, v2 looks promising! 🚀

adamkudrna · 2020-08-21T16:06:11Z

I can also confirm markdownlint-cli2 is way faster than markdownlint, great job @DavidAnson! ❤️

~~So what will happen to the original CLI now? Are you going to deprecate it? Also calling the new CLI via markdownlint-cli2 seems a bit unusual compared to other npm scripts…~~

Edit: I now realized the original CLI is from different author, sorry 🤦.

DavidAnson · 2020-08-21T16:16:54Z

@adamkudrna Thanks for confirming that! My motivation for doing CLI2 was that there were a number of features/improvements that I didn't see a clean way of adding to CLI1. I will continue updating the markdownlint dependency and keep the tests passing and review pull requests for CLI1, but will probably dedicate most of my time on features/requests to CLI2.

kachkaev · 2020-08-22T12:49:41Z

Just tried [email protected] in a real project in which I experienced slow performance with [email protected]. Seeing:

RangeError: Maximum call stack size exceeded
    at main (/path/to/project/node_modules/markdownlint-cli2/markdownlint-cli2.js:179:30)
    at async /path/to/project/node_modules/markdownlint-cli2/markdownlint-cli2.js:356:9
error Command failed with exit code 2.

The stack points to this line: dirInfo.parent.files.push(...dirInfo.files);. Happy to help with the investigation if there are ideas or new versions to try out 🙂

DavidAnson · 2020-08-22T16:49:12Z

Great, thank you for letting me know! We should probably move this investigation to an issue in the repository for CLI2. Could you please open something there? Looking at the info above, that’s not a particularly deep call stack, so I’m guessing this is more to do with Promise chains? Also, it might help to know which repository you are seeing this problem with. If you aren’t comfortable sharing that publicly, we can work something out.

DavidAnson · 2020-08-23T02:18:36Z

Quick update, I don’t need an example to reproduce this, I think I figured out what’s going on by inspection. (The message is a little bit misleading.)

DavidAnson · 2020-08-23T22:46:15Z

@kachkaev, that RangeError should be fixed in markdownlint-cli2 version 0.0.6, thanks!

Some info about the bug: https://twitter.com/DavidAns/status/1297361113386872833

DavidAnson added the enhancement label Jul 18, 2020

DavidAnson mentioned this issue Aug 13, 2020

Unintuitive handling of .markdownlintignore files DavidAnson/markdownlint-cli2#1

Closed

kachkaev mentioned this issue Aug 30, 2020

Out of memory crash in 0.0.6 DavidAnson/markdownlint-cli2#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal performance when a large folder is listed in .markdownlintignore #108

Suboptimal performance when a large folder is listed in .markdownlintignore #108

kachkaev commented Jul 18, 2020

DavidAnson commented Jul 18, 2020

DavidAnson commented Jul 21, 2020

DavidAnson commented Jul 21, 2020

kachkaev commented Jul 21, 2020 •

edited

Loading

DavidAnson commented Jul 21, 2020 •

edited

Loading

DavidAnson commented Jul 31, 2020

borekb commented Aug 9, 2020 •

edited

Loading

DavidAnson commented Aug 9, 2020

borekb commented Aug 9, 2020

borekb commented Aug 13, 2020

kachkaev commented Aug 16, 2020

adamkudrna commented Aug 21, 2020 •

edited

Loading

DavidAnson commented Aug 21, 2020

kachkaev commented Aug 22, 2020 •

edited

Loading

DavidAnson commented Aug 22, 2020

DavidAnson commented Aug 23, 2020

DavidAnson commented Aug 23, 2020

Suboptimal performance when a large folder is listed in .markdownlintignore #108

Suboptimal performance when a large folder is listed in .markdownlintignore #108

Comments

kachkaev commented Jul 18, 2020

DavidAnson commented Jul 18, 2020

DavidAnson commented Jul 21, 2020

DavidAnson commented Jul 21, 2020

kachkaev commented Jul 21, 2020 • edited Loading

DavidAnson commented Jul 21, 2020 • edited Loading

DavidAnson commented Jul 31, 2020

borekb commented Aug 9, 2020 • edited Loading

DavidAnson commented Aug 9, 2020

borekb commented Aug 9, 2020

borekb commented Aug 13, 2020

kachkaev commented Aug 16, 2020

adamkudrna commented Aug 21, 2020 • edited Loading

DavidAnson commented Aug 21, 2020

kachkaev commented Aug 22, 2020 • edited Loading

DavidAnson commented Aug 22, 2020

DavidAnson commented Aug 23, 2020

DavidAnson commented Aug 23, 2020

kachkaev commented Jul 21, 2020 •

edited

Loading

DavidAnson commented Jul 21, 2020 •

edited

Loading

borekb commented Aug 9, 2020 •

edited

Loading

adamkudrna commented Aug 21, 2020 •

edited

Loading

kachkaev commented Aug 22, 2020 •

edited

Loading