feat: split-words linter counterpart to merge-words #885

hippietrail · 2025-03-11T17:02:15Z

Issues

N/A

Description

Looks for words which are not in the dictionary but when split in two, each part is in the dictionary.
It starts splitting the word in the middle then expands outwards rather than starting at the left and moving right. The idea being that missed spaces are usually in the middle.

The good and bad parts are that it finds more than a few problems which are more to do with things that maybe shouldn't be in the dictionary. This can be annoying but will help us clean up the dictionary.

But the logic can also be tweaked to avoid common antipatterns as they are identified.

For now both halves have to be exact matches to dictionary entries, each letter must be in the same case as in the dictionary. There were more false positives before I made it exact. But now it misses some. So this logic / heuristics can also be tweaked.

How Has This Been Tested?

I added a couple of unit tests.
I also had to modify several unrelated tests that this PR makes fail due to finding different numbers of lints found.

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

elijah-potter

Love the idea here. I want to be mindful of the potential performance implications, though.

elijah-potter · 2025-03-13T19:28:19Z

harper-comments/tests/language_support_sources/jsdoc.ts

@@ -1,14 +1,14 @@
 /** This is a doc comment.
-  * Since there are no keywords it _sould_ be checked. */
+  * Since there are no keywords it _shuld_ be checked. */


I'm a little confused as to why these must be changed?

I'm a little confused as to why these must be changed?

I was too. A number of the tests are based on the number of lints found and some unexpected splits result in pairs of "words" in the dictionary. In this case I think it was "soul" and "d". (Line 19,421 d/~5NXGJ)

I think having both this linter and dictionary comments will help us sort out which dictionary entries really shouldn't be there, or add justifications for things that don't look like words explaining why they actually belong.

At the same time, the split_words logic can be tweaked. It was for things like this that I made it case sensitive. There are many single-letter uppercase "words" and they are often easy to justify. I also thought of not splitting all the way down to single letters but that would rule out a whole class of common errors such as a way/away from being spotted.

I think what we'll get is false positives that will give us insight into both curating the dictionary and tweaking heuristics in the split_words logic.

I'm surprised to see such changes too.

Once again, I jump in a discussion I try to follow without being sure about the exact issue this PR try to resolve.

So please be kind if my remarks are totally out of topic 😅

The way I understood this PR, it's about using words in the dictionary to compute if a word that is not in the dictionary is not made of words that are present in the dictionary.

So if someone added foo and bar, foobar and barfoo won't be reported as invalid.

But if someone added mergeable, we may accept unmergeable, but apparently maybe also mergeableun and some other strange variations, such as inmergeable...

I mean unless I'm wrong it means that from now, the following words wouldn't be reported as errors?:

tis

sould

aadd

ork

misstakes

I can get that misstakes is made of two real words miss and takes

But would takesmiss be accepted?

Maybe, I simply don't get the scope of this PR and/or this rule.

But from my perspective, the split logic should be limited to words that are longer than a few characters.

But I would exclude one letter "word" from this logic. I mean having:

a that leads to accept aadd as made of a + add

k that leads to accept ork as made of or + k

d that leads to accept sould as made of soul + d

And maybe word from 2-letter words (and maybe 3-letter) should come from a list that is manually maintained.

All these make me think there is something, either the logic of this PR, or simply my understanding

Once again, I might be totally out of scope.
But seeing that tests files were updated because the rule was updated makes me think there is a problem

No no. It detects places where you either missed hitting the spacebar or wrongly think something like incase or everytime is a standard word. So it would flag 'misstakes' and 'takesmiss' (which would already be flagged as misspellings) and adds suggestions like did you mean 'miss takes'. It only does these checks for words that are not in the dictionary. It'll find things written as compounds by Germanic language speakers who forget they're written as two words in English, product names and trademarks like 'vscode' and 'wifi' that people write without a space but that are officially written with a space or hyphen.

The tests it changes are all brittle tests that depend on the number of lints found not changing.

elijah-potter · 2025-03-13T19:29:27Z

harper-core/src/linting/split_words.rs

+
+            let mut found = false;
+
+            for i in 1..w_chars.len() {


This looks like O(n^2), which could have significant performance considerations. Run cargo bench on master and on this branch to get a good understanding of the performance implications.

This looks like O(n^2), which could have significant performance considerations. Run cargo bench on master and on this branch to get a good understanding of the performance implications.

It's definitely never going to be free. Heuristics and optimization can probably help. Benching now...
master before pulling latest changes

Running benches/parse_demo.rs (target/release/deps/parse_demo-e61b92e973b62c69) Benchmarking parse_essay: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60. parse_essay time: [1.1415 ms 1.1487 ms 1.1594 ms] Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe lint_essay time: [4.0558 ms 4.0617 ms 4.0686 ms] Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe lint_essay_uncached time: [36.347 ms 37.231 ms 38.571 ms] Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) high mild 7 (7.00%) high severe

split_words without merging latest changes

Running benches/parse_demo.rs (target/release/deps/parse_demo-934b9b6b76825e17) Benchmarking parse_essay: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60. parse_essay time: [1.1756 ms 1.1879 ms 1.2016 ms] change: [+2.3471% +3.5357% +4.9174%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 6 (6.00%) high mild 3 (3.00%) high severe lint_essay time: [4.6598 ms 5.4398 ms 6.8017 ms] change: [+15.204% +33.930% +75.811%] (p = 0.01 < 0.05) Performance has regressed. Found 15 outliers among 100 measurements (15.00%) 5 (5.00%) high mild 10 (10.00%) high severe lint_essay_uncached time: [36.381 ms 36.485 ms 36.601 ms] change: [-5.4025% -2.0040% +0.3988%] (p = 0.23 > 0.05) No change in performance detected. Found 9 outliers among 100 measurements (9.00%) 6 (6.00%) high mild 3 (3.00%) high severe

master after pulling latest changes

Running benches/parse_demo.rs (target/release/deps/parse_demo-db6a88add7c2fdb6) Benchmarking parse_essay: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60. parse_essay time: [1.1495 ms 1.1665 ms 1.1878 ms] change: [-3.5494% -1.8548% +0.1785%] (p = 0.05 > 0.05) No change in performance detected. Found 17 outliers among 100 measurements (17.00%) 9 (9.00%) high mild 8 (8.00%) high severe lint_essay time: [4.1378 ms 4.2216 ms 4.3449 ms] change: [-38.141% -22.394% -8.9018%] (p = 0.02 < 0.05) Performance has improved. Found 16 outliers among 100 measurements (16.00%) 6 (6.00%) high mild 10 (10.00%) high severe lint_essay_uncached time: [36.746 ms 37.151 ms 37.752 ms] change: [+0.6915% +1.8258% +3.5692%] (p = 0.01 < 0.05) Change within noise threshold. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high severe

We could have heuristics such as only doing this for words over a certain length, and only trying splitting at several points near the middle. I didn't try to get rid of all allocations, clones, use slices rather than strings as much as possible etc. I think there's a trade-off

We could also have it turned off by default?

hippietrail · 2025-03-16T13:37:09Z

Feature request #907 is related to this.

feat: split-words linter counterpart to merge-words

21b3517

elijah-potter reviewed Mar 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split-words linter counterpart to merge-words #885

feat: split-words linter counterpart to merge-words #885

hippietrail commented Mar 11, 2025

elijah-potter left a comment

elijah-potter Mar 13, 2025

hippietrail Mar 14, 2025

ccoVeille Mar 14, 2025

hippietrail Mar 15, 2025

elijah-potter Mar 13, 2025

hippietrail Mar 14, 2025

hippietrail commented Mar 16, 2025

feat: split-words linter counterpart to merge-words #885

Are you sure you want to change the base?

feat: split-words linter counterpart to merge-words #885

Conversation

hippietrail commented Mar 11, 2025

Issues

Description

How Has This Been Tested?

Checklist

elijah-potter left a comment

Choose a reason for hiding this comment

elijah-potter Mar 13, 2025

Choose a reason for hiding this comment

hippietrail Mar 14, 2025

Choose a reason for hiding this comment

ccoVeille Mar 14, 2025

Choose a reason for hiding this comment

hippietrail Mar 15, 2025

Choose a reason for hiding this comment

elijah-potter Mar 13, 2025

Choose a reason for hiding this comment

hippietrail Mar 14, 2025

Choose a reason for hiding this comment

hippietrail commented Mar 16, 2025