Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: split-words linter counterpart to merge-words #885

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hippietrail
Copy link
Contributor

Issues

N/A

Description

Looks for words which are not in the dictionary but when split in two, each part is in the dictionary.
It starts splitting the word in the middle then expands outwards rather than starting at the left and moving right. The idea being that missed spaces are usually in the middle.

The good and bad parts are that it finds more than a few problems which are more to do with things that maybe shouldn't be in the dictionary. This can be annoying but will help us clean up the dictionary.

But the logic can also be tweaked to avoid common antipatterns as they are identified.

For now both halves have to be exact matches to dictionary entries, each letter must be in the same case as in the dictionary. There were more false positives before I made it exact. But now it misses some. So this logic / heuristics can also be tweaked.

How Has This Been Tested?

I added a couple of unit tests.
I also had to modify several unrelated tests that this PR makes fail due to finding different numbers of lints found.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

Copy link
Collaborator

@elijah-potter elijah-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea here. I want to be mindful of the potential performance implications, though.

@@ -1,14 +1,14 @@
/** This is a doc comment.
* Since there are no keywords it _sould_ be checked. */
* Since there are no keywords it _shuld_ be checked. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to why these must be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to why these must be changed?

I was too. A number of the tests are based on the number of lints found and some unexpected splits result in pairs of "words" in the dictionary. In this case I think it was "soul" and "d". (Line 19,421 d/~5NXGJ)

I think having both this linter and dictionary comments will help us sort out which dictionary entries really shouldn't be there, or add justifications for things that don't look like words explaining why they actually belong.

At the same time, the split_words logic can be tweaked. It was for things like this that I made it case sensitive. There are many single-letter uppercase "words" and they are often easy to justify. I also thought of not splitting all the way down to single letters but that would rule out a whole class of common errors such as a way/away from being spotted.

I think what we'll get is false positives that will give us insight into both curating the dictionary and tweaking heuristics in the split_words logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see such changes too.

Once again, I jump in a discussion I try to follow without being sure about the exact issue this PR try to resolve.

So please be kind if my remarks are totally out of topic 😅

The way I understood this PR, it's about using words in the dictionary to compute if a word that is not in the dictionary is not made of words that are present in the dictionary.

So if someone added foo and bar, foobar and barfoo won't be reported as invalid.

But if someone added mergeable, we may accept unmergeable, but apparently maybe also mergeableun and some other strange variations, such as inmergeable...

I mean unless I'm wrong it means that from now, the following words wouldn't be reported as errors?:

  • tis
  • sould
  • aadd
  • ork
  • misstakes

I can get that misstakes is made of two real words miss and takes

But would takesmiss be accepted?

Maybe, I simply don't get the scope of this PR and/or this rule.

But from my perspective, the split logic should be limited to words that are longer than a few characters.

But I would exclude one letter "word" from this logic. I mean having:

  • a that leads to accept aadd as made of a + add
  • k that leads to accept ork as made of or + k
  • d that leads to accept sould as made of soul + d

And maybe word from 2-letter words (and maybe 3-letter) should come from a list that is manually maintained.

All these make me think there is something, either the logic of this PR, or simply my understanding

Once again, I might be totally out of scope.
But seeing that tests files were updated because the rule was updated makes me think there is a problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no. It detects places where you either missed hitting the spacebar or wrongly think something like incase or everytime is a standard word. So it would flag 'misstakes' and 'takesmiss' (which would already be flagged as misspellings) and adds suggestions like did you mean 'miss takes'. It only does these checks for words that are not in the dictionary. It'll find things written as compounds by Germanic language speakers who forget they're written as two words in English, product names and trademarks like 'vscode' and 'wifi' that people write without a space but that are officially written with a space or hyphen.

The tests it changes are all brittle tests that depend on the number of lints found not changing.


let mut found = false;

for i in 1..w_chars.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like O(n^2), which could have significant performance considerations. Run cargo bench on master and on this branch to get a good understanding of the performance implications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like O(n^2), which could have significant performance considerations. Run cargo bench on master and on this branch to get a good understanding of the performance implications.

It's definitely never going to be free. Heuristics and optimization can probably help. Benching now...
master before pulling latest changes

     Running benches/parse_demo.rs (target/release/deps/parse_demo-e61b92e973b62c69)
Benchmarking parse_essay: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
parse_essay             time:   [1.1415 ms 1.1487 ms 1.1594 ms]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

lint_essay              time:   [4.0558 ms 4.0617 ms 4.0686 ms]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

lint_essay_uncached     time:   [36.347 ms 37.231 ms 38.571 ms]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

split_words without merging latest changes

     Running benches/parse_demo.rs (target/release/deps/parse_demo-934b9b6b76825e17)
Benchmarking parse_essay: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
parse_essay             time:   [1.1756 ms 1.1879 ms 1.2016 ms]
                        change: [+2.3471% +3.5357% +4.9174%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

lint_essay              time:   [4.6598 ms 5.4398 ms 6.8017 ms]
                        change: [+15.204% +33.930% +75.811%] (p = 0.01 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

lint_essay_uncached     time:   [36.381 ms 36.485 ms 36.601 ms]
                        change: [-5.4025% -2.0040% +0.3988%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

master after pulling latest changes

     Running benches/parse_demo.rs (target/release/deps/parse_demo-db6a88add7c2fdb6)
Benchmarking parse_essay: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
parse_essay             time:   [1.1495 ms 1.1665 ms 1.1878 ms]
                        change: [-3.5494% -1.8548% +0.1785%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  9 (9.00%) high mild
  8 (8.00%) high severe

lint_essay              time:   [4.1378 ms 4.2216 ms 4.3449 ms]
                        change: [-38.141% -22.394% -8.9018%] (p = 0.02 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe

lint_essay_uncached     time:   [36.746 ms 37.151 ms 37.752 ms]
                        change: [+0.6915% +1.8258% +3.5692%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

We could have heuristics such as only doing this for words over a certain length, and only trying splitting at several points near the middle. I didn't try to get rid of all allocations, clones, use slices rather than strings as much as possible etc. I think there's a trade-off

We could also have it turned off by default?

@hippietrail
Copy link
Contributor Author

Feature request #907 is related to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants