Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to check for inconsistent use of capitalization #25

Open
ajeanmahoney opened this issue Jan 8, 2024 · 10 comments
Open

Ability to check for inconsistent use of capitalization #25

ajeanmahoney opened this issue Jan 8, 2024 · 10 comments
Assignees

Comments

@ajeanmahoney
Copy link
Collaborator

Just like #24, but the report captures inconsistent capitalization of terms within the text. For example:

6TiSCH-Aware (1)
6TiSCH-aware (17)
6top protocol (6)
6top Protocol (2)
Address Lookup (3)
address lookup (12)
Architecture (29)
architecture (22)
Backbone (4)
backbone (15)
backbone router (1)
Backbone router (2)
Backbone Router (27) 
ChannelOffset (1)
channelOffset (24)
Deterministic (3)
deterministic (12)
Distributed Routing (1) 
distributed routing (6)

Note that the RPC uses title case for section titles, so if a term is only capitalized in a heading, this would not be considered inconsistent and does not need to be highlighted.

@NGPixel
Copy link
Member

NGPixel commented Jan 28, 2025

Processing all words, then all two-words sequences would be quite intensive. Can, for example, backbone and router be handled as 2 different terms only? or do you expect backbone router to also be checked (this would lead to many duplicate results)?

It is safe to ignore terms shorter than 4 characters?

@rjsparks
Copy link
Member

do you need to check more than two adjacent words?

@ajeanmahoney
Copy link
Collaborator Author

It is safe to ignore terms shorter than 4 characters?

For the most part. I would like "NUL" vs "Nul" or "null", and "ID", "Id", and "id" to be flagged. Three letters and shorter with interesting capitalization tend to be abbreviations.

Can, for example, backbone and router be handled as 2 different terms only?

do you need to check more than two adjacent words?

Field names and response codes can have more than two capitalized words (e.g., Destination Connection ID Length field, Proxy Authentication Required). It would be helpful if the variations like the following were flagged:

  • Destination Connection ID Length field
  • Destination Connection Id Length field
  • Destination Connection ID length field

@rjsparks
Copy link
Member

This is an interesting probem, and is probably solvable, but it's not currently solved in any code we have.
Not creating false positives at the end of sentences will be a corner to address: in the router backbone. Router backbones have
What about plural vs not plural forms?
I would scope this as a big ask - doable, I think still, but not a minor feature.

@ajeanmahoney
Copy link
Collaborator Author

I don't know if it would help to stop searching once an all-lowercase word is found. The example above would then be

  • Destination Connection ID
  • Destination Connection ID Length
  • Destination Connection Id Length

What about plural vs not plural forms?

That doesn't matter.

Checking terms for caps consistency is something we do manually now. For example, if I see both "Id" and "ID" used, I use case-sensitive search (on singular constructions, which also finds plurals) to get counts and see how the terms are used. If there are a lot of "ID"s but only a couple of "Id"s, then I'll check if I can update those "Id"s to "ID". If the use is mixed (e.g., 22 "ID"s, 24 "Id"s), then I'll ask authors for guidance.

@rjsparks
Copy link
Member

I think there is something to discuss on plurals that does matter.

If the entire document is Backbone routers are great. Let's talk about my backbone router. What should the report contain? Is it different if the document is Backbone routers are great. Let's talk about my backbone Router.?

@NGPixel
Copy link
Member

NGPixel commented Jan 29, 2025

An option would be to have a set list of common words to check for? rather than processing every possible word / term sequence in the document.

@ajeanmahoney
Copy link
Collaborator Author

What should the report contain?

It would be great to have this:

  • Backbone router (1)
  • backbone router (3)
  • backbone Router (1)

It would still be helpful to have this

  • Backbone (1) / backbone (4)
  • Router (1) / router (4)

An option would be to have a set list of common words to check for?

RFC terminology is incredibly diverse, and there are WG 'dialects' where one working group may develop a capitalization style that may not be used in another working group. The RPC strives for consistency within a doc and within clusters, but it can be difficult to make documents consistent across time and areas.

We have working style sheets for documents that we edit, but we don't currently capture the final decisions, just the questions. So we don't have great data for building a checklist.

It would be awesome if we could save final style sheets and use them as checklists in the future. For instance, while editing a document that normatively references RFC NNNN, you could load the rfcNNNN style sheet (and the style sheets of all the RFCs that your doc normatively references) and check your document's capitalization against those.

@rjsparks
Copy link
Member

It would be great to have this:

I think you answered what you would want if a document contained both of the example sentences I gave, but I don't understand the output numbers.

@ajeanmahoney
Copy link
Collaborator Author

I made up the counts in my example above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants