-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer gets confused by escaped quotes #306
Comments
I hope not, https://many-tags-blacklist.elixir.fbstc.org/linux/latest/source/security/selinux/hooks.c#L1073 runs a1bff7a + some new-pygments iteration (notice no redirect on latest) and idents cut off in the same place. I suspect it's something to do with how filters are designed. Expect a redesign proposal (wall of text) soon. |
On v6.10 there is a whole chunk without identifiers, it starts on line 1067 and ends on line 2583 Quick explanation on how formatted code is supplied with identifier links:
The problem here lies in the tokenizer - it gets confused, probably by the escaped quote, and treats a big chunk of code as a single token (maybe an interesting token, maybe not, didn't count the lines and it doesn't matter). Try We could of course try to massage the regex to make it support unescaped strings (assuming it even is possible, although AFAIK perl regex is very powerful) but I have an idea how to redesign this whole part (tldr use Pygments lexers and extend Pygments a bit). I will post a full explanation soon. |
So the root cause is always the same quote escaping that fails parsing? Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings? |
Probably not always, I just wanted to see how often that happens because of quote escaping (
I don't understand what you mean. I think it's b - these chunks are multiline strings (single line with newlines escaped) in tokenize-file output. Also, for another bug caused by the tokenizer getting confused, see https://elixir.bootlin.com/u-boot/v2023.10/source/fs/ext4/ext4_common.c#L33. Include with quotes after an include with angle brackets is treated as an identifier.
|
The idea was a sort of mitigation: we know C does not support multi-line strings. That means we know those multiline chunks can only be a bug reported by our tokenizer (when parsing C code). We could modify the tokenizer to reset its state at newline. That workaround would only make sense if we plan on expanding on the current tokenizer script. Your #307 proposal makes more sense to my eyes though. |
See security/selinux/hooks.c in v6.10 for a page with the bug. The ident def exists, but web has not inserted links inside the document.
Could it be related to #300?
The text was updated successfully, but these errors were encountered: