-
-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parse/md): markdown header support in lexer #5208
base: main
Are you sure you want to change the base?
Conversation
d46c1ba
to
22c15da
Compare
CodSpeed Performance ReportMerging #5208 will not alter performanceComparing Summary
|
MdHeader = before:MdHashList MdParagraph? after:MdHashList | ||
|
||
MdHeader1 = before:MdHashList MdParagraph? after:MdHashList | ||
MdHeader2 = before:MdHashList MdParagraph? after:MdHashList | ||
MdHeader3 = before:MdHashList MdParagraph? after:MdHashList | ||
MdHeader4 = before:MdHashList MdParagraph? after:MdHashList | ||
MdHeader5 = before:MdHashList MdParagraph? after:MdHashList | ||
MdHeader6 = before:MdHashList MdParagraph? after:MdHashList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change doesn't quite look right to me. We already have MdHeader
defined above, but you've added 6 new nodes for headers. IMO, we should rename MdHeader
into AnyMdHeader
, and have it be a union of all the other header levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, but how should we represent the level of the heading ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having MdHeader1
through MdHeader6
is fine. I'm saying we should change MdHeader
to:
AnyMdHeader = MdHeader1 | MdHeader2 | MdHeader3 | MdHeader4 | MdHeader5 | MdHeader6
// Skip whitespace after the hash marks | ||
if p.at(WHITESPACE) { | ||
p.bump(WHITESPACE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I think the whitespace is required for it to become a heading. Do you have a source for this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this example space is skip https://spec.commonmark.org/0.31.2/#example-62
in this example spaces is also skiped https://spec.commonmark.org/0.31.2/#example-62
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least one space or tab is required between the # characters and the heading’s contents, unless the heading is empty. Note that many implementations currently do not require the space.
The code here makes the whitespace optional when it is actually required. See example 64 in that doc
4 => MD_HEADER4, | ||
5 => MD_HEADER5, | ||
6 => MD_HEADER6, | ||
_ => MD_HEADER // Fallback, should not happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a parsing error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Markdown, I think, shouldn't have parsing errors. What I mean is that at the end, the language is very lax and, worst case scenario, a paragraph is always emitted.
I've never seen an editor emitting a parsing error 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of checking if it's valid ATX heading or just paragphe is done in LEXER but rust ask us for fallback
fn consume_header(&mut self) -> MarkdownSyntaxKind { | ||
self.assert_at_char_boundary(); | ||
|
||
let mut level = 0; | ||
while matches!(self.current_byte(), Some(b'#')) { | ||
self.advance(1); | ||
level += 1; | ||
} | ||
|
||
match level { | ||
1 => MD_HEADER1, | ||
2 => MD_HEADER2, | ||
3 => MD_HEADER3, | ||
4 => MD_HEADER4, | ||
5 => MD_HEADER5, | ||
6 => MD_HEADER6, | ||
_ => ERROR_TOKEN, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm looking at this again, this isn't right. The purpose of the lexer is to convert source code into an "alphabet" of sorts for the parser to interpret into a syntax tree. So it should output the "leaf" token, which would be the token for #
, which you can get the SyntaxKind for via the macro T![#]
. To continue the "alphabet" analogy, the lexer needs to emit the "letters" (tokens) that make up the "words".
I recommend taking a look at how the HTML lexer works, particularly how consume_token
works here:
biome/crates/biome_html_parser/src/lexer/mod.rs
Lines 54 to 64 in 766492f
fn consume_token(&mut self, current: u8) -> HtmlSyntaxKind { | |
match current { | |
b'\n' | b'\r' | b'\t' | b' ' => self.consume_newline_or_whitespaces(), | |
b'<' => self.consume_l_angle(), | |
b'>' => self.consume_byte(T![>]), | |
b'/' => self.consume_byte(T![/]), | |
b'=' => self.consume_byte(T![=]), | |
b'!' => self.consume_byte(T![!]), | |
b'\'' | b'"' => self.consume_string_literal(current), | |
_ if self.current_kind == T![<] && is_tag_name_byte(current) => { | |
// tag names must immediately follow a `<` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, another approach here would be to make each header it's own token. In the ungram, it could look something like this:
MdHeader1 = '#' MdParagraph?
MdHeader2 = '##' MdParagraph?
MdHeader3 = '###' MdParagraph?
etc.
And then what you currently have here would make more sense. I'm not sure if I would recommend going that route though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with your proposal of grammar change:
thread 'main' panicked at /.cargo/registry/src/index.crates.io-6f17d22bba15001f/quote-1.0.38/src/runtime.rs:490:9:
"##_token" is not a valid Ident
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Summary
Supporting ATX heading for markdown lexer.
Blocking thing