Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parse/md): markdown header support in lexer #5208

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AugustinMauroy
Copy link

@AugustinMauroy AugustinMauroy commented Feb 26, 2025

Summary

Supporting ATX heading for markdown lexer.

Blocking thing

  • test don't pass and IDK how to fix it.

@github-actions github-actions bot added A-Parser Area: parser A-Tooling Area: internal tools labels Feb 26, 2025
@AugustinMauroy AugustinMauroy changed the title Markdown header feat(parse/md): Markdown header Feb 26, 2025
@ematipico ematipico changed the title feat(parse/md): Markdown header feat(parse/md): markdown header Feb 26, 2025
@AugustinMauroy AugustinMauroy marked this pull request as ready for review February 26, 2025 14:02
@AugustinMauroy AugustinMauroy changed the title feat(parse/md): markdown header feat(parse/md): markdown header support in lever Feb 26, 2025
@AugustinMauroy AugustinMauroy changed the title feat(parse/md): markdown header support in lever feat(parse/md): markdown header support in lexer Feb 26, 2025
Copy link

codspeed-hq bot commented Feb 26, 2025

CodSpeed Performance Report

Merging #5208 will not alter performance

Comparing AugustinMauroy:markdown-header (769030c) with main (4df66a5)

Summary

✅ 95 untouched benchmarks

Comment on lines 59 to +66
MdHeader = before:MdHashList MdParagraph? after:MdHashList

MdHeader1 = before:MdHashList MdParagraph? after:MdHashList
MdHeader2 = before:MdHashList MdParagraph? after:MdHashList
MdHeader3 = before:MdHashList MdParagraph? after:MdHashList
MdHeader4 = before:MdHashList MdParagraph? after:MdHashList
MdHeader5 = before:MdHashList MdParagraph? after:MdHashList
MdHeader6 = before:MdHashList MdParagraph? after:MdHashList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't quite look right to me. We already have MdHeader defined above, but you've added 6 new nodes for headers. IMO, we should rename MdHeader into AnyMdHeader, and have it be a union of all the other header levels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, but how should we represent the level of the heading ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having MdHeader1 through MdHeader6 is fine. I'm saying we should change MdHeader to:

AnyMdHeader = MdHeader1 | MdHeader2 | MdHeader3 | MdHeader4 | MdHeader5 | MdHeader6

Comment on lines +39 to +42
// Skip whitespace after the hash marks
if p.at(WHITESPACE) {
p.bump(WHITESPACE);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I think the whitespace is required for it to become a heading. Do you have a source for this behavior?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this example space is skip https://spec.commonmark.org/0.31.2/#example-62
in this example spaces is also skiped https://spec.commonmark.org/0.31.2/#example-62

Copy link
Contributor

@dyc3 dyc3 Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one space or tab is required between the # characters and the heading’s contents, unless the heading is empty. Note that many implementations currently do not require the space.

The code here makes the whitespace optional when it is actually required. See example 64 in that doc

4 => MD_HEADER4,
5 => MD_HEADER5,
6 => MD_HEADER6,
_ => MD_HEADER // Fallback, should not happen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a parsing error instead?

Copy link
Member

@ematipico ematipico Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown, I think, shouldn't have parsing errors. What I mean is that at the end, the language is very lax and, worst case scenario, a paragraph is always emitted.

I've never seen an editor emitting a parsing error 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of checking if it's valid ATX heading or just paragphe is done in LEXER but rust ask us for fallback

Comment on lines 187 to 205
fn consume_header(&mut self) -> MarkdownSyntaxKind {
self.assert_at_char_boundary();

let mut level = 0;
while matches!(self.current_byte(), Some(b'#')) {
self.advance(1);
level += 1;
}

match level {
1 => MD_HEADER1,
2 => MD_HEADER2,
3 => MD_HEADER3,
4 => MD_HEADER4,
5 => MD_HEADER5,
6 => MD_HEADER6,
_ => ERROR_TOKEN,
}
}
Copy link
Contributor

@dyc3 dyc3 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm looking at this again, this isn't right. The purpose of the lexer is to convert source code into an "alphabet" of sorts for the parser to interpret into a syntax tree. So it should output the "leaf" token, which would be the token for #, which you can get the SyntaxKind for via the macro T![#]. To continue the "alphabet" analogy, the lexer needs to emit the "letters" (tokens) that make up the "words".

I recommend taking a look at how the HTML lexer works, particularly how consume_token works here:

fn consume_token(&mut self, current: u8) -> HtmlSyntaxKind {
match current {
b'\n' | b'\r' | b'\t' | b' ' => self.consume_newline_or_whitespaces(),
b'<' => self.consume_l_angle(),
b'>' => self.consume_byte(T![>]),
b'/' => self.consume_byte(T![/]),
b'=' => self.consume_byte(T![=]),
b'!' => self.consume_byte(T![!]),
b'\'' | b'"' => self.consume_string_literal(current),
_ if self.current_kind == T![<] && is_tag_name_byte(current) => {
// tag names must immediately follow a `<`

Copy link
Contributor

@dyc3 dyc3 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, another approach here would be to make each header it's own token. In the ungram, it could look something like this:

MdHeader1 = '#' MdParagraph?
MdHeader2 = '##' MdParagraph?
MdHeader3 = '###' MdParagraph?
etc.

And then what you currently have here would make more sense. I'm not sure if I would recommend going that route though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with your proposal of grammar change:

thread 'main' panicked at /.cargo/registry/src/index.crates.io-6f17d22bba15001f/quote-1.0.38/src/runtime.rs:490:9:
"##_token" is not a valid Ident
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Parser Area: parser A-Tooling Area: internal tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants