feat(parse/md): markdown header support in lexer #5208

AugustinMauroy · 2025-02-26T13:40:33Z

Summary

Supporting ATX heading for markdown lexer.

Blocking thing

test don't pass and IDK how to fix it.

codspeed-hq · 2025-02-26T14:38:16Z

CodSpeed Performance Report

Merging #5208 will not alter performance

_{Comparing AugustinMauroy:markdown-header (769030c) with main (4df66a5)}

Summary

✅ 95 untouched benchmarks

crates/biome_markdown_parser/src/lexer/tests.rs

dyc3 · 2025-02-26T15:26:58Z

xtask/codegen/markdown.ungram

 MdHeader = before:MdHashList MdParagraph? after:MdHashList

+MdHeader1 = before:MdHashList MdParagraph? after:MdHashList
+MdHeader2 = before:MdHashList MdParagraph? after:MdHashList
+MdHeader3 = before:MdHashList MdParagraph? after:MdHashList
+MdHeader4 = before:MdHashList MdParagraph? after:MdHashList
+MdHeader5 = before:MdHashList MdParagraph? after:MdHashList
+MdHeader6 = before:MdHashList MdParagraph? after:MdHashList


This change doesn't quite look right to me. We already have MdHeader defined above, but you've added 6 new nodes for headers. IMO, we should rename MdHeader into AnyMdHeader, and have it be a union of all the other header levels.

okay, but how should we represent the level of the heading ?

crates/biome_markdown_parser/src/lexer/mod.rs

dyc3 · 2025-02-28T18:32:52Z

crates/biome_markdown_parser/src/syntax/atx_headings.rs

+    // Skip whitespace after the hash marks
+    if p.at(WHITESPACE) {
+        p.bump(WHITESPACE);
+    }


IIRC, I think the whitespace is required for it to become a heading. Do you have a source for this behavior?

in this example space is skip https://spec.commonmark.org/0.31.2/#example-62
in this example spaces is also skiped https://spec.commonmark.org/0.31.2/#example-62

At least one space or tab is required between the # characters and the heading’s contents, unless the heading is empty. Note that many implementations currently do not require the space.

The code here makes the whitespace optional when it is actually required. See example 64 in that doc

dyc3 · 2025-02-28T18:33:47Z

crates/biome_markdown_parser/src/syntax/atx_headings.rs

+        4 => MD_HEADER4,
+        5 => MD_HEADER5,
+        6 => MD_HEADER6,
+        _ => MD_HEADER // Fallback, should not happen


Should this be a parsing error instead?

Markdown, I think, shouldn't have parsing errors. What I mean is that at the end, the language is very lax and, worst case scenario, a paragraph is always emitted.

I've never seen an editor emitting a parsing error 🤔

This part of checking if it's valid ATX heading or just paragphe is done in LEXER but rust ask us for fallback

dyc3 · 2025-02-28T18:51:11Z

crates/biome_markdown_parser/src/lexer/mod.rs

+    fn consume_header(&mut self) -> MarkdownSyntaxKind {
+        self.assert_at_char_boundary();
+
+        let mut level = 0;
+        while matches!(self.current_byte(), Some(b'#')) {
+            self.advance(1);
+            level += 1;
+        }
+
+        match level {
+            1 => MD_HEADER1,
+            2 => MD_HEADER2,
+            3 => MD_HEADER3,
+            4 => MD_HEADER4,
+            5 => MD_HEADER5,
+            6 => MD_HEADER6,
+            _ => ERROR_TOKEN,
+        }
+    }


Now that I'm looking at this again, this isn't right. The purpose of the lexer is to convert source code into an "alphabet" of sorts for the parser to interpret into a syntax tree. So it should output the "leaf" token, which would be the token for #, which you can get the SyntaxKind for via the macro T![#]. To continue the "alphabet" analogy, the lexer needs to emit the "letters" (tokens) that make up the "words".

I recommend taking a look at how the HTML lexer works, particularly how consume_token works here:

biome/crates/biome_html_parser/src/lexer/mod.rs

Lines 54 to 64 in 766492f

fn consume_token(&mut self, current: u8) -> HtmlSyntaxKind {

match current {

b'\n' | b'\r' | b'\t' | b' ' => self.consume_newline_or_whitespaces(),

b'<' => self.consume_l_angle(),

b'>' => self.consume_byte(T![>]),

b'/' => self.consume_byte(T![/]),

b'=' => self.consume_byte(T![=]),

b'!' => self.consume_byte(T![!]),

b'\'' | b'"' => self.consume_string_literal(current),

_ if self.current_kind == T![<] && is_tag_name_byte(current) => {

// tag names must immediately follow a `<`

Alternatively, another approach here would be to make each header it's own token. In the ungram, it could look something like this:

MdHeader1 = '#' MdParagraph? MdHeader2 = '##' MdParagraph? MdHeader3 = '###' MdParagraph? etc.

And then what you currently have here would make more sense. I'm not sure if I would recommend going that route though.

with your proposal of grammar change:

thread 'main' panicked at /.cargo/registry/src/index.crates.io-6f17d22bba15001f/quote-1.0.38/src/runtime.rs:490:9: "##_token" is not a valid Ident note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

github-actions bot added A-Parser Area: parser A-Tooling Area: internal tools labels Feb 26, 2025

AugustinMauroy changed the title ~~Markdown header~~ feat(parse/md): Markdown header Feb 26, 2025

ematipico changed the title ~~feat(parse/md): Markdown header~~ feat(parse/md): markdown header Feb 26, 2025

feat(parse/md): markdown header

22c15da

AugustinMauroy force-pushed the markdown-header branch from d46c1ba to 22c15da Compare February 26, 2025 14:02

AugustinMauroy marked this pull request as ready for review February 26, 2025 14:02

AugustinMauroy changed the title ~~feat(parse/md): markdown header~~ feat(parse/md): markdown header support in lever Feb 26, 2025

AugustinMauroy changed the title ~~feat(parse/md): markdown header support in lever~~ feat(parse/md): markdown header support in lexer Feb 26, 2025

AugustinMauroy added 2 commits February 26, 2025 15:52

test: add more case

a09ced8

test: add not an heading

91dbf33

dyc3 requested changes Feb 26, 2025

View reviewed changes

update parser

769030c

dyc3 requested changes Feb 28, 2025

View reviewed changes

update

e8ac551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parse/md): markdown header support in lexer #5208

feat(parse/md): markdown header support in lexer #5208

AugustinMauroy commented Feb 26, 2025 •

edited

Loading

codspeed-hq bot commented Feb 26, 2025 •

edited

Loading

dyc3 Feb 26, 2025

AugustinMauroy Feb 26, 2025

dyc3 Feb 26, 2025

dyc3 Feb 28, 2025

AugustinMauroy Mar 4, 2025

dyc3 Mar 5, 2025 •

edited

Loading

dyc3 Feb 28, 2025

ematipico Feb 28, 2025 •

edited

Loading

AugustinMauroy Mar 4, 2025

dyc3 Feb 28, 2025 •

edited

Loading

dyc3 Feb 28, 2025 •

edited

Loading

AugustinMauroy Mar 4, 2025

	fn consume_token(&mut self, current: u8) -> HtmlSyntaxKind {
	match current {
	b'\n' \| b'\r' \| b'\t' \| b' ' => self.consume_newline_or_whitespaces(),
	b'<' => self.consume_l_angle(),
	b'>' => self.consume_byte(T![>]),
	b'/' => self.consume_byte(T![/]),
	b'=' => self.consume_byte(T![=]),
	b'!' => self.consume_byte(T![!]),
	b'\'' \| b'"' => self.consume_string_literal(current),
	_ if self.current_kind == T![<] && is_tag_name_byte(current) => {
	// tag names must immediately follow a `<`

feat(parse/md): markdown header support in lexer #5208

Are you sure you want to change the base?

feat(parse/md): markdown header support in lexer #5208

Conversation

AugustinMauroy commented Feb 26, 2025 • edited Loading

Summary

Blocking thing

codspeed-hq bot commented Feb 26, 2025 • edited Loading

Merging #5208 will not alter performance

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyc3 Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ematipico Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyc3 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

dyc3 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AugustinMauroy commented Feb 26, 2025 •

edited

Loading

codspeed-hq bot commented Feb 26, 2025 •

edited

Loading

dyc3 Mar 5, 2025 •

edited

Loading

ematipico Feb 28, 2025 •

edited

Loading

dyc3 Feb 28, 2025 •

edited

Loading

dyc3 Feb 28, 2025 •

edited

Loading