#13943: format, add a new `pp_with_size` function #13947

Octachron · 2025-04-08T15:52:36Z

This PR proposes to implement the @<n> format specifier as a pp_with_size function on Format.formatter rather than as a rewriting rule on the stream of formatting instructions.

This makes the behaviour of the various format string interpreters (sprintf, fprintf, Format_doc.Doc.msg) more uniform in term of the interaction between @<n> and %a and %t (see #13943 for an example).

With this PR, the semantics for pp_with_size is that it affects the size of the next token printed with the pp_print_string primitive, in the absence of new lines, box opening or closing, or tabulation breaks.

jberdine

This LGTM, interface and implementation. I made a few comments, but this is a definite improvement.

jberdine · 2025-04-08T21:51:24Z

stdlib/format.ml

+let with_next_size state f s = match state.pp_stream_state with
+  | Next_size n -> set_stream_state state Normal; n
+  | Newline | Normal -> f state s


I can't say that I'm super fond of this sort of side-effecting function that also returns a value, but not a big deal for a narrowly-scoped internal function.

On the other hand, it is short, has only 3 uses, and I think I would find the call sites easier to read if this function were just inlined.

jberdine · 2025-04-08T22:26:06Z

stdlib/format.ml


 (* Format a slice *)
 let format_pp_substring state size ~pos ~len source =
  state.pp_space_left <- state.pp_space_left - size;
  pp_output_substring state ~pos ~len source;
-  state.pp_is_new_line <- false
+  set_stream_state state Normal

 (* Format a string by its length, if not empty *)
 let format_string state s =


There is perhaps a question of whether pp_with_size should interact with the strings that can be printed by pp_print_custom_break. The current state of the PR implementation does not adapt the call to pp_string_width on the next line to with_next_size. Users might think that a call to pp_with_size followed immediately by a call to pp_print_custom_break would cause the "fits" string to be considered as it it had the size indicated by pp_with_size. To support that, the implementation of pp_print_custom_break would need to adapt pp_string_width state before (line 729/749) to use with_next_size, and I think that is it. Maybe that is not desired though, and pp_with_size could just be documented as not applying to strings printed by pp_print_custom_break.

Applying the with_size modifier to the fits part of the breaks makes sense to me. We could choose to affect the whole fitting text but it is probably more regular to affect only the before part to keep the strict separation between the out_space and out_string lower-level instructions.

Thinking more about this, for pp_print_custom_break ppf ~fits:(s1, n, s2) ~breaks:(s3, m, s4), the sizes of s1 n spaces and s2 are all relevant in the case that no break is inserted at the break hint. In case a break is inserted at the break hint, then the sizes of m spaces and s4 are relevant. The size of s3 is not since it is emitted before the line break no matter what its size is.

So if user code is using strings that need size-correction, it would be consistent to support specifying the size of fits, either piece by piece or as a whole, and the size of the m and s4 parts of breaks, again either piece by piece or as a whole. It would also be consistent to not support specifying the size of any of these strings. I don't think that the @<N> mechanism can be used to support the general case, so I worry that using it to support some incomplete portion is just a step down a path leading to a dead end. If it becomes necessary to support specifying the size of these strings, it seems that we will need to add a size for each of the strings, essentially storing arguments to print_as instead of print_string.

So I think that the right approach is to not have pp_with_size interact with the pp_print_custom_break strings at all.

jberdine · 2025-04-08T22:35:39Z

stdlib/format.mli

+(** [pp_with_size ppf w] sets the width of the next element to be printed to
+    [w]. *)


I think it would help to clarify that pp_with_size does not affect the apparent size of the next element if there are any intervening box opens, box closes, or break hints that break. I think that to understand this function requires users to think not in terms of a tree of boxes, but to consider its fringe as a linear sequence of "elements" (that end up being strings). It is not clear to me that this will be obvious.

I agree that some clarification is required, what about

[pp_with_size ppf w] sets the width of the next element to be printed on the current line to [w], this does not affect the width of boxes nor the width of elements printed after a linebreak.

?

I think that still leaves the possibility for people to expect that @<10>@[%s will affect the size of the string given for the %s. So perhaps:

[pp_with_size ppf w] sets the width of the next element to be printed on the current line to [w]; this does not have any effect if there are any line breaks, tabulation breaks, or any boxes are opened or closed before the next element.

jberdine · 2025-04-08T22:41:16Z

testsuite/tests/lib-format/with_size.ml

+
+let () = test "foo@ @<1000>bar"
+let () = test "foo@ @<1000>%s" "bar"
+let () = test "foo@ @<1000>%a" Format.pp_print_string "bar"


It might be good to have another test case similar to this one, but such as

t () = test "foo@ @<1000>%a" (fun ppf -> Format.fprintf ppf "@[bar@]")

to illustrate that using boxes will defeat with_size.

gasche · 2025-04-09T14:31:16Z

My understanding of the current documentation of format is that @ should be followed by a "conversion specification" (the %a, %d, %s, %t stuff), and that the whole printing of that conversion will be shown "as if it was of size ...". If I understand your comments correctly, this works well for "leaf" conversions like %d, %s, but not for "composed" or "nested" conversions like %a or %t that may include arbitrary boxes.

You propose to change the specific so that it behaves differently on "nested" conversions like %a and %t. Your proposal is more or less to act on the leftmost "leaf" node of the underlying format.

Another option would be to simply reject those cases, with a failure when applying @ on non-leaf formats. (We could probably even make it a static failure if we want; adding a dynamic failure sounds a bit risky and then we would rather suggest to ignore it altogether.)

Finally, one may wonder whether the "full" feature could be supported. (For example we could begin a "Start_with_size" constructor before and a End_with_size token after the conversion, and hope to deal with that correctly in the printer.) My understanding from a distance is that it's not clear what it would mean several tokens together have to take a certain size (how to spread the size among the list of tokens?). Maybe an attempt would be as follows:

print the tokens up to the size
after that size, print all the rest as zero-size tokens
if we haven't reached the desired size when the token substream stops, print an empty item with the remaining size

gasche · 2025-04-09T14:43:32Z

Sherlocode finds uses of @<0>, of @<1> and of @<2>. @<2> is almost always followed by a space. @<1> is sometimes followed by formatting boxes (@[...@]) and sometimes by %a, and it looks like only @c-cube uses it. @<0> is mostly followed by %s, and sometimes used by a single character.

I think it would be interesting to look at the usage of @<1>%a to understand whether the behavior you propose fits the intent of the author of the code. If it does not, I would be in favor of rejecting the combination altogether, instead of silently doing something unintended. (I'm not particularly pushing for the hard route of implementing the feature in its full generality.)

c-cube · 2025-04-09T14:54:03Z

(I only recall using @<1>… in combination with a unicode symbol that I know takes one column when displayed, for context. I might forget more… cursed uses?)

Octachron · 2025-04-09T15:46:50Z

@gasche , beware that Sherlocode is dropping non-ascii characters, thus some of @c-cube's code is incorrectly displayed.

@c-cube, speaking of which, correcting by hand the width of unicode characters will not be be needed anymore with 5.4 where Format counts the number of unicode scalar values to approximate the width of strings.

Enumerating all use cases on Sherlodoc:

@<1>@{<Green>*@}: the size is currently dropped silently. With this PR this is equivalent to @{<Green>@<1>*@} if the tag is implemented with open_mark_tag, otherwise part of the escape sequence will be printed with size 1.
However, a better implementation would be to use @<0> for the escape sequence itself.
Format.fprintf ppf "@<0>%a" ansi_format format the behavior will be improved by this PR
(depending on how the ansi escape code is sent).
@<1>←%a always works fine
@<1>←<lazy>, always applies the size modifier to the whole string literal @<1>←<lazy> contrarily to what the documentation says.
@<n>%s always works fine.

Thus there is one and half case where the behavior is improved, none where it is degraded.

jberdine · 2025-04-10T22:16:16Z

This is diverging from the current PR, but one alternative to

Finally, one may wonder whether the "full" feature could be supported. (For example we could begin a "Start_with_size" constructor before and a End_with_size token after the conversion, and hope to deal with that correctly in the printer.) My understanding from a distance is that it's not clear what it would mean several tokens together have to take a certain size (how to spread the size among the list of tokens?).

would be to add support for specifying the size of boxes. In this case "@<10>@[%a@]" would specify that the entirety of the elements produced by the formatter given to the %a would be formatted in an unbreakable (Pp_fits) box, and that box would then be considered to have size 10. This would avoid questions of how to spread the size among elements.

jberdine · 2025-04-10T22:36:46Z

correcting by hand the width of unicode characters will not be be needed anymore with 5.4 where Format counts the number of unicode scalar values to approximate the width of strings.

Although use cases expecting a modularspace font may still want to specify e.g. @<2>⊢.

gasche · 2025-04-16T16:56:28Z

I have a bad uninformed gut feeling about this PR. I agree that the behavior is slightly better than the current behavior wrt. %a or %t, but it comes with a change to the specification that I think makes the specification worse.

I consider the following futures:

we could do nothing at all
we could merge the implementation change, but keep the previous specification (which is still not correctly implemented)
we could put some effort in implementing the specification correctly, if we believed it was worth it

(Also: the code examples I found in the wild gave me the impression that the user code was written with the current specification in mind, and not with the newly proposed specification in mind.)

Octachron added 2 commits April 8, 2025 17:03

Format: Magic_size tests

5ac81b4

Format: @<n> as a primitive instruction.

0de71e2

jberdine approved these changes Apr 8, 2025

View reviewed changes

dra27 linked an issue Apr 16, 2025 that may be closed by this pull request

Format length specifier (@<n>) silently ignored with %a #13943

Open

dra27 assigned gasche Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#13943: format, add a new `pp_with_size` function #13947

#13943: format, add a new `pp_with_size` function #13947

Octachron commented Apr 8, 2025

jberdine left a comment

jberdine Apr 8, 2025

jberdine Apr 8, 2025

Octachron Apr 9, 2025

jberdine Apr 10, 2025

jberdine Apr 8, 2025

Octachron Apr 9, 2025

jberdine Apr 10, 2025

jberdine Apr 8, 2025

gasche commented Apr 9, 2025

gasche commented Apr 9, 2025

c-cube commented Apr 9, 2025

Octachron commented Apr 9, 2025

jberdine commented Apr 10, 2025

jberdine commented Apr 10, 2025

gasche commented Apr 16, 2025

		(** [pp_with_size ppf w] sets the width of the next element to be printed to
		[w]. *)

#13943: format, add a new pp_with_size function #13947

Are you sure you want to change the base?

#13943: format, add a new pp_with_size function #13947

Conversation

Octachron commented Apr 8, 2025

jberdine left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche commented Apr 9, 2025

gasche commented Apr 9, 2025

c-cube commented Apr 9, 2025

Octachron commented Apr 9, 2025

jberdine commented Apr 10, 2025

jberdine commented Apr 10, 2025

gasche commented Apr 16, 2025

#13943: format, add a new `pp_with_size` function #13947

#13943: format, add a new `pp_with_size` function #13947