Skip to content

#13943: format, add a new pp_with_size function #13947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

Octachron
Copy link
Member

This PR proposes to implement the @<n> format specifier as a pp_with_size function on Format.formatter rather than as a rewriting rule on the stream of formatting instructions.

This makes the behaviour of the various format string interpreters (sprintf, fprintf, Format_doc.Doc.msg) more uniform in term of the interaction between @<n> and %a and %t (see #13943 for an example).

With this PR, the semantics for pp_with_size is that it affects the size of the next token printed with the pp_print_string primitive, in the absence of new lines, box opening or closing, or tabulation breaks.

Copy link
Contributor

@jberdine jberdine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, interface and implementation. I made a few comments, but this is a definite improvement.

Comment on lines +207 to +209
let with_next_size state f s = match state.pp_stream_state with
| Next_size n -> set_stream_state state Normal; n
| Newline | Normal -> f state s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say that I'm super fond of this sort of side-effecting function that also returns a value, but not a big deal for a narrowly-scoped internal function.

On the other hand, it is short, has only 3 uses, and I think I would find the call sites easier to read if this function were just inlined.


(* Format a slice *)
let format_pp_substring state size ~pos ~len source =
state.pp_space_left <- state.pp_space_left - size;
pp_output_substring state ~pos ~len source;
state.pp_is_new_line <- false
set_stream_state state Normal

(* Format a string by its length, if not empty *)
let format_string state s =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is perhaps a question of whether pp_with_size should interact with the strings that can be printed by pp_print_custom_break. The current state of the PR implementation does not adapt the call to pp_string_width on the next line to with_next_size. Users might think that a call to pp_with_size followed immediately by a call to pp_print_custom_break would cause the "fits" string to be considered as it it had the size indicated by pp_with_size. To support that, the implementation of pp_print_custom_break would need to adapt pp_string_width state before (line 729/749) to use with_next_size, and I think that is it. Maybe that is not desired though, and pp_with_size could just be documented as not applying to strings printed by pp_print_custom_break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying the with_size modifier to the fits part of the breaks makes sense to me. We could choose to affect the whole fitting text but it is probably more regular to affect only the before part to keep the strict separation between the out_space and out_string lower-level instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, for pp_print_custom_break ppf ~fits:(s1, n, s2) ~breaks:(s3, m, s4), the sizes of s1 n spaces and s2 are all relevant in the case that no break is inserted at the break hint. In case a break is inserted at the break hint, then the sizes of m spaces and s4 are relevant. The size of s3 is not since it is emitted before the line break no matter what its size is.

So if user code is using strings that need size-correction, it would be consistent to support specifying the size of fits, either piece by piece or as a whole, and the size of the m and s4 parts of breaks, again either piece by piece or as a whole. It would also be consistent to not support specifying the size of any of these strings. I don't think that the @<N> mechanism can be used to support the general case, so I worry that using it to support some incomplete portion is just a step down a path leading to a dead end. If it becomes necessary to support specifying the size of these strings, it seems that we will need to add a size for each of the strings, essentially storing arguments to print_as instead of print_string.

So I think that the right approach is to not have pp_with_size interact with the pp_print_custom_break strings at all.

Comment on lines +241 to +242
(** [pp_with_size ppf w] sets the width of the next element to be printed to
[w]. *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to clarify that pp_with_size does not affect the apparent size of the next element if there are any intervening box opens, box closes, or break hints that break. I think that to understand this function requires users to think not in terms of a tree of boxes, but to consider its fringe as a linear sequence of "elements" (that end up being strings). It is not clear to me that this will be obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that some clarification is required, what about

[pp_with_size ppf w] sets the width of the next element to be printed on the current line to [w], this does not affect the width of boxes nor the width of elements printed after a linebreak.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that still leaves the possibility for people to expect that @<10>@[%s will affect the size of the string given for the %s. So perhaps:

[pp_with_size ppf w] sets the width of the next element to be printed on the current line to [w]; this does not have any effect if there are any line breaks, tabulation breaks, or any boxes are opened or closed before the next element.


let () = test "foo@ @<1000>bar"
let () = test "foo@ @<1000>%s" "bar"
let () = test "foo@ @<1000>%a" Format.pp_print_string "bar"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have another test case similar to this one, but such as

t () = test "foo@ @<1000>%a" (fun ppf -> Format.fprintf ppf "@[bar@]")

to illustrate that using boxes will defeat with_size.

@gasche
Copy link
Member

gasche commented Apr 9, 2025

My understanding of the current documentation of format is that @ should be followed by a "conversion specification" (the %a, %d, %s, %t stuff), and that the whole printing of that conversion will be shown "as if it was of size ...". If I understand your comments correctly, this works well for "leaf" conversions like %d, %s, but not for "composed" or "nested" conversions like %a or %t that may include arbitrary boxes.

You propose to change the specific so that it behaves differently on "nested" conversions like %a and %t. Your proposal is more or less to act on the leftmost "leaf" node of the underlying format.

Another option would be to simply reject those cases, with a failure when applying @ on non-leaf formats. (We could probably even make it a static failure if we want; adding a dynamic failure sounds a bit risky and then we would rather suggest to ignore it altogether.)

Finally, one may wonder whether the "full" feature could be supported. (For example we could begin a "Start_with_size" constructor before and a End_with_size token after the conversion, and hope to deal with that correctly in the printer.) My understanding from a distance is that it's not clear what it would mean several tokens together have to take a certain size (how to spread the size among the list of tokens?). Maybe an attempt would be as follows:

  • print the tokens up to the size
  • after that size, print all the rest as zero-size tokens
  • if we haven't reached the desired size when the token substream stops, print an empty item with the remaining size

@gasche
Copy link
Member

gasche commented Apr 9, 2025

Sherlocode finds uses of @<0>, of @<1> and of @<2>. @<2> is almost always followed by a space. @<1> is sometimes followed by formatting boxes (@[...@]) and sometimes by %a, and it looks like only @c-cube uses it. @<0> is mostly followed by %s, and sometimes used by a single character.

I think it would be interesting to look at the usage of @<1>%a to understand whether the behavior you propose fits the intent of the author of the code. If it does not, I would be in favor of rejecting the combination altogether, instead of silently doing something unintended. (I'm not particularly pushing for the hard route of implementing the feature in its full generality.)

@c-cube
Copy link
Contributor

c-cube commented Apr 9, 2025

(I only recall using @<1>… in combination with a unicode symbol that I know takes one column when displayed, for context. I might forget more… cursed uses?)

@Octachron
Copy link
Member Author

@gasche , beware that Sherlocode is dropping non-ascii characters, thus some of @c-cube's code is incorrectly displayed.

@c-cube, speaking of which, correcting by hand the width of unicode characters will not be be needed anymore with 5.4 where Format counts the number of unicode scalar values to approximate the width of strings.

Enumerating all use cases on Sherlodoc:

  • @<1>@{<Green>*@}: the size is currently dropped silently. With this PR this is equivalent to @{<Green>@<1>*@} if the tag is implemented with open_mark_tag, otherwise part of the escape sequence will be printed with size 1.
    However, a better implementation would be to use @<0> for the escape sequence itself.
  • Format.fprintf ppf "@<0>%a" ansi_format format the behavior will be improved by this PR
    (depending on how the ansi escape code is sent).
  • @<1>←%a always works fine
  • @<1>←<lazy>, always applies the size modifier to the whole string literal @<1>←<lazy> contrarily to what the documentation says.
  • @<n>%s always works fine.

Thus there is one and half case where the behavior is improved, none where it is degraded.

@jberdine
Copy link
Contributor

This is diverging from the current PR, but one alternative to

Finally, one may wonder whether the "full" feature could be supported. (For example we could begin a "Start_with_size" constructor before and a End_with_size token after the conversion, and hope to deal with that correctly in the printer.) My understanding from a distance is that it's not clear what it would mean several tokens together have to take a certain size (how to spread the size among the list of tokens?).

would be to add support for specifying the size of boxes. In this case "@<10>@[%a@]" would specify that the entirety of the elements produced by the formatter given to the %a would be formatted in an unbreakable (Pp_fits) box, and that box would then be considered to have size 10. This would avoid questions of how to spread the size among elements.

@jberdine
Copy link
Contributor

correcting by hand the width of unicode characters will not be be needed anymore with 5.4 where Format counts the number of unicode scalar values to approximate the width of strings.

Although use cases expecting a modularspace font may still want to specify e.g. @<2>⊢.

@dra27 dra27 linked an issue Apr 16, 2025 that may be closed by this pull request
@gasche
Copy link
Member

gasche commented Apr 16, 2025

I have a bad uninformed gut feeling about this PR. I agree that the behavior is slightly better than the current behavior wrt. %a or %t, but it comes with a change to the specification that I think makes the specification worse.

I consider the following futures:

  • we could do nothing at all
  • we could merge the implementation change, but keep the previous specification (which is still not correctly implemented)
  • we could put some effort in implementing the specification correctly, if we believed it was worth it

(Also: the code examples I found in the wild gave me the impression that the user code was written with the current specification in mind, and not with the newly proposed specification in mind.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Format length specifier (@<n>) silently ignored with %a
4 participants