coerce to the scale’s type #532

mbostock · 2021-09-05T17:24:04Z

Once the scale type is known, we can coerce the domain and channel values to match. This makes it easier to correct for untyped data by setting the scale type explicitly, rather than relying on the user to fix their data; it also makes Plot behave more consistently and predictably.

For example, consider a simple line plot. If we forget to pass typed data and the Date and Close columns are strings, they are interpreted as ordinal values rather than temporal and quantitative, respectively. Oops!

Plot.plot({
  marks: [
    Plot.line(aapl, {x: "Date", y: "Close"})
  ]
})

We might try to fix this by setting the scale types explicitly. However, since the data are still strings, and since Plot uses d3.min and d3.max to compute the quantitative domain, it doesn’t work very well! The inferred domains are ["2013-05-13", "2018-05-11"] = [Invalid Date, Invalid Date] and ["100.110001", "99.989998"] = [100.110001, 99.989998] respectively. Oops!

Plot.plot({
  x: {
    type: "utc"
  },
  y: {
    type: "linear"
  },
  marks: [
    Plot.line(aapl, {x: "Date", y: "Close"})
  ]
})

We can correct by setting a scale.transform, but this is a relatively obscure feature.

Plot.plot({
  x: {
    transform: x => new Date(x)
  },
  y: {
    transform: y => +y
  },
  marks: [
    Plot.line(aapl, {x: "Date", y: "Close"})
  ]
})

With this PR, the input data are automatically coerced to match the scale’s type, giving a more obvious fix to the common case of trying to use strings as temporal or quantitative data. (I.e., with this PR, the second example above produces the desired chart shown in the third example.) In the future, we could even consider requiring an explicitly-specified scale type when the inferred type is ordinal and the inferred domain has high cardinality (say, forty or more); this would cause the first chart to error, at which point you could enter the scale types to correct the problem, or better, type the data before passing it to Plot.

Related #74 #513.

mbostock · 2021-09-05T17:35:13Z

(There’s a small potential optimization here where if we know the scale type will be quantitative or temporal before we’ve computed the channel values, say because the scale type was specified explicitly, then we can avoid the extra copy and instead fold the coercion and typed array into the channel value construction. But we can’t do this in general because we need to know the channel values in order to infer the scale type. In any case hopefully an extra copy won’t add too much overhead, and I think it’s worth the cost for predictability regardless.)

Fil

Definitely useful.

I would like to clarify the handling of dates. The regexp doesn’t match dates with a space instead of a T, like "2014-12-30 00:00:00+00:00", which happen in the wild and are a potential source of error (as you mention in https://observablehq.com/@mbostock/plotting-bens-google-fit-data : this format is parsed by Chrome but not Safari). Shouldn't we in this case replace the space by a T, rather than default to NaN?

Also, since there are many flavors of ISO 8601, we might want to document precisely the formats we're expecting with this regexp, and those we're not covering.

Paraphrasing the RegExp we have:

year: YYYY (and for years outside 0000-9999, -YYYYYY or +YYYYYY)
month: MM
day: DD
hour: hh
minutes: mm
seconds: ss
milliseconds: mmm
timezone: empty (local time), Z (UTC +00:00), +hh:mm (UTC +hh:mm)

for the date, the year is mandatory, optionally followed by month, and day:

(±YY)YYYY
(±YY)YYYY-MM
(±YY)YYYY-MM-DD

hour:minutes is optional, and optionally followed by seconds, milliseconds

hh:mm
hh:mm:ss
hh:mm:ss.mmm

timezone is optional

overall this makes a lot of combinations: (year)(Ttime)?(tz)?

Some parts of ISO 8601 that are not covered: time parts, week, day-of-year, 10th of seconds…

Here's a list of examples that are matched:

2021-09-06T06:57:38 => 2021-09-06T06:57:38 local time
2021-09-06T06:57:38+00:00 => 2021-09-06T06:57:38 UTC
2021-09-06T06:57:38Z => 2021-09-06T06:57:38 UTC
2021-09-06T06:57:38.123+00:00 => 2021-09-06T06:57:38.123 UTC
+012000-09-01 => +012000-09-01 00:00 UTC
-000100-09-01T02:02 => -000100-09-01T01:52:39.000Z // don't expect hours for far-away dates to work?
1402 => 1402-01-01 00:00 UTC
2008 => 2008-01-01 00:00 UTC
2008-03 => 2008-03-01 00:00 UTC
2021-09-06 => 2021-09-06 00:00 UTC
2021T06:57:38Z => 2021-01-01T06:57:38.000Z UTC

How can we document this in README without doing a lot of "paraphrasing"? Is there a reference page we can point to that would cover exactly what is covered here?

mbostock · 2021-09-06T14:40:54Z

~~I think we’re going for~~ I think this comes from RFC 3339 section 5.6, the “internet date/time format” (which allows a space or lowercase T unlike ISO 8601): https://datatracker.ietf.org/doc/html/rfc3339#section-5.6

mbostock · 2021-09-06T14:58:51Z

I don’t feel like we need to document this exhaustively. We can just say it’s a ISO 8601 date or date-time.

Shouldn't we in this case replace the space by a T, rather than default to NaN?

I’m not sure we want to broaden scope here beyond ISO 8601 into RFC 3339. If we do that, it’s more likely someone wouldn’t notice when their string is not compatible with the Date constructor in all browsers. The goal here isn’t to parse all possible date strings automatically; it’s to recommend a standard interchange format that works reliably. Formats outside that recommendation will need to be handled specially either by parsing yourself or using the timeFormat option as suggested in #535.

Fil · 2021-09-06T15:57:24Z

If I read RFC 3339 section 5.6 correctly, it doesn't seem to include "+" 6DIGIT years, and is more lax on "." 1*DIGIT sec-frac (we only accept "." 3DIGIT)? (I'm fine with that choice, I'm just looking for a reference to point to.)

But it makes me wonder, shouldn't we adopt d3.isoParse instead—as the default timeFormat function of #535? Though the code is a bit different, the results are identical in all my tests of what I understand the format to cover*. This way, documentation and tests would be centralized in d3-time-format.

[*] A difference is that isoParse("99-02") returns a date (1st Feb 1999), and isoParse("12") returns (1st January 1912)—this regexp doesn't cover these cases, and the parser returns respectively NaN and 12 (12ms after EPOCH).

mbostock · 2021-09-06T16:06:51Z

d3.isoParse doesn’t do any validation—it just passes through to the Date constructor. So I don’t advise that because the goal here again is to recommend an interchange format for dates.

I do think it’d be worthwhile to have a parse function that goes along with isoformat if we want to make this code available outside of Plot.

mbostock · 2021-09-06T16:48:59Z

Here’s parse for isoformat, WDYT? mbostock/isoformat#1

mbostock · 2021-09-06T20:06:21Z

I have merged and published isoformat 0.2.

Fil

I added a paragraph to README and CHANGELOG—not sure it's enough (do we want to repeat some of the guidance (about typing data), or details (such as "any non-string values are coerced to number first and treated as milliseconds since UNIX epoch")?

Fil · 2021-09-06T21:02:22Z

src/scales.js

+// browsers. (In the future, this could be made more liberal if desired, though
+// it is still generally preferable to do date parsing yourself explicitly,
+// rather than rely on Plot.) Any non-string values are coerced to number first
+// and treated as milliseconds since UNIX epoch.


Maybe these comments could be reflected in the README?

I will take a pass at the README before I merge. Thank you! 👍

mbostock · 2021-09-07T01:55:00Z

src/scales.js

+function coerceDate(x) {
+  return x instanceof Date ? x
+    : typeof x === "string" ? isoParse(x)
+    : new Date(x == null ? NaN : +x);


Returning a Date instance for undefined or NaN input feels wrong here, especially because our defined helper considers invalid dates to be defined. We should probably return undefined instead of an invalid date.

coerce to the scale’s type

95e2e1a

mbostock requested a review from Fil September 5, 2021 17:24

This was referenced Sep 5, 2021

Allow mark channels to specify a default scale type #533

Closed

When coercing temporal data, allow a timeFormat string or function to be specified #535

Closed

Fil mentioned this pull request Sep 6, 2021

Warning / logging system? #536

Closed

Fil requested changes Sep 6, 2021

View reviewed changes

upgrade isoformat

dfde475

mbostock requested a review from Fil September 6, 2021 20:12

Fil added 3 commits September 6, 2021 22:59

document type coercion

1b3428c

language

25ff424

changelog

e626354

Fil approved these changes Sep 6, 2021

View reviewed changes

mbostock added 2 commits September 6, 2021 18:49

update CHANGELOG

24d0897

coerce invalid dates to undefined

11c3228

mbostock commented Sep 7, 2021

View reviewed changes

update README

916d04b

mbostock mentioned this pull request Sep 7, 2021

ignore infinite values when setting a scale domain? #528

Closed

update README

9d7cde3

mbostock merged commit 0bbf70e into main Sep 7, 2021

mbostock deleted the mbostock/coerce-to-scale-type branch September 7, 2021 02:23

mbostock mentioned this pull request Feb 11, 2022

More warnings #755

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coerce to the scale’s type #532

coerce to the scale’s type #532

mbostock commented Sep 5, 2021 •

edited

Loading

mbostock commented Sep 5, 2021

Fil left a comment

mbostock commented Sep 6, 2021 •

edited

Loading

mbostock commented Sep 6, 2021

Fil commented Sep 6, 2021

mbostock commented Sep 6, 2021

mbostock commented Sep 6, 2021

mbostock commented Sep 6, 2021

Fil left a comment

Fil Sep 6, 2021

mbostock Sep 6, 2021

mbostock Sep 7, 2021

coerce to the scale’s type #532

coerce to the scale’s type #532

Conversation

mbostock commented Sep 5, 2021 • edited Loading

mbostock commented Sep 5, 2021

Fil left a comment

Choose a reason for hiding this comment

mbostock commented Sep 6, 2021 • edited Loading

mbostock commented Sep 6, 2021

Fil commented Sep 6, 2021

mbostock commented Sep 6, 2021

mbostock commented Sep 6, 2021

mbostock commented Sep 6, 2021

Fil left a comment

Choose a reason for hiding this comment

Fil Sep 6, 2021

Choose a reason for hiding this comment

mbostock Sep 6, 2021

Choose a reason for hiding this comment

mbostock Sep 7, 2021

Choose a reason for hiding this comment

mbostock commented Sep 5, 2021 •

edited

Loading

mbostock commented Sep 6, 2021 •

edited

Loading