-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coerce to the scale’s type #532
Conversation
(There’s a small potential optimization here where if we know the scale type will be quantitative or temporal before we’ve computed the channel values, say because the scale type was specified explicitly, then we can avoid the extra copy and instead fold the coercion and typed array into the channel value construction. But we can’t do this in general because we need to know the channel values in order to infer the scale type. In any case hopefully an extra copy won’t add too much overhead, and I think it’s worth the cost for predictability regardless.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely useful.
I would like to clarify the handling of dates. The regexp doesn’t match dates with a space instead of a T, like "2014-12-30 00:00:00+00:00", which happen in the wild and are a potential source of error (as you mention in https://observablehq.com/@mbostock/plotting-bens-google-fit-data : this format is parsed by Chrome but not Safari). Shouldn't we in this case replace the space by a T, rather than default to NaN?
Also, since there are many flavors of ISO 8601, we might want to document precisely the formats we're expecting with this regexp, and those we're not covering.
Paraphrasing the RegExp we have:
- year: YYYY (and for years outside 0000-9999, -YYYYYY or +YYYYYY)
- month: MM
- day: DD
- hour: hh
- minutes: mm
- seconds: ss
- milliseconds: mmm
- timezone: empty (local time), Z (UTC +00:00), +hh:mm (UTC +hh:mm)
for the date, the year is mandatory, optionally followed by month, and day:
- (±YY)YYYY
- (±YY)YYYY-MM
- (±YY)YYYY-MM-DD
hour:minutes is optional, and optionally followed by seconds, milliseconds
- hh:mm
- hh:mm:ss
- hh:mm:ss.mmm
timezone is optional
overall this makes a lot of combinations: (year)(Ttime)?(tz)?
Some parts of ISO 8601 that are not covered: time parts, week, day-of-year, 10th of seconds…
Here's a list of examples that are matched:
- 2021-09-06T06:57:38 => 2021-09-06T06:57:38 local time
- 2021-09-06T06:57:38+00:00 => 2021-09-06T06:57:38 UTC
- 2021-09-06T06:57:38Z => 2021-09-06T06:57:38 UTC
- 2021-09-06T06:57:38.123+00:00 => 2021-09-06T06:57:38.123 UTC
- +012000-09-01 => +012000-09-01 00:00 UTC
- -000100-09-01T02:02 => -000100-09-01T01:52:39.000Z // don't expect hours for far-away dates to work?
- 1402 => 1402-01-01 00:00 UTC
- 2008 => 2008-01-01 00:00 UTC
- 2008-03 => 2008-03-01 00:00 UTC
- 2021-09-06 => 2021-09-06 00:00 UTC
- 2021T06:57:38Z => 2021-01-01T06:57:38.000Z UTC
How can we document this in README without doing a lot of "paraphrasing"? Is there a reference page we can point to that would cover exactly what is covered here?
|
I don’t feel like we need to document this exhaustively. We can just say it’s a ISO 8601 date or date-time.
I’m not sure we want to broaden scope here beyond ISO 8601 into RFC 3339. If we do that, it’s more likely someone wouldn’t notice when their string is not compatible with the Date constructor in all browsers. The goal here isn’t to parse all possible date strings automatically; it’s to recommend a standard interchange format that works reliably. Formats outside that recommendation will need to be handled specially either by parsing yourself or using the timeFormat option as suggested in #535. |
If I read RFC 3339 section 5.6 correctly, it doesn't seem to include But it makes me wonder, shouldn't we adopt d3.isoParse instead—as the default timeFormat function of #535? Though the code is a bit different, the results are identical in all my tests of what I understand the format to cover*. This way, documentation and tests would be centralized in d3-time-format. [*] A difference is that isoParse("99-02") returns a date (1st Feb 1999), and isoParse("12") returns (1st January 1912)—this regexp doesn't cover these cases, and the parser returns respectively NaN and 12 (12ms after EPOCH). |
d3.isoParse doesn’t do any validation—it just passes through to the Date constructor. So I don’t advise that because the goal here again is to recommend an interchange format for dates. I do think it’d be worthwhile to have a parse function that goes along with isoformat if we want to make this code available outside of Plot. |
Here’s parse for isoformat, WDYT? mbostock/isoformat#1 |
I have merged and published isoformat 0.2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a paragraph to README and CHANGELOG—not sure it's enough (do we want to repeat some of the guidance (about typing data), or details (such as "any non-string values are coerced to number first and treated as milliseconds since UNIX epoch")?
// browsers. (In the future, this could be made more liberal if desired, though | ||
// it is still generally preferable to do date parsing yourself explicitly, | ||
// rather than rely on Plot.) Any non-string values are coerced to number first | ||
// and treated as milliseconds since UNIX epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these comments could be reflected in the README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a pass at the README before I merge. Thank you! 👍
src/scales.js
Outdated
function coerceDate(x) { | ||
return x instanceof Date ? x | ||
: typeof x === "string" ? isoParse(x) | ||
: new Date(x == null ? NaN : +x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a Date instance for undefined or NaN input feels wrong here, especially because our defined helper considers invalid dates to be defined. We should probably return undefined instead of an invalid date.
Once the scale type is known, we can coerce the domain and channel values to match. This makes it easier to correct for untyped data by setting the scale type explicitly, rather than relying on the user to fix their data; it also makes Plot behave more consistently and predictably.
For example, consider a simple line plot. If we forget to pass typed data and the Date and Close columns are strings, they are interpreted as ordinal values rather than temporal and quantitative, respectively. Oops!
We might try to fix this by setting the scale types explicitly. However, since the data are still strings, and since Plot uses d3.min and d3.max to compute the quantitative domain, it doesn’t work very well! The inferred domains are ["2013-05-13", "2018-05-11"] = [Invalid Date, Invalid Date] and ["100.110001", "99.989998"] = [100.110001, 99.989998] respectively. Oops!
We can correct by setting a scale.transform, but this is a relatively obscure feature.
With this PR, the input data are automatically coerced to match the scale’s type, giving a more obvious fix to the common case of trying to use strings as temporal or quantitative data. (I.e., with this PR, the second example above produces the desired chart shown in the third example.) In the future, we could even consider requiring an explicitly-specified scale type when the inferred type is ordinal and the inferred domain has high cardinality (say, forty or more); this would cause the first chart to error, at which point you could enter the scale types to correct the problem, or better, type the data before passing it to Plot.
Related #74 #513.