Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Column Names #4

Open
spkaluzny opened this issue May 5, 2019 · 2 comments
Open

Data Column Names #4

spkaluzny opened this issue May 5, 2019 · 2 comments

Comments

@spkaluzny
Copy link
Collaborator

I think we want to think about the names for the eruption data in R. The names from the tsz data file are:
eruptionID geyser eruption_time_epoch has_seconds exact ns ie E A wc ini maj min q duration entrant observer eruption_comment time_updated time_entered associated_primaryID other_comments
It would be good to have descriptive names with consistent character case. Similar names length would be good as well.

I realize that the data has been available for some time with the above names from the archive and I don't know if using different names in R would have any ramifications.

@taltstidl
Copy link

taltstidl commented May 10, 2019

I agree, the current column names are a product of historical developments and are not quite normalized or self-explanatory. I'm including a draft of possible new names here along with a short description (coming later 😉):

  • eruption_id: The unique database identifier of the eruption.
  • geyser: The unique name of the geyser that erupted.
  • time: The timestamp of the eruption. Note that there are modifiers that can change the interpretation of this timestamp, which are also listed below.
  • has_seconds: Whether the eruption timestamp was recorded with a second precision. If not set, any seconds of the timestamp should be disregarded.
  • exact: A timestamp modifier which indicates that the exact start time was recorded.
  • near_start: A timestamp modifier which indicates that the exact start time was not recorded, but where circumstantial evidence suggests that the time was near the actual start of the eruption.
  • in_eruption: A timestamp modifier which indicates that the start time was recorded when the geyser was already in eruption.
  • electronic: A timestamp modifier which indicates that the start time was inferred using electronic monitoring equipment such as temperature loggers and seismographs.
  • approximate: A timestamp modifier which indicates that the given start time is only a rough estimate, usually based on post-eruptive evidence.
  • webcam: Whether the eruption was seen on a webcam or in-basin.
  • initial: Whether the eruption was the initial one in a series of eruptions. This is only applicable to geysers which erupt in series.
  • major: Whether the eruption was of the major type. This is only applicable to geysers that have minor and major eruptions.
  • minor: Whether the eruption was of the minor type. This is only applicable to geysers that have minor and major eruptions.
  • questionable: Indicates that there is uncertainty about the report, usually when the observation conditions make it hard to determine the geyser with certainty.
  • duration: The duration of the eruption as raw text.
  • entrant: The username of the user that entered the eruption.
  • observer: The name of the person that observed the eruption. If not given, the observer coincides with the entrant.
  • comment: Comments on the eruption, usually consisting of more detailed observations on the eruption or the events leading to the eruption.
  • time_entered: The timestamp when the eruption was entered, as determined by the entry client.
  • time_updated: The timestamp when the eruption was last updated, as determined by the entry client. If the eruption was not edited, this coincides with the timestamp when it was entered.
  • primary_id: The unique database identifier of the primary eruption. Reports of the same eruption are grouped together, with the most representative report being selected as the primary eruption.
  • other_comments: Comments by other people on the eruption.

While ideally we would change the column names directly within the source TSV files, I'm a bit reluctant as it might break things for people already using our archive files. I'll bring it up at our next meeting though. Also, I'll be looking into adding our parsed durations (a numerical value of the duration in seconds) to the archive files.

@taltstidl
Copy link

@spkaluzny I've updated the column descriptions. We've decided against renaming the column names within the archive files, so it's probably best to map these within the gt_get_data function. If there's anything else I can do, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants