How to format csv file for use as time series object with ts() command #2268

rvaillancourt · 2024-12-07T19:44:58Z

Hi All.
I am trying to re-format my time-series csv file in order to be able to create a time series object using the ts() function. A small portion of my csv file is attached here and contains oceanographic data collected using a Webb glider. The data are collected at uneven time intervals and so must be wrangled into a new time series by windowing it. For this time series I would use a one-month time window. I have been trying to do this according to Dan Kelley's book section 5.9.4.3 on windowing methods, but the cut command does not work with datetime formatted data, and when I convert it to numeric the code no longer works correctly.

In a related note, is there an error in the example code in Dan Kelley's book? I believe the ceiling and floor commands in the following line of code should be switched. This is how it appears on page 154:

C <- cut(y, breaks =seq(ceiling(min(y)), floor(max(y)),5)

-Thanks,
Bob.

testFile.csv

dankelley · 2024-12-07T20:04:08Z

I had a quick look and see that bad data greatly outnumber good data, viz.

library(oce)
f <- "~/Downloads/testFile.csv"
d <- read.csv(f, header = TRUE)
d$time <- as.POSIXct(d$TIME, format="%m/%d/%Y %H:%M", tz = "UTC")
oce.plot.ts(d$time, d$temp, type = "p")
badTime <- is.na(d$time)
table(badTime)
bad <- which(badTime)[1]
d[bad + seq(-10, 10), ]

dankelley · 2024-12-07T20:08:36Z

Here's another code, and the graph. I always start by examining the data in as raw a form as possible, before deciding how to reduce it further. Past what I have now, the rest is just using binApply1D() as in my book, and in notes from @richardsc in another context.

library(oce)
f <- "~/Downloads/testFile.csv"
d <- read.csv(f, header = TRUE)
d$time <- as.POSIXct(d$TIME, format="%m/%d/%Y %H:%M", tz = "UTC")
badTime <- is.na(d$time)
bad <- badTime | is.na(d$temp) # do this for all variables of interest
d <- d[!bad, ]
png("oce2268.png")
oce.plot.ts(d$time, d$temp, type = "o", pch = 20, cex = 0.5)

richardsc · 2024-12-07T23:51:15Z

Hi Bob,

Thanks for submitting this here! Aside from having @dankelley to give advice, I generally find that it can be helpful to others to have a discussion that involves "advice" in an open forum like this rather than emails (which I can never find again).

I agree with @dankelley that much of the data in that file appears to be empty rows for some reason, but also there are lots of "missing" data even within the series. The below plot shows the times as a function of index, where I used rug() to show the NA times:

You mentioned that this is data from a slocum glider, which actually both @dankelley and I have quite a bit of experience working with (see our not-yet-on-CRAN package called oceglider).

Glider data is pretty complicated, so I think it would help to know what it is you're trying to do with it. For example, gliders usually collect data by diving, so that the "time series" data is in fact a sort of mix between time and depth profiles. Looking at your csv, I don't see any column for "depth" or "pressure", so it's hard to know what is the right way to bin the data to get something useful.

I'd also point out that casting into a ts object might not be the right approach here, given that (I think) ts objects require equal spacing in time. You could do it if you interpolate or put NAs for the missing data, but again, this depends on the context for analysis.

Just doing what I did in Issue #2264, gives something like the below. Note that there are some subleties here, because the NAs that get returned from binMean1D() cause issues with the polygon() call.

library(oce)

d <- read.csv('testFile.csv', header=TRUE)
d$time <- as.POSIXct(d$TIME, format="%m/%d/%Y %H:%M", tz = "UTC")
badTime <- is.na(d$time)
bad <- badTime | is.na(d$temp) # do this for all variables of interest
dc <- d[!bad, ]

## bin average in monthly chunks
tbreaks <- seq(min(dc$time), max(dc$time), by="1 month")
Tb <- binMean1D(dc$time, dc$temp, tbreaks)
Tbsd <- binApply1D(dc$time, dc$temp, tbreaks, sd, na.rm=TRUE)
oce.plot.ts(dc$time, dc$temp)
lines(Tb$xmids, Tb$result, lwd=3, col=2)
# polygon(c(Tb$xmids, rev(Tb$xmids)), c(Tb$result + Tbsd$result, rev(Tb$result)),
#         col=rgb(0, 0, 0, 0.25), border=NA)
# polygon(c(Tb$xmids, rev(Tb$xmids)), c(Tb$result - Tbsd$result, rev(Tb$result)),
#         col=rgb(0, 0, 0, 0.25), border=NA)
errorbars(Tbsd$xmids, Tb$result, 0, Tbsd$result, style = 1, col = 2, lwd = 2)

rvaillancourt · 2024-12-08T13:47:38Z

Hi Clark and Dan,
Thanks so much for your input, it is very useful to talk to others who have glider experience. My previous posting was too brief so may have led to some confusion. Here are some more details about the data and what I’m trying to do with it.

This data example I sent is a snippet from a much larger 9-year dataset created from multiple glider missions from the OOI Pioneer Array New England region that have been merged and then sorted by time. The original depth-resolved glider data were processed to calculate euphotic zone depths and averages for properties over that depth. Each row of data in the file I sent you is the depth-averaged value over the euphotic zone for a single ascent or descent for the glider at that latitude/longitude, so, for example, the 'clh', column are the average values for chlorophyll a within the euphotic zone. (The euphotic zone depth column was arbitrarily removed from this dataset to save space.) The objective here is to develop a composite 2D picture of the bio-optical conditions across the PA site for monthly time steps, hence my need to obtain an average over a month. Averaging over a month will also create a dataset with even time intervals which then can be cast as a time series object and then decomposed to detect seasonal variability and long-term trends.

The many missing lines of data are those glider profiles that failed the minimum requirements for accurate determination of the euphotic zone depth, for reasons such as intermittent cloud cover, excess wave focussing at the ocean's surface, or perhaps the glider profile was during the nighttime when there is no sunlight to assess euphotic zone. There are also gaps in the time series because of glider or instrument failures, so these data were not available from the glider DAC. The script Clark sent gives a solution for ridding the file of all these NAs.

The question I have now is how do I take the monthly bin-average (red line from Clark’s plot) in his posting and turn this into a time series object for further analysis for decomposition to look for seasonality and trend?

dankelley · 2024-12-08T13:52:56Z

Thanks for the info. I had plotted the lon-lat data and see interesting sampling protocols.

Maybe I'm missing something here. To get a time-series, use the ts() R function. Typing ?ts gives useful information and examples. But you almost certainly know that, so, as I say, I'm likely missing something. I just popped into the issue when I had a break from what is really keeping me busy -- end of term work.

richardsc · 2024-12-08T14:30:19Z

Hi Bob,

Thanks for the context. That helps me understand the data a bit better. Makes me think I should start looking into the OOI glider data to see what kinds of things might be relevant to the glider program I run on the Scotian Shelf, and whether some of the kinds of signals we see are found in other similar long-term glider datasets.

As for the time-series coercion, I think all you need to do is add:

## Create ts object
Tts <- ts(Tb$result, deltat=1/12)
plot(Tts)

to the end of the code I wrote above. Note that in this example I chose the deltat= argument to be in "years" (e.g. monthly data is an interval of 1/12).

rvaillancourt · 2024-12-08T15:06:55Z

Thanks Clark and Dan. That is very helpful.

Using a glider 'swarm' is an experimental approach I'm taking to look at the bio-optics on the New England shelf. The profiler moorings would seem like the likelier candidate for a platform except that most of those profilers come only within ca. 25 m of surface whereas the gliders most of the time come up to within a few meters. Since sunlight attenuates exponentially with depth, much of the euphotic zone is going to be below the shallowest sampling depth if I used the profilers. The trouble with the gliders is with spatial biasing, but I'm trying to minimize this by averaging over large areas and time spans. I've attached my poster from OSM2024 where I presented these data for the 2021 glider year. Happy to talk to you more about it if you're interested.
~Bob

OSM2024_POSTER_V1.5.pdf

dankelley · 2024-12-08T15:32:56Z

Quick Q on the poster: I'm a bit mixed up on the notation regarding MLD. Are you saying you find the depth where sigma-theta is 0.03 kg/m^3 higher than the surface value? I've always found it tricky to think of MLD (hence some discussion in my 2018 book) because any definition that seems good in one place/season seems to be not so good at another place/season.

rvaillancourt · 2024-12-08T17:08:26Z

Yes, it is the depth at which the sigma-theta increases by 0.03 kg/m3 from the surface. I would rather know the turbulent depth (sensu Franks 2015), but given all we have is T and S on the ctds then then a hydrographic mixed layer depth is all we can get from the gliders. We also have buoyancy frequency for higher vertical resolution measure of density stratification. But the MLD does at least follow the expected changes with season, so that is mildly comforting. ~Bob Robert D. Vaillancourt, Ph.D. Professor of Oceanography 106 Brossman Hall Millersville University Millersville, PA 17551 From: Dan Kelley ***@***.***> Sent: Sunday, December 8, 2024 10:33 AM To: dankelley/oce ***@***.***> Cc: Robert Vaillancourt ***@***.***>; Author ***@***.***> Subject: Re: [dankelley/oce] How to format csv file for use as time series object with ts() command (Issue #2268) CAUTION: This email originated from outside of Millersville. Do not click links or open attachments unless you recognize the sender and know the content is safe. Quick Q on the poster: I'm a bit mixed up on the notation regarding MLD. Are you saying you find the depth where sigma-theta is 0.03 kg/m^3 higher than the surface value? I've always found it tricky to think of MLD (hence some discussion in my 2018 book) because any definition that seems good in one place/season seems to be not so good at another place/season. — Reply to this email directly, view it on GitHub<#2268 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS35AXCKONJUNQQCJ2X52U32ERRD3AVCNFSM6AAAAABTGPR6Y2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRWGE4DEOJRGE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

dankelley · 2024-12-21T10:00:39Z

Dear reporter,

Do you think this issue (as defined by its title) has been addressed? If so, please close it. If not, please add a comment explaining what remains to be done.

We use open issues as a sort of "to do" list for the project.

Thanks!

PS. This is a standardized reply.

dankelley · 2024-12-30T13:06:43Z

The original question seems to have been answered, and discussion finished about 3 weeks ago, so will close this issue now. Reminder: we normally ask that reporters close issues, but developers may do it if discussion has stalled on a matter that has apparently been resolved.

dankelley closed this as completed Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to format csv file for use as time series object with ts() command #2268

How to format csv file for use as time series object with ts() command #2268

rvaillancourt commented Dec 7, 2024

dankelley commented Dec 7, 2024

dankelley commented Dec 7, 2024

richardsc commented Dec 7, 2024

rvaillancourt commented Dec 8, 2024

dankelley commented Dec 8, 2024

richardsc commented Dec 8, 2024

rvaillancourt commented Dec 8, 2024

dankelley commented Dec 8, 2024

rvaillancourt commented Dec 8, 2024 via email

dankelley commented Dec 21, 2024

dankelley commented Dec 30, 2024

How to format csv file for use as time series object with ts() command #2268

How to format csv file for use as time series object with ts() command #2268

Comments

rvaillancourt commented Dec 7, 2024

dankelley commented Dec 7, 2024

dankelley commented Dec 7, 2024

richardsc commented Dec 7, 2024

rvaillancourt commented Dec 8, 2024

dankelley commented Dec 8, 2024

richardsc commented Dec 8, 2024

rvaillancourt commented Dec 8, 2024

dankelley commented Dec 8, 2024

rvaillancourt commented Dec 8, 2024 via email

dankelley commented Dec 21, 2024

dankelley commented Dec 30, 2024