Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set up on a new dataset and predict on new data points #2

Open
andrewcztrack opened this issue Aug 25, 2020 · 3 comments
Open

set up on a new dataset and predict on new data points #2

andrewcztrack opened this issue Aug 25, 2020 · 3 comments

Comments

@andrewcztrack
Copy link

hi @asolin @wil-j-wil !!

Really interested in the models and i am trying to set up the models on the new dataset.

can you please just review the below to see if what I'm doing makes sense.

import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler

plot_intermediate = False

import yfinance as yf

Y = np.array(yf.download("SPY", start="2008-01-01", end="2020-12-30")['Close'])

X=np.linspace(1,100,len(Y)).reshape(len(Y),1)

Y=Y.reshape(len(Y),1)


print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]

# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)

# Load cross-validation indices
cvind = np.loadtxt('../experiments/heteroscedastic/cvind.csv').astype(int)

# 10-fold cross-validation setup
nt = np.floor(cvind.shape[0]/10).astype(int)
cvind = np.reshape(cvind[:10*nt], (10, nt))

np.random.seed(123)
fold = 0

# Get training and test indices
test = cvind[fold, :]
train = np.setdiff1d(cvind, test)

# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]

plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');
@wil-j-wil
Copy link
Collaborator

wil-j-wil commented Aug 26, 2020

Hi @andrewcztrack

Glad to see you are trying things out. Everything looks OK except that you're using the cross-validation indices that we stored specifically for a different (smaller) data set. So you've truncated your data (I assume unintentionally).

To generate your own train/test split, you could use the code below:

# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.split(ind_shuffled, 10))  # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])

This splits your data into a 90% train / 10% test split. However, if you simply want to train on all the data and then make predictions at unseen locations, then you can just set XT to the locations you want to predict at.

Also note that the code is currently scaling and shifting your data, which may or may not be desirable, but is something to keep in mind.

Any other questions, let me know.

Will

@andrewcztrack
Copy link
Author

Hi @wil-j-wil !!! thank you so much! you are so generous with your time!!
so generally speaking i want the model to be trained and then predict into the future.
So be trained on 90 days and predict forward for the next 10 days.
Also to understand the cross validation experiment with your code below.
Essentially two experiments with the models.
I assume that it would be advantageous to standardise the values with X and Y values as it non stationary heteroscedastic data?
Is my logic correct?

To note i tried to code below but I am getting an error.

import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler

plot_intermediate = False

import yfinance as yf

Y = np.array(yf.download("SPY", start="2008-12-01", end="2020-12-30")['Close'])

X=np.linspace(1,100,len(Y)).reshape(len(Y),1)

Y=Y.reshape(len(Y),1)


print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]

# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)


# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.array_split(ind_shuffled, 5))  # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])


# Get training and test indices
#test = cvind[fold, :]
#train = np.setdiff1d(cvind, test)

# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]

plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');




[*********************100%***********************]  1 of 1 completed
loading data ...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-68ca1431719f> in <module>
     38 # 10-fold cross-validation setup
     39 ind_shuffled = np.random.permutation(N)
---> 40 ind_split = np.stack(np.array_split(ind_shuffled, 5))  # 10 random batches of data indices
     41 fold = 0
     42 # Get training and test indices

<__array_function__ internals> in stack(*args, **kwargs)

~/miniconda3/envs/myenv1/lib/python3.8/site-packages/numpy/core/shape_base.py in stack(arrays, axis, out)
    423     shapes = {arr.shape for arr in arrays}
    424     if len(shapes) != 1:
--> 425         raise ValueError('all input arrays must have the same shape')
    426 
    427     result_ndim = arrays[0].ndim + 1

ValueError: all input arrays must have the same shape

@wil-j-wil
Copy link
Collaborator

That error is because the data does not divide evenly into 5 batches. You could truncate the data slightly to fix it.

However, didn't you say that you wanted to train on the past and then predict into the future? In this case, you want to just set the first 90 days to be the training data and the last 10 to be test, so you don't need this random split any more.

Standardising the data might be fine, but just remember that this means the input will no longer be the exact time stamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants