-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set up on a new dataset and predict on new data points #2
Comments
Glad to see you are trying things out. Everything looks OK except that you're using the cross-validation indices that we stored specifically for a different (smaller) data set. So you've truncated your data (I assume unintentionally). To generate your own train/test split, you could use the code below: # 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.split(ind_shuffled, 10)) # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold]) This splits your data into a 90% train / 10% test split. However, if you simply want to train on all the data and then make predictions at unseen locations, then you can just set XT to the locations you want to predict at. Also note that the code is currently scaling and shifting your data, which may or may not be desirable, but is something to keep in mind. Any other questions, let me know. Will |
Hi @wil-j-wil !!! thank you so much! you are so generous with your time!! To note i tried to code below but I am getting an error. import sys
sys.path.insert(0, '../')
import numpy as np
from jax.experimental import optimizers
import matplotlib.pyplot as plt
import time
from sde_gp import SDEGP
import approximate_inference as approx_inf
import priors
import likelihoods
from utils import softplus_list, plot
from sklearn.preprocessing import StandardScaler
plot_intermediate = False
import yfinance as yf
Y = np.array(yf.download("SPY", start="2008-12-01", end="2020-12-30")['Close'])
X=np.linspace(1,100,len(Y)).reshape(len(Y),1)
Y=Y.reshape(len(Y),1)
print('loading data ...')
#D = np.loadtxt('../../data/mcycle.csv', delimiter=',')
#X = D[:, 1:2]
#Y = D[:, 2:]
N = X.shape[0]
# Standardize
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(Y)
Xall = X_scaler.transform(X)
Yall = y_scaler.transform(Y)
# 10-fold cross-validation setup
ind_shuffled = np.random.permutation(N)
ind_split = np.stack(np.array_split(ind_shuffled, 5)) # 10 random batches of data indices
fold = 0
# Get training and test indices
test = ind_split[fold]
train = np.concatenate(ind_split[np.arange(10) != fold])
# Get training and test indices
#test = cvind[fold, :]
#train = np.setdiff1d(cvind, test)
# Set training and test data
X = Xall[train, :]
Y = Yall[train, :]
XT = Xall[test, :]
YT = Yall[test, :]
plt.figure(1, figsize=(12, 5))
plt.clf()
plt.plot(X_scaler.inverse_transform(X), y_scaler.inverse_transform(Y), 'k.', label='train')
plt.plot(X_scaler.inverse_transform(XT), y_scaler.inverse_transform(YT), 'r.', label='test')
plt.legend()
plt.xlabel('time (milliseconds)')
plt.ylabel('accelerometer reading');
[*********************100%***********************] 1 of 1 completed
loading data ...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-68ca1431719f> in <module>
38 # 10-fold cross-validation setup
39 ind_shuffled = np.random.permutation(N)
---> 40 ind_split = np.stack(np.array_split(ind_shuffled, 5)) # 10 random batches of data indices
41 fold = 0
42 # Get training and test indices
<__array_function__ internals> in stack(*args, **kwargs)
~/miniconda3/envs/myenv1/lib/python3.8/site-packages/numpy/core/shape_base.py in stack(arrays, axis, out)
423 shapes = {arr.shape for arr in arrays}
424 if len(shapes) != 1:
--> 425 raise ValueError('all input arrays must have the same shape')
426
427 result_ndim = arrays[0].ndim + 1
ValueError: all input arrays must have the same shape |
That error is because the data does not divide evenly into 5 batches. You could truncate the data slightly to fix it. However, didn't you say that you wanted to train on the past and then predict into the future? In this case, you want to just set the first 90 days to be the training data and the last 10 to be test, so you don't need this random split any more. Standardising the data might be fine, but just remember that this means the input will no longer be the exact time stamp. |
hi @asolin @wil-j-wil !!
Really interested in the models and i am trying to set up the models on the new dataset.
can you please just review the below to see if what I'm doing makes sense.
The text was updated successfully, but these errors were encountered: