GitHub - imanisima/gmis-hackathon2020: 24-hour data analytics hackathon from CAHSI GMiS conference. 3rd Place

Predict Power Generation of Solar Panels

GMiS CAHSI Data Analytics Hackathon 2020 - Team 14: Mask-araid

Question: What will be the generated power voltage from a solar panel at a given time in the future given the weather conditions?

Software Used: Jupyter Lab

Programming Language: Python

Research

First, let's do some research about solar panels. According to 1876 Energy and Trace Software, the highest contributing factors to solar panels are temperature, energy conversion efficiency (power), shade, solar radiation, and location (longitude and latitude). Additionally, solar panels work more efficiently in cold temperatures, allowing the panel to produce more voltage and more electricity. Rain and snow have no effect on solar panels however cloudy days and humidity can slow down production.

Step I: EDA

First, we will perform some EDA so that we can get a feel for the data.

import pandas as pd

data_set = pd.read_csv("cahsi_data_2020/D1.csv")

data_set.head(100)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	weather_datetime	solar_datetime	solarRadiation	uvHigh	winddirAvg	humidityHigh	humidityLow	humidityAvg	qcStatus	tempHigh	...	windchillAvg	heatindexHigh	heatindexLow	heatindexAvg	pressureMax	pressureMin	pressureTrend	precipRate	precipTotal	DC
0	2020-02-07 14:29:00	2020-02-07 14:29:1	627.70	7.0	195	24	24	24	-1	65	...	65	65	65	65	30.06	30.05	0.60	0.0	0.0	42.036
1	2020-02-07 14:34:00	2020-02-07 14:34:1	617.31	7.0	129	24	23	23	-1	68	...	67	68	66	67	30.06	30.05	-0.15	0.0	0.0	42.126
2	2020-02-07 14:39:00	2020-02-07 14:39:1	608.13	6.0	108	24	23	23	-1	68	...	67	68	67	67	30.06	30.05	0.00	0.0	0.0	42.264
3	2020-02-07 14:44:00	2020-02-07 14:44:1	582.57	6.0	87	25	24	24	-1	67	...	66	67	66	66	30.06	30.05	-0.15	0.0	0.0	42.204
4	2020-02-07 14:49:00	2020-02-07 14:49:1	571.67	6.0	38	24	24	24	-1	66	...	66	66	66	66	30.05	30.04	-0.15	0.0	0.0	42.360
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	2020-02-07 22:24:00	2020-02-07 22:24:1	0.00	0.0	255	41	40	40	1	51	...	51	51	51	51	30.15	30.14	0.15	0.0	0.0	0.186
96	2020-02-07 22:29:00	2020-02-07 22:29:1	0.00	0.0	3	43	41	42	1	51	...	51	51	50	51	30.15	30.14	0.15	0.0	0.0	0.192
97	2020-02-07 22:34:00	2020-02-07 22:34:1	0.00	0.0	299	42	40	41	1	51	...	50	51	50	50	30.15	30.14	0.00	0.0	0.0	0.192
98	2020-02-07 22:39:00	2020-02-07 22:39:1	0.00	0.0	233	42	41	41	1	51	...	51	51	51	51	30.15	30.15	0.00	0.0	0.0	0.192
99	2020-02-07 22:44:00	2020-02-07 22:44:1	0.00	0.0	248	41	39	40	1	51	...	51	51	51	51	30.16	30.15	0.00	0.0	0.0	0.198

100 rows × 29 columns

data_set.tail()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	weather_datetime	solar_datetime	winddirAvg	humidityHigh	humidityLow	humidityAvg	qcStatus	tempHigh	...	windchillAvg	heatindexHigh	heatindexLow	heatindexAvg	pressureMax	pressureMin	pressureTrend	DC
7955	2020-03-30 21:29:00	2020-03-30 21:29:1	153	25	25	25	1	62	...	62	62	62	62	30.25	30.24	0.00	0.030
7956	2020-03-30 21:34:00	2020-03-30 21:34:1	160	25	25	25	1	62	...	62	62	62	62	30.25	30.24	0.00	0.024
7957	2020-03-30 21:39:00	2020-03-30 21:39:1	188	25	25	25	1	62	...	62	62	62	62	30.25	30.24	0.00	0.030
7958	2020-03-30 21:44:00	2020-03-30 21:44:1	153	25	25	25	1	62	...	62	62	62	62	30.25	30.24	-0.15	0.024
7959	2020-03-30 21:49:00	2020-03-30 21:49:1	107	25	25	25	1	62	...	62	62	62	62	30.25	30.25	0.00	0.024

5 rows × 29 columns

Observation: Notice that as it becomes later in the day, the solar radiation, uv, and temperature decreases. The DC voltage also decreases.

# what other columns are there?
data_set.columns

Index(['weather_datetime', 'solar_datetime', 'solarRadiation', 'uvHigh',
       'winddirAvg', 'humidityHigh', 'humidityLow', 'humidityAvg', 'qcStatus',
       'tempHigh', 'tempLow', 'tempAvg', 'windspeedHigh', 'windgustLow',
       'windspeedAvg', 'dewptHigh', 'dewptLow', 'dewptAvg', 'windchillHigh',
       'windchillAvg', 'heatindexHigh', 'heatindexLow', 'heatindexAvg',
       'pressureMax', 'pressureMin', 'pressureTrend', 'precipRate',
       'precipTotal', 'DC'],
      dtype='object')

# what's the size of our data?
data_set.shape

(7960, 29)

# how distributed is the data?
data_set.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	solarRadiation	uvHigh	winddirAvg	humidityHigh	humidityLow	humidityAvg	qcStatus	tempHigh	tempLow	tempAvg	...	windchillAvg	heatindexHigh	heatindexLow	heatindexAvg	pressureMax	pressureMin	pressureTrend	precipRate	precipTotal	DC
count	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	...	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000	7960.000000
mean	180.382851	1.844472	182.790075	45.861683	44.912437	45.077261	0.893970	53.745729	53.420603	53.554397	...	53.418970	53.581533	53.236935	53.380653	30.200309	30.193024	0.000974	0.000469	0.012936	19.539283
std	264.082275	2.846040	78.432376	21.862087	21.940977	21.924786	0.311143	11.622671	11.565509	11.589908	...	11.720226	11.353402	11.265849	11.305647	0.137469	0.137532	0.095205	0.005050	0.053583	19.753129
min	0.000000	0.000000	0.000000	11.000000	10.000000	10.000000	-1.000000	27.000000	27.000000	27.000000	...	26.000000	27.000000	27.000000	27.000000	29.850000	29.830000	-0.600000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	139.000000	28.000000	27.000000	27.000000	1.000000	45.000000	45.000000	45.000000	...	45.000000	45.000000	45.000000	45.000000	30.100000	30.100000	0.000000	0.000000	0.000000	0.030000
50%	0.000000	0.000000	197.000000	43.000000	42.000000	42.000000	1.000000	54.000000	54.000000	54.000000	...	54.000000	54.000000	54.000000	54.000000	30.180000	30.180000	0.000000	0.000000	0.000000	8.379000
75%	333.120000	3.000000	215.000000	60.000000	59.000000	60.000000	1.000000	62.000000	62.000000	62.000000	...	62.000000	62.000000	62.000000	62.000000	30.250000	30.250000	0.000000	0.000000	0.000000	39.823500
max	986.880000	10.000000	359.000000	98.000000	98.000000	98.000000	1.000000	83.000000	83.000000	83.000000	...	83.000000	80.000000	80.000000	80.000000	30.610000	30.600000	0.600000	0.130000	0.370000	43.710000

8 rows × 27 columns

# Use pd.DataFrame.corr function to see what correlations can be identified between DC and other features.
data_set.corr(method="spearman")

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	solarRadiation	uvHigh	winddirAvg	humidityHigh	humidityLow	humidityAvg	qcStatus	tempHigh	tempLow	tempAvg	...	windchillAvg	heatindexHigh	heatindexLow	heatindexAvg	pressureMax	pressureMin	pressureTrend	precipRate	precipTotal	DC
solarRadiation	1.000000	0.937831	-0.342999	-0.399324	-0.408921	-0.405694	-0.051987	0.452500	0.444020	0.448101	...	0.446924	0.452235	0.442935	0.447495	0.046178	0.045571	-0.073512	0.001359	0.035785	0.814947
uvHigh	0.937831	1.000000	-0.316591	-0.407405	-0.418103	-0.414418	-0.048438	0.454691	0.445485	0.449924	...	0.448278	0.454565	0.444371	0.449403	0.062698	0.062510	-0.096479	-0.043045	0.006510	0.700912
winddirAvg	-0.342999	-0.316591	1.000000	0.319881	0.321128	0.320633	-0.058951	-0.362586	-0.361985	-0.362065	...	-0.361569	-0.362126	-0.361636	-0.361757	0.196054	0.195846	0.011043	-0.018541	0.003610	-0.259023
humidityHigh	-0.399324	-0.407405	0.319881	1.000000	0.998860	0.999456	-0.132945	-0.765681	-0.763568	-0.764586	...	-0.759300	-0.765219	-0.763032	-0.764093	0.169556	0.169504	0.027204	0.197191	0.362307	-0.102328
humidityLow	-0.408921	-0.418103	0.321128	0.998860	1.000000	0.999736	-0.132788	-0.767797	-0.765245	-0.766474	...	-0.761256	-0.767374	-0.764697	-0.765993	0.167956	0.167909	0.029168	0.196765	0.360056	-0.109135
humidityAvg	-0.405694	-0.414418	0.320633	0.999456	0.999736	1.000000	-0.132995	-0.766938	-0.764562	-0.765708	...	-0.760461	-0.766498	-0.764012	-0.765218	0.168278	0.168241	0.028430	0.196955	0.361020	-0.106906
qcStatus	-0.051987	-0.048438	-0.058951	-0.132945	-0.132788	-0.132995	1.000000	0.050799	0.052421	0.051585	...	0.058814	0.050671	0.052195	0.051358	-0.130895	-0.131655	-0.013845	0.023923	0.051293	-0.128036
tempHigh	0.452500	0.454691	-0.362586	-0.765681	-0.767797	-0.766938	0.050799	1.000000	0.999030	0.999402	...	0.998271	0.999708	0.998698	0.999141	-0.451769	-0.452192	-0.045211	-0.126698	-0.170193	0.176902
tempLow	0.444020	0.445485	-0.361985	-0.763568	-0.765245	-0.764562	0.052421	0.999030	1.000000	0.999544	...	0.998397	0.998736	0.999677	0.999288	-0.455311	-0.455749	-0.043467	-0.125614	-0.170110	0.170358
tempAvg	0.448101	0.449924	-0.362065	-0.764586	-0.766474	-0.765708	0.051585	0.999402	0.999544	1.000000	...	0.998819	0.999124	0.999221	0.999748	-0.453805	-0.454220	-0.044260	-0.126174	-0.170251	0.173594
windspeedHigh	0.397328	0.383542	-0.321357	-0.520153	-0.516107	-0.517716	-0.011902	0.500599	0.503205	0.502059	...	0.484797	0.500218	0.502660	0.501699	-0.048984	-0.049268	-0.030898	-0.107910	-0.222856	0.242005
windgustLow	0.330530	0.317268	-0.276300	-0.407748	-0.403040	-0.404775	-0.000308	0.397769	0.401008	0.399604	...	0.382369	0.397298	0.400612	0.399310	-0.041768	-0.041718	-0.019151	-0.086551	-0.173471	0.214411
windspeedAvg	0.387350	0.372748	-0.316086	-0.489531	-0.485025	-0.486786	-0.003781	0.472159	0.475211	0.473931	...	0.456297	0.471763	0.474679	0.473568	-0.044306	-0.044478	-0.025686	-0.101311	-0.205448	0.245529
dewptHigh	-0.052317	-0.050672	0.068162	0.567230	0.563345	0.565181	-0.143246	0.050101	0.052756	0.051704	...	0.057759	0.050857	0.053645	0.052601	-0.235636	-0.236398	-0.020742	0.119437	0.299852	0.049395
dewptLow	-0.080215	-0.080862	0.079540	0.593152	0.592413	0.593069	-0.141533	0.017321	0.020574	0.019246	...	0.025345	0.018005	0.021514	0.020128	-0.227196	-0.227908	-0.017020	0.124784	0.308419	0.035530
dewptAvg	-0.066127	-0.065882	0.074241	0.580213	0.577895	0.579206	-0.140929	0.034221	0.037206	0.036009	...	0.042133	0.034966	0.038134	0.036922	-0.231846	-0.232595	-0.018377	0.122124	0.304470	0.041853
windchillHigh	0.452330	0.454327	-0.363098	-0.764098	-0.766250	-0.765379	0.055493	0.999509	0.998592	0.998945	...	0.998978	0.999217	0.998260	0.998684	-0.453846	-0.454269	-0.045599	-0.126133	-0.168847	0.176803
windchillAvg	0.446924	0.448278	-0.361569	-0.759300	-0.761256	-0.760461	0.058814	0.998271	0.998397	0.998819	...	1.000000	0.997993	0.998074	0.998567	-0.459162	-0.459578	-0.044587	-0.124342	-0.166472	0.173580
heatindexHigh	0.452235	0.454565	-0.362126	-0.765219	-0.767374	-0.766498	0.050671	0.999708	0.998736	0.999124	...	0.997993	1.000000	0.998743	0.999269	-0.452040	-0.452460	-0.045153	-0.126702	-0.170172	0.176960
heatindexLow	0.442935	0.444371	-0.361636	-0.763032	-0.764697	-0.764012	0.052195	0.998698	0.999677	0.999221	...	0.998074	0.998743	1.000000	0.999439	-0.455523	-0.455950	-0.043394	-0.125619	-0.170043	0.170265
heatindexAvg	0.447495	0.449403	-0.361757	-0.764093	-0.765993	-0.765218	0.051358	0.999141	0.999288	0.999748	...	0.998567	0.999269	0.999439	1.000000	-0.454029	-0.454441	-0.044092	-0.126179	-0.170233	0.173561
pressureMax	0.046178	0.062698	0.196054	0.169556	0.167956	0.168278	-0.130895	-0.451769	-0.455311	-0.453805	...	-0.459162	-0.452040	-0.455523	-0.454029	1.000000	0.998638	-0.016431	-0.101865	-0.224125	0.081685
pressureMin	0.045571	0.062510	0.195846	0.169504	0.167909	0.168241	-0.131655	-0.452192	-0.455749	-0.454220	...	-0.459578	-0.452460	-0.455950	-0.454441	0.998638	1.000000	-0.016344	-0.103346	-0.224628	0.081169
pressureTrend	-0.073512	-0.096479	0.011043	0.027204	0.029168	0.028430	-0.013845	-0.045211	-0.043467	-0.044260	...	-0.044587	-0.045153	-0.043394	-0.044092	-0.016431	-0.016344	1.000000	0.018658	0.018448	-0.027936
precipRate	0.001359	-0.043045	-0.018541	0.197191	0.196765	0.196955	0.023923	-0.126698	-0.125614	-0.126174	...	-0.124342	-0.126702	-0.125619	-0.126179	-0.101865	-0.103346	0.018658	1.000000	0.410878	0.094109
precipTotal	0.035785	0.006510	0.003610	0.362307	0.360056	0.361020	0.051293	-0.170193	-0.170110	-0.170251	...	-0.166472	-0.170172	-0.170043	-0.170233	-0.224125	-0.224628	0.018448	0.410878	1.000000	0.140794
DC	0.814947	0.700912	-0.259023	-0.102328	-0.109135	-0.106906	-0.128036	0.176902	0.170358	0.173594	...	0.173580	0.176960	0.170265	0.173561	0.081685	0.081169	-0.027936	0.094109	0.140794	1.000000

27 rows × 27 columns

Observation: In relation to DC, it appears there is a strong correlation with:

solarRadiation - 0.8
uvHigh - 0.7

and loose correlation with:

tempHigh
tempLow
tempAvg
windchillAvg
heatindexHigh
heatindexLow
heatindexAvg
precipTotal

Does this reflect any information gathered from our research?

Step II: Feature Selection

We will split the data into features and labels and convert them into arrays to be used for our model.

import numpy as np

# we want to perdict DC
labels = np.array(data_set['DC'])

# Remove the labels and unimportant features from the features list.

col = [
 'weather_datetime',
 'solar_datetime',
 'winddirAvg',
 'humidityHigh',
 'humidityLow',
 'humidityAvg',
 'heatindexLow',
 'heatindexHigh',
 'heatindexAvg',
 'qcStatus',
 'windspeedHigh',
 'windgustLow',
 'windspeedAvg',
 'dewptHigh',
 'dewptLow',
 'dewptAvg',
 'windchillHigh',
 'windchillAvg',
 'pressureMax',
 'pressureMin',
 'pressureTrend',
 'precipRate',
 'precipTotal',
 'DC']

features= data_set.drop(col, axis = 1)
feature_list = list(features.columns)
features = np.array(features)

Step III: Build and Train Model

Split the data into train and test sets.

from sklearn.model_selection import train_test_split

# Note here that the test size is so low because I want to overfit the model since we have a separate test set.
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.1)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (7164, 5)
Training Labels Shape: (7164,)
Testing Features Shape: (796, 5)
Testing Labels Shape: (796,)

# the features we will be using to predict DC
feature_list

['solarRadiation', 'uvHigh', 'tempHigh', 'tempLow', 'tempAvg']

Step III.i: Hyper Parameters Tuning

Hyper Parameters Tuning is good for figuring out what parameters will work the best for building the model. It's much better than guessing. Although it isn't perfect, it gives us some clues on what to try.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

Gradient Boost

gradient_boost_model = GradientBoostingRegressor()
gradient_params = {'learning_rate': sp_randFloat(),
                'subsample'    : sp_randFloat(),
                'n_estimators' : sp_randInt(200, 2000),
                'max_depth'    : sp_randInt(10, 110)
             }

random_gradient = RandomizedSearchCV(estimator= gradient_boost_model, param_distributions = gradient_params, cv = 3, verbose=2, n_iter = 100, n_jobs=-1)
random_gradient.fit(train_features, train_labels)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 28.2min finished





RandomizedSearchCV(cv=3, estimator=GradientBoostingRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aedd0>,
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aecd0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2ae050>,
                                        'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aee10>},
                   verbose=2)

# Results from Random Search
print(" Results from Random Search " )
print("\n The best estimator across ALL searched params:\n", random_gradient.best_estimator_)
print("\n The best score across ALL searched params:\n", random_gradient.best_score_)
print("\n The best parameters across ALL searched params:\n", random_gradient.best_params_)
print(random_gradient.score(test_features , test_labels))

 Results from Random Search 

 The best estimator across ALL searched params:
 GradientBoostingRegressor(learning_rate=0.01794706377831745, max_depth=32,
                          n_estimators=785, subsample=0.2873167459093807)

 The best score across ALL searched params:
 0.9655473309872132

 The best parameters across ALL searched params:
 {'learning_rate': 0.01794706377831745, 'max_depth': 32, 'n_estimators': 785, 'subsample': 0.2873167459093807}
0.9688789434674687

Step III.ii: Random Forest Model

# Instantiate model with 1500 decision trees
rf = RandomForestRegressor(n_estimators = 785, 
                           criterion="mse", 
                           max_depth = 32, 
                           min_samples_split = 2)

# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(max_depth=32, n_estimators=785)

Step III.iii: Accuracy - R2 Score

Let's see what the accuracy our model is using the training set provided.

y_pred = rf.predict(test_features)

from sklearn.metrics import r2_score

r2_score(test_labels, y_pred)

0.9709460517127418

Comment: Our model has a accuracy of 97%! That's not bad at all.

Sept IV: Predictions Using Test Dataset

Now we will test our model using the test set. Remember that whatever we did to the training set must also be done to the testing set!

test_set = pd.read_csv("cahsi_data_2020/D2.csv")


col = [
 'weather_datetime',
 'solar_datetime',
 'winddirAvg',
 'humidityHigh',
 'humidityLow',
 'humidityAvg',
 'heatindexLow',
 'heatindexHigh',
 'heatindexAvg',
 'qcStatus',
 'windspeedHigh',
 'windgustLow',
 'windspeedAvg',
 'dewptHigh',
 'dewptLow',
 'dewptAvg',
 'windchillHigh',
 'windchillAvg',
 'pressureMax',
 'pressureMin',
 'pressureTrend',
 'precipRate',
 'precipTotal']

testset_features = test_set.drop(col, axis = 1)
testset_features = np.array(testset_features)

# Use the forest's predict method on the test data
predictions = rf.predict(testset_features)

predictions

array([1.20963236, 1.20963236, 0.70364704, ..., 6.45444127, 6.45444127,
       6.45444127])

Step V: Dump predictions into text file for later use.

print('Predictions:\n', predictions) 
file = open("answer.txt", "w") 

for num in predictions:

    content = str(num)
    file.write(content)
    file.write("\n")

file.close()

Predictions:
 [1.20963236 1.20963236 0.70364704 ... 6.45444127 6.45444127 6.45444127]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cahsi_data_2020		cahsi_data_2020
hackathon_2020.ipynb		hackathon_2020.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Power Generation of Solar Panels

GMiS CAHSI Data Analytics Hackathon 2020 - Team 14: Mask-araid

Research

Step I: EDA

Step II: Feature Selection

Step III: Build and Train Model

Step III.i: Hyper Parameters Tuning

Step III.ii: Random Forest Model

Step III.iii: Accuracy - R2 Score

Sept IV: Predictions Using Test Dataset

Step V: Dump predictions into text file for later use.

About

Releases

Packages

Languages

imanisima/gmis-hackathon2020

Folders and files

Latest commit

History

Repository files navigation

Predict Power Generation of Solar Panels

GMiS CAHSI Data Analytics Hackathon 2020 - Team 14: Mask-araid

Research

Step I: EDA

Step II: Feature Selection

Step III: Build and Train Model

Step III.i: Hyper Parameters Tuning

Step III.ii: Random Forest Model

Step III.iii: Accuracy - R2 Score

Sept IV: Predictions Using Test Dataset

Step V: Dump predictions into text file for later use.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages