Question: What will be the generated power voltage from a solar panel at a given time in the future given the weather conditions?
Software Used: Jupyter Lab
Programming Language: Python
First, let's do some research about solar panels. According to 1876 Energy and Trace Software, the highest contributing factors to solar panels are temperature, energy conversion efficiency (power), shade, solar radiation, and location (longitude and latitude). Additionally, solar panels work more efficiently in cold temperatures, allowing the panel to produce more voltage and more electricity. Rain and snow have no effect on solar panels however cloudy days and humidity can slow down production.
First, we will perform some EDA so that we can get a feel for the data.
import pandas as pd
data_set = pd.read_csv("cahsi_data_2020/D1.csv")
data_set.head(100)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
weather_datetime | solar_datetime | solarRadiation | uvHigh | winddirAvg | humidityHigh | humidityLow | humidityAvg | qcStatus | tempHigh | ... | windchillAvg | heatindexHigh | heatindexLow | heatindexAvg | pressureMax | pressureMin | pressureTrend | precipRate | precipTotal | DC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-07 14:29:00 | 2020-02-07 14:29:1 | 627.70 | 7.0 | 195 | 24 | 24 | 24 | -1 | 65 | ... | 65 | 65 | 65 | 65 | 30.06 | 30.05 | 0.60 | 0.0 | 0.0 | 42.036 |
1 | 2020-02-07 14:34:00 | 2020-02-07 14:34:1 | 617.31 | 7.0 | 129 | 24 | 23 | 23 | -1 | 68 | ... | 67 | 68 | 66 | 67 | 30.06 | 30.05 | -0.15 | 0.0 | 0.0 | 42.126 |
2 | 2020-02-07 14:39:00 | 2020-02-07 14:39:1 | 608.13 | 6.0 | 108 | 24 | 23 | 23 | -1 | 68 | ... | 67 | 68 | 67 | 67 | 30.06 | 30.05 | 0.00 | 0.0 | 0.0 | 42.264 |
3 | 2020-02-07 14:44:00 | 2020-02-07 14:44:1 | 582.57 | 6.0 | 87 | 25 | 24 | 24 | -1 | 67 | ... | 66 | 67 | 66 | 66 | 30.06 | 30.05 | -0.15 | 0.0 | 0.0 | 42.204 |
4 | 2020-02-07 14:49:00 | 2020-02-07 14:49:1 | 571.67 | 6.0 | 38 | 24 | 24 | 24 | -1 | 66 | ... | 66 | 66 | 66 | 66 | 30.05 | 30.04 | -0.15 | 0.0 | 0.0 | 42.360 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2020-02-07 22:24:00 | 2020-02-07 22:24:1 | 0.00 | 0.0 | 255 | 41 | 40 | 40 | 1 | 51 | ... | 51 | 51 | 51 | 51 | 30.15 | 30.14 | 0.15 | 0.0 | 0.0 | 0.186 |
96 | 2020-02-07 22:29:00 | 2020-02-07 22:29:1 | 0.00 | 0.0 | 3 | 43 | 41 | 42 | 1 | 51 | ... | 51 | 51 | 50 | 51 | 30.15 | 30.14 | 0.15 | 0.0 | 0.0 | 0.192 |
97 | 2020-02-07 22:34:00 | 2020-02-07 22:34:1 | 0.00 | 0.0 | 299 | 42 | 40 | 41 | 1 | 51 | ... | 50 | 51 | 50 | 50 | 30.15 | 30.14 | 0.00 | 0.0 | 0.0 | 0.192 |
98 | 2020-02-07 22:39:00 | 2020-02-07 22:39:1 | 0.00 | 0.0 | 233 | 42 | 41 | 41 | 1 | 51 | ... | 51 | 51 | 51 | 51 | 30.15 | 30.15 | 0.00 | 0.0 | 0.0 | 0.192 |
99 | 2020-02-07 22:44:00 | 2020-02-07 22:44:1 | 0.00 | 0.0 | 248 | 41 | 39 | 40 | 1 | 51 | ... | 51 | 51 | 51 | 51 | 30.16 | 30.15 | 0.00 | 0.0 | 0.0 | 0.198 |
100 rows × 29 columns
data_set.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
weather_datetime | solar_datetime | solarRadiation | uvHigh | winddirAvg | humidityHigh | humidityLow | humidityAvg | qcStatus | tempHigh | ... | windchillAvg | heatindexHigh | heatindexLow | heatindexAvg | pressureMax | pressureMin | pressureTrend | precipRate | precipTotal | DC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7955 | 2020-03-30 21:29:00 | 2020-03-30 21:29:1 | 0.0 | 0.0 | 153 | 25 | 25 | 25 | 1 | 62 | ... | 62 | 62 | 62 | 62 | 30.25 | 30.24 | 0.00 | 0.0 | 0.0 | 0.030 |
7956 | 2020-03-30 21:34:00 | 2020-03-30 21:34:1 | 0.0 | 0.0 | 160 | 25 | 25 | 25 | 1 | 62 | ... | 62 | 62 | 62 | 62 | 30.25 | 30.24 | 0.00 | 0.0 | 0.0 | 0.024 |
7957 | 2020-03-30 21:39:00 | 2020-03-30 21:39:1 | 0.0 | 0.0 | 188 | 25 | 25 | 25 | 1 | 62 | ... | 62 | 62 | 62 | 62 | 30.25 | 30.24 | 0.00 | 0.0 | 0.0 | 0.030 |
7958 | 2020-03-30 21:44:00 | 2020-03-30 21:44:1 | 0.0 | 0.0 | 153 | 25 | 25 | 25 | 1 | 62 | ... | 62 | 62 | 62 | 62 | 30.25 | 30.24 | -0.15 | 0.0 | 0.0 | 0.024 |
7959 | 2020-03-30 21:49:00 | 2020-03-30 21:49:1 | 0.0 | 0.0 | 107 | 25 | 25 | 25 | 1 | 62 | ... | 62 | 62 | 62 | 62 | 30.25 | 30.25 | 0.00 | 0.0 | 0.0 | 0.024 |
5 rows × 29 columns
Observation: Notice that as it becomes later in the day, the solar radiation, uv, and temperature decreases. The DC voltage also decreases.
# what other columns are there?
data_set.columns
Index(['weather_datetime', 'solar_datetime', 'solarRadiation', 'uvHigh',
'winddirAvg', 'humidityHigh', 'humidityLow', 'humidityAvg', 'qcStatus',
'tempHigh', 'tempLow', 'tempAvg', 'windspeedHigh', 'windgustLow',
'windspeedAvg', 'dewptHigh', 'dewptLow', 'dewptAvg', 'windchillHigh',
'windchillAvg', 'heatindexHigh', 'heatindexLow', 'heatindexAvg',
'pressureMax', 'pressureMin', 'pressureTrend', 'precipRate',
'precipTotal', 'DC'],
dtype='object')
# what's the size of our data?
data_set.shape
(7960, 29)
# how distributed is the data?
data_set.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
solarRadiation | uvHigh | winddirAvg | humidityHigh | humidityLow | humidityAvg | qcStatus | tempHigh | tempLow | tempAvg | ... | windchillAvg | heatindexHigh | heatindexLow | heatindexAvg | pressureMax | pressureMin | pressureTrend | precipRate | precipTotal | DC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | ... | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 | 7960.000000 |
mean | 180.382851 | 1.844472 | 182.790075 | 45.861683 | 44.912437 | 45.077261 | 0.893970 | 53.745729 | 53.420603 | 53.554397 | ... | 53.418970 | 53.581533 | 53.236935 | 53.380653 | 30.200309 | 30.193024 | 0.000974 | 0.000469 | 0.012936 | 19.539283 |
std | 264.082275 | 2.846040 | 78.432376 | 21.862087 | 21.940977 | 21.924786 | 0.311143 | 11.622671 | 11.565509 | 11.589908 | ... | 11.720226 | 11.353402 | 11.265849 | 11.305647 | 0.137469 | 0.137532 | 0.095205 | 0.005050 | 0.053583 | 19.753129 |
min | 0.000000 | 0.000000 | 0.000000 | 11.000000 | 10.000000 | 10.000000 | -1.000000 | 27.000000 | 27.000000 | 27.000000 | ... | 26.000000 | 27.000000 | 27.000000 | 27.000000 | 29.850000 | 29.830000 | -0.600000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 139.000000 | 28.000000 | 27.000000 | 27.000000 | 1.000000 | 45.000000 | 45.000000 | 45.000000 | ... | 45.000000 | 45.000000 | 45.000000 | 45.000000 | 30.100000 | 30.100000 | 0.000000 | 0.000000 | 0.000000 | 0.030000 |
50% | 0.000000 | 0.000000 | 197.000000 | 43.000000 | 42.000000 | 42.000000 | 1.000000 | 54.000000 | 54.000000 | 54.000000 | ... | 54.000000 | 54.000000 | 54.000000 | 54.000000 | 30.180000 | 30.180000 | 0.000000 | 0.000000 | 0.000000 | 8.379000 |
75% | 333.120000 | 3.000000 | 215.000000 | 60.000000 | 59.000000 | 60.000000 | 1.000000 | 62.000000 | 62.000000 | 62.000000 | ... | 62.000000 | 62.000000 | 62.000000 | 62.000000 | 30.250000 | 30.250000 | 0.000000 | 0.000000 | 0.000000 | 39.823500 |
max | 986.880000 | 10.000000 | 359.000000 | 98.000000 | 98.000000 | 98.000000 | 1.000000 | 83.000000 | 83.000000 | 83.000000 | ... | 83.000000 | 80.000000 | 80.000000 | 80.000000 | 30.610000 | 30.600000 | 0.600000 | 0.130000 | 0.370000 | 43.710000 |
8 rows × 27 columns
# Use pd.DataFrame.corr function to see what correlations can be identified between DC and other features.
data_set.corr(method="spearman")
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
solarRadiation | uvHigh | winddirAvg | humidityHigh | humidityLow | humidityAvg | qcStatus | tempHigh | tempLow | tempAvg | ... | windchillAvg | heatindexHigh | heatindexLow | heatindexAvg | pressureMax | pressureMin | pressureTrend | precipRate | precipTotal | DC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
solarRadiation | 1.000000 | 0.937831 | -0.342999 | -0.399324 | -0.408921 | -0.405694 | -0.051987 | 0.452500 | 0.444020 | 0.448101 | ... | 0.446924 | 0.452235 | 0.442935 | 0.447495 | 0.046178 | 0.045571 | -0.073512 | 0.001359 | 0.035785 | 0.814947 |
uvHigh | 0.937831 | 1.000000 | -0.316591 | -0.407405 | -0.418103 | -0.414418 | -0.048438 | 0.454691 | 0.445485 | 0.449924 | ... | 0.448278 | 0.454565 | 0.444371 | 0.449403 | 0.062698 | 0.062510 | -0.096479 | -0.043045 | 0.006510 | 0.700912 |
winddirAvg | -0.342999 | -0.316591 | 1.000000 | 0.319881 | 0.321128 | 0.320633 | -0.058951 | -0.362586 | -0.361985 | -0.362065 | ... | -0.361569 | -0.362126 | -0.361636 | -0.361757 | 0.196054 | 0.195846 | 0.011043 | -0.018541 | 0.003610 | -0.259023 |
humidityHigh | -0.399324 | -0.407405 | 0.319881 | 1.000000 | 0.998860 | 0.999456 | -0.132945 | -0.765681 | -0.763568 | -0.764586 | ... | -0.759300 | -0.765219 | -0.763032 | -0.764093 | 0.169556 | 0.169504 | 0.027204 | 0.197191 | 0.362307 | -0.102328 |
humidityLow | -0.408921 | -0.418103 | 0.321128 | 0.998860 | 1.000000 | 0.999736 | -0.132788 | -0.767797 | -0.765245 | -0.766474 | ... | -0.761256 | -0.767374 | -0.764697 | -0.765993 | 0.167956 | 0.167909 | 0.029168 | 0.196765 | 0.360056 | -0.109135 |
humidityAvg | -0.405694 | -0.414418 | 0.320633 | 0.999456 | 0.999736 | 1.000000 | -0.132995 | -0.766938 | -0.764562 | -0.765708 | ... | -0.760461 | -0.766498 | -0.764012 | -0.765218 | 0.168278 | 0.168241 | 0.028430 | 0.196955 | 0.361020 | -0.106906 |
qcStatus | -0.051987 | -0.048438 | -0.058951 | -0.132945 | -0.132788 | -0.132995 | 1.000000 | 0.050799 | 0.052421 | 0.051585 | ... | 0.058814 | 0.050671 | 0.052195 | 0.051358 | -0.130895 | -0.131655 | -0.013845 | 0.023923 | 0.051293 | -0.128036 |
tempHigh | 0.452500 | 0.454691 | -0.362586 | -0.765681 | -0.767797 | -0.766938 | 0.050799 | 1.000000 | 0.999030 | 0.999402 | ... | 0.998271 | 0.999708 | 0.998698 | 0.999141 | -0.451769 | -0.452192 | -0.045211 | -0.126698 | -0.170193 | 0.176902 |
tempLow | 0.444020 | 0.445485 | -0.361985 | -0.763568 | -0.765245 | -0.764562 | 0.052421 | 0.999030 | 1.000000 | 0.999544 | ... | 0.998397 | 0.998736 | 0.999677 | 0.999288 | -0.455311 | -0.455749 | -0.043467 | -0.125614 | -0.170110 | 0.170358 |
tempAvg | 0.448101 | 0.449924 | -0.362065 | -0.764586 | -0.766474 | -0.765708 | 0.051585 | 0.999402 | 0.999544 | 1.000000 | ... | 0.998819 | 0.999124 | 0.999221 | 0.999748 | -0.453805 | -0.454220 | -0.044260 | -0.126174 | -0.170251 | 0.173594 |
windspeedHigh | 0.397328 | 0.383542 | -0.321357 | -0.520153 | -0.516107 | -0.517716 | -0.011902 | 0.500599 | 0.503205 | 0.502059 | ... | 0.484797 | 0.500218 | 0.502660 | 0.501699 | -0.048984 | -0.049268 | -0.030898 | -0.107910 | -0.222856 | 0.242005 |
windgustLow | 0.330530 | 0.317268 | -0.276300 | -0.407748 | -0.403040 | -0.404775 | -0.000308 | 0.397769 | 0.401008 | 0.399604 | ... | 0.382369 | 0.397298 | 0.400612 | 0.399310 | -0.041768 | -0.041718 | -0.019151 | -0.086551 | -0.173471 | 0.214411 |
windspeedAvg | 0.387350 | 0.372748 | -0.316086 | -0.489531 | -0.485025 | -0.486786 | -0.003781 | 0.472159 | 0.475211 | 0.473931 | ... | 0.456297 | 0.471763 | 0.474679 | 0.473568 | -0.044306 | -0.044478 | -0.025686 | -0.101311 | -0.205448 | 0.245529 |
dewptHigh | -0.052317 | -0.050672 | 0.068162 | 0.567230 | 0.563345 | 0.565181 | -0.143246 | 0.050101 | 0.052756 | 0.051704 | ... | 0.057759 | 0.050857 | 0.053645 | 0.052601 | -0.235636 | -0.236398 | -0.020742 | 0.119437 | 0.299852 | 0.049395 |
dewptLow | -0.080215 | -0.080862 | 0.079540 | 0.593152 | 0.592413 | 0.593069 | -0.141533 | 0.017321 | 0.020574 | 0.019246 | ... | 0.025345 | 0.018005 | 0.021514 | 0.020128 | -0.227196 | -0.227908 | -0.017020 | 0.124784 | 0.308419 | 0.035530 |
dewptAvg | -0.066127 | -0.065882 | 0.074241 | 0.580213 | 0.577895 | 0.579206 | -0.140929 | 0.034221 | 0.037206 | 0.036009 | ... | 0.042133 | 0.034966 | 0.038134 | 0.036922 | -0.231846 | -0.232595 | -0.018377 | 0.122124 | 0.304470 | 0.041853 |
windchillHigh | 0.452330 | 0.454327 | -0.363098 | -0.764098 | -0.766250 | -0.765379 | 0.055493 | 0.999509 | 0.998592 | 0.998945 | ... | 0.998978 | 0.999217 | 0.998260 | 0.998684 | -0.453846 | -0.454269 | -0.045599 | -0.126133 | -0.168847 | 0.176803 |
windchillAvg | 0.446924 | 0.448278 | -0.361569 | -0.759300 | -0.761256 | -0.760461 | 0.058814 | 0.998271 | 0.998397 | 0.998819 | ... | 1.000000 | 0.997993 | 0.998074 | 0.998567 | -0.459162 | -0.459578 | -0.044587 | -0.124342 | -0.166472 | 0.173580 |
heatindexHigh | 0.452235 | 0.454565 | -0.362126 | -0.765219 | -0.767374 | -0.766498 | 0.050671 | 0.999708 | 0.998736 | 0.999124 | ... | 0.997993 | 1.000000 | 0.998743 | 0.999269 | -0.452040 | -0.452460 | -0.045153 | -0.126702 | -0.170172 | 0.176960 |
heatindexLow | 0.442935 | 0.444371 | -0.361636 | -0.763032 | -0.764697 | -0.764012 | 0.052195 | 0.998698 | 0.999677 | 0.999221 | ... | 0.998074 | 0.998743 | 1.000000 | 0.999439 | -0.455523 | -0.455950 | -0.043394 | -0.125619 | -0.170043 | 0.170265 |
heatindexAvg | 0.447495 | 0.449403 | -0.361757 | -0.764093 | -0.765993 | -0.765218 | 0.051358 | 0.999141 | 0.999288 | 0.999748 | ... | 0.998567 | 0.999269 | 0.999439 | 1.000000 | -0.454029 | -0.454441 | -0.044092 | -0.126179 | -0.170233 | 0.173561 |
pressureMax | 0.046178 | 0.062698 | 0.196054 | 0.169556 | 0.167956 | 0.168278 | -0.130895 | -0.451769 | -0.455311 | -0.453805 | ... | -0.459162 | -0.452040 | -0.455523 | -0.454029 | 1.000000 | 0.998638 | -0.016431 | -0.101865 | -0.224125 | 0.081685 |
pressureMin | 0.045571 | 0.062510 | 0.195846 | 0.169504 | 0.167909 | 0.168241 | -0.131655 | -0.452192 | -0.455749 | -0.454220 | ... | -0.459578 | -0.452460 | -0.455950 | -0.454441 | 0.998638 | 1.000000 | -0.016344 | -0.103346 | -0.224628 | 0.081169 |
pressureTrend | -0.073512 | -0.096479 | 0.011043 | 0.027204 | 0.029168 | 0.028430 | -0.013845 | -0.045211 | -0.043467 | -0.044260 | ... | -0.044587 | -0.045153 | -0.043394 | -0.044092 | -0.016431 | -0.016344 | 1.000000 | 0.018658 | 0.018448 | -0.027936 |
precipRate | 0.001359 | -0.043045 | -0.018541 | 0.197191 | 0.196765 | 0.196955 | 0.023923 | -0.126698 | -0.125614 | -0.126174 | ... | -0.124342 | -0.126702 | -0.125619 | -0.126179 | -0.101865 | -0.103346 | 0.018658 | 1.000000 | 0.410878 | 0.094109 |
precipTotal | 0.035785 | 0.006510 | 0.003610 | 0.362307 | 0.360056 | 0.361020 | 0.051293 | -0.170193 | -0.170110 | -0.170251 | ... | -0.166472 | -0.170172 | -0.170043 | -0.170233 | -0.224125 | -0.224628 | 0.018448 | 0.410878 | 1.000000 | 0.140794 |
DC | 0.814947 | 0.700912 | -0.259023 | -0.102328 | -0.109135 | -0.106906 | -0.128036 | 0.176902 | 0.170358 | 0.173594 | ... | 0.173580 | 0.176960 | 0.170265 | 0.173561 | 0.081685 | 0.081169 | -0.027936 | 0.094109 | 0.140794 | 1.000000 |
27 rows × 27 columns
Observation: In relation to DC, it appears there is a strong correlation with:
solarRadiation
- 0.8uvHigh
- 0.7
and loose correlation with:
tempHigh
tempLow
tempAvg
windchillAvg
heatindexHigh
heatindexLow
heatindexAvg
precipTotal
Does this reflect any information gathered from our research?
We will split the data into features and labels and convert them into arrays to be used for our model.
import numpy as np
# we want to perdict DC
labels = np.array(data_set['DC'])
# Remove the labels and unimportant features from the features list.
col = [
'weather_datetime',
'solar_datetime',
'winddirAvg',
'humidityHigh',
'humidityLow',
'humidityAvg',
'heatindexLow',
'heatindexHigh',
'heatindexAvg',
'qcStatus',
'windspeedHigh',
'windgustLow',
'windspeedAvg',
'dewptHigh',
'dewptLow',
'dewptAvg',
'windchillHigh',
'windchillAvg',
'pressureMax',
'pressureMin',
'pressureTrend',
'precipRate',
'precipTotal',
'DC']
features= data_set.drop(col, axis = 1)
feature_list = list(features.columns)
features = np.array(features)
Split the data into train and test sets.
from sklearn.model_selection import train_test_split
# Note here that the test size is so low because I want to overfit the model since we have a separate test set.
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.1)
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (7164, 5)
Training Labels Shape: (7164,)
Testing Features Shape: (796, 5)
Testing Labels Shape: (796,)
# the features we will be using to predict DC
feature_list
['solarRadiation', 'uvHigh', 'tempHigh', 'tempLow', 'tempAvg']
Hyper Parameters Tuning is good for figuring out what parameters will work the best for building the model. It's much better than guessing. Although it isn't perfect, it gives us some clues on what to try.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt
Gradient Boost
gradient_boost_model = GradientBoostingRegressor()
gradient_params = {'learning_rate': sp_randFloat(),
'subsample' : sp_randFloat(),
'n_estimators' : sp_randInt(200, 2000),
'max_depth' : sp_randInt(10, 110)
}
random_gradient = RandomizedSearchCV(estimator= gradient_boost_model, param_distributions = gradient_params, cv = 3, verbose=2, n_iter = 100, n_jobs=-1)
random_gradient.fit(train_features, train_labels)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 2.9min
[Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 28.2min finished
RandomizedSearchCV(cv=3, estimator=GradientBoostingRegressor(), n_iter=100,
n_jobs=-1,
param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aedd0>,
'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aecd0>,
'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2ae050>,
'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe6ae2aee10>},
verbose=2)
# Results from Random Search
print(" Results from Random Search " )
print("\n The best estimator across ALL searched params:\n", random_gradient.best_estimator_)
print("\n The best score across ALL searched params:\n", random_gradient.best_score_)
print("\n The best parameters across ALL searched params:\n", random_gradient.best_params_)
print(random_gradient.score(test_features , test_labels))
Results from Random Search
The best estimator across ALL searched params:
GradientBoostingRegressor(learning_rate=0.01794706377831745, max_depth=32,
n_estimators=785, subsample=0.2873167459093807)
The best score across ALL searched params:
0.9655473309872132
The best parameters across ALL searched params:
{'learning_rate': 0.01794706377831745, 'max_depth': 32, 'n_estimators': 785, 'subsample': 0.2873167459093807}
0.9688789434674687
# Instantiate model with 1500 decision trees
rf = RandomForestRegressor(n_estimators = 785,
criterion="mse",
max_depth = 32,
min_samples_split = 2)
# Train the model on training data
rf.fit(train_features, train_labels)
RandomForestRegressor(max_depth=32, n_estimators=785)
Let's see what the accuracy our model is using the training set provided.
y_pred = rf.predict(test_features)
from sklearn.metrics import r2_score
r2_score(test_labels, y_pred)
0.9709460517127418
Comment: Our model has a accuracy of 97%! That's not bad at all.
Now we will test our model using the test set. Remember that whatever we did to the training set must also be done to the testing set!
test_set = pd.read_csv("cahsi_data_2020/D2.csv")
col = [
'weather_datetime',
'solar_datetime',
'winddirAvg',
'humidityHigh',
'humidityLow',
'humidityAvg',
'heatindexLow',
'heatindexHigh',
'heatindexAvg',
'qcStatus',
'windspeedHigh',
'windgustLow',
'windspeedAvg',
'dewptHigh',
'dewptLow',
'dewptAvg',
'windchillHigh',
'windchillAvg',
'pressureMax',
'pressureMin',
'pressureTrend',
'precipRate',
'precipTotal']
testset_features = test_set.drop(col, axis = 1)
testset_features = np.array(testset_features)
# Use the forest's predict method on the test data
predictions = rf.predict(testset_features)
predictions
array([1.20963236, 1.20963236, 0.70364704, ..., 6.45444127, 6.45444127,
6.45444127])
print('Predictions:\n', predictions)
file = open("answer.txt", "w")
for num in predictions:
content = str(num)
file.write(content)
file.write("\n")
file.close()
Predictions:
[1.20963236 1.20963236 0.70364704 ... 6.45444127 6.45444127 6.45444127]