https://www.kaggle.com/c/LANL-Earthquake-Prediction
- Dataset details, such as number of features, instances, data distribution
- acoustic_data - the seismic signal [int16]
- time_to_failure - the time (in milli seconds) until the next laboratory earthquake [float64]
Training Data instances: 629 million points
signal | quaketime |
---|---|
count | 1.000000e+07 |
mean | 4.502072e+00 |
std | 1.780707e+01 |
min | -4.621000e+03 |
25% | 2.000000e+00 |
50% | 4.000000e+00 |
75% | 7.000000e+00 |
max | 3.252000e+03 |
- seg_id- the test segment ids for which predictions should be made (one prediction per segment)
- acoustic_data - the seismic signal [int16] for which the prediction is made.
Test Data instances: 2624 files, with 150,000 instances for each file => 393,600,000 instances
- SVM
- Gradient Boosting
- Random Forests
- Divide the training data into chunks of 150,000 data points as the test data consists of 150,000 points
- We are not creating validation dataset as the input dataset is a continguous data from a sensor. Creating validation dataset by choosing the data randomly will not give any good results
- Scale the data
Feature generation: Create several groups of features:
- Usual aggregations: mean, std, min and max
- Average difference between the consequitive values in absolute and percent values;
- Absolute min and max vallues;
- Aforementioned aggregations for first and last 10000 and 50000 values - I think these data should be useful;
- Max value to min value and their differencem also count of values bigger that 500 (arbitrary threshold);
- Quantile features
- Trend features
- Rolling features
- Python
- Chandra Kiran Saladi ( cxs172130 )
- Shreyash Mane ( ssm170730 )
- Tanya Tukade ( txt171230 )
- Supraja Ponnur ( sxp179130 )