Zindi Africa Biomass Challenge

Predicting biomass using satellite images: https://zindi.africa/competitions/africa-biomass-challenge

Setting up/running the scripts

Clone the project from github.

git clone https://github.com/pmwaniki/biomass-prediction
cd biomass-predictios

Conda environment can be set up using the enviroment.yml file:

conda env create -f environment.yml

conda activate satellite

Configuring data, logs and results directory

Edit the settings.py file to specify the directories where the data (data_dir), results (result_dir), and logs (log_dir) will be stored.

Data pre-processing

Download the train, validation and test datasets and place them in the path defined by setting.py's data_dir. Run python preprocessing.py to preprocess the data. The train, validation and test datasets are combined to form a single dataset and saved into hdf5 file train_combined.h5. A csv containing the longitude, latitude, biomass, and mean cloud cover for each image is created (Train.csv). In addition, K-Mean clustering is used to create 5 cluster using longitude and latitude. The clusters are used in model training to define the folds in K-fold clustering. This ensures that the training and validation dataset do not overlap geographically.

The mean and variance of each band was computed and saved in band_statistics.joblib to be used to scale the images during model training.

python preprocessing.py

Solution at #8 on the private leaderboard

The Solution can be found in regression_sgd.py. The process of hyper-parameter tuning was repeated 5 times (once for each cluster) and predictions were aggregated using unweighted mean. There was suspicion that the model predictions were poorly calibrated given that the values of biomass in the training and submission datasets were obtained using different methods (GEDI vs GIZ). Therefore, in an effort to improve calibration, a constant value was added to the submission predictions. A value of 65 provided the best score on the public leaderboard.

ray start --head
python regression_sgd.py

Model

A convolutional neural network (CNN) was used. The layers of the model wer as follows:

Group normalization
Dropout (2D)
Initial convolution layer
One or more residual blocks
Final convolution (number of output channels equal number of outputs)
Average pooling (compute mean for each channel)

class GaussianNoise(nn.Module):
    def __init__(self, stddev):
        super().__init__()
        self.stddev = stddev

    def forward(self, din):
        if self.stddev == 0.0: return din
        if self.training:
            return din + torch.randn(din.size(),device=din.device) * self.stddev
        return din
    
def hidden_block(dim_hidden,dim_hidden2,noise_sd,kernel_size=3,padding=1):
    return nn.Sequential(
        nn.GroupNorm(1,dim_hidden),
        GaussianNoise(stddev=noise_sd),
        nn.Conv2d(dim_hidden,dim_hidden2,kernel_size=(kernel_size,kernel_size),padding=padding,stride=1,bias=False),
    )
class ResidualBlock(nn.Module):
    def __init__(self,dim_hidden,dim_hidden2,noise_sd,dropout=0.0,kernel_size=3,padding=1):
        super().__init__()
        self.dropout=dropout
        self.block1=hidden_block(dim_hidden,dim_hidden2,noise_sd,kernel_size=kernel_size,padding=padding)
        self.block2=hidden_block(dim_hidden2,dim_hidden2,noise_sd,kernel_size=kernel_size,padding=padding)
        self.block3=hidden_block(dim_hidden2,dim_hidden2,noise_sd=0.0,kernel_size=kernel_size,padding=padding)
        self.bottleneck=nn.Conv2d(dim_hidden,dim_hidden2,kernel_size=(1,1),padding=0)

    def forward(self,x):
        identity=x
        out=F.leaky_relu(self.block1(x),0.01)
        out=F.dropout2d(out,self.dropout)
        out=F.leaky_relu(self.block2(out),0.01)
        out = F.dropout2d(out, self.dropout)
        out=self.block3(out)
        out = out + self.bottleneck(identity)
        out= F.leaky_relu(out,0.01)
        return out
    
class CNN(nn.Module):
    def __init__(self,dropout=0.01,hidden_dim=32,dim_input=12,dim_out=1,normalization_groups=0,gaussian_sd=0,n_blocks=1,expansion=2,kernel_size=3):
        super().__init__()
        if kernel_size ==1:
            padding=0
        elif kernel_size == 3:
            padding = 1
        elif kernel_size == 5:
            padding = 2
        elif kernel_size ==7:
            padding=3
        else:
            raise Exception(f"CNN not implemented for kernel_size{kernel_size}")
        self.dropout=dropout
        if normalization_groups==0:
            self.input_norm=nn.Identity()
        else:
            self.input_norm=nn.GroupNorm(normalization_groups,12)
        self.conv1=nn.Conv2d(dim_input,hidden_dim,kernel_size=(kernel_size,kernel_size),stride=1,padding=padding,bias=False)
        block_sizes=[hidden_dim]
        for i in range(n_blocks): block_sizes.append(block_sizes[i]*expansion)
        self.blocks=nn.Sequential(*[ResidualBlock(block_sizes[l-1],block_sizes[l],gaussian_sd,kernel_size=kernel_size,padding=padding)
                                    for l in range(1,len(block_sizes))])
        self.conv_final=nn.Conv2d(block_sizes[-1],dim_out,kernel_size=(kernel_size,kernel_size),stride=1,padding=padding)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.normal_(m.weight, mean=0.0, std=0.01)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)


    def forward(self,x):
        x=self.input_norm(x)
        x = F.dropout2d(x, self.dropout)
        x=F.leaky_relu(self.conv1(x),0.1)
        x=self.blocks(x)
        x=self.conv_final(x)
        x=torch.mean(x,dim=(2,3))
        return x

Data augmentation

Data augmentation was done using the kornia library. The augmentation was done by applying random horizontal flip, random vertical flip, and gaussian blur.

def augmentation(config):
    aug_list = AugmentationSequential(

        K.augmentation.RandomHorizontalFlip(p=0.5),
        K.augmentation.RandomVerticalFlip(p=0.5),
        # K.RandomAffine(degrees=(0, 90), p=0.25),
        K.augmentation.RandomGaussianBlur(kernel_size=(config['blur_kernel_size'],config['blur_kernel_size']), sigma=(0.01, 2.0), p=config['prop_noise']),
        data_keys=["input",],
        same_on_batch=False,
        random_apply=10,keepdim=True
    ).to(device)
    return aug_list

Hyper-parameter optimization

The library ray[tune] was sused for hyper-parameter optimization. ASHA algorithm was used to select the best hyper-parameters. The tuned hyper-parameters include:

Learning rate
Weight decay
Clip norm (gradient clipping)
Dropout
Number of normalization groups (Group Norm)
Standard deviation of Gaussina noise layers
Number of channels for convolution layers
Number of residual blocks
Kernel size of convolution layers
Kernel size of gaussian blur layers (data augmentation)
Proportion of observations with gaussian blur

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
GIZ_Biomass_Starter_Notebook.ipynb		GIZ_Biomass_Starter_Notebook.ipynb
LICENSE		LICENSE
README.MD		README.MD
bayesian_tuned.py		bayesian_tuned.py
cnn_baseline.py		cnn_baseline.py
cnn_bayesian.py		cnn_bayesian.py
embeddings_regression.py		embeddings_regression.py
environment.yml		environment.yml
inference_regression_sgd.py		inference_regression_sgd.py
jtt.py		jtt.py
k_mean-clusters.png		k_mean-clusters.png
loaders.py		loaders.py
metric_learning.py		metric_learning.py
modules.py		modules.py
plot_embeddings.py		plot_embeddings.py
preprocessing.py		preprocessing.py
regression_baseline.py		regression_baseline.py
regression_sgd.py		regression_sgd.py
settings.py		settings.py
ssvae.py		ssvae.py
stacking.py		stacking.py
vae.py		vae.py
vae_tunned.py		vae_tunned.py
vanilla_vae.py		vanilla_vae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zindi Africa Biomass Challenge

Setting up/running the scripts

Configuring data, logs and results directory

Data pre-processing

Solution at #8 on the private leaderboard

Model

Data augmentation

Hyper-parameter optimization

About

Releases

Packages

Languages

License

pmwaniki/biomass-prediction

Folders and files

Latest commit

History

Repository files navigation

Zindi Africa Biomass Challenge

Setting up/running the scripts

Configuring data, logs and results directory

Data pre-processing

Solution at #8 on the private leaderboard

Model

Data augmentation

Hyper-parameter optimization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages