CNN training on MNIST does not converge #145

milancurcic · 2023-06-22T15:38:07Z

cnn_mnist example which trains a CNN network on MNIST data stays at random (10%) accuracy over epochs;
cnn_from_keras example which loads a pre-trained CNN from Keras and achieves expected high accuracy (90.14%)

The above suggests that the forward passes of conv2d, maxpool2d, and flatten layers are implemented correctly.

The culprit may be in the implementation of backward methods for any of these layers, or in the backward flow of data.

This should be fixed before the release of v0.13.0.

The text was updated successfully, but these errors were encountered:

certik · 2024-03-21T01:30:02Z

Here is an example output that I am getting:

$ fpm run --example cnn_mnist --profile release --flag "-fno-frontend-optimize -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -Wl,-rpath -Wl,$CONDA_PREFIX/lib"
Layer: input
------------------------------------------------------------
Output shape: 784
Parameters: 0

Layer: reshape
------------------------------------------------------------
Input shape: 784
Output shape: 1 28 28
Parameters: 0
Activation: 

Layer: conv2d
------------------------------------------------------------
Input shape: 1 28 28
Output shape: 8 26 26
Parameters: 80
Activation: relu

Layer: maxpool2d
------------------------------------------------------------
Input shape: 8 26 26
Output shape: 8 13 13
Parameters: 0
Activation: 

Layer: conv2d
------------------------------------------------------------
Input shape: 8 13 13
Output shape: 16 11 11
Parameters: 1168
Activation: relu

Layer: maxpool2d
------------------------------------------------------------
Input shape: 16 11 11
Output shape: 16 5 5
Parameters: 0
Activation: 

Layer: flatten
------------------------------------------------------------
Input shape: 16 5 5
Output shape: 400
Parameters: 0
Activation: 

Layer: dense
------------------------------------------------------------
Input shape: 400
Output shape: 10
Parameters: 4010
Activation: softmax

Epoch  1 done, Accuracy:  9.91 %
Epoch  2 done, Accuracy:  9.91 %
Epoch  3 done, Accuracy:  9.91 %
Epoch  4 done, Accuracy:  9.91 %
...

It will stay at this percentage.

certik · 2024-03-21T01:46:24Z

Git bisect reveals #142:

6bbc28d123cdec20140331edc60df106d518a202 is the first bad commit
commit 6bbc28d123cdec20140331edc60df106d518a202
Author: Milan Curcic <[email protected]>
Date:   Thu Jun 22 11:27:03 2023 -0400

    Connect `flatten`, `conv2d`, and `maxpool2d` layers in backward pass (#142)
    
    * Connect flatten, conv2d, and maxpool2d layers in backward pass
    
    * Bump minor version

 fpm.toml                        |  2 +-
 src/nf/nf_network_submodule.f90 | 16 +++++++++++-----
 2 files changed, 12 insertions(+), 6 deletions(-)

milancurcic · 2024-04-18T14:49:50Z

Tests with minimal CNN on randomly selected constant inputs/outputs converge fine (#174). The problem with training CNN on MNIST may be elsewhere, or the bug is more subtle than I previously suspected. Needs more intermediate complexity tests to understand more.

OneAdder · 2025-02-19T07:03:20Z

@milancurcic The problem appears to be with the initialization of weights. Suggestions:

Hotfix: use Kaiming weights initialization instead of normal
Fix: Initializers stub #151

milancurcic · 2025-03-14T02:07:58Z

Fixed in #201

milancurcic added the bug Something isn't working label Jun 22, 2023

milancurcic self-assigned this Jun 22, 2023

milancurcic mentioned this issue Jan 10, 2024

Test failure with ifx #167

Open

certik mentioned this issue Mar 21, 2024

Connect flatten, conv2d, and maxpool2d layers in backward pass #142

Merged

milancurcic mentioned this issue Apr 16, 2024

More conv2d tests #174

Merged

milancurcic changed the title ~~CNN training does not converge~~ CNN training on MNIST does not converge Apr 18, 2024

This was referenced Feb 21, 2025

Weight initialization #204

Draft

Layernorm #203

Merged

milancurcic closed this as completed Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNN training on MNIST does not converge #145

CNN training on MNIST does not converge #145

milancurcic commented Jun 22, 2023

certik commented Mar 21, 2024 •

edited

Loading

certik commented Mar 21, 2024 •

edited

Loading

milancurcic commented Apr 18, 2024

OneAdder commented Feb 19, 2025

milancurcic commented Mar 14, 2025

CNN training on MNIST does not converge #145

CNN training on MNIST does not converge #145

Comments

milancurcic commented Jun 22, 2023

certik commented Mar 21, 2024 • edited Loading

certik commented Mar 21, 2024 • edited Loading

milancurcic commented Apr 18, 2024

OneAdder commented Feb 19, 2025

milancurcic commented Mar 14, 2025

certik commented Mar 21, 2024 •

edited

Loading

certik commented Mar 21, 2024 •

edited

Loading