Linear2d layer #197

OneAdder · 2025-02-03T19:36:25Z

Two-Dimensional Linear Layer

Reason

Modern day Machine Learning techniques often require being able to work with 2D-shaped data, especially when solving Natural Language Processing tasks. For example, self-attention matrix is of shape (sequence_length, sequence_length) while scaled dot product attention matrix is of shape (sequence_length, head_size), transformer embeddings are usually stored in a 2D lookup table and so on.

In order to do so, a linear layer with trainable parameters that transforms such data is necessary. I plan on implementing MultiHead Attention and later Transformer Encoder, so I decided to add 2D Layer first as it will be required in every step of transformers architecture.

Desciption

linear2d_layer implements both forward and backward passes and, unintuitively, accepts inputs of 3D-shape (reasons in Crutches section) with the first dimension being reserved for batch size.
linear2d constructor accepts four arguments: batch_size, sequence_length, in_features, out_features, requiring the input shape to be (batch_size, sequence_length, in_features). The output shape (layer_shape) is (batch_size, sequence_length, out_features).
layer and network classes are modified to support linear2d_layer. At this stage, the layer is restricted to be preceeded by input3d_layer.
flatten layer is now allowed to be the last layer of the network as a placeholder.

Crutches

Input2D Layer

It appears that it is impossible to implement an input2d_layer in the current paradigm. The problem is that Fortran cannot resolve generics by array's shape and as input3d_layer that accepts real array is already present, it is impossible to add functions that accept different shapes. So, we are stuck with an extra dimension. I decided to use it for storing batch size, similarly to how PyTorch does it.

Dense Layer

To an extent, this layer can be used as simply another interface for dense if we make it generic. But it can create another problem: it will require reconsidering restrictions on layer ordering. It can be done, but I would like a maintener's opinion on that.

Tests

I made tests in test_linear2d_layer.f90

Sample Code

Fortran NN code:

program simple
  use nf, only: input, network, sgd, linear2d, mse, flatten
  implicit none
  type(network) :: net
  real :: x(2, 3, 4) = reshape(&
      [0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2,&
       0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2],&
      [2, 3, 4])
  real :: y(6) = [0.12, 0.1, 0.3, 0.12, 0.1, 0.3]
  type(mse) :: mean_se = mse()
  integer, parameter :: num_iterations = 500
  integer :: n

  net = network([ &
    input([2, 3, 4]), &
    linear2d(2, 3, 4, 1), &
    flatten() &
  ])

  call net % print_info()
  do n = 0, 2
    call net % forward(x)
    call net % backward(y, mean_se)
    call net % update(optimizer=sgd(learning_rate=1.))
    print '(i4,2(3x,f8.6))', n, net % predict(x)
  end do
end program simple

print_info output:

Layer: input
------------------------------------------------------------
Output shape: 2 3 4
Parameters: 0

Layer: linear2d
------------------------------------------------------------
Input shape: 2 3 4
Output shape: 2 3 1
Parameters: 5
Activation: 

Layer: flatten
------------------------------------------------------------
Input shape: 2 3 1
Output shape: 6
Parameters: 0
Activation:

NN output:

0   0.156267   0.195867
0.156267   0.195867
0.156267   0.195867
1   0.155947   0.194027
0.155947   0.194027
0.155947   0.194027
2   0.151360   0.186960
0.151360   0.186960
0.151360   0.186960

PyTorch Reference

I used PyTorch to make sure that everything works. Here is the snippet of my Python code:

import torch


class LinearModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(in_features=4, out_features=1)
        self.linear.bias.data = torch.tensor([0.11])
        self.linear.weight.data = torch.zeros(1, 4) + 0.1

    def forward(self, x):
        x = self.linear(x)
        return x


inp = torch.zeros(2, 3, 4)
inp[0] += 0.1
inp[1] += 0.2
x = torch.tensor(inp, requires_grad=True)
labels = torch.tensor([0.12, 0.1, 0.3, 0.12, 0.1, 0.3])

lm = LinearModule()
optimizer = torch.optim.SGD(lm.parameters(), lr=1.)
loss_function = torch.nn.MSELoss()

for i in range(3):
    pred = lm(x)
    loss = loss_function(pred, labels)
    loss.backward()
    optimizer.step()
    with torch.no_grad():
        pred_first, pred_second = lm(x)
        print(i,'\n'.join(f'{f.item():.6}   {s.item():.6}' for f, s in zip(pred_first, pred_second)))

Output:

0 0.156267   0.195867
0.156267   0.195867
0.156267   0.195867
1 0.155947   0.194027
0.155947   0.194027
0.155947   0.194027
2 0.15136   0.18696
0.15136   0.18696
0.15136   0.18696

milancurcic · 2025-02-03T19:56:36Z

Thank you! Will review this week..

milancurcic · 2025-02-04T02:52:41Z

Thank you for this effort and sorry that I missed the issue you opened a few days ago.

Great work figuring out the plumbing of neural-fortran and getting stuff running.

I have two questions for now:

We have dense and we have a linear activation function. If I'm not mistaken, a dense layer with a linear activation function is equivalent to a 1-d linear layer. If yes, could this functionality be accomplished with a combination of flatten -> dense (linear activation) -> reshape? If it would, there may still be a good argument for a dedicated linear layer, but it'd be good to understand whether it's currently not possible or simply impractical.
I see that you put the batch_size dim first, and doing matmul and transpose over the latter two dims. But recall that in Fortran the first dim varies the fastest, and the last one the slowest. Was this intentional?

I will think a little more about the crutch and the need for 3-d shapes. It's not yet obvious to me that it is but it's late over here..

OneAdder · 2025-02-04T07:09:54Z

Thank you for your response!

If it is possible, I'm not smart enough to figure out how. In theory, if the input to the linear layer would be a (3, 2) matrix and the output a (2, 2) matrix, the output of 2D Linear Layer would differ from the output of flattening it and doing 1D Linear layer. E.g., [[1, 2], [3, 4], [5, 6]] × [[1, 2], [3, 4]] and [[1], [2], [3], [4], [5], [6]] × [[1, 2, 3, 4]] yield different numerical values. So, if you know how to work around it and can share the code, I'll be happy to see it! And, nonetheless, even if it is possible, it is still very impractical.
No, I'm simply new to Fortran and didn't think about it. I'll change it.

Regarding the plumbing, I think a bit more of documentation could have helped. I might make a description of OOP structure of the project with a diagram and send a pr into contributing.md.

BTW, looking at my PR in the morning reveals that I forgot a couple of things: put the layer's logic into its submodule and update cmake stuff. I'll do it as well

milancurcic · 2025-02-06T16:27:01Z

Thanks for the updates, Michael.

When you get a chance, please review the Input2d layer in #198. If it's sound, we'll merge that into main and then merge it with this PR and remove the crutch.

From your research & experience, is an input layer the only that precedes Linear2d in applications? Or some other kinds of layers do too?

OneAdder · 2025-02-06T20:39:27Z

Wow, thank you for your work!

Answering you question may be not that simple. I mostly work in Natural Language Processing, and in this area appears the need to perform linear transformations on matrices in several places.

The general idea is that vectors for inputs are stored in a transformed lookup table with dimensions (sequence_length, hidden_dim) and all operations happen on the table rather than on raw input vectors. For example, an encoder model works like this:

Accept input (vectorized text) of dimensions (sequence_length)
Transform the input into lookup table of (sequence_length, hidden_dim)
Perform transformations and activations specific to this model's architecture on (sequence_length, hidden_dim)
Transform the output into the output shape of (sequence_length), (1), etc., or keep it as is for decoder stage if needed

So, linear2d layer will appear in every step of this: to rearrange input, to perform architecture-specific calculations and then to format output

I am currently working on nf_multihead_attention on my fork here but it's still very much work in progress. The essence of attention is that it is a combination of linear layers with softmax activation in between. And the linear layers work with 2D shapes.

But all of this is not really about now, but rather about the future. At this point linear2d_layer may be used for something simple such as linear regression on matrices (my example).

milancurcic · 2025-02-16T15:50:40Z

@OneAdder Thanks for the updates; I'm not a big fan of flatten2d; I attempted a generic flatten in #202, please take a look, I think it can work.

OneAdder · 2025-02-16T19:38:12Z

Everything seems to be resolved. @milancurcic can you take a look? any suggestions?

milancurcic · 2025-02-16T19:39:37Z

Great, thanks! Will review and test locally tonight.

OneAdder · 2025-02-16T20:22:36Z

Here is a sample: https://gist.github.com/OneAdder/1090c066d8a9e3c0557c2968000ba463

milancurcic

Thanks, Michael. You'll notice that I added your simple linear2d program as an example. I also left a comment about it not converging. We should get it to converge.

I think this PR is almost complete. One key thing stands out that I would like to discuss changing.

Currently, the user is required to provide:

linear2d(sequence_length, in_features, out_features)

An analogy to the dense (1d) layer would be:

dense(in_features, out_features)

However, notice that in NF, you only need to write

dense(out_features)

and the internal shape of the dense layer, which is (in_features, out_features) is determined from the layer that precedes it. For example:

input(5), &
dense(10)

would initialize a dense layer with internal shape of (5, 10).

That said, I would like to be able to do simply:

net = network([ &
  input(sequence_length, in_features), &
  linear2d(out_features), &
  flatten(), &
])

to get the same behavior that we currently have. Basically, at layer construction time we only know out_features, but sequence_length and in_features are obtained during layer % init() from the preceding layer.

This way, the user doesn't need to pass redundant information when constructing the network, which makes for a nicer API but also reduces the chance of user error.

What do you think?

example/linear2d.f90

src/nf/nf_linear2d_layer_submodule.f90

milancurcic · 2025-02-17T04:14:55Z

src/nf/nf_layer_constructors_submodule.f90

@@ -148,4 +150,13 @@ module function reshape(output_shape) result(res)

  end function reshape

+  module function linear2d(sequence_length, in_features, out_features) result(res)


Here we should be requesting only out_features at layer constructor invocation, and sequence_length and in_features are obtained later from the layer that feeds into this one.

Suggested change

module function linear2d(sequence_length, in_features, out_features) result(res)

module function linear2d(out_features) result(res)

Yes, makes sense

It turns out that we cannot avoid passing sequence_length. We need it to determine the output shape. I think it's not a big deal and can be left like this

module function linear2d(sequence_length, out_features) result(res) integer, intent(in) :: sequence_length, out_features type(layer) :: res res % name = 'linear2d' res % layer_shape = [sequence_length, out_features] allocate(res % p, source=linear2d_layer(out_features)) end function linear2d

I just gave it a try; unless I'm mistaken, I think it can work. See this commit: 678b2c0

I'll try to explain; the process goes like this:

linear2d(out_features) constructs the generic layer instance which at this time does not yet know its layer_shape.

network constructor from an array of generic layers loops over each layer in order and calls layer % init(prev_layer).

Inside layer % init the output shape of the previous layer is passed to the concrete linear2d_layer % init(input_shape) and inside the concrete init, all parameters are now known (sequence_length, in_features, out_features.

After the concrete layer init call, back inside the generic layer % init, we set the generic layer % layer_shape to be the same as the shape of the concrete linear2d_layer % output.

I hope this make sense and I'm sorry that it had to be so complicated. However, we are essentially hacking around Fortran's very limited generic features.

src/nf/nf_linear2d_layer_submodule.f90

OneAdder · 2025-02-17T07:24:36Z

Remove redundant args from constructor
- Update MHA as it uses the layer
Use Normal weight initialization
- Cement the weight values in tests
- Cement the weight values in MHA tests
Tweak values in example till it converges
- First get it to converge in PyTorch and make sure that values are correct

OneAdder · 2025-02-17T10:09:54Z

@milancurcic done

milancurcic · 2025-02-17T16:32:49Z

One last issue remains. I don't think the example actually converges. In the previous iteration, the loss (MSE) tolerance was 0.01, which for these outputs, on the order of 0.1, becomes very easily satisfied. If you lower the loss tolerance to, say, 1e-4, and run for a large number of iterations and print the output of each step, you'll see that the outputs change (and they change gradually, with sufficiently low learning rate), but they don't actually converge toward the values of y. The previous version that seemed to converge was only due to luck and a large loss tolerance.

Are we still not doing something correctly in this example?

OneAdder · 2025-02-17T17:40:46Z

Off Topic

I think I found an unrelated bug: gradients become zero between flatten and linear2d if the order is input > linear2d > flatten > dense. I'll try to fix it now.

Example

I don't think that there's an issue in fact. What is happening here is actually linear regression. It can be visualized as trying to draw a line as close to each point as possible. The y values are not on a line. Thus, it is impossible for the model to converge completely here.
The only thing we can do is to make y values lay on a line. But that would be cheating IMO

milancurcic · 2025-02-17T17:48:23Z

OK, thanks for that explanation. In that case, maybe the example should simply run for a number of iterations and print the outputs to the screen, rather than stopping on some first accidental local minimum. And related to this, as I understand it, this linear2d layer is much more useful as a building block of bigger stuff like MHA, and less so on its own. In that case, the example itself is not that relevant.

OneAdder · 2025-02-17T18:19:22Z

Great! Then we can just remove it. I'm working on a text classification with IMDB dataset. I'll add it later when there are more 2D layers.
The bug is not a bug. I just have vanishing gradients

milancurcic

Looks great, thank you!

OneAdder mentioned this pull request Feb 3, 2025

2D Linear Layer #196

Closed

milancurcic self-requested a review February 3, 2025 19:56

milancurcic mentioned this pull request Feb 6, 2025

Add Input2d layer #198

Merged

5 tasks

OneAdder mentioned this pull request Feb 9, 2025

Multihead attention #199

Merged

milancurcic mentioned this pull request Feb 16, 2025

Generic flatten (2d and 3d) #202

Merged

OneAdder added 19 commits February 16, 2025 23:16

linear2d_layer forward implementation

1fb279a

implement backward

d997b6b

introduce concurrency, outtroduce stupidity

9919c01

fix style

43d1a1f

add parameters api to linear2d_layer

906f21b

add constructor for linear2d_layer

e1b4695

add integration for linear2d layer

0fe2ef0

set usage rules for linear2d_layer

957095d

add linear2d_layer to public api

eff36fe

update tests for linear2d layer

b6f3c97

remove extra comment

541d943

remove rubbish

a27ec09

move linear2d layer logic into submodule

79abce3

update cmake for linear2d_layer

2168ec9

update tests for linear2d_layer

9a13af3

update linear2d_layer tests

0db76db

update linear2d_layer tests for batch last

f28ecc0

make linear2d_layer with batch as last dimension (performance)

9386aa3

linear2d_layer: fix gradient updates

07750db

OneAdder added 4 commits February 16, 2025 23:16

update cmake

141fe57

linear2d_layer: use flatten layer instead of flatten2d

c4b8fc7

linear2d_layer: remove flatten2d layer

54d1bb0

linear2d_layer: remove public api

9a4422f

OneAdder force-pushed the linear2d_layer branch from 2e9518e to 9a4422f Compare February 16, 2025 19:17

OneAdder added 2 commits February 16, 2025 23:31

linear2d_layer: update cmakelists

7606d2c

linear2d_layer: workaround cpu imprecision to make ci happy

7d271fe

Add linear2d example

539fde8

milancurcic requested changes Feb 17, 2025

View reviewed changes

example/linear2d.f90 Outdated Show resolved Hide resolved

src/nf/nf_linear2d_layer_submodule.f90 Outdated Show resolved Hide resolved

milancurcic reviewed Feb 17, 2025

View reviewed changes

src/nf/nf_linear2d_layer_submodule.f90 Outdated Show resolved Hide resolved

OneAdder added 5 commits February 17, 2025 12:10

linear2d_layer: remove redundant constructor args

a97f141

linear2d_layer: make example converge

bbfaf3c

linear2d_layer: make weighs init with normal distribution

4d28a0a

linear2d_layer: add loss stopping and more iterations

bfc69d5

linear2d_layer: update tests

119a6c8

milancurcic added 2 commits February 17, 2025 10:56

Tidy up

6f33ebe

Require passing only out_features to linear2d(); tidy up

678b2c0

Remove linear2d example

e78ef62

milancurcic approved these changes Feb 17, 2025

View reviewed changes

milancurcic merged commit c316ee1 into modern-fortran:main Feb 17, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear2d layer #197

Linear2d layer #197

OneAdder commented Feb 3, 2025

milancurcic commented Feb 3, 2025

milancurcic commented Feb 4, 2025

OneAdder commented Feb 4, 2025

milancurcic commented Feb 6, 2025

OneAdder commented Feb 6, 2025

milancurcic commented Feb 16, 2025

OneAdder commented Feb 16, 2025

milancurcic commented Feb 16, 2025

OneAdder commented Feb 16, 2025

milancurcic left a comment

milancurcic Feb 17, 2025

OneAdder Feb 17, 2025

OneAdder Feb 17, 2025

milancurcic Feb 17, 2025

milancurcic Feb 17, 2025

OneAdder commented Feb 17, 2025 •

edited

Loading

OneAdder commented Feb 17, 2025

milancurcic commented Feb 17, 2025

OneAdder commented Feb 17, 2025

milancurcic commented Feb 17, 2025

OneAdder commented Feb 17, 2025

milancurcic left a comment

		@@ -148,4 +150,13 @@ module function reshape(output_shape) result(res)

		end function reshape

		module function linear2d(sequence_length, in_features, out_features) result(res)

	module function linear2d(sequence_length, in_features, out_features) result(res)
	module function linear2d(out_features) result(res)

Linear2d layer #197

Linear2d layer #197

Conversation

OneAdder commented Feb 3, 2025

Two-Dimensional Linear Layer

Reason

Desciption

Crutches

Input2D Layer

Dense Layer

Tests

Sample Code

PyTorch Reference

milancurcic commented Feb 3, 2025

milancurcic commented Feb 4, 2025

OneAdder commented Feb 4, 2025

milancurcic commented Feb 6, 2025

OneAdder commented Feb 6, 2025

milancurcic commented Feb 16, 2025

OneAdder commented Feb 16, 2025

milancurcic commented Feb 16, 2025

OneAdder commented Feb 16, 2025

milancurcic left a comment

Choose a reason for hiding this comment

milancurcic Feb 17, 2025

Choose a reason for hiding this comment

OneAdder Feb 17, 2025

Choose a reason for hiding this comment

OneAdder Feb 17, 2025

Choose a reason for hiding this comment

milancurcic Feb 17, 2025

Choose a reason for hiding this comment

milancurcic Feb 17, 2025

Choose a reason for hiding this comment

OneAdder commented Feb 17, 2025 • edited Loading

OneAdder commented Feb 17, 2025

milancurcic commented Feb 17, 2025

OneAdder commented Feb 17, 2025

Off Topic

Example

milancurcic commented Feb 17, 2025

OneAdder commented Feb 17, 2025

milancurcic left a comment

Choose a reason for hiding this comment

OneAdder commented Feb 17, 2025 •

edited

Loading