Skip to content

Linear2d layer #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Feb 17, 2025
Merged

Linear2d layer #197

merged 44 commits into from
Feb 17, 2025

Conversation

OneAdder
Copy link
Collaborator

@OneAdder OneAdder commented Feb 3, 2025

Two-Dimensional Linear Layer

Reason

Modern day Machine Learning techniques often require being able to work with 2D-shaped data, especially when solving Natural Language Processing tasks. For example, self-attention matrix is of shape (sequence_length, sequence_length) while scaled dot product attention matrix is of shape (sequence_length, head_size), transformer embeddings are usually stored in a 2D lookup table and so on.

In order to do so, a linear layer with trainable parameters that transforms such data is necessary. I plan on implementing MultiHead Attention and later Transformer Encoder, so I decided to add 2D Layer first as it will be required in every step of transformers architecture.

Desciption

  • linear2d_layer implements both forward and backward passes and, unintuitively, accepts inputs of 3D-shape (reasons in Crutches section) with the first dimension being reserved for batch size.

  • linear2d constructor accepts four arguments: batch_size, sequence_length, in_features, out_features, requiring the input shape to be (batch_size, sequence_length, in_features). The output shape (layer_shape) is (batch_size, sequence_length, out_features).

  • layer and network classes are modified to support linear2d_layer. At this stage, the layer is restricted to be preceeded by input3d_layer.

  • flatten layer is now allowed to be the last layer of the network as a placeholder.

Crutches

Input2D Layer

It appears that it is impossible to implement an input2d_layer in the current paradigm. The problem is that Fortran cannot resolve generics by array's shape and as input3d_layer that accepts real array is already present, it is impossible to add functions that accept different shapes. So, we are stuck with an extra dimension. I decided to use it for storing batch size, similarly to how PyTorch does it.

Dense Layer

To an extent, this layer can be used as simply another interface for dense if we make it generic. But it can create another problem: it will require reconsidering restrictions on layer ordering. It can be done, but I would like a maintener's opinion on that.

Tests

I made tests in test_linear2d_layer.f90

Sample Code

Fortran NN code:

program simple
  use nf, only: input, network, sgd, linear2d, mse, flatten
  implicit none
  type(network) :: net
  real :: x(2, 3, 4) = reshape(&
      [0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2,&
       0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2, 0.1, 0.2],&
      [2, 3, 4])
  real :: y(6) = [0.12, 0.1, 0.3, 0.12, 0.1, 0.3]
  type(mse) :: mean_se = mse()
  integer, parameter :: num_iterations = 500
  integer :: n

  net = network([ &
    input([2, 3, 4]), &
    linear2d(2, 3, 4, 1), &
    flatten() &
  ])

  call net % print_info()
  do n = 0, 2
    call net % forward(x)
    call net % backward(y, mean_se)
    call net % update(optimizer=sgd(learning_rate=1.))
    print '(i4,2(3x,f8.6))', n, net % predict(x)
  end do
end program simple

print_info output:

Layer: input
------------------------------------------------------------
Output shape: 2 3 4
Parameters: 0

Layer: linear2d
------------------------------------------------------------
Input shape: 2 3 4
Output shape: 2 3 1
Parameters: 5
Activation: 

Layer: flatten
------------------------------------------------------------
Input shape: 2 3 1
Output shape: 6
Parameters: 0
Activation: 

NN output:

0   0.156267   0.195867
0.156267   0.195867
0.156267   0.195867
1   0.155947   0.194027
0.155947   0.194027
0.155947   0.194027
2   0.151360   0.186960
0.151360   0.186960
0.151360   0.186960

PyTorch Reference

I used PyTorch to make sure that everything works. Here is the snippet of my Python code:

import torch


class LinearModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(in_features=4, out_features=1)
        self.linear.bias.data = torch.tensor([0.11])
        self.linear.weight.data = torch.zeros(1, 4) + 0.1

    def forward(self, x):
        x = self.linear(x)
        return x


inp = torch.zeros(2, 3, 4)
inp[0] += 0.1
inp[1] += 0.2
x = torch.tensor(inp, requires_grad=True)
labels = torch.tensor([0.12, 0.1, 0.3, 0.12, 0.1, 0.3])

lm = LinearModule()
optimizer = torch.optim.SGD(lm.parameters(), lr=1.)
loss_function = torch.nn.MSELoss()

for i in range(3):
    pred = lm(x)
    loss = loss_function(pred, labels)
    loss.backward()
    optimizer.step()
    with torch.no_grad():
        pred_first, pred_second = lm(x)
        print(i,'\n'.join(f'{f.item():.6}   {s.item():.6}' for f, s in zip(pred_first, pred_second)))

Output:

0 0.156267   0.195867
0.156267   0.195867
0.156267   0.195867
1 0.155947   0.194027
0.155947   0.194027
0.155947   0.194027
2 0.15136   0.18696
0.15136   0.18696
0.15136   0.18696

@OneAdder OneAdder mentioned this pull request Feb 3, 2025
@milancurcic milancurcic self-requested a review February 3, 2025 19:56
@milancurcic
Copy link
Member

Thank you! Will review this week..

@milancurcic
Copy link
Member

Thank you for this effort and sorry that I missed the issue you opened a few days ago.

Great work figuring out the plumbing of neural-fortran and getting stuff running.

I have two questions for now:

  1. We have dense and we have a linear activation function. If I'm not mistaken, a dense layer with a linear activation function is equivalent to a 1-d linear layer. If yes, could this functionality be accomplished with a combination of flatten -> dense (linear activation) -> reshape? If it would, there may still be a good argument for a dedicated linear layer, but it'd be good to understand whether it's currently not possible or simply impractical.
  2. I see that you put the batch_size dim first, and doing matmul and transpose over the latter two dims. But recall that in Fortran the first dim varies the fastest, and the last one the slowest. Was this intentional?

I will think a little more about the crutch and the need for 3-d shapes. It's not yet obvious to me that it is but it's late over here..

@OneAdder
Copy link
Collaborator Author

OneAdder commented Feb 4, 2025

Thank you for your response!

  1. If it is possible, I'm not smart enough to figure out how. In theory, if the input to the linear layer would be a (3, 2) matrix and the output a (2, 2) matrix, the output of 2D Linear Layer would differ from the output of flattening it and doing 1D Linear layer. E.g., [[1, 2], [3, 4], [5, 6]] × [[1, 2], [3, 4]] and [[1], [2], [3], [4], [5], [6]] × [[1, 2, 3, 4]] yield different numerical values. So, if you know how to work around it and can share the code, I'll be happy to see it! And, nonetheless, even if it is possible, it is still very impractical.
  2. No, I'm simply new to Fortran and didn't think about it. I'll change it.

Regarding the plumbing, I think a bit more of documentation could have helped. I might make a description of OOP structure of the project with a diagram and send a pr into contributing.md.

BTW, looking at my PR in the morning reveals that I forgot a couple of things: put the layer's logic into its submodule and update cmake stuff. I'll do it as well

@milancurcic milancurcic mentioned this pull request Feb 6, 2025
5 tasks
@milancurcic
Copy link
Member

Thanks for the updates, Michael.

When you get a chance, please review the Input2d layer in #198. If it's sound, we'll merge that into main and then merge it with this PR and remove the crutch.

From your research & experience, is an input layer the only that precedes Linear2d in applications? Or some other kinds of layers do too?

@OneAdder
Copy link
Collaborator Author

OneAdder commented Feb 6, 2025

Wow, thank you for your work!

Answering you question may be not that simple. I mostly work in Natural Language Processing, and in this area appears the need to perform linear transformations on matrices in several places.

The general idea is that vectors for inputs are stored in a transformed lookup table with dimensions (sequence_length, hidden_dim) and all operations happen on the table rather than on raw input vectors. For example, an encoder model works like this:

  • Accept input (vectorized text) of dimensions (sequence_length)
  • Transform the input into lookup table of (sequence_length, hidden_dim)
  • Perform transformations and activations specific to this model's architecture on (sequence_length, hidden_dim)
  • Transform the output into the output shape of (sequence_length), (1), etc., or keep it as is for decoder stage if needed

So, linear2d layer will appear in every step of this: to rearrange input, to perform architecture-specific calculations and then to format output

I am currently working on nf_multihead_attention on my fork here but it's still very much work in progress. The essence of attention is that it is a combination of linear layers with softmax activation in between. And the linear layers work with 2D shapes.

But all of this is not really about now, but rather about the future. At this point linear2d_layer may be used for something simple such as linear regression on matrices (my example).

@milancurcic
Copy link
Member

@OneAdder Thanks for the updates; I'm not a big fan of flatten2d; I attempted a generic flatten in #202, please take a look, I think it can work.

@OneAdder
Copy link
Collaborator Author

Everything seems to be resolved. @milancurcic can you take a look? any suggestions?

@milancurcic
Copy link
Member

Great, thanks! Will review and test locally tonight.

@OneAdder
Copy link
Collaborator Author

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Michael. You'll notice that I added your simple linear2d program as an example. I also left a comment about it not converging. We should get it to converge.

I think this PR is almost complete. One key thing stands out that I would like to discuss changing.

Currently, the user is required to provide:

linear2d(sequence_length, in_features, out_features)

An analogy to the dense (1d) layer would be:

dense(in_features, out_features)

However, notice that in NF, you only need to write

dense(out_features)

and the internal shape of the dense layer, which is (in_features, out_features) is determined from the layer that precedes it. For example:

input(5), &
dense(10)

would initialize a dense layer with internal shape of (5, 10).

That said, I would like to be able to do simply:

net = network([ &
  input(sequence_length, in_features), &
  linear2d(out_features), &
  flatten(), &
])

to get the same behavior that we currently have. Basically, at layer construction time we only know out_features, but sequence_length and in_features are obtained during layer % init() from the preceding layer.

This way, the user doesn't need to pass redundant information when constructing the network, which makes for a nicer API but also reduces the chance of user error.

What do you think?

@@ -148,4 +150,13 @@ module function reshape(output_shape) result(res)

end function reshape

module function linear2d(sequence_length, in_features, out_features) result(res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should be requesting only out_features at layer constructor invocation, and sequence_length and in_features are obtained later from the layer that feeds into this one.

Suggested change
module function linear2d(sequence_length, in_features, out_features) result(res)
module function linear2d(out_features) result(res)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that we cannot avoid passing sequence_length. We need it to determine the output shape. I think it's not a big deal and can be left like this

  module function linear2d(sequence_length, out_features) result(res)
    integer, intent(in) :: sequence_length, out_features
    type(layer) :: res

    res % name = 'linear2d'
    res % layer_shape = [sequence_length, out_features]
    allocate(res % p, source=linear2d_layer(out_features))
  end function linear2d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave it a try; unless I'm mistaken, I think it can work. See this commit: 678b2c0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to explain; the process goes like this:

  1. linear2d(out_features) constructs the generic layer instance which at this time does not yet know its layer_shape.
  2. network constructor from an array of generic layers loops over each layer in order and calls layer % init(prev_layer).
  3. Inside layer % init the output shape of the previous layer is passed to the concrete linear2d_layer % init(input_shape) and inside the concrete init, all parameters are now known (sequence_length, in_features, out_features.
  4. After the concrete layer init call, back inside the generic layer % init, we set the generic layer % layer_shape to be the same as the shape of the concrete linear2d_layer % output.

I hope this make sense and I'm sorry that it had to be so complicated. However, we are essentially hacking around Fortran's very limited generic features.

@OneAdder
Copy link
Collaborator Author

OneAdder commented Feb 17, 2025

  • Remove redundant args from constructor
    • Update MHA as it uses the layer
  • Use Normal weight initialization
    • Cement the weight values in tests
    • Cement the weight values in MHA tests
  • Tweak values in example till it converges
    • First get it to converge in PyTorch and make sure that values are correct

@OneAdder
Copy link
Collaborator Author

@milancurcic done

@milancurcic
Copy link
Member

One last issue remains. I don't think the example actually converges. In the previous iteration, the loss (MSE) tolerance was 0.01, which for these outputs, on the order of 0.1, becomes very easily satisfied. If you lower the loss tolerance to, say, 1e-4, and run for a large number of iterations and print the output of each step, you'll see that the outputs change (and they change gradually, with sufficiently low learning rate), but they don't actually converge toward the values of y. The previous version that seemed to converge was only due to luck and a large loss tolerance.

Are we still not doing something correctly in this example?

@OneAdder
Copy link
Collaborator Author

Off Topic

I think I found an unrelated bug: gradients become zero between flatten and linear2d if the order is input > linear2d > flatten > dense. I'll try to fix it now.

Example

I don't think that there's an issue in fact. What is happening here is actually linear regression. It can be visualized as trying to draw a line as close to each point as possible. The y values are not on a line. Thus, it is impossible for the model to converge completely here.
The only thing we can do is to make y values lay on a line. But that would be cheating IMO

@milancurcic
Copy link
Member

OK, thanks for that explanation. In that case, maybe the example should simply run for a number of iterations and print the outputs to the screen, rather than stopping on some first accidental local minimum. And related to this, as I understand it, this linear2d layer is much more useful as a building block of bigger stuff like MHA, and less so on its own. In that case, the example itself is not that relevant.

@OneAdder
Copy link
Collaborator Author

Great! Then we can just remove it. I'm working on a text classification with IMDB dataset. I'll add it later when there are more 2D layers.
The bug is not a bug. I just have vanishing gradients

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@milancurcic milancurcic merged commit c316ee1 into modern-fortran:main Feb 17, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants