MSTDP not learning #697

jhunter533 · 2024-10-26T04:46:30Z

jhunter533
Oct 26, 2024

I've been trying to implement Bindsnet based off my reinforcement learning Pytorch code. I opted to not use the pipeline and to use custom functions. I need help trying to get it to learn since it does not seem to be working. I feel like part of my problem may stem from trying to do continuous environments or my setup for them since input and output can be negative. If there is any advice or resources that could help me there let me know.

What I've tried:

LINode where layer.s=layer.v for the output.
Sum of spikes of output LIFNode layer
One_step set to true and false
Tested with both 1 and 256 batch_size

Observations:

The layers do spike and change weights from my print and graph testing
No matter which setup for the output layers is used it does not seem to learn. When using the spike setup the output layers are always 0. When using the LI layer it's always the reset voltage so -65 by default. This shows it isn't learning since the output is always the relative 0 in my testing thus far (though it should be noted that during training the very negative out layer results in the most negative action always while in the case of 0 value it leads to very random behavior but is 0 during testing).
I have tried changing a few of the parameters but they didn't seem to affect the lack of training though I understand they probably need tuned at some point.

I intend to keep the update for network separate from it's output generation since there are times when I want to do the equivalent of 'no_grad' however I may be thinking too much in the ANN world and perhaps I should always be performing the update. If someone has advice or papers that apply to this please let me know.

I've been running it with basic gymnasium environment like classic control pendulum.

I appreciate any help that can be offered.

The following the network from pytorch for reference purposes

class network(nn.Module):
    def __init__(self,num_states,num_actions,action_bound,hidden_dim,hidden_dim2):
        super(network,self).__init__()
        self.fc1=nn.Linear(self.num_states,self.hidden_dim)
        self.fc2=nn.Linear(self.hidden_dim,self.hidden_dim2)
        self.out1=nn.Linear(self.hidden_dim2,self.num_actions)
        self.out2=nn.Linear(self.hidden_dim2,self.num_actions)

    def forward(self,state):
        '''
            'state'| Input state-[(batch,num_states)]
            'out1'| Output mean actions-[(batch,num_action)]
            'out2'| Output log probability-[(batch,num_action)]
        '''
        state=state.to(device)
        x=F.relu(self.fc1(state))
        x=F.relu(self.fc2(x))
        out1=self.out1(x)
        out2=self.out2(x)
        return out1,out2
    def action(self,state,det=False):
        '''
            'state'| Input state -[(batch,state_dim)]
            'det'| User input if testing to utilize deterministic not stochastic policy-False by default
            'action'| pi Output action to take as sampled from policy-[(batch,action_dim)]
                ///must take just the most recent for env step
            'out2'| Outputs the final log probability of policy of sampled action-[(batch,action_dim)]'
            
        '''
        mu,log_std=self(state)
        if det:
            state=torch.tensor(state,dtype=torch.float32).to(device)
            action=self.action_bound*torch.tanh(mu)
            return action.cpu().detach().numpy()
        ....Remaining logic to gain action
        action=self.action_bound*torch.tanh(raw_action)
        return action,out2

The following is the network in BindsNET. The LINode is the LIFNode but after the last step in the forward it sets self.s=self.v.

class Network(nn.Module):
    def __init__(self,num_states,num_actions,action_bound,hidden_dim,hidden_dim2):
        super(Network,self).__init__()
        self.T=8
        self.batch_size=256
        self.network=Network()
        self.input=Input(n=self.num_states,traces=True)
        self.middle=LIFNodes(n=self.hidden_dim,traces=True)
        self.middle2=LIFNodes(n=self.hidden_dim2,traces=True)
        self.out1=LINodes(n=self.num_actions,traces=True)
        self.out2=LINodes(n=self.num_actions,traces=True)
        #Alternative out layers if spikes
        #self.out1=LIFNodes(n=self.num_actions,traces=True)
        #self.out2=LIFNodes(n=self.num_actions,traces=True)

        #Currently the nu and wmin wmax are random but I had these set to the values from STDP Breakout and nothing changed
        #I also tried LIF output layers with refrac=0
        input_middle=Connection(source=self.input,target=self.middle,update_rule=MSTDP,nu=(3e-4,3e-4),wmin=-1,wmax=1)
        middle_middle=Connection(source=self.middle,target=self.middle2,update_rule=MSTDP,nu=(3e-4,3e-4),wmin=-1,wmax=1)
        middle_out1=Connection(source=self.middle2,target=self.out1,update_rule=MSTDP,nu=(3e-4,3e-4),wmin=-1,wmax=1)
        middle_out2=Connection(source=self.middle2,target=self.out2,update_rule=MSTDP,nu=(3e-4,3e-4),wmin=-1,wmax=1)
        
        self.network.add_layer(self.input,name="Input Layer")
        self.network.add_layer(self.middle,name="Hidden Layer")
        self.network.add_layer(self.middle2,name="Hidden Layer2")
        self.network.add_layer(self.out1,name="Output Layer1")
        self.network.add_layer(self.out2,name="Output Layer2")

        self.network.add_connection(input_middle,source="Input Layer",target="Hidden Layer")
        self.network.add_connection(middle_middle,source="Hidden Layer",target = "Hidden Layer2")
        self.network.add_connection(middle_out1,source="Hidden Layer2",target="Output Layer1")
        self.network.add_connection(middle_out2,source="Hidden Layer2",target="Output Layer2")
        #monitors do exist here
        self.encoder=RepeatEncoder(time=self.T)
    def forward(self,state,reward):

        state=state.to(device)
        state=self.encoder(state).to(device)
        self.network.train(mode=False)
        self.network.run(inputs={"Input Layer": state}, time=self.T,reward=reward,one_step=True)
        #state is [T,Batch,Feature]
        #The output layers of network have [Batch,Feature]
        out1=self.out1.s
        out2=self.out2.s
        #For LIF outlayers
        #out1=self.out1.s.sum(dim=0,keepdim=True).float() 
        #out2=self.out2.s.sum(dim=0,keepdim=True).float()
        return out1, out2
  def action remains the same

Agent.learn Function:

def train(self,max_episode):
        reward=0
        for episode in tqdm(range(max_episode)):
            for layer in self.Network.network.layers:
                self.actor.network.layers[layer].reset_state_variables()
            for connection in self.Network.network.connections:
                self.actor.network.connections[connection].reset_state_variables()
            for monitor in self.Network.network.monitors:
                self.actor.network.monitors[monitor].reset_state_variables()
            state,_=self.env.reset(seed=self.seed)
            done=False
            trunc=False
            while not (done or trunc):
                self.totalsteps+=1
                #use random action for x steps to encourage learning
                action=self.select_action(state,reward)
                action=action.squeeze(0)
                total_r=0
 
                next_state,reward,done,trunc,info=self.env.step(action)
                total_r+=reward
                if done:
                    total_r=max(total_r,0.0)
                    break
                total_r*=1

                self.buffer.store(state,action,total_r,next_state,done)
                states,actions,rewards,next_states,dones=self.buffer.sample()
                rewards=torch.unsqueeze(rewards,1)
                dones=torch.unsqueeze(dones,1)
                with torch.no_grad():
                    next_action,next_log_probs=self.Network.action(next_states,reward)
                    #...some more calculations not involving network
                #..more unrelated steps in logic
                actions_pred,log_pred=self.Network.action(states,reward)
                self.Network.network.train(mode=True)
                for connection in self.actor.network.connections:
                    self.Network.network.connections[connection].update(reward=reward)
                if done:
                    break
                if trunc:
                    break
                state=next_state

jhunter533 · 2024-10-26T04:55:05Z

jhunter533
Oct 26, 2024
Author

These graphs are from pretty early into training but they should help with debugging. For debugging reasons my batch is 1. These graphs use the LIF node out layers and that setup

0 replies

Hananel-Hazan · 2024-10-27T23:30:25Z

Hananel-Hazan
Oct 27, 2024
Maintainer

Thank you for sharing the graphs. The network code looks good.

To begin with, let's remove the upper and lower bounds on the weights in each layer, specifically eliminating wmin=-1 and wmax=1, allowing the weights to reach higher values.

Additionally, for debugging purposes, I recommend removing the second middle layer. This will help us observe if the output weights change. I suspect that there may not be enough activity in the deeper layers to produce meaningful weight changes. In spiking neurons, there's a risk of vanishing spikes, where deeper layers receive progressively fewer spikes, resulting in diminished weight updates compared to layers connected directly to the input.

Furthermore, for the MSTDP to be effective, the output layer needs to synchronize between presynaptic and postsynaptic spikes. Since the output layer currently lacks postsynaptic spikes, you should guide it similarly to supervised learning by stimulating the neurons that need to be active while inhibiting those that should not. This approach will allow MSTDP to be effective and facilitate appropriate weight changes. There is an example in BindsNET that can demonstrate such a process (supervise MNIST)

Regarding your concerns about the continuous environment and potential issues with negative inputs and outputs, I agree that working with both positive and negative rewards can be challenging. I recommend starting with just one type of reward to assess its performance. Additionally, consider mapping the output to a continuous range by normalizing or scaling it before passing it to the environment. You might transform spike outputs into a more continuous space by normalizing the total spike count for each neuron to fit a desired output range.

I also encourage you to read the following papers, as they contain valuable techniques that may be helpful:

“A Comprehensive Review of Spiking Neural Network-Based Reinforcement Learning” (2020): This paper provides insights into effective training and continuous actions with SNNs.
“Learning Spiking Neural Network Policies Using Deep Q-Learning” (Fremaux et al., 2018): This study adapts Q-learning for SNNs and presents methods for managing continuous control.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSTDP not learning #697

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

MSTDP not learning #697

jhunter533 Oct 26, 2024

Replies: 2 comments

jhunter533 Oct 26, 2024 Author

Hananel-Hazan Oct 27, 2024 Maintainer

jhunter533
Oct 26, 2024

jhunter533
Oct 26, 2024
Author

Hananel-Hazan
Oct 27, 2024
Maintainer