Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ELMO instead of BERT #46

Closed
nassera2014 opened this issue Jan 17, 2021 · 9 comments
Closed

Using ELMO instead of BERT #46

nassera2014 opened this issue Jan 17, 2021 · 9 comments

Comments

@nassera2014
Copy link

nassera2014 commented Jan 17, 2021

Hi;
Thank you fot your great and well-explained work. Do you have how can i use ELMO instead of BERT ?

Code :

handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
docELMO = [i.split() for i in unpreprocessed_documents]

from elmoformanylangs import Embedder
e = Embedder('/home/nassera/136',batch_size = 64)

training_elmo = e.sents2elmo(docELMO, output_layer=0)

print("training ELMO : ", training_elmo[0])
training_dataset = CTMDataset(handler.bow, training_elmo, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

ctm.fit(training_dataset) # run the model
print('topics : ',ctm.get_topics())

When i run this code i get this error :

2021-01-16 22:12:51,392 INFO: char embedding size: 3773 2021-01-16 22:12:52,371 INFO: word embedding size: 221272 2021-01-16 22:12:58,469 INFO: Model( (token_embedder): ConvTokenEmbedder( (word_emb_layer): EmbeddingLayer( (embedding): Embedding(221272, 100, padding_idx=3) ) (char_emb_layer): EmbeddingLayer( (embedding): Embedding(3773, 50, padding_idx=3770) ) (convolutions): ModuleList( (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,)) (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,)) (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,)) (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,)) (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,)) (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,)) (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,)) ) (highways): Highway( (_layers): ModuleList( (0): Linear(in_features=2048, out_features=4096, bias=True) (1): Linear(in_features=2048, out_features=4096, bias=True) ) ) (projection): Linear(in_features=2148, out_features=512, bias=True) ) (encoder): ElmobiLm( (forward_layer_0): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (backward_layer_0): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (forward_layer_1): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (backward_layer_1): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) ) ) 2021-01-16 22:13:11,365 INFO: 2 batches, avg len: 20.9 training ELMO : [[ 0.06318592 -0.04212857 -0.40941882 ... -0.393932 0.65597 -0.19988859] [ 0.0464317 -0.03159406 -0.23152797 ... 0.2573734 0.28932744 -0.21369117] [ 0.04215719 -0.27414545 -0.1282109 ... -0.01528776 0.15322109 -0.02998078] ... [-0.20043871 0.11804245 -0.5754699 ... 0.19337586 -0.06868231 0.11217812] [-0.1898424 -0.24078836 -0.1522124 ... -0.08325598 -0.5789431 -0.21831807] [ 0.08684797 -0.14746179 -0.2742679 ... 0.06612014 0.15257567 -0.32261848]] Settings: N Components: 50 Topic Prior Mean: 0.0 Topic Prior Variance: 0.98 Model Type: prodLDA Hidden Sizes: (100, 100) Activation: softplus Dropout: 0.2 Learn Priors: True Learning Rate: 0.002 Momentum: 0.99 Reduce On Plateau: False Save Dir: None Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/TM_FB/Test_CTM_ELMO.py", line 76, in <module> ctm.fit(training_dataset) # run the model File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 227, in fit sp, train_loss = self._train_epoch(train_loader) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 154, in _train_epoch for batch_samples in loader: File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__ data = self._next_data() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [30, 1024] at entry 0 and [21, 1024] at entry 1

i got a list of numpy arrays concerning ELMO but it bug in
ctm.fit(training_dataset) # run the model

@vinid
Copy link
Contributor

vinid commented Jan 17, 2021

Hello!

I think the error is in ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

you are telling the model that the size of the contextual input is 768, but if you are using ELMo, you need to put 1024 there :)

Let me know if this fixes the error and if you get good topics (I have never tried ELMo in the topic model, so it would be interesting to know if it works).

Note: we have recently updated the library to a version that is a bit faster than the one that used the TextHandler object. It should also be a little bit easier to use the various embedding models.

@nassera2014
Copy link
Author

Thank you for your reponse; i changed size to 1024 but i still have this error :
Settings: N Components: 10 Topic Prior Mean: 0.0 Topic Prior Variance: 0.9 Model Type: prodLDA Hidden Sizes: (100, 100) Activation: softplus Dropout: 0.2 Learn Priors: True Learning Rate: 0.002 Momentum: 0.99 Reduce On Plateau: False Save Dir: None Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/TM_FB/Test_CTM_ELMO.py", line 78, in <module> ctm.fit(training_dataset) # run the model File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 227, in fit sp, train_loss = self._train_epoch(train_loader) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 154, in _train_epoch for batch_samples in loader: File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__ data = self._next_data() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [33, 1024] at entry 0 and [21, 1024] at entry 1

this is the code :
`import numpy as np
training_elmo = e.sents2elmo(docELMO)
tr_elmo = np.array(training_elmo, dtype=object)
print("training ELMO : ", tr_elmo)
training_dataset = CTMDataset(handler.bow, tr_elmo, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=1024, n_components=10)

ctm.fit(training_dataset)
print('topics : ',ctm.get_topics())`

@vinid
Copy link
Contributor

vinid commented Jan 17, 2021

What's the shape of the tr_elmo array?

@vinid
Copy link
Contributor

vinid commented Jan 17, 2021

Ok, so there is probably one problem related to the package you are using.

It seems like e.sents2elmo returns the char embeddings (not sure, but it should be like this). However, CombinedTM needs a sentence/document embedding for each document.

You need to find a way to aggregate ELMo's results into a single sentence vector to make ti work.

@nassera2014
Copy link
Author

e.sent2elmo return a list of numpy arrays, each with the shape=(seq_len, embedding_size).
this is the paramater that i input in e.sent2elmo.

docELMO = [i.split() for i in unpreprocessed_documents]
from elmoformanylangs import Embedder
e = Embedder('/home/nassera/136',batch_size = 64)
training_elmo = e.sents2elmo(docELMO)
tr_elmo = np.array(training_elmo, dtype=object)

@vinid
Copy link
Contributor

vinid commented Jan 18, 2021

Yes, then that might be the problem, unfortunately.

the training data should have a shape that is (num_documents, embedding_size), that is one vector for each document, as we encode document representations and not word-char level representations.

@nassera2014
Copy link
Author

Yes you are right, do you have any idea how we can convert the result of e.sent2elmo to (num_documents, embedding_size) ?

Thank you so much.

@vinid
Copy link
Contributor

vinid commented Jan 20, 2021

You could do mean pooling (averaging over the sequence), but I am not sure if this can be good for the embeddings you get from ELMo.

I also saw your open issue. I'll search a bit more.

Is there any reason why you prefer ELMo to BERT?

@nassera2014
Copy link
Author

Thank you so much,
i don't prefer ELMo, but i would like to compare the 2 contextualized word embedding.

@vinid vinid closed this as completed Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants