Using ELMO instead of BERT #46

nassera2014 · 2021-01-17T18:19:52Z

Hi;
Thank you fot your great and well-explained work. Do you have how can i use ELMO instead of BERT ?

Code :

handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
docELMO = [i.split() for i in unpreprocessed_documents]

from elmoformanylangs import Embedder
e = Embedder('/home/nassera/136',batch_size = 64)

training_elmo = e.sents2elmo(docELMO, output_layer=0)

print("training ELMO : ", training_elmo[0])
training_dataset = CTMDataset(handler.bow, training_elmo, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

ctm.fit(training_dataset) # run the model
print('topics : ',ctm.get_topics())

When i run this code i get this error :

2021-01-16 22:12:51,392 INFO: char embedding size: 3773 2021-01-16 22:12:52,371 INFO: word embedding size: 221272 2021-01-16 22:12:58,469 INFO: Model( (token_embedder): ConvTokenEmbedder( (word_emb_layer): EmbeddingLayer( (embedding): Embedding(221272, 100, padding_idx=3) ) (char_emb_layer): EmbeddingLayer( (embedding): Embedding(3773, 50, padding_idx=3770) ) (convolutions): ModuleList( (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,)) (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,)) (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,)) (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,)) (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,)) (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,)) (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,)) ) (highways): Highway( (_layers): ModuleList( (0): Linear(in_features=2048, out_features=4096, bias=True) (1): Linear(in_features=2048, out_features=4096, bias=True) ) ) (projection): Linear(in_features=2148, out_features=512, bias=True) ) (encoder): ElmobiLm( (forward_layer_0): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (backward_layer_0): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (forward_layer_1): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) (backward_layer_1): LstmCellWithProjection( (input_linearity): Linear(in_features=512, out_features=16384, bias=False) (state_linearity): Linear(in_features=512, out_features=16384, bias=True) (state_projection): Linear(in_features=4096, out_features=512, bias=False) ) ) ) 2021-01-16 22:13:11,365 INFO: 2 batches, avg len: 20.9 training ELMO : [[ 0.06318592 -0.04212857 -0.40941882 ... -0.393932 0.65597 -0.19988859] [ 0.0464317 -0.03159406 -0.23152797 ... 0.2573734 0.28932744 -0.21369117] [ 0.04215719 -0.27414545 -0.1282109 ... -0.01528776 0.15322109 -0.02998078] ... [-0.20043871 0.11804245 -0.5754699 ... 0.19337586 -0.06868231 0.11217812] [-0.1898424 -0.24078836 -0.1522124 ... -0.08325598 -0.5789431 -0.21831807] [ 0.08684797 -0.14746179 -0.2742679 ... 0.06612014 0.15257567 -0.32261848]] Settings: N Components: 50 Topic Prior Mean: 0.0 Topic Prior Variance: 0.98 Model Type: prodLDA Hidden Sizes: (100, 100) Activation: softplus Dropout: 0.2 Learn Priors: True Learning Rate: 0.002 Momentum: 0.99 Reduce On Plateau: False Save Dir: None Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/TM_FB/Test_CTM_ELMO.py", line 76, in <module> ctm.fit(training_dataset) # run the model File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 227, in fit sp, train_loss = self._train_epoch(train_loader) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 154, in _train_epoch for batch_samples in loader: File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__ data = self._next_data() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [30, 1024] at entry 0 and [21, 1024] at entry 1

i got a list of numpy arrays concerning ELMO but it bug in
ctm.fit(training_dataset) # run the model

The text was updated successfully, but these errors were encountered:

vinid · 2021-01-17T18:36:55Z

Hello!

I think the error is in ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

you are telling the model that the size of the contextual input is 768, but if you are using ELMo, you need to put 1024 there :)

Let me know if this fixes the error and if you get good topics (I have never tried ELMo in the topic model, so it would be interesting to know if it works).

Note: we have recently updated the library to a version that is a bit faster than the one that used the TextHandler object. It should also be a little bit easier to use the various embedding models.

nassera2014 · 2021-01-17T20:47:00Z

Thank you for your reponse; i changed size to 1024 but i still have this error :
Settings: N Components: 10 Topic Prior Mean: 0.0 Topic Prior Variance: 0.9 Model Type: prodLDA Hidden Sizes: (100, 100) Activation: softplus Dropout: 0.2 Learn Priors: True Learning Rate: 0.002 Momentum: 0.99 Reduce On Plateau: False Save Dir: None Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/TM_FB/Test_CTM_ELMO.py", line 78, in <module> ctm.fit(training_dataset) # run the model File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 227, in fit sp, train_loss = self._train_epoch(train_loader) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/contextualized_topic_models/models/ctm.py", line 154, in _train_epoch for batch_samples in loader: File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__ data = self._next_data() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/home/nassera/PycharmProjects/MyProject/venv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [33, 1024] at entry 0 and [21, 1024] at entry 1

this is the code :
`import numpy as np
training_elmo = e.sents2elmo(docELMO)
tr_elmo = np.array(training_elmo, dtype=object)
print("training ELMO : ", tr_elmo)
training_dataset = CTMDataset(handler.bow, tr_elmo, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=1024, n_components=10)

ctm.fit(training_dataset)
print('topics : ',ctm.get_topics())`

vinid · 2021-01-17T20:48:38Z

What's the shape of the tr_elmo array?

vinid · 2021-01-17T21:03:15Z

Ok, so there is probably one problem related to the package you are using.

It seems like e.sents2elmo returns the char embeddings (not sure, but it should be like this). However, CombinedTM needs a sentence/document embedding for each document.

You need to find a way to aggregate ELMo's results into a single sentence vector to make ti work.

nassera2014 · 2021-01-18T09:54:51Z

e.sent2elmo return a list of numpy arrays, each with the shape=(seq_len, embedding_size).
this is the paramater that i input in e.sent2elmo.

docELMO = [i.split() for i in unpreprocessed_documents]
from elmoformanylangs import Embedder
e = Embedder('/home/nassera/136',batch_size = 64)
training_elmo = e.sents2elmo(docELMO)
tr_elmo = np.array(training_elmo, dtype=object)

vinid · 2021-01-18T10:06:42Z

Yes, then that might be the problem, unfortunately.

the training data should have a shape that is (num_documents, embedding_size), that is one vector for each document, as we encode document representations and not word-char level representations.

nassera2014 · 2021-01-19T17:25:09Z

Yes you are right, do you have any idea how we can convert the result of e.sent2elmo to (num_documents, embedding_size) ?

Thank you so much.

vinid · 2021-01-20T11:08:35Z

You could do mean pooling (averaging over the sequence), but I am not sure if this can be good for the embeddings you get from ELMo.

I also saw your open issue. I'll search a bit more.

Is there any reason why you prefer ELMo to BERT?

nassera2014 · 2021-01-20T15:45:57Z

Thank you so much,
i don't prefer ELMo, but i would like to compare the 2 contextualized word embedding.

vinid closed this as completed Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ELMO instead of BERT #46

Using ELMO instead of BERT #46

nassera2014 commented Jan 17, 2021 •

edited

Loading

vinid commented Jan 17, 2021

nassera2014 commented Jan 17, 2021

vinid commented Jan 17, 2021

vinid commented Jan 17, 2021

nassera2014 commented Jan 18, 2021

vinid commented Jan 18, 2021

nassera2014 commented Jan 19, 2021

vinid commented Jan 20, 2021

nassera2014 commented Jan 20, 2021

Using ELMO instead of BERT #46

Using ELMO instead of BERT #46

Comments

nassera2014 commented Jan 17, 2021 • edited Loading

vinid commented Jan 17, 2021

nassera2014 commented Jan 17, 2021

vinid commented Jan 17, 2021

vinid commented Jan 17, 2021

nassera2014 commented Jan 18, 2021

vinid commented Jan 18, 2021

nassera2014 commented Jan 19, 2021

vinid commented Jan 20, 2021

nassera2014 commented Jan 20, 2021

nassera2014 commented Jan 17, 2021 •

edited

Loading