-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Parsing Part for Pytorch Implementation #552
base: master
Are you sure you want to change the base?
Changes from 24 commits
d86fccb
07c4142
c9ec5b8
9311763
a33fe36
822f6c2
8ef31d2
ddcf223
03229a4
5a6b128
0aa3aaa
e121618
4cbeb68
470eca9
2b91e03
1892713
850e9d2
b8e2d3c
c5476bc
9199eed
fdaf8e4
c2b7193
66e6400
46bc27e
21d861f
a1a4465
cfb5479
3ee92fe
a37ef47
5e9434b
ce00bd7
d00c087
ee1c6dd
3f89fa7
96d7dcf
6a37769
e577f3c
c8ec489
1a9d4bd
dc4e77b
86f1250
3e1fd11
d9ed82c
f29403e
b36cbab
045c581
de010c3
78c2f27
2347df5
70a9370
d229259
5f328f0
424e67a
80aae2d
3c90227
9007983
82ba700
46fb8ad
a734a89
c38dc95
753c20e
dd6517b
fea59e0
6d8446a
a4a1db6
4889201
3181031
593bc7f
df5cc6a
cd5368a
f19eaf1
31201f0
f00163e
4a194a7
0d3f146
981a45a
b9a8ced
f4c9f93
36adfb6
aa491ff
dd0dd2a
eeab1b0
6cda51e
d8e6734
2b8a78c
7c83bfe
f44f00d
14bede9
382958b
a67bc71
89104bb
8d31a31
9435f6d
1752b26
c6d8fcc
249afc9
fe4367b
1030177
0b57ad8
bab975e
50c1127
05c090a
08fd0d0
f567220
f8f1ca1
298bdc4
949522c
d05d81b
9bc0ac8
ebbfdcf
12a777d
84d5517
a2bfc83
3269cd4
c90015c
81c1bc5
91911ae
b1be4b5
40654a6
7cda4dd
7e673e7
1857a5f
14beb42
3aee7e7
cd46faa
fd514a2
2786971
9b97c68
a20b2c5
a002222
743bbc5
ca34b1c
52d903c
2057c8a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import numpy as np | ||
import math | ||
import torch.nn as nn | ||
import torch | ||
|
||
class WordEmbeddingMap: | ||
def __init__(self, config): | ||
self.emb_dict, self.dim, self.w2i, self.emb = load(config) | ||
|
||
def isOutOfVocabulary(self, word): | ||
return word not in self.w2i | ||
|
||
def load(config): | ||
emb_dict = dict() | ||
w2i = {} | ||
i = 0 | ||
for line in open(config.get_string("glove.matrixResourceName")): | ||
if not len(line.split()) == 2: | ||
if "\t" in line: | ||
delimiter = "\t" | ||
else: | ||
delimiter = " " | ||
word, *rest = line.rstrip().split(delimiter) | ||
word = "<UNK>" if word == "" else word | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IF Python is OK using an empty string as a key, this should not be necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is easier to change the key here instead of changing all tokens through the codes... |
||
w2i[word] = i | ||
i += 1 | ||
x = np.array(list(map(float, rest))) | ||
vector = x #(x /np.linalg.norm(x)) #normalized | ||
embedding_size = vector.shape[0] | ||
emb_dict[word] = vector | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are two copies of the arrays being kept temporarily: one in emb_dict and another in weights? If memory is an issue, it seems like one could record this vector right away in weights. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, I will refine this later. Thanks! |
||
|
||
weights = np.zeros((len(emb_dict), embedding_size)) | ||
for w, i in w2i.items(): | ||
weights[i] = emb_dict[w] | ||
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights), freeze=True) | ||
return emb_dict, embedding_size, w2i, emb |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
from dataclasses import dataclass | ||
import torch.nn as nn | ||
from embeddings.wordEmbeddingMap import * | ||
from pyhocon import ConfigFactory | ||
import torch | ||
|
||
@dataclass | ||
class ConstEmbeddingParameters: | ||
emb: nn.Embedding | ||
w2i: dict | ||
|
||
class _ConstEmbeddingsGlove: | ||
def __init__(self): | ||
self.SINGLETON_WORD_EMBEDDING_MAP = None | ||
self.cep = None | ||
config = ConfigFactory.parse_file('../resources/org/clulab/glove.conf') | ||
self.load(config) | ||
self.dim = self.SINGLETON_WORD_EMBEDDING_MAP.dim | ||
|
||
def load(self, config): | ||
if self.SINGLETON_WORD_EMBEDDING_MAP is None: | ||
self.SINGLETON_WORD_EMBEDDING_MAP = WordEmbeddingMap(config) | ||
self.cep = ConstEmbeddingParameters(self.SINGLETON_WORD_EMBEDDING_MAP.emb, self.SINGLETON_WORD_EMBEDDING_MAP.w2i) | ||
|
||
def get_ConstLookupParams(self): | ||
return self.cep | ||
|
||
ConstEmbeddingsGlove = _ConstEmbeddingsGlove() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this might help. In the previous version with the head start of i = 1, it seems like the wrong vectors might have been used. If one looked up "," in w2i, it might have been mapped to 2 instead of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we treated empty string "" and unknown "" differently in the previous version, 0 was token by , and i was starting from 1.
In the current version, the "" and "" share the same embedding, so we do not need an extra id for ""/"".