Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Parsing Part for Pytorch Implementation #552

Draft
wants to merge 134 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
d86fccb
init code
ZhengTang1120 Sep 16, 2021
07c4142
Update columnReader.py
ZhengTang1120 Sep 16, 2021
c9ec5b8
refined the code and fixed few bugs
ZhengTang1120 Sep 16, 2021
9311763
initial code for metal
ZhengTang1120 Sep 20, 2021
a33fe36
refine metal, added layers(partial)
ZhengTang1120 Sep 21, 2021
822f6c2
fixed some bugs, init code for embeddings
ZhengTang1120 Sep 21, 2021
8ef31d2
more implementation for embedding layer
ZhengTang1120 Sep 25, 2021
ddcf223
init code for rnnLayer
ZhengTang1120 Sep 25, 2021
03229a4
forward layer implementation
ZhengTang1120 Sep 29, 2021
5a6b128
greedy forward layer
ZhengTang1120 Sep 29, 2021
0aa3aaa
add more functions to layers, init viterbi layer
ZhengTang1120 Sep 30, 2021
e121618
traverse the code, fixed bugs
ZhengTang1120 Sep 30, 2021
4cbeb68
finished the whole model except the viterbi part
ZhengTang1120 Sep 30, 2021
470eca9
finally training...
ZhengTang1120 Sep 30, 2021
2b91e03
the training pipeline is working now
ZhengTang1120 Oct 1, 2021
1892713
fix some minor issues
ZhengTang1120 Oct 1, 2021
850e9d2
make minor changes, implemented Viterbi decoder
ZhengTang1120 Oct 7, 2021
b8e2d3c
Update forwardLayer.py
ZhengTang1120 Oct 7, 2021
c5476bc
Update forwardLayer.py
ZhengTang1120 Oct 7, 2021
9199eed
fixed bugs in viterbi decoder
ZhengTang1120 Oct 7, 2021
fdaf8e4
fixed some bugs, changed default learning rate
ZhengTang1120 Oct 9, 2021
c2b7193
add features and fixed bugs
ZhengTang1120 Oct 19, 2021
66e6400
Update wordEmbeddingMap.py
ZhengTang1120 Oct 19, 2021
46bc27e
Update wordEmbeddingMap.py
ZhengTang1120 Oct 19, 2021
21d861f
Update seqScorer.py
ZhengTang1120 Oct 20, 2021
a1a4465
Update seqScorer.py
ZhengTang1120 Oct 20, 2021
cfb5479
fixed the eval() bug
ZhengTang1120 Oct 21, 2021
3ee92fe
Controlling sources of randomness
ZhengTang1120 Oct 27, 2021
a37ef47
missed import...
ZhengTang1120 Oct 27, 2021
5e9434b
debugged for parsing
ZhengTang1120 Oct 27, 2021
ce00bd7
fixed bugs for parsing
ZhengTang1120 Oct 28, 2021
d00c087
export model to onnx
ZhengTang1120 Oct 28, 2021
ee1c6dd
specified input and output names
ZhengTang1120 Oct 28, 2021
3f89fa7
fixed bug in saving x2i
ZhengTang1120 Oct 28, 2021
96d7dcf
fixed some bugs
ZhengTang1120 Oct 29, 2021
6a37769
remove clipping
ZhengTang1120 Oct 29, 2021
e577f3c
add scheduler
ZhengTang1120 Nov 3, 2021
c8ec489
use xavier uniform to initialize weights
ZhengTang1120 Nov 16, 2021
1a9d4bd
Update metal.py
ZhengTang1120 Nov 16, 2021
dc4e77b
convert layers to a single NN module to save it to onnx
ZhengTang1120 Dec 2, 2021
86f1250
get dummy input
ZhengTang1120 Dec 2, 2021
3e1fd11
Create pytorch2onnx.py
ZhengTang1120 Dec 2, 2021
d9ed82c
Update pytorch2onnx.py
ZhengTang1120 Dec 2, 2021
f29403e
Update pytorch2onnx.py
ZhengTang1120 Dec 2, 2021
b36cbab
converted the list in the model to nnModuleList
ZhengTang1120 Dec 2, 2021
045c581
remove the redundant
ZhengTang1120 Dec 2, 2021
de010c3
Update pytorch2onnx.py
ZhengTang1120 Dec 6, 2021
78c2f27
Update pytorch2onnx.py
ZhengTang1120 Dec 6, 2021
2347df5
Update pytorch2onnx.py
ZhengTang1120 Dec 6, 2021
70a9370
Update pytorch2onnx.py
ZhengTang1120 Dec 7, 2021
d229259
test the onnx model
ZhengTang1120 Dec 9, 2021
5f328f0
Create test_onnx.scala
ZhengTang1120 Dec 15, 2021
424e67a
Delete test_onnx.scala
ZhengTang1120 Dec 15, 2021
80aae2d
change the onnx model to fit scala code
ZhengTang1120 Dec 15, 2021
3c90227
Update test_onnx.py
ZhengTang1120 Dec 15, 2021
9007983
Update pytorch2onnx.py
ZhengTang1120 Dec 16, 2021
82ba700
Update test_onnx.py
ZhengTang1120 Dec 16, 2021
46fb8ad
set random seed for onnx
ZhengTang1120 Jan 26, 2022
a734a89
Update test_onnx.py
ZhengTang1120 Jan 26, 2022
c38dc95
paths to data and embeddings
ZhengTang1120 Jan 26, 2022
753c20e
debug the randomness
ZhengTang1120 Jan 26, 2022
dd6517b
debug randomness
ZhengTang1120 Jan 26, 2022
fea59e0
Update metal.py
ZhengTang1120 Jan 26, 2022
6d8446a
debug randomness
ZhengTang1120 Jan 26, 2022
a4a1db6
Update metal.py
ZhengTang1120 Jan 26, 2022
4889201
move dropout inside model
ZhengTang1120 Jan 26, 2022
3181031
Update layers.py
ZhengTang1120 Jan 26, 2022
593bc7f
move RNNs inside model...
ZhengTang1120 Jan 26, 2022
df5cc6a
Update embeddingLayer.py
ZhengTang1120 Jan 26, 2022
cd5368a
debug randomness
ZhengTang1120 Jan 26, 2022
f19eaf1
Update forwardLayer.py
ZhengTang1120 Jan 26, 2022
31201f0
dropout
ZhengTang1120 Jan 26, 2022
f00163e
debug dropout
ZhengTang1120 Jan 26, 2022
4a194a7
Update forwardLayer.py
ZhengTang1120 Jan 26, 2022
0d3f146
average models
ZhengTang1120 Jan 27, 2022
981a45a
Update run.py
ZhengTang1120 Jan 27, 2022
b9a8ced
Update metal.py
ZhengTang1120 Jan 27, 2022
f4c9f93
Update run.py
ZhengTang1120 Jan 27, 2022
36adfb6
Update run.py
ZhengTang1120 Jan 27, 2022
aa491ff
fixed typo
ZhengTang1120 Jan 27, 2022
dd0dd2a
debug randomness
ZhengTang1120 Jan 27, 2022
eeab1b0
Update mtl-en-ner.conf
ZhengTang1120 Jan 27, 2022
6cda51e
Update utils.py
ZhengTang1120 Jan 27, 2022
d8e6734
Update utils.py
ZhengTang1120 Jan 27, 2022
2b8a78c
solve the randomness
ZhengTang1120 Jan 27, 2022
7c83bfe
Update embeddingLayer.py
ZhengTang1120 Jan 27, 2022
f44f00d
Update mtl-en-ner.conf
ZhengTang1120 Jan 27, 2022
14bede9
Update forwardLayer.py
ZhengTang1120 Jan 27, 2022
382958b
Update metal.py
ZhengTang1120 Jan 27, 2022
a67bc71
fix bugs
ZhengTang1120 Jan 27, 2022
89104bb
Update run.py
ZhengTang1120 Jan 27, 2022
8d31a31
fix bug
ZhengTang1120 Jan 27, 2022
9435f6d
Update forwardLayer.py
ZhengTang1120 Jan 27, 2022
1752b26
Update forwardLayer.py
ZhengTang1120 Jan 27, 2022
c6d8fcc
Update forwardLayer.py
ZhengTang1120 Jan 27, 2022
249afc9
Update forwardLayer.py
ZhengTang1120 Feb 3, 2022
fe4367b
fix bug
ZhengTang1120 Feb 3, 2022
1030177
remove debug print
ZhengTang1120 Feb 3, 2022
0b57ad8
add averaging models feature
ZhengTang1120 Feb 3, 2022
bab975e
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
50c1127
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
05c090a
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
08fd0d0
debug performance difference between torch and onnx
ZhengTang1120 Feb 3, 2022
f567220
debug performance difference between torch and onnx
ZhengTang1120 Feb 3, 2022
f8f1ca1
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
298bdc4
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
949522c
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
d05d81b
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
9bc0ac8
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
ebbfdcf
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
12a777d
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
84d5517
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
a2bfc83
fix bug in viterbi decoding
ZhengTang1120 Feb 3, 2022
3269cd4
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
c90015c
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
81c1bc5
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
91911ae
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
b1be4b5
debug decoder
ZhengTang1120 Feb 3, 2022
40654a6
decoder error...
ZhengTang1120 Feb 3, 2022
7cda4dd
Update pytorch2onnx.py
ZhengTang1120 Feb 3, 2022
7e673e7
Update viterbiForwardLayer.py
ZhengTang1120 Feb 3, 2022
1857a5f
trying to fix the viterbi decoder
ZhengTang1120 Feb 3, 2022
14beb42
Update viterbiForwardLayer.py
ZhengTang1120 Feb 3, 2022
3aee7e7
add other embeddings to onnx model
ZhengTang1120 Mar 3, 2022
cd46faa
Update embeddingLayer.py
ZhengTang1120 Mar 3, 2022
fd514a2
fix bug in distance embeddings
ZhengTang1120 Mar 3, 2022
2786971
Update embeddingLayer.py
ZhengTang1120 Mar 3, 2022
9b97c68
Update embeddingLayer.py
ZhengTang1120 Mar 3, 2022
a20b2c5
Update pytorch2onnx.py
ZhengTang1120 Mar 9, 2022
a002222
implement viterbi decoding
ZhengTang1120 Mar 10, 2022
743bbc5
remove pick span and transduce to simplify the model
ZhengTang1120 Mar 10, 2022
ca34b1c
Update mtl-en-pos-chunk-srlp.conf
ZhengTang1120 Mar 10, 2022
52d903c
save the json only once to save memory and space
ZhengTang1120 Mar 21, 2022
2057c8a
Update mtl-en-srla.conf
ZhengTang1120 Mar 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
Empty file.
36 changes: 36 additions & 0 deletions main/src/main/python/embeddings/wordEmbeddingMap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import numpy as np
import math
import torch.nn as nn
import torch

class WordEmbeddingMap:
def __init__(self, config):
self.emb_dict, self.dim, self.w2i, self.emb = load(config)

def isOutOfVocabulary(self, word):
return word not in self.w2i

def load(config):
emb_dict = dict()
w2i = {}
i = 0
Comment on lines +15 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this might help. In the previous version with the head start of i = 1, it seems like the wrong vectors might have been used. If one looked up "," in w2i, it might have been mapped to 2 instead of 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we treated empty string "" and unknown "" differently in the previous version, 0 was token by , and i was starting from 1.
In the current version, the "" and "" share the same embedding, so we do not need an extra id for ""/"".

for line in open(config.get_string("glove.matrixResourceName")):
if not len(line.split()) == 2:
if "\t" in line:
delimiter = "\t"
else:
delimiter = " "
word, *rest = line.rstrip().split(delimiter)
word = "<UNK>" if word == "" else word
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF Python is OK using an empty string as a key, this should not be necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easier to change the key here instead of changing all tokens through the codes...

w2i[word] = i
i += 1
x = np.array(list(map(float, rest)))
vector = x #(x /np.linalg.norm(x)) #normalized
embedding_size = vector.shape[0]
emb_dict[word] = vector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are two copies of the arrays being kept temporarily: one in emb_dict and another in weights? If memory is an issue, it seems like one could record this vector right away in weights.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will refine this later. Thanks!


weights = np.zeros((len(emb_dict), embedding_size))
for w, i in w2i.items():
weights[i] = emb_dict[w]
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights), freeze=True)
return emb_dict, embedding_size, w2i, emb
Empty file.
28 changes: 28 additions & 0 deletions main/src/main/python/pytorch/constEmbeddingsGlove.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from dataclasses import dataclass
import torch.nn as nn
from embeddings.wordEmbeddingMap import *
from pyhocon import ConfigFactory
import torch

@dataclass
class ConstEmbeddingParameters:
emb: nn.Embedding
w2i: dict

class _ConstEmbeddingsGlove:
def __init__(self):
self.SINGLETON_WORD_EMBEDDING_MAP = None
self.cep = None
config = ConfigFactory.parse_file('../resources/org/clulab/glove.conf')
self.load(config)
self.dim = self.SINGLETON_WORD_EMBEDDING_MAP.dim

def load(self, config):
if self.SINGLETON_WORD_EMBEDDING_MAP is None:
self.SINGLETON_WORD_EMBEDDING_MAP = WordEmbeddingMap(config)
self.cep = ConstEmbeddingParameters(self.SINGLETON_WORD_EMBEDDING_MAP.emb, self.SINGLETON_WORD_EMBEDDING_MAP.w2i)

def get_ConstLookupParams(self):
return self.cep

ConstEmbeddingsGlove = _ConstEmbeddingsGlove()
Loading