-
Notifications
You must be signed in to change notification settings - Fork 0
/
AI_prompts.txt
491 lines (367 loc) · 22.6 KB
/
AI_prompts.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
🟩 how can I create a config file for a python environment for anaconda, with the following libraries
nltk
collections
sklearn
pandas
numpy
seaborn
genism
datasets
tokenizers
torch
transformers
tqdm
einops
torchinfo
accelerate
huggingface_hub
x_transformers
🟩 once I have created the environment, how can I install new packages
🟩 Do the kde plot of the question lengths in 'dataset'
🟩 set the darkgrid theme for seaborn
🟩 do the kde plot with the length of fact1 and fact2 (in the same plot)
🟩 histogram of the answerKey
🟩 get the 'text' from the 'choices' and check if all the texts constitute only of one word
🟩 @workspace I have my dataset with this fields:
{'id': '3E7TUJ2EGCLQNOV1WEAJ2NN9ROPD9K',
'question': 'What type of water formation is formed by clouds?',
'choices': {'text': ['pearls',
'streams',
'shells',
'diamonds',
'rain',
'beads',
'cooled',
'liquid'],
'label': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']},
'answerKey': 'F',
'fact1': 'beads of water are formed by water vapor condensing',
'fact2': 'Clouds are made of water vapor.',
'combinedfact': 'Beads of water can be formed by clouds.',
'formatted_question': 'What type of water formation is formed by clouds? (A) pearls (B) streams (C) shells (D) diamonds (E) rain (F) beads (G) cooled (H) liquid'}
How can I compute tf-idf on fact1+fact2+question and compute the cosine similarity with all the choices
🟩 create a dataset "facts_and_question" in which we have f"{data['fact1']} {data['fact2']} {data['question']}"
🟩 @workspace #selection how to create a pandas series in which I have as items "{data['fact1']} {data['fact2']} {data['question']}" for each row of the data
🟩 quick look at the data
🟩 #selection I have the "choices" columns in a pandas dataframe of this kind "choices": { "label": ["A", "B", "C", "D", "E", "F", "G", "H"], "text": ["sand", "occurs over a wide range", "forests", "Global warming", "rapid changes occur", "local weather conditions", "measure of motion", "city life"] },
I want to create a column for each label, with the associated text
🟩 #selection check that all the label are always ordered like ['A', 'B', 'C', 'D','E', 'F', G', 'H']
🟩 check if there are uppercase entries in the vocabulary
🟩 #selection look at one sample of the df_train_tfidf
🟩 apply this pipeline to each col of the df_train_tfidf and store the results in a dict
🟩 #selection compute the cosine similarity between the 'facts_and_questions' and all the choices 'A', 'B',... 'H'
🟩 #selection predict as answerkey the choice with higher cosine similarity
🟩 if all the cosine similarity are 0, pick randomly . but with a fixed seed for reproducibility
🟩 make it a function of df_tfidf
🟩 set a seed
🟩 compute accuracy and f1 between pred and df['answerKey']
🟩 @workspace truncate the tf-idf matrixes with SVD
🟩 #selection instead of applying singularly svd to each key of the dict, concat all of them together, apply svd, split them again
🟩 must be stacked vertically, not horizontally the cols
🟩 #selection it is strange to get f1 score exactly equal to accuracy with micro average
🟩 how can I compute the cosine similarity between two pandas series, but do it first with the first, second with the second, and outputs a series (not pairwise, not between all the elements)
🟩 what if are just 2 numpy arrays
🟩 how to transform a sparse matrix into a normal numpy array
🟩 how to apply the argmax along the rows of a df
🟩 map the argmax insices to the corresponding choices
🟩 how to check if all the entries have just one word
🟩 /tmp/ipykernel_849325/14407470.py:3: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos]
df_train_ngram[choices] = df_train_ngram[choices].apply(lambda x: str(x.str.split().str[0]))
🟩 I have 8 words for each row of a df, how to check that are all different
🟩 AttributeError: 'Series' object has no attribute 'split'
🟩 concat 'question' 'fact1' 'fact2' into a single serie
🟩 along the column axis
🟩 apply the split function to each entry
🟩 get the index of best_choice
🟩 get the idf (inverse document frequency) score for each word, from a list of documents
🟩 could you do it in python
🟩 'DataFrame' object has no attribute 'append'. how to iteratively add a row to a pandas df
🟩 #selection RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
🟩 for i, (facts_and_question, choice, label) in enumerate(train_loader) still gives the error
🟩 #selection RuntimeError: mat1 and mat2 must have the same dtype, but got Double and Float
🟩 #selection torch.float64 RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Double
🟩 convert them to float64 instead of float32
🟩 take a part of the train_data and use it as a val set
🟩 #selection how would you improve the siamese neural network
🟩 #selection ValueError: expected 2D or 3D input (got 1D input)
🟩 #selection ValueError: only one element tensors can be converted to Python scalars
🟩 ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[38], line 21
17 predicted_choices = torch.argmax(output, dim=1)
18 return predicted_choices
---> 21 pred_train = argmax_predict_NeuralNetwork(df_train_noDuplicates, pretrained_W2V, model)
22 train_accuracy, train_f1 = evaluate_predictions(answerKey_train, pred_train, 'train')
23 print('')
Cell In[38], line 8
6 data = {}
7 for col in ['facts_and_question', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']:
----> 8 data[col] = torch.tensor(df[col].apply(lambda x: torch.tensor(sent_to_emb(x, pretrained_vectors)).float().to(mydevice)))
10 # apply the model to the data
11 # get the output for the 'facts_and_question' and all the choices
12 output = model(data['facts_and_question'], data['A'])
File ~/.pyenv/versions/3.12.3/envs/nlp-env/lib/python3.12/site-packages/torch/utils/_device.py:79, in DeviceContext.__torch_function__(self, func, types, args, kwargs)
77 if func in _device_constructors() and kwargs.get('device') is None:
78 kwargs['device'] = self.device
---> 79 return func(*args, **kwargs)
ValueError: only one element tensors can be converted to Python scalars
🟩 /fix InvalidParameterError: The 'y_pred' parameter of accuracy_score must be an array-like or a sparse matrix. Got 7546 instead.
🟩 #selection try to improve the neural network
🟩 make the markdown table for this data. cols = [train, validatino], rows = [accuracy, F1]
train Accuracy: 0.88222 train F1 Score: 0.88209
validation Accuracy: 0.85313 validation F1 Score: 0.85184
🟩 write the code for writing it in markdown
🟩 #selection decrease the overfitting of the model
🟩 def preprocess_function(examples):
first_sentences = [[context] * 4 for context in examples["sent1"]]
question_headers = examples["sent2"]
second_sentences = [
[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
]
first_sentences = sum(first_sentences, [])
second_sentences = sum(second_sentences, [])
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
Explain with an example step by step
🟩 #selection File <string>:35
self.embedding_layer = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx)
TabError: inconsistent use of tabs and spaces in indentation
🟩 #selection explain why it use the slice on the out out = self.dropout(out) out = torch.relu_(self.fc1(out[:,-1,:]))
🟩 #selection write it with the proper function from dataset module
🟩 ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[18], line 35
31 df['choices'] = [choice['text'] for choice in df['choices']]
33 return df
---> 35 dataset_train = format_choices(dataset_train)
36 dataset_val = format_choices(dataset_val)
37 dataset_test = format_choices(dataset_test)
Cell In[18], line 31
28 wrong_choices = [choice['label'] != correct_order for choice in dataset_train['choices']]
29 if sum(wrong_choices) == 0:
30 # get the text of the choices
---> 31 df['choices'] = [choice['text'] for choice in df['choices']]
33 return df
TypeError: 'Dataset' object does not support item assignment
🟩 #selection ---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[68], line 18
15 examples['input_ids'] = tokenized_examples['input_ids']
16 return examples
---> 18 dataset_train_inputids = dataset.map(preprocess_function)#, remove_columns=dataset_train.column_names) #, batched=True)
19 # dataset_train_inputids = dataset_train_inputids.with_format("torch")
20 # print(dataset)
File ~/.pyenv/versions/3.12.3/envs/nlp-env/lib/python3.12/site-packages/datasets/dataset_dict.py:870, in DatasetDict.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, desc)
866 if cache_file_names is None:
867 cache_file_names = {k: None for k in self}
868 return DatasetDict(
869 {
--> 870 k: dataset.map(
871 function=function,
872 with_indices=with_indices,
873 with_rank=with_rank,
874 input_columns=input_columns,
875 batched=batched,
876 batch_size=batch_size,
877 drop_last_batch=drop_last_batch,
878 remove_columns=remove_columns,
879 keep_in_memory=keep_in_memory,
880 load_from_cache_file=load_from_cache_file,
...
8 ]
10 # Tokenize
11 tokenized_examples = fast_tokenizer(sentences, truncation=True)
KeyError: 0
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
🟩 #file:notebook3_contextualizedEmbeddings.ipynb apply the model to the data, and group the results by lists of 8 elements, and predict the argmax (for each list)
🟩 #file:notebook3_contextualizedEmbeddings.ipynb Do you see any problem with the BiLSTM_Classifier? the loss stop to decrease after just a few epochs, and the performances on the task are very poor
🟩 I still get the same problems, maybe there is some error in preparing the data.
I have created 8 copied for each sample, each of them ends with one of the possible choices.
Each of this have label 0 if the choice was wrong, and 1 if it was the correct one
🟩 #selection attach to this transformer encoder a classification head with pytorch
🟩 #selection expected a 'cuda' device generator but found 'cpu' with hugging face trainer
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[131], line 20
8 dataset_val_encoded.set_format(type='torch', columns=['input_ids', 'attention_mask', 'answerKey'])
10 trainer = Trainer(
11 model,
12 args,
(...)
17 compute_metrics=compute_metrics,
18 )
---> 20 trainer.train()
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/transformers/trainer.py:1938, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1936 hf_hub_utils.enable_progress_bars()
1937 else:
-> 1938 return inner_training_loop(
1939 args=args,
1940 resume_from_checkpoint=resume_from_checkpoint,
1941 trial=trial,
1942 ignore_keys_for_eval=ignore_keys_for_eval,
1943 )
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/transformers/trainer.py:2236, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2233 rng_to_sync = True
2235 step = -1
...
77 if func in _device_constructors() and kwargs.get('device') is None:
78 kwargs['device'] = self.device
---> 79 return func(*args, **kwargs)
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
🟩 how can I move the dataset to cuda
🟩 I'm trying to train the AutoModelForMultipleChoice with the trainer from HuggingFace, but I get:
RuntimeError Traceback (most recent call last)
Cell In[167], line 16
4 return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}
6 trainer = Trainer(
7 model,
8 args,
(...)
13 compute_metrics=compute_metrics,
14 )
---> 16 trainer.train()
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/transformers/trainer.py:1938, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1936 hf_hub_utils.enable_progress_bars()
1937 else:
-> 1938 return inner_training_loop(
1939 args=args,
1940 resume_from_checkpoint=resume_from_checkpoint,
1941 trial=trial,
1942 ignore_keys_for_eval=ignore_keys_for_eval,
1943 )
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/transformers/trainer.py:2236, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2233 rng_to_sync = True
2235 step = -1
-> 2236 for step, inputs in enumerate(epoch_iterator):
2237 total_batched_samples += 1
2239 if self.args.include_num_input_tokens_seen:
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/accelerate/data_loader.py:454, in DataLoaderShard.__iter__(self)
452 # We iterate one batch ahead to check when we are at the end
453 try:
--> 454 current_batch = next(dataloader_iter)
455 except StopIteration:
456 yield
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
627 if self._sampler_iter is None:
628 # TODO(https://github.com/pytorch/pytorch/issues/76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \
633 self._IterableDataset_len_called is not None and \
634 self._num_yielded > self._IterableDataset_len_called:
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:672, in _SingleProcessDataLoaderIter._next_data(self)
671 def _next_data(self):
--> 672 index = self._next_index() # may raise StopIteration
673 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
674 if self._pin_memory:
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:620, in _BaseDataLoaderIter._next_index(self)
619 def _next_index(self):
--> 620 return next(self._sampler_iter)
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/data/sampler.py:288, in BatchSampler.__iter__(self)
286 batch = [0] * self.batch_size
287 idx_in_batch = 0
--> 288 for idx in self.sampler:
289 batch[idx_in_batch] = idx
290 idx_in_batch += 1
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/accelerate/data_loader.py:92, in SeedableRandomSampler.__iter__(self)
90 # print("Setting seed at epoch", self.epoch, seed)
91 self.generator.manual_seed(seed)
---> 92 yield from super().__iter__()
93 self.set_epoch(self.epoch + 1)
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/data/sampler.py:168, in RandomSampler.__iter__(self)
166 else:
167 for _ in range(self.num_samples // n):
--> 168 yield from torch.randperm(n, generator=generator).tolist()
169 yield from torch.randperm(n, generator=generator).tolist()[:self.num_samples % n]
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/_device.py:79, in DeviceContext.__torch_function__(self, func, types, args, kwargs)
77 if func in _device_constructors() and kwargs.get('device') is None:
78 kwargs['device'] = self.device
---> 79 return func(*args, **kwargs)
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
🟩 there is no generator, is used by the trainer, it isn't in my code
🟩 from tqdm import tqdm
generator = torch.Generator(device=mydevice)
train_loader = torch.utils.data.DataLoader(dataset_train_encoded, batch_size=batch_size, shuffle=True,
collate_fn=DataCollatorForMultipleChoice(tokenizer), generator=generator)
loop = tqdm(train_loader)
for batch in loop:
outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'],
token_type_ids=batch['token_type_ids'], labels=batch['answerKey'])
break
I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:762, in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
761 if not is_tensor(value):
--> 762 tensor = as_tensor(value)
764 # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
765 # # at-least2d
766 # if tensor.ndim > 2:
767 # tensor = tensor.squeeze(0)
768 # elif tensor.ndim < 2:
769 # tensor = tensor[None, :]
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:724, in BatchEncoding.convert_to_tensors.<locals>.as_tensor(value, dtype)
723 return torch.tensor(np.array(value))
--> 724 return torch.tensor(value)
File ~/.pyenv/versions/nlp-env/lib/python3.12/site-packages/torch/utils/_device.py:79, in DeviceContext.__torch_function__(self, func, types, args, kwargs)
78 kwargs['device'] = self.device
---> 79 return func(*args, **kwargs)
ValueError: too many dimensions 'str'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
...
782 " expected)."
783 ) from e
785 return self
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`id` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
🟩 how to use the Trainer from huggingface manually, instead of using trainer.train()
🟩 RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
can I manually build the dataloader, with a generator.to(mydevice). or set the generator for the current dataloader
🟩 trainer.evaluate(eval_dataset=val_dataloader)
TypeError: 'DataLoader' object is not subscriptable
🟩 how can I print just the information on the last layers of a pytorch model
🟩 how can I measure both accuracy and f1 with evaluate from huggingface
🟩 integrate it in this code, I'm not using the Trainer
for batch in train_dataloader:
batch = {k: v.to(mydevice) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
print(metric.compute())
🟩 #selection create a pytorch model which takes as input the trf_model and attach it the following classification head
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=1, bias=True)
)
🟩 how can I create a format of this kind
f"""
Question: {item['question']}
fact1: {item['fact1']}
A) {item['choices'][0]}
B) {item['choices'][1]}
...
H) {item['choices'][7]}
Choose the correct choice. Answer with the corresponding letter only.
"""
And then compile it each time by unpacking the **item, and formmating the propt with string formatting
🟩 If I build the tf-idf representation on the train data, how can I get the tf-idf of a new sample
🟩 I have the following code in pytorch:
model.eval()
accuracy_metric = evaluate.load("accuracy")
t0 = timeit.default_timer()
for batch in test_dataloader:
batch = {k: v.to(mydevice) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
accuracy_metric.add_batch(predictions=predictions, references=batch["labels"])
how can I save the index of the samples that I have answered wrong, so I can look at those samples later
🟩 I'm doing a project for the NLP exam, In which I have a question and 8 choices. a multiple choice task. I ask you some suggestion for the name of the project. I have chosen "classification Between Multiple Choices", but is quite boring
🟩 could you check if the following conclusions for mine report on a NLP project are good:
To do the project, I started with the simplest methods that we know will not produce incredible results, but are still interesting since they are the beginning of NLP. So with tf-idf, which is simple but still achieves meaningful performance, and a bigram/trigram LM. Then I tried the neural models used to generate word embeddings, such as Word2Vec, which is trained to predict the context given an input word. For the vector representations (both tf-idf and word embeddings), the main metric used to compare two sentences is cosine similarity, and among the possible alternatives I chose the one with the highest cosine similarity. Finally, BERT and LLM, which use (part of) the transformer architecture that is state of the art for seq2seq models. In particular, the encoder (BERT-like models are encoder-only) extracts the meaning from the input sequence, and the decoder (LLM are decoder-only) generates something new given the input. I wanted to thank you for the lessons and Dr. Marco Braga for answering my questions about the project.