-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlogs
222 lines (183 loc) · 8.64 KB
/
logs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
8/26/2021
- scraped discord messages to make a training dataset
- scraper: https://github.com/Tyrrrz/DiscordChatExporter/
- message source: Aerin Chu, who kindly gave permission for us to use her messages
- pasted basic markovify program
comments:
- make sure program can detect the end of a sentence/phrase; people don't usually fully punctuate discord messages
- denote end of sentence/phrase?? when do people split up their messages
- unconventional language: misspelled words ("heyyyyy"), emojis (":3"), keysmashes ("sdfjsk")
- sentiment analysis?
8/27/2021
- cleaned aerin1.txt --> aerin1cleaned1.txt (data preprocessing)
- ran through markovify
- modified default call so that markovify delineates new lines as "sentences" rather than looking for punctuation
comments:
- need to consider what the bot is responding to! how??
- use markovify.NewlineText to delineate new segment of text?* done
- ideas:
- make it so bot can also return wordclouds? word frequencies, etc.
- live scraping? make it so it updates every time
- mood? sentiment analysis?
- ask josh for discord bot help!!
Niema:
- state_size is order of markov chain (how many previous words the generated word depends on)
- for live scraping: update chain, retrain
- classify types of messages? for replies -- jokes, positive/negative (sentiment analysis)
8/28/2021
- intent classification
- ex. greetings, identity
- scraping messages + replies
- after scraping, use markov chain to generate different answers for certain questions
- NLU (natural language understanding) datasets, machine learning datasets, NLP datasets
- movie script datasets
- python chatbot library
Goals list:
- Bare minimum - discord bot that generates aerinlike sentences
- how do we make it more realistic?
- "good to have" goal - discord bot that can respond to text
- classifies text (sentiment/mood and type of response), message it's responding to
- stretch goal - discord bot retrains itself live given a response
Steps?
1. make a markov generator + train it on aerin's texts (DONE)
2. classify intent of these texts
3. scrape data + classify intent of messages it could respond to (the messages aerin was responding to)
(unclear zone)
?. discord bot basics, coding, how to implement this code into discord, how to get inputs, how to scrape live messages
*** 8/31-9/2 (last three days): do the discord stuff
back to 8/28/21:
to do:
- overall: create a working chatbot!
- import chatterbot
- scrape data!
- use data to train bot
ERROR: couldn't download chatterbot on replit IDE
8/29/21:
- started formatting aerinconvo
- chatterbot training section: trains bot using input-response format
- take all conversations and format them to someone else, aerin's response x1000
- couldn't install chatterbot on replit (takes eons, error: disk quota exceeded)
- using own IDE: VSCode
- worked! very fast
- *will use VSCode to run chatterbot, replit to update logs + preprocess data
- find way to upload everything to github
- might just move to VS
- how to automate data preprocessing?
- format into chatterbot conversation
- lines 8000 - 12000 (jenny - aerin convo)
- explored chatterbot
- trained it on test dialogue + returned response
- continuing to format dialogue
* how to retrain bot from scratch
** chatterbot doesn't seem to differentiate between speaker 1 and speaker 2
late 8/29/21:
- rewrote + optimized cleaner (fixed reactions, etc)
- successfully built a discord bot (on VSCode) that can generate markovified text on command (not aerin's, but nonetheless)!
- *not always online though; smth about keeping it online through heroku?? o_o
- now have basic knowledge to implement actual bot
- retrain bot from scratch: chatbot.storage.drop()
next up:
- speaker differentiation: how to get chatterbot to only speak like aerin
- idea: train it on two line "call - response" exchanges (responses being aerin's) to avoid confusion
- getting chatterbot to work with discord
- data preprocessing??? how do we make this easier
8/30/21:
Henry:
- look into chatterbot's default datasets to see how it works / how to format it
- look into removing extraneous
Done:
- fixed + finalized cleaner (data preprocessor)
- moved to vscode completely, updated GitHub (ty yukati)
- Aerin-Jenny conversation cleaned up and sorted into conversation style (call-response style)
- wrote call-response training code
- ran chatterbot on cleaned snippet of aerin-jenny conversation
- created working discord bot
- sends markov generated messages
- tried to integrate chatterbot into discord bot
- it's learning??? from our responses??? (HOW)
- idea: train bot using reactions! use reactions as some kind of "score" for how realistic/funny it's being
8/31/21:
- go through chatterbot's documentation
- opened training database file in sqlite to view it as a table
Notes:
- chatterbot corpus: other training format. call/response but is formatted in a -- call - response format
https://github.com/gunthercox/chatterbot-corpus/blob/master/chatterbot_corpus/data/english/
- are multiple community-made corpii like "greetings, humor", etc
- how can we modify it so it doesn't learn from our messages?
- maybe make it so the inputted response is never added as a possible output? how
- how do we separate greetings, jokes, etc into a separate training thing?
- how can we "untrain" it?
- chatterbot --> storage --> sql_storage.py (https://github.com/gunthercox/ChatterBot/blob/master/chatterbot/storage/sql_storage.py)
- several useful functions including "remove"
- does this "untrain" the bot?
- is it possible to limit who the bot learns from (aerin only)?
- chatterbot --> examples (https://github.com/gunthercox/ChatterBot/blob/master/examples/))
- specific response example?
- *learning feedback example (!) --> use reactions to score?
- default response example --> maximum_similarity_threshold?
- logic adapters
Henry:
- implement markov chain by using own training class
- more data / look into intent classification
- intent classification --> dividing inputs into categories
- make own trainer class
- limit learning after training!
- make own model from chatterbot --> pickle library
Did:
- figured out how to stop chatterbot from learning post-training!
- started writing new cleaner
- update: finished writing new cleaner2.py!
- ran aerinjennconvo1.txt through chatterbot.
To do:
- explore chatterbot corpus function
- write new training class w/ markov chains
- what is going on with the discord bot? sending the same message multiple times?
- just a bug
9/1/21:
- set up greetings + goodbyes corpii
To do:
- train chatterbot on corpii (Done)
- write new training class w/ markov chains (?)
- how?
- other idea: have chatterbot generate a response if it isn't confident about an answer
- generates from aerin1.txt
- look into intent classification
- look into the python pickle library
- look into a feedback system
idea:
- use tagged datasets for positive/negative reactions
- tags = positive, negative
extra:
- custom goodbyes / using peoples' names
- instead of writing cleaners, write a new training class
- make bot always online (heroku)
- how do we use all of aerin's data?
- use markov chain on data to expand database before feeding
- manipulate training class to also generate response
- generate_export_data
- find output function
- split data into
- chatterbot minimize cost function
- want cost to be near 0
- when passing generated text through markov chain
- tried a no chatterbot - markov based model:
-implemented rake to prioritize words
-Markov chain generated using specific words
- chatterbot seems to be the better alternative just for functionality (though we could use markov elsewhere)
- get_response: use markov chain somehow (known responses have priority)
Things to work on:
- Markov chain will not generate with short words - maybe expand the scope of the markov chain
- Clean, format, and use aerin's responses from all of spis just-hanging-out to train Chatterbot
- For most data: possibly make a general spis-bot(need consent)
9/2/21:
- implemented corpusmaker, which can easily generate corpii to train chatterbot on
- useful for common prompts with set answers
- ex. greetings
- takes simple user command and creates + formats corresponding yaml file, trains chatterbot on it
- useful to train chatterbot multiple times on corpii
- created general spis-bot, which can return responses from chatterbot and markovify
- aerinbot can also return markov chain responses
- to respond with markov chain, just @ the bot
- wrote cleaner4.py, which does the same thing as cleaner3, except that it writes a new file
- moved text files to own folders
-