-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtodo
243 lines (193 loc) · 8.6 KB
/
todo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
errors:
results.py error tab org listed for each url
progress was not being saved. prob cuz sync func was awaited. err parser had no file +
progress lower than total at end
rp not found: https://www.madisoncounty.ny.gov/robots.txt
autobl always uses todays date. recurring final errors are not being skipped? + proceed() was not being respected in init_queue()
call bot_excluder.write_file after scraper or immediatly after rp gets updated?
child frame error never has errex
prevent url from being put back in queue due to rate limit and immediately retrieved (loop) +
detect low q size +
just wait +
detect which url was recently put back into q -
todo:
black formatter
progress always visible +
with bar! +
use previous total count as estimate? -
concurrent robots fetch. after queue.get, not before queue.put? - it will be blocking anywhere cuz its not async?
why init_q puts directly into q instead of add_to_q?
clean up url, url_dup, domain, and dup_domain (with/without scheme and www). should be small funcs +
respect 429 too many reqs and rp rate limit
if 429 then what? double crawl_delay?
put back in queue
how handle if all urls in q must wait?
separate func for rate limiting? before req, not before add to q +
reduce log size +
check mem unn +
remove cml
recover q loses all working urls. keep track of current urls being requested and combine with q prior to writing to file?
rewrite errorlog without nested lists
class inst to be more descriptive?
init with empty lists?
put all minor errors into new jj error
cml_text = '{\n' - what?
index +
globals -
keyword_arr
currentTab
html validator: https://beautifytools.com/javascript-validator.php +
js camelcase naming convention
1. Always declare variables
2. Always use const if the value should not be changed
3. Always use const if the type should not be changed (Arrays and Objects)
4. Only use let if you can't use const
5. Only use var if you MUST support old browsers. var has function scope. let is block scope
refactor
err parser +
index +
error funcs
can be in scrap or own class. child class?
back_in_q_b as sep func?
pw and aio sessions in their req class +
rename working_o to scrap +
results.py +
results.py to do:
sort results by jbw_conf or percent simularity? can remove jbw conf from result files
dict or class instead of nested res_list
remove browser in each results file
reconsider skipping geodesic with high max_dist
optimise: two separate sections. one with geolimter function
remove zip_form? display purposes only. keep
why send coords from form? prevents double lookup
show jj_error num in tooltip on error tabs?
reword error pages to discourage refreshing?
sort errors by alpha?
percent decode urls?
wraparound text for mobile?
remove
remove locks + operations are atomic, race condition not possible
scrap.browser: only used for cml and file contents. should be requester not browser
multiorg: results.py doesnt show duplicate urls
CML
bash_ping, restart_nic
logging module
new id is assigned when using static reqs
handle timeouts uniformly +
asyncio.TimeoutError not possible without wait_for?
check async timeout again
child frames
respect robots.txt
does rp file have multiple dates? so how determine exiration?
when are new entries created?
which domains are scraped never change therfore only dates change? or new entries are made as scraper runs?
store rp files or make async
after success req: update timestamp and domain count +
after any req? not just success?
call proceed_f on initial urls +
default useragent
use 2 useragents to see difference?
pass rp into working_o?
implement domain rate limiter +
include timestamp of last req for that domain +
dict of domains: obj as value? +
save and recover +
global default rate limiter also
auto blacklist
static bl is unn
blacklist is dup urls +
must be exact match, not domain. ex: 5il.co != 5il.co/page3
domain-wide jbw conf. Once high conf is established, ignore low conf links for that domain
auto update dbs urls by sorting jbw conf from scraper
this wont get new orgs, only new em urls
cleanup before release:
whitespace
unn comments
pagination class never works. test 'next' or '>' test +
set playwright user agent?
use asyncio.Event() instead of blocking: all_done_d,l pw_pause, asyncio.sleep
net::ERR_NAME_NOT_RESOLVED should also be 404
include error result in working_o?
need checked lock? - dup ensures a url is processed only once, therefore one task at a time
check for malformed urls? >>> r=urlparse('http:/joesjorbs.com') >>> if r.scheme and r.netloc: print(99)
redundant error 7
final or try with next reqer?
discard head elem from html. page.inner_html('body')
resp.text vs page.inner_html vs page.text_content vs page.content() etc
add redirect history to checked pages. can be done with aiohttp, cant find for pw
parent url may not be in errorlog on first error - remove code
update portals
allow multiple em urls for each org in db?
use urls from only same org. ex: dont use county url for town. diff orgs
error on any em url in list would call for fallback. implications?
city oswego, orleans, st lawco
properly track skipped pages?
put skipped pages into cml? might give dups
"application" as bunkword?
use empty list placeholder for jj_final_error and fallback_success in errorlog?
errorlog as json?
sort results to either regular dir or empty vis text dir (for debugging)
fallback to domain after homeurl fallback?
mark all nonlogged errors with underscore or remove try block
use both a domain limiter and a limiter based on full url (except query)?
create unique codes for all skips. print and mark in cml
improve bunkwords: mark all skips in outcome. dont use list comprehension? print offending bunkword and context
winter update project:
which scraper to use?
use double quotes. watch out for replacing possesive apostrophes
update em urls and home urls
search for new orgs
verify coords?
document which orgs use a centralized service and exclude or include them from jj search. ie: applitrack/caboces, applitrack/penfield, etc
dups in db. probably causes the dups found in cml? solved with multi org d?
index.html to do:
dup zip codes
fix indents
obfuscate -
improve code comments
zip_dict one entry per line?
hide modal after back button without refresh - difficult
show progress on modal - difficult
new modal over old for progress?
create favicon
to do later:
search PDFs from webpages
only firefox can detect pdf cleanly
run scraper as cron job
put all errors in add_errorurls_f. eg: __error. or use jj error 9 catch all for __errors?
jbws back to count but limit to x occurrences?
decompose nav tags?
content of script tags not decomposing because Splash evaluates scripts. So there is no script tag header or footer for BS to read: https://recruiting.ultipro.com/BRY1002BSC/JobBoard/6b838b9a-cd2b-436a-903b-0de7b6e17b4f/?q=&o=postedDateDesc
max crawl depth 3? either high or low conf jbws in tags? -
remove non printable characters from result text? -
weighted jbws
upgrade server to 22.04
charter schools http://www.p12.nysed.gov/psc/csdirectory/CSLaunchPage.html
false positives: include keyword and date
https://www.herkimer.edu/about/employment/
https://hr.cornell.edu/jobs librarian 1/20
https://www.newvisions.org/pages/media-centers-for-the-21st-century
https://www.tbafcs.org/Page/1444 nurse 2/20 dropdown
All fallback types: static fb, portal to homepage fb, include_old fb
Concerns:
Dup checker:
remove after ampersand in query?
remove fragments and trailing slash. yes
case sensitivity. yes
High conf: exclude good low conf links
https://www.cityofnewburgh-ny.gov/civil-service = upcoming exams
http://www.albanycounty.com/Government/Departments/DepartmentofCivilService.aspx = exam announcement
have separate high conf jbw lists?
accept links with only high conf job words?
Bunkwords: search entire element or just contents?
must search url to exclude .pdf, etc
Decompose: drop down menus?
dont decompose menus for anchor tag search +
No space between elements' content in results
caused by converting from soup to soup.text
eg: Corporation Counsel</option><option>Downtown Parking Improvement
produces this: developmentcorporation counseldowntown parking improvement planengineeringethics
this shouldn't matter because a keyword probably won't span accross multiple elements
use urllib or manual replace to percent encode urls?
url_path = workingurl.replace('/', '%2F') # or
url_path = parse.quote(workingurl, safe=':')