forked from digininja/CeWL
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
258 lines (192 loc) · 8.63 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
CeWL - Custom Word List generator
=================================
Copyright(c) 2018, Robin Wood <[email protected]>
Based on a discussion on PaulDotCom (episode 129) about creating custom word lists
spidering a targets website and collecting unique words I decided to write
CeWL, the Custom Word List generator. CeWL is a ruby app which spiders a
given URL to a specified depth, optionally following external links, and
returns a list of words which can then be used for password crackers such
as John the Ripper.
By default, CeWL sticks to just the site you have specified and will go to a
depth of 2 links, this behaviour can be changed by passing arguments. Be
careful if setting a large depth and allowing it to go offsite, you could end
up drifting on to a lot of other domains. All words of three characters and
over are output to stdout. This length can be increased and the words can be
written to a file rather than screen so the app can be automated.
CeWL also has an associated command line app, FAB (Files Already Bagged)
which uses the same meta data extraction techniques to create author/creator
lists from already downloaded.
Homepage: https://digi.ninja/projects/cewl.php
GitHub: https://github.com/digininja/CeWL
Change Log
==========
Version 5.4.3
-------------
Added the --with-number parameter to make words include letters and numbers
Version 5.4.2
-------------
Merged an update to change the way usage instructions are shown.
Updated instructions on installing gems.
Updated README.
Version 5.4.1
-------------
A line to add a / to the end of the URL had been commented out. I don't
remember why it was done but I'm putting it back in. See issue
https://github.com/digininja/CeWL/issues/26
Version 5.4
-----------
Steven van der Baan added the ability to hit ctrl-c and keep the results
so far.
Version 5.3.1
-------------
Added the ability to handle non-standard port numbers.
Added lots more debugging and a new --debug parameter.
Version 5.3
-----------
Added the command line argument --header (-H) to allow headers to be passed in.
Parameters are specified in name:value pairs and you can pass multiple.
Version 5.2
-----------
Loads of changes including:
* Code refactoring by @g0tmi1k
* Internationalisation - should now handle non-ASCII sites much better
* Found more ways to pull words out of JavaScript content and other areas
that aren't normal HTML
* Lots of little bug fixes
Version 5.1
-----------
Added the GPL-3+ licence to allow inclusion in Debian.
Added a Gemfile to make installing gems easier.
Version 5.0
-----------
Adds proxy support from the command line and the ability to pass in
credentials for both basic and digest authentication.
A few other smaller bug fixes as well.
Version 4.3
-----------
CeWL now sorts the words found by count and optionally (new --count argument)
includes the word count in the output. I've left the words in the case
they are in the pages so "Product" is different to "product" I figure that if
it is being used for password generation then the case may be significant so
let the user strip it if they want to. There are also more improvements to the
stability of the spider in this release.
By default, CeWL sticks to just the site you have specified and will go to a
depth of 2 links, this behaviour can be changed by passing arguments. Be
careful if setting a large depth and allowing it to go offsite, you could end
up drifting on to a lot of other domains. All words of three characters
and over are output to stdout. This length can be increased and the words can
be written to a file rather than screen so the app can be automated.
Version 4.2
-----------
Fixes a pretty major bug that I found while fixing a smaller bug for @yorikv.
The bug was related to a hack I had to put in place because of a problem I was
having with the spider, while I was looking in to it I spotted this line which
is the one that the spider uses to find new links in downloaded pages:
web_page.scan(/href="(.*?)"/i).flatten.map do |link|
This is fine if all the links look like this:
<a href="test.php">link</a>
But if the link looks like either of these:
<a href='test.php'>link</a>
<a href=test.php>link</a>
the regex will fail so the links will be ignored.
To fix this up I've had to override the function that parses the page to find
all the links, rather than use a regex I've changed it to use Nokogiri which
is designed to parse a page looking for links rather than just running through
it with a custom regex. This brings in a new dependency but I think it is worth
it for the fix to the functionality. I also found another bug where a link like
this:
<a href='#name'>local</a>
which should be ignored as it just links to an internal name was actually being
translated to '/#name' which may unintentionally mean referencing the index
page. I've fixed this one as well after a lot of debugging to find how best to
do it.
A final addition is to allow a user to specify a depth of 0 which allows CeWL
to spider a single page.
I'm only putting this out as a point release as I'd like to rewrite the
spidering to use a better spider, that will come out as the next major release.
Version 4.0/4.1
---------------
The main change in version 4.0/1 is the upgrade to run with Ruby 1.9.x, this
has been tested on various machines and on BT5 as that is a popular platform
for running it and it appears to run fine. Another minor change is that Up to
version 4 all HTML tags were stripped out before the page was parsed for words,
this meant that text in alt and title tags were missed. I now grab the text
from those tags before stripping the HTML to give those extra few works.
Version 3
---------
Addresses a problem spotted by Josh Wright. The Spider gem doesn't handle
JavaScript redirection URLs, for example an index page containing just the
following:
<script language="JavaScript">
self.location.href =
'http://www.FOO.com/FOO/connect/FOONet/Top+Navigator/Home';
</script>
wasn't spidered because the redirect wasn't picked up. I now scan through a
page looking for any lines containing location.href= and then add the given
URL to the list of pages to spider.
Version 2
---------
Version 2 of CeWL can also create two new lists, a list of email addresses
found in mailto links and a list of author/creator names collected from meta
data found in documents on the site. It can currently process documents in
Office pre 2007, Office 2007 and PDF formats. This user data can then be used
to create the list of usernames to be used in association with the password
list.
Pronunciation
=============
Seeing as I was asked, CeWL is pronounced "cool".
Installation
============
CeWL needs the rubygems package to be installed along with the following gems:
* mime-types
* mini_exiftool
* rubyzip
* spider
All these gems were available by running "gem install xxx" as root. The
mini_exiftool gem also requires the exiftool application to be installed.
Then just save CeWL to a directory and make it executable.
The project page on my site gives some tips on solving common problems people
have encountered while running CeWL - https://digi.ninja/projects/cewl.php
Usage
=====
CeWL 5.4.2 (Break Out) Robin Wood ([email protected]) (https://digi.ninja/)
Usage: cewl [OPTIONS] ... <url>
OPTIONS:
-h, --help: Show help.
-k, --keep: Keep the downloaded file.
-d <x>,--depth <x>: Depth to spider to, default 2.
-m, --min_word_length: Minimum word length, default 3.
-o, --offsite: Let the spider visit other sites.
-w, --write: Write the output to the file.
-u, --ua <agent>: User agent to send.
-n, --no-words: Don't output the wordlist.
-a, --meta: include meta data.
--meta_file file: Output file for meta data.
-e, --email: Include email addresses.
--email_file <file>: Output file for email addresses.
--meta-temp-dir <dir>: The temporary directory used by exiftool when parsing files, default /tmp.
-c, --count: Show the count for each word found.
-v, --verbose: Verbose.
--debug: Extra debug information.
Authentication
--auth_type: Digest or basic.
--auth_user: Authentication username.
--auth_pass: Authentication password.
Proxy Support
--proxy_host: Proxy host.
--proxy_port: Proxy port, default 8080.
--proxy_username: Username for proxy, if required.
--proxy_password: Password for proxy, if required.
Headers
--header, -H: In format name:value - can pass multiple.
<url>: The site to spider.
Ruby Doc
========
CeWL is commented up in Ruby Doc format.
Licence
=======
This project released under the Creative Commons Attribution-Share Alike 2.0
UK: England & Wales
( http://creativecommons.org/licenses/by-sa/2.0/uk/ )
Alternatively, you can use GPL-3+ instead the of the original license.
( http://opensource.org/licenses/GPL-3.0 )