-
Notifications
You must be signed in to change notification settings - Fork 216
Adding feeds to Malcom
Malcom is an intelligence-driven tool. This does not mean it's smart, it only means that it will do its best to organize and connect different pieces of information so that you (who are smart) can draw actionable intelligence out of unstructured information.
This means that Malcom needs information. Lots of information. Luckily, Malcom has feeds. Feeds are Malcom's information sources; they can be websites, RSS feeds, your HTTP access logs, syslogs, or POP3 email! Malcom started out with a "test feed", designed to import data from ZeusTracker's binaries feed. More feeds will be added as Malcom - and its userbase - grow.
The main purpose of feeds is to make Malcom customizable to suit your needs and munch the information you are interested in.
Feeds [should] follow the following generic process:
- Inherit from the
Feeds
class - Fetch the content of the feed (RSS feed, webpage, system file, you name it). This is done with the
update
function. - Parse the data entry and create elements that Malcom can understand. These elements are specified in
Malcom.datatypes.element
. For now, the elements available areUrl
,Hostname
,Ip
, andAs
. - Save these elements to DB and connect them
All you need for this is to write three methods:
-
__init__
- This will initialize your class and set some mandatory attributes; -
update
- This will fetch data from a URL, a file, etc. and loop through every data entry, feeding it to theanalyze
method; -
analyze
- This will take the data entry from the update method and parse it. Here is where you create your elements, store them in the database, and link them together.
All this is done by writing a class that inherits from Feed
and saving it in the feeds
folder. You may create any datatype, and update it any way you want. For an element to be valid in Malcom's schema, it must have the following fields:
-
value
: This a unique representation of the element you decide to add. -
type
: This is the type of the element (e.g. anIp
element has atype
set to'ip'
). This is set automatically when you create new elements using their classes (e.g.IP(ip='127.0.0.1')
)
To build a feed, you have to create a class that inherits the Feed
class in feeds.feed
. The Feed
class has all the attribuets Malcom needs to understand how to run a feed and get statistics out of it.
You should override the following attributes in Feeds you create. This should be done in the __init__
function of your feed.
-
name
- The name of your feed. Setting this to the class name is recommended. -
source
- The source, as in URL or filename of your feed -
description
- The source's description. This should have as much detail as possible, as it will later allow you to know where a given element is coming from.
Most importantly, there's a virtual method called update
that you will have to implement for the feed to work.
The update method is used to fetch data from a resource (it can be an URL, a file on your disk, a web service, whatever). The input format doesn't really matter, as long as you respect this one condition: you have to pass each data entry (a data entry can be an XML node, a JSON element, a CSV line, etc.) to the analyze
function.
The Feed
class has two helper functions: update_xml
and update_lines
. Both functions take the source
attribute as a source for their data, and will call analyze
automatically on each data entry.
-
update_xml(parent_node, list_of_child_nodes)
- This will try to parse the source as an XML tree. It will look for theparent_node
node and build a dictionary with entries corresponding to thelist_of_child_nodes
list. -
update_lines()
- This will parse the source as a one-data-entry-per-line source, and feed each line to theanalyze
function automatically.
Using these helper functions is not mandatory, you can always use urllib2
or requests
to achieve the same effect - as long as you pass every data entry to the analyze
function individually.
The analyze
method is where the data entry is translated to objects to be inserted in the Malcom database.
The analyze
method expects a single data entry as an input. From this data entry, it should be able to extract one or more elements to be added to the database. This part is really sensitive since it is here that you qualify information, effectively going from raw data, to information, to actionable intel.
The following guidelines are how feeds in the Malcom trunk are created, and it is how you should be creating them too to keep a consistent database. As a rule of thumb, raw elements (Ip
, Url
, Hostname
) should be created devoid of all malicious context. The malicious context will be then added to the element (thanks to the wonders of schema-less databases!) using the <element_object>.add_evil
method. You should extract at least one element (Ip
, Url
, Hostname
, or any other element type in Malcom.model.datatypes
) from every data entry provided to the analytics()
function.
As an example, if you have a data feed listing a C&C and a malware family on each line, such as this:
this-should-be-suspended.no-ip.org;Citadel-C&C;online
You should create a Hostname
element representing this-should-be-suspended.no-ip.org
:
h = Hostname(hostname='this-should-be-suspended.no-ip.org')
Then, you should create a dictionary containing the intel you want to attach to this element:
hostname, what, status = line.split(';')
family, infrastructure_type = what.split('-')
intel = {'family': family, 'infrastructure_type': infrastructure_type, 'status': status}
Before calling add_evil
, the intel
dictionary needs to have a few mandatory keys:
-
source
: The source where the intel comes from. This key is added automatically if you useupdate_xml
orupdate_lines
to feed data to theanalytics
function. -
description
: A human-readable description of the intel. This is why the element was listed on the feed. -
date_added
: Optional - can come in handy if the feed has a timestamp for its entries. If not specified, it will be set todatetime.datetime.utcnow()
.
Lastly, an optional but very useful key:
-
id
: This basically defines when the intel will be overwritten. Think of it as the "primary key" of your intel. For example, if you set this to the domain name, then if on the next feed run the type switches fromCitadel-C&C
toDridex-C&C
, it will overwrite the previous record, and you will lose the previous intel. If you setid
todomain+type
, then it will store every last occurrence ofdomain
being a Citadel or Dridex C&C. It's up to you to choose when you want to overwrite and when you'd rather update. Since this situation is still pretty rare,id
will probably set tosource
by default in the future.
intel['id'] = '{}{}'.format(hostname, infrastructure_type) # You can also MD5 the whole thing to save storage
intel['description'] = 'Listed as a {} {}. Status: {}'.format(infrastructure_type, family, status)
intel['source'] = 'InternalFeed1' # This would be set to the feed name by default if using one of the helper methods
h.add_evil(intel) # Done!
It might seem redundant to store a description
variable besides the other variables, but a human-readable format is easier to understand than just key-value sets.
You insert elements into the db using the Feed.commit_to_db(element)
method. This method returns the element you just saved to the DB, so that you can connect it with other eventual elements.
You would save the hostname we previously found with:
h_from_db = self.commit_to_db(hostname)
You can also link elements between them. For example, imagine the line you're analysing also has the IP address the domain resolved to at the time it was added. You can add it to the database and link it using the Model.connect()
method:
_, _, _, ip = "this-should-be-suspended.no-ip.org;Citadel-C&C;online;127.0.0.1".split(';')
ip_from_db = self.commit_to_db(ip)
self.model.connect(h_from_db, ip_from_db)
connect
also takes an optional argument: attribs
. This is the text that will show over the link when displayed in the graph.
It's best to learn by example. Let's take a look at how the ZeusTrackerBinaries feed is built.
class ZeusTrackerBinaries(Feed):
def __init__(self, name):
super(ZeusTrackerBinaries, self).__init__(name)
self.name = "ZeusTrackerBinaries"
self.source = "https://zeustracker.abuse.ch/monitor.php?urlfeed=binaries"
self.description = "This feed shows the latest 50 ZeuS binary URLs."
The class is initialised by calling its super
constructor method. Easy enough, right?
This is the update
method, the one that the Feed
class expects you to implement:
def update(self):
for dict in self.update_xml('item', ["title", "link", "description", "guid"]):
self.analyze(dict)
- First thing we do is to fetch the data. In this case, we're calling the
update_xml
method. This will automatically use the source defined in__init__
:https://zeustracker.abuse.ch/monitor.php?urlfeed=binaries
. -
update_xml
will automatically parse the XML structure<item><title></title><link></link><description></description><guid></guid></item>
. If your XML structure differs from this, then you'll have to parse the XML yourself. - The returned dictionaries
{'description
: ...,title
: ..., etc}are sent to the
analyze` method
Let's take a look at the analyze
method:
def analyze(self, dict):
evil = dict
url = Url(re.search("URL: (?P<url>\S+),", dict['description']).group('url'))
evil['id'] = md5.new(re.search(r"id=(?P<id>[a-f0-9]+)", dict['guid']).group('id')).hexdigest()
try:
date_string = re.search(r"\((?P<date>[0-9\-]+)\)", dict['title']).group('date')
evil['date_added'] = datetime.datetime.strptime(date_string, "%Y-%m-%d")
except AttributeError, e:
print "Date not found!"
url.add_evil(evil)
self.commit_to_db(url)
And Bob's your uncle!