Skip to content

Adding feeds to Malcom

Thomas Chopitea edited this page Jul 19, 2015 · 11 revisions

Intro

Malcom is an intelligence-driven tool. This does not mean it's smart, it only means that it will do its best to organize and connect different pieces of information so that you (who are smart) can draw actionable intelligence out of unstructured information.

This means that Malcom needs information. Lots of information. Luckily, Malcom has feeds. Feeds are Malcom's information sources; they can be websites, RSS feeds, your HTTP access logs, syslogs, or POP3 email! Malcom started out with a "test feed", designed to import data from ZeusTracker's binaries feed. More feeds will be added as Malcom - and its userbase - grow.

The main purpose of feeds is to make Malcom customizable to suit your needs and munch the information you are interested in.

Generic workflow

Feeds [should] follow the following generic process:

  1. Inherit from the Feeds class
  2. Fetch the content of the feed (RSS feed, webpage, system file, you name it). This is done with the update function.
  3. Parse the data entry and create elements that Malcom can understand. These elements are specified in Malcom.datatypes.element. For now, the elements available are Url, Hostname, Ip, and As.
  4. Save these elements to DB and connect them

All you need for this is to write three methods:

  • __init__ - This will initialize your class and set some mandatory attributes;
  • update - This will fetch data from a URL, a file, etc. and loop through every data entry, feeding it to the analyze method;
  • analyze - This will take the data entry from the update method and parse it. Here is where you create your elements, store them in the database, and link them together.

All this is done by writing a class that inherits from Feed and saving it in the feeds folder. You may create any datatype, and update it any way you want. For an element to be valid in Malcom's schema, it must have the following fields:

  • value: This a unique representation of the element you decide to add.
  • type: This is the type of the element (e.g. an Ip element has a type set to 'ip'). This is set automatically when you create new elements using their classes (e.g. IP(ip='127.0.0.1'))

Writing the __init__ method

To build a feed, you have to create a class that inherits the Feed class in feeds.feed. The Feed class has all the attribuets Malcom needs to understand how to run a feed and get statistics out of it.

You should override the following attributes in Feeds you create. This should be done in the __init__ function of your feed.

  • name - The name of your feed. Setting this to the class name is recommended.
  • source - The source, as in URL or filename of your feed
  • description - The source's description. This should have as much detail as possible, as it will later allow you to know where a given element is coming from.

Most importantly, there's a virtual method called update that you will have to implement for the feed to work.

Writing the update() method

The update method is used to fetch data from a resource (it can be an URL, a file on your disk, a web service, whatever). The input format doesn't really matter, as long as you respect this one condition: you have to pass each data entry (a data entry can be an XML node, a JSON element, a CSV line, etc.) to the analyze function.

Helper functions

The Feed class has two helper functions: update_xml and update_lines. Both functions take the source attribute as a source for their data, and will call analyze automatically on each data entry.

  • update_xml(parent_node, list_of_child_nodes) - This will try to parse the source as an XML tree. It will look for the parent_node node and build a dictionary with entries corresponding to the list_of_child_nodes list.
  • update_lines() - This will parse the source as a one-data-entry-per-line source, and feed each line to the analyze function automatically.

Using these helper functions is not mandatory, you can always use urllib2 or requests to achieve the same effect - as long as you pass every data entry to the analyze function individually.

Writing the analyze method

The analyze method is where the data entry is translated to objects to be inserted in the Malcom database.

The analyze method expects a single data entry as an input. From this data entry, it should be able to extract one or more elements to be added to the database. This part is really sensitive since it is here that you qualify information, effectively going from raw data, to information, to actionable intel.

How to keep a consistent database

The following guidelines are how feeds in the Malcom trunk are created, and it is how you should be creating them too to keep a consistent database. As a rule of thumb, raw elements (Ip, Url, Hostname) should be created devoid of all malicious context. The malicious context will be then added to the element (thanks to the wonders of schema-less databases!) using the <element_object>.add_evil method. You should extract at least one element (Ip, Url, Hostname, or any other element type in Malcom.model.datatypes) from every data entry provided to the analytics() function.

As an example, if you have a data feed listing a C&C and a malware family on each line, such as this:

this-should-be-suspended.no-ip.org;Citadel-C&C;online

You should create a Hostname element representing this-should-be-suspended.no-ip.org:

h = Hostname(hostname='this-should-be-suspended.no-ip.org')

Then, you should create a dictionary containing the intel you want to attach to this element:

hostname, what, status = line.split(';')
family, infrastructure_type = what.split('-')
intel = {'family': family, 'infrastructure_type': infrastructure_type, 'status': status}

Before calling add_evil, the intel dictionary needs to have a few mandatory keys:

  • source: The source where the intel comes from. This key is added automatically if you use update_xml or update_lines to feed data to the analytics function.
  • description: A human-readable description of the intel. This is why the element was listed on the feed.
  • date_added: Optional - can come in handy if the feed has a timestamp for its entries. If not specified, it will be set to datetime.datetime.utcnow().

Lastly, an optional but very useful key:

  • id : This basically defines when the intel will be overwritten. Think of it as the "primary key" of your intel. For example, if you set this to the domain name, then if on the next feed run the type switches from Citadel-C&C to Dridex-C&C, it will overwrite the previous record, and you will lose the previous intel. If you set id to domain+type, then it will store every last occurrence of domain being a Citadel or Dridex C&C. It's up to you to choose when you want to overwrite and when you'd rather update. Since this situation is still pretty rare, id will probably set to source by default in the future.
intel['id'] = '{}{}'.format(hostname, infrastructure_type)  # You can also MD5 the whole thing to save storage
intel['description'] = 'Listed as a {} {}. Status: {}'.format(infrastructure_type, family, status)
intel['source'] = 'InternalFeed1'  # This would be set to the feed name by default if using one of the helper methods
h.add_evil(intel)  # Done!

It might seem redundant to store a description variable besides the other variables, but a human-readable format is easier to understand than just key-value sets.

How to insert elements into the DB

You insert elements into the db using the Feed.commit_to_db(element) method. This method returns the element you just saved to the DB, so that you can connect it with other eventual elements.

You would save the hostname we previously found with:

h_from_db = self.commit_to_db(hostname)

You can also link elements between them. For example, imagine the line you're analysing also has the IP address the domain resolved to at the time it was added. You can add it to the database and link it using the Model.connect() method:

_, _, _, ip = "this-should-be-suspended.no-ip.org;Citadel-C&C;online;127.0.0.1".split(';')
ip_from_db = self.commit_to_db(ip)
self.model.connect(h_from_db, ip_from_db)

connect also takes an optional argument: attribs. This is the text that will show over the link when displayed in the graph.

Concrete example: ZeusTrackerBinaries

It's best to learn by example. Let's take a look at how the ZeusTrackerBinaries feed is built.

class ZeusTrackerBinaries(Feed):

    def __init__(self, name):
        super(ZeusTrackerBinaries, self).__init__(name)
        self.name = "ZeusTrackerBinaries"
        self.source = "https://zeustracker.abuse.ch/monitor.php?urlfeed=binaries"
        self.description = "This feed shows the latest 50 ZeuS binary URLs."

The class is initialised by calling its super constructor method. Easy enough, right?

This is the update method, the one that the Feed class expects you to implement:

def update(self):
    for dict in self.update_xml('item', ["title", "link", "description", "guid"]):
        self.analyze(dict)
  1. First thing we do is to fetch the data. In this case, we're calling the update_xml method. This will automatically use the source defined in __init__: https://zeustracker.abuse.ch/monitor.php?urlfeed=binaries.
  2. update_xml will automatically parse the XML structure <item><title></title><link></link><description></description><guid></guid></item>. If your XML structure differs from this, then you'll have to parse the XML yourself.
  3. The returned dictionaries {'description: ..., title: ..., etc}are sent to theanalyze` method

Let's take a look at the analyze method:

def analyze(self, dict):
        evil = dict

        url = Url(re.search("URL: (?P<url>\S+),", dict['description']).group('url'))
        evil['id'] = md5.new(re.search(r"id=(?P<id>[a-f0-9]+)", dict['guid']).group('id')).hexdigest()

        try:
            date_string = re.search(r"\((?P<date>[0-9\-]+)\)", dict['title']).group('date')
            evil['date_added'] = datetime.datetime.strptime(date_string, "%Y-%m-%d")
        except AttributeError, e:
            print "Date not found!"

        url.add_evil(evil)
        self.commit_to_db(url)

And Bob's your uncle!