A Rule Generator for Yara Rules
Florian Roth, July 2015
yarGen is a generator for Yara rules. The reason why I developed another Yara rule generator was a special use case in which I had a directory full of hackware samples for which I had to write Yara rules.
The main principle is the creation of yara rules from strings found in malware files while removing all strings that also appear in goodware files.
Since version 0.15.0 yarGen supports opcode string elements extracted from the .text sections of PE files. During database creation it splits the .text sections with the regex [\x00]{3,} and takes the first 16 bytes of each part to build an opcode database from goodware PE files. During rule creation on sample files it compares the goodware opcodes with the opcodes extracted from the malware samples and removes all opcodes that also appear in the goodware database. (there is no further magic in it yet - no XOR loop detection etc.)
Since version 0.14.0 it uses naive-bayes-classifier by Mustafa Atik and Nejdet Yucesoy in order to classify the string and detect useful words instead of compression/encryption garbage.
Since version 0.12.0 yarGen does not completely remove the goodware strings from the analysis process but includes them with a very low score. The rules will be included if no better strings can be found and marked with a comment /* Goodware rule */. Force yarGen to remvoe all goodware strings with --excludegood. Also since version 0.12.0 yarGen allows to place the "strings.xml" from PEstudio in the program directory in order to apply the blacklist definition during the string analysis process. You'll get better results.
The rule generation process tries to identify similarities between the files that get analyzed and then combines the strings to so called "super rules". Up to now the super rule generation does not remove the simple rule for the files that have been combined in a single super rule. This means that there is some redundancy when super rules are created. You can supress a simple rule for a file that was already covered by super rule by using --nosimple.
- Make sure you have at least 2.5GB of RAM on the machine you plan to use yarGen (4GB if opcodes should be included in rule generation, deactivate via --noop)
- Clone the git repository
- Install all dependancies with
sudo pip install scandir lxml naiveBayesClassifier pefile
- Unzip the goodware string database (e.g.
7z x good-strings.db.zip.001
) - Unzip the goodware opcode database (e.g.
7z x good-opcodes.db.zip.001
) - See help with
python yarGen.py --help
Warning: yarGen pulls the whole goodstring database to memory and uses up to 2.5 GB of memory for a few seconds - 4 GB if opcode evaluation is used.
I already tried to migrate the database to sqlite but the numerous string comparisons and lookups made the analysis very slow.
usage: yarGen.py [-h] [-m M] [-l min-size] [-z min-score] [-s max-size]
[-rc maxstrings] [--excludegood] [-o output_rule_file]
[-a author] [-r ref] [-p prefix] [--score] [--nosimple]
[--nomagic] [--nofilesize] [-fm FM] [--noglobal] [--nosuper]
[-g G] [-u] [-c] [--nr] [--oe] [-fs size-in-MB] [--debug]
[--noop] [-n opcode-num] [--inverse] [--nodirname]
[--noscorefilter]
yarGen
optional arguments:
-h, --help show this help message and exit
Rule Creation:
-m M Path to scan for malware
-l min-size Minimum string length to consider (default=8)
-z min-score Minimum score to consider (default=5)
-s max-size Maximum length to consider (default=128)
-rc maxstrings Maximum number of strings per rule (default=20,
intelligent filtering will be applied)
--excludegood Force the exclude all goodware strings
Rule Output:
-o output_rule_file Output rule file
-a author Author Name
-r ref Reference
-p prefix Prefix for the rule description
--score Show the string scores as comments in the rules
--nosimple Skip simple rule creation for files included in super
rules
--nomagic Don't include the magic header condition statement
--nofilesize Don't include the filesize condition statement
-fm FM Multiplier for the maximum 'filesize' condition
(default: 3)
--noglobal Don't create global rules
--nosuper Don't try to create super rules that match against
various files
Database Operations:
-g G Path to scan for goodware (dont use the database
shipped with yaraGen)
-u Update local goodware database (use with -g)
-c Create new local goodware database (use with -g)
General Options:
--nr Do not recursively scan directories
--oe Only scan executable extensions EXE, DLL, ASP, JSP,
PHP, BIN, INFECTED
-fs size-in-MB Max file size in MB to analyze (default=10)
--debug Debug output
OpCode Feature:
--noop Do not use the OpCode string feature
-n opcode-num Number of opcodes to add if not enough high scoring
string could be found (default=3)
Inverse Mode:
--inverse Show the string scores as comments in the rules
--nodirname Don't use the folder name variable in inverse rules
--noscorefilter Don't filter strings based on score (default in
'inverse' mode)
See the following blog post for a more detailed description on how to use yarGen for YARA rule creation: How to Write Simple but Sound Yara Rules
As you can see in the screenshot above you'll get a rule that contains strings, which are not found in the goodware strings database.
You should clean up the rules afterwards. In the example above, remove the strings $s14, $s17, $s19, $s20 that look like random code to get a cleaner rule that is more likely to match on other samples of the same family.
To get a more generic rule, remove string $s5, which is very specific for this compiled executable.
python yarGen.py -m X:\MAL\Case1401
Use the shipped database of goodware strings and scan the malware directory "X:\MAL" recursively. Create rules for all files included in this directory and below. A file named 'yargen_rules.yar' will be generated in the current directory.
python yarGen.py --noop -m X:\MAL\Case1401
Deactivate the opcode analysis. (memory consumption 2.5GB instead of 4GB)
yarGen will by default use the top 20 strings based on their score. To see how a certain string in the rule scored, use the "--score" parameter.
python yarGen.py --score -m X:\MAL\Case1401
In order to use only strings for your rules that match a certain minimum score use the "-z" parameter. It is a good pratice to first create rules with "--score" and than perform a second run with a minimum score set for you sample set via "-z".
python yarGen.py --score -z 5 -m X:\MAL\Case1401
python yarGen.py -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case_441 -o case441.yar
python yarGen.py --excludegood -m /opt/mal/case_441
python yarGen.py --nosimple -m /opt/mal/case_441
python yarGen.py --debug -m /opt/mal/case_441
python yarGen.py -c -g C:\Windows\System32
python yarGen.py -u -g "C:\Program Files"
In order to create some inverse rules on goodware, you have to prepare a directory with subdirectories in which you include all versions of the files you want to create inverse rules for with their original name and in their original folder. If that sounds strange, let me give you an example.
E.g. you want to create inverse rules for all Windows executables in the System32 folder, you have to create a goodware archive with the following directory structure:
- G:\goodware
- WindowsXP
- System32 - all files
- Windows2003
- System32 - all files
- Windows2008R2
- System32 - all files
- WindowsXP
yarGen than creates rules that identify e.g. file name "cmd.exe" in path ending with "System32" and checks if the file contains certain necessary strings. If the strings don't show up, the rule will fire. This indicates a replaced system file or malware file that tries to masquerade as a system file.
python yarGen.py --inverse -oe -m G:\goodware\
You can also instruct yarGen not to include the file path but solely rely on the filename.
python yarGen.py --inverse -oe --nodirname -m G:\goodware\