-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format Identification Methodology and Programming #23
Comments
For those interested in the Bash scripts, here they are, from my .bashrc, which connects the alias command with the full command. alias stringme='strings --radix=d -n 4 * | sort | uniq -c| sort -gr | head' - prints out common strings from all files in the directory.
The 'lcs' program is available at: https://github.com/gleporeNARA/pronom-research/blob/master/lcs as is a program called 'idcom' which runs 'file', Siegfried, and TRiD against all files in a directory. |
Thanks for sharing this! I would suggest writing this up as a blog post because I think that not many people will read the issues in the repository. I'll tweet this out, but I'm definitely not enough of a coder to provide feedback. I'll be happy if I can follow the steps well enough to try it out at home. ;) |
I thought it might be a good idea to document my particular approach to identifying format signatures, in the hope that it will help other people in their work. I use a combination of Python programs and Bash scripts to largely automate my analysis of new formats.
Many of us are working on unknown file formats from our own repositories, but in addition to local files, there are also tons of online repositories of formats, software, and documentation to draw upon. I download large datasets from various online archives (old computers, software, etc.) as well as from the Internet Archive.
Once I've downloaded my datasets, I run Siegfried against the files to identify the unknown files (I identify all files where Siegfried returns either UNKNOWN or "extension only" matches.) I use a Bash script I wrote called 'getid' to pull out all files of the same extension, and copy them to my working environment. The first thing I do when I enter that directory on the command line is to run another script called 'binhead', which dumps the first 16 binary bytes of all the files in the directory to the screen. It looks like this:
'binhead' has a companion program called 'bintail' which dumps the last 16 bytes of each file to the screen.
'binhead' takes one argument, which is the number of bytes to dump. It defaults to 16, but can take more. The above extract implies the addition of more plain text in the header, so my next step would be to run 'binhead' with more bytes. 'binhead 26' yields the following:
The 20202020 at the end are just spaces. I think that's enough to create a PRONOM signature for these files, so I run another program called 'lcs' on the files. LCS stands for the Longest Common Subsequence, which is a programming term for an algorithm that identifies the longest common string of characters which occurs in all strings analyzed. Calling the program - 'lcs geo 22' will identify the longest common subsequence in the first 22 bytes of all files with a geo extension. The LCS algorithm is incredibly processor intensive, so it's impossible to run it on strings larger than roughly 1024 bytes. Running 'lcs' produces the hex values "47454f49442045585452414354454420524547494f4e", which I then use to create a new PRONOM signature, using Ross Spencer's online tool.
Once I download the new signature XML file. I run yet another Bash script 'sfxx'. The script copies the XML file to my Siegfried directory, creates the new signature, and runs it against the files in the current directory.
I have quite a few other Python programs and Bash scripts which automate various parts of working with file formats. Another Python script identifies which specific bytes (in hex) occur at the same location across all files in a directory. This can help identify format signatures which are not text based. I still have work to do, I would like to adapt the above program to also search for common bytes starting from the end of the file as well (which would help identify EOF markers.)
I'm not a great programming, I mostly hack and slash at the code until it roughly works, but the above process has greatly increased my ability to analyze unknown formats. I would be interested to know if other researchers are writing their own code to help analyze file formats.
Please reply below with any questions or comments.
The text was updated successfully, but these errors were encountered: