Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mathematica Notebook format #1

Open
gleporeNARA opened this issue Sep 23, 2020 · 14 comments
Open

Mathematica Notebook format #1

gleporeNARA opened this issue Sep 23, 2020 · 14 comments

Comments

@gleporeNARA
Copy link

gleporeNARA commented Sep 23, 2020

Looking for additional real-world examples from earlier version of Mathematica (say, versions 1 and 2, if they existed). Also wondering about the resource cost in performing an identification on a signature with several long text strings to search for.

Format name: Mathematica Notebook files
Version number(s): all?
PRONOM fmt/201 - No current signatures on file - http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=926&strPageToDisplay=summary
Extensions: nb
mime-type: text/plain; charset=us-ascii
Description: "Wolfram Mathematica (usually termed Mathematica) is a modern technical computing system spanning most areas of technical computing — including neural networks, machine learning, image processing, geometry, data science, visualizations, and others. The system is used in many technical, scientific, engineering, mathematical, and computing fields. It was conceived by Stephen Wolfram and is developed by Wolfram Research of Champaign, Illinois The Wolfram Language is the programming language used in Mathematica."
Format type: Text (Structured)
Vendor: Wolfram Research

The signature from the 'file' command is:

# .nb files
#too long 0	string	(***********************************************************************\n\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Mathematica-Compatible Notebook	Mathematica 3.0 notebook
0	string	(***********************	Mathematica 3.0 notebook

# other (* matches it is a comment start in these langs
# GRR: Too weak; also matches other languages e.g. ML
#0	string	(*	Mathematica, or Pascal, Modula-2 or 3 code text

Below is a list of common strings that appear in these files.

Content-type: application/vnd.wolfram.mathematica
Content-type: application/mathematica
Wolfram Notebook File
Mathematica-Compatible Notebook
CreatedBy='Mathematica x.x'
http://www.wolfram.com/nb
mathematica.zip

@samalloing
Copy link

I'm going to try this one if that's ok

@gleporeNARA
Copy link
Author

Sam - absolutely! Have at it. I have a large body of files to test your signature on.

@samalloing
Copy link

I have created a signature file for a notebook file. The current problem I'm facing (also described above) is that there are multiple versions of Wolfram Notebook and there is only one fmt. Should I create a signature file for every version I can find? Or do I need to handle this differently?
cc @Dclipsham

Thanks for any advise

Sam

@gleporeNARA
Copy link
Author

Sam
I've asked David that very same question! He said (to paraphrase him) he leaves it up to the user to decide if the variations in the format are significant (i.e. no backwards compatibility in the software.)

I think for Mathematica, a single signature for the variations would be acceptable. The plain text nature of the format mitigates future archiving issues.

@samalloing
Copy link

Thanks for your response. I think I'm going to try to make different signatures then, because there are no tools (I'm aware of) that say which version a file is. So maybe it is useful to know this. But if you (or anybody else) has objections to this, I'll create one for all versions)

Sam

@samalloing
Copy link

I've been thinking further on this and I think I will make a signature file for every version of Notebook I have and create a fallback for Wolfram Notebooks in general so that all Notebooks will be identified. Most will have versions and some will have the general ID. So we don't need to deprecate the current Wolfram Notebook PUID and if more specific PUIDs are available these are assigned. How does this sound?

@gleporeNARA
Copy link
Author

I like the idea of a catch-all signature (the existing one) and then individual ones. There must be some format changes from the earlier version to the later versions (after Wolfram bought Mathematica).

I can test your signatures when they are ready.

@samalloing
Copy link

Hi @gleporeNARA

I have a test signature for all the mathematica notebook files I have and that you included in this issue. The all.zip file contains the signature file for 10.0, 10.1, 10.2, 10.3, 10.4, 11.0, 11.2, 7.0, 8.0, 9.0, 11.1, 11.3, 12.0, 6.0 and a catch all for unknown versions.
all.zip

cc @thorsted

@gleporeNARA
Copy link
Author

The version specific signatures look good, they all match up with my test files (except for some of the 4.2 versions.) The generic signature is probably too brief at 2 bytes to be specific enough. It matched hundreds of non-Mathematica files in my test collections. See attached for a small sample. It mostly looks like Pascal code, but there are other formats that come up postive as well. I would suggest the generic signature should also include the word Mathematica somewhere in the first 200 or so characters, and perhaps a few more asterisks.

The others that aren't matching a specific signature all have the string "Mathematica-Compatible Notebook" in addition to the '(*' string. Perhaps a separate signature for that would be useful. There's obviously some program out there that outputs its Mathematica files with that string.

Thanks for working on this!

False Positives
2.zip

@samalloing
Copy link

Hi @gleporeNARA,

Can you also provide the examples of Mathematica 4.2 files that fail the match?

Thanks!

@gleporeNARA
Copy link
Author

The four files with the names beginning with Math42 in the original zip file I uploaded. It's weird, because it looks like they should match the hex values 4372656174656442793d274d617468656d617469636120342e3227

Can you verify?

@samalloing
Copy link

Fix it, just a copy paste error.
all.zip

@samalloing
Copy link

I'll look into the false positives

@samalloing
Copy link

Regarding the false positives, I think I found a solution. If the end of the file also match: 290a 0d0a 0d0a, 290a 0a0a or 290a then it is a Mathematica file. This matches 100% of the mathematica files and non of the false positives.

Now I'm trying to create a signature for that, but it is more challenging then I thought :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants