This program is part of an experiment to learn how real-world files compress. When run on a machine, it feeds every file on the system through five different file compressors to see how well the file compresses and how long it takes the file to compress.
- Download the zip folder with all the necessary tools here.
- The latest Java Runtime Environment (JRE). This should run on anything that supports JRE 8 or higher.
- If you wish to build from source, see Building from Source below
You will not be able to use your computer while this is running. We highly recommend running this overnight. It takes many, many hours. If you need to stop the program early, press Ctrl + C. This should halt the program.
Though we do not track or store file names or file locations on your system, we do collect some information that could be used to guess what type of information may exist on your system. For this reason, we will not distribute this data to anyone not part of the Comprestimator study.
Instead of storing file names, comprestimator will store a file's hash value, which is like a unique file fingerprint. This fingerprint cannot tell us the name, location, or content of the original file. (If you would like to know more about how hash functions work, visit the Wikipedia article on cryptographic hash functions: https://en.wikipedia.org/wiki/Cryptographic_hash_function). Besides storing the file's hash, this program will record the file's original size, its compressed size, and the amount of time needed to compress the file. All of this information will be stored locally on an SQLite database file.
We do not record the names of files processed, nor do we store any of the data that the files contain. However, if you have some files that you don't want comprestimator to touch at all, you can enter the file location(s) in skip_list.txt
. See the instructions in "How to run it" to use the skip list.
WARNING: Comprestimator will take many hours to finish. If you need to use your computer before the program has finished running, hit Ctrl + C
. You can restart the program later, and it will pick up where it left off.
(note: if you have Windows Subsystem for Linux (WSL), I recommend using WSL so that we can gather file
command metadata. If you do not know what WSL is, disregard this request.)
- Open your Downloads folder (or wherever you placed the comprestimator zip file). Right-click on
comprestimator.zip
and selectExtract all...
. - If there are any folders that you do not want comprestimator to touch, enter the full pathname of each folder on a separate line in the file
skip_list.txt
. The pathname begins with the hard drive's letter, e.gC:\Users\me\private_folder
. Careful! The skip list is case sensitive! If you cannot findskip_list.txt
, double clickrun.bat
and then hitCtrl + C
after a few seconds to quit the program. This will generateskip_list.txt
for you. - Double-click
run.bat
. A shell window will open. The first message you should see is "Beginning file enumeration...".
This takes several minutes depending on the amount of data on your computer. - Next, you should see "Beginning compression loop...". This step will take many hours. If you need to halt the
program to use your computer, hit
Ctrl + C
. You can safely restart the program later (this will not corrupt or overwrite any previous results). - When the program is finished running, you will see a new file called "test.db". Please save this!! This contains all the results from the experiment!
- I recommend compressing test.db before sending it. To do this, right click on the file, and select "send to compressed (zipped) folder."
- Fill out the consent form. Once we get a form from you, we will email you a link to a private Google Drive folder where you can upload the DB file. Email me at [email protected] if you need help. Thank you!
If you would like to participate in our study, please read and sign the electronic consent form available here. Once you sign up, we will send you a link to a private Google Drive folder where you can upload your compression database when you are done running the tool.
- Open a terminal. Navigate to the directory that contains comprestimator.jar.
- If there are any folders that you do not want comprestimator to touch, enter the full pathname of each folder on a separate line in the file
skip_list.txt
. Careful! The skip list is case sensitive! If you cannot findskip_list.txt
, enter./run.sh
and then hitCtrl + C
after a couple seconds to quit the program. This will generateskip_list.txt
for you. - Enter
./run.sh
. The first message you should see is "Beginning file enumeration...". This takes several minutes depending on the amount of data on your computer. - Next, you should see "Beginning compression loop...". This step will take many hours. If you need to halt the
program to use your computer, hit
Ctrl + C
. You can safely restart the program later (this will not corrupt or overwrite any previous results). - When the program is finished running, you will see a new file called "test.db". Zip this file however you like (gzip works well).
- Fill out the consent form. Once we get a form from you, we will email you a link to a private Google Drive folder where you can upload the DB file. Email me at [email protected] if you need help. Thank you!
You're more than welcome to build from source if you wish! We've tried to make it as easy as possible. All you will need is ant
and JDK 8 or higher. Once you have cloned the git repo, you'll need to create a file called build.properties
in the same directory as build.xml
. In build.properties
, write the line
jdk.home=pathToYourJDK
substituting the path to the JDK on your machine (it should look like jdk.home=/usr/lib/jvm/java-8-openjdk
or something similar. Don't add a '/' at the end of the line!
Once you've created build.properties
, in a command line, run ant
. If everything goes well, you should see BUILD SUCCESSFUL
. All dependencies should be included in the git repo, so you shouldn't need to download anything else. The JAR will be in out/artifacts/comprestimator_jar
. Wherever you put the JAR, make sure you move either run.sh
or run.bat
to the same directory.
If you want to see the data comprestimator has stored, you will need to download SQLite3. Once you have downloaded and unzipped the file (or built the code from source), open a terminal, nagivate to the folder that contains test.db, and enter sqlite3 test.db
. To see a few results gathered, enter select * from lz4_results limit 10;
. Without limit 10
you will likely get a list of about 100,000 results. What you are seeing is the file's hash value (about 64 letters and numbers of gobbldeegook), the file extension (e.g. pdf, doc, jpg, etc.), and some numbers representing the size of the file before and after compressing and how long compression took. The names of your files and their contents are never stored.
If this is overwhelming, email [email protected] and ask for assistance. We are also happy to send you the entire database dumped as a text file.