Skip to content

Commit

Permalink
[Hotfix] Set Magika to use medium-confidence instead of high (#1200)
Browse files Browse the repository at this point in the history
* * Set Magika to use medium-confidence instead of the default high-confidence.

* Switch to using content types instead of MIME types. (Magika claims this is more reliable.)

* Remove JPEG 2000 support.

* Add unit test for medium-confidence HTML file (was misdetected as plain text with high confidence mode.)

* Fix gitleaks
  • Loading branch information
cgodwin1 authored Mar 6, 2024
1 parent c377252 commit 4ed4821
Show file tree
Hide file tree
Showing 28 changed files with 4,486 additions and 85 deletions.
2 changes: 1 addition & 1 deletion .gitleaks.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ useDefault = true
[allowlist]
description = "global allow list"
paths = [
"solution/text-extractor/extractors/tests/fixtures/eml/sample.eml",
"solution/text-extractor/extractors/tests/fixtures",
]

16 changes: 8 additions & 8 deletions solution/text-extractor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
4. [Unit testing your new extractor](#unit-testing-your-new-extractor)
5. [Storing fixture files in a "collection"](#storing-fixture-files-in-a-collection)
6. [Generating new fixture files](#generating-new-fixture-files)
7. [Fixing misdetected or unsupported MIME types](#fixing-misdetected-or-unsupported-mime-types)
7. [Determining file types and fixing misdetected ones](#determining-file-types-and-fixing-misdetected-ones)

# About

Expand Down Expand Up @@ -178,7 +178,7 @@ from .sample import SampleExtractor as SampleExtractor # Note the redundant ali
.... etc ....
```

The extractor is now registered and will be automatically instantiated when a file has one of the MIME types listed in `file_types`.
The extractor is now registered and will be automatically instantiated when a file has one of the content types listed in `file_types`. See [Determining file types and fixing misdetected ones](#determining-file-types-and-fixing-misdetected-ones) for instructions on determine a file's content type.

Note the underscore in front of the `_extract()` method definition. Be sure to override this instead of `extract()` because the latter performs pre-extraction checks, then calls `_extract()`.

Expand Down Expand Up @@ -302,18 +302,18 @@ You can also easily use the existing unit test suite to generate new fixture fil

Be sure to re-comment these 2 lines when you're done, or fixture files will be re-created every time you run unit tests, which may produce undesired behavior.

## Fixing misdetected or unsupported MIME types
## Determining file types and fixing misdetected ones

The text extractor uses [Google's Magika library](https://github.com/google/magika) for MIME type detection, which uses a machine learning algorithm that promises greater than 99% accuracy when detecting known file types. However, not all file types are supported and their model has to be trained to support them.
The text extractor uses [Google's Magika library](https://github.com/google/magika) for content type detection, which uses a machine learning algorithm that promises greater than 99% accuracy when detecting known file types. However, not all file types are supported and their model has to be trained to support them.

You can [open an issue](https://github.com/google/magika/issues) on their repository to report a misdetection or missing file type. To do so, install Magika on your machine so that you can generate a report, like so:

```shell
$ pip install magika
$ magika -i unknown_type.xyz
unknown_type.xyz: application/octet-stream
$ magika --generate-report unknown_type.xyz
unknown_type.xyz: Unknown binary data (unknown)
$ magika --label --prediction-mode medium-confidence your-file.xyz
your-file.xyz: unknown
$ magika --generate-report your-file.xyz
your-file.xyz: Unknown binary data (unknown)
########################################
### REPORT ###
########################################
Expand Down
2 changes: 1 addition & 1 deletion solution/text-extractor/extractors/binary.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@


class BinaryExtractor(Extractor):
file_types = ("application/msword",)
file_types = ("doc",)

def _extract(self, file: bytes) -> str:
# Solution taken from here https://stackoverflow.com/questions/64397811/reading-a-doc-file-in-memory. Doc files
Expand Down
2 changes: 1 addition & 1 deletion solution/text-extractor/extractors/email.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@


class EmailExtractor(Extractor):
file_types = ("message/rfc822",)
file_types = ("eml",)

def _extract_payload(self, message):
payload = message.get_payload()
Expand Down
3 changes: 1 addition & 2 deletions solution/text-extractor/extractors/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@


class ExcelExtractor(Extractor):
file_types = ("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",)
extension = "xlsx"
file_types = ("xlsx",)

_valid_cell_types = ["s", "n", "b", "inlineStr", "str"]

Expand Down
8 changes: 4 additions & 4 deletions solution/text-extractor/extractors/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
ExtractorInitException,
)

from magika import Magika
from magika import Magika, PredictionMode

logger = logging.getLogger(__name__)
magika = Magika()
magika = Magika(prediction_mode=PredictionMode.MEDIUM_CONFIDENCE)


# Base class for text extraction
Expand All @@ -20,8 +20,8 @@ class Extractor:

@classmethod
def get_file_type(cls, file: bytes) -> str:
# Determine the file's MIME type using Google's Magika ML algorithm
return magika.identify_bytes(file).output.mime_type
# Determine the file's content type using Google's Magika ML algorithm
return magika.identify_bytes(file).output.ct_label

@classmethod
def get_extractor(cls, file_type: str, config: dict = {}) -> "Extractor":
Expand Down
14 changes: 2 additions & 12 deletions solution/text-extractor/extractors/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,11 @@


class ImageExtractor(Extractor):
file_types = (
"image/gif",
"image/bmp",
"image/x-ms-bmp",
"image/x-tga",
"image/webp",
# TODO: Uncomment these types if Magika adds JPEG 2000 support, or remove if they don't.
# These are not urgently needed file types.
# "image/jp2",
# "image/jpx",
)
file_types = ("gif", "bmp", "tga", "webp",)

def __init__(self, file_type: str, config: dict):
super().__init__(file_type, config)
self.extractor = Extractor.get_extractor("image/jpeg", config)
self.extractor = Extractor.get_extractor("jpeg", config)

def _extract(self, file: bytes) -> str:
image = Image.open(io.BytesIO(file))
Expand Down
6 changes: 1 addition & 5 deletions solution/text-extractor/extractors/markup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,7 @@


class MarkupExtractor(Extractor):
file_types = (
"text/html",
"text/xml",
"application/xml",
)
file_types = ("html", "xml")

def _extract(self, file: bytes) -> str:
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning) # Hide unnecessary warning about parsing XML
Expand Down
2 changes: 1 addition & 1 deletion solution/text-extractor/extractors/old_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@


class OldExcelExtractor(Extractor):
file_types = ("application/vnd.ms-excel",)
file_types = ("xls",)

def _extract(self, file: bytes) -> str:
file_path = self._write_file(file)
Expand Down
2 changes: 1 addition & 1 deletion solution/text-extractor/extractors/outlook.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@


class OutlookExtractor(Extractor):
file_types = ("application/vnd.ms-outlook",)
file_types = ("outlook",)

def _handle_data(self, attachment: extract_msg.Attachment) -> str:
file_name = attachment.longFilename
Expand Down
4 changes: 2 additions & 2 deletions solution/text-extractor/extractors/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@


class PdfExtractor(Extractor):
file_types = ("application/pdf",)
file_types = ("pdf",)
max_size = 20

def __init__(self, file_type: str, config: dict):
super().__init__(file_type, config)
self.extractor = Extractor.get_extractor("image/jpeg", config)
self.extractor = Extractor.get_extractor("jpeg", config)

def _convert_to_images(self, file: bytes, temp_dir: str) -> [str]:
logger.debug("Converting PDF file to images stored in a temporary directory.")
Expand Down
2 changes: 1 addition & 1 deletion solution/text-extractor/extractors/powerpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@


class PowerPointExtractor(Extractor):
file_types = ("application/vnd.openxmlformats-officedocument.presentationml.presentation",)
file_types = ("pptx",)

def _extract(self, file: bytes) -> str:
file_path = self._write_file(file)
Expand Down
5 changes: 1 addition & 4 deletions solution/text-extractor/extractors/rtf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@


class RichTextExtractor(Extractor):
file_types = (
"text/rtf",
"application/rtf",
)
file_types = ("rtf",)

def _extract(self, file: bytes) -> str:
return rtf_to_text(file.decode(), errors="replace")
11 changes: 6 additions & 5 deletions solution/text-extractor/extractors/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,19 @@ def _test_file_type(self, file_type, **kwargs):

with open(f"{self.BASE_PATH}{collection}/{variation}sample.{file_type}", "rb") as f:
sample = f.read()
with open(f"{self.BASE_PATH}{collection}/{variation}expected.txt", "rb") as f:
expected = f.read().decode()

# Determine the file's MIME type
mime_type = Extractor.get_file_type(sample)
# Determine the file's content type
file_type = Extractor.get_file_type(sample)

with patch("extractors.Extractor._extract_embedded", new=mock_extract_embedded):
extractor = Extractor.get_extractor(mime_type, config)
extractor = Extractor.get_extractor(file_type, config)
output = extractor.extract(sample)

# Uncomment these 2 lines to re-export fixture files the next time tests are run.
# with open(f"{self.BASE_PATH}{collection}/{variation}expected.txt", "w") as f:
# f.write(output)

with open(f"{self.BASE_PATH}{collection}/{variation}expected.txt", "rb") as f:
expected = f.read().decode()

self.assertEqual(output, expected)
Loading

0 comments on commit 4ed4821

Please sign in to comment.