[Hotfix] Set Magika to use medium-confidence instead of high (#1200)

* * Set Magika to use medium-confidence instead of the default high-confidence. * Switch to using content types instead of MIME types. (Magika claims this is more reliable.) * Remove JPEG 2000 support. * Add unit test for medium-confidence HTML file (was misdetected as plain text with high confidence mode.) * Fix gitleaks
Enterprise-CMCS · Mar 6, 2024 · 4ed4821 · 4ed4821
1 parent c377252
commit 4ed4821
Show file tree

Hide file tree

Showing 28 changed files with 4,486 additions and 85 deletions.
diff --git a/.gitleaks.toml b/.gitleaks.toml
@@ -6,6 +6,6 @@ useDefault = true
 [allowlist]
 description = "global allow list"
 paths = [
-    "solution/text-extractor/extractors/tests/fixtures/eml/sample.eml",
+    "solution/text-extractor/extractors/tests/fixtures",
 ]
 
diff --git a/solution/text-extractor/README.md b/solution/text-extractor/README.md
@@ -14,7 +14,7 @@
     4. [Unit testing your new extractor](#unit-testing-your-new-extractor)
     5. [Storing fixture files in a "collection"](#storing-fixture-files-in-a-collection)
     6. [Generating new fixture files](#generating-new-fixture-files)
-    7. [Fixing misdetected or unsupported MIME types](#fixing-misdetected-or-unsupported-mime-types)
+    7. [Determining file types and fixing misdetected ones](#determining-file-types-and-fixing-misdetected-ones)
 
 # About
 
@@ -178,7 +178,7 @@ from .sample import SampleExtractor as SampleExtractor  # Note the redundant ali
 .... etc ....
 ```
 
-The extractor is now registered and will be automatically instantiated when a file has one of the MIME types listed in `file_types`.
+The extractor is now registered and will be automatically instantiated when a file has one of the content types listed in `file_types`. See [Determining file types and fixing misdetected ones](#determining-file-types-and-fixing-misdetected-ones) for instructions on determine a file's content type.
 
 Note the underscore in front of the `_extract()` method definition. Be sure to override this instead of `extract()` because the latter performs pre-extraction checks, then calls `_extract()`.
 
@@ -302,18 +302,18 @@ You can also easily use the existing unit test suite to generate new fixture fil
 
 Be sure to re-comment these 2 lines when you're done, or fixture files will be re-created every time you run unit tests, which may produce undesired behavior.
 
-## Fixing misdetected or unsupported MIME types
+## Determining file types and fixing misdetected ones
 
-The text extractor uses [Google's Magika library](https://github.com/google/magika) for MIME type detection, which uses a machine learning algorithm that promises greater than 99% accuracy when detecting known file types. However, not all file types are supported and their model has to be trained to support them.
+The text extractor uses [Google's Magika library](https://github.com/google/magika) for content type detection, which uses a machine learning algorithm that promises greater than 99% accuracy when detecting known file types. However, not all file types are supported and their model has to be trained to support them.
 
 You can [open an issue](https://github.com/google/magika/issues) on their repository to report a misdetection or missing file type. To do so, install Magika on your machine so that you can generate a report, like so:
 
 ```shell
 $ pip install magika
-$ magika -i unknown_type.xyz
-unknown_type.xyz: application/octet-stream
-$ magika --generate-report unknown_type.xyz
-unknown_type.xyz: Unknown binary data (unknown)
+$ magika --label --prediction-mode medium-confidence your-file.xyz
+your-file.xyz: unknown
+$ magika --generate-report your-file.xyz
+your-file.xyz: Unknown binary data (unknown)
 ########################################
 ###              REPORT              ###
 ########################################

diff --git a/solution/text-extractor/extractors/binary.py b/solution/text-extractor/extractors/binary.py
@@ -6,7 +6,7 @@
 
 
 class BinaryExtractor(Extractor):
-    file_types = ("application/msword",)
+    file_types = ("doc",)
 
     def _extract(self, file: bytes) -> str:
         #  Solution taken from here https://stackoverflow.com/questions/64397811/reading-a-doc-file-in-memory.  Doc files

diff --git a/solution/text-extractor/extractors/email.py b/solution/text-extractor/extractors/email.py
@@ -10,7 +10,7 @@
 
 
 class EmailExtractor(Extractor):
-    file_types = ("message/rfc822",)
+    file_types = ("eml",)
 
     def _extract_payload(self, message):
         payload = message.get_payload()

diff --git a/solution/text-extractor/extractors/excel.py b/solution/text-extractor/extractors/excel.py
@@ -9,8 +9,7 @@
 
 
 class ExcelExtractor(Extractor):
-    file_types = ("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",)
-    extension = "xlsx"
+    file_types = ("xlsx",)
 
     _valid_cell_types = ["s", "n", "b", "inlineStr", "str"]
 

diff --git a/solution/text-extractor/extractors/extractor.py b/solution/text-extractor/extractors/extractor.py
@@ -7,10 +7,10 @@
     ExtractorInitException,
 )
 
-from magika import Magika
+from magika import Magika, PredictionMode
 
 logger = logging.getLogger(__name__)
-magika = Magika()
+magika = Magika(prediction_mode=PredictionMode.MEDIUM_CONFIDENCE)
 
 
 # Base class for text extraction
@@ -20,8 +20,8 @@ class Extractor:
 
     @classmethod
     def get_file_type(cls, file: bytes) -> str:
-        # Determine the file's MIME type using Google's Magika ML algorithm
-        return magika.identify_bytes(file).output.mime_type
+        # Determine the file's content type using Google's Magika ML algorithm
+        return magika.identify_bytes(file).output.ct_label
 
     @classmethod
     def get_extractor(cls, file_type: str, config: dict = {}) -> "Extractor":

diff --git a/solution/text-extractor/extractors/image.py b/solution/text-extractor/extractors/image.py
@@ -9,21 +9,11 @@
 
 
 class ImageExtractor(Extractor):
-    file_types = (
-        "image/gif",
-        "image/bmp",
-        "image/x-ms-bmp",
-        "image/x-tga",
-        "image/webp",
-        # TODO: Uncomment these types if Magika adds JPEG 2000 support, or remove if they don't.
-        # These are not urgently needed file types.
-        # "image/jp2",
-        # "image/jpx",
-    )
+    file_types = ("gif", "bmp", "tga", "webp",)
 
     def __init__(self, file_type: str, config: dict):
         super().__init__(file_type, config)
-        self.extractor = Extractor.get_extractor("image/jpeg", config)
+        self.extractor = Extractor.get_extractor("jpeg", config)
 
     def _extract(self, file: bytes) -> str:
         image = Image.open(io.BytesIO(file))

diff --git a/solution/text-extractor/extractors/markup.py b/solution/text-extractor/extractors/markup.py
@@ -6,11 +6,7 @@
 
 
 class MarkupExtractor(Extractor):
-    file_types = (
-        "text/html",
-        "text/xml",
-        "application/xml",
-    )
+    file_types = ("html", "xml")
 
     def _extract(self, file: bytes) -> str:
         warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)  # Hide unnecessary warning about parsing XML

diff --git a/solution/text-extractor/extractors/old_excel.py b/solution/text-extractor/extractors/old_excel.py
@@ -4,7 +4,7 @@
 
 
 class OldExcelExtractor(Extractor):
-    file_types = ("application/vnd.ms-excel",)
+    file_types = ("xls",)
 
     def _extract(self, file: bytes) -> str:
         file_path = self._write_file(file)

diff --git a/solution/text-extractor/extractors/outlook.py b/solution/text-extractor/extractors/outlook.py
@@ -11,7 +11,7 @@
 
 
 class OutlookExtractor(Extractor):
-    file_types = ("application/vnd.ms-outlook",)
+    file_types = ("outlook",)
 
     def _handle_data(self, attachment: extract_msg.Attachment) -> str:
         file_name = attachment.longFilename

diff --git a/solution/text-extractor/extractors/pdf.py b/solution/text-extractor/extractors/pdf.py
@@ -10,12 +10,12 @@
 
 
 class PdfExtractor(Extractor):
-    file_types = ("application/pdf",)
+    file_types = ("pdf",)
     max_size = 20
 
     def __init__(self, file_type: str, config: dict):
         super().__init__(file_type, config)
-        self.extractor = Extractor.get_extractor("image/jpeg", config)
+        self.extractor = Extractor.get_extractor("jpeg", config)
 
     def _convert_to_images(self, file: bytes, temp_dir: str) -> [str]:
         logger.debug("Converting PDF file to images stored in a temporary directory.")

diff --git a/solution/text-extractor/extractors/powerpoint.py b/solution/text-extractor/extractors/powerpoint.py
@@ -8,7 +8,7 @@
 
 
 class PowerPointExtractor(Extractor):
-    file_types = ("application/vnd.openxmlformats-officedocument.presentationml.presentation",)
+    file_types = ("pptx",)
 
     def _extract(self, file: bytes) -> str:
         file_path = self._write_file(file)

diff --git a/solution/text-extractor/extractors/rtf.py b/solution/text-extractor/extractors/rtf.py
@@ -4,10 +4,7 @@
 
 
 class RichTextExtractor(Extractor):
-    file_types = (
-        "text/rtf",
-        "application/rtf",
-    )
+    file_types = ("rtf",)
 
     def _extract(self, file: bytes) -> str:
         return rtf_to_text(file.decode(), errors="replace")
diff --git a/solution/text-extractor/extractors/tests/__init__.py b/solution/text-extractor/extractors/tests/__init__.py
@@ -24,18 +24,19 @@ def _test_file_type(self, file_type, **kwargs):
 
         with open(f"{self.BASE_PATH}{collection}/{variation}sample.{file_type}", "rb") as f:
             sample = f.read()
-        with open(f"{self.BASE_PATH}{collection}/{variation}expected.txt", "rb") as f:
-            expected = f.read().decode()
 
-        # Determine the file's MIME type
-        mime_type = Extractor.get_file_type(sample)
+        # Determine the file's content type
+        file_type = Extractor.get_file_type(sample)
 
         with patch("extractors.Extractor._extract_embedded", new=mock_extract_embedded):
-            extractor = Extractor.get_extractor(mime_type, config)
+            extractor = Extractor.get_extractor(file_type, config)
             output = extractor.extract(sample)
 
         # Uncomment these 2 lines to re-export fixture files the next time tests are run.
         # with open(f"{self.BASE_PATH}{collection}/{variation}expected.txt", "w") as f:
         #     f.write(output)
 
+        with open(f"{self.BASE_PATH}{collection}/{variation}expected.txt", "rb") as f:
+            expected = f.read().decode()
+
         self.assertEqual(output, expected)