Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom ML model usage and ml_providers list #587

Merged
merged 141 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
4a64d98
Filters optimisation
babenek Jul 23, 2024
49e82ce
style
babenek Jul 23, 2024
fdd483b
unicode cases in filter
babenek Jul 23, 2024
21fd81f
tmp markup loan
babenek Jul 23, 2024
97ef85b
BM scores fix
babenek Jul 24, 2024
f851904
docs upd
babenek Jul 24, 2024
ec48914
removed unused filter
babenek Jul 24, 2024
70b6b34
[skip actions] [ml] 2024-07-24T13:15:59+03:00
babenek Jul 24, 2024
9434d62
retrain
babenek Jul 24, 2024
72aed7f
BMcscor
babenek Jul 24, 2024
ec5d03b
retrain
babenek Jul 24, 2024
5122e45
custom ref
babenek Jul 24, 2024
4a39e22
testfix
babenek Jul 24, 2024
0bf1643
docfix
babenek Jul 24, 2024
898a252
doc upd
babenek Jul 25, 2024
dcd3a9e
uuid
babenek Jul 25, 2024
bbb2f3c
uuid use_ml: true to skip extra others
babenek Jul 26, 2024
680792b
[skip actions] [uuid] 2024-07-26T12:48:46+03:00
babenek Jul 26, 2024
36b6d13
[skip actions] [uuid] 2024-07-26T13:30:49+03:00
babenek Jul 26, 2024
d5ff926
[skip actions] [uuid] 2024-07-26T13:36:49+03:00
babenek Jul 26, 2024
8b3a71b
retrain
babenek Jul 26, 2024
c686b69
fix
babenek Jul 26, 2024
92858b6
[no ci] upd2
babenek Jul 25, 2024
3dbd263
Merge branch 'main' into auxiliary
babenek Aug 7, 2024
6d0ab82
test counters fix
babenek Aug 7, 2024
fdb5ee7
reduce whitespaces during extracting subtext
babenek Aug 7, 2024
7132398
Merge branch 'main' into auxiliary
babenek Aug 7, 2024
2c21919
aux BM ref
babenek Aug 7, 2024
d7f65c1
BM scores fix
babenek Aug 7, 2024
59e9c21
Merge branch 'auxiliary' into uuid
babenek Aug 7, 2024
4c8a10b
fix
babenek Aug 7, 2024
29ae9a5
[skip actions] [uuid] 2024-08-07T16:37:15+03:00
babenek Aug 7, 2024
c9cfeb9
rollback ML model
babenek Aug 7, 2024
3723700
Merge branch 'uuid' into ml
babenek Aug 7, 2024
9aa6387
extension_input_vstack
babenek Aug 7, 2024
cdfc898
retrain with file_type
babenek Aug 7, 2024
b43accd
testfix
babenek Aug 7, 2024
1d35367
Rollback BM
babenek Aug 7, 2024
87a63e8
Merge branch 'auxiliary' into uuid
babenek Aug 7, 2024
a02043a
Merge branch 'uuid' into ml
babenek Aug 7, 2024
95d12f3
Rollback BM
babenek Aug 7, 2024
9b8f3d7
[no ci] BM scor fix
babenek Aug 8, 2024
68f177d
UUID does not require ml yet
babenek Aug 8, 2024
c4f725e
retrain
babenek Aug 8, 2024
660fc44
JWT fix
babenek Aug 8, 2024
536a7f3
customBMref
babenek Aug 8, 2024
6d072d4
JWT fix BC scor
babenek Aug 8, 2024
16bb62c
Merge branch 'auxiliary' into uuid
babenek Aug 8, 2024
b478b90
[skip actions] [ml] 2024-08-08T11:52:05+03:00
babenek Aug 8, 2024
e16f060
Merge branch 'uuid' into ml
babenek Aug 8, 2024
1544ab1
[skip actions] [ml] 2024-08-08T12:48:05+03:00
babenek Aug 8, 2024
cc1563b
[skip actions] [ml] 2024-08-08T13:10:05+03:00
babenek Aug 8, 2024
6b7432a
retrain
babenek Aug 8, 2024
226004e
style
babenek Aug 8, 2024
5eac055
ML_FILE_TYPE=12
babenek Aug 8, 2024
23d3abb
BM scores fix
babenek Aug 8, 2024
b803e78
Merge branch 'main' into auxiliary
babenek Aug 9, 2024
df5fb95
Merge branch 'auxiliary' into ml
babenek Aug 9, 2024
3c4446e
[skip actions] [ml] 2024-08-09T10:05:08+03:00
babenek Aug 9, 2024
6450440
UUID pattern added
babenek Aug 9, 2024
9fbdf9b
BM scores fix
babenek Aug 9, 2024
52e4fe4
[skip actions] [uuid] 2024-08-10T10:05:05+03:00
babenek Aug 10, 2024
39c3dcd
Merge branch 'uuid' into ml
babenek Aug 10, 2024
eb0f799
[skip actions] [ml] 2024-08-11T00:01:07+03:00
babenek Aug 10, 2024
8786414
square bracket workaround in keywort regex
babenek Aug 10, 2024
672342a
path filter
babenek Aug 11, 2024
ac6ee1a
BM score fix
babenek Aug 11, 2024
1eabc31
Merge branch 'auxiliary' into ml
babenek Aug 11, 2024
54766e8
[skip actions] [ml] 2024-08-11T12:07:40+03:00
babenek Aug 11, 2024
8f4848d
[skip actions] [ml] 2024-08-11T12:09:15+03:00
babenek Aug 11, 2024
97bd8b3
[skip actions] [ml] 2024-08-11T12:28:18+03:00
babenek Aug 11, 2024
c8c5ec6
[skip actions] [ml] 2024-08-11T13:53:46+03:00
babenek Aug 11, 2024
f7c9ea0
[skip actions] [ml] 2024-08-11T21:40:36+03:00
babenek Aug 11, 2024
db5161d
[skip actions] [ml] 2024-08-11T21:41:54+03:00
babenek Aug 11, 2024
e8dea42
0.92
babenek Aug 11, 2024
3497f06
ValueStringTypeCheck workaround for heterogenous source
babenek Aug 12, 2024
6c10bf6
wrap added to filter array definitions
babenek Aug 12, 2024
abc980c
TOML format sanitizer
babenek Aug 12, 2024
ddbda1a
YAML case
babenek Aug 12, 2024
164cdfd
BM fix
babenek Aug 12, 2024
b55af99
BM scores fix
babenek Aug 12, 2024
8fed98d
[skip actions] [ml] 2024-08-12T19:08:33+03:00
babenek Aug 12, 2024
137f6b2
[skip actions] [subhashtext] 2024-08-12T21:32:30+03:00
babenek Aug 12, 2024
ea404b3
variable is hashed too
babenek Aug 12, 2024
a076394
hash & subtext test
babenek Aug 12, 2024
04a15c3
testBM
babenek Aug 12, 2024
e271544
updBMscor
babenek Aug 12, 2024
d06c8a9
refactoring
babenek Aug 13, 2024
7930ff6
skip f* in BM experiment
babenek Aug 13, 2024
9851fb2
keep 0*-3* meta for experiment
babenek Aug 13, 2024
530b16e
less repos in test
babenek Aug 13, 2024
37d386d
refactoring2
babenek Aug 13, 2024
653f10b
read_text.cache_clear()
babenek Aug 13, 2024
ea871c6
--subtext in benchmark
babenek Aug 13, 2024
3bb24a0
[skip actions] [subhashtext] 2024-08-13T12:52:11+03:00
babenek Aug 13, 2024
bf4eb64
[skip actions] [subhashtext] 2024-08-13T12:55:14+03:00
babenek Aug 13, 2024
1a09f85
fix
babenek Aug 13, 2024
0fd6924
[skip actions] [ml] 2024-08-13T13:13:43+03:00
babenek Aug 13, 2024
ae76b69
[skip actions] [ml] 2024-08-13T13:26:20+03:00
babenek Aug 13, 2024
f2f40f0
[skip actions] [ml] 2024-08-13T14:04:12+03:00
babenek Aug 13, 2024
8c6c30d
subtext
babenek Aug 13, 2024
2e20a2b
[skip actions] [ml] 2024-08-13T17:00:25+03:00
babenek Aug 13, 2024
09f813d
Merge branch 'main' into auxiliary
babenek Aug 14, 2024
1be4e7c
[skip actions] [subhashtext] 2024-08-14T07:46:29+03:00
babenek Aug 14, 2024
ff705d2
[skip actions] [ml] 2024-08-14T07:49:09+03:00
babenek Aug 14, 2024
feeefc3
experiment ml rollback
babenek Aug 14, 2024
95e0b1a
BM scores with hashes
babenek Aug 14, 2024
0ce84fc
some rollbacks
babenek Aug 14, 2024
4582016
Merge branch 'subhashtext' into ml
babenek Aug 14, 2024
f2f44f6
Merge branch 'main' into subhashtext
babenek Aug 14, 2024
e57f6cd
Merge branch 'subhashtext' into ml
babenek Aug 14, 2024
09bd176
[skip actions] [ml] 2024-08-14T14:09:10+03:00
babenek Aug 14, 2024
1cf645c
Rollback file type
babenek Aug 14, 2024
2f2633c
Merge branch 'main' into ml
babenek Aug 15, 2024
d3c2bc8
custom ref BM
babenek Aug 16, 2024
7750d43
upd
babenek Aug 16, 2024
d36c51d
Merge branch 'auxiliary' into ml
babenek Aug 16, 2024
c041f9f
[skip actions] [ml] 2024-08-16T13:49:10+03:00
babenek Aug 16, 2024
6a5ffe3
ml_model integrity
babenek Aug 16, 2024
abd2557
test fix
babenek Aug 16, 2024
997845a
after corrections
babenek Aug 16, 2024
6993571
BMscorUPD
babenek Aug 16, 2024
70f7140
[skip actions] [ml] 2024-08-16T18:44:01+03:00
babenek Aug 16, 2024
4096b41
[skip actions] [auxiliary] 2024-08-17T09:35:51+03:00
babenek Aug 17, 2024
4062caa
retrain
babenek Aug 18, 2024
730aa75
Merge branch 'auxiliary' into ml
babenek Aug 18, 2024
171e3aa
testfix
babenek Aug 18, 2024
75df1c8
style
babenek Aug 19, 2024
2d2db83
md5sum of ml model
babenek Aug 19, 2024
7674677
model config check
babenek Aug 19, 2024
4bb5de8
custom ref rollback
babenek Aug 19, 2024
8a5c59d
Scores fix
babenek Aug 19, 2024
899cd9b
Move cicd dir to .ci as not sourcecode related
babenek Aug 19, 2024
bcfb708
Missed dir added
babenek Aug 19, 2024
c163427
Merge branch 'auxiliary' into ml [no ci]
babenek Aug 19, 2024
ad34a2c
Merge branch 'main' into auxiliary
babenek Aug 19, 2024
1d5cc32
Merge branch 'auxiliary' into ml [no ci]
babenek Aug 19, 2024
90382b4
keep md5 in train log
babenek Aug 20, 2024
125968d
External ML config and model may be used
babenek Aug 20, 2024
b7391ec
replaced --azure and --cuda arguments to --ml_providers
babenek Aug 20, 2024
d517359
style
babenek Aug 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,7 @@ jobs:
# crc32 should be changed
python -m credsweeper --banner
# run quick scan
python -m credsweeper --log debug --path ../tests/samples --save-json
python -m credsweeper --ml_providers AzureExecutionProvider,CPUExecutionProvider --log debug --path ../tests/samples --save-json
NEW_MODEL_FOUND_SAMPLES=$(jq '.|length' output.json)
if [ 10 -gt ${NEW_MODEL_FOUND_SAMPLES} ]; then
echo "Failure: found ${NEW_MODEL_FOUND_SAMPLES} credentials"
Expand Down
17 changes: 9 additions & 8 deletions .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,15 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.sha }}

# # # ml_config & ml_model integrity

- name: Check ml_model.onnx integrity
if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
run: |
md5sum --binary credsweeper/ml_model/ml_config.json | grep 2b29c5e1aa199d14b788652bd542c7c0
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 88f37978fc0599ac8d1bf732ad40c077


# # # line ending

- name: Check for text file ending
Expand All @@ -53,14 +62,6 @@ jobs:
done
exit ${n}

# # # ml_model integrity

- name: Check ml_model.onnx integrity
if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
run: |
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 88f37978fc0599ac8d1bf732ad40c077
md5sum --binary credsweeper/ml_model/model_config.json | grep 2b29c5e1aa199d14b788652bd542c7c0

# # # Python setup

- name: Set up Python
Expand Down
38 changes: 24 additions & 14 deletions credsweeper/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,6 @@ def get_arguments() -> Namespace:
dest="export_log_config",
metavar="PATH")
parser.add_argument("--rules",
nargs="?",
help="path of rule config file (default: credsweeper/rules/config.yaml). "
f"severity:{[i.value for i in Severity]} "
f"type:{[i.value for i in RuleType]}",
Expand All @@ -131,13 +130,11 @@ def get_arguments() -> Namespace:
dest="severity",
type=severity_levels)
parser.add_argument("--config",
nargs="?",
help="use custom config (default: built-in)",
default=None,
dest="config_path",
metavar="PATH")
parser.add_argument("--log_config",
nargs="?",
help="use custom log config (default: built-in)",
default=None,
dest="log_config_path",
Expand Down Expand Up @@ -178,15 +175,27 @@ def get_arguments() -> Namespace:
default=16,
required=False,
metavar="POSITIVE_INT")
ml_provider_group = parser.add_mutually_exclusive_group()
ml_provider_group.add_argument("--azure",
help="enable AzureExecutionProvider for onnx",
dest="azure",
action="store_true")
ml_provider_group.add_argument("--cuda",
help="enable CUDAExecutionProvider for onnx",
dest="cuda",
action="store_true")
parser.add_argument("--ml_config",
help="use external config for ml model",
type=str,
default=None,
dest="ml_config",
required=False,
metavar="PATH")
parser.add_argument("--ml_model",
help="use external ml model",
type=str,
default=None,
dest="ml_model",
required=False,
metavar="PATH")
parser.add_argument("--ml_providers",
help="comma separated list of providers for onnx (CPUExecutionProvider is used by default)",
type=str,
default=None,
dest="ml_providers",
required=False,
metavar="STR")
parser.add_argument("--api_validation",
help="add credential api validation option to credsweeper pipeline. "
"External API is used to reduce FP for some rule types.",
Expand Down Expand Up @@ -297,8 +306,9 @@ def scan(args: Namespace, content_provider: AbstractProvider, json_filename: Opt
pool_count=args.jobs,
ml_batch_size=args.ml_batch_size,
ml_threshold=args.ml_threshold,
azure=args.azure,
cuda=args.cuda,
ml_config=args.ml_config,
ml_model=args.ml_model,
ml_providers=args.ml_providers,
find_by_ext=args.find_by_ext,
depth=args.depth,
doc=args.doc,
Expand Down
20 changes: 15 additions & 5 deletions credsweeper/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ def __init__(self,
pool_count: int = 1,
ml_batch_size: Optional[int] = None,
ml_threshold: Union[float, ThresholdPreset] = ThresholdPreset.medium,
azure: bool = False,
cuda: bool = False,
ml_config: Union[None, str, Path] = None,
ml_model: Union[None, str, Path] = None,
ml_providers: Optional[str] = None,
find_by_ext: bool = False,
depth: int = 0,
doc: bool = False,
Expand Down Expand Up @@ -78,6 +79,9 @@ def __init__(self,
pool_count: int value, number of parallel processes to use
ml_batch_size: int value, size of the batch for model inference
ml_threshold: float or string value to specify threshold for the ml model
ml_config: str or Path to set custom config of ml model
ml_model: str or Path to set custom ml model
ml_providers: str - comma separated list with providers
find_by_ext: boolean - files will be reported by extension
depth: int - how deep container files will be scanned
doc: boolean - document-specific scanning
Expand Down Expand Up @@ -113,8 +117,9 @@ def __init__(self,
self.sort_output = sort_output
self.ml_batch_size = ml_batch_size if ml_batch_size and 0 < ml_batch_size else 16
self.ml_threshold = ml_threshold
self.azure = azure
self.cuda = cuda
self.ml_config = ml_config
self.ml_model = ml_model
self.ml_providers = ml_providers
self.ml_validator = None
self.__log_level = log_level

Expand Down Expand Up @@ -187,7 +192,12 @@ def ml_validator(self) -> MlValidator:
"""ml_validator getter"""
from credsweeper.ml_model import MlValidator
if not self.__ml_validator:
self.__ml_validator: MlValidator = MlValidator(threshold=self.ml_threshold)
self.__ml_validator: MlValidator = MlValidator(
threshold=self.ml_threshold, #
ml_config=self.ml_config, #
ml_model=self.ml_model, #
ml_providers=self.ml_providers, #
)
assert self.__ml_validator, "self.__ml_validator was not initialized"
return self.__ml_validator

Expand Down
56 changes: 39 additions & 17 deletions credsweeper/ml_model/ml_validator.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import hashlib
import logging
import os
import string
from typing import List, Tuple, Union
from pathlib import Path
from typing import List, Tuple, Union, Optional

import numpy as np
import onnxruntime as ort
Expand All @@ -21,35 +22,56 @@ class MlValidator:
CHAR_INDEX = {char: index for index, char in enumerate('\0' + string.printable + NON_ASCII)}
NUM_CLASSES = len(CHAR_INDEX)

def __init__(self, threshold: Union[float, ThresholdPreset], azure: bool = False, cuda: bool = False) -> None:
def __init__(
self, #
threshold: Union[float, ThresholdPreset], #
ml_config: Union[None, str, Path] = None, #
ml_model: Union[None, str, Path] = None, #
ml_providers: Optional[str] = None) -> None:
"""Init

Args:
threshold: decision threshold
ml_config: path to ml config
ml_model: path to ml model
ml_providers: coma separated list of providers https://onnxruntime.ai/docs/execution-providers/
"""
dir_path = os.path.dirname(os.path.realpath(__file__))
model_file_path = os.path.join(dir_path, "ml_model.onnx")
if azure:
provider = "AzureExecutionProvider"
elif cuda:
provider = "CUDAExecutionProvider"
dir_path = Path(__file__).parent

if ml_config:
ml_config_path = Path(ml_config)
else:
ml_config_path = dir_path / "ml_config.json"
with open(ml_config_path, "rb") as f:
md5_config = hashlib.md5(f.read()).hexdigest()

if ml_model:
ml_model_path = Path(ml_model)
else:
ml_model_path = dir_path / "ml_model.onnx"
with open(ml_model_path, "rb") as f:
md5_model = hashlib.md5(f.read()).hexdigest()

if ml_providers:
providers = ml_providers.split(',')
else:
provider = "CPUExecutionProvider"
self.model_session = ort.InferenceSession(model_file_path, providers=[provider])
providers = ["CPUExecutionProvider"]
self.model_session = ort.InferenceSession(ml_model_path, providers=providers)

model_details = Util.json_load(os.path.join(dir_path, "model_config.json"))
model_config = Util.json_load(ml_config_path)
if isinstance(threshold, float):
self.threshold = threshold
elif isinstance(threshold, ThresholdPreset) and "thresholds" in model_details:
self.threshold = model_details["thresholds"][threshold.value]
elif isinstance(threshold, ThresholdPreset) and "thresholds" in model_config:
self.threshold = model_config["thresholds"][threshold.value]
else:
self.threshold = 0.5

self.common_feature_list = []
self.unique_feature_list = []
logger.info("Init ML validator, model file path: %s", model_file_path)
logger.debug("ML validator details: %s", model_details)
for feature_definition in model_details["features"]:
logger.info("Init ML validator with %s provider; config:'%s' md5:%s model:'%s' md5:%s", providers,
ml_config_path, md5_config, ml_model_path, md5_model)
logger.debug("ML validator details: %s", model_config)
for feature_definition in model_config["features"]:
feature_class = feature_definition["type"]
kwargs = feature_definition.get("kwargs", {})
feature_constructor = getattr(features, feature_class, None)
Expand Down
21 changes: 13 additions & 8 deletions docs/source/guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,13 @@ Get all argument list:

.. code-block:: text

usage: python -m credsweeper [-h] (--path PATH [PATH ...] | --diff_path PATH [PATH ...] | --export_config [PATH] | --export_log_config [PATH]) [--rules [PATH]] [--severity SEVERITY] [--config [PATH]]
[--log_config [PATH]] [--denylist PATH] [--find-by-ext] [--depth POSITIVE_INT] [--no-filters] [--doc] [--ml_threshold FLOAT_OR_STR] [--ml_batch_size POSITIVE_INT]
[--azure | --cuda] [--api_validation] [--jobs POSITIVE_INT] [--skip_ignored] [--save-json [PATH]] [--save-xlsx [PATH]] [--hashed] [--subtext] [--sort] [--log LOG_LEVEL] [--size_limit SIZE_LIMIT]
usage: python -m credsweeper [-h] (--path PATH [PATH ...] | --diff_path PATH [PATH ...] | --export_config [PATH] | --export_log_config [PATH])
[--rules PATH] [--severity SEVERITY] [--config PATH] [--log_config PATH] [--denylist PATH]
[--find-by-ext] [--depth POSITIVE_INT] [--no-filters] [--doc] [--ml_threshold FLOAT_OR_STR]
[--ml_batch_size POSITIVE_INT] [--ml_config PATH] [--ml_model PATH] [--ml_providers STR]
[--api_validation] [--jobs POSITIVE_INT] [--skip_ignored] [--save-json [PATH]]
[--save-xlsx [PATH]] [--hashed] [--subtext] [--sort] [--log LOG_LEVEL]
[--size_limit SIZE_LIMIT]
[--banner] [--version]
options:
-h, --help show this help message and exit
Expand All @@ -27,10 +31,10 @@ Get all argument list:
exporting default config to file (default: config.json)
--export_log_config [PATH]
exporting default logger config to file (default: log.yaml)
--rules [PATH] path of rule config file (default: credsweeper/rules/config.yaml). severity:['critical', 'high', 'medium', 'low', 'info'] type:['keyword', 'pattern', 'pem_key', 'multi']
--rules PATH path of rule config file (default: credsweeper/rules/config.yaml). severity:['critical', 'high', 'medium', 'low', 'info'] type:['keyword', 'pattern', 'pem_key', 'multi']
--severity SEVERITY set minimum level for rules to apply ['critical', 'high', 'medium', 'low', 'info'](default: 'Severity.INFO', case insensitive)
--config [PATH] use custom config (default: built-in)
--log_config [PATH] use custom log config (default: built-in)
--config PATH use custom config (default: built-in)
--log_config PATH use custom log config (default: built-in)
--denylist PATH path to a plain text file with lines or secrets to ignore
--find-by-ext find files by predefined extension
--depth POSITIVE_INT additional recursive search in data (experimental)
Expand All @@ -41,8 +45,9 @@ Get all argument list:
'highest'] (default: medium)
--ml_batch_size POSITIVE_INT, -b POSITIVE_INT
batch size for model inference (default: 16)
--azure enable AzureExecutionProvider for onnx
--cuda enable CUDAExecutionProvider for onnx
--ml_config PATH use external config for ml model
--ml_model PATH use external ml model
--ml_providers STR comma separated list of providers for onnx (CPUExecutionProvider is used by default)
--api_validation add credential api validation option to credsweeper pipeline. External API is used to reduce FP for some rule types.
--jobs POSITIVE_INT, -j POSITIVE_INT
number of parallel processes to use (default: 1)
Expand Down
7 changes: 7 additions & 0 deletions experiment/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,13 @@ def main(cred_data_location: str, jobs: int) -> str:
# print in last line the name
print(f"\nYou can find your model in:\n{_model_file_name}")

# convert the model to onnx right now
command = f"{sys.executable} -m tf2onnx.convert --saved-model {_model_file_name}" \
f" --output {pathlib.Path(__file__).parent.parent}/credsweeper/ml_model/ml_model.onnx --verbose"
subprocess.check_call(command, shell=True, cwd=pathlib.Path(__file__).parent)

# to keep the hash in log
command = f"md5sum {pathlib.Path(__file__).parent.parent}/credsweeper/ml_model/ml_model.onnx"
subprocess.check_call(command, shell=True, cwd=pathlib.Path(__file__).parent)
command = f"md5sum {pathlib.Path(__file__).parent.parent}/credsweeper/ml_model/ml_config.json"
subprocess.check_call(command, shell=True, cwd=pathlib.Path(__file__).parent)
4 changes: 2 additions & 2 deletions experiment/src/model_config_preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@


def model_config_preprocess(df_all: pd.DataFrame) -> Dict[str, float]:
model_config_path = APP_PATH / "ml_model" / "model_config.json"
model_config_path = APP_PATH / "ml_model" / "ml_config.json"
model_config = Util.json_load(model_config_path)

# check whether all extensions from meta are in model_config.json
# check whether all extensions from meta are in ml_config.json

for x in model_config["features"]:
if "FileExtension" == x["type"]:
Expand Down
Loading
Loading