Skip to content

Commit 19ad941

Browse files
fix repetitive pattern extraction (#109)
* fix repetitive pattern extraction #108 * add --ignore_extraction_boundary #109 * Update README.md * adding tests * fix subdomain-hostname must not have tld, fix issue causing any item with same boundary as the file not getting extracted by adding \n at start and end of file before splitting * docs updates --------- Co-authored-by: David G <[email protected]>
1 parent 494c140 commit 19ad941

12 files changed

+109
-41
lines changed

README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -83,12 +83,13 @@ The following arguments are available:
8383

8484
How the extractions are performed
8585

86-
* `--use_extractions` (REQUIRED): if you only want to use certain extraction types, you can pass their slug found in either `includes/ai/config.yaml`, `includes/lookup/config.yaml` `includes/pattern/config.yaml` (e.g. `pattern_ipv4_address_only`). Default if not passed, no extractions applied.
87-
* Important: if using any AI extractions, you must set an OpenAI API key in your `.env` file
86+
* `--use_extractions` (REQUIRED): if you only want to use certain extraction types, you can pass their slug found in either `includes/ai/config.yaml`, `includes/lookup/config.yaml` `includes/pattern/config.yaml` (e.g. `pattern_ipv4_address_only`). Default if not passed, no extractions applied. You can also pass a catch all wildcard `*` which will match all extraction paths (e.g. `pattern_*` would run all extractions starting with `pattern_`)
87+
* Important: if using any AI extractions (`ai_*`), you must set an AI API key in your `.env` file
8888
* Important: if you are using any MITRE ATT&CK, CAPEC, CWE, ATLAS or Location extractions you must set `CTIBUTLER` or NVD CPE or CVE extractions you must set `VULMATCH` settings in your `.env` file
8989
* `--relationship_mode` (REQUIRED): either.
9090
* `ai`: AI provider must be enabled. extractions performed by either regex or AI for extractions user selected. Rich relationships created from AI provider from extractions.
9191
* `standard`: extractions performed by either regex or AI (AI provider must be enabled) for extractions user selected. Basic relationships created from extractions back to master Report object generated.
92+
* `--ignore_extraction_boundary` (OPTIONAL, default `false`, not compatible with AI extractions): in some cases the same string will create multiple extractions depending on extractions set (e.g. `https://www.google.com/file.txt` could create a url, url with file, domain, subdomain, and file). The default behaviour is for txt2stix to take the longest extraction and ignore everything else (e.g. only extract url with file, and ignore url, file, domain, subdomain, and file). If you want to override this behaviour and get all extractions in the output, set this flag to `true`.
9293
* `--ignore_image_refs` (default `true`): images references in documents don't usually need extracting. e.g. `<img src="https://example.com/image.png" alt="something">` you would not want domain or file extractions extracting `example.com` and `image.png`. Hence these are ignored by default (they are removed from text sent to extraction). Note, only the `img src` is ignored, all other values e.g. `alt` are considered. If you want extractions to consider this data, set it to `false`
9394
* `--ignore_link_refs` (default `true`): link references in documents don't usually need extracting e.g. `<a href="https://example.com/link.html" title="something">Bad Actor</a>` you would only want `Bad actor` to be considered for extraction. Hence these part of the link are ignored by default (they are removed from text sent to extraction). Note, only the `a href` is ignored, all other values e.g. `title` are considered. Setting this to `false` will also include everything inside the link tag (e.g. `example.com` would extract as a domain)
9495

Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
https://subdomain.google.com/file.txt

tests/manual-tests/cases-standard-tests.md

+30-1
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,35 @@ python3 txt2stix.py \
321321
--report_id 8cf2590e-f7b8-40c6-99cd-4aad9fbdc8bd
322322
```
323323

324+
### extraction boundary tests
325+
326+
Should create `pattern_url_file` extraction as boundary observed
327+
328+
```shell
329+
python3 txt2stix.py \
330+
--relationship_mode standard \
331+
--input_file tests/data/manually_generated_reports/test_extraction_boundary.txt \
332+
--name 'extraction boundary tests 1' \
333+
--tlp_level clear \
334+
--confidence 100 \
335+
--use_extractions 'pattern_*' \
336+
--report_id f6d8800b-9708-4c74-aa1b-7a59d3c79d79
337+
```
338+
339+
Should create all extractions;
340+
341+
```shell
342+
python3 txt2stix.py \
343+
--relationship_mode standard \
344+
--input_file tests/data/manually_generated_reports/test_extraction_boundary.txt \
345+
--name 'extraction boundary tests 1' \
346+
--tlp_level clear \
347+
--confidence 100 \
348+
--ignore_extraction_boundary true \
349+
--use_extractions 'pattern_*' \
350+
--report_id 0f5b1afd-c468-49a2-9896-6910b7f124dd
351+
```
352+
324353
### disarm demo
325354

326355
```shell
@@ -333,4 +362,4 @@ python3 txt2stix.py \
333362
--confidence 100 \
334363
--use_extractions lookup_disarm_name \
335364
--report_id 8cb2dbf0-136f-4ecb-995c-095496e22abc
336-
```
365+
```

txt2stix/extractions.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1-
from typing import Any
1+
from typing import Any, Type
22
import yaml
33
from pathlib import Path
4+
5+
from typing import TYPE_CHECKING
6+
if TYPE_CHECKING:
7+
import txt2stix.pattern.extractors.base_extractor
48
from .common import NamedDict
59

610
class Extractor(NamedDict):
@@ -19,6 +23,7 @@ class Extractor(NamedDict):
1923
prompt_negative_examples = None
2024
stix_mapping = None
2125
prompt_extraction_extra = None
26+
pattern_extractor : 'Type[txt2stix.pattern.extractors.base_extractor.BaseExtractor]' = None
2227

2328

2429
def __init__(self, key, dct, include_path=None, test_cases: dict[str, list[str]]=None):

txt2stix/pattern/extractors/base_extractor.py

+26-19
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ class BaseExtractor:
1919
version = None
2020
stix_mapping = None
2121
invalid_characters = ['.', ',', '!', '`', '(', ')', '{', '}', '"', '````', ' ', '[', ']']
22-
SPLITS_FINDER = re.compile(r'[\'"<\(\{\[\s].*?[\)\s\]\}\)>"\']') #split on boundary characters instead of ' ' only
22+
SPLITS_FINDER = re.compile(r'[\'"<\(\{\[\s](?P<item>.*?)[\)\s\]\}\)>"\']') #split on boundary characters instead of ' ' only
2323

2424

2525
@classmethod
@@ -40,44 +40,41 @@ def extract_extraction_from_text(cls, text: str):
4040
start_index = 0
4141
if cls.extraction_regex is not None:
4242
if cls.extraction_regex.startswith("^") or cls.extraction_regex.endswith("$"):
43-
for word in cls.split_all(text):
44-
end_index = start_index + len(word) - 1
43+
for matchsplit in cls.SPLITS_FINDER.finditer(text):
44+
word = matchsplit.group('item')
45+
start_index = matchsplit.start('item')
4546
match = re.match(cls.extraction_regex, word)
4647
if match:
47-
extracted_observables.append((match.group(0), start_index))
48+
extracted_observables.append((match.group(0), match.start()+start_index))
4849
else:
4950
stripped_word = word.strip(cls.common_strip_elements)
5051
match = re.match(cls.extraction_regex, stripped_word)
5152
if match:
52-
extracted_observables.append((match.group(0), start_index))
53-
start_index = end_index + 2 # Adding 1 for the space and 1 for the next word's starting index
53+
extracted_observables.append((match.group(0), start_index + word.index(stripped_word)))
5454
else:
5555
# Find regex in the entire text (including whitespace)
5656
for match in re.finditer(cls.extraction_regex, text):
57-
match = match.group().strip('\n')
58-
end_index = start_index + len(match) - 1
57+
match_value = match.group().strip('\n')
58+
start_index, end_index = match.span()
5959

60-
extracted_observables.append((match, start_index))
61-
start_index = end_index + 2 # Adding 1 for the space and 1 for the next word's starting index
60+
extracted_observables.append((match_value, start_index))
6261

6362
# If extraction_function is not None, then find matches that don't throw exception when
6463
elif cls.extraction_function is not None:
6564

6665
start_index = 0
6766

68-
words = cls.SPLITS_FINDER.findall(text)
69-
for word in words:
67+
for match in cls.SPLITS_FINDER.finditer(text):
68+
word = match.group('item')
7069
end_index = start_index + len(word) - 1
7170

7271
word = BaseExtractor.trim_invalid_characters(word, cls.invalid_characters)
7372
try:
7473
if cls.extraction_function(word):
75-
extracted_observables.append((word, start_index))
74+
extracted_observables.append((word, match.start('item')))
7675
except Exception as error:
7776
pass
7877

79-
start_index = end_index + 2 # Adding 1 for the space and 1 for the next word's starting index
80-
8178
else:
8279
raise ValueError("Both extraction_regex and extraction_function can't be None.")
8380

@@ -93,15 +90,25 @@ def extract_extraction_from_text(cls, text: str):
9390

9491
response = []
9592

96-
for extraction, positions in string_positions.items():
93+
# for extraction, positions in string_positions.items():
94+
# response.append({
95+
# "value": extraction,
96+
# "type": cls.name,
97+
# "version": cls.version,
98+
# "stix_mapping": cls.stix_mapping,
99+
# "start_index": positions,
100+
# })
101+
102+
for position, (string, pos) in enumerate(extracted_observables, 1):
103+
if cls.filter_function and not cls.filter_function(string):
104+
continue
97105
response.append({
98-
"value": extraction,
106+
"value": string,
99107
"type": cls.name,
100108
"version": cls.version,
101109
"stix_mapping": cls.stix_mapping,
102-
"start_index": positions,
110+
"start_index": pos,
103111
})
104-
105112
return response
106113

107114
@staticmethod

txt2stix/pattern/extractors/domain/hostname_extractor.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
from tld import get_tld
22

3+
from txt2stix.utils import validate_file_mimetype
4+
from ..helper import TLDs
5+
36
from ..base_extractor import BaseExtractor
47

58

@@ -29,5 +32,5 @@ def filter_function(domain):
2932
if domain.count('.') <= 1:
3033
tld = get_tld(domain, fix_protocol=True, fail_silently=True)
3134
if not tld:
32-
return True
33-
return False
35+
return not validate_file_mimetype(domain)
36+
return False

txt2stix/pattern/extractors/domain/sub_domain_extractor.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
from tld import get_tld
22

3+
from txt2stix.utils import validate_file_mimetype
4+
from ..helper import TLDs
5+
36
from ..base_extractor import BaseExtractor
47

58

@@ -39,4 +42,8 @@ def filter_function(domain):
3942

4043
class HostNameSubdomainExtractor(SubDomainExtractor):
4144
name = "pattern_host_name_subdomain"
42-
filter_function = lambda domain: domain.count('.') >= 2
45+
filter_function = lambda domain: domain.count('.') >= 2 and get_tld(domain, fail_silently=True) not in TLDs
46+
47+
def filter_function(domain):
48+
tld = get_tld(domain, fail_silently=True, fix_protocol=True)
49+
return domain.count('.') >= 2 and not tld and not validate_file_mimetype(domain)

txt2stix/pattern/extractors/helper.py

+20-7
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
from .base_extractor import ALL_EXTRACTORS
99

1010
from ...extractions import Extractor
11-
from ...utils import read_included_file
11+
from ...utils import FILE_EXTENSIONS, read_included_file, TLDs
12+
1213

1314

1415
def read_text_file(file_path):
@@ -48,7 +49,7 @@ def check_false_positive_domain(domain):
4849
bool: True if the domain is not a false positive, False otherwise.
4950
"""
5051
file_extension = domain.split(".")[-1]
51-
if file_extension in FILE_EXTENSION:
52+
if file_extension in FILE_EXTENSIONS:
5253
return False
5354
else:
5455
return True
@@ -65,15 +66,27 @@ def load_extractor(extractor):
6566
extractor.pattern_extractor.stix_mapping = extractor.stix_mapping
6667

6768

68-
def extract_all(extractors :list[Extractor], input_text):
69+
def extract_all(extractors :list[Extractor], input_text, ignore_extraction_boundary=False):
6970
logging.info("using pattern extractors")
7071
pattern_extracts = []
7172
for extractor in extractors:
7273
load_extractor(extractor)
7374
extracts = extractor.pattern_extractor().extract_extraction_from_text(input_text)
7475
pattern_extracts.extend(extracts)
75-
return pattern_extracts
76-
7776

78-
FILE_EXTENSION = read_included_file('lookups/extensions.txt')
79-
TLD = read_included_file('lookups/tld.txt')
77+
pattern_extracts.sort(key=lambda ex: (ex['start_index'], len(ex['value'])))
78+
retval = {}
79+
end = 0
80+
for raw_extract in pattern_extracts:
81+
start_index = raw_extract['start_index']
82+
key = (raw_extract['type'], raw_extract['value'])
83+
if ignore_extraction_boundary or start_index >= end:
84+
extraction = retval.setdefault(key, {**raw_extract, "start_index":[start_index]})
85+
if start_index not in extraction['start_index']:
86+
extraction['start_index'].append(start_index)
87+
end = start_index + len(raw_extract['value'])
88+
return list(retval.values())
89+
90+
91+
# FILE_EXTENSION = read_included_file('lookups/extensions.txt')
92+
# TLD = read_included_file('lookups/tld.txt')

txt2stix/pattern/extractors/others/cve_extractor.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ class CVEExtractor(BaseExtractor):
1111
"""
1212

1313
name = "pattern_cve_id"
14-
extraction_regex = r'^CVE-\d{4}-(?:\d{4}|\d{5})$'
14+
extraction_regex = r'^CVE-\d{4}-(?:\d{4,6})$'

txt2stix/pattern/extractors/others/email_extractor.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from ..base_extractor import BaseExtractor
2-
from ..helper import TLD
2+
from ..helper import TLDs
33

44

55
class EmailAddressExtractor(BaseExtractor):
@@ -17,5 +17,5 @@ class EmailAddressExtractor(BaseExtractor):
1717
def filter_function(email):
1818
x = email.split("@")
1919
domain = x[-1].split(".")[-1]
20-
if domain in TLD:
20+
if domain in TLDs:
2121
return True

txt2stix/txt2stix.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@ def parse_args():
141141
parser.add_argument("--external_refs", type=parse_ref, help="pass additional `external_references` entry (or entries) to the report object created. e.g --external_ref author=dogesec link=https://dkjjadhdaj.net", default=[], metavar="{source_name}={external_id}", action="extend", nargs='+')
142142
parser.add_argument('--ignore_image_refs', default=True, type=parse_bool)
143143
parser.add_argument('--ignore_link_refs', default=True, type=parse_bool)
144+
parser.add_argument("--ignore_extraction_boundary", default=False, type=parse_bool, help="default if not passed is `false`, but if set to `true` will ignore boundary capture logic for extractions")
144145

145146
args = parser.parse_args()
146147
if not args.input_file.exists():
@@ -176,9 +177,10 @@ def log_notes(content, type):
176177
logging.debug(json.dumps(content, sort_keys=True, indent=4))
177178
logging.debug(f" ========================= {'-'*len(type)} ========================= ")
178179

179-
def extract_all(bundler: txt2stixBundler, extractors_map, text_content, ai_extractors: list[BaseAIExtractor]=[]):
180+
def extract_all(bundler: txt2stixBundler, extractors_map, text_content, ai_extractors: list[BaseAIExtractor]=[], **kwargs):
180181
assert ai_extractors or not extractors_map.get("ai"), "There should be at least one AI extractor in ai_extractors"
181182

183+
text_content = "\n"+text_content+"\n"
182184
all_extracts = dict()
183185
if extractors_map.get("lookup"):
184186
try:
@@ -191,7 +193,7 @@ def extract_all(bundler: txt2stixBundler, extractors_map, text_content, ai_extra
191193
if extractors_map.get("pattern"):
192194
try:
193195
logging.info("using pattern extractors")
194-
pattern_extracts = pattern.extract_all(extractors_map["pattern"].values(), text_content)
196+
pattern_extracts = pattern.extract_all(extractors_map["pattern"].values(), text_content, ignore_extraction_boundary=kwargs.get('ignore_extraction_boundary', False))
195197
bundler.process_observables(pattern_extracts)
196198
all_extracts["pattern"] = pattern_extracts
197199
except BaseException as e:
@@ -256,7 +258,7 @@ def main():
256258
if args.relationship_mode == "ai":
257259
validate_token_count(int(os.environ["INPUT_TOKEN_LIMIT"]), preprocessed_text, [args.ai_settings_relationships])
258260

259-
all_extracts = extract_all(bundler, args.use_extractions, preprocessed_text, ai_extractors=args.ai_settings_extractions)
261+
all_extracts = extract_all(bundler, args.use_extractions, preprocessed_text, ai_extractors=args.ai_settings_extractions, ignore_extraction_boundary=args.ignore_extraction_boundary)
260262
extracted_relationships = None
261263
if args.relationship_mode == "ai" and sum(map(lambda x: len(x), all_extracts.values())):
262264
extracted_relationships = extract_relationships_with_ai(bundler, preprocessed_text, all_extracts, args.ai_settings_relationships)

txt2stix/utils.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def read_included_file(path):
5858

5959
def validate_tld(domain):
6060
extracted_domain = tldextract.extract(domain)
61-
return extracted_domain.suffix in TLDS
61+
return extracted_domain.suffix in TLDs
6262

6363
def validate_reg_key(reg_key):
6464
reg_key = reg_key.lower()
@@ -71,6 +71,6 @@ def validate_file_mimetype(file_name):
7171
_, ext = os.path.splitext(file_name)
7272
return FILE_EXTENSIONS.get(ext)
7373

74-
TLDS = [tld.lower() for tld in read_included_file('helpers/tlds.txt').splitlines()]
74+
TLDs = [tld.lower() for tld in read_included_file('helpers/tlds.txt').splitlines()]
7575
REGISTRY_PREFIXES = [key.lower() for key in read_included_file('helpers/windows_registry_key_prefix.txt').splitlines()]
7676
FILE_EXTENSIONS = dict(line.lower().split(',') for line in read_included_file('helpers/mimetype_filename_extension_list.csv').splitlines())

0 commit comments

Comments
 (0)