-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excavate from RAW_TEXT
events
#1636
Excavate from RAW_TEXT
events
#1636
Conversation
Excavate is correctly parsing events out of downloaded files. However after running
|
Yeah we should probably look at each event in that chain and figure out where either:
|
Ah I have this in my config scope:
report_distance: 1
search_distance: 1 Which may be allowing SOCIAL to emit "https://github.com/atlassian" Will try it without |
Yes it was my configuration allowing social to grab OOS github profiles and have them be consumed by other modules. My re-run on dell.com was far more successful just a few events that probably should be OOS but nothing from unstructured
I have found a few errors in excavate (Created by it consuming |
This reverts commit a40da6a.
So Ive added a few bits of data that should test most of excavate rules when extracting from The tests are getting an error when emmiting the WARNING bbot.modules.internal.excavate:base.py:1347 Error sanitizing event data "{'host': '', 'url': '', 'description': 'Parsed file content contains JSON Web Token (JWT) [eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c]'}" for type "FINDING": 2 validation errors for _data_validator
host
Value error, Validation failed for ('',), {}: Invalid hostname: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error
url
Value error, Validation failed for ('',), {}: Validation failed for ('',), {}: Invalid URL: "" [type=value_error, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/value_error Thinking of the scan of |
Makes sense. Probably what we should do is, for event types like FINDING that require a host, if a host isn't specified, we just walk the chain of parents back until we hit the first host event, and use that one. We can raise an error if no host was found. Always having a host helps in showing which git repo / website the file originally came from. EDIT: this logic should probably go in |
I totally agree with the first part about walking back up the parents looking for a host But i'm not getting the path/filepath thing Many findings wouldn't have anything like that, unless i'm misunderstanding? |
This would be for secrets etc. found in a git repo. We'd want to capture which repo it was found in, but also which individual file. |
hmm, are we to the point with the "secrets" based modules that they need their own unique event type? |
I don't think a new event would be necessary, but maybe an optional path / filepath attribute to a Yes you would only really get a path attribute in the event.data dictionary in a |
Not sure if the event.host is correct? I've merged in the latest changes from #1656 but it doesnt seem to be getting the correct parent host |
Hmm I'll look into that. Also I noticed there's several places we're pulling url/path from parent events. I'm thinking we should also inherit those automatically. What do you think? |
Okay @domwhewell-sage I pushed some fixes to #1666. This should save you from having to manually pull the EDIT: merge it into this branch and if it works well we can just merge both at once. |
…excavate_raw_text
…vent_data otherwise allow parent inheritence
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1636 +/- ##
=====================================
+ Coverage 93% 93% +1%
=====================================
Files 341 341
Lines 25926 25979 +53
=====================================
+ Hits 23893 23950 +57
+ Misses 2033 2029 -4 ☔ View full report in Codecov by Sentry. |
This PR adds the if statement back to the internal excavate module so it can run its rules on
RAW_TEXT
events. And a test has been added to download a pdf and extract URLS from it.I have made changes to the unstructured module as when it printed its discovery context it included the content of the
RAW_TEXT
event sooutput.csv
was quite large.I have also added a
replace()
to the__str__
function of theEvent
class so newlines are not printed in debug logs