-
Notifications
You must be signed in to change notification settings - Fork 112
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[connectors]- fix(webcrawler): sanitize webcrawler url (#9138)
* [types] - feature: add utility to validate and standardize URLs - Introduce a new function to check if a URL is valid and to standardize it if so - Ensure that only URLs with http or https protocols are considered valid * [front/lib/api] - refactor: use centralized validateUrl function from @dust-tt/types - Replaced local validateUrl function with imported one from @dust-tt/types to ensure consistency across modules - Removed duplicate validateUrl function definition from @app/lib/utils [front/pages/api] - refactor: update document API to use centralized validateUrl - Switched to use the validateUrl function from @dust-tt/types in the document API endpoint for URL validation * [connectors/webcrawler/temporal] - fix: ensure URLs are validated and sanitized in activities - Implement URL validation using a new utility to ensure input URLs are valid and standardized before processing - Sanitize the URL to remove query parameters and ensure the length doesn't exceed preset maximums in document formatting * fix: lint/format * [front] - refactor: streamline import of validateUrl utility - Consolidate validateUrl import by removing the duplicate import statement - Simplify the codebase for better maintainability and readability * [connectors] - fix: handle invalid URLs during document formatting - Extract document content formatting into a separate function to allow for null returns on invalid URLs - Log and skip document upsert to datasource if formatted document content is invalid [front] - refactor: relocate validateUrl import - Move import of validateUrl to a different section of the code for better code organization * [connectors/webcrawler/temporal] - fix: refine error message for invalid URLs during crawl - Updated error message to include both invalid URLs and documents for better clarity during website crawling errors
- Loading branch information
1 parent
e4a70b3
commit 418082a
Showing
5 changed files
with
50 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
export const validateUrl = ( | ||
urlString: string | ||
): { | ||
valid: boolean; | ||
standardized: string | null; | ||
} => { | ||
let url: URL; | ||
try { | ||
url = new URL(urlString); | ||
} catch (e) { | ||
return { valid: false, standardized: null }; | ||
} | ||
|
||
if (url.protocol !== "http:" && url.protocol !== "https:") { | ||
return { valid: false, standardized: null }; | ||
} | ||
|
||
return { valid: true, standardized: url.href }; | ||
}; |