Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mc 133 huggingface pipeline #137

Merged
merged 43 commits into from
Jan 23, 2025
Merged

Mc 133 huggingface pipeline #137

merged 43 commits into from
Jan 23, 2025

Conversation

maxachis
Copy link
Collaborator

#133

This incorporates several changes. Among them:

  • The addition of logic for pipelining URLs into Huggingface
  • The addition of logic for retrieving HTML data for URLs
  • Logic for repeatedly setting up "cycles" for the continual retrieval of the above data where needed.
  • Logic for dumping production data and setting up a test database with the most up-to-date version of the database (for testing against live data)
  • Bug fixes and QoL improvements for the HTML tag collector
  • Revisions to the Root URL Cache so that it persists in the database rather than existing as a json file that will be deleted when a container stops.
  • Creation of /url endpoint for retrieving information about URLs without filtering by Batch.
  • Creation of /annotation/url endpoints for the annotating of URLs regarding their relevancy.

* Create URL Metadata Table
* Convert batch `status`, `strategy` columns to enums
* Convert URL `status` column to enum
* Add new migration tests
* Add database structure tests
* Update tests
- requests_html library, previously used, has not been maintained and was causing bugs
Copy link

gitguardian bot commented Jan 19, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
15222185 Triggered Generic Password 974ca99 local_database/DataDumper/docker-compose.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@maxachis maxachis merged commit e39d687 into dev Jan 23, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants