Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing invalid docs from persistence. #5206

Merged
merged 1 commit into from
Jul 16, 2024
Merged

Conversation

fulmicoton
Copy link
Contributor

@fulmicoton fulmicoton commented Jul 9, 2024

Removing invalid docs from persistence.

This PR filters out invalid docs from batches before persisting the
batch.
That way, they won't be persisted in the mrecordlog and the
indexing pipeline will not have to parse them again (and log an error).

This PR also improves a little bit on the metric for ingested docs.
The num_docs and bytes metric now have a validity `label`, and the
point at which they are measured is documented in the metric description.

Closes #5205

@fulmicoton fulmicoton force-pushed the 5205/ditch-invalid-docs branch 4 times, most recently from c83ae27 to c6fe1f2 Compare July 9, 2024 09:57
@fulmicoton fulmicoton requested a review from trinity-1686a July 9, 2024 09:57
This PR filters out invalid docs from batches before persisting the
batch.
That way, they won't be persisted in the mrecordlog and the
indexing pipeline will not have to parse them again (and log an error).

This PR also improves a little bit on the metric for ingested docs.
The num_docs and bytes metric now have a validity `label`, and the
point at which they are measured is documented in the metric description.

Closes #5205
@fulmicoton fulmicoton force-pushed the 5205/ditch-invalid-docs branch from c6fe1f2 to 0a57d53 Compare July 9, 2024 10:12
@trinity-1686a trinity-1686a linked an issue Jul 10, 2024 that may be closed by this pull request
@@ -122,7 +148,7 @@ pub(super) async fn validate_doc_batch(
doc_mapper: Arc<dyn DocMapper>,
) -> IngestV2Result<(DocBatchV2, Vec<ParseFailure>)> {
if is_document_validation_enabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope for this PR: it might be worth thinking about disabling validation when VRL is present, or to do VRL validation early. I can imagine a document-doc_mapper-VRL combo where not having VRL here causes a document to be rejected, even though it would have been accepted during actual indexation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent point! I'll create an issue.

@fulmicoton fulmicoton merged commit a3f074e into main Jul 16, 2024
4 of 5 checks passed
@fulmicoton fulmicoton deleted the 5205/ditch-invalid-docs branch July 16, 2024 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ditching invalid documents upon ingestion
2 participants