You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Figure out the difference between the PII taggers (which ones are most relevant) - we might just start using the fastest one
Test NSFW filters (is what they filter out problematic in Danish) (minor)
Test hate-speech filters (is what they filter out problematic in Danish) (minor)
See the filters are valid
Apply all filters to DAGW (whole dataset) and see if any of the filters are problematic (non should filter out extreme amount of data as we consider the dataset fairly clean)
E.g. could be afraid that the current gopher filters uses "english" thresholds which might be problematic for us.
Starting applying filters to the dataset
Formatting datasets to jsonl.gz and applying filters to them.
Decide on reasonable threshold
Once we have run the analysis we would like to set a reasonable set of starting thresholds which we can vary based on (specified in the config).
@peterbjorgensen does this seems like a reasonable approach to you as well?
The text was updated successfully, but these errors were encountered:
One of the taggers in Dolma is using Microsoft Presidio for PII https://microsoft.github.io/presidio/
It is a bit unclear to me (also after reading their documentation) whether it works in Danish.
But I think it boils down to whether a PII classifier exists in Danish. The framework can use spaCy, stanza and transformers models.
Get an overview of filters:
See the filters are valid
Apply all filters to DAGW (whole dataset) and see if any of the filters are problematic (non should filter out extreme amount of data as we consider the dataset fairly clean)
Starting applying filters to the dataset
Decide on reasonable threshold
@peterbjorgensen does this seems like a reasonable approach to you as well?
The text was updated successfully, but these errors were encountered: