-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221
Comments
Soner left the following message " I wanted to bring this issue to your attention. Since we put in the spam filter service, my team and I have resolved over 100 tickets in the last 8 weeks or so, users can’t publish their datasets, and we need to go in and publish perfectly fine datasets for them. Some users are ok with us doing this step, and some are curious about this new service temporarily or not... annoyed by it. In some instances, we have to contact Leonid and have him whitelist their Dataverse so that users can publish their datasets and not contact us every time they make a minor change. I wanted to find out if this Spam filter service is temporary and if a new solution is coming soon and we will not need to publish users’ legit datasets for them. Sonia and Ceilyn know this; we chatted a few weeks ago, but they haven’t had a chance to bring it to your attention." |
Moving issue to Needs Sizing column. |
2023/07/17: This issue will be prioritized after the USCB dataset support has been resolved (due to resource constraints; Leonid will be participating in USCB support)
|
See also related issues/PRs to address the spike issue incrementally: |
2023/08/28
|
I'm going to use this spike to document and discuss some incremental improvements to the content validation in prod. (this has been discussed previously; but I can open a new issue for that instead). I am ready to switch to the new model of handling datasets for which an attempt to publish has triggered an alarm from the validation script, just need a confirmation that what's described below is ok (and that everybody in support is aware of the changes): Once the switch is made, the users will no longer be instructed to contact support in order to get their dataset published. Instead, an RT ticket will be opened automatically, so that it can be reviewed and published, or deleted, as needed. The following new message will be shown to the user:
An RT ticket will be opened in the standard
If any changes to the above text are needed, please let me know. Otherwise, we are ready to switch to this scheme. |
I will be adding info about other potential changes. |
Since it's possible that the support team won't publish the dataset, maybe because it's actually spam, what about editing the third sentence to say that "The repository team has been notified and within 24 hours will either publish your dataset or contact you"? |
@sbarbosadataverse I have configured it with your warning message (in my comment above). But if you want to change it, based on @jggautier's suggestion above or otherwise, just slack me the new message please, it can be modified instantly. |
2023/12/18
|
To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves). As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.
This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset. Anything that I'm missing/any reasons any of the above is a bad idea? |
So in short, perhaps we could trust new items in existing collections. Sure, worth a shot, I'd say. |
... And the collections themselves. As of now, we are validating the collection metadata as well, when they are published or updated. These checks generate false positives too. This is especially problematic with edits of already published collections. We cannot use the approach of opening an RT ticket and having the support review the changes; since there is no concept of versioning or drafts for collections. |
@sbarbosadataverse Would you approve of this proposal I made here some time back? I posted about it in slack channels too and got positive feedback from some support/curation members. It's quoted below, but, in short - should we try to only run the spam filter on the datasets in the top-level, root collection? - And not in/on sub-collections, since those have to go through curation already.
|
2024/07/10
|
Opened a new Monitoring issue for spam in production: |
Assigning to @sbarbosadataverse and @landreev so they can provide update on status. Should this issue stay on hold? Something else? |
I think some research into the affect of the spam prevention code on Harvard's repository might help us determine the urgency of improving how Harvard Dataverse handles spam, so that more people can continue sharing data as quickly as possible.
For example, could we try to get a better idea of how often the spam prevention is flagging and preventing the publication of non-spam, and why? We could look at the number of times that people have emailed the repository's support about this, since those emails are recorded in RT.
And some people affected by this might not email support. They might try to create a different dataset or they might abandon the dataset and try a different repository.
To get a better sense of how often this happens, we could find unpublished dataset versions that have been or would likely be flagged as potential spam (for example, because there are URLs in their description metadata fields, which the spam detection doesn't like) and try to learn from the depositors if they haven't published those versions because of the spam detection.
In recent Slack conversations, there was discussion about how to improve the spam detection so that fewer non-spam deposits are affected. It was suggested that Dataverse collections could be added to a safe list so that any dataset versions published in those collections would never be flagged as potential spam.
If some unacceptable number of non-spam datasets deposited in "Root" are being flagged as spam, what could be done?
The text was updated successfully, but these errors were encountered: