Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221

jggautier · 2023-04-25T16:39:26Z

I think some research into the affect of the spam prevention code on Harvard's repository might help us determine the urgency of improving how Harvard Dataverse handles spam, so that more people can continue sharing data as quickly as possible.

For example, could we try to get a better idea of how often the spam prevention is flagging and preventing the publication of non-spam, and why? We could look at the number of times that people have emailed the repository's support about this, since those emails are recorded in RT.

And some people affected by this might not email support. They might try to create a different dataset or they might abandon the dataset and try a different repository.

To get a better sense of how often this happens, we could find unpublished dataset versions that have been or would likely be flagged as potential spam (for example, because there are URLs in their description metadata fields, which the spam detection doesn't like) and try to learn from the depositors if they haven't published those versions because of the spam detection.

In recent Slack conversations, there was discussion about how to improve the spam detection so that fewer non-spam deposits are affected. It was suggested that Dataverse collections could be added to a safe list so that any dataset versions published in those collections would never be flagged as potential spam.

If some unacceptable number of non-spam datasets deposited in "Root" are being flagged as spam, what could be done?

sbarbosadataverse · 2023-05-18T20:53:26Z

Soner left the following message " I wanted to bring this issue to your attention. Since we put in the spam filter service, my team and I have resolved over 100 tickets in the last 8 weeks or so, users can’t publish their datasets, and we need to go in and publish perfectly fine datasets for them. Some users are ok with us doing this step, and some are curious about this new service temporarily or not... annoyed by it. In some instances, we have to contact Leonid and have him whitelist their Dataverse so that users can publish their datasets and not contact us every time they make a minor change. I wanted to find out if this Spam filter service is temporary and if a new solution is coming soon and we will not need to publish users’ legit datasets for them. Sonia and Ceilyn know this; we chatted a few weeks ago, but they haven’t had a chance to bring it to your attention."

cmbz · 2023-05-19T01:29:23Z

Moving issue to Needs Sizing column.
Once sized, it can be prioritized and worked on post DCM 2023.

cmbz · 2023-07-17T18:15:12Z

2023/07/17: This issue will be prioritized after the USCB dataset support has been resolved (due to resource constraints; Leonid will be participating in USCB support)

One possibility is for the system to notify the curation team whenever a dataset trips the spam filter. Then, the curation team could publish the dataset on behalf of the user. The user would receive a message to the user indicating that the curation team will publish it. No RT ticket would need to be created by the user and it could reduce negative interaction with users/Soner's team

cmbz · 2023-07-24T17:18:28Z

See also related issues/PRs to address the spike issue incrementally:

cmbz · 2023-08-28T18:19:09Z

2023/08/28
To complete the work needed to improve user experience of spam handling, we need two issues to be written, sized, and prioritized:

Implement mechanism to automatically email support when dataset is trapped in spam filter
Update "unable to publish" messaging workflow (should be updated after automatic emails are sent to support (and will not go to the user)

landreev · 2023-09-25T16:38:20Z

I'm going to use this spike to document and discuss some incremental improvements to the content validation in prod. (this has been discussed previously; but I can open a new issue for that instead).

I am ready to switch to the new model of handling datasets for which an attempt to publish has triggered an alarm from the validation script, just need a confirmation that what's described below is ok (and that everybody in support is aware of the changes):

Once the switch is made, the users will no longer be instructed to contact support in order to get their dataset published. Instead, an RT ticket will be opened automatically, so that it can be reviewed and published, or deleted, as needed.

The following new message will be shown to the user:

This dataset did not pass our automated metadata validation scans and cannot be published right away. Please note that this may be in error. The repository team has been notified and will publish your dataset within 24 hours. No further action is required.

An RT ticket will be opened in the standard dataverse_support queue and will look as follows:

Title: Potential spam dataset doi:10.7910/DVN/ZZZZZ, please review

Filter triggered for dataset
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZZZZZ
Please review and publish or delete, as needed.

(this ticket was opened by the automated content validator script)

If any changes to the above text are needed, please let me know. Otherwise, we are ready to switch to this scheme.

landreev · 2023-09-25T16:53:50Z

I will be adding info about other potential changes.

jggautier · 2023-09-25T17:45:50Z

Since it's possible that the support team won't publish the dataset, maybe because it's actually spam, what about editing the third sentence to say that "The repository team has been notified and within 24 hours will either publish your dataset or contact you"?

landreev · 2023-10-02T18:34:20Z

@sbarbosadataverse I have configured it with your warning message (in my comment above). But if you want to change it, based on @jggautier's suggestion above or otherwise, just slack me the new message please, it can be modified instantly.

cmbz · 2023-12-18T16:30:16Z

2023/12/18

Some Improvements in spam handling have already been implemented, as outlined above
@landreev has additional suggestions for improvements and will include them here.

landreev · 2023-12-19T01:46:10Z

To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves).
Also, whitelisting mechanisms have been extended - it is now possible to whitelist specific collections and users, in addition to datasets.

As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.

Consider disabling validation checks on collections altogether. We are still enforcing the policy of requiring most users to go through support in order to create collections. Is there a realistic danger that somebody who has convinced the support team that they are legitimate data depositors will proceed to post spam? The only users who are allowed to create collections are those authenticated via HarvardKey and the institutional logins of a couple of other trusted schools. May be safe-ish to assume that they are unlikely to create anything inappropriate either (?).
By the same logic as the above, should we consider disabling validation checks on the datasets in all the sub-collections (i.e. all the collections other than the top-level Harvard Dataverse collection)?

This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset.

Anything that I'm missing/any reasons any of the above is a bad idea?

pdurbin · 2023-12-19T14:30:43Z

So in short, perhaps we could trust new items in existing collections. Sure, worth a shot, I'd say.

landreev · 2023-12-19T14:41:32Z

So in short, perhaps we could trust new items in existing collections.

... And the collections themselves. As of now, we are validating the collection metadata as well, when they are published or updated. These checks generate false positives too. This is especially problematic with edits of already published collections. We cannot use the approach of opening an RT ticket and having the support review the changes; since there is no concept of versioning or drafts for collections.

landreev · 2024-04-10T15:42:03Z

@sbarbosadataverse Would you approve of this proposal I made here some time back? I posted about it in slack channels too and got positive feedback from some support/curation members. It's quoted below, but, in short - should we try to only run the spam filter on the datasets in the top-level, root collection? - And not in/on sub-collections, since those have to go through curation already.

To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves). Also, whitelisting mechanisms have been extended - it is now possible to whitelist specific collections and users, in addition to datasets.

As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.

Consider disabling validation checks on collections altogether. We are still enforcing the policy of requiring most users to go through support in order to create collections. Is there a realistic danger that somebody who has convinced the support team that they are legitimate data depositors will proceed to post spam? The only users who are allowed to create collections are those authenticated via HarvardKey and the institutional logins of a couple of other trusted schools. May be safe-ish to assume that they are unlikely to create anything inappropriate either (?).

By the same logic as the above, should we consider disabling validation checks on the datasets in all the sub-collections (i.e. all the collections other than the top-level Harvard Dataverse collection)?

This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset.

Anything that I'm missing/any reasons any of the above is a bad idea?

cmbz · 2024-07-10T22:11:40Z

2024/07/10

Any updates for this issue? What do next steps look like? @sbarbosadataverse @landreev

sbarbosadataverse · 2024-07-17T17:52:08Z

Opened a new Monitoring issue for spam in production:

Monitor: Spam detection and remediation in Harvard Dataverse dataverse-pm#291

cmbz · 2024-09-04T20:30:59Z

Assigning to @sbarbosadataverse and @landreev so they can provide update on status. Should this issue stay on hold? Something else?

sbarbosadataverse moved this to Harvard Dataverse Instance (Sonia) in IQSS Dataverse Project May 18, 2023

sbarbosadataverse added this to IQSS Dataverse Project May 18, 2023

cmbz moved this from Harvard Dataverse Instance (Sonia) to SPRINT- NEEDS SIZING in IQSS Dataverse Project May 19, 2023

landreev mentioned this issue Jul 20, 2023

"External Metadata Filtering" improvements. IQSS/dataverse#9727

Closed

cmbz added Size: 0.5 Size: 3 A percentage of a sprint. and removed Size: 0.5 labels Dec 18, 2023

cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Dec 18, 2023

landreev moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Jan 31, 2024

landreev self-assigned this Jan 31, 2024

landreev moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Apr 8, 2024

cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Apr 25, 2024

landreev moved this from In Progress 💻 to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project May 31, 2024

pdurbin unassigned landreev Jun 5, 2024

cmbz added the FY24 Sprint 26 FY24 Sprint 26 label Jun 20, 2024

cmbz moved this from This Sprint 🏃‍♀️ 🏃 to SPRINT READY in IQSS Dataverse Project Jun 20, 2024

cmbz removed the FY24 Sprint 26 FY24 Sprint 26 label Jun 20, 2024

cmbz moved this from SPRINT READY to On Hold ⌛ in IQSS Dataverse Project Jul 10, 2024

sbarbosadataverse mentioned this issue Jul 17, 2024

Monitor: Spam detection and remediation in Harvard Dataverse IQSS/dataverse-pm#291

Open

1 task

cmbz assigned landreev and sbarbosadataverse Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221

Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221

jggautier commented Apr 25, 2023

sbarbosadataverse commented May 18, 2023

cmbz commented May 19, 2023

cmbz commented Jul 17, 2023 •

edited

Loading

cmbz commented Jul 24, 2023 •

edited

Loading

cmbz commented Aug 28, 2023

landreev commented Sep 25, 2023 •

edited

Loading

landreev commented Sep 25, 2023

jggautier commented Sep 25, 2023

landreev commented Oct 2, 2023

cmbz commented Dec 18, 2023 •

edited by landreev

Loading

landreev commented Dec 19, 2023

pdurbin commented Dec 19, 2023

landreev commented Dec 19, 2023

landreev commented Apr 10, 2024 •

edited

Loading

cmbz commented Jul 10, 2024

sbarbosadataverse commented Jul 17, 2024 •

edited

Loading

cmbz commented Sep 4, 2024

Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221

Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221

Comments

jggautier commented Apr 25, 2023

sbarbosadataverse commented May 18, 2023

cmbz commented May 19, 2023

cmbz commented Jul 17, 2023 • edited Loading

cmbz commented Jul 24, 2023 • edited Loading

cmbz commented Aug 28, 2023

landreev commented Sep 25, 2023 • edited Loading

landreev commented Sep 25, 2023

jggautier commented Sep 25, 2023

landreev commented Oct 2, 2023

cmbz commented Dec 18, 2023 • edited by landreev Loading

landreev commented Dec 19, 2023

pdurbin commented Dec 19, 2023

landreev commented Dec 19, 2023

landreev commented Apr 10, 2024 • edited Loading

cmbz commented Jul 10, 2024

sbarbosadataverse commented Jul 17, 2024 • edited Loading

cmbz commented Sep 4, 2024

cmbz commented Jul 17, 2023 •

edited

Loading

cmbz commented Jul 24, 2023 •

edited

Loading

landreev commented Sep 25, 2023 •

edited

Loading

cmbz commented Dec 18, 2023 •

edited by landreev

Loading

landreev commented Apr 10, 2024 •

edited

Loading

sbarbosadataverse commented Jul 17, 2024 •

edited

Loading