Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221

Open
jggautier opened this issue Apr 25, 2023 · 17 comments
Assignees
Labels
Size: 3 A percentage of a sprint. Status: Needs Input Applied to issues in need of input from someone currently unavailable

Comments

@jggautier
Copy link
Collaborator

I think some research into the affect of the spam prevention code on Harvard's repository might help us determine the urgency of improving how Harvard Dataverse handles spam, so that more people can continue sharing data as quickly as possible.

For example, could we try to get a better idea of how often the spam prevention is flagging and preventing the publication of non-spam, and why? We could look at the number of times that people have emailed the repository's support about this, since those emails are recorded in RT.

And some people affected by this might not email support. They might try to create a different dataset or they might abandon the dataset and try a different repository.

To get a better sense of how often this happens, we could find unpublished dataset versions that have been or would likely be flagged as potential spam (for example, because there are URLs in their description metadata fields, which the spam detection doesn't like) and try to learn from the depositors if they haven't published those versions because of the spam detection.

In recent Slack conversations, there was discussion about how to improve the spam detection so that fewer non-spam deposits are affected. It was suggested that Dataverse collections could be added to a safe list so that any dataset versions published in those collections would never be flagged as potential spam.

If some unacceptable number of non-spam datasets deposited in "Root" are being flagged as spam, what could be done?

@sbarbosadataverse sbarbosadataverse moved this to Harvard Dataverse Instance (Sonia) in IQSS Dataverse Project May 18, 2023
@sbarbosadataverse
Copy link

Soner left the following message " I wanted to bring this issue to your attention. Since we put in the spam filter service, my team and I have resolved over 100 tickets in the last 8 weeks or so, users can’t publish their datasets, and we need to go in and publish perfectly fine datasets for them. Some users are ok with us doing this step, and some are curious about this new service temporarily or not... annoyed by it. In some instances, we have to contact Leonid and have him whitelist their Dataverse so that users can publish their datasets and not contact us every time they make a minor change. I wanted to find out if this Spam filter service is temporary and if a new solution is coming soon and we will not need to publish users’ legit datasets for them. Sonia and Ceilyn know this; we chatted a few weeks ago, but they haven’t had a chance to bring it to your attention."

@cmbz
Copy link
Collaborator

cmbz commented May 19, 2023

Moving issue to Needs Sizing column.
Once sized, it can be prioritized and worked on post DCM 2023.

@cmbz cmbz moved this from Harvard Dataverse Instance (Sonia) to SPRINT- NEEDS SIZING in IQSS Dataverse Project May 19, 2023
@cmbz
Copy link
Collaborator

cmbz commented Jul 17, 2023

2023/07/17: This issue will be prioritized after the USCB dataset support has been resolved (due to resource constraints; Leonid will be participating in USCB support)

  • One possibility is for the system to notify the curation team whenever a dataset trips the spam filter. Then, the curation team could publish the dataset on behalf of the user. The user would receive a message to the user indicating that the curation team will publish it. No RT ticket would need to be created by the user and it could reduce negative interaction with users/Soner's team

@cmbz
Copy link
Collaborator

cmbz commented Jul 24, 2023

See also related issues/PRs to address the spike issue incrementally:

@cmbz
Copy link
Collaborator

cmbz commented Aug 28, 2023

2023/08/28
To complete the work needed to improve user experience of spam handling, we need two issues to be written, sized, and prioritized:

  • Implement mechanism to automatically email support when dataset is trapped in spam filter
  • Update "unable to publish" messaging workflow (should be updated after automatic emails are sent to support (and will not go to the user)

@landreev
Copy link
Collaborator

landreev commented Sep 25, 2023

I'm going to use this spike to document and discuss some incremental improvements to the content validation in prod. (this has been discussed previously; but I can open a new issue for that instead).

I am ready to switch to the new model of handling datasets for which an attempt to publish has triggered an alarm from the validation script, just need a confirmation that what's described below is ok (and that everybody in support is aware of the changes):

Once the switch is made, the users will no longer be instructed to contact support in order to get their dataset published. Instead, an RT ticket will be opened automatically, so that it can be reviewed and published, or deleted, as needed.

The following new message will be shown to the user:

This dataset did not pass our automated metadata validation scans and cannot be published right away. Please note that this may be in error. The repository team has been notified and will publish your dataset within 24 hours. No further action is required.

An RT ticket will be opened in the standard dataverse_support queue and will look as follows:

Title: Potential spam dataset doi:10.7910/DVN/ZZZZZ, please review

Filter triggered for dataset
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZZZZZ
Please review and publish or delete, as needed.

(this ticket was opened by the automated content validator script)

If any changes to the above text are needed, please let me know. Otherwise, we are ready to switch to this scheme.

@landreev
Copy link
Collaborator

I will be adding info about other potential changes.

@jggautier
Copy link
Collaborator Author

Since it's possible that the support team won't publish the dataset, maybe because it's actually spam, what about editing the third sentence to say that "The repository team has been notified and within 24 hours will either publish your dataset or contact you"?

@landreev
Copy link
Collaborator

landreev commented Oct 2, 2023

@sbarbosadataverse I have configured it with your warning message (in my comment above). But if you want to change it, based on @jggautier's suggestion above or otherwise, just slack me the new message please, it can be modified instantly.

@cmbz
Copy link
Collaborator

cmbz commented Dec 18, 2023

2023/12/18

  • Some Improvements in spam handling have already been implemented, as outlined above
  • @landreev has additional suggestions for improvements and will include them here.

@cmbz cmbz added Size: 0.5 Size: 3 A percentage of a sprint. and removed Size: 0.5 labels Dec 18, 2023
@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Dec 18, 2023
@landreev
Copy link
Collaborator

To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves).
Also, whitelisting mechanisms have been extended - it is now possible to whitelist specific collections and users, in addition to datasets.

As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.

  1. Consider disabling validation checks on collections altogether. We are still enforcing the policy of requiring most users to go through support in order to create collections. Is there a realistic danger that somebody who has convinced the support team that they are legitimate data depositors will proceed to post spam? The only users who are allowed to create collections are those authenticated via HarvardKey and the institutional logins of a couple of other trusted schools. May be safe-ish to assume that they are unlikely to create anything inappropriate either (?).
  2. By the same logic as the above, should we consider disabling validation checks on the datasets in all the sub-collections (i.e. all the collections other than the top-level Harvard Dataverse collection)?

This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset.

Anything that I'm missing/any reasons any of the above is a bad idea?

@pdurbin
Copy link
Member

pdurbin commented Dec 19, 2023

So in short, perhaps we could trust new items in existing collections. Sure, worth a shot, I'd say.

@landreev
Copy link
Collaborator

So in short, perhaps we could trust new items in existing collections.

... And the collections themselves. As of now, we are validating the collection metadata as well, when they are published or updated. These checks generate false positives too. This is especially problematic with edits of already published collections. We cannot use the approach of opening an RT ticket and having the support review the changes; since there is no concept of versioning or drafts for collections.

@landreev landreev moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Jan 31, 2024
@landreev landreev self-assigned this Jan 31, 2024
@landreev landreev moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Apr 8, 2024
@landreev
Copy link
Collaborator

landreev commented Apr 10, 2024

@sbarbosadataverse Would you approve of this proposal I made here some time back? I posted about it in slack channels too and got positive feedback from some support/curation members. It's quoted below, but, in short - should we try to only run the spam filter on the datasets in the top-level, root collection? - And not in/on sub-collections, since those have to go through curation already.

To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves). Also, whitelisting mechanisms have been extended - it is now possible to whitelist specific collections and users, in addition to datasets.

As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.

  1. Consider disabling validation checks on collections altogether. We are still enforcing the policy of requiring most users to go through support in order to create collections. Is there a realistic danger that somebody who has convinced the support team that they are legitimate data depositors will proceed to post spam? The only users who are allowed to create collections are those authenticated via HarvardKey and the institutional logins of a couple of other trusted schools. May be safe-ish to assume that they are unlikely to create anything inappropriate either (?).

  2. By the same logic as the above, should we consider disabling validation checks on the datasets in all the sub-collections (i.e. all the collections other than the top-level Harvard Dataverse collection)?

This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset.

Anything that I'm missing/any reasons any of the above is a bad idea?

@cmbz cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Apr 25, 2024
@landreev landreev moved this from In Progress 💻 to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project May 31, 2024
@cmbz cmbz added the FY24 Sprint 26 FY24 Sprint 26 label Jun 20, 2024
@cmbz cmbz moved this from This Sprint 🏃‍♀️ 🏃 to SPRINT READY in IQSS Dataverse Project Jun 20, 2024
@cmbz cmbz removed the FY24 Sprint 26 FY24 Sprint 26 label Jun 20, 2024
@cmbz cmbz moved this from SPRINT READY to On Hold ⌛ in IQSS Dataverse Project Jul 10, 2024
@cmbz
Copy link
Collaborator

cmbz commented Jul 10, 2024

2024/07/10

@sbarbosadataverse
Copy link

sbarbosadataverse commented Jul 17, 2024

Opened a new Monitoring issue for spam in production:

@cmbz
Copy link
Collaborator

cmbz commented Sep 4, 2024

Assigning to @sbarbosadataverse and @landreev so they can provide update on status. Should this issue stay on hold? Something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. Status: Needs Input Applied to issues in need of input from someone currently unavailable
Projects
Status: On Hold ⌛
Development

No branches or pull requests

5 participants