Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HTMLPurifier with a maintained dependency #7916

Closed
asmecher opened this issue May 9, 2022 · 17 comments
Closed

Replace HTMLPurifier with a maintained dependency #7916

asmecher opened this issue May 9, 2022 · 17 comments
Assignees
Labels
Housekeeping:1:Todo Any dependency management or refactor that would be nice to have some day.
Milestone

Comments

@asmecher
Copy link
Member

asmecher commented May 9, 2022

HTMLPurifier is not maintained anymore. Find another replacement toolset for "safe" HTML filtering.

It does not support HTML5: https://forum.pkp.sfu.ca/t/tinymce-stripping-anchor-tags-within-or-around-divs/72749/12

We currently use HTMLPurifier in the PKPString::stripUnsafeHtml function (lib/pkp/classes/core/PKPString.inc.php). This takes a configuration of allowed elements/attributes from config.inc.php.

PRs
pkp-lib --> #9257
ojs --> pkp/ojs#4025 [TEST ONLY]
omp --> pkp/omp#1581 [TEST ONLY]
ops --> pkp/ops#696 [TEST ONLY]
jatsTemplate --> pkp/jatsTemplate#30
oaiJats --> pkp/oaiJats#36

Update
Based on following reasons we have decided to revert back the Symfony HTML Sanitizer and keep using the HTMLPurifier

Reverting Back HTMLPurifier
pkp-lib --> #10197
ojs --> pkp/ojs#4365 [TEST ONLY]
omp --> pkp/omp#1634 [TEST ONLY]
ops --> pkp/ops#729 [TEST ONLY]
jatsTemplate --> pkp/jatsTemplate#46
oaiJats --> pkp/oaiJats#46

@touhidurabir
Copy link
Member

There are few alternative packages/libraries to choose from as HTMLPurifier is still the most common and widely used one even if it's not maintained properly and there has been no major upgrade of it . Most the of popular ones are also kind of based on that one which make those more like a wrapper over it . Few available options are (based on initial R&D)

  1. xemlock/htmlpurifier-html5 . This one also mentioned in the pkp forum at https://forum.pkp.sfu.ca/t/tinymce-stripping-anchor-tags-within-or-around-divs/72749/12 . As this allow compatibility of HTML5, it's still a wrapper over the not maintained HTMLPurifier . However one big advantage of this one is that this does not require much modification of current implementation also any modification or option addition to the config options .

  2. tgalopin/html-sanitizer . This one seems relatively new comparing to HTMLPurifier (first release at 2018) and seems to be maintained regularly . It has much better options and compatibility for HTML5 also have better implementation. This one also available as Symfony Bundle so it may have high chance to be maintained properly . The only downside of this one is to make it work with our current config option implementation which may need more parsing before using with this package or we may need to add more config option in the config file . Need more research .

  3. symfony/html-sanitizer . This one maintained by Symfony Team and only released with Symfony 6 . This one has a very fluent API to work with and high chance of maintained properly by Symfony Team. But even the initial released version is locked to PHP 8.1+ so it will need to be used with a custom forked version to make it work for the 3.4 release target which aims to work with PHP 8.0+ . And even with that , there may be additional workload the maintain the forked version up to date with future release .

@asmecher @NateWr few stuff we need to figure out first

  1. what is the backward compatibility requirement for this enhancement/modification ?
  2. up to which PHP version ?
  3. How much of config option modification/addition we may allow ? Or should we try to keep the config option exact as same for this enhancement ?

personally I would like to go with option 2 or 3 if we target for a long term supported package but if we just need to make it HTML5 compatible , option 1 may work for now . However , as option 1 is based on old HTMLPurifier , it may not be maintained properly and security issue may arise . Need more research .

@asmecher
Copy link
Member Author

Thanks, @touhidurabir! To summarize our convo today:

  • [xemlock/htmlpurifier-html5](https://github.com/xemlock/htmlpurifier-html5) is a fork and I wouldn't normally recommend that approach but if we have a long-term solution that requires a stopgap (e.g. because of too high a PHP baseline) it's worth considering. (I don't see a formal statement anywhere that HTMLPurifier is not being maintained, but the release schedule sure doesn't look good.)
  • [tgalopin/html-sanitizer](https://github.com/tgalopin/html-sanitizer) looks good but is a personal repo (does not seem institutionally supported). Not my first choice, though it has promising uptake.
  • [symfony/html-sanitizer](https://github.com/symfony/html-sanitizer) -- we both agreed that something adopted/maintained by a big project like Symfony seemed like the best approach. However, as you noted, the PHP8.1 requirement is currently steep.

Decisions

  • We don't have a lot of calls for HTMLPurifier to be replaced -- it seems safe/mature enough to continue using, and it's still getting PHP compatibility fixes merged into Github. We can cautiously continue to use it for the time being.
  • @touhidurabir will take a quick look to see if we can determine the maintenance roadmap for [symfony/html-sanitizer](https://github.com/symfony/html-sanitizer). We could probably bump our PHP requirements for 3.6 (or possibly even 3.5) to PHP8.1 -- but if [symfony/html-sanitizer](https://github.com/symfony/html-sanitizer) is already on to a newer, more aggressive requirement by then, it may force our community into a difficult position.
  • If [symfony/html-sanitizer](https://github.com/symfony/html-sanitizer)'s maintenance roadmap is compatible enough with ours, I'd recommend we consider adopting it later (3.5 or 3.6). This would mean waiting to implement it until that branch exists.

@touhidurabir
Copy link
Member

touhidurabir commented May 12, 2022

@asmecher if this is not a higher priority right now , we can get back to this one later when we caught up with PHP 8.1+ . Another option is to use a forked version of symfony/html-sanitizer that we make compatible with PHP 8.0+ but even so this package by the Symfony Team is still in BETA state and that why it's still not included in their available component list at https://symfony.com/components . And maintaining a custom forked version may turn into a way more work that it should be.

Now there is no such info about how the future version will integrate newer version of PHP . So we are not sure if this package will have an aggressive PHP version upgrade policy but we can always use a lower release version. For example, if we look at https://symfony.com/releases/6.0 which support 8.0+ but the https://symfony.com/releases/6.1 will require 8.1+ which seems pretty aggressive . Some more general details can be found at https://symfony.com/doc/current/contributing/community/releases.html .

Also current LTS version of Symfony is 5.4 which will be supported for bug fix till 2024 and security fix till 2025 so there is a high chance we may not get a new LTS before 2024. So basically it's almost impossible to determine if we can have this symfony/html-sanitizer package for latest version with PHP version compatible for our next release so we may have to use a lower release version then.

few more details at https://symfony.com/blog/new-in-symfony-6-1-htmlsanitizer-component

@asmecher asmecher added this to the 3.5 milestone May 19, 2022
@NateWr NateWr added the Housekeeping:1:Todo Any dependency management or refactor that would be nice to have some day. label May 30, 2022
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue May 16, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue May 16, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue May 17, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue May 17, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue May 17, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue May 17, 2023
@touhidurabir
Copy link
Member

@asmecher can you check the PR at here (still need to pass the tests) . using symfony/html-sanitizer , we have few cases to consider

  1. we do not need to add any config for utf-8 as it's accept only utf-8 .
  2. we do not need to add config for Doctype as it's set to HTML5
  3. it does not have a cache way
  4. it's based on HTML Sanitizer W3C Standard Proposal
  5. We need to use strip_tags for unsupported tags first as it removed the any tags with content within it which not defined as allowed .

I find the CASE:5 a bit issue for us . but need bit more investigation .

touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue May 18, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue May 18, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue May 18, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue Jun 16, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue Jun 16, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue Jun 16, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue Jun 16, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue Jun 16, 2023
touhidurabir added a commit to touhidurabir/ojs that referenced this issue Jun 16, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue Jun 19, 2023
touhidurabir added a commit to touhidurabir/pkp-lib that referenced this issue Jun 19, 2023
asmecher pushed a commit to pkp/jatsTemplate that referenced this issue Jun 12, 2024
asmecher pushed a commit that referenced this issue Jun 12, 2024
* #7916 Added core PKPHtmlSanitizer class

* #7916 fixed empty string issue for DOMDocument::loadHTML

* #7916 removed dom node traversing and implememted W3C conigurations

* #7916 updated implementation with necessary doc blocks and sanitizer property cache

* #7916 patch update after rebase

* #7916 removed html_entity_decode and removed config passing to core sanitization class

* #7916 removed sanitizer config setter and getter
@touhidurabir
Copy link
Member

With the latest implementation, all previous issue should be resolved . Closing this now but will reopen if any issue arise again .

@github-project-automation github-project-automation bot moved this from Todo to Done in Infrastructure Jun 13, 2024
@touhidurabir
Copy link
Member

@asmecher I think we need to return back the HTML Purifier as I have found an issue with symfony html sanitizer . Please take a look at the issue [Html Sanitizer] Sanitization remove single < (less than) char which is not part of any tag .

symfony html sanitizer is production grade package that ships as part of symfony bundler but I think this one does not suit all our cases . Or may be it need a bit more time to mature and we jump the gun here .

@asmecher
Copy link
Member Author

asmecher commented Jul 4, 2024

Re-opening for consideration -- we may need to revert to HTMLPurifier if Symfony sanitizer doesn't better meet our needs by wrap-up time for 3.5.

@asmecher asmecher reopened this Jul 4, 2024
@github-project-automation github-project-automation bot moved this from Done to Todo in Infrastructure Jul 4, 2024
touhidurabir added a commit to touhidurabir/pkp-jatsTemplate that referenced this issue Jul 12, 2024
touhidurabir added a commit to touhidurabir/pkp-oaiJats that referenced this issue Jul 12, 2024
@touhidurabir
Copy link
Member

@asmecher please check the reversion PRs at #7916 (comment) , see the section Reverting Back HTMLPurifier .

asmecher pushed a commit to pkp/jatsTemplate that referenced this issue Jul 16, 2024
asmecher pushed a commit to pkp/oaiJats that referenced this issue Jul 16, 2024
@asmecher
Copy link
Member Author

Thanks, @touhidurabir! All is merged, and I took care of the submodule updates manually.

@github-project-automation github-project-automation bot moved this from Todo to Done in Infrastructure Jul 16, 2024
@asmecher asmecher modified the milestones: 3.5.0 LTS, 3.5 Internal Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Housekeeping:1:Todo Any dependency management or refactor that would be nice to have some day.
Projects
Status: Done
Development

No branches or pull requests

4 participants