Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW phpoffice/phpword support #61

Closed

Conversation

emteknetnz
Copy link
Member

@emteknetnz emteknetnz commented Sep 19, 2023

Issue #58

Notes:

  • The javascript has been non-functional for a long time so I've simply removed it along with all the associated build tooling
  • Imported images are saved as inline base64 encoded images, they are not saved to the assets folder.
  • In my testing there was one sample docx file I tried which that caused the browser to become basically non-responsive (25 second load time at 97% CPU usage according to firefox performance profiler). That same sample doc did work in CMS 4. I think the difference is that this particular document, for whatever reason, every work was wrapped in its own <span> tag which you could see when viewing source. I think perhaps the older version of tinymce would tolerate this while the newer version doesn't.
  • Also .doc files previously when doing this in CMS 4 on PHP 7.4, however now that this PR is on CMS 5 in PHP 8.1 using the "MsDoc" throws a PHP warnings and doesn't import
  • PHPOffice/PHPWord has 1K open issues https://github.com/PHPOffice/PHPWord/issues
  • We probably shouldn't merge this

This was referenced Sep 19, 2023
@emteknetnz emteknetnz force-pushed the pulls/3/phpoffice branch 3 times, most recently from 4ce580f to 36db4ef Compare September 20, 2023 01:29
@emteknetnz emteknetnz marked this pull request as ready for review September 20, 2023 01:32
Copy link

@maxime-rainville maxime-rainville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing there was one sample docx file I tried which that caused the browser to become basically non-responsive (25 second load time at 97% CPU usage according to firefox performance profiler). That same sample doc did work in CMS 4. I think the difference is that this particular document, for whatever reason, every work was wrapped in its own <span> tag which you could see when viewing source. I think perhaps the older version of tinymce would tolerate this while the newer version doesn't.

I take it the file is imported properly, but when TinyMCE tries to display, the browser crashes because there's too many weird tag.

Could we just do something like called strip_tags on the HTML output? Or do some other process to normalise the HTML?

@@ -25,38 +23,24 @@
use SilverStripe\Versioned\Versioned;
use SilverStripe\View\Parsers\HTMLValue;
use Tidy;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a hard dependency on Tidy.

ERROR [Emergency]: Uncaught Error: Class "Tidy" not found IN POST /admin/pages/edit/EditForm/2/field/ImportedFromFile/upload Line 232 in /home/max/Projects/cms5x-doc-convert/vendor/silverstripe/documentconverter/src/ImportField.php Source ====== 223: /** @var Importer $importer */ 224: $importer = Injector::inst()->create($importerClass, $fileDescriptor, $chosenFolderID); 225: $content = $importer->import(); 226: 227: if (is_array($content) && isset($content['error'])) { 228: return $content; 229: } 230: 231: // Clean up with tidy (requires tidy module) * 232: $tidy = new Tidy(); 233: 

Can we add this to the composer requirements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@emteknetnz
Copy link
Member Author

emteknetnz commented Oct 11, 2023

Could we just do something like called strip_tags on the HTML output? Or do some other process to normalise the HTML?

strip_tags() is not not feasible because the whole point of the document converter is that converts a word document to HTML. I'm strongly of the opinion we shouldn't try to normalise the output of the converter, because that's the job of the converter to get it right, we shouldn't be trying to fix its mistakes

Here's small snippet of the HTML imported from this particular word doc which was the 1MB download for here:

<span>faucibus</span><span>. Nunc a</span> <span>feugiat</span> <span>urna</span><span>. Integer sit</span> <span>amet</span> <span>tincidunt</span> <span>mauris</span><span>. Donec id</span> <span>urna</span> <span>nec</span> <span>turpis</span> <span>laoreet</span> <span>aliquam</span> <span>ac id ante.</span> <span>Curabitur</span> <span>faucibus</span> <span>consectetur</span> <span>velit</span><span>, vitae</span> <span>efficitur</span> <span>odio</span> <span>facilisis</span> <span>sed.</span></p>
<p>&nbsp;</p>
<p><img src="blob:http://bugfix9.test/8af42cf3-3514-4eed-866f-34998b9fa3bf"></p>
<p>&nbsp;</p>

The bottom part with the <p> and <img> tags is correct (for the img tag there's some magic done so that the image correctly renders inside tinymce + website frontend so that it's inline base64 encoded). However all of the <span> tags shouldn't be there

I tried uploaded two different sample docx files, and this was the result of one of them. Given the 50% failure rate, and the multitude of different ways that a docx file could be imported wrong due to all of the edges cases, I'm strongly of the opinion that we shouldn't be trying to "bandaid fix" when the phpoffice dependency doesn't do it's job well. We rely on the dependency to do all of the heavy lifting and we shouldn't spend any of our own time trying to "fix" it. As noted above this dependency has over 1K open issues on it.

@emteknetnz
Copy link
Member Author

A more realistic solution would be to simply put a big red warning on the import screen that says "Importing certain documents may make the CMS become unresponsive, if that happens don't use the importer for this particular batch of word documents." - though I'd really question why we'd want to release software in that sort of state.

@maxime-rainville
Copy link

Not doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants