Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError The input .md file in simplified Chinese #163

Open
Lydiagugugaga opened this issue Aug 21, 2024 · 13 comments
Open

UnicodeEncodeError The input .md file in simplified Chinese #163

Lydiagugugaga opened this issue Aug 21, 2024 · 13 comments

Comments

@Lydiagugugaga
Copy link

UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 23: surrogates not allowed

I'm trying to convert a md file whose content is in Simplified Chinese, but I'm encountering encoding problems. I've read that the latest version mentions fixing #161, but I still can't get it to work on my end, so I'd like to ask what's the best way to fix it.

@MartinPacker
Copy link
Owner

Thank you for reporting this.

I'm away from my computer for the next couple of days- so can't look at this right away.

Can you somehow get me a minimal reducible example? I will also say that a workaround might be using a hexadecimal entity reference. But that's probably not a scalable behaviour.

@Lydiagugugaga
Copy link
Author

Thank you for reporting this.

I'm away from my computer for the next couple of days- so can't look at this right away.

Can you somehow get me a minimal reducible example? I will also say that a workaround might be using a hexadecimal entity reference. But that's probably not a scalable behaviour.

I just try to input python md2pptx output.pptx < 222.md
image

@MartinPacker
Copy link
Owner

Thanks. I need the 222.md file - or a minimal version of it. (No confidential etc data.)

@Lydiagugugaga
Copy link
Author

Thanks for your reply. Here is just an example markdown file:

222.md

@MartinPacker
Copy link
Owner

MartinPacker commented Aug 23, 2024

Thanks for this. I note your attempt to use <p> paragraph tags. Those aren't supported by md2pptx - if I remember correctly. I would use asterisks * instead.

If you think paragraph tags should be supported - and have a clear idea as to how they should be rendered - please open another issue.

@Lydiagugugaga
Copy link
Author

Lydiagugugaga commented Aug 23, 2024

Thanks for this. I note your attempt to use <p> paragraph tags. Those aren't supported by md2pptx - if I remember correctly. I would use asterisks * instead.

If you think paragraph tags should be supported - and have a clear idea as to how they should be rendered - please open another issue.

Thanks for your reply.
About <p> paragraph tags, I thought it was the problem before, but I actually tried removing it and using the generic .md form and it doesn't work either.

@MartinPacker
Copy link
Owner

Right. BBEdit (one of my editors of choice) thinks the file is UTF-8 but I suspect it isn't. Sniffing what it is is an approach I might take.

@MartinPacker
Copy link
Owner

This is strange: My run with your file yields this:

md2pptx Markdown To Powerpoint Converter 5.0.2+ 15 August, 2024
===============================================================

Open source project: https://github.com/MartinPacker/md2pptx

External Dependencies:

  Python: 3.9.6
  python-pptx: 0.6.23
  Pillow: 10.3.0
  CairoSVG: Not Installed
  graphviz: Not Installed

Internal Dependencies:

  funnel: 0.1
  runPython: 0.4

No slide to document metadata on. Continuing without it.

Slides:
=======

   1   初学者骑车之路:掌握自行车技巧的必备指南
   2   自行车基础知识
   3       自行车的组成部分
   4       自行车的类型和用途
   5   准备骑行前的注意事项
   6       自行车装备和保养
   7       骑行安全知识和规则
   8   学习骑行技巧
   9       自行车平衡和姿势
  10       踩踏和换挡技巧
  11       转弯和刹车技巧

@MartinPacker
Copy link
Owner

MartinPacker commented Aug 23, 2024

I'm suspecting your problem is with python-pptx or lxml, rather than md2pptx. But I keep an open mind about this.

@Lydiagugugaga
Copy link
Author

I'm suspecting your problem is with python-pptx or lxml, rather than md2pptx. But I keep an open mind about this.

Thank you so much for helping me with this question.
I've referenced some of the previously mentioned issues and also tried the python-pptx version change which is currently v0.6.23. But is didn't work.

If it's a problem with python-pptx or lxml, what do you suggest to fix it?

@MartinPacker
Copy link
Owner

I've just fixed a problem with numeric character references. So with the very latest push a workaround for you might well be to use character references such as &#dc80;. Fiddly, I know.

@Lydiagugugaga
Copy link
Author

So with the very latest push a workaround for you might well be to use character references such as &#dc80;. Fiddly, I know.

Thank you very much.
I'll try it.

@MartinPacker
Copy link
Owner

Please let me know how you get on. And do you think the text is really UTF-16 rather than UTF-8? The U+DC80 character isn't valid in UTF-8, apparently.

(And I just pushed some doc changes after the one that fixes numeric character references - so don't get confused by what the latest commit says.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants