Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: CJK and more characters support #8

Open
liyiheng opened this issue Sep 13, 2018 · 4 comments
Open

Feature request: CJK and more characters support #8

liyiheng opened this issue Sep 13, 2018 · 4 comments

Comments

@liyiheng
Copy link

liyiheng commented Sep 13, 2018

tree -L 1     
.
├── Cargo.lock
├── Cargo.toml
├── Chinese.md
├── Chinese.pdf
├── images
├── README.md
├── src
├── target
├── test.md
└── test.pdf

thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside '─' (bytes 6..9) of ├── src ', libcore/str/mod.rs:2111:5

@leroycep
Copy link
Owner

Thanks for the issue! Could you provide the file that gives you this error? A testcase would be helpful.

@liyiheng
Copy link
Author

Chinese characters are ?? in pdf. I think the panic is caused by output of tree command.

File contents:

中文
```
tree -L 1
.
├── Cargo.lock
├── Cargo.toml
├── Chinese.md
├── Chinese.pdf
├── images
├── README.md
├── src
├── target
├── test.md
└── test.pdf
```

@leroycep
Copy link
Owner

Ah, thanks. The error is located in src/sectioner.rs.

Relevant code:

Event::Text(ref text) if self.is_code => {
    let mut start = 0;
    for (pos, c) in text.chars().enumerate() {
        if c == '\n' {
            self.write(&text[start..pos]);
            self.new_line();
            start = pos + 1;
        }
    }
    if start < text.len() {
        self.write(&text[start..]);
    }
}

On line 3 of that snippet I call text.chars().enumerate(), which gives the current character and the current character count. Then, on line 5 I assume that the character count is the byte position, which works in ASCII, but not in unicode.

I changed text.chars().enumerate() to text.char_indices(). That solves the panicking, but the characters are still rendered as question marks.

@fschutt
Copy link

fschutt commented Sep 20, 2018

Can you select the text and copy the original characters out? If yes, that means that the font simply can't display the characters (or is encoded badly)? Does the font you are embedding the characters with support CJK? I've always used http://bluejamesbond.github.io/CharacterMap/ for debugging font-related issues.

You'll probably need to do some kind of font-selection-based-on-character-plane, i.e. if CJK characters are detected, then embed Roboto-CJK, otherwise, use Roboto-Medium.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants