-
Notifications
You must be signed in to change notification settings - Fork 148
finediff is eating my words when showing the comparisons. #20
Comments
Probably one of the HTML tag end up being inserted in the middle of a multibyte character. FineDiff works on a binary byte basis, it doesn't know about characters. It happens to work fine for display for ASCII characters because they are single byte. Not sure if you could find where an HTML tag split a whole character and shift back or forth (depending on whether it is the opening or closing tag) to a proper character boundary. |
Where can I try and set the character boundary? I tried to look into the codes but it was a bit too difficult for me to follow... Thanks in advance! |
Create your own rendering handler:
See code. each time your callback is called, you may want to see if the start/end of the segment are valid Unicode characters, and if not look around to fetch the previous/following missing bytes. Frankly, it's just an untested idea, but if I had time, that what I would look into. Changing FineDiff code is not an option, it's completely designed to work on bytes, and these bytes could be anything, FineDiff doesn't care about their meaning. |
OK thanks, but the thing is that it is missing more than one characters (sometimes a few sentences), so if I shift back and forth, most likely I would just have one more character back, which doesn't really help much... |
Probably because you are looking at the broken HTML result. Look at the binary string internally, not the broken rendered HTML. Putting in there an HTML renderer was my biggest mistake, I should not have created this helper method because FineDiff is really completely binary and it doesn't care about what the data is, originally it was used just to save storage, saving only what changed. Many users of the library think the library is to render diff visually on screen, that wasn't my intention at all originally. Isn't it true that if you use Edit: Out of curiosity, what granularity do you use? |
I played on it for a bit, the following codes seem to work on my case, but not sure about other cases..
|
Are Chinese characters always 3-byte large? (including whitespace, etc.) |
No... can be 1-4 bytes... I guess I may have to refine it for a bit to suit more cases... But how about insertion? I can't get it right using the same technique... |
Alright, looking at Unicode encoding, to find the beginning of the character seems pretty easy: if bit 7-6 are binary 0x80 (i.e. Now use the distance of the beginning of the character to the passed It has been a while since I wrote PHP, so I would have to check again the PHP reference.. I forgot.. Can we check a single byte in a string using array notation? If so this become very easy. You could all do this without changing FineDiff, just by providing your own callback to Edit: fixed mistakes |
Thanks! I also just googled it and discovered this fork It is working well with my Chinese characters! |
Just be aware you won't have the same kind of performance however, as there is no equivalent for |
I've worked a bit on this today, I wanted to test the idea above about nudging the boundary back/forth. The idea works, it's all in the details though. It's not perfect yet but I have figured how to make it work perfectly, but I don't know when this will be ready. |
Cool! Thanks! I realize the the fork isn't performing as efficient as this one, but it still can serve as the temporary solution. Looking forward to your updates! |
See this https://github.com/xrstf/PHP-FineDiff |
After reading #18, I can now use finediff for Chinese successfully, but sometimes it will eat out my words!
Example:
(before)
1根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區域中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的时空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。 T'尸F.
(after)
根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時
2空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。
Using FineDiff::renderDiffToHTMLFromOpcodes($a, $opcodes), the result will just be:
根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區域空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。 T'尸F.
The whole "中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時
2" is missing! I don;t know where to look at to solve the problem. Please give me some guidance. Thanks!
The text was updated successfully, but these errors were encountered: