-
Notifications
You must be signed in to change notification settings - Fork 147
Words with a space added after... #4
Comments
The project is abandoned. |
yeah i realized the project is abandoned.. but still cant find anything that is even half decent |
Try this custom granularity stack:
This way it will not try to chunk in paragraph before chunking into sentences. |
Just to add to the above, the stock granularity stacks are fine if user doesn't mind the side-effect above. But only the callers know exactly what their data look like, and can then carefully craft their own stack according to what they want. In your case, what you want it appears is disregard paragraph formatting and focus on sentences instead, therefore I eliminated the |
gives the same result.. if i change the ordering it doesn't show any ins / del |
Can you cut and paste me the text which causes problem? Apparently I don't remember how my code works, I thought this would solve the issue, I want to see what happens to refresh my memory. |
I looked into this problem recently, because I was having a similar problem. I forked the code to fix it here. https://github.com/pbagnall/PHP-FineDiff The issue is that the way the tokeniser works. It includes the whitespace on the end of the token. So if we have a string "abc def" then when you tokenise it to words you get "abc ", "def". Now when you compare that with "abc def ghi" you end up with this going on... "abc ", "def" Note that the "def" is not the same as "def " because the latter includes the whitespace which comes after it. So what my fork does (at some cost of efficiency to be fair) is tokenise like this instead... "abc", " ", "def" ...and this time the "def" matches and the diff starts from the whitespace which follows. |
its the
|
word then character delims seems to actually work. |
@pbagnall : it is by design, for performance purpose, otherwise the algorithm would have been significantly slower, and quite different too. This is the side-effect of having an algorithm which is efficient enough to be used server-side. @WilliamStam : ok I see, I thought you were trying to diff plain text (as shown in your original screenshot). In finediff, the separator has to be a single character, not a string, and I see now that the separator is more like the By the way, do not skip |
i changed it to
that part seems to be working now 💃 thanks 👍 |
@WilliamStam: "word then character delims seems to actually work" I suppose it is ok as long as you know there won't be large amount of text to process, you will have to profile to be sure it is good enough for your server, as splitting directly on words before coarse-chunking into paragraph and sentence is rather CPU intensive. |
@WilliamStam "i changed it to" Cool then. You could also try using your own delimiters instead of the stock ones, not sure I can predict what would happen though:
Since in HTML |
understanding finediff more in the few min today on this thread than ALL On Tue, Nov 19, 2013 at 2:45 PM, Raymond Hill [email protected]:
|
For the "courier" vs "courier delivery boy" issue, you just have to make sure there is 1 space at the end of the 2 strings $from and $to strings. 1 space won't do harm as this space will anyway be eaten up as being delimiter. |
Last bit of my comment got html-interpreted.. i meant bold and italics anchors of course. |
I've noticed that in some situations using the 'word' granularity, something odd can happen. Take this work on it's own:
'Courier'
If I add the words ' Delivery Boy' after it (including the preceding white space), when it comes to viewing the difference it'll strike through the work 'Courier' and show 'Courier Delivery Boy' as being added instead.
Is there any way around this at all so it would show just ' Delivery Boy' as being added?
The text was updated successfully, but these errors were encountered: