Words with a space added after... #4

TheCodeFoundry · 2012-04-02T10:20:12Z

I've noticed that in some situations using the 'word' granularity, something odd can happen. Take this work on it's own:

'Courier'

If I add the words ' Delivery Boy' after it (including the preceding white space), when it comes to viewing the difference it'll strike through the work 'Courier' and show 'Courier Delivery Boy' as being added instead.

Is there any way around this at all so it would show just ' Delivery Boy' as being added?

WilliamStam · 2013-11-19T08:38:09Z

this is old.. but im still having similer issues. all i did was add a paragraph mark. now it marks the whole thing as changed

Krispian · 2013-11-19T08:41:12Z

The project is abandoned.
Looking for an alternative to or tolerate =)

WilliamStam · 2013-11-19T08:44:44Z

yeah i realized the project is abandoned.. but still cant find anything that is even half decent

gorhill · 2013-11-19T11:52:21Z

Try this custom granularity stack:

$myStack = array(
    FineDiff::sentenceDelimiters,
    FineDiff::wordDelimiters
    );

This way it will not try to chunk in paragraph before chunking into sentences.

gorhill · 2013-11-19T11:59:43Z

Just to add to the above, the stock granularity stacks are fine if user doesn't mind the side-effect above. But only the callers know exactly what their data look like, and can then carefully craft their own stack according to what they want. In your case, what you want it appears is disregard paragraph formatting and focus on sentences instead, therefore I eliminated the FineDiff::paragraphDelimiters. Trade-off is that since a coarser level of chunking is removed, this will be more CPU intensive (in a typical text, their are more sentences than paragraphes), so the user can adjust according to what is important to his case: CPU vs best diffs.

WilliamStam · 2013-11-19T12:00:44Z

$myStack = array(
    \FineDiff::sentenceDelimiters,
    \FineDiff::characterDelimiters,
);
$diff = \FineDiff::getDiffOpcodes($orig, $latest, $myStack);
$diffHTML = \FineDiff::renderDiffToHTMLFromOpcodes($orig, $diff);

gives the same result.. if i change the ordering it doesn't show any ins / del

gorhill · 2013-11-19T12:02:13Z

Can you cut and paste me the text which causes problem? Apparently I don't remember how my code works, I thought this would solve the issue, I want to see what happens to refresh my memory.

pbagnall · 2013-11-19T12:14:08Z

I looked into this problem recently, because I was having a similar problem. I forked the code to fix it here.

https://github.com/pbagnall/PHP-FineDiff

The issue is that the way the tokeniser works. It includes the whitespace on the end of the token. So if we have a string "abc def" then when you tokenise it to words you get "abc ", "def". Now when you compare that with "abc def ghi" you end up with this going on...

"abc ", "def"
versus
"abc ", "def ", "ghi"

Note that the "def" is not the same as "def " because the latter includes the whitespace which comes after it.

So what my fork does (at some cost of efficiency to be fair) is tokenise like this instead...

"abc", " ", "def"
versus
"abc", " ", "def", " ", "ghi"

...and this time the "def" matches and the diff starts from the whitespace which follows.

WilliamStam · 2013-11-19T12:14:23Z

its the  that seems to be causing the problem

require_once('lib/finediff.php');


    $old = "<p>Daisy maak haar o&euml; oop en soek <strong>dadelik </strong>na &lsquo;n Panado. Sy weet nog nie wat haar naam is of watter jaar dit is nie, maar sy is alreeds kwaad. Vanoggend is daar g&rsquo;n teken van haar spirituele beginsel om elke dag met groot dankbaarheid te begin nie.</p>

<p>Die Bosbewoner sit op &lsquo;n leunstoel langs Daisy en drink sy koffie. Hy loer bekommerd na haar en verkies om die woorde waarmee sy wakker word te ignoreer.</p>";

    $new = "<p>Daisy maak haar o&euml; oop en soek <strong>dadelik </strong>na &lsquo;n Panado. Sy weet nog nie wat haar naam is of watter jaar dit is nie, maar sy is alreeds kwaad.</p>

<p>Vanoggend is daar g&rsquo;n teken van haar spirituele beginsel om elke dag met groot dankbaarheid te begin nie.Die Bosbewoner sit op &lsquo;n leunstoel langs Daisy en drink sy koffie. Hy loer bekommerd na haar en verkies om die woorde waarmee sy wakker word te ignoreer.</p>";




    $myStack = array(
        \FineDiff::sentenceDelimiters,
        \FineDiff::characterDelimiters,
    );


    $diff = \FineDiff::getDiffOpcodes($old, $new, $myStack);
    $diffHTML = \FineDiff::renderDiffToHTMLFromOpcodes($old, $diff);

    echo '<style>ins {background: none repeat scroll 0 0 #DDFFDD; color: #008000; text-decoration: none;};</style>';
    echo '<style>del {background: none repeat scroll 0 0 #FFDDDD; color: #FF0000; text-decoration: none;};</style>';
    echo $diffHTML;

WilliamStam · 2013-11-19T12:26:17Z

word then character delims seems to actually work.

gorhill · 2013-11-19T12:30:44Z

@pbagnall : it is by design, for performance purpose, otherwise the algorithm would have been significantly slower, and quite different too. This is the side-effect of having an algorithm which is efficient enough to be used server-side.

@WilliamStam : ok I see, I thought you were trying to diff plain text (as shown in your original screenshot). In finediff, the separator has to be a single character, not a string, and I see now that the separator is more like the  and , and as pointed out by @pbagnall , your case shows the side effect of the algorithm. A solution would be to have at least one white space each side of  and .

By the way, do not skip FineDiff::wordDelimiters, you are causing the algorithm to be very inefficient (by forcing it to jump from sentences to character, yikes!). Each stack entry must be a progression from the above, or else your are going to tax your server.

WilliamStam · 2013-11-19T12:34:35Z

i changed it to

$myStack = array(
    \FineDiff::paragraphDelimiters,
    \FineDiff::wordDelimiters,
    \FineDiff::characterDelimiters
);

that part seems to be working now 💃 thanks 👍

gorhill · 2013-11-19T12:34:44Z

@WilliamStam: "word then character delims seems to actually work"

I suppose it is ok as long as you know there won't be large amount of text to process, you will have to profile to be sure it is good enough for your server, as splitting directly on words before coarse-chunking into paragraph and sentence is rather CPU intensive.

gorhill · 2013-11-19T12:45:02Z

@WilliamStam "i changed it to"

Cool then. You could also try using your own delimiters instead of the stock ones, not sure I can predict what would happen though:

$myStack = array(
    \"<>",
    \"<> \t\n\r" ,
    \""
);

Since in HTML < and > are key anchors.

WilliamStam · 2013-11-19T12:48:29Z

understanding finediff more in the few min today on this thread than ALL
the previous reading about it.. thank you for your time!

On Tue, Nov 19, 2013 at 2:45 PM, Raymond Hill [email protected]:

@WilliamStam https://github.com/WilliamStam "i changed it to"

Cool then. You could also try using your own delimiters instead of the
stock ones, not sure I can predict what would happen though:

$myStack = array(
"<>",
"<> \t\n\r" ,
""
);

Since in HTML < and > are key anchors.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-28786962
.

ceyzeriat · 2015-01-21T21:25:31Z

For the "courier" vs "courier delivery boy" issue, you just have to make sure there is 1 space at the end of the 2 strings $from and $to strings. 1 space won't do harm as this space will anyway be eaten up as being delimiter.
For future reference, in order to ignore html anchors in the comparison but keep them anyway in the output display you simply have to add 1 anw 1 only space before all < and after all >. Example: $to = preg_replace ("/ *</", " <", preg_replace("/> *", "> ", $to)) and same thing with from.
What's happening is that all anchors are now taken as word as they are all separated with 1 space before and after. If the text has different anchors they will be stamped as inserted or deleted. But as the $to is displayed these anchors are processed as html code hence don't show up in words but do work for whatever they were meant to do.
This was not widely tested so i don't know what happens in corner cases. But it works beautifully on a simple text I have (with and ).

ceyzeriat · 2015-01-21T21:27:56Z

Last bit of my comment got html-interpreted.. i meant bold and italics anchors of course.
Btw. Thanks for that huge and amazing work gorhill!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words with a space added after... #4

Words with a space added after... #4

TheCodeFoundry commented Apr 2, 2012

WilliamStam commented Nov 19, 2013

Krispian commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

gorhill commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

pbagnall commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

gorhill commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

ceyzeriat commented Jan 21, 2015

ceyzeriat commented Jan 21, 2015

Words with a space added after... #4

Words with a space added after... #4

Comments

TheCodeFoundry commented Apr 2, 2012

WilliamStam commented Nov 19, 2013

Krispian commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

gorhill commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

pbagnall commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

gorhill commented Nov 19, 2013

gorhill commented Nov 19, 2013

WilliamStam commented Nov 19, 2013

ceyzeriat commented Jan 21, 2015

ceyzeriat commented Jan 21, 2015