Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

Words with a space added after... #4

Open
TheCodeFoundry opened this issue Apr 2, 2012 · 17 comments
Open

Words with a space added after... #4

TheCodeFoundry opened this issue Apr 2, 2012 · 17 comments

Comments

@TheCodeFoundry
Copy link

I've noticed that in some situations using the 'word' granularity, something odd can happen. Take this work on it's own:

'Courier'

If I add the words ' Delivery Boy' after it (including the preceding white space), when it comes to viewing the difference it'll strike through the work 'Courier' and show 'Courier Delivery Boy' as being added instead.

Is there any way around this at all so it would show just ' Delivery Boy' as being added?

@WilliamStam
Copy link

this is old.. but im still having similer issues. all i did was add a paragraph mark. now it marks the whole thing as changed

diff

@Krispian
Copy link

The project is abandoned.
Looking for an alternative to or tolerate =)

@WilliamStam
Copy link

yeah i realized the project is abandoned.. but still cant find anything that is even half decent

@gorhill
Copy link
Owner

gorhill commented Nov 19, 2013

Try this custom granularity stack:

$myStack = array(
    FineDiff::sentenceDelimiters,
    FineDiff::wordDelimiters
    );

This way it will not try to chunk in paragraph before chunking into sentences.

@gorhill
Copy link
Owner

gorhill commented Nov 19, 2013

Just to add to the above, the stock granularity stacks are fine if user doesn't mind the side-effect above. But only the callers know exactly what their data look like, and can then carefully craft their own stack according to what they want. In your case, what you want it appears is disregard paragraph formatting and focus on sentences instead, therefore I eliminated the FineDiff::paragraphDelimiters. Trade-off is that since a coarser level of chunking is removed, this will be more CPU intensive (in a typical text, their are more sentences than paragraphes), so the user can adjust according to what is important to his case: CPU vs best diffs.

@WilliamStam
Copy link

$myStack = array(
    \FineDiff::sentenceDelimiters,
    \FineDiff::characterDelimiters,
);
$diff = \FineDiff::getDiffOpcodes($orig, $latest, $myStack);
$diffHTML = \FineDiff::renderDiffToHTMLFromOpcodes($orig, $diff);

gives the same result.. if i change the ordering it doesn't show any ins / del

@gorhill
Copy link
Owner

gorhill commented Nov 19, 2013

Can you cut and paste me the text which causes problem? Apparently I don't remember how my code works, I thought this would solve the issue, I want to see what happens to refresh my memory.

@pbagnall
Copy link

I looked into this problem recently, because I was having a similar problem. I forked the code to fix it here.

https://github.com/pbagnall/PHP-FineDiff

The issue is that the way the tokeniser works. It includes the whitespace on the end of the token. So if we have a string "abc def" then when you tokenise it to words you get "abc ", "def". Now when you compare that with "abc def ghi" you end up with this going on...

"abc ", "def"
versus
"abc ", "def ", "ghi"

Note that the "def" is not the same as "def " because the latter includes the whitespace which comes after it.

So what my fork does (at some cost of efficiency to be fair) is tokenise like this instead...

"abc", " ", "def"
versus
"abc", " ", "def", " ", "ghi"

...and this time the "def" matches and the diff starts from the whitespace which follows.

@WilliamStam
Copy link

its the <p> that seems to be causing the problem

require_once('lib/finediff.php');


    $old = "<p>Daisy maak haar o&euml; oop en soek <strong>dadelik </strong>na &lsquo;n Panado. Sy weet nog nie wat haar naam is of watter jaar dit is nie, maar sy is alreeds kwaad. Vanoggend is daar g&rsquo;n teken van haar spirituele beginsel om elke dag met groot dankbaarheid te begin nie.</p>

<p>Die Bosbewoner sit op &lsquo;n leunstoel langs Daisy en drink sy koffie. Hy loer bekommerd na haar en verkies om die woorde waarmee sy wakker word te ignoreer.</p>";

    $new = "<p>Daisy maak haar o&euml; oop en soek <strong>dadelik </strong>na &lsquo;n Panado. Sy weet nog nie wat haar naam is of watter jaar dit is nie, maar sy is alreeds kwaad.</p>

<p>Vanoggend is daar g&rsquo;n teken van haar spirituele beginsel om elke dag met groot dankbaarheid te begin nie.Die Bosbewoner sit op &lsquo;n leunstoel langs Daisy en drink sy koffie. Hy loer bekommerd na haar en verkies om die woorde waarmee sy wakker word te ignoreer.</p>";




    $myStack = array(
        \FineDiff::sentenceDelimiters,
        \FineDiff::characterDelimiters,
    );


    $diff = \FineDiff::getDiffOpcodes($old, $new, $myStack);
    $diffHTML = \FineDiff::renderDiffToHTMLFromOpcodes($old, $diff);

    echo '<style>ins {background: none repeat scroll 0 0 #DDFFDD; color: #008000; text-decoration: none;};</style>';
    echo '<style>del {background: none repeat scroll 0 0 #FFDDDD; color: #FF0000; text-decoration: none;};</style>';
    echo $diffHTML;

5mz18 1

@WilliamStam
Copy link

word then character delims seems to actually work.

@gorhill
Copy link
Owner

gorhill commented Nov 19, 2013

@pbagnall : it is by design, for performance purpose, otherwise the algorithm would have been significantly slower, and quite different too. This is the side-effect of having an algorithm which is efficient enough to be used server-side.

@WilliamStam : ok I see, I thought you were trying to diff plain text (as shown in your original screenshot). In finediff, the separator has to be a single character, not a string, and I see now that the separator is more like the <p> and </p>, and as pointed out by @pbagnall , your case shows the side effect of the algorithm. A solution would be to have at least one white space each side of <p> and </p>.

By the way, do not skip FineDiff::wordDelimiters, you are causing the algorithm to be very inefficient (by forcing it to jump from sentences to character, yikes!). Each stack entry must be a progression from the above, or else your are going to tax your server.

@WilliamStam
Copy link

i changed it to

$myStack = array(
    \FineDiff::paragraphDelimiters,
    \FineDiff::wordDelimiters,
    \FineDiff::characterDelimiters
);

that part seems to be working now 💃 thanks 👍

@gorhill
Copy link
Owner

gorhill commented Nov 19, 2013

@WilliamStam: "word then character delims seems to actually work"

I suppose it is ok as long as you know there won't be large amount of text to process, you will have to profile to be sure it is good enough for your server, as splitting directly on words before coarse-chunking into paragraph and sentence is rather CPU intensive.

@gorhill
Copy link
Owner

gorhill commented Nov 19, 2013

@WilliamStam "i changed it to"

Cool then. You could also try using your own delimiters instead of the stock ones, not sure I can predict what would happen though:

$myStack = array(
    \"<>",
    \"<> \t\n\r" ,
    \""
);

Since in HTML < and > are key anchors.

@WilliamStam
Copy link

understanding finediff more in the few min today on this thread than ALL
the previous reading about it.. thank you for your time!

On Tue, Nov 19, 2013 at 2:45 PM, Raymond Hill [email protected]:

@WilliamStam https://github.com/WilliamStam "i changed it to"

Cool then. You could also try using your own delimiters instead of the
stock ones, not sure I can predict what would happen though:

$myStack = array(
"<>",
"<> \t\n\r" ,
""
);

Since in HTML < and > are key anchors.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-28786962
.

@ceyzeriat
Copy link

For the "courier" vs "courier delivery boy" issue, you just have to make sure there is 1 space at the end of the 2 strings $from and $to strings. 1 space won't do harm as this space will anyway be eaten up as being delimiter.
For future reference, in order to ignore html anchors in the comparison but keep them anyway in the output display you simply have to add 1 anw 1 only space before all < and after all >. Example: $to = preg_replace ("/ *</", " <", preg_replace("/> *", "> ", $to)) and same thing with from.
What's happening is that all anchors are now taken as word as they are all separated with 1 space before and after. If the text has different anchors they will be stamped as inserted or deleted. But as the $to is displayed these anchors are processed as html code hence don't show up in words but do work for whatever they were meant to do.
This was not widely tested so i don't know what happens in corner cases. But it works beautifully on a simple text I have (with and ).

@ceyzeriat
Copy link

Last bit of my comment got html-interpreted.. i meant bold and italics anchors of course.
Btw. Thanks for that huge and amazing work gorhill!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants