-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththe-levenshtein-rules.html
175 lines (159 loc) · 14.1 KB
/
the-levenshtein-rules.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
<!DOCTYPE html>
<html lang="en">
<head>
<title>meandering journey</title>
<meta charset="utf-8" />
<link href="./theme/css/main.css" type="text/css" rel="stylesheet" />
<link href="http://meandering.journey.sk/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="meandering journey Full Atom Feed" />
<link href="http://meandering.journey.sk/feeds/misc.atom.xml" type="application/atom+xml" rel="alternate" title="meandering journey Categories Atom Feed" />
</head>
<body id="index" class="home">
<div class="all">
<div class="extra-info">
<aside>
<h3>About the blog</h3>
A platform to practice my written communication skills.
The range of topics tends to surprise even myself.
</aside>
<aside>
<h3>About the author</h3>
My name is Ján Hušták. I live near
<a href="http://maps.google.com/maps?q=bratislava&z=6">Bratislava</a>.
I've been developing software professionally since 1998.
The Java platform has served me well but I don't dwell on it.
</aside>
<aside>
<h3>Links</h3>
<a href="http://www.journey.sk">Main site</a>,
<a href="http://coding.journey.sk">projects page</a>,
<a href="https://github.com/codingjourney">GitHub</a>.
Sorry, no social networks. I do read mail sent to
coding at journey.sk.
</aside>
<aside id="tags">
<h3>Tags</h3>
<a href="./tag/motivation.html">motivation</a>
- <a href="./tag/htpc.html">HTPC</a>
- <a href="./tag/openbsd.html">OpenBSD</a>
- <a href="./tag/qt.html">Qt</a>
- <a href="./tag/upsheet.html">upsheet</a>
- <a href="./tag/python.html">Python</a>
- <a href="./tag/kde.html">KDE</a>
- <a href="./tag/cloud-computing.html">cloud computing</a>
- <a href="./tag/caldav.html">CalDAV</a>
- <a href="./tag/howto.html">howto</a>
- <a href="./tag/jetty.html">Jetty</a>
- <a href="./tag/craftsmanship.html">craftsmanship</a>
- <a href="./tag/meta.html">meta</a>
- <a href="./tag/music.html">music</a>
- <a href="./tag/it-misadventures.html">IT misadventures</a>
- <a href="./tag/algorithms.html">algorithms</a>
- <a href="./tag/android.html">Android</a>
- <a href="./tag/cups.html">CUPS</a>
- <a href="./tag/security.html">security</a>
- <a href="./tag/html5.html">HTML5</a>
</aside>
<aside class="links">
<h3>Recent articles</h3>
<a href="./much-more-fun-with-planning-poker.html">Much more fun with Planning poker</a>
<a href="./the-child-that-grew-too-fast.html">The child that grew too fast</a>
<a href="./mare-nostrum-at-konzerthaus.html">Mare Nostrum at Konzerthaus</a>
<a href="./too-much-fun-with-planning-poker.html">Too much fun with Planning poker</a>
<a href="./long-time-no-blog.html">Long time no blog</a>
<a href="./what-i-did-last-summer.html">What I did last summer</a>
<a href="./october-2013-is-here.html">October 2013 is here</a>
<a href="./october-2013.html">October 2013</a>
</aside>
</div><!-- /.extra-info -->
<div class="main-column">
<header id="banner" class="body">
<h1><a href=".">meandering<img src="./theme/images/logo.png"/>journey</a></h1>
</header><!-- /#banner -->
<section id="content" class="body">
<header>
<h3>
<a href="the-levenshtein-rules.html" rel="bookmark"
title="Permalink to The Levenshtein rules">The Levenshtein rules</a></h2>
</header>
<footer class="post-info">
Published on <abbr class="published" title="2013-07-28T06:30:00"> Sun 28 July 2013 </abbr> under
<a href="./tag/algorithms.html">algorithms</a> </footer><!-- /.post-info -->
<div class="entry-content">
<p>I continued my exploration of the <a href="./the-levenshtein-puzzle.html">Levenshtein distance</a> by writing an implementation in the <a href="http://docs.jboss.org/drools/release/6.0.0.Beta5/drools-expert-docs/html/ch04.html#d0e4036">Drools Rule Language</a> a.k.a. DRL. It turned out to be very interesting because the properties of DRL forced me to formulate the problem in a completely different way.</p>
<p>This isn't the right place to delve into rule engines but I'll try to explain the basics. A DRL program is executed in a <em>working memory</em> containing <em>facts</em> and <em>rules</em>. A rule is a simple statement of the form</p>
<div class="codehilite"><pre><span class="n">when</span>
<span class="n">the</span> <span class="n">working</span> <span class="n">memory</span> <span class="n">is</span> <span class="n">in</span> <span class="n">such</span> <span class="n">and</span> <span class="n">such</span> <span class="n">state</span>
<span class="n">then</span>
<span class="n">perform</span> <span class="n">these</span> <span class="n">actions</span>
<span class="p">(</span><span class="n">potentially</span> <span class="n">affecting</span> <span class="n">the</span> <span class="n">working</span> <span class="n">memory</span><span class="p">)</span>
</pre></div>
<p>The <em>when</em> part contains a fact pattern. When a fact is inserted into the working memory all rules whose <em>when</em> parts match the new fact get their <em>then</em> parts executed. This may cause other facts to appear in the working memory, triggering a cascade of rule firing. Well-written rules adhere to two important principles:</p>
<ol>
<li>The <em>then</em> part should contain no conditional logic (as Drools creator Mark Proctor says, it should be <em>"if this then that"</em>, not <em>"if this then maybe that"</em>). All decision-making should be expressed in the <em>when</em> sections.</li>
<li>The rules should have the same semantics regardless of their execution order (e.g. when a new fact matches two rules it shouldn't matter which fires first).</li>
</ol>
<p>As you can see, a rule author gives up a lot of control over the program flow. The idea is to specify <em>what</em> should happen and let the rule engine figure out <em>how</em> to do it. The way it looks in practice is that you decompose your input into very small parts that are straightforward to reason about. From that you can formulate rules that let the engine construct the desired result.</p>
<p>My original Levenshtein distance algorithm used the concepts of identical and orthogonal sub-words. Those are not really suitable for a rule engine because their discovery is in itself quite complex. I replaced them with the idea of character locations. A character location is a simple object that says "there is an 'a' at offset 2 in the word 'banana'". Converting a word into character locations is trivial and I can then throw them into the working memory as new facts (the examples use pseudo-code rather than actual DRL syntax):</p>
<div class="codehilite"><pre><span class="n">when</span>
<span class="n">word</span> <span class="o">:</span> <span class="n">String</span>
<span class="n">then</span>
<span class="k">for</span> <span class="n">offset</span> <span class="n">from</span> <span class="mi">1</span> <span class="n">to</span> <span class="n">word</span><span class="p">.</span><span class="n">length</span>
<span class="n">insert</span> <span class="n">CharacterLocation</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">offset</span><span class="p">)</span>
</pre></div>
<p>The rule will be triggered for each of the words as they are inserted into the working memory. Armed with a bunch of CharacterLocations, I can identify character matches:</p>
<div class="codehilite"><pre><span class="n">when</span>
<span class="n">location1</span><span class="p">,</span> <span class="n">location2</span> <span class="o">:</span> <span class="n">CharacterLocation</span>
<span class="n">location1</span><span class="p">.</span><span class="n">character</span> <span class="o">==</span> <span class="n">location2</span><span class="p">.</span><span class="n">character</span>
<span class="n">location1</span><span class="p">.</span><span class="n">word</span> <span class="o">!=</span> <span class="n">location2</span><span class="p">.</span><span class="n">word</span>
<span class="n">then</span>
<span class="n">insert</span> <span class="n">Match</span><span class="p">(</span><span class="n">location1</span><span class="p">,</span> <span class="n">location2</span><span class="p">)</span>
</pre></div>
<p>This rule, in turn, will be triggered for each suitable pair of CharacterLocations, generating all possible Matches:</p>
<p><img alt="invalid match set" src="./static/images/levenshtein2_all.svg" /></p>
<p>For the Levenshtein distance I need a combination of Matches that covers as much of the two words as possible. Not every combination makes sense:</p>
<p><img alt="invalid match set" src="./static/images/levenshtein2_invalid.svg" /></p>
<p>so I'm actually looking for <em>sequences</em> of <em>strictly ordered</em> Matches, such as this one:</p>
<p><img alt="valid match set" src="./static/images/levenshtein2_valid.svg" /></p>
<p>Generating valid sequences takes a bit of induction. I first create "seeds" - sequences containing just two Matches:</p>
<div class="codehilite"><pre><span class="n">when</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">:</span> <span class="n">Match</span>
<span class="n">x</span> <span class="o"><</span> <span class="n">y</span>
<span class="n">not</span> <span class="n">exists</span> <span class="n">Sequence</span> <span class="n">s</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
<span class="n">then</span>
<span class="n">insert</span> <span class="n">Sequence</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</pre></div>
<p>I proceed to grow each Sequence from the middle, using the <code>visited</code> set to avoid creating the same one twice:</p>
<div class="codehilite"><pre><span class="n">when</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">candidate</span> <span class="o">:</span> <span class="n">Match</span>
<span class="n">s</span> <span class="o">:</span> <span class="n">Sequence</span>
<span class="n">x</span> <span class="o"><</span> <span class="n">candidate</span> <span class="o"><</span> <span class="n">y</span>
<span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">visited</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">candidate</span><span class="p">)</span>
<span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">candidate</span><span class="p">)</span>
<span class="n">s</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">s</span><span class="p">.</span><span class="n">indexOf</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">==</span> <span class="n">s</span><span class="p">.</span><span class="n">indexOf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">then</span>
<span class="n">insert</span> <span class="n">s</span><span class="p">.</span><span class="n">clone</span><span class="p">(</span><span class="n">visited</span> <span class="o">+=</span> <span class="n">candidate</span><span class="p">)</span>
<span class="n">s</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">candidate</span> <span class="n">between</span> <span class="n">x</span> <span class="n">and</span> <span class="n">y</span><span class="p">)</span>
</pre></div>
<p>The distance corresponding to a Sequence is determined by the gaps it leaves open:</p>
<p><img alt="sequence with gaps" src="./static/images/levenshtein2_gaps.svg" /></p>
<p>so once all valid Sequences have been generated, I simply pick the best one:</p>
<div class="codehilite"><pre><span class="n">when</span>
<span class="n">there</span> <span class="n">are</span> <span class="n">no</span> <span class="n">more</span> <span class="n">other</span> <span class="n">rules</span> <span class="n">to</span> <span class="n">run</span>
<span class="n">s</span> <span class="o">:</span> <span class="n">set</span> <span class="n">of</span> <span class="n">all</span> <span class="n">Sequence</span> <span class="n">instances</span>
<span class="n">then</span>
<span class="n">print</span> <span class="s">"Found distance %s"</span> <span class="o">%</span> <span class="n">min</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">map</span><span class="p">(</span><span class="n">_</span><span class="p">.</span><span class="n">distance</span><span class="p">))</span>
</pre></div>
<p>And that's it. From a complexity point of view, the algorithm is quite a pig. It explores the entire solution space and doesn't even use the best-known result for pruning. It isn't even easily parallelizable, with all the each-on-each semantics going on. It does, however, stick to the rule-based declarative approach so performance is the rule engine's problem ;-)</p>
</div><!-- /.entry-content -->
</section>
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://getpelican.com/">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.
</address><!-- /#about -->
</footer><!-- /#contentinfo -->
</div><!-- /.main-column -->
</div><!-- /.all -->
</body>
</html>