forked from szabgab/perlmaven.com
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathunique-values-in-an-array-in-perl.tt
305 lines (217 loc) · 7.61 KB
/
unique-values-in-an-array-in-perl.tt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
=title Unique values in an array in Perl
=timestamp 2012-09-20T21:42:56
=indexes unique, uniq, distinct, filter, grep, array, List::MoreUtils, duplicate
=status show
=books beginner_book
=author szabgab
=index 1
=archive 1
=feed 1
=comments 1
=social 1
=abstract start
In this part of the <a href="/perl-tutorial">Perl tutorial</a> we are going to see how to
make sure we only have <b>distinct values in an array</b>.
Perl 5 does not have a built in function to filter out duplicate values
from an array, but there are several solutions to the problem.
=abstract end
<h2>List::MoreUtils</h2>
Depending on your situation, probably the simplest way is to use the <hl>uniq</hl>
function of the <a href="https://metacpan.org/module/List::MoreUtils">List::MoreUtils</a> module from CPAN.
<code lang="perl">
use List::MoreUtils qw(uniq);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq @words;
</code>
A full example is this:
<code lang="perl">
use strict;
use warnings;
use 5.010;
use List::MoreUtils qw(uniq);
use Data::Dumper qw(Dumper);
my @words = qw(foo bar baz foo zorg baz);
my @unique_words = uniq @words;
say Dumper \@unique_words;
</code>
The result is:
<code>
$VAR1 = [
'foo',
'bar',
'baz',
'zorg'
];
</code>
For added fun the same module also provides a function called <hl>distinct</hl>,
which is just an alias of the <hl>uniq</hl> function.
In order to use this module you'll have to install it from CPAN.
<h2>Home made uniq</h2>
If you cannot install the above module for whatever reason, or if you think
the overhead of loading it is too big, there is a very short expression
that will do the same:
<code lang="perl">
my @unique = do { my %seen; grep { !$seen{$_}++ } @data };
</code>
This, of course can look cryptic to someone who does not know it already,
so it is recommended to define your own <hl>uniq</hl> subroutine,
and use that in the rest of the code:
<code lang="perl">
use strict;
use warnings;
use 5.010;
use Data::Dumper qw(Dumper);
my @words = qw(foo bar baz foo zorg baz);
my @unique = uniq( @words );
say Dumper \@unique_words;
sub uniq {
my %seen;
return grep { !$seen{$_}++ } @_;
}
</code>
<h2>Home made uniq explained</h2>
I can't just throw this example here and leave it like that. I'd better explain it.
Let's start with an easier version:
<code lang="perl">
my @unique;
my %seen;
foreach my $value (@words) {
if (! $seen{$value}) {
push @unique, $value;
$seen{$value} = 1;
}
}
</code>
Here we are using a regular <hl>foreach</hl> loop to go over the
values in the original array, one by one. We use a helper hash called <hl>%seen</hl>.
The nice thing about the hashes is that their keys are <b>unique</b>.
We start with an empty hash so when we encounter the first "foo", <hl>$seen{"foo"}</hl>
does not exist and thus its value is <hl>undef</hl> which is considered false in Perl.
Meaning we have not seen this value yet. We push the value to the end of the new
<hl>@uniq</hl> array where we are going to collect the distinct values.
We also set the value of <hl>$seen{"foo"}</hl> to 1.
Actually any value would do as long as it is considered "true" by Perl.
The next time we encounter the same string we already have that key
in the <hl>%seen</hl> hash and its value is true, so the <hl>if</hl> condition
will fail, and we won't <hl>push</hl> the duplicate value in the resulting array.
<h2>Shortening the home made unique function</h2>
First of all we replace the assignment of 1 <hl>$seen{$value} = 1;</hl> by the
post-increment operator <hl>$seen{$value}++</hl>. This does not change the behavior
of the previous solution - any positive number is going to be evaluated as TRUE, but
it will allow us to include the setting of the "seen flag" within the <hl>if</hl>
condition. It is important that this is a postfix increment (and not a prefix increment)
as this means the increment only takes place after the boolean expression was evaluated.
The first time we encounter a value the expression will be TRUE and the rest of the times
it will be FALSE.
<code lang="perl">
my @unique;
my %seen;
foreach my $value (@data) {
if (! $seen{$value}++ ) {
push @unique, $value;
}
}
</code>
This is shorter, but we can do even better.
<h2>Filtering duplicate values using grep</h2>
The <hl>grep</hl> function in Perl is a generalized form of the well known grep command of Unix.
It is basically a <a href="/filtering-values-with-perl-grep">filter</a>.
You provide an array on the right hand side and an expression in the block.
The <hl>grep</hl> function will take each value of the array one-by-one, put it in
<hl>$_</hl>, the <a href="/the-default-variable-of-perl.html">default scalar variable of Perl</a>
and then execute the block. If the block evaluates to TRUE, the value can pass.
If the block evaluates to FALSE the current value is filtered out.
That's how we got to this expression:
<code lang="perl">
my %seen;
my @unique = grep { !$seen{$_}++ } @words;
</code>
<h2>Wrapping it in 'do' or in 'sub'</h2>
The last little thing we have to do, is wrapping the above two statements in either
a <hl>do</hl> block
<code lang="perl">
my @unique = do { my %seen; grep { !$seen{$_}++ } @words };
</code>
or, better yet, in a function with an expressive name:
<code lang="perl">
sub uniq {
my %seen;
return grep { !$seen{$_}++ } @_;
}
</code>
<h2>Home made uniq - round 2</h2>
Prakash Kailasa suggested an even shorted version of implemening uniq,
for perl version 5.14 and above, if there is no requirement to preserve the order of elements.
Inline:
<code lang="perl">
my @unique = keys { map { $_ => 1 } @data };
</code>
or within a subroutine:
<code lang="perl">
my @unique = uniq(@data);
sub uniq { keys { map { $_ => 1 } @_ } };
</code>
Let's take this expression apart:
<hl>map</hl> has a similar syntax to <hl>grep</hl>: a block and an array (or a list of values).
It goes over all the elements of the array, executs the block and passes the result to the left.
In our case, for every value in the array it will pass the value itself followed by the number 1.
Remember <hl>=></hl>, aka. fat comma, is just a comma. Assuming @data has ('a', 'b', 'a') in it,
this expression will return ('a', 1, 'b', 1, 'a', 1).
<code lang="perl">
map { $_ => 1 } @data
</code>
If we assigned that expression to a hash, we would get the original data as keys, and the number 1-es as
valiues. Try this:
<code lang="perl">
use strict;
use warnings;
use Data::Dumper;
my @data = qw(a b a);
my %h = map { $_ => 1 } @data;
print Dumper \%h;
</code>
and you will get:
<code>
$VAR1 = {
'a' => 1,
'b' => 1
};
</code>
If, instead of assigning it to an array we wrap the above expression in curly braces, we will get a reference to an
anonymous hash.
<code lang="perl">
{ map { $_ => 1 } @data }
</code>
Let's see it in action:
<code lang="perl">
use strict;
use warnings;
use Data::Dumper;
my @data = qw(a b a);
my $hr = { map { $_ => 1 } @data };
print Dumper $hr;
</code>
Will print the same output as the previous one, barring any change in order in the dumping of the hash.
Finally, starting from perl version 5.14, we can call the <hl>keys</hl> function on hash references as well.
Thus we can write:
<code lang="perl">
my @unique = keys { map { $_ => 1 } @data };
</code>
and we'll get back the unique values fom <hl>@data</hl>
<h2>Exercise</h2>
Given the following file print out the unique values:
input.txt:
<code>
foo Bar bar first second
Foo foo another foo
</code>
expected output:
<code>
foo Bar bar first second Foo another
</code>
<h2>Exercise 2</h2>
This time filter out duplicates regardless of case.
expected output:
<code>
foo Bar first second another
</code>