-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchap_dissertation_scallop.tex
1250 lines (1108 loc) · 84.6 KB
/
chap_dissertation_scallop.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%=========================================================================================
% Scallop recognition
\chapter{Multi-layered Scallop Recognition Framework}
\label{chap:scallop_recog}
%========================================================================================
\section{Introduction}
Recognizing marine organisms, like scallops, is a challenging problem. A previously introduced approach, named eigen-value based shape descriptors (in Chapter~\ref{chap:eigen}), is incapable of utilizing textural information. Thus, eigen-value shape descriptors are unsuitable for recognizing organisms with prominent textural markers. Sensitivity to discretization noise, exhibited by Eigen-value based shape descriptors is another factor that discourages their use in noisy natural images. The multi-layered object recognition approach discussed in this chapter combines both shape and textural cues to recognize objects. This framework is also expressly designed to deal with noise present in images. A scallop enumeration problem is used as a means to validate this multi-layered approach.
The sea scallop \textit{(Placopecten magellanicus)} fishery in the
US EEZ (Exclusive Economic Zone) of the northwest Atlantic Ocean has been, and still is, one of
the most valuable fisheries in the United States.
Historically, the inshore sea scallop fishing grounds in the New York Bight,
i.e., Montauk Point, New York to Cape May, New Jersey, have provided
a substantial amount of scallops \cite{caddy, serchuk, hart, naidu, fisheries}.
These mid-Atlantic Bight ``open access''
grounds are especially important, not only for vessels fishing in the day boat
category, which are usually smaller vessels with limited range opportunities,
but also all the vessels that want to fish in near-shore ``open access''
areas to save fuel.\footnote{Based on personal communication with several limited
access and day boat scallopers.}
These areas offer high fish densities, but are at times rapidly depleted due
to overfishing \cite{rosenberg}.
The 2011 \gls{rsa}
project (Titled: ``A Demonstration Sea Scallop Survey of the
Federal Inshore Areas of the New York Bight using a Camera Mounted Autonomous Underwater
Vehicle'') was a scallop survey effort undertaken to study the health of the scallop population
along the coast of New York-New Jersey. As a part of this effort around a quarter million images of the
ocean floor were recorded and a manual scallop enumeration was performed on these images.
The considerable human effort involved for manual enumeration spawned the idea of building an automated species
recognition system that can sift through millions of images and perform species enumeration with minimal to no human intervention.
In response to this need for an automated scallop enumeration system, a multi-layered scallop
recognition framework was proposed \cite{prasanna_med, prasanna_aslo, prasanna_igi}.
The workflow of this scallop recognition framework involves 4 processing layers:
customized \gls{tdva} pre-processing, robust image
segmentation, object classification and false positive filtering layers.
The value of the proposed approach in this dissertation is primarily in providing a novel engineering solution to a real-world problem with economic
and societal significance, which goes beyond the particular domain of scallop population
assessment, and can possibly extend to other problems of environmental monitoring,
or even defense (e.g.\ mine detection).
Given the general unavailability of similar automation tools, the proposed one
can have potential impact in the area of underwater automation.
The multi-layered approach not only introduces several technical innovations at
the implementation level,
but also provides a specialized package for benthic habitat assessment.
At a processing level, it provides the flexibility to re-task individual
data processing layers for different detection applications.
When viewed as a complete package, the approach
offers an efficient tool to benthic habitat specialists for processing
large image datasets.
In the this chapter, we discuss the details of the multi-layered scallop recognition system
\cite{prasanna_med, prasanna_aslo, prasanna_igi}. This chapter also lists information about the data collection effort that provided the scallop data for the scallop enumeration survey. Finally, an in depth comparison of the differences between this multi-layered framework and an earlier scallop recognition work \cite{dawkings13} is discussed.
%========================================================================================
\section{Background}
\subsection{Underwater Animal Recognition}
\label{sec:animal_lit}
In natural settings, living organisms often tend to blend into their environments to evade detection via camouflage. Webster's thesis work \cite{webster} provides a detailed exposition on the visual camouflage mechanisms adopted by animals to blend into their background. Under such circumstances of camouflage, there are very limited visual cues that can be used to identify animals. Even in the presence of visual cues, the task of identifying animals from natural scenes is shown to be a cognitively challenging and complex task \cite{wichmann}.
Previous efforts to detect animals like plankton~\cite{mcgavin_plankton, stelzer_rotifier}, clam~\cite{forrest_clam} and a range of other benthic megafauna~\cite{schoening} exist. Most of these methods here are specialized to a specific species, or only tested in controlled environments. In some cases, the methods require specialized apparatus (like in the plankton recognition studies~\cite{mcgavin_plankton, stelzer_rotifier}).
A series of automated tools like specialized color correction, segmentation and classification modules along with some level of manual expert support, can be combined to perform identification of several marine organisms like sea anemones and sponges from natural image datasets~\cite{schoening}.
The existing techniques for marine animal recognition can be broadly divided into methods devised for identifying mobile organisms and methods for sedentary organisms. The former category is useful in dealing with a wide range of sea organisms like the varied species of fish that swim through water. The latter category is less studied. It includes identifying sedentary marine animals like scallops, corrals and sponges. Both categories present their own set of challenges. In the rest of this section we visit the techniques relevant to moving animals and show how they are different from the methods employed for sedentary animals. An overview of the existing literature on recognizing sedentary animals follows, with special emphasis on methods developed for identifying scallops.
%========================================================================================
\subsubsection{Methods for Recognition of Moving Underwater Organisms}
Recognizing and counting mobile marine life like fish~\cite{spampinato, edgington, williams} and studies in aquaculture \cite{zion} have been attempted. The recurring theme in these efforts involves the use of stationary cameras to detect the presence of moving species, provided that the background can be described by a prior model. This technique of assuming a known background, and using changes in the background as an evidence for the presence of a moving object entering the field of view of a sensor, is called background subtraction. In the marine species identification case, any changes to the background are assumed to be caused by a moving marine organism. The pixels in the image that deviate from the background model can be labeled as the pixels belonging to the organism.
Once a marine organism is detected through background subtraction, then other computer vision or machine learning techniques can be used to classify the organism into a specific species based on its visible characteristics. This classification task can be achieved through conventional machine learning approaches. For instance the salmon species classification algorithm developed by Williams et al. \cite{williams} uses active contours to model the shape of the fish before comparing these contours to known salmon species. However, if the pixels corresponding to the organism are contaminated by high levels of noise, a specialized technique that is robust to noise might be required.
Background subtraction requires a mathematical model that describes the distribution of background pixels. In an underwater setting, such a background model can only be obtained if the camera is stationary and is observing a static background. Additionally, there can also be cases where the evolution of a non-static background over time can be captured through a mathematical model. Such well defined background models are not always available. An opportunity to employ the background subtraction-based techniques arises in underwater environments with stationary fixtures designed to study a specific underwater location. In instances where such stationary arrangement of cameras is not available, background subtraction is inapplicable due to the lack of a background model.
%========================================================================================
\subsubsection{Methods for Recognition of Sedentary Underwater Organisms}
Since sedentary marine animals like scallops do not typically move (unless chased by a predator), a mobile robotic platform is required to traverse sub-sea relief to image and recognize those marine animals. Extending background subtraction to work with mobile robotic platforms is challenging, since the motion of the platform causes changes in its background. Generating a model for the background to perform background subtraction in these cases is problematic. This makes the task of detecting sedentary organisms with moving sensors even more challenging than detecting moving organism with stationary sensors. The lack of background model in these cases motivates the development of a foreground model. If a foreground model is available, the task of detecting an organism can be realized as a search for pixels satisfying the foreground model in the image.
Detecting an organism typically involves segmenting all pixels of the organism, in order for one to classify the organism into a known category.
The motion-based segmentation of marine animals that involves subtracting a known model of background from a snapshot of the environment, followed by attributing the pixels with non-zero values to the foreground is inapplicable in cases where the background model does not exist.
Furthermore, the task of segmentation can be challenging in noisy images with weak edges, since the boundary pixels of the foreground object cannot be easily distinguished from background pixels.
Thus, the lack of background model makes background subtraction problematic. This leads to the need for techniques that depend on foreground models, and use of other features to detect and segment organisms from the background. This task becomes even more complicated if the organism does not present significant visual cues that make it distinctive from the background, as in the case of creatures exhibiting camouflage. High levels of noise or unpredictable environmental variables could also significantly affect the effectiveness of any animal recognition mechanism.
%========================================================================================
\subsubsection{Scallop Recognition Methods}
There are several aspects that make scallop recognition challenging.
Scallops, especially when viewed in low resolution, do not provide features
that would clearly distinguish them from their natural environment. This
presents a major challenge in designing an automated identification process
based on visual data. To compound this problem, visual data collected
from the species' natural habitat contain a
significant amount of speckle noise.
Some scallops are also partially or almost completely
covered by sediment, obscuring the scallop shell features.
A highly robust detection mechanism is required to overcome these impediments.
There is a range of previously developed methods specialized for scallop recognition
\cite{dawkings13,guomundsson,enomoto9,enomoto10,fearn, prasanna_med, prasanna_aslo, prasanna_igi}
that operate on different assumptions, either with regards to the environmental conditions or the quality of data.
Existing approaches to automated scallop counting in artificial environments
\cite{enomoto9, enomoto10} employ a detection mechanism based on intricate distinguishing features
like fluted patterns in scallop shells and exposed shell rim of scallops, respectively.
Imaging these intricate scallop shell features might
be possible in artificial scallop beds with stationary cameras and
minimal sensor noise, but this level of detail
is difficult to obtain from low resolution images of scallops in their natural environment.
A major factor that contributes to the poor image resolution is the fact that sometimes the image of a target
is captured several meters away from it.
Overcoming this problem by operating an underwater vehicle much closer to the ocean floor
will adversely impact the image footprint (i.e. area covered by an image) and increase the risk of damaging the vehicle.
Furthermore, existing work on scallop detection \cite{dawkings13, guomundsson} in their natural
environment is limited to small datasets (often less than 100 images).
A sliding window approach has been used \cite{guomundsson} to focus the search for the presence of scallops. The large number of overlapping windows that need to be processed per image raises scalability concerns if this method were to operate
on a large dataset containing millions of images. Additionally, the small number of natural images used as a test set raises questions about the generalizibility of this method and its ability to function under varied environmental conditions.
The work by Dawkins \cite{dawkings13} is more detailed in its treatment of the natural environmental conditions spanning the scallop habitat. The images used here are collected using a towed camera system that minimizes noise, a fact which greatly enhances the performance of the machine learning and computer vision algorithms. Despite the elaborate imaging setup designed to minimize noise, the results reported are derived only from a few tens to hundreds of images.
It is not clear if those machine learning methods \cite{dawkings13} can extend to noisy image data captured by \gls{auv}s.
From these studies alone, it is not clear if such methods can be used effectively
in cases of large datasets comprising several thousand seabed images.
An interesting example of machine-learning methods applied to the
problem of scallop detection \cite{fearn}
utilizes the concept of \gls{buva}.
The approach is promising but it does not use any ground truth for validation.
There is more work \cite{prasanna_med, prasanna_aslo, prasanna_igi} that offers a multi-layered object recognition framework validated on
a natural image dataset for scallop recognition application.
The main emphasis there (and in this dissertation) is to develop a technique that can work on low quality noisy sensor data collected using \gls{auv}s.
The other objective is to build a scalable architecture that can operate on
large image datasets in the order of thousands to millions of images
and can be generalized for recognizing other marine organisms.
A detailed comparison between the scallop recognition approaches in Dawkins et al. \cite{dawkings13} and Kannappan et al.\cite{prasanna_igi} is provided in Section~\ref{sec:scallop_discussion}.
%========================================================================================
\subsection{Motivation for a Generalized Automated Object Recognition Tool}
Understanding the parameters that affect the habitat of underwater organisms is of interest to marine
biologists and government officials charged with regulating a multi-million dollar fishing industry. Dedicated
marine surveys are needed to obtain population assessments. One traditional scallop survey method, still
in use today, is a dredge-based survey. Dredge-based surveys have been extensively used for scallop population density
assessment \cite{nefsc}. The process involves dredging part of the ocean floor, and manually counting the
animals of interest found in the collected material. In addition to being invasive and
detrimental to the creatures’ habitat \cite{jenkins}, these methods have accuracy
limitations and can only generalize population numbers up to a certain extent.
There is a need for non-invasive and accurate survey alternatives.
The availability of a range of robotic systems in form of towed camera and \gls{auv} systems offer possibilities for such non-invasive alternatives. Optical imaging surveys using underwater
robotic platforms provide higher data densities. The large volume of image data (in the order of thousands
to millions of images) can be both a blessing and a curse. On one hand, it provides a detailed picture of
the species habitat; on the other requires extensive manpower and time to process the data.
While improvements in robotic platform and image acquisition systems have enhanced our capabilities to
observe and monitor the habitat of a species, we still lack the required arsenal of data processing tools. This
need motivates the development of automated tools to analyze benthic imagery data containing scallops.
One of the earliest video based surveys of scallops \cite{rosenkratz} reports that
it took from 4 to 10 hours of tedious manual analysis in order to review and process one hour
of collected seabed imagery. The report suggests that an automated computer technique for
processing of benthic images would be a great leap forward; to this time, however, no
such system is available. There is anecdotal evidence of in-house development efforts by the
HabCam group \cite{gallager} towards an automated system but as yet no such
system has emerged to the community of researchers and managers. A recent manual count
of our \gls{auv}-based imagery dataset indicated that it took an hour to process 2080 images,
whereas expanding the analysis to include all benthic macro-organisms reduced the rate
down to 600 images/hr \cite{walker}. Another manual counting effort \cite{oremland}
reports a processing time of 1 to 10 hours per person to process each image tow
transect (the exact image number per tow was not reported). The same report indicates that
the processing time was reduced to 1–2 hours per tow by subsampling 1\% of the images.
Future benthic studies can be geared towards increasing data densities with the help of robotic optical surveys.
It is clear that the large datasets, in the order of millions of images, generated by these surveys will impose
a strain on researchers if the images are to be process manually. This strongly suggests the need for automated tools
that can process underwater image datasets. Motivated by the need to reduce human effort, Schoening \cite{schoening} has proposed a range of tools that
can be generalized to organisms like sea-anemones. With an additional requirement of being able to work with low-resolution noisy underwater images,
a generalized multi-layered framework that can be used to detect and count underwater organisms has been proposed \cite{prasanna_med, prasanna_aslo, prasanna_igi}. This method has been evaluated on a scallop population assessment effort on a dataset containing over 8000 images, the details of which is the subject of this chapter.
%========================================================================================
\section{Preliminaries}
\subsection{Visual Attention}
\label{sec:visual_attn}
Visual attention is a neuro-physiologically inspired machine learning method \cite{koch}
that attempts to mimic the human brain function in its ability to rapidly single out
objects that are different from their surroundings within imagery data.
The method is based on the hypothesis that the human visual system first isolates
points of interest in an image, and then sequentially processes these points
based on the degree of interest associated with each point.
The degree of interest associated with a pixel is called \emph{salience},
and points with the highest salience values are processed first.
The method is used to pinpoint
regions in an image where the value of some pixel attributes may be an indicator to
its uniqueness relative to the rest of the image.
According to the visual attention hypothesis \cite{koch}, in the
human visual system the input video feed
is split into several feature streams.
Locations in these feature streams that are different from others in their neighborhood would generate
peaks in the \emph{center-surround} feature maps.
The different center-surround feature maps can be combined to obtain a
saliency \emph{map}.
Peaks in these resulting saliency maps, otherwise known as \emph{fixations}, become
points of interest, processed sequentially in descending order of their salience values.
Itti et al.~\cite{itti} proposed a computational model for visual attention.
According to this model, an image is first processed along three feature streams
(color, intensity, and orientation).
The color stream is further divided into two sub-streams (red-green and blue-yellow) and
the orientation stream into four sub-streams
($\theta \in \{0^\circ, 45^\circ, 90^\circ, 135^\circ\}$).
The image information in each sub-stream is further processes in 9 different scales.
In each scale, the image is scaled down using a factor $\frac{1}{2^k}$ (where $k = 0,\ldots,8$),
resulting in some loss of information as scale increases.
The resulting image data for each scale factor constitutes the
\emph{spatial scale} for the particular sub-stream.
The sub-stream feature maps are compared across different scales to expose
differences in them. Through the spatial scales in each sub-stream feature map, the scaling factors change the information
contained. Resizing these spatial scales to a common scale through interpolation, and then comparing them, brings out
the mismatch between the scales.
Let $\ominus$ be an pixel operator that takes pixel-wise differences between resized sub-streams. This function is called the \emph{center-surround} operator, and codifies the mismatches in the differently scaled sub-streams in the form of another map: the center-surround feature map. In the case of
the intensity stream, with $c\in\{2,3,4\}$ and $s=c+\delta$ for $\delta \in \{3,4\}$ denoting the indices of two different spatial scales, the center-surround feature map is given by
%
\begin{equation} \label{intensity-CS}
I(c,s)=\left|I(c)\ominus I(s)\right| \enspace.
\end{equation}
%
Similarly center-surround feature maps are computed for each sub-stream
in color and orientation streams.
In this way, the seven sub-streams (two in color, one in intensity and four in orientation),
yield a total of 42 center-surround feature maps.
All center-surround feature maps in an original stream (color, intensity,
and orientation) are then combined into a \emph{conspicuity map} (CM):
one for color $\bar{C}$, one for intensity $\bar{I}$, and one for orientation $\bar{O}$.
Define the cross-scale operator $\oplus$ that adds up pixel values in different maps.
Let $w_{cs}$ be scalar weights associated with how much the combination of
two different spatial scales $c$ and $s$ contributes to the resulting conspicuity map.
If $M$ is the global maximum over the map resulting from the $\oplus$ operation, and $\bar{m}$ is the mean over
all local maxima present in the map, let $\mathcal{N}(\cdot)$ be a normalization operator that scales that map by a factor of $(M-\bar{m})^{2}$.
For the case of intensity, this combined operation produces a conspicuity map based on the formula
%
\begin{equation} \label{eq:feature_map}
\bar{I}=\bigoplus_{c=2}^{4} \bigoplus_{s=c+3}^{c+4}w_{cs}\,\mathcal{N}(I(c,s))
\enspace.
\end{equation}
%
The three conspicuity maps---for intensity, color and orientation---are combined to produce the \emph{saliency map}.
If scalar weights for each data stream are selected, say
$w_{\bar{I}}$ for intensity, $w_{\bar{C}}$ for color, and
$w_{\bar{O}}$ for orientation, the saliency map can be expressed mathematically as
%
\begin{equation} \label{eq:saliency_map}
S=w_{\bar{I}}\,\mathcal{N}(\bar{I})+w_{\bar{C}}\,\mathcal{N}(\bar{C})+
w_{\bar{O}}\,\mathcal{N}(\bar{O}) \enspace.
\end{equation}
%
In a methodological variant of visual attention known as \gls{buva}, all streams are weighted equally: $w_{cs}$ is constant for all $c\in\{2,3,4\}$, $s=c+\delta$ ($\delta\in\{3,4\}$)
and $w_{\bar{I}}=w_{\bar{C}}=w_{\bar{O}}$.
A winner-takes-all neural network is typically used \cite{itti,walther}
to compute the maxima, or fixations, on this map---other discrete optimization methods are of course possible.
In the context of visual attention, fixations are the local maxima of the saliency map.
These fixations lead to shifts in \emph{focus of attention}, or in other words,
enables the human vision processing system to preferentially process regions
around fixations in an image.
In a different variant of visual attention referred to as \gls{tdva} \cite{navalpakkam},
the weights in \eqref{eq:feature_map} and \eqref{eq:saliency_map}
are selected judiciously to bias fixations toward particular attributes. There exists a method to select these weights in the general case when $N_m$ maps are to be combined with those weights \cite{navalpakkam}.
Let $N$ be the number of images in the learning set,
and $N_{iT}$ and $N_{iD}$ be the number of targets---in this case, scallops---and distractors
(similar objects) in image $i$ within the learning set.
For image $i$, let $P_{ijT_{k}}$ denote the local maximum of the numerical values of the map for feature $j$ in the neighborhood of the target indexed $k$;
similarly, let $P_{ijD_{r}}$ be the local maximum of the numerical values of the map for feature $j$ in the neighborhood of distractor indexed $r$.
The weights for a combination of maps are determined by
\begin{align} \label{eq:learning_wts_all}
w'_{j}=\frac{\sum_{i=1}^{N} N_{iT}^{-1}\sum_{k=1}^{N_{iT}}P_{ijT_{k}} }{
\sum_{i=1}^{N} N_{iD}^{-1} \sum_{r=1}^{N_{iD}}P_{ijD_{r}}} \nonumber \\
w_{j}=\frac{w'_{j}}{\frac{1}{N_m} \sum_{j=1}^{N_{m}}w'_{j} },
\end{align}
where $j\in\{1,\ldots,N_{m}\}$ is the index set of the different maps to be combined.
Equations \eqref{eq:learning_wts_all} are used for the selection of weights $w_{cs}$ in \eqref{eq:feature_map}, and $w_{\bar{I}}$, $w_{\bar{O}}$, $w_{\bar{C}}$ in
\eqref{eq:saliency_map}.
%========================================================================================
\section{Problem Statement}
\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{scallopred}
\caption[Seabed image with scallops shown in red circles]{Seabed image with scallops shown in red circles}
\label{fig:scallopred}
\end{figure}
A visual scallop population assessment process involves identifying these animals in image datasets.
A representative example of an image from the dataset we had to work with is shown in Figure~\ref{fig:scallopred} (scallops marked within red circles).
A general solution to automated image annotation might not necessarily be effective for the dataset at hand.
The need here is to identify algorithms and methods that will work best under \emph{poor} lighting and imaging conditions, characteristic of this particular scallop counting application.
The results from using elementary image processing methods like thresholding and edge detection on the images (see Figure~\ref{subfig:thresh_scallop} and \ref{subfig:edge_scallop}) demonstrate the need for a more sophisticated approach (possibly a hybrid combination of several techniques).
\begin{figure}
\centering
\begin{subfigure}[]{0.15\textwidth}
\includegraphics[width=\textwidth]{dcrescent_scallop}
\caption{}
\label{subfig:dcrescent_scallop}
\end{subfigure}
\begin{subfigure}[]{0.15\textwidth}
\includegraphics[width=\textwidth]{bcrescent_scallop}
\caption{}
\label{subfig:bcrescent_scallop}
\end{subfigure}
\begin{subfigure}[]{0.17\textwidth}
\raisebox{-2pt}{\includegraphics[width=\textwidth]{thresh}}
\caption{}
\label{subfig:thresh_scallop}
\end{subfigure}
\begin{subfigure}[]{0.17\textwidth}
\raisebox{-2pt}{\includegraphics[width=\textwidth]{edge}}
\caption{}
\label{subfig:edge_scallop}
\end{subfigure}
\caption[Scallop features]{(\subref{subfig:dcrescent_scallop}) Scallop with yellowish tinge and dark crescent; (\subref{subfig:bcrescent_scallop}) Scallop with yellowish tinge and bright shell rim crescent; (\subref{subfig:thresh_scallop}) Scallop sample after thresholding; (\subref{subfig:edge_scallop}) Scallop sample after edge detection.
}
\end{figure}
Another challenge, related to the issue of low image resolution and high levels of speckle noise, is the selection of appropriate scallop features that would enable distinguishing between these organisms and other objects.
In the particular dataset, one recurrent visual pattern is a dark crescent on the upper perimeter of the scallop shell, which is the shadow cast by the upper open scallop shell produced from the \gls{auv} strobe light (see Figure~\ref{subfig:dcrescent_scallop}).
Another pattern that could serve as a feature in this dataset is a bright crescent on the periphery of the scallop, generally associated with the visible interior of the bottom half when the scallop shell is partly open (see Figure~\ref{subfig:bcrescent_scallop}).
A third pattern may be a yellowish tinge associated with the composition of the scallop image (see Figure~\ref{subfig:bcrescent_scallop}).
We have leveraged visual patterns \cite{prasanna_aslo} to develop a three-layered scallop counting framework that combines tools from computer vision and machine learning.
This particular hybrid architecture uses top-down visual attention, graph-cut segmentation and template matching along with a range of other filtering and image processing techniques.
Though this architecture offers a performance of over 63\% true positive detection rate, it has a very large number of false positives.
To mitigate this problem, we extend the framework \cite{prasanna_aslo} by adding a fourth, false-positives filtering layer \cite{prasanna_igi}.
%========================================================================================
\section{Scallop Survey Procedure}
The 2011 \gls{rsa}
project (Titled: ``A Demonstration Sea Scallop Survey of the
Federal Inshore Areas of the New York Bight using a Camera Mounted Autonomous Underwater
Vehicle'')
was a proof-of-concept project that successfully used a digital, rapid-fire camera integrated
to a Gavia \gls{auv}, to collect a continuous record of photographs for mosaicking,
and subsequent scallop enumeration.
In July 2011, transects were completed in the northwestern waters of the mid-Atlantic
Bight at depths of 25-50 m. The \gls{auv} continuously photographed the
seafloor along each transect at a constant distance of 2\,m above the seafloor.
Parallel sets of transects were spaced as close as 4\,m.
Georeferenced images were manually analyzed for the presence of sea scallops
using position data logged (using \gls{dvl} and \gls{ins}) with each image.
%
\subsection{Field Survey Process}
%
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth, natwidth=944, natheight=800]{survey_region}
\caption[Map of the scallop survey region]{Map of the survey region from Shinnecock, New York to Cape May, New Jersey,
divided into eight blocks or strata}
\label{fig:strata}
\end{figure}
%
In the 2011 demonstration survey, the federal inshore scallop grounds from Shinnecock, New York to Ocean View, Delaware, was divided into eight blocks or strata (as shown in Figure~\ref{fig:strata}).
The \textit{f/v Christian and Alexa} served as the surface support platform
from which a Gavia \gls{auv} (see Figure~\ref{fig:gavia_auv}) was deployed and recovered. The \gls{auv}
conducted photographic surveys of the seabed for a continuous duration of approximately
3 hours during each dive, repeated 3--4 times in each stratum, with
each stratum involving roughly 10 hours of imaging and an area of about $45\,000$
m$^2$.
The \gls{auv} collected altitude (height above the seabed) and
attitude (heading, pitch, roll) data, allowing the georectification of each image
into scaled images for size and counting measurements. During the 2011 pilot study
survey season, over $250\,000$ images of the seabed were collected.
These images were analyzed in the University of Delaware's Coastal Sediments, Hydrodynamics and Engineering
Laboratory for estimates of scallop
abundance and size distribution. The \textit{f/v Christian and Alexa} provided
surface support, and made tows along the \gls{auv} transect to ground-truth the presence of
scallops and provide calibration for the size distribution.
Abundance and sizing estimates were computed manually for each image using a GUI-based
digital sizing software.
Each image included embedded metadata that
allowed it to be incorporated into existing benthic image classification systems
(HabCam mip \cite{dawkings13}).
% Abundance and sizing estimates were conducted via a heads-up manual method,
% with each image including embedded metadata allowing it to be incorporated into
% to existing benthic image classification systems (HabCam mip \citep{Dawkings13}).
During this proof of concept study, in each stratum the f/v
\textit{Christian and Alexa} made one 15-minute dredge tow along the \gls{auv}
transect to ground-truth the presence of scallops and other fauna,
and provide calibration for the size distribution. The vessel was maintained
on the dredge track by using Differential GPS.
The tows were made with the starboard 15 ft
(4.572\;m)
wide New Bedford style commercial dredge at the commercial dredge speed of 4.5--5.0 knots.
The dredge was equipped with 4 inch (10.16 m) interlocking rings,
an 11 inch (27.94 cm) twine mesh top, and turtle chains.
After dredging, the catch was sorted, identified, and weighed.
Length-frequency data were obtained for the caught scallops.
This information was recorded onto data logs and then entered into a laptop computer database aboard ship for comparison to the camera image estimates.
The mobile platform of the \gls{auv} provided a more expansive and continuous coverage
of the seabed compared to traditional fixed drop camera systems or towed camera systems.
In a given day, the \gls{auv} surveys covered about $60\,000$\;m$^2$ of seabed
from an altitude of 2\;m above the bed, simultaneously producing broad sonar
swath coverage and measuring the salinity, temperature, dissolved oxygen, and
chlorophyll-a in the water.
\subsection{Sensors and Hardware}
\label{section:equipment}
The University of Delaware \gls{auv} (Figure \ref{fig:gavia_auv}) was used to collect
continuous images of the benthos, and simultaneously map the texture and topography of
the seabed. Sensor systems associated with this vehicle include: \begin{enumerate*}[label=(\arabic*):, start=1]
\item a 500\;kHz GeoAcoustics GeoSwath Plus phase measuring bathymetric
sonar; \item a 900/1800\;kHz Marine Sonic dual-frequency high-resolution
side-scan sonar;
\item a Teledyne Rd Instruments 1200 kHz acoustic \gls{dvl}/\gls{adcp};
\item a Kearfott T-24 inertial navigation system;
\item an Ecopuck flntu combination fluorometer / turbidity sensor;
\item a Point Grey Scorpion model 20SO digital camera and LED strobe array;
\item an Aanderaa Optode dissolved oxygen sensor;
\item a temperature and density sensor; and, \item an altimeter. \end{enumerate*}
Each sensor separately records time and spatially stamped data with frequency and spacing.
The \gls{auv} is capable of very precise dynamic positioning, adjusting to the variable
topography of the seabed while maintaining a constant commanded altitude offset.
\begin{figure}
\centering
\begin{subfigure}[]{0.7\textwidth}
\includegraphics[width=\textwidth,natwidth=1052,natheight=428]{auv_schematics}
\end{subfigure}
\begin{subfigure}[]{0.6\textwidth}
\includegraphics[width=\textwidth,natwidth=690,natheight=518]{auv_image}
\end{subfigure}
\caption{Schematics and image of the Gavia \gls{auv} }
\label{fig:gavia_auv}
\end{figure}
\subsection{Data Collection}
The data was collected over two separate five-day cruises in July 2011.
In total, 27 missions were run using the \gls{auv} to photograph the seafloor (For list of missions see Table \ref{tab:mission_list}).
Mission lengths were constrained by the 2.5 to 3.5 hour battery life of the \gls{auv}.
During each mission, the \gls{auv} was instructed to follow a constant height of 2\;m
above the seafloor. In addition to the $250\,000$ images that were collected, the
\gls{auv} also gathered data about water temperature, salinity, dissolved oxygen,
geoswath bathymetry, and side-scan sonar of the seafloor.
The camera on the \gls{auv}, a Point Grey Scorpion model 20SO (for camera specifications see Table \ref{tab:camera_specs}),
was mounted inside the nose module of the vehicle.
It was focused at 2\;m, and captured
images at a resolution of $800\times600$. The camera lens had a horizontal viewing
angle of 44.65 degrees. Given the viewing angle and distance from the seafloor,
the image footprint can be calculated as $1.86\times1.40$\;m$^2$.
Each image was saved in jpeg format, with metadata that included position information
(including latitude, longitude, depth, altitude, pitch, heading and roll)
and the near-seafloor environmental conditions analyzed in this study.
This information is stored in the header file, making the images readily comparable and
able to be incorporated into existing \gls{rsa} image databases, such as the
HabCam database.
A manual count of the number of scallops in each image was performed and used to obtain overall scallop
abundance assessment.
Scallops counted were articulated shells in life position (left valve up) \cite{walker}.
%
\begin{table}
\centering
\begin{threeparttable}
\begin{tabular}{ll}
\toprule[1pt]\\[-6pt]
Mission & Number of images\\[2pt]\midrule
LI1\tnote{1} &$12\,775$\\
LI2 &$2\,387$\\
LI3 &$8\,065$\\
LI4 &$9\,992$\\
LI5 &$8\,338$\\
LI6 &$11\,329$\\
LI7 &$10\,163$\\
LI8 &$9\,780$\\
LI9 &$2\,686$\\
NYB1\tnote{2} &$9\,141$\\
NYB2 &$9\,523$\\
NYB3 &$9\,544$\\
NYB4 &$9\,074$\\
NYB5 &$9\,425$\\
NYB6 &$9\,281$\\
NYB7 &$12\,068$\\
NYB8 &$9\,527$\\
NYB9 &$10\,950$\\
NYB10 &$9\,170$\\
NYB11 &$10\,391$\\
NYB12 &$7\,345$\\
NYB13 &$6\,285$\\
NYB14 &$9\,437$\\
NYB15 &$11\,097$\\
ET1\tnote{3} &$9\,255$\\
ET2 &$12\,035$\\
ET3 &$10\,474$\\
\\[2pt]\bottomrule[1pt]
\end{tabular}
\begin{tablenotes}
\vskip 5pt
\item[1] \footnotesize{LI--Long Island}
\item[2] \footnotesize{NYB--New York Bight}
\item[3] \footnotesize{ET--Elephant Trunk}
\end{tablenotes}
\end{threeparttable}
\caption{List of missions and number of images collected}
\label{tab:mission_list}
\end{table}
%
\begin{table}
\centering
\begin{tabularx}{\textwidth}{XX}
\toprule[1pt]\\[-6pt]
Attribute &Specs\\[2pt]\midrule
Name &Point Grey Scorpion 20SO Low Light Research Camera\\
Image Sensor &8.923 mm Sony ccd\\
Horizontal Viewing Angle &44.65 degrees (underwater)\\
Mass &125 g\\
Frame rate &3.75 fps\\
Memory &Computer housed in \gls{auv} nose cone\\
Image Resolution &800 $\times$ 600\\
Georeferenced metadata &Latitude, longitude, altitude, depth\\
Image Format &jpeg\\[2pt]\bottomrule[1pt]
\end{tabularx}
\caption{Camera specifications\label{tab:camera_specs}}
\end{table}
%========================================================================================
\section{Methodology}
The multi-layered scallop counting framework that comprises four layers of processing on underwater images for the purpose of obtaining scallop counts is discussed in this section.
The four layers involve the sequential application of Top-Down Visual Attention, Segmentation, Classification and False-Positive Filtering.
\subsection{Layer I: Top-Down Visual Attention} \label{subsec:TDVA}
\subsubsection{Learning}
A customized \gls{tdva} algorithm can be designed
to sift automatically through the body of imagery data, and focus on
regions of interest that are more likely to contain scallops.
The process of designing the \gls{tdva} algorithm is described below.
The first step is a small-scale, \gls{buva} based saliency computation.
The saliency computation is performed
on a collection of randomly selected 243 annotated images, collectively containing 300 scallops.
This collection constitutes the \emph{learning set}.
Figure~\ref{fig:saliency_combine} represents graphically the flow of computation
and shows the type of information in a typical image that visual attention tends to highlight.
%
\begin{figure*}
\vskip -5pt
\centering
\includegraphics[width=0.80\textwidth,natwidth=864,natheight=582]{saliency_example1.png}
\caption{Illustration of computation flow for the construction of saliency maps}
\label{fig:saliency_combine}
\end{figure*}
%
A process of extremum
seeking on the saliency map of each image identifies fixations in the associated image.
If a $100\times100$ pixel window---corresponding to an approximately $23 \times 23$\enspace cm$^2$ area on the seafloor---centered around a fixation point
contained the center of a scallop, the corresponding
fixation was labeled a \emph{target}; otherwise, it is considered a \emph{distractor}.
\begin{figure}
% \vskip +6pt
\centering
\includegraphics[width=0.6\textwidth,natwidth=800,natheight=600]{fixation_example.pdf}
\caption[Illustration of fixations]{Illustration of fixations (marked by yellow boundaries):
red lines indicate the order in which the fixations were detected with the lower-left fixation being the first.
}
\label{fig:fixation}
%\vskip -15pt
\end{figure}
The target and distractor regions are determined in all the feature and conspicuity maps for each one of these processed images in the learning set.
This is done by adaptively thresholding and locally segmenting the points around the fixations with similar salience values in each map.
Then the mean numerical value in neighborhoods around
these target and distractor regions in the feature maps and conspicuity maps
are computed. These values are used to populate the $P_{ijT_k}$ and $P_{ijD_r}$ variables in \eqref{eq:learning_wts_all}, and determine the top-down weights for feature maps and conspicuity maps.
For the conspicuity maps, the center-surround scale weights $w_{cs}$ computed through \eqref{eq:learning_wts_all} and consequently used in \eqref{eq:feature_map},
are shown in Table \ref{tab:tdva_fm_wts}.
For the saliency map computation, the weights resulting from the application of
\eqref{eq:learning_wts_all} on the conspicuity maps are
$w_{\bar{I}}= 1.1644$, $w_{\bar{C}}= 1.4354$ and $w_{\bar{O}}= 0.4001$.
The symmetry of the scallop shell in our low-resolution dataset
justifies the relatively small value of the orientation weight.
\begin{table*}
\caption{Top-down weights for feature maps \label{tab:tdva_fm_wts}}
\centering
\begin{tabular}{llllllll}
\toprule[1pt]\\[-6pt]
& & \multicolumn{6}{c}{Center Surround Feature Scales}\\[2pt]\cline{1-8}\\[-5pt]
& & 1 & 2 & 3 & 4 & 5 & 6\\[2pt]\cline{1-8}\\[-5pt]
Color & red-green & 0.8191 & 0.8031 & 0.9184 & 0.8213 & 0.8696 & 0.7076\\
& blue-yellow & 1.1312 & 1.1369 & 1.3266 & 1.2030 & 1.2833 & 0.9799\\[8pt]
Intensity & intensity & 0.7485 & 0.8009 & 0.9063 & 1.0765 & 1.3111 & 1.1567\\[8pt]
Orientation & $0^\circ$ & 0.7408 & 0.2448 & 0.2410 & 0.2788 & 0.3767 & 2.6826\\
& $45^\circ$ & 0.7379 & 0.4046 & 0.4767 & 0.3910 & 0.7125 & 2.2325\\
& $90^\circ$ & 0.6184 & 0.5957 & 0.5406 & 1.2027 & 2.0312 & 2.1879\\
& $135^\circ$ & 0.8041 & 0.6036 & 0.7420 & 1.5624 & 1.1956 & 2.3958\\[2pt]\bottomrule[1pt]
\end{tabular}
\end{table*}
%
\subsubsection{Implementation and Testing}\label{sec:tdva-testing}
To test the performance of the customized \gls{tdva} algorithm, it is applied on
two image datasets, the size of which is shown in Table~\ref{tab:count_results}.
In this application, the saliency maps are computed via the formulae
\eqref{eq:saliency_map} and \eqref{eq:feature_map},
using the weights listed in Table \ref{tab:tdva_fm_wts}.
Convergence time of the winner-takes-all neural network that finds fixations in the saliency map of each image in the datasets of Table~\ref{tab:count_results}, is controlled using dynamic thresholding:
It is highly unlikely that a fixation that contains an object of interest
requires more than $10\,000$ iterations. If convergence to some fixation takes
more than this number of iterations,
then the search is terminated and no more fixations
are sought in the image.
Given that an image in datasets of
Table~\ref{tab:count_results} contains two scallops on average, no more than
ten fixations are sought in each image (The percentage of images
in the datasets that contained more than $10$ scallops
was $0.002\%$).
Since in the testing phase the whole scallop---not just the center---needs to be included in the fixation window, the size of this window is set at
$270\times270$ pixels; more than $91$\% of the scallops are accommodated inside the window (Figure~\ref{fig:window_length}).
\begin{figure}
\centering
\includegraphics[width=0.60\textwidth,natwidth=800,natheight=600]{windowlength.pdf}
\caption[Scallop fixation window dimension analysis]{Percentage of scallops enclosed in the fixation window as a function of
window half length (in pixels)}
\label{fig:window_length}
\end{figure}
\begin{figure*}
\centering
\begin{subfigure}[]{0.17\textwidth}
\includegraphics[width=\textwidth,natwidth=842,natheight=844]{fixationwindow.pdf}
\caption{}
\label{subfig:fixationwindow}
\end{subfigure}
\begin{subfigure}[]{0.17\textwidth}
\includegraphics[width=\textwidth,natwidth=837,natheight=839]{edgeimg.pdf}
\caption{}
\label{subfig:edgeimg}
\end{subfigure}
\begin{subfigure}[]{0.17\textwidth}
\includegraphics[width=\textwidth,natwidth=1055,natheight=842]{graphimg.pdf}
\caption{}
\label{subfig:graphimg}
\end{subfigure}
\begin{subfigure}[]{0.17\textwidth}
\includegraphics[width=\textwidth,natwidth=834,natheight=837]{regboundry.pdf}
\caption{}
\label{subfig:regboundry}
\end{subfigure}
\begin{subfigure}[]{0.17\textwidth}
\includegraphics[width=\textwidth,natwidth=810,natheight=813]{regcircle.pdf}
\caption{}
\label{subfig:regcircle}
\end{subfigure}
\caption[Segmentation layer process flow]{(\subref{subfig:fixationwindow}) Fixation window from layer I; (\subref{subfig:edgeimg}) Edge segmented image; (\subref{subfig:graphimg}) graph-cut segmented image; (\subref{subfig:regboundry}) Region boundaries obtained when the edge segmented image is used as a mask over the graph-cut segmented image boundaries; (\subref{subfig:regcircle}) circle fitted on the extracted region boundaries.
}
\label{fig:segmentation_levels}
\end{figure*}
\subsection{Layer II: Segmentation and shape extraction}
\label{subsec:layer2}
This processing layer consists of
three separate sub-layers: edge based segmentation (involves basic morphological
operations like smoothing, adaptive thresholding and edge detection), graph-cut
segmentation, and shape fitting.
The flow of the segmentation process for a typical fixation window containing a scallop is illustrated in Figure~\ref{fig:segmentation_levels}.
Figure~\ref{subfig:fixationwindow} shows a fixation window.
Edge-based segmentation on this window yields the edge segmented image of Figure~\ref{subfig:edgeimg}.
At the same time, graph-cut segmentation process \cite{shi} is applied on the
fixation window to decompose it into 10 separate regions
as seen in Figure~\ref{subfig:graphimg}. The boundaries of these
segments are matched with the edges in the edge segmented image.
This leads to further filtering of the edges, and eventually to the region boundaries
on Figure~\ref{subfig:regboundry}.
This is followed by fitting a circle to
each of the contours in the filtered region boundaries (Figure~\ref{subfig:regboundry}).
Only circles with dimensions close to that of a scallop (diameter $20 - 70$ pixels)
are retained (Figure~\ref{subfig:regcircle}), which in turn helps in rejection of other
non-scallop round objects.
The choice of the
shape to be fitted is suggested by the geometry of the scallop's shell.
Finding the circle that fits best to a given set of points is formulated as
an optimization problem \cite{taubin,chernov}.
Given a set of $n$ points on a connected contour each with coordinates $(x_i,y_i)$ ($i\in\{1,2,\ldots,n\}$),
define a function of four parameters $A$, $B$, $C$, and $D$:
\begin{align} \label{eq:obj_fn_2}
F_2(A,B,C,D) = \frac{\sum_{i=1}^{n} [A(x_i^2+y_i^2)+Bx_i+Cy_i+D]^2}{n^{-1}\sum_{i=1}^{n} [4A^2(x_i^2+y_i^2)+4ABx_i+4ACy_i+B^2+C^2]} \enspace.
\end{align}
It is shown \cite{taubin} that minimizing \eqref{eq:obj_fn_2} over these parameters yields the circle that fits best around the contour.
The center $(a,b)$ and the radius of this best-fit circle are given as a function of
the parameters as follows:
%
\begin{align} \label{fit-parameters}
a&=-\frac{B}{2A}\enspace,& b&=-\frac{C}{2A} \enspace, &
R&=\sqrt{\frac{B^2+C^2-4AD}{4A^2}} \enspace.
\end{align}
For all annotated scallops in the testing image dataset, the quality of the fit is quantified by means of two scalar measures:
the center error $e_c$, and the percent radius error $e_r$.
An annotated scallop would be associated with a triple
$(a_g, b_g,R_g)$---the coordinates of its center $(a_g,b_g)$ and its radius $R_g$.
Using the parameters of the fit in \eqref{fit-parameters}, the error measures are evaluated as follows, and are required to be below the thresholds specified on the right hand side in order for the scallop to be considered detected.
%
\begin{align*} %\label{eq:seg_center}
e_c &= \sqrt{(a_g-a)^2 + (b_g-b)^2}\leq12 \enspace \text{(pixels) } &
e_r &= \frac{| R_g - R |}{R_g} \leq0.3 \enspace.
\end{align*}
%
These thresholds were
set empirically, taking into account that radius measurements
in manual counts used as ground truth \cite{walker} have a measurement error of 5--10\%.
\subsection{Layer III: Classification} \label{subsec:layer3}
The binary classification problem solved in this layer consists of identifying specific features in the images which mark the presence of scallops.
These images are obtained by a using a
camera at the nose of the \gls{auv}, illuminated by a strobe light close to its tail
(mounted to the hull of the control module at an oblique angle to the camera).
Our hypothesis is that due to this camera-light configuration, scallops appear in the images with
a bright crescent at the lower part of its perimeter and a dark crescent at the top---a shadow.
Though crescents appear in images of most scallops, their prominence and relative position with respect to the scallop varies considerably.
The hypothesis regarding the origin of the light artifacts implies that the approximate profile and orientation of the crescents is a function of their location in the image.
\subsubsection{Scallop Profile Hypothesis} \label{subsubsec:scallop_profile_hypothesis}
A statistical analysis was performed on a dataset of $3\,706$ manually labeled scallops
(each scallop is represented as $(a,b,R)$
where $a,b$ are the horizontal and vertical coordinates of the scallop center,
and $R$ is its radius).
For this analysis, square windows of length $2.8\times R$
centered on $(a,b)$ were used to crop out regions from the images containing scallops.\footnote{Using a slightly
larger window size ($>2\times R$, the size of the scallop) includes
a neighborhood of pixels just outside the scallop which is where
crescents are expected.
This also improves the performance of local contrast enhancement, leading to better edge detection.}
Each cropped region was filtered in grayscale, contrast stretched, and then normalized
by resizing to $11 \times 11$ dimension or $121$ bins.
To show the positional dependence of the scallop profiles,
the image plane is discretized into $48$ regions ($6\times8$ grid).
Scallops whose centers lie within each grid square are segregated.
The mean (Figure~\ref{subfig:mean_quadrant}) and standard deviation (Figure~\ref{subfig:stddev_quadrant})
of the $11 \times 11$ scallop profiles of all scallops per grid square
over the whole dataset of $3\,706$ images was recorded.
The lower standard deviation found in the intensity maps of the crescents on the side of the scallop facing away from the camera reveal that these artifacts are more consistent as markers compared to the ones closer to the lens.
\begin{figure*}
\centering
\begin{subfigure}[]{0.45\textwidth}
\includegraphics[width=\textwidth]{mean_scallop_quadrants}
\caption{}
\label{subfig:mean_quadrant}
\end{subfigure}
\begin{subfigure}[]{0.45\textwidth}
\includegraphics[width=\textwidth]{stddev_scallop_quadrants}
\caption{}
\label{subfig:stddev_quadrant}
\end{subfigure}
\caption[Mean and variance map of scallops in each quadrant]{(\subref{subfig:mean_quadrant}) Mean map of scallops in each quadrant; (\subref{subfig:stddev_quadrant}) Standard deviation map of scallops in each quadrant.
Red corresponds to higher numeric values and blue correspond to lower numeric values.}
\label{fig:mean_stddev_quadrant}
\end{figure*}
\subsubsection{Scallop Profile Learning} \label{subsubsec:scallop_profile_learning}
The statistics of the dataset of $3\,706$ images used to produce Figure~\ref{fig:mean_stddev_quadrant} form a look-up table that represents reference scallop profile (mean and standard deviation maps)
as a function of scallop center pixel location.
To obtain the reference profile for a pixel location,
the statistics from all the scallops whose centers lie inside a $40\times40$ window centered on the pixel is used.
This look-up table can be compressed; it turns out that not all of the 121 bins ($11\times11$) within each map is equally informative, because bins close to the boundary are more likely to include a significant number of background pixels.
For this reason, a circular mask with a radius covering 4 bins is applied to each map (Figure~\ref{fig:scallop_learning_mask}), thus reducing the number of bins that are candidates as features for identification to $61$.
Out of these $61$ bins, $15$ additional bins having the highest standard deviation are ignored, leading to a final set of $46$ bins.
The value in the selected $46$ bins from mean map forms a $46$-dimensional feature vector associated with that region. The corresponding $46$ bins from the standard deviation map are also recorded, and are used to weight the features
(as seen later in \eqref{distance}).
\begin{figure}
\centering
\begin{subfigure}[]{0.3\textwidth}
\includegraphics[width=\textwidth]{curr_scallop_mean}
\caption{}
\label{subfig:mean_scallop}
\end{subfigure}
\begin{subfigure}[]{0.3\textwidth}
\includegraphics[width=\textwidth]{curr_scallop_stddev}
\caption{}
\label{subfig:stddev_scallop}
\end{subfigure}
\begin{subfigure}[]{0.3\textwidth}
\includegraphics[width=\textwidth]{mask_centered}
\caption{}
\label{subfig:mask_scallop}
\end{subfigure}
\caption[Illustration of scallop profile hypothesis]{Intensity statistics and mask for a region centered at image coordinates $(470,63)$: (\subref{subfig:mean_scallop}) Map of mean intensity; (\subref{subfig:stddev_scallop}) Map of intensity standard deviation;
(\subref{subfig:mask_scallop}) Mask applied to remove background points.}
\label{fig:scallop_learning_mask}
\end{figure}
\subsubsection{Scallop Template Matching} \label{subsubsec:scallop_template_matching}
With this look-up table that codes the reference scallop profile for every scallop center pixel location,
the resemblance of any segmented object to a scallop can now be assessed.
The metric used for this comparison is a weighted distance function between the
elements of the feature vector for the region corresponding to the segmented object, and that coming from the look-up table, depending on the location of the object in the image being processed.
If this distance metric is below a certain threshold $D_\mathsf{thresh}$, the object is classified
a scallop.
Technically, let $X^o=(X^o_1,X^o_2, \ldots,X^o_{46})$ denote the feature vector computed for the segmented object,
and $X^s=(X^s_1,\ldots,X^s_{46})$ the reference
feature vector.
Every component of the $X^s$ vector is a reference mean intensity value for a particular bin, and is associated with a standard deviation $\sigma_k$ from the reference standard deviation map.
%(Section~\ref{subsubsec:scallop_profile_hypothesis}).
To compute the distance metric, first normalize $X^o$ to produce vector $X^{\bar{o}}$ with components
%
\[
X^{\bar{o}}_p = \min_{k} X^s_k + \left(\frac{\max \limits_{k} X^s_k-\min \limits_{k} X^s_k}{\max \limits_{k} X^o_k-\min \limits _{k} X^o_k}
\right)\left[ X^o_p-\min_{k} X^o_k \right] \;\text{for } p=1,\ldots,46\enspace,
\]
%
and then evaluate the distance metric $D_{t}$ quantifying the dissimilarity between the normalized object vector $X^{\bar{o}}$ and the reference feature vector $X^s$ as
%
\begin{equation}\label{distance}
D_{t} = \sqrt{\sum_{k=1}^n\frac{\|X^{\bar{o}}_{k}-X^s_k\|^2}{\sigma_k}} \enspace .
\end{equation}
%
%
\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{mask_offset}
\caption[Template matching masks]{Nine different masks slightly offset from the center used to make the classification layer
robust to errors in segmentation}
\label{fig:scallop_masks}
\end{figure}
%
Small variations in segmentation can produce notable deviations in the computed distance metric \eqref{distance}.
To alleviate this effect, the mask of Figure~\ref{subfig:mask_scallop} was slightly shifted in different directions and the best match in terms of the distance was identified.
This process enhanced the robustness of the classification layer with respect to small segmentation errors.
Specifically, nine slightly shifted masks were used (shown in
Figure~\ref{fig:scallop_masks}). Out of the nine resulting
distance metrics $D^{o_1}_t$ \ldots $D^{o_9}_t$, the smallest $
D_\mathsf{obj}=\min_{p\in\{1,\ldots,9\}} D^{o_p}_t
$ is found and used for classification.
If $D_\mathsf{obj}<D_\mathsf{thresh}$,
the corresponding object
is classified as a scallop.
Based on Figures~\ref{subfig:precision_recall}--\ref{subfig:template_hist},
the threshold value was chosen at $D_\mathsf{thresh}=7$ to give a recall\footnote{\emph{Recall} refers to the fraction of relevant instances
identified: fraction of scallops detected over all ground truth scallops;
\emph{precision} is the fraction of the instances returned that are really relevant
compared to all instances returned: fraction of true scallops over all objects
identified as scallops.} rate of
$97\%$.
Evident in Figure~\ref{subfig:precision_recall} is the natural trade-off between increasing recall rates and keeping the number of false positives low.
%
\begin{figure*}
\centering
\begin{subfigure}{0.47\textwidth}
\includegraphics[width=\textwidth]{precision_recall}
\caption{}
\label{subfig:precision_recall}
\end{subfigure}
\begin{subfigure}{0.47\textwidth}
\includegraphics[width=\textwidth]{template_thresh_hist}
\caption{}
\label{subfig:template_hist}
\end{subfigure}
\caption[Precision-Recall curves for Classification Layer]{(\subref{subfig:precision_recall}) Precision-Recall curve with $D_\mathsf{thresh}$ shown as
a vertical line; (\subref{subfig:template_hist})
Histogram of template match of segmented scallop objects.}
\label{detection-curves}
\end{figure*}
%========================================================================================
\subsection{Layer IV: False Positives Filter}
To decrease the false positives that are produced in the classification layer, two methods are evaluated as possible candidates: a high-dimensional \gls{wctm} technique and a \gls{hog} method.
The main objective here is to find a method that will retain a high percentage of true positive scallop and at the same time eliminate as many false positives from the classification layer as possible.
\subsubsection{High-dimensional weighted correlation template matching (WCTM)}
In this method, the templates used are generated from scallop images that are \emph{not} preprocessed, i.e., images that are not median-filtered, unlike the images that were processed by the first three layers.
The intuition behind this is that although median filtering reduces speckle noise and may improve the performance of segmentation, it also weakens the edges and gradients in an image.
Avoiding median filtering helps to generate templates that are more accurate than the ones already used in the classification layer.
Based on the observation that the scallop templates are dependent on their position in the image (Figure~\ref{fig:mean_stddev_quadrant}), a new scallop template is generated for each object that is classified as a scallop in Layer III.
As indicated before, such an object would be represented by a triplet $(a_o,b_o,R_o)$, where $a_o$ and $b_o$ represent the spatial Cartesian coordinates of object's geometric center, and $R_o$ gives its radius.
The representative scallop template is now generated from all scallops in the learning set (containing 3\,706 scallops), of which the center is within a $40\times40$ window in the neighborhood of the object center $(a_o,b_o)$.
Each of these scallops is then extracted using a window of size $2.5R\times2.5R$ where $R$ is the scallop radius.
Since these scallops in the learning set can be of different dimensions, it is resized (scaled) to a window of size $2.5R_o\times2.5R_o$.
All these scallop instances in the learning set are finally combined through a pixel-wise mean to obtain the mean representative template.
Similarly, a standard deviation map that captures the standard deviation of each pixel in the mean template is also obtained.
The templates produced here are of larger size compared to the templates in Layer III (recall that a Layer III template was of size $11\times11$).
The inclusion of slightly more information contributes to these new larger templates being more accurate.
In a fashion similar to the analysis in Layer III, the templates and object pixels first undergo normalization and mean subtraction.
Then they are compared.
Let $v=(2.5R_o)^2$ be the total number of pixels in both the template and the object, and let the new reference scallop feature (template) and the object be represented by vectors $X^t=(X^t_1,X^t_2, \ldots,X^t_{v})$ and $X^u=(X^u_1,\ldots,X^u_{v})$, respectively.
In addition, let $\sigma$ be the standard deviation vector associated with $X^t$.
Then the reference scallop feature vector $X^t$ would first be normalized as follows:
%
\[
X^{t'}_p = \min_{k} X^u_k + \left(\frac{\max \limits_{k} X^u_k-\min \limits_{k} X^u_k}{\max \limits_{k} X^{t}_k-\min \limits _{k} X^{t}_k}
\right)\left[ X^{t}_p-\min_{k} X^{t}_k \right] \; ,
\]
%
where $p$ denotes the position of component $X^t_p$ in vector $X^t$.
Normalization is followed by mean subtraction, this time both for the template and for the object.
The resulting, mean-subtracted reference scallop feature $X^{\bar{t}}$, and object $X^{\bar{u}}$ are computed as
%
\begin{align*}
X^{\bar{t}}_p &= X^{t'}_p-\frac{1}{v}\sum_{k=1}^{v}X^{t'}_k \enspace,&
X^{\bar{u}}_p &= X^{u}_p-\frac{1}{v}\sum_{k=1}^{v}X^{u}_k \enspace.
\end{align*}
%
Now the standard deviation vector is normalized:
%
\[
\bar{\sigma}_p = \frac{\sigma_p}{\sum_{k=1}^{v}\sigma_k} \enspace.
\]
%
At this point, a metric that expresses the correlation between the mean-subtracted template and the object can be computed.
This metric is weighted by the (normalized) variance of each feature.
In general, the higher the value of this metric, the better the match between the object and the template.
The \gls{wctm} similarity metric is given by
%
\[
D_\mathsf{wctm} = \sum_{k=1}^v\frac{X^{\bar{t}}_{k}X^{\bar{u}}_k}{\bar{\sigma}_{k}} \enspace .
\]
%
The threshold set for the weighted correlation metric $D_\mathsf{wctm}$, in order to distinguish between likely true and false positives is at $0.0002222$, i.e., any object with a similarity score lower than this threshold is rejected.
This threshold value is justified from the precision-recall curves (see Figure~\ref{subfig:layer4_precision_recall}) of the weighted correlation metric values for the objects filtering down from the classification layer.
The threshold shown by the blue line corresponds to 96\% recall rate, i.e., 96\% of the true positive scallops from the classification layer pass through \gls{wctm}.
At the same time, \gls{wctm} decreases the false positives by over 63\%.
%
\begin{figure}
\centering
\begin{subfigure}{0.45\textwidth}
\includegraphics[width=\textwidth]{layer4_precision_recall}
\caption{}
\label{subfig:layer4_precision_recall}
\end{subfigure}
\begin{subfigure}{0.45\textwidth}
\includegraphics[width=\textwidth]{layer4_hog_precision_recall}
\caption{}
\label{subfig:hog_precision_recall}
\end{subfigure}
\caption[Precision-Recall curves for False-positive Filtering Layer]{Precision recall curve for Layer IV candidate methods (a) \gls{wctm} and (b) \gls{hog}. The blue line marks thresholds $D_\mathsf{wctm}=0.0002222$ and $D_\mathsf{hog}=2.816$. It is important to note that \gls{wctm} is a similarity measure and \gls{hog} is a dissimilarity measure. This implies that only instances below the indicated threshold $D_\mathsf{wctm}$ in \gls{wctm}, and likewise instances above the threshold $D_\mathsf{hog}$ in \gls{hog}, are rejected as false positives. }
\end{figure}
\subsubsection{Histogram of Gradients (HOG)}
The \gls{hog} feature descriptor encodes an object by capturing a series of local gradients in neighborhood of the object pixels.
These gradients are then transformed into a histogram after discretization and normalization.
There are several variants of \gls{hog} feature descriptors.
The \textsc{R-}\gls{hog} used for human detection in \cite{dalal} was tested here as a possible Layer IV candidate.
To produce \textsc{R-}\gls{hog}, the image is first tiled into a series of $8\times8$ pixel groups referred to here as cells (the image dimensions need to be multiples of 8).