From ae4e5c3e7b1c3e1f8d1cf0d86befa3fce0d7cf05 Mon Sep 17 00:00:00 2001 From: HuiTang <42053362+huitangtang@users.noreply.github.com> Date: Mon, 15 May 2023 17:40:47 +0800 Subject: [PATCH] Create index.html --- docs/index.html | 469 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 469 insertions(+) create mode 100644 docs/index.html diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..66b808a --- /dev/null +++ b/docs/index.html @@ -0,0 +1,469 @@ + +
+ + + + +
+
+ Hui Tang1, 2
+
+ |
+
+
+
+
+ Kui Jia
+
+ ✉, 1
+
+
+ |
+
+ |
+
+
+ |
+
+
+ |
+
+
+
+ Code
+ [GitHub]
+
+
+ |
+
+
+
+
+ Paper
+ [arXiv]
+
+
+ |
+
+
+ |
+
+
+
+
+
+ |
+
+ + The first row depicts the tasks of object detection and attribute classification in a close-set setting, i.e., train and test on the same vocabulary set. +The second row gives qualitative results from our proposed OvarNet, +which simultaneously localizes, categorizes, and characterizes arbitrary objects in an open-vocabulary scenario. We only show one object per image for ease of visualization, red denotes the base category/attribute i.e., +seen in the training set, while blue represents the novel category/attribute unseen in the training set. + + |
+
+ + In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. +To achieve this goal, we make the following contributions: +(i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; +(ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, +additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; +(iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; +Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, +and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, +demonstrating strong generalization ability to novel attributes and categories. + + |
+
+
+
+
+ |
+
+
+ |
+
+ + + R1: Benchmark on COCO and VAW Datasets + + ++ In the Tab., we compare OvarNet to other attribute prediction methods and open-vocabulary object detectors on the VAW test set and COCO validation set. +As there is no open-vocabulary attribute prediction method developed on the VAW dataset, +we re-train two models on the full VAW dataset as an oracle comparison, namely, SCoNE and TAP. +Our best model achieves 68.52/67.62 AP across all attribute classes for the box-given and box-free settings respectively. +On COCO open-vocabulary object detection, +we compare with OVR-RCNN, ViLD, Region CLIP, PromptDet, and Detic, our best model obtains 54.10/35.17 AP for novel categories, surpassing the recent state-of-the-art ViLD-ens and Detic by a large margin, +showing that attributes understanding is beneficial for open-vocabulary object recognition. + +
+
+
+ |
+
+ + + + R2: Cross-dataset Transfer on OVAD Benchmark + + ++ We compare with other state-of-the-art methods on OVAD benchmark, +following the same evaluation protocol, we conduct zero-shot cross-dataset transfer evaluation with CLIP-Attr and OvarNet trained on COCO Caption dataset. +Metric is average precision (AP) over different attribute frequency distributions, 'head', 'medium', and 'tail'. +As shown in the Tab., our proposed models largely outperform other competitors by a noticeable margin. + +
+
+
+ |
+
+ + + + R3: Evaluation on LSA Benchmark + + ++ We evaluate the proposed OvarNet on the same benchmark proposed by Pham et al.. +As OpenTAP employs a Transformer-based architecture with object category and object bounding box as the additional prior inputs, we have evaluated two settings. One is the original OvarNet without any additional input information; + the other integrates the object category embedding as an extra token into the transformer encoder layer. +As shown in the Tab., OvarNet outperforms prompt-based CLIP by a large margin and surpasses OpenTAP (proposed in the benchmark paper) under the same scenario, +i.e., with additional category embedding introduced. 'Attribute prompt' means the prompt designed with formats similar to "A photo of something that is [attribute]", while 'object-attribute prompt' denotes "A photo of [category] [attribute]". For the 'combined prompt', the outputs of the 'attribute prompt' and the 'object-attribute prompt' are weighted average. + +
+
+
+ |
+
+ + In the following Fig., we show the qualitative results + of OvarNet on VAW and MS-COCO benchmarks. + OvarNet is capable of accurately localizing, recognizing, + and characterizing objects based on a broad variety + of novel categories and attributes. + +
+
+
+ |
+
+
+ @InProceedings{chen2023ovarnet,
+ title={OvarNet: Towards Open-vocabulary Object Attribute Recognition},
+ author={Chen, Keyan and Jiang, Xiaolong and Hu, Yao and Tang, Xu and Gao, Yan and Chen, Jianqi and Xie, Weidi},
+ booktitle={CVPR},
+ year={2023}}
+
+
+
+
++ Based on a template by Phillip Isola and Richard Zhang. +
+ +