Webly Supervised Semantic Segmentation: Supplementary Material



1. More comparison with other methods

For completeness, in Table 1 we compare the performance of our method against other semantic segmentation algorithms that require additional human annotations other than image tags. TransferNet[1] requires pixel-wise annotations from other classes rather than the classes of interest, hence they can be viewed as a hybrid between fully supervised methods and weakly supervised methods. MIL-bb[2], MIL-seg[2], N_B[3] and Aug_MCG[4] make use of the BING[5] or MCG[6] algorithm, which are trained with human annotated bounding boxes or pixel-wise masks. In this sense, they effectively require additional human annotations in an indirect way.

Our method only uses 8K web images downloaded according to their image tags. No further human annotations or interactions are required. Our method achieves better performance than various methods that require stronger supervision, such as [2, 3, 11, 12, 15]. We are even better than two fully supervised methods[7, 8].



Table 1: Performance on the PASCAL VOC 2012 validation and test set of methods that require human annotations more than image tags.
Methods val test Annotations
O2P[7] - 47.8 pixel-wise annotations
(fully supervised)
SDS[8] - 51.6
FCN-8s[9] - 62.2
Deeplab[10] - 70.3
TransferNet 52.1 51.2 pixel-wise annotations, other classes
MIL-bb[2] 37.8 37.0 BING[5]
MIL-seg[2] 42.0 40.6 MCG[6]
N_B[3] 41.9 43.2
Aug_MCG[4] 54.3 55.5
CCCN-size[11] 42.4 45.1 size indication
Point[12] 46.1 - points
ScribbleSup[13] 63.1 - scribbles
BoxSup[14] 62.0 64.6 boxes
CheckMask[15] 51.5 52.9 click of selection
Ours 53.4 55.3 image tags



2. More results for semantic segmentation, in addition to Figure 5 in the paper

Images WebS-i WebS-i1WebS-i2 Groundtruth



3. More results for object segmentation, in addition to Figure 6 in the paper

Images [16] [17] [18] SNN-i2 Groundtruth


References:

[1] S. Hong, J. Oh, B. Han, and H. Lee, "Learning transferrable knowledge for semantic segmentation with deep convolutional neural network," in IEEE International Conference on 073 Computer Vision, 2016.
[2] P. O. Pinheiro and R. Collobert, "From image-level to pixel-level labeling with convolutional networks," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713-1721, 2015.
[3] Y. Wei, X. Liang, Y. Chen, Z. Jie, Y. Xiao, Y. Zhao, and S. Yan, "Learning to segment with image-level annotations," Pattern Recognition, 2016.
[4] X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia, "Augmented feedback in semantic segmentation under image level supervision," in European Conference on Computer Vision, pp. 90-105, 2016.
[5] M. Cheng, Z. Zhang, W. Lin, and P. Torr, "Bing: Binarized normed gradients for objectness estimation at 300fps," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 3286-3293, 2014.
[6] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, "Multiscale combinatorial grouping," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 328-335, 2014.
[7] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, "Semantic segmentation with second-order pooling," in European Conference on Computer Vision, pp. 430-443, Springer, 2012.
[8] B.Hariharan,P.Arbeláez,R.Girshick,andJ.Malik, "Simultaneous detection and segmentation," in European Conference on Computer Vision, pp. 297-312, Springer, 2014.
[9] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, 2015.
[10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Semantic image segmentation with deep convolutional nets and fully connected crfs," in International Conference on Learning Representations, 2015.
[11] D. Pathak, P. Krahenbuhl, and T. Darrell, "Constrained convolutional neural networks for weakly supervised segmentation," in IEEE International Conference on Computer Vision, pp. 1796-1804, 2015.
[12] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, "What's the point: Semantic segmentation with point supervision," in European Conference on Computer Vision, 2016.
[13] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, "Scribblesup: Scribble-supervised convolutional networks for semantic segmentation," in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[14] J. Dai, K. He, and J. Sun, "Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation," in IEEE International Conference on Computer Vision, pp. 1635-1643, 2015.
[15] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez, "Built-in foreground/background prior for weakly-supervised semantic segmentation," in European Conference on Computer Vision, pp. 413-432, 2016.
[16] A. Joulin, F. Bach, and J. Ponce, "Discriminative clustering for image co-segmentation," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1943-1950, 2010.
[17] A. Joulin, F. Bach, and J. Ponce, "Multi-class cosegmentation," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 542-549, 2012.
[18] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, "Unsupervised joint object discovery and segmentation in internet images," in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1939-1946, 2013.