| Sign up

PDF (10.3 MB)

Cite

Collect

Submit Manuscript

Show Outline

Figures (9)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Tables (7)

Table 1

Table 2

Table 3

Table 4

Table 5

Article | Open Access

Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Peng Zheng^{¹^,²}, Dehong Gao^³, Deng-Ping Fan^¹

(), Li Liu^⁴, Jorma Laaksonen^⁵, Wanli Ouyang^², Nicu Sebe^⁶

1College of Computer Science, Nankai University, Tianjin 300350, China

2Shanghai AI Laboratory, Shanghai 200232, China

3School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China

4College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

5Department of Computer Science, Aalto University, Espoo FI-02150, Finland

6Department of Information Engineering and Computer Science, University of Trento, Trento I-38122, Italy

Show Author Information

Abstract

We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS). It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef). LM aids in object localization using global semantic information. Within the RM, we utilize BiRef for the reconstruction process, where hierarchical patches of images provide the source reference, and gradient maps serve as the target reference. These components collaborate to generate the final predicted maps. We also introduce auxiliary gradient supervision to enhance the focus on regions with finer details. In addition, we outline practical training strategies tailored for DIS to improve map quality and the training process. To validate the general applicability of our approach, we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance, outperforming task-specific cutting-edge methods across all benchmarks. Our codes are publicly available at https://github.com/ZhengPeng7/BiRefNet.

Keywords

dichotomous image segmentation camouflaged object detection salient object detection bilateral reference high-resolution segmentation

References

[1]

D. P. Fan, J. Zhang, G. Xu, M. M. Cheng, and L. Shao, Salient objects in clutter, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2344–2366, 2023.

Crossref Google Scholar

[2]

D. P. Fan, G. P. Ji, P. Xu, M. M. Cheng, C. Sakaridis, and L. Van Gool, Advances in deep concealed scene understanding, Vis. Intell., vol. 1, no. 1, pp. 16, 2023.

Crossref Google Scholar

[3]

D. P. Fan, G. P. Ji, M. M. Cheng, and L. Shao, Concealed object detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6024–6042, 2022.

Crossref Google Scholar

[4]

X. Qin, H. Dai, X. Hu, D. P. Fan, L. Shao, and L. Van Gool, Highly accurate dichotomous image segmentation, in Proc. 17th European Conf. Computer Vision (ECCV), Tel Aviv, Israel, 2022, pp. 38–56.

Crossref

[5]

W. Lu, W. Sun, X. Min, W. Zhu, Q. Zhou, J. He, Q. Wang, Z. Zhang, T. Wang, and G. Zhai, Deep neural network for blind visual quality assessment of 4K content, IEEE Trans. Broadcast., vol. 69, no. 2, pp. 406–421, 2023.

Crossref Google Scholar

[6]

W. Sun, X. Min, D. Tu, S. Ma, and G. Zhai, Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training, IEEE J. Sel. Top. Signal Process., vol. 17, no. 6, pp. 1178–1192, 2023.

Crossref Google Scholar

[7]

W. Sun, X. Min, G. Zhai, K. Gu, H. Duan, and S. Ma, MC360IQA: A multi-channel CNN for blind 360-degree image quality assessment, IEEE J. Sel. Top. Signal Process., vol. 14, no. 1, pp. 64–77, 2020.

Crossref Google Scholar

[8]

W. Sun, X. Min, W. Lu, and G. Zhai, A deep learning based No-reference quality assessment model for UGC videos, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 856–865.

Crossref

[9]

Y. Zhou, B. Dong, Y. Wu, W. Zhu, G. Chen, and Y. Zhang, Dichotomous image segmentation with frequency priors, in Proc. 32nd Int. Joint Conf. Artificial Intelligence, Macao, China, 2023, pp. 1822–1830.

Crossref

[10]

J. Pei, Z. Zhou, Y. Jin, H. Tang, and P. A. Heng, Unite-divide-unite: Joint boosting trunk and structure for high-accuracy dichotomous image segmentation, in Proc. 31st ACM Int. Conf. Multimedia, Ottawa, Canada, 2023, pp. 2139–2147.

Crossref

[11]

T. Kim, K. Kim, J. Lee, D. Cha, J. Lee, and D. Kim, Revisiting image pyramid structure for high resolution salient object detection, in Proc. 16th Asian Conf. Computer Vision (ACCV), Macao, China, 2022, pp. 257–273,

Crossref

[12]

X. Li, J. Yang, S. Li, J. Lei, J. Zhang, and D. Chen, Locate, refine and restore: A progressive enhancement network for camouflaged object detection, in Proc. 32nd Int. Joint Conf. Artificial Intelligence, Macao, China, 2023, pp. 1116–1124.

Crossref

[13]

D. P. Fan, M. M. Cheng, Y. Liu, T. Li, and A. Borji, Structure-measure: A new way to evaluate foreground maps, in in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 4558–4567.

Crossref

[14]

Y. Zeng, P. Zhang, Z. Lin, J. Zhang, and H. Lu, Towards high-resolution salient object detection, in in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 7233–7242.

Crossref

[15]

C. Xie, C. Xia, M. Ma, Z. Zhao, X. Chen, and J. Li, Pyramid grafting network for one-stage high resolution saliency detection, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 11707–11716.

Crossref

[16]

X. Deng, P. Zhang, W. Liu, and H. Lu, Recurrent multi-scale transformer for high-resolution salient object detection, in Proc. 31st ACM Int. Conf. Multimedia, Ottawa, Canada, 2023, pp. 7413–7423.

Crossref

[17]

L. Tang, B. Li, Y. Zhong, S. Ding, and M. Song, Disentangled high quality salient object detection, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 3560–3570.

Crossref

[18]

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.

Crossref

[19]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 9992–10002.

Crossref

[20]

D. P. Fan, G. P. Ji, G. Sun, M. M. Cheng, J. Shen, and L. Shao, Camouflaged object detection, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 2774–2784.

Crossref

[21]

Y. Zhong, B. Li, L. Tang, S. Kuang, S. Wu, and S. Ding, Detecting camouflaged object in frequency domain, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, 4494–4503.

Crossref

[22]

Y. Sun, S. Wang, C. Chen, and T. Z. Xiang, Boundary-guided camouflaged object detection, in Proc. 31st Int. Joint Conf. Artificial Intelligence, Vienna, Austria, 2022, pp. 1335–1341.

Crossref

[23]

G. P. Ji, D. P. Fan, Y. C. Chou, D. Dai, A. Liniger, and L. Van Gool, Deep gradient learning for efficient camouflaged object detection, Mach. Intell. Res., vol. 20, no. 1, pp. 92–108, 2023.

Crossref Google Scholar

[24]

Z. Huang, H. Dai, T. Z. Xiang, S. Wang, H. X. Chen, J. Qin, and H. Xiong, Feature shrinkage pyramid for camouflaged object detection with transformers, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 5557–5566.

Crossref

[25]

X. Hu, S. Wang, X. Qin, H. Dai, W. Ren, D. Luo, Y. Tai, and L. Shao, High-resolution iterative feedback network for camouflaged object detection, in Proc. 37^th AAAI Conf. Artificial Intelligence, Washington, DC, USA, 2023, pp. 881–889.

Crossref

[26]

B. Yin, X. Zhang, Q. Hou, B. Y. Sun, D. P. Fan, and L. Van Gool, CamoFormer: Masked separable attention for camouflaged object detection, arXiv preprint arXiv: 2212.06570, 2022.

[27]

J. Wei, S. Wang, Z. Wu, C. Su, Q. Huang, and Q. Tian, Label decoupling framework for salient object detection, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 13022–13031.

Crossref

[28]

Q. Yu, X. Zhao, Y. Pang, L. Zhang, and H. Lu, Multi-view aggregation network for dichotomous image segmentation, arXiv preprint arXiv: 2404.07445, 2024.

[29]

J. Li, J. Zhang, S. J. Maybank, and D. Tao, Bridging composite and real: Towards end-to-end deep image matting, Int. J. Comput. Vis., vol. 130, no. 2, pp. 246–266, 2022.

Crossref Google Scholar

[30]

N. Xu, B. Price, S. Cohen, and T. Huang, Deep image matting, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 311–320.

Crossref

[31]

O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in Proc. 18^th Int. Conf. Medical Image Computing and Computer Assisted Intervention (MICCAI 2015), Munich, Germany, 2015, pp. 234–241.

Crossref

[32]

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, ICNet for real-time semantic segmentation on high-resolution images, arXiv preprint arXiv: 1704.08545, 2017.

[33]

W. I. Grosky and R. Jain, A pyramid-based approach to segmentation applied to region matching, IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 5, pp. 639–650, 1986.

Crossref Google Scholar

[34]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6230–6239.

Crossref

[35]

Q. Yu, J. Zhang, H. Zhang, Y. Wang, Z. Lin, N. Xu, Y. Bai, and A. Yuille, Mask guided matting via progressive refinement network, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 1154–1163.

Crossref

[36]

X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, BASNet: Boundary-aware salient object detection, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7471–7481.

Crossref

[37]

T. Shen, Y. Zhang, L. Qi, J. Kuen, X. Xie, J. Wu, Z. Lin, and J. Jia, High quality segmentation for ultra high-resolution images, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 1300–1309.

Crossref

[38]

C. He, K. Li, Y. Zhang, Y. Zhang, Z. Guo, X. Li, M. Danelljan, and F. Yu, Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects, in Proc. 12th Int. Conf. Learning Representations (ICLR), Vienna, Austria, 2024.

[39]

C. Tang, H. Chen, X. Li, J. Li, Z. Zhang, and X. Hu, Look closer to segment better: Boundary patch refinement for instance segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 13921–13930.

Crossref

[40]

W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang, Deep Laplacian pyramid networks for fast and accurate super-resolution, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5835–5843.

Crossref

[41]

W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang, Fast and accurate image super-resolution with deep Laplacian pyramid networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp. 2599–2613, 2019.

Crossref Google Scholar

[42]

M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, DenseASPP for semantic segmentation in street scenes, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 3684–3692.

Crossref

[43]

L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Proc. 15th European Conf. Computer Vision (ECCV), Munich, Germany, 2018, pp. 833–851.

Crossref

[44]

J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, Deformable convolutional networks, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 764–773.

Crossref

[45]

T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 936–944.

Crossref

[46]

R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, Frequency-tuned salient region detection, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 1597–1604.

Crossref

[47]

Z. Zhang, W. Jin, J. Xu, and M. M. Cheng, Gradient-induced co-saliency detection, in Proc. 16th European Conf. Computer Vision (ECCV), Glasgow, UK, 2020, pp. 455–472.

Crossref

[48]

T. N. Le, T. V. Nguyen, Z. Nie, M. T. Tran, and A. Sugimoto, Anabranch network for camouflaged object segmentation, Comput. Vis. Image Underst., vol. 184, pp. 45–56, 2019.

Crossref Google Scholar

[49]

Y. Lv, J. Zhang, Y. Dai, A. Li, B. Liu, N. Barnes, and D. P. Fan, Simultaneously localize, segment and rank the camouflaged objects, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 11586–11596.

Crossref

[50]

L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, Learning to detect salient objects with image-level supervision, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 11586–11596.

Crossref

[51]

C. Yang, L. Zhang, H. Lu, X. Ruan, and M. H. Yang, Saliency detection via graph-based manifold ranking, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 3166–3173.

[52]

D. P. Fan, C. Gong, Y. Cao, B. Ren, M. M. Cheng, and A. Borji, Enhanced-alignment measure for binary foreground map evaluation, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 698–704.

Crossref

[53]

A. Borji, M. M. Cheng, H. Jiang, and J. Li, Salient object detection: A benchmark, IEEE Trans. Image Process., vol. 24, no. 12, pp. 5706–5722, 2015.

Crossref Google Scholar

[54]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in Proc. 3rd Int. Conf. Learning Representations (ICLR), San Diego, CA, USA, 2015.

[55]

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , PyTorch: An imperative style, high-performance deep learning library, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 8026–8037.

[56]

X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, U2-Net: Going deeper with nested U-structure for salient object detection, Pattern Recognit., vol. 106, pp. 107404, 2020.

Crossref Google Scholar

[57]

J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., Deep high-resolution representation learning for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3349–3364, 2021.

Crossref Google Scholar

[58]

Q. Jia, S. Yao, Y. Liu, X. Fan, R. Liu, and Z. Luo, Segment, magnify and reiterate: Detecting camouflaged objects the hard way, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 4703–4712.

Crossref

[59]

Y. Pang, X. Zhao, T. -Z. Xiang, L. Zhang, and H. Lu, Zoom in and out: A mixed-scale triplet network for camouflaged object detection, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 2150–2160.

Crossref

[60]

C. He, K. Li, Y. Zhang, L. Tang, Y. Zhang, Z. Guo, and X. Li, Camouflaged object detection with feature decomposition and edge reconstruction, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 22046–22055.

Crossref

[61]

Q. Zou, Z. Zhang, Q. Li, X. Qi, Q. Wang, and S. Wang, DeepCrack: learning hierarchical convolutional features for crack detection, IEEE Trans. Image Process., vol. 28, no. 3, pp. 1498–1512, 2019.

Crossref Google Scholar

[62]

T. Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision (ECCV), Zurich, Switzerland, 2016, pp. 740–755.

Crossref

[63]

L. Dai, X. Song, X. Liu, C. Li, Z. Shi, J. Chen, and M. Brooks, Enabling trimap-free image matting with a frequency-guided saliency-aware network via joint learning, IEEE Trans. Multimedia, vol. 25, pp. 4868–4879, 2023.

Crossref Google Scholar

CAAI Artificial Intelligence Research

Volume 3,
2024

Article number: 9150038

DOI: 10.26599/AIR.2024.9150038

Cite this article:

Zheng P, Gao D, Fan D-P, et al. Bilateral Reference for High-Resolution Dichotomous Image Segmentation. CAAI Artificial Intelligence Research, 2024, 3: 9150038. https://doi.org/10.26599/AIR.2024.9150038

Return

Table 1Quantitative ablation studies of the proposed multi-stage supervision for acceleration and training epochs. “ $↑$ ” (“ $↓$ ”) means that the higher (lower) is better.

Setting		$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
MSS	Epoch	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
	200	0.875	0.848	0.041	0.886	0.914	1207
	400	0.897	0.863	0.036	0.905	0.937	1039
√	200	0.892	0.858	0.037	0.901	0.932	1043

Table 2Quantitative ablation studies of the proposed components in the proposed BiRefNet, including reconstruction module (RM), inward reference (InRef), outward reference (OutRef), and their combinations.

Module			$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
RM	InRef	OutRef	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
			0.837	0.785	0.056	0.845	0.887	1204
√			0.855	0.831	0.048	0.865	0.895	1167
	√		0.848	0.825	0.050	0.857	0.903	1152
√	√		0.869	0.834	0.041	0.886	0.912	1093
√		√	0.863	0.831	0.042	0.891	0.918	1106
	√	√	0.861	0.839	0.044	0.881	0.911	1114
√	√	√	0.889	0.851	0.038	0.900	0.924	1065

Table 3Effectiveness of practical strategies for training high-resolution segmentation, including context feature fusion (CFF), image pyramids input (IPT), regional loss fine-tuning (RLFT), and their combinations. The results are obtained by our final model.

Module			$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
CFF	IPT	RLFT	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
			0.889	0.851	0.038	0.900	0.924	1065
√			0.893	0.856	0.038	0.904	0.928	1054
	√		0.895	0.857	0.037	0.904	0.927	1051
		√	0.890	0.861	0.036	0.899	0.932	1043
√	√	√	0.897	0.863	0.036	0.905	0.937	1039

Table 4Quantitative comparisons between our BiRefNet and the state-of-the-art methods on DIS5K. We use the results from Ref. [10], where all methods take 1024 $\times$ 1024 input. The size of each dataset is marked behind.

Method	DIS-TE1 (500)						DIS-TE2 (500)						DIS-TE3 (500)
Method	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
BASNet $_{19}$ ^[34]	0.663	0.577	0.105	0.741	0.756	155	0.738	0.653	0.096	0.781	0.808	341	0.790	0.714	0.080	0.816	0.848	681
U $^{2}$ Net $_{20}$ ^[56]	0.701	0.601	0.085	0.762	0.783	165	0.768	0.676	0.083	0.798	0.825	367	0.813	0.721	0.073	0.823	0.856	738
HRNet $_{20}$ ^[57]	0.668	0.579	0.088	0.742	0.797	262	0.747	0.664	0.087	0.784	0.840	555	0.784	0.700	0.080	0.805	0.869	1049
PGNet $_{22}$ ^[15]	0.754	0.680	0.067	0.800	0.848	162	0.807	0.743	0.065	0.833	0.880	375	0.843	0.785	0.056	0.844	0.911	797
IS-Net $_{22}$ ^[4]	0.740	0.662	0.074	0.787	0.820	149	0.799	0.728	0.070	0.823	0.858	340	0.830	0.758	0.064	0.836	0.883	687
FP-DIS $_{23}$ ^[9]	0.784	0.713	0.060	0.821	0.860	160	0.827	0.767	0.059	0.845	0.893	373	0.868	0.811	0.049	0.871	0.922	780
UDUN $_{23}$ ^[10]	0.784	0.720	0.059	0.817	0.864	140	0.829	0.768	0.058	0.843	0.886	325	0.865	0.809	0.050	0.865	0.917	658
BiRefNet	0.860	0.819	0.037	0.885	0.911	106	0.894	0.857	0.036	0.900	0.930	266	0.925	0.893	0.028	0.919	0.955	569
BiRefNet_SwinB	0.857	0.819	0.038	0.884	0.912	110	0.890	0.854	0.037	0.898	0.930	275	0.919	0.886	0.030	0.915	0.953	597
BiRefNet_SwinT	0.823	0.774	0.048	0.855	0.887	117	0.862	0.821	0.046	0.877	0.912	290	0.899	0.860	0.036	0.897	0.942	627
BiRefNet_PVTv2b2	0.839	0.796	0.042	0.870	0.903	111	0.881	0.842	0.040	0.888	0.925	280	0.903	0.866	0.036	0.901	0.941	614

Method	DIS-TE4 (500)						DIS-TE (1-4) (2000)						DIS-VD (470)
Method	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$	$F_{β}^{x} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$S_{m} ↑$	$E_{ϕ}^{m} ↑$	${H C E}_{γ} ↓$
BASNet $_{19}$ ^[34]	0.785	0.713	0.087	0.806	0.844	2852	0.744	0.664	0.092	0.786	0.814	1007	0.737	0.656	0.094	0.781	0.809	1132
U $^{2}$ Net $_{20}$ ^[56]	0.800	0.707	0.085	0.814	0.837	2898	0.771	0.676	0.082	0.799	0.825	1042	0.753	0.656	0.089	0.785	0.809	1139
HRNet $_{20}$ ^[57]	0.772	0.687	0.092	0.792	0.854	3864	0.743	0.658	0.087	0.781	0.840	1432	0.726	0.641	0.095	0.767	0.824	1560
PGNet $_{22}$ ^[15]	0.831	0.774	0.065	0.841	0.899	3361	0.809	0.746	0.063	0.830	0.885	1173	0.798	0.733	0.067	0.824	0.879	1326
IS-Net $_{22}$ ^[4]	0.827	0.753	0.072	0.830	0.870	2888	0.799	0.726	0.070	0.819	0.858	1016	0.791	0.717	0.074	0.813	0.856	1116
FP-DIS $_{23}$ ^[9]	0.846	0.788	0.061	0.852	0.906	3347	0.831	0.770	0.047	0.847	0.895	1165	0.823	0.763	0.062	0.843	0.891	1309
UDUN $_{23}$ ^[10]	0.846	0.792	0.059	0.849	0.901	2785	0.831	0.772	0.057	0.844	0.892	977	0.823	0.763	0.059	0.838	0.892	1097
BiRefNet	0.904	0.864	0.039	0.900	0.939	2723	0.896	0.858	0.035	0.901	0.934	916	0.891	0.854	0.038	0.898	0.931	989
BiRefNet_SwinB	0.899	0.860	0.040	0.895	0.938	2836	0.891	0.855	0.036	0.898	0.933	954	0.881	0.844	0.039	0.890	0.925	1029
BiRefNet_SwinT	0.880	0.834	0.049	0.878	0.925	2888	0.866	0.822	0.045	0.877	0.916	980	0.862	0.819	0.045	0.874	0.917	1070
BiRefNet_PVTv2b2	0.890	0.846	0.045	0.886	0.929	2871	0.878	0.838	0.041	0.886	0.925	969	0.868	0.827	0.044	0.880	0.919	1073

Table 5Quantitative comparisons between our BiRefNet and the state-of-the-art methods in high-resolution and low-resolution SOD datasets. TR denotes the training set. To provide a fair comparison, we train our BiRefNet with different combinations of training sets, where 1, 2, and 3 represent DUTS^[50], HRSOD^[17], and UHRSD^[18], respectively. The size of each dataset is marked behind.

Methods	TR	High-resolution benchmarks												Low-resolution benchmarks
		DAVIS-S (92)				HRSOD-TE (400)				UHRSD-TE (988)				DUTS-TE (5019)				DUT-OMRON (5168)
		$S_{m} ↑$	$F_{β}^{x} ↑$	$E_{ϕ}^{m} ↑$	$M ↓$	$S_{m} ↑$	$F_{β}^{x} ↑$	$E_{ϕ}^{m} ↑$	$M ↓$	$S_{m} ↑$	$F_{β}^{x} ↑$	$E_{ϕ}^{m} ↑$	$M ↓$	$S_{m} ↑$	$F_{β}^{x} ↑$	$E_{ϕ}^{m} ↑$	$M ↓$	$S_{m} ↑$	$F_{β}^{x} ↑$	$E_{ϕ}^{m} ↑$	$M ↓$
LDF $_{20}$ ^[30]	1	0.922	0.911	0.947	0.019	0.904	0.904	0.919	0.032	0.888	0.913	0.891	0.047	0.892	0.898	0.910	0.034	0.838	0.820	0.873	0.051
HRSOD $_{19}$ ^[17]	1, 2	0.876	0.899	0.955	0.026	0.896	0.905	0.934	0.030	-	-	-	-	0.824	0.835	0.885	0.050	0.762	0.743	0.831	0.065
DHQ $_{21}$ ^[20]	1, 2	0.920	0.938	0.947	0.012	0.920	0.922	0.947	0.022	0.900	0.911	0.905	0.039	0.894	0.900	0.919	0.031	0.836	0.820	0.873	0.045
PGNet $_{22}$ ^[18]	1	0.935	0.936	0.947	0.015	0.930	0.931	0.944	0.021	0.912	0.931	0.904	0.037	0.911	0.917	0.922	0.027	0.855	0.835	0.887	0.045
PGNet $_{22}$ ^[18]	1, 2	0.948	0.950	0.975	0.012	0.935	0.937	0.946	0.020	0.912	0.935	0.905	0.036	0.912	0.919	0.925	0.028	0.858	0.835	0.887	0.046
PGNet $_{22}$ ^[18]	2, 3	0.954	0.957	0.979	0.010	0.938	0.945	0.946	0.020	0.935	0.949	0.916	0.026	0.859	0.871	0.897	0.038	0.786	0.772	0.884	0.058
BiRefNet	1	0.967	0.966	0.984	0.008	0.957	0.958	0.972	0.014	0.931	0.933	0.943	0.030	0.939	0.937	0.958	0.019	0.868	0.813	0.878	0.040
BiRefNet	1, 2	0.973	0.976	0.990	0.006	0.962	0.963	0.976	0.011	0.937	0.942	0.951	0.024	0.938	0.935	0.960	0.018	0.868	0.818	0.882	0.040
BiRefNet	1, 3	0.975	0.977	0.989	0.006	0.959	0.958	0.972	0.014	0.952	0.960	0.965	0.019	0.942	0.942	0.961	0.018	0.881	0.837	0.896	0.036
BiRefNet	2, 3	0.976	0.980	0.990	0.006	0.956	0.953	0.967	0.016	0.952	0.958	0.964	0.019	0.933	0.928	0.954	0.020	0.864	0.810	0.879	0.040
BiRefNet	1, 2, 3	0.975	0.979	0.989	0.006	0.962	0.961	0.973	0.013	0.957	0.963	0.969	0.016	0.944	0.943	0.962	0.018	0.882	0.839	0.896	0.038

Table 6Comparison of BiRefNet with recent methods. As seen, BiRefNet performs much better than previous methods. The size of each dataset is marked behind.

Method	CAMO (250)						COD10K (2026)						NC4K (4121)
Method	$S_{m} ↑$	$F_{β}^{ω} ↑$	$F_{β}^{m} ↑$	$E_{ϕ}^{m} ↑$	$E_{ϕ}^{x} ↑$	$M ↓$	$S_{m} ↑$	$F_{β}^{ω} ↑$	$F_{β}^{m} ↑$	$E_{ϕ}^{m} ↑$	$E_{ϕ}^{x} ↑$	$M ↓$	$S_{m} ↑$	$F_{β}^{ω} ↑$	$F_{β}^{m} ↑$	$E_{ϕ}^{m} ↑$	$E_{ϕ}^{x} ↑$	$M ↓$
SINet $_{20}$ ^[23]	0.751	0.606	0.675	0.771	0.831	0.100	0.771	0.551	0.634	0.806	0.868	0.051	0.808	0.723	0.769	0.871	0.883	0.058
BGNet $_{22}$ ^[25]	0.812	0.749	0.789	0.870	0.882	0.073	0.831	0.722	0.753	0.901	0.911	0.033	0.851	0.788	0.820	0.907	0.916	0.044
SegMaR $_{22}$ ^[58]	0.815	0.753	0.795	0.874	0.884	0.071	0.833	0.724	0.757	0.899	0.906	0.034	0.841	0.781	0.820	0.896	0.907	0.046
ZoomNet $_{22}$ ^[59]	0.820	0.752	0.794	0.878	0.892	0.066	0.838	0.729	0.766	0.888	0.911	0.029	0.853	0.784	0.818	0.896	0.912	0.043
SINetv2 $_{22}$ ^[9]	0.820	0.743	0.782	0.882	0.895	0.070	0.815	0.680	0.718	0.887	0.906	0.037	0.847	0.770	0.805	0.903	0.914	0.048
FEDER $_{23}$ ^[60]	0.802	0.738	0.781	0.867	0.873	0.071	0.822	0.716	0.751	0.900	0.905	0.032	0.847	0.789	0.824	0.907	0.915	0.044
HitNet $_{23}$ ^[27]	0.849	0.809	0.831	0.906	0.910	0.055	0.871	0.806	0.823	0.935	0.938	0.023	0.875	0.834	0.853	0.926	0.929	0.037
FSPNet $_{23}$ ^[28]	0.856	0.799	0.830	0.899	0.928	0.050	0.851	0.735	0.769	0.895	0.930	0.026	0.879	0.816	0.843	0.915	0.937	0.035
BiRefNet	0.904	0.890	0.904	0.954	0.959	0.030	0.913	0.874	0.888	0.960	0.967	0.014	0.914	0.894	0.909	0.953	0.960	0.023

Table 7Comparison of different DIS methods on the performance, efficiency, and model complexity. MACs refers to multiply-accumulate operations.

Model	Runtime (ms)	#Params (10⁶)	MACs (10⁹)	DIS-TEs ( $H C E, F_{β}^{ω}$ )
BiRefNet $_{S w i n L}$	83.3	215	1143	916, 0.858
BiRefNet $_{S w i n L_c p}$	78.3	215	1143	916, 0.858
BiRefNet $_{S w i n B}$	61.4	101	561	954, 0.855
BiRefNet $_{S w i n T}$	40.9	39	231	980, 0.822
BiRefNet $_{P V T v 2 b 2}$	47.8	35	195	969, 0.838
BiRefNet $_{P V T v 2 b 1}$	36.6	23	147	978, 0.817
BiRefNet $_{P V T v 2 b 0}$	32.9	11	89	1013, 0.806
IS-Net	16.0	44	160	1016, 0.726
UDUN $_{R e s 50}$	33.5	25	142	977, 0.772