| Sign up

PDF (567.4 KB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

Segmented Summarization and Refinement: A Pipeline for Long-Document Analysis on Social Media

(), Priyanshi Garg, Weili Wu

Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75080, USA

Show Author Information

Abstract

Social media’s explosive growth has resulted in a massive influx of electronic documents influencing various facets of daily life. However, the enormous and complex nature of this content makes extracting valuable insights challenging. Long document summarization emerges as a pivotal technique in this context, serving to distill extensive texts into concise and comprehensible summaries. This paper presents a novel three-stage pipeline for effective long document summarization. The proposed approach combines unsupervised and supervised learning techniques, efficiently handling large document sets while requiring minimal computational resources. Our methodology introduces a unique process for forming semantic chunks through spectral dynamic segmentation, effectively reducing redundancy and repetitiveness in the summarization process. Contrary to previous methods, our approach aligns each semantic chunk with the entire summary paragraph, allowing the abstractive summarization model to process documents without truncation and enabling the summarization model to deduce missing information from other chunks. To enhance the summary generation, we utilize a sophisticated rewrite model based on Bidirectional and Auto-Regressive Transformers (BART), rearranging and reformulating summary constructs to improve their fluidity and coherence. Empirical studies conducted on the long documents from the Webis-TLDR-17 dataset demonstrate that our approach significantly enhances the efficiency of abstractive summarization transformers. The contributions of this paper thus offer significant advancements in the field of long document summarization, providing a novel and effective methodology for summarizing extensive texts in the context of social media.

Keywords

long document summarization abstractive summarization text segmentation text alignment rewrite model spectral embedding

References

[1]

L. Dong, M. N. Satpute, W. Wu, and D. Z. Du, Two-phase multidocument summarization through content-attention-based subtopic detection, IEEE Trans. Comput. Soc. Syst., vol. 8, no. 6, pp. 1379–1392, 2021.

[2]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.

[3]

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Virtual Event, 2020, pp. 7871–7880.

[4]

N. Zmandar, A. Singh, M. El-Haj, and P. Rayson, Joint abstractive and extractive method for long financial document summarization, in Proc. 3rd Financial Narrative Processing Workshop, Lancaster, UK, 2021, pp. 99–105.

[5]

J. Pilault, R. Li, S. Subramanian, and C. Pal, On extractive and abstractive neural document summarization with transformer language models, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (EMNLP), Virtual Event, 2020, pp. 9308–9319.

[6]

S. Cho, K. Song, X. Wang, F. Liu, and D. Yu, Toward unifying text segmentation and long document summarization, in Proc. 2022 Conf. Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 106–118.

[7]

Y. Zhang, A. Ni, Z. Mao, C. H. Wu, C. Zhu, B. Deb, A. Awadallah, D. Radev, and R. Zhang, Summ^N: A multi-stage summarization framework for long input dialogues and documents: A multi-stage summarization framework for long input dialogues and documents, in Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 1592–1604.

[8]

S. Cao and L. Wang, HIBRIDS: Attention with hierarchical biases for structure-aware long document summarization, in Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 786–807.

[9]

A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, A discourse-aware attention model for abstractive summarization of long documents, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 2018, pp. 615–621.

[10]

OpenAI, GPT-4 technical report, Technical report, OpenAI, San Francisco, CA, USA, 2023.

[11]

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.

[12]

C. Truong, L. Oudre, and N. Vayatis, Selective review of offline change point detection methods, Signal Process., vol. 167, p. 107299, 2020.

Crossref Google Scholar

[13]

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., vol. 15, no. 6, pp. 1373–1396, 2003.

Crossref Google Scholar

[14]

G. Moro and L. Ragazzi, Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes, Proc. AAAI Conf. Artif. Intell., vol. 36, no. 10, pp. 11085–11093, 2022.

Crossref Google Scholar

[15]

M. Völske, M. Potthast, S. Syed, and B. Stein, TL;DR: Mining reddit to learn automatic summarization, in Proc. Workshop on New Frontiers in Summarization, Copenhagen, Denmark, 2017, pp. 59–63.

[16]

S. Narayan, S. B. Cohen, and M. Lapata, Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 1797–1807.

[17]

Y. Liu, A. Ni, L. Nan, B. Deb, C. Zhu, A. H. Awadallah, and D. Radev, Leveraging locality in abstractive text summarization, in Proc. 2022 Conf. Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 6081–6093.

[18]

A. Bajaj, P. Dangati, K. Krishna, P. A. Kumar, R. Uppaal, B. Windsor, E. Brenner, D. Dotterrer, R. Das, and A. McCallum, Long document summarization in a low resource setting using pretrained language models, in Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. Natural Language Processing: Student Research Workshop, Virtual Event, 2021, pp. 71–80.

[19]

I. Beltagy, M. E. Peters, and A. Cohan, Longformer: The long-document transformer, arXiv preprint arXiv: 2004.05150, 2020.

[20]

A. Roy, M. Saffar, A. Vaswani, and D. Grangier, Efficient content-based sparse attention with routing transformers, Trans. Assoc. Comput. Linguist., vol. 9, pp. 53–68, 2021.

Crossref Google Scholar

[21]

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al., Big bird: Transformers for longer sequences, arXiv preprint arXiv: 2007.14062, 2020.

[22]

X. Zhang, F. Wei, and M. Zhou, HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization, in Proc. 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 5059–5069.

[23]

J. Lim and H. J. Song, Improving multi-stage long document summarization with enhanced coarse summarizer, in Proc. 4th New Frontiers in Summarization Workshop, Singapore, 2023, pp. 135–144.

[24]

B. Pang, E. Nijkamp, W. Kryscinski, S. Savarese, Y. Zhou, and C. Xiong, Long document summarization with top-down and bottom-up inference, in Proc. Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2023, pp. 1267–1284.

[25]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv: 2401.04088, 2024.

[26]

Meta, Build the future of AI with Meta Llama 3, https://llama.meta.com/llama3/, 2024.

[27]

S. Syed, M. Völske, N. Lipka, B. Stein, H. Schütze, and M. Potthast, Towards summarization for social media—Results of the TL;DR challenge, in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, 2019, pp. 523–528.

[28]

S. Gehrmann, Z. Ziegler, and A. Rush, Generating abstractive summaries with finetuned language models, in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, 2019, pp. 516–522.

[29]

H. Choi, L. Ravuru, T. Dryjański, S. Rye, D. Lee, H. Lee, and I. Hwang, VAE-PGN based abstractive model in multi-stage architecture for text summarization, in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, 2019, pp. 510–515.

[30]

T. Brants, F. Chen, and I. Tsochantaridis, Topic-based document segmentation with probabilistic latent semantic analysis, in Proc. 11th Int. Conf. Information and Knowledge Management, McLean, VA, USA, 2002, pp. 211–218.

[31]

M. A. Hearst, Multi-paragraph segmentation of expository text, arXiv preprint arXiv: cmp-lg/9406037, 1994.

[32]

M. A. Hearst, TextTiling: Segmenting text into multi-paragraph subtopic passages, Comput. Linguist., vol. 23, no. 1, pp. 33–64, 1997.

[33]

F. Y. Y. Choi, Advances in domain independent linear text segmentation, arXiv preprint arXiv: cs/0003083, 2000.

[34]

M. Utiyama and H. Isahara, A statistical model for domain-independent text segmentation, in Proc. 39th Annual Meeting on Association for Computational Linguistics, Toulouse, France, 2001, pp. 499–506.

[35]

P. Fragkou, V. Petridis, and A. Kehagias, A dynamic programming algorithm for linear text segmentation, J. Intell. Inf. Syst., vol. 23, no. 2, pp. 179–197, 2004.

[36]

J. Eisenstein, Hierarchical text segmentation from multi-scale lexical cohesion, in Proc. Human Language Technologies: The 2009 Annual Conf. North American Chapter of the Association for Computational Linguistics, Boulder, CO, USA, 2009, pp. 353–361.

[37]

G. Glavaš, F. Nanni, and S. P. Ponzetto, Unsupervised text segmentation using semantic relatedness graphs, in Proc. 5th Joint Conf. Lexical and Computational Semantics, Berlin, Germany, 2016, pp. 125–130.

[38]

J. Li, A. Sun, and S. Joty, SEGBOT: A generic neural text segmentation model with pointer network, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 4166–4172.

[39]

Y. Wang, S. Li, and J. Yang, Toward fast and accurate neural discourse segmentation, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 962–967.

[40]

Y. Liu, C. Zhu, and M. Zeng, End-to-end segmentation-based news summarization, in Proc. Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 2022, pp. 544–554.

[41]

C. Y. Lin, ROUGE: A package for automatic evaluation of summaries, presented at Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 2004.

[42]

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, BERTScore: Evaluating text generation with BERT, arXiv preprint arXiv: 1904.09675, 2019.

[43]

R. Mihalcea and P. Tarau, TextRank: Bringing order into text, in Proc. 2004 Conf. Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 2004, pp. 404–411.

[44]

P. Li, W. Lam, L. Bing, W. Guo, and H. Li, Cascaded attention based unsupervised information distillation for compressive summarization, in Proc. 2017 Conf. Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 2017, pp. 2081–2090.

[45]

J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization, arXiv preprint arXiv: 1912.08777, 2019.

Journal of Social Computing

Volume 5 Issue 2,
June 2024

Pages 132-144

DOI: 10.23919/JSC.2024.0010

Cite this article:

Wang G, Garg P, Wu W. Segmented Summarization and Refinement: A Pipeline for Long-Document Analysis on Social Media. Journal of Social Computing, 2024, 5(2): 132-144. https://doi.org/10.23919/JSC.2024.0010

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号