AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (567.4 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Segmented Summarization and Refinement: A Pipeline for Long-Document Analysis on Social Media

Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75080, USA
Show Author Information

Abstract

Social media’s explosive growth has resulted in a massive influx of electronic documents influencing various facets of daily life. However, the enormous and complex nature of this content makes extracting valuable insights challenging. Long document summarization emerges as a pivotal technique in this context, serving to distill extensive texts into concise and comprehensible summaries. This paper presents a novel three-stage pipeline for effective long document summarization. The proposed approach combines unsupervised and supervised learning techniques, efficiently handling large document sets while requiring minimal computational resources. Our methodology introduces a unique process for forming semantic chunks through spectral dynamic segmentation, effectively reducing redundancy and repetitiveness in the summarization process. Contrary to previous methods, our approach aligns each semantic chunk with the entire summary paragraph, allowing the abstractive summarization model to process documents without truncation and enabling the summarization model to deduce missing information from other chunks. To enhance the summary generation, we utilize a sophisticated rewrite model based on Bidirectional and Auto-Regressive Transformers (BART), rearranging and reformulating summary constructs to improve their fluidity and coherence. Empirical studies conducted on the long documents from the Webis-TLDR-17 dataset demonstrate that our approach significantly enhances the efficiency of abstractive summarization transformers. The contributions of this paper thus offer significant advancements in the field of long document summarization, providing a novel and effective methodology for summarizing extensive texts in the context of social media.

References

[1]
L. Dong, M. N. Satpute, W. Wu, and D. Z. Du, Two-phase multidocument summarization through content-attention-based subtopic detection, IEEE Trans. Comput. Soc. Syst., vol. 8, no. 6, pp. 1379–1392, 2021.
[2]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.
[3]
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Virtual Event, 2020, pp. 7871–7880.
[4]
N. Zmandar, A. Singh, M. El-Haj, and P. Rayson, Joint abstractive and extractive method for long financial document summarization, in Proc. 3rd Financial Narrative Processing Workshop, Lancaster, UK, 2021, pp. 99–105.
[5]
J. Pilault, R. Li, S. Subramanian, and C. Pal, On extractive and abstractive neural document summarization with transformer language models, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (EMNLP), Virtual Event, 2020, pp. 9308–9319.
[6]
S. Cho, K. Song, X. Wang, F. Liu, and D. Yu, Toward unifying text segmentation and long document summarization, in Proc. 2022 Conf. Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 106–118.
[7]
Y. Zhang, A. Ni, Z. Mao, C. H. Wu, C. Zhu, B. Deb, A. Awadallah, D. Radev, and R. Zhang, SummN: A multi-stage summarization framework for long input dialogues and documents: A multi-stage summarization framework for long input dialogues and documents, in Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 1592–1604.
[8]
S. Cao and L. Wang, HIBRIDS: Attention with hierarchical biases for structure-aware long document summarization, in Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 786–807.
[9]
A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, A discourse-aware attention model for abstractive summarization of long documents, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 2018, pp. 615–621.
[10]
OpenAI, GPT-4 technical report, Technical report, OpenAI, San Francisco, CA, USA, 2023.
[11]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[12]

C. Truong, L. Oudre, and N. Vayatis, Selective review of offline change point detection methods, Signal Process., vol. 167, p. 107299, 2020.

[13]

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., vol. 15, no. 6, pp. 1373–1396, 2003.

[14]

G. Moro and L. Ragazzi, Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes, Proc. AAAI Conf. Artif. Intell., vol. 36, no. 10, pp. 11085–11093, 2022.

[15]
M. Völske, M. Potthast, S. Syed, and B. Stein, TL;DR: Mining reddit to learn automatic summarization, in Proc. Workshop on New Frontiers in Summarization, Copenhagen, Denmark, 2017, pp. 59–63.
[16]
S. Narayan, S. B. Cohen, and M. Lapata, Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 1797–1807.
[17]
Y. Liu, A. Ni, L. Nan, B. Deb, C. Zhu, A. H. Awadallah, and D. Radev, Leveraging locality in abstractive text summarization, in Proc. 2022 Conf. Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 6081–6093.
[18]
A. Bajaj, P. Dangati, K. Krishna, P. A. Kumar, R. Uppaal, B. Windsor, E. Brenner, D. Dotterrer, R. Das, and A. McCallum, Long document summarization in a low resource setting using pretrained language models, in Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. Natural Language Processing: Student Research Workshop, Virtual Event, 2021, pp. 71–80.
[19]
I. Beltagy, M. E. Peters, and A. Cohan, Longformer: The long-document transformer, arXiv preprint arXiv: 2004.05150, 2020.
[20]

A. Roy, M. Saffar, A. Vaswani, and D. Grangier, Efficient content-based sparse attention with routing transformers, Trans. Assoc. Comput. Linguist., vol. 9, pp. 53–68, 2021.

[21]
M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al., Big bird: Transformers for longer sequences, arXiv preprint arXiv: 2007.14062, 2020.
[22]
X. Zhang, F. Wei, and M. Zhou, HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization, in Proc. 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 5059–5069.
[23]
J. Lim and H. J. Song, Improving multi-stage long document summarization with enhanced coarse summarizer, in Proc. 4th New Frontiers in Summarization Workshop, Singapore, 2023, pp. 135–144.
[24]
B. Pang, E. Nijkamp, W. Kryscinski, S. Savarese, Y. Zhou, and C. Xiong, Long document summarization with top-down and bottom-up inference, in Proc. Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2023, pp. 1267–1284.
[25]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv: 2401.04088, 2024.
[26]
Meta, Build the future of AI with Meta Llama 3, https://llama.meta.com/llama3/, 2024.
[27]
S. Syed, M. Völske, N. Lipka, B. Stein, H. Schütze, and M. Potthast, Towards summarization for social media—Results of the TL;DR challenge, in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, 2019, pp. 523–528.
[28]
S. Gehrmann, Z. Ziegler, and A. Rush, Generating abstractive summaries with finetuned language models, in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, 2019, pp. 516–522.
[29]
H. Choi, L. Ravuru, T. Dryjański, S. Rye, D. Lee, H. Lee, and I. Hwang, VAE-PGN based abstractive model in multi-stage architecture for text summarization, in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, 2019, pp. 510–515.
[30]
T. Brants, F. Chen, and I. Tsochantaridis, Topic-based document segmentation with probabilistic latent semantic analysis, in Proc. 11th Int. Conf. Information and Knowledge Management, McLean, VA, USA, 2002, pp. 211–218.
[31]
M. A. Hearst, Multi-paragraph segmentation of expository text, arXiv preprint arXiv: cmp-lg/9406037, 1994.
[32]

M. A. Hearst, TextTiling: Segmenting text into multi-paragraph subtopic passages, Comput. Linguist., vol. 23, no. 1, pp. 33–64, 1997.

[33]
F. Y. Y. Choi, Advances in domain independent linear text segmentation, arXiv preprint arXiv: cs/0003083, 2000.
[34]
M. Utiyama and H. Isahara, A statistical model for domain-independent text segmentation, in Proc. 39th Annual Meeting on Association for Computational Linguistics, Toulouse, France, 2001, pp. 499–506.
[35]
P. Fragkou, V. Petridis, and A. Kehagias, A dynamic programming algorithm for linear text segmentation, J. Intell. Inf. Syst., vol. 23, no. 2, pp. 179–197, 2004.
[36]
J. Eisenstein, Hierarchical text segmentation from multi-scale lexical cohesion, in Proc. Human Language Technologies: The 2009 Annual Conf. North American Chapter of the Association for Computational Linguistics, Boulder, CO, USA, 2009, pp. 353–361.
[37]
G. Glavaš, F. Nanni, and S. P. Ponzetto, Unsupervised text segmentation using semantic relatedness graphs, in Proc. 5th Joint Conf. Lexical and Computational Semantics, Berlin, Germany, 2016, pp. 125–130.
[38]
J. Li, A. Sun, and S. Joty, SEGBOT: A generic neural text segmentation model with pointer network, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 4166–4172.
[39]
Y. Wang, S. Li, and J. Yang, Toward fast and accurate neural discourse segmentation, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 962–967.
[40]
Y. Liu, C. Zhu, and M. Zeng, End-to-end segmentation-based news summarization, in Proc. Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 2022, pp. 544–554.
[41]
C. Y. Lin, ROUGE: A package for automatic evaluation of summaries, presented at Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 2004.
[42]
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, BERTScore: Evaluating text generation with BERT, arXiv preprint arXiv: 1904.09675, 2019.
[43]
R. Mihalcea and P. Tarau, TextRank: Bringing order into text, in Proc. 2004 Conf. Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 2004, pp. 404–411.
[44]
P. Li, W. Lam, L. Bing, W. Guo, and H. Li, Cascaded attention based unsupervised information distillation for compressive summarization, in Proc. 2017 Conf. Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 2017, pp. 2081–2090.
[45]
J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization, arXiv preprint arXiv: 1912.08777, 2019.
Journal of Social Computing
Pages 132-144
Cite this article:
Wang G, Garg P, Wu W. Segmented Summarization and Refinement: A Pipeline for Long-Document Analysis on Social Media. Journal of Social Computing, 2024, 5(2): 132-144. https://doi.org/10.23919/JSC.2024.0010

163

Views

25

Downloads

0

Crossref

1

Scopus

Altmetrics

Received: 27 November 2023
Revised: 23 May 2024
Accepted: 28 May 2024
Published: 30 June 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return