| Sign up

PDF (96.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Research Article | Open Access

Mindstorms in natural language-based societies of mind

Mingchen Zhuge^{¹^,^*}, Haozhe Liu^{¹^,^*}, Francesco Faccio^{²^,^*}, Dylan R. Ashley^{²^,^*}, Róbert Csordás^³, Anand Gopalakrishnan^², Abdullah Hamdi^⁴, Hasan Abed Al Kader Hammoud^¹, Vincent Herrmann^², Kazuki Irie^⁵, Louis Kirsch^², Bing Li^¹, Guohao Li^¹, Shuming Liu^¹, Jinjie Mai^¹, Piotr Piękos^¹, Aditya A. Ramesh^², Imanol Schlag^⁶, Weimin Shi^⁷, Aleksandar Stanić^², Wenyi Wang^¹, Yuhui Wang^¹, Mengmeng Xu^¹, Deng-Ping Fan^⁸(), Bernard Ghanem^¹(), Jürgen Schmidhuber^¹()

1

Center of Excellence for Generative AI, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

2

Dalle Molle Institute for Artificial Intelligence Research, Lugano, Switzerland

3

Stanford University, California, USA

4

Oxford University, Oxford, UK

5

Harvard University, Cambridge, USA

6

ETH AI Center, Zurich, Switzerland

7

Beihang University, Beijing, China

8

CS & VCIP, Nankai University, Tianjin, China

^* Mingchen Zhuge, Haozhe Liu, Francesco Faccio, and Dylan R. Ashley contributed equally to this work.

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Inspired by Minsky’s Society of Mind, Schmidhuber’s Learning to Think, and other more recent works, this paper proposes and advocates for the concept of natural language-based societies of mind (NLSOMs). We imagine these societies as consisting of a collection of multimodal neural networks, including large language models, which engage in a “mindstorm” to solve problems using a shared natural language interface. Here, we work to identify and discuss key questions about the social structure, governance, and economic principles for NLSOMs, emphasizing their impact on the future of AI. Our demonstrations with NLSOMs—which feature up to 129 agents—show their effectiveness in various tasks, including visual question answering, image captioning, and prompt generation for text-to-image synthesis.

Keywords

mindstorm society of mind (SOM)large language models (LLMs)multimodal learning learning to think

References

[1]

Minsky, M. Society of Mind. Simon and Schuster, 1988.

[2]

Barto, A. G.; Sutton, R. S.; Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics Vol. SMC-13, No. 5, 834–846, 1983.

Crossref Google Scholar

[3]

Munro, P. A dual back-propagation scheme for scalar reward learning. In: Proceedings of the 9th Annual Conference of the Cognitive Science Society, 165–176, 1987.

[4]

Jordan, M. I. Supervised learning and systems with excess degrees of freedom. University of Massachusetts at Amherst, 1988.

[5]

Werbos, P. J. Neural networks for control and system identification. In: Proceedings of the 28th IEEE Conference on Decision and Control, 260–265, 2002.

[6]

Werbos, P. J. Backpropagation and neurocontrol: A review and prospectus. In: Proceedings of the International Joint Conference on Neural Networks, 209–216, 1989.

[7]

Robinson, T.; Fallside, F. Dynamic reinforcement driven error propagation networks with application to game playing. In: Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 11, 836–843, 1989.

[8]

Jordan, M. I.; Rumelhart, D. E. Forward models: Supervised learning with a distal teacher. In: Backpropagation. Psychology Press, 189–236, 2013.

[9]

Narendra, K. S.; Parthasarathy, K. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks Vol. 1, No. 1, 4–27, 1990.

Crossref Google Scholar

[10]

Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks Vol. 61, 85–117, 2015.

Crossref Google Scholar

[11]

Schmidhuber, J. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In: Proceedings of the IJCNN International Joint Conference on Neural Networks, 253–258, 1990.

[12]

Schmidhuber, J. Reinforcement learning in Markovian and non-Markovian environments. In: Proceedings of the 3rd International Conference on Neural Information Processing Systems, 500–506, 1990.

[13]

Schmidhuber, J. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science Vol. 18, No. 2, 173–187, 2006.

Crossref Google Scholar

[14]

Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development Vol. 2, No. 3, 230–247, 2010.

Crossref Google Scholar

[15]

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A. C.; Bengio, Y. Generative adversarial nets. Communications of the ACM Vol. 63, No. 11, 139–144, 2020.

Crossref Google Scholar

[16]

Schmidhuber, J. Generative Adversarial Networks are special cases of Artificial Curiosity (1990) and also closely related to Predictability Minimization (1991). Neural Networks Vol. 127, 58–66, 2020.

Crossref Google Scholar

[17]

Schmidhuber, J. Learning to generate subgoals for action sequences. In: Proceedings of the Seattle International Joint Conference on Neural Networks, 453, 1991.

[18]

Solomonoff, R. J. A formal theory of inductive inference. Part Ⅰ. Information and Control Vol. 7, No. 1, 1–22, 1964.

Crossref Google Scholar

[19]

Kolmogorov, A. N. Three approaches to the quantitative definition of information. International Journal of Computer Mathematics Vol. 2, 157–168, 1968.

Crossref Google Scholar

[20]

Chaitin, G. J. On the length of programs for computing finite binary sequences. Journal of the ACM Vol. 13, No. 4, 547–569, 1966.

Crossref Google Scholar

[21]

Levin, L. A. On the notion of a random sequence. Soviet Math. Dokl. Vol. 14, No. 5, 1413–1416, 1973.

[22]

Solomonoff, R. Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transactions on Information Theory Vol. 24, No. 4, 422–432, 1978.

Crossref Google Scholar

[23]

Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications. New York: Springer, 1997.

[24]

Schmidhuber, J. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science Vol. 13, No. 4, 587–612, 2002.

Crossref Google Scholar

[25]

Schmidhuber, J. Optimal ordered problem solver. Machine Learning Vol. 54, No. 3, 211–254, 2004.

Crossref Google Scholar

[26]

Schmidhuber, J. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks Vol. 10, No. 5, 857–873, 1997.

Crossref Google Scholar

[27]

Schmidhuber, J. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249, 2015.

[28]

Schmidhuber, J. One big net for everything. arXiv preprint arXiv:1802.08864, 2018.

[29]

Mialon, G.; Dessì, R.; Lomeli, M.; Nalmpantis, C.; Pasunuru, R.; Raileanu, R.; Rozière, B.; Schick, T.; Dwivedi-Yu, J.; Celikyilmaz, A.; et al. Augmented language models: A survey. arXiv preprint arXiv:2302.07842, 2023.

[30]

Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.

[31]

Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys Vol. 55, No. 9, 1–35, 2023.

Crossref Google Scholar

[32]

Engelbart, D. C. Augmenting human intellect: A conceptual framework. In: Augmented Education in the Global Age. Routledge, 13–29, 2023.

[33]

Hall, J. NASA Moon Survival Task: The Original Consensus Exercise. Teleometrics International, 1989.

[34]

Hewitt, C.; Bishop, P.; Steiger, R. A universal modular actor formalism for artificial intelligence. In: Proceedings of the 3rd International Joint Conference on Artificial Intelligence, 235–245, 1973.

[35]

MacQueen, D. Modules for standard ML. In: Proceedings of the ACM Symposium on LISP and Functional Programming, 198–207, 1984.

[36]

Core, M. G.; Lane, H.; Lent, M.; Gomboc, D.; Solomon, S.; Rosenberg, M. Building explainable artificial intelligence systems. In: Proceedings of the 18th Conference on Innovative Applications of Artificial Intelligence, Vol. 2, 1766–1773, 2006.

[37]

Miller, S.; Stallard, D.; Bobrow, R.; Schwartz, R. A fully statistical approach to natural language interfaces. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, 55–61, 1996.

[38]

Jordan, S. R. A natural language understander based on a freely associated learned memory net. International Journal of Computer & Information Sciences Vol. 6, 9–25, 1977.

Crossref Google Scholar

[39]

Kloumann, I. M.; Danforth, C. M.; Harris, K. D.; Bliss, C. A.; Dodds, P. S. Positivity of the English language. PLoS One Vol. 7, No. 1, e29484, 2012.

Crossref Google Scholar

[40]

Zeng, A.; Attarian, M.; Ichter, B.; Choromanski, K.; Wong, A.; Welker, S.; Tombari, F.; Purohit, A.; Ryoo, M.; Sindhwani, V.; et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.

[41]

Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual ChatGPT: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.

[42]

Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; Zhuang, Y. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. arXiv preprint arXiv:2303.17580, 2023.

[43]

Surís, D.; Menon, S.; Vondrick, C. ViperGPT: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.

[44]

Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.

[45]

Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. arXiv preprint arXiv:2307.16789, 2023.

[46]

AutoGPT. 2023. Available at https://github.com/Significant-Gravitas/Auto-GPT

[47]

Chase, H. LangChain. 2022. Available at https://github.com/hwchase17/langchain

[48]

Liu, J. LlamaIndex. 2022. Available at https://github.com/jer ryjliu/llama_index

[49]

XAgent Team. Xagent: An autonomous agent for complex task solving. 2023.

[50]

Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.

[51]

Li, G.; Abed Al Kader Hammoud, H.; Itani, H.; Khizbullin, D.; Ghanem, B. CAMEL: Communicative agents for “mind” exploration of large language model society. arXiv preprint arXiv:2303.17760, 2023.

[52]

Qian, C.; Cong, X.; Yang, C.; Chen, W.; Su, Y.; Xu, J.; Liu, Z.; Sun, M. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.

[53]

Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S. K. S.; Lin, Z.; et al. MetaGPT: Meta programming for A multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.

[54]

Zhou, W.; Jiang, Y. E.; Li, L.; Wu, J.; Wang, T.; Qiu, S.; Zhang, J.; Chen, J.; Wu, R.; Wang, S.; et al. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023.

[55]

Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023.

[56]

Park, J. S.; O’Brien, J. C.; Cai, C. J.; Morris, M. R.; Liang, P.; Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, Article No. 2, 1–22, 2023.

[57]

Davidson, T. R.; Veselovsky, V.; Josifoski, M.; Peyrard, M.; Bosselut, A.; Kosinski, M.; West, R. Evaluating language model agency through negotiations. arXiv preprint arXiv:2401.04536, 2024.

[58]

Chan, C. M.; Chen, W.; Su, Y.; Yu, J.; Xue, W.; Zhang, S.; Fu, J.; Liu, Z. ChatEval: Towards better LLM-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.

[59]

Liu, Z.; Zhang, Y.; Li, P.; Liu, Y.; Yang, D. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170, 2023.

[60]

Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; Vardhamanan, S.; Haq, S.; Sharma, A.; Joshi, T. T.; Moazam, H.; et al. DSPy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023.

[61]

Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; Mottaghi, R. A-OKVQA: A benchmark for visual question answering using world knowledge. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13668. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 146–162, 2022.

[62]

Zheng, L.; Chiang, W. L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, Article No. 2020, 46595–4662, 2023.

[63]

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 2011, 27730–2774, 2024.

[64]

Li, J.; Li, D.; Savarese, S.; Hoi, S.; Rasal, S.; Boddhu, S. K. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.

[65]

Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022.

[66]

Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022.

[67]

Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 2, 13–23, 2019.

[68]

Zhu, D.; Chen, J.; Haydarov, K.; Shen, X.; Zhang, W.; Elhoseiny, M. ChatGPT asks, BLIP-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023.

[69]

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37, 2048–2057, 2015.

[70]

Fu, X.; Zhou, B.; Chandratreya, I.; Vondrick, C.; Roth, D. There’s a time and place for reasoning beyond the image. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1138–1149, 2022.

[71]

Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084, 2019.

[72]

Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.

[73]

Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.

[74]

Wierstra, D.; Forster, A.; Peters, J.; Schmidhuber, J. Recurrent policy gradients. Logic Journal of IGPL Vol. 18, No. 5, 620–634, 2010.

Crossref Google Scholar

[75]

Schmidhuber, J. The neural bucket brigade. In: Connection is m in Perspective. Pfeifer, R.; Schreter, Z.; Fogelman, Z.; Steels, L. Eds. Amsterdam: Elsevier, 439–446, 1989.

[76]

Schmidhuber, J. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science Vol. 1, No. 4, 403–412, 1989.

Crossref Google Scholar

[77]

Holland, J. H. Properties of the bucket brigade. In: Proceedings of the 1st International Conference on Genetic Algorithms, 1–7, 1985.

[78]

Wilson, S. W. ZCS: A zeroth level classifier system. Evolutionary Computation Vol. 2, No. 1, 1–18, 1994.

Crossref Google Scholar

[79]

Baum, E. B.; Durdanovic, I. Toward a model of mind as an economy of agents. Machine Learning Vol. 35, No. 2, 155–185, 1999.

Crossref Google Scholar

[80]

Bommarito, M. II; Katz, D. M. GPT takes the Bar Exam. arXiv preprint arXiv:2212.14402, 2022.

[81]

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 159, 1877–1901, 2020.

[82]

Brush, S. G. History of the Lenz–Ising model. Reviews of Modern Physics Vol. 39, No. 4, 883, 1967.

Crossref Google Scholar

[83]

Schmidhuber, J. Annotated history of modern AI and deep learning. arXiv preprint arXiv:2212.11279, 2022.

[84]

Sen, A. Collective Choice and Social Welfare. Oliver & Boyd, 1971.

[85]

Gibbard, A. Manipulation of voting schemes: A general result. Econometrica: Journal of the Econometric Society Vol. 41, 587–601, 1973.

Crossref Google Scholar

[86]

Satterthwaite, M. A. Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory Vol. 10, No. 2, 187–217, 1975.

Crossref Google Scholar

[87]

Von Neumann, J.; Morgenstern, O. Theory of Games and Economic Behavior. Princeton: Princeton University Press, 1947.

[88]

Huang, S.; Dong, L.; Wang, W.; Hao, Y.; Singhal, S.; Ma, S.; Lv, T.; Cui, L.; Mohammed, O. K.; Patra, B.; et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.

[89]

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; Zhou, D. Chain of thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1800, 24824–24837, 2024.

[90]

Schmidhuber, J. First powdered flight-plane truth. 2003. Available at https://people.idsia.ch/juergen/planetruth.html

[91]

Hu, Y.; Hua, H.; Yang, Z.; Shi, W.; Smith, N. A.; Luo, J. PromptCap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022.

[92]

OpenAI. Introducing ChatGPT. 2022. Available at https://openai.com/blog/chatgpt

[93]

Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. GIT: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.

[94]

Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), 4171–4186, 2019.

[95]

Mokady, R.; Hertz, A.; Bermano, A. H. ClipCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.

[96]

Jiang, Y.; Natarajan, V.; Chen, X.; Rohrbach, M.; Batra, D.; Parikh, D. Pythia v0.1: The winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956, 2018.

[97]

Tan, H.; Bansal, M. LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.

[98]

Kamath, A.; Clark, C.; Gupta, T.; Kolve, E.; Hoiem, D.; Kembhavi, A. Webly supervised concept expansion for general purpose vision models. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13696. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 662–681, 2022.

[99]

Marino, K.; Chen, X.; Parikh, D.; Gupta, A.; Rohrbach, M. KRISP: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14106–14116, 2021.

[100]

Open AI. GPT-4 technical report. 2023. Available at https://cdn.openai.com/papers/gpt-4.pdf

[101]

Fulford, I.; Ng, A. Chatgpt prompt engineering for developers. 2023. Available at https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/

[102]

Kojima, T.; Gu, S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 1613, 22199–22213, 2024.

[103]

AI Explosion. spaCy: Industrial-strength natural language processing. 2017. Available at https://spacy.io

[104]

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318, 2002.

[105]

Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72, 2005.

[106]

Lin, C. Y. Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. 74–81, 2004.

[107]

Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9909. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 382–398, 2016.

[108]

Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word embeddings to document distances. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37, 957–966, 2015.

[109]

Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: Nlg evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.

[110]

Luma AI Lab. Imagine 3D model. 2023. Available at https://lumalabs.ai/

[111]

Poole, B.; Jain, A.; Barron, J. T.; Mildenhall, B. Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988, 2022.

[112]

Lin, C. H.; Gao, J.; Tang, L.; Takikawa, T.; Zeng, X.; Huang, X.; Kreis, K.; Fidler, S.; Liu, M. Y.; Lin, T. Y. Magic3D: High-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 300–309, 2023.

[113]

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.

[114]

Xu, M.; Soldan, M.; Gao, J.; Liu, S.; Pérez-Rúa, J. M.; Ghanem, B. Boundary-denoising for video activity localization. arXiv preprint arXiv:2304.02934, 2023.

[115]

Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. Ego4D: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18973–18990, 2022.

[116]

Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 20–36, 2016.

[117]

Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1933–1941, 2016.

[118]

Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 1, 568–576, 2014.

[119]

Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L. M.; Shum, H. Y. DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.

[120]

Gao, J.; Sun, C.; Yang, Z.; Nevatia, R. TALL: Temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, 5277–5285, 2017.

[121]

Hendricks, L. A.; Wang, O.; Shechtman, E.; Sivic, J.; Darrell, T.; Russell, B. Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, 5804–5813, 2017.

[122]

Soldan, M.; Xu, M.; Qu, S.; Tegner, J.; Ghanem, B. VLG-net: Video-language graph matching network for video grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 3217–3227, 2021.

[123]

Zhang, S.; Peng, H.; Fu, J.; Luo, J. Learning 2D temporal adjacent networks for moment localization with natural language. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 12870–12877, 2020.

Crossref Google Scholar

[124]

Escorcia, V.; Soldan, M.; Sivic, J.; Ghanem, B.; Russell, B. Finding moments in video collections using natural language. arXiv preprint arXiv:1907.12763, 2019.

[125]

Lei, J.; Berg, T. L.; Bansal, M. QVHIGHLIGHTS: Detecting moments and highlights in videos via natural language queries. In: Proceedings of the 35th International Conference on Neural Information Processing Systems, Article No. 906, 11846–11858, 2024.

[126]

Zeng, R.; Xu, H.; Huang, W.; Chen, P.; Tan, M.; Gan, C. Dense regression network for video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10284–10293, 2020.

[127]

Mun, J.; Cho, M.; Han, B. Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10807–10816, 2020.

[128]

Chen, S.; Jiang, Y. G. Hierarchical visual-textual graph for temporal activity localization via language. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12365. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 601–618, 2020.

[129]

Rodriguez-Opazo, C.; Marrese-Taylor, E.; Saleh, F. S.; Li, H.; Gould, S. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2453–2462, 2020.

[130]

Li, K.; Guo, D.; Wang, M. Proposal-free video grounding with contextual pyramid network. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 1902–1910, 2021.

Crossref Google Scholar

[131]

Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S. F. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1417–1426, 2017.

[132]

Lin, T.; Zhao, X.; Su, H.; Wang, C.; Yang, M. BSN: boundary sensitive network for temporal action proposal generation. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 3–21, 2018.

[133]

Lin, T.; Liu, X.; Li, X.; Ding, E.; Wen, S. BMN: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3888–3897, 2019.

[134]

Xu, M.; Zhao, C.; Rojas, D. S.; Thabet, A.; Ghanem, B. G-TAD: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10153–10162, 2020.

[135]

Liu, N.; Wang, X.; Li, X.; Yang, Y.; Zhuang, Y. ReLERZJU-alibaba submission to the Ego4D natural language queries challenge 2022. arXiv preprint arXiv:2207.00383, 2022.

[136]

Zheng, S.; Zhang, Q.; Liu, B.; Jin, Q.; Fu, J. Exploring anchor-based detection for Ego4D natural language query. arXiv preprint arXiv:2208.05375, 2022.

[137]

Mo, S.; Mu, F.; Li, Y. A simple transformer-based model for Ego4D natural language queries challenge. arXiv preprint arXiv:2211.08704, 2022.

[138]

Lin, K.; Wang, A.; Soldan, M.; Wray, M.; Yan, R.; Xu, E. Z.; Gao, D.; Tu, R. C.; Zhao, W.; Kong, W.; et al. Egocentric video-language pretraining. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, Article No. 550, 7575–7586, 2024.

[139]

Hou, Z.; Zhong, W.; Ji, L.; Gao, D.; Yan, K.; Chan, W. K.; Ngo, C. W.; Shou, Z.; Duan, N. An efficient COarse-to-fiNE alignment framework Ego4D natural language queries challenge 2022. arXiv preprint arXiv:2211.08776, 2022.

[140]

Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al. Habitat: A platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9338–9346, 2019.

[141]

Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niebner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D data in indoor environments. In: Proceedings of the International Conference on 3D Vision, 667–676, 2017.

[142]

Cao, C.; Zhu, H.; Choset, H.; Zhang, J. Exploring large and complex environments fast and efficiently. In: Proceedings of the IEEE International Conference on Robotics and Automation, 7781–7787, 2021.

[143]

Azpúrua, H.; Saboia, M.; Freitas, G. M.; Clark, L.; Agha-mohammadi, A. A.; Pessin, G.; Campos, M. F. M.; Macharet, D. G. A survey on the autonomous exploration of confined subterranean spaces: Perspectives from real-word and industrial robotic deployments. Robotics and Autonomous Systems Vol. 160, 104304, 2023.

Crossref Google Scholar

[144]

Burgard, W.; Moors, M.; Fox, D.; Simmons, R.; Thrun, S. Collaborative multi-robot exploration. In: Proceedings of the IEEE International Conference on Robotics and Automation, 476–481, 2000.

[145]

Das, A.; Datta, S.; Gkioxari, G.; Lee, S.; Parikh, D.; Batra, D. Embodied question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1–10, 2018.

[146]

Kadian, A.; Truong, J.; Gokaslan, A.; Clegg, A.; Wijmans, E.; Lee, S.; Savva, M.; Chernova, S.; Batra, D. Sim2Real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robotics and Automation Letters Vol. 5, No. 4, 6670–6677, 2020.

[147]

Chiu, J. ChatGPT is generating fake news stories — attributed to real journalists. I set out to separate fact from fiction. 2023. Available at https://www.thestar.com/news/canada/chatgpt-is-generating-fake-news-stories-attributed-to-real-journalists-i-set-out-to-separate/article_38d0f008-cf86-5cd3-af97-307a95b2296d.html

Computational Visual Media

Volume 11 Issue 1,
February 2025

Pages 29-81

DOI: 10.26599/CVM.2025.9450460

Cite this article:

Zhuge M, Liu H, Faccio F, et al. Mindstorms in natural language-based societies of mind. Computational Visual Media, 2025, 11(1): 29-81. https://doi.org/10.26599/CVM.2025.9450460

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号