Navigating the AI Frontier: Study on Language Model Innovations from Leading Experts

Mary Mulan ZHU
9 min readFeb 15, 2024

--

Table of Contents

  • Introduction
  • History of Language Models
  • Large Language Models (LLMs)
  • Look to the Future
  • References
  • Table of Contents (Level 2)
Image generated by DALL.E

Introduction

Early last morning, I came across a research paper on Large Language Models (LLMs), published on February 9, 2024. The authors include Tomas Mikolov, the inventor of the word2vec algorithm, and Richard Socher, the developer of Salesforce’s artificial intelligence system Einstein.

Large Language Models: A Survey, Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 2024.2.9 download

Richard Socher is an expert in Natural Language Processing (NLP) and deep learning, having earned his Ph.D. from Chris Manning’s artificial intelligence lab at Stanford University. Manning and Socher are among the earliest researchers to apply deep learning techniques to natural language processing, and I have been following their lectures and research. Back in 2012, when most people were skeptical about using deep learning for natural language processing, Socher’s lecture “Deep Learning for NLP (without Magic)” was particularly meaningful.

The paper provides a historical overview, current status, and future prospects of LLMs. It introduces key technologies, software, and related applications at various stages. Below, I have extracted some particularly interesting points and discussions from this paper, along with my explanations and opinions. Except for the figure in the chapter on the “History of Language Models”, all figures in this article are attributed to courtesy of the above paper.

History of Language Models

Language models have evolved significantly, with advancements leading to the creation of large-scale, pre-trained statistical language models (LLMs). This evolution can be categorized into four distinct waves.

Language Model History

Statistical Language Models (SLMs)

  • Features: Text as a sequence of words, with probabilities estimated for each word. n-gram models.
  • Challenges: Cannot fully capture the diversity and variability of natural language due to data sparsity, managed through smoothing techniques.
  • Applications: N-gram models are widely used in many NLP systems.

Neural Language Models (NLMs)

  • Features: Task-specific, with each model trained for specific tasks. Overcame data sparsity by using word embeddings in a low-dimensional continuous vector space.
  • Applications: Facilitated computing semantic similarity, useful in various NLP applications (e.g., queries vs. documents in Web search, sentences in different languages in machine translation) or modalities (e.g., image and text in image captioning).

Pre-trained Language Models (PLMs)

  • Features: Task-agnostic, with a generalizable hidden embedding space. Pre-training on large text corpora, followed by fine-tuning for specific tasks. Based on recurrent neural networks or transformers.
  • Models: BERT (Birectional Encoder Representations from Trans- formers), RoBERTa, XLMs; GPT-1 and GPT-2, developed by OpenAI.

Large Language Models (LLMs)

Large Language Models have tens to hundreds of billions of parameters; They have emergent abilities, such as in-context learning, instruction following, and multi-step reasoning; They can be augmented, for example, integration with external knowledge and tools for enhanced interaction and continual learning.

LLMs inlcude PaLM, LLaMA, GPT-3, GPT-4. AI Agents Based on LLMs are one of the applications, with challenges of interaction with dynamic environments and the need for augmentation for real-world applications.

Timeline

Capabilities

Model Families

There are three major families of LLMs: GPT, LlaMA, and PaLM.

Category

How to build a LLM

Look to the Future

LLM Architecture

Transformer

Currently, the Transformer architecture is mainstream. Since the 2012 victory of AlexNet in the ImageNet image recognition competition, deep learning has gradually gained widespread recognition. Current natural language processing algorithms include LSTM, RNN, GRU, seq2seq, etc.

The Transformer is based on the 2017 Google paper: Attention is all you need.

State Space Models (SSMs)

Currently, there is significant research on post-attention mechanisms. One important category within this field pertains to studies on State Space Models (SSMs). SSM is often used in reference to the newer Structured State Space Model architecture, abbreviated as S4. Recent models in this category include Mamba, Hyena, and Striped Hyena. These models not only surpass Transformer architecture models on many evaluation metrics, but they also address the limitation of the Transformer concerning the context window. Attention-based models are highly inefficient for longer contexts.

Mixture of Experts (MoE)

Another significant area of research is the Mixture of Experts (MoE). MoE represents attention-compatible architectural mechanisms. In Large Language Models (LLMs), MoEs enable the training of an extremely large model, which is then only partially instantiated during inference. MoE architecture models include Mixtral, GLaM, and there are rumors that GPT-4 has adopted the MoE architecture.

MoE can also be applied to architectures other than Transformers. For example, recent research from the University of Warsaw, Poland, and others, involves applying MoE to Mamba (SSM).

Receptance Weighted Key Value (RWKV)

RWKV combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.

Rethinking the Whole Transformer Architecture

Additionally, some research is focused on rethinking the entire Transformer architecture. An early example of this is the Monarch Mixer. It proposes a new architecture that utilizes the same sub-quadratic primitive, achieving high hardware efficiency on GPUs — Monarch matrices — along both the sequence length and model dimension.

LLM Applications

Solution to Limitations of LLMs

Hallucination can be addressed through advanced prompt engineering (RAG), use of tools, or other augmentation techniques and more research on this area is expected.

Replacement to Machine Learning Systems

LLM-based systems are being deployed on areas which machine learning systems were used traditionally. LLM-based systems provides personalized interactions with understanding of people preference and interests. Some examples:

  • Chatbot in customer service
  • Content recommendation
  • Many other applications using machine learning techniques

LLM-based agents and multi-agent

Agents system can access external tools and resources then make decision using LLM’s reasoning capability. Research on this area is the most close one to Artificial General Intelligence (AGI).

Reference

Large Language Models

  • Large Language Models: A Survey, Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 2024.2.9 download | wechat

word2vec

  • Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013.1.16 downlaod
  • Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, 2013, Part of Advances in Neural Information Processing Systems 26 (NIPS 2013) | download
  • word2vec Parameter Learning Explained, Xin Rong, 2014.11.11 download

Salesforce AI system

  • Salesforce、次世代版Einsteinを発表 すべてのCRMアプリケーションと カスタマー・エクスペリエンスに 対話型AIアシスタントを導入、2023年9月12日、株式会社セールスフォース・ジャパン website

Richard Socher Speech

  • Learning for NLP (without Magic) — ACL 2012 Tutorial, Richard Socher, 2012 youtube

Transformer

  • Transformer: Attention Is All You Need, NIPS(NeurIPS) 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Submitted on 12 Jun 2017 (v1), last revised 2 Aug 2023 (this version, v7), Google Brain, Google Research, University of Toronto download

RWKV

  • RWKV: Reinventing RNNs for the Transformer Era, 2023.05.22 download

State Space Models (SSMs)

  • State Space Models (SSMs): Efficiently Modeling Long Sequences with Structured State Spaces, Albert Gu, Karan Goel, and Christopher Ré, Department of Computer Science, Stanford University, 2022.8.5 download
  • Mamba (SSM): Efficiently Modeling Long Sequences with Structured State Spaces, Albert Gu, Karan Goel, and Christopher Ré, Department of Computer Science, Stanford University, 2022.8.5 download
  • Hyena Hierarchy (SSM): Towards Larger Convolutional Language Models, Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Stanford University, Mila and Université de Montréal, 2023.4.21 download
  • Striped Hyena (SSM): M. Poli, J. Wang, S. Massaroli, J. Quesnelle, E. Nguyen, and A. Thomas, “StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models,” 2023.12. github | together.ai
  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Albert Gu and Tri Dao, Machine Learning Department, Carnegie Mellon University, Department of Computer Science, Princeton University, Dec 1, 2023 download | 新智元 | YC Hacker News

MoE (Mixture of Experts) Models

  • MoE (Mixture of Experts): G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, University of Queensland, “Finite mixture models,” Annual review of statistics and its application, vol. 6, pp. 355–378, 2019.6. MoE is an attention-compatible architectural mechanism. download
  • Finite Mixture Models, 1st Edition, by Geoffrey J. McLachlan (Author), David Peel (Author), 2000.10.2 book
  • GLaM model (MoE): GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du, Yanping Huang,Andrew M. Dai,Simon Tong,Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson 1 Kathleen Meier-Hellstern 1 Toju Duke 1 Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Google, 2022.8.1 in International Conference on Machine Learning download
  • Mixtral (MoE): Mixtral of experts, A high quality Sparse Mixture-of-Experts. mistral.ai

MoE-Mamba Models

  • MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, MaciejPio ́ro, KamilCiebiera, Krystian Kro ́l, Jan Ludziejewski, Sebastian Jaszczur, IDEAS NCBR, Polish Academy of Sciences, University of Warsaw, 2024.1.8 download

Monarch Mixer Model

  • Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture, Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, and Christopher Ré, Stanford University, University at Buffalo, SUNY, 2023.10.18 download

RAG

  • RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis†, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, Facebook AI Research, University College London, New York University, 2021.4.12 download
  • RAG: Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai , Jiawei Sun, Qianyu Guo, Meng Wang and Haofen Wang, Tongji University, Fudan University, 2024.1.5 download

LoRA

  • Lora: Low-rank adaptation of large language models, Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, Microsoft Corporation, 2021.10.16 download

LLM-based Agents

  • Agents: The Rise and Potential of Large Language Model Based Agents: A Survey, Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang and Tao Gui, Fudan NLP Group, 2023.9.19 download
  • Agents: A Survey on Large Language Model based Autonomous Agents, Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, 2023.9.7 download
  • Agents: Agent ai: Surveying the horizons of multimodal interaction, Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao, Stanford University; Microsoft Research, Redmond;University of California, Los Angeles; University of Washington; Microsoft Gaming, 2024.1.25 download

Table of Contents (Level 2)

Introduction

History of Language Models

  • Statistical Language Models (SLMs)
  • Neural Language Models (NLMs)
  • Pre-trained Language Models (PLMs)

Large Language Models (LLMs)

  • Timeline
  • Capabilities
  • Model Families
  • Category
  • How to build a LLM

Look to the Future

LLM Architecture

  • Transformer
  • State Space Models (SSMs)
  • Mixture of Experts (MoE)
  • Receptance Weighted Key Value (RWKV)
  • Rethinking the Whole Transformer Architecture

LLM Applications

  • Solution to Limitations of LLMs
  • Replacement to Machine Learning Systems
  • LLM-based agents and multi-agent

References

  • Large Language Models
  • word2vec
  • Salesforce AI system
  • Richard Socher Speech
  • Transformer
  • RWKV
  • State Space Models (SSMs)
  • MoE (Mixture of Experts) Models
  • MoE-Mamba Models
  • Monarch Mixer Model
  • RAG
  • LoRA
  • LLM-based Agents

--

--

Mary Mulan ZHU

Technical architect, blogger, passionate on machine learning and generative AI. https://www.linkedin.com/in/marymulan/