Navigating the AI Frontier: Study on Language Model Innovations from Leading Experts

9 min readFeb 15, 2024

Introduction
History of Language Models
Large Language Models (LLMs)
Look to the Future
References
Table of Contents (Level 2)

Introduction

Early last morning, I came across a research paper on Large Language Models (LLMs), published on February 9, 2024. The authors include Tomas Mikolov, the inventor of the word2vec algorithm, and Richard Socher, the developer of Salesforce’s artificial intelligence system Einstein.

Large Language Models: A Survey, Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 2024.2.9 download

Richard Socher is an expert in Natural Language Processing (NLP) and deep learning, having earned his Ph.D. from Chris Manning’s artificial intelligence lab at Stanford University. Manning and Socher are among the earliest researchers to apply deep learning techniques to natural language processing, and I have been following their lectures and research. Back in 2012, when most people were skeptical about using deep learning for natural language processing, Socher’s lecture “Deep Learning for NLP (without Magic)” was particularly meaningful.

The paper provides a historical overview, current status, and future prospects of LLMs. It introduces key technologies, software, and related applications at various stages. Below, I have extracted some particularly interesting points and discussions from this paper, along with my explanations and opinions. Except for the figure in the chapter on the “History of Language Models”, all figures in this article are attributed to courtesy of the above paper.

History of Language Models

Language models have evolved significantly, with advancements leading to the creation of large-scale, pre-trained statistical language models (LLMs). This evolution can be categorized into four distinct waves.

Statistical Language Models (SLMs)

Features: Text as a sequence of words, with probabilities estimated for each word. n-gram models.
Challenges: Cannot fully capture the diversity and variability of natural language due to data sparsity, managed through smoothing techniques.
Applications: N-gram models are widely used in many NLP systems.

Neural Language Models (NLMs)

Features: Task-specific, with each model trained for specific tasks. Overcame data sparsity by using word embeddings in a low-dimensional continuous vector space.
Applications: Facilitated computing semantic similarity, useful in various NLP applications (e.g., queries vs. documents in Web search, sentences in different languages in machine translation) or modalities (e.g., image and text in image captioning).

Pre-trained Language Models (PLMs)

Features: Task-agnostic, with a generalizable hidden embedding space. Pre-training on large text corpora, followed by fine-tuning for specific tasks. Based on recurrent neural networks or transformers.
Models: BERT (Birectional Encoder Representations from Trans- formers), RoBERTa, XLMs; GPT-1 and GPT-2, developed by OpenAI.

Large Language Models (LLMs)

Large Language Models have tens to hundreds of billions of parameters; They have emergent abilities, such as in-context learning, instruction following, and multi-step reasoning; They can be augmented, for example, integration with external knowledge and tools for enhanced interaction and continual learning.

LLMs inlcude PaLM, LLaMA, GPT-3, GPT-4. AI Agents Based on LLMs are one of the applications, with challenges of interaction with dynamic environments and the need for augmentation for real-world applications.

Timeline

Capabilities

Model Families

There are three major families of LLMs: GPT, LlaMA, and PaLM.

How to build a LLM

Look to the Future

LLM Architecture

Transformer

Currently, the Transformer architecture is mainstream. Since the 2012 victory of AlexNet in the ImageNet image recognition competition, deep learning has gradually gained widespread recognition. Current natural language processing algorithms include LSTM, RNN, GRU, seq2seq, etc.

The Transformer is based on the 2017 Google paper: Attention is all you need.

State Space Models (SSMs)

Currently, there is significant research on post-attention mechanisms. One important category within this field pertains to studies on State Space Models (SSMs). SSM is often used in reference to the newer Structured State Space Model architecture, abbreviated as S4. Recent models in this category include Mamba, Hyena, and Striped Hyena. These models not only surpass Transformer architecture models on many evaluation metrics, but they also address the limitation of the Transformer concerning the context window. Attention-based models are highly inefficient for longer contexts.

Mixture of Experts (MoE)

Another significant area of research is the Mixture of Experts (MoE). MoE represents attention-compatible architectural mechanisms. In Large Language Models (LLMs), MoEs enable the training of an extremely large model, which is then only partially instantiated during inference. MoE architecture models include Mixtral, GLaM, and there are rumors that GPT-4 has adopted the MoE architecture.

MoE can also be applied to architectures other than Transformers. For example, recent research from the University of Warsaw, Poland, and others, involves applying MoE to Mamba (SSM).

Receptance Weighted Key Value (RWKV)

RWKV combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.

Rethinking the Whole Transformer Architecture

Additionally, some research is focused on rethinking the entire Transformer architecture. An early example of this is the Monarch Mixer. It proposes a new architecture that utilizes the same sub-quadratic primitive, achieving high hardware efficiency on GPUs — Monarch matrices — along both the sequence length and model dimension.

LLM Applications

Solution to Limitations of LLMs

Hallucination can be addressed through advanced prompt engineering (RAG), use of tools, or other augmentation techniques and more research on this area is expected.

Replacement to Machine Learning Systems

LLM-based systems are being deployed on areas which machine learning systems were used traditionally. LLM-based systems provides personalized interactions with understanding of people preference and interests. Some examples:

Chatbot in customer service
Content recommendation
Many other applications using machine learning techniques

LLM-based agents and multi-agent

Agents system can access external tools and resources then make decision using LLM’s reasoning capability. Research on this area is the most close one to Artificial General Intelligence (AGI).

Reference

Large Language Models

Large Language Models: A Survey, Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 2024.2.9 download | wechat

word2vec

Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013.1.16 downlaod
Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, 2013, Part of Advances in Neural Information Processing Systems 26 (NIPS 2013) | download
word2vec Parameter Learning Explained, Xin Rong, 2014.11.11 download

Salesforce AI system

Salesforce、次世代版Einsteinを発表すべてのCRMアプリケーションとカスタマー・エクスペリエンスに対話型AIアシスタントを導入、2023年9月12日、株式会社セールスフォース・ジャパン website

Richard Socher Speech

Learning for NLP (without Magic) — ACL 2012 Tutorial, Richard Socher, 2012 youtube

Transformer

Transformer: Attention Is All You Need, NIPS(NeurIPS) 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Submitted on 12 Jun 2017 (v1), last revised 2 Aug 2023 (this version, v7), Google Brain, Google Research, University of Toronto download

RWKV

RWKV: Reinventing RNNs for the Transformer Era, 2023.05.22 download

State Space Models (SSMs)

State Space Models (SSMs): Efficiently Modeling Long Sequences with Structured State Spaces, Albert Gu, Karan Goel, and Christopher Ré, Department of Computer Science, Stanford University, 2022.8.5 download
Mamba (SSM): Efficiently Modeling Long Sequences with Structured State Spaces, Albert Gu, Karan Goel, and Christopher Ré, Department of Computer Science, Stanford University, 2022.8.5 download
Hyena Hierarchy (SSM): Towards Larger Convolutional Language Models, Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Stanford University, Mila and Université de Montréal, 2023.4.21 download
Striped Hyena (SSM): M. Poli, J. Wang, S. Massaroli, J. Quesnelle, E. Nguyen, and A. Thomas, “StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models,” 2023.12. github | together.ai
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Albert Gu and Tri Dao, Machine Learning Department, Carnegie Mellon University, Department of Computer Science, Princeton University, Dec 1, 2023 download | 新智元 | YC Hacker News

MoE (Mixture of Experts) Models

MoE (Mixture of Experts): G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, University of Queensland, “Finite mixture models,” Annual review of statistics and its application, vol. 6, pp. 355–378, 2019.6. MoE is an attention-compatible architectural mechanism. download
Finite Mixture Models, 1st Edition, by Geoffrey J. McLachlan (Author), David Peel (Author), 2000.10.2 book
GLaM model (MoE): GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du, Yanping Huang，Andrew M. Dai，Simon Tong，Dmitry Lepikhin， Yuanzhong Xu， Maxim Krikun，Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson 1 Kathleen Meier-Hellstern 1 Toju Duke 1 Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Google, 2022.8.1 in International Conference on Machine Learning download
Mixtral (MoE): Mixtral of experts, A high quality Sparse Mixture-of-Experts. mistral.ai

MoE-Mamba Models

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, MaciejPio ́ro, KamilCiebiera, Krystian Kro ́l, Jan Ludziejewski, Sebastian Jaszczur, IDEAS NCBR, Polish Academy of Sciences, University of Warsaw, 2024.1.8 download

Monarch Mixer Model

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture, Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, and Christopher Ré, Stanford University, University at Buffalo, SUNY, 2023.10.18 download

RAG

RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis†, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, Facebook AI Research, University College London, New York University, 2021.4.12 download
RAG: Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai , Jiawei Sun, Qianyu Guo, Meng Wang and Haofen Wang, Tongji University, Fudan University, 2024.1.5 download

LoRA

Lora: Low-rank adaptation of large language models, Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, Microsoft Corporation, 2021.10.16 download

LLM-based Agents

Agents: The Rise and Potential of Large Language Model Based Agents: A Survey, Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang and Tao Gui, Fudan NLP Group, 2023.9.19 download
Agents: A Survey on Large Language Model based Autonomous Agents, Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, 2023.9.7 download
Agents: Agent ai: Surveying the horizons of multimodal interaction, Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao, Stanford University; Microsoft Research, Redmond;University of California, Los Angeles; University of Washington; Microsoft Gaming, 2024.1.25 download

Table of Contents (Level 2)

Introduction

History of Language Models

Statistical Language Models (SLMs)
Neural Language Models (NLMs)
Pre-trained Language Models (PLMs)

Large Language Models (LLMs)

Timeline
Capabilities
Model Families
Category
How to build a LLM

Look to the Future

LLM Architecture

Transformer
State Space Models (SSMs)
Mixture of Experts (MoE)
Receptance Weighted Key Value (RWKV)
Rethinking the Whole Transformer Architecture

LLM Applications

Solution to Limitations of LLMs
Replacement to Machine Learning Systems
LLM-based agents and multi-agent

References

Large Language Models
word2vec
Salesforce AI system
Richard Socher Speech
Transformer
RWKV
State Space Models (SSMs)
MoE (Mixture of Experts) Models
MoE-Mamba Models
Monarch Mixer Model
RAG
LoRA
LLM-based Agents

Navigating the AI Frontier: Study on Language Model Innovations from Leading Experts

Table of Contents

Introduction

History of Language Models

Statistical Language Models (SLMs)

Neural Language Models (NLMs)

Pre-trained Language Models (PLMs)

Large Language Models (LLMs)

Timeline

Capabilities

Model Families

Category

How to build a LLM

Look to the Future

LLM Architecture

LLM Applications

Reference

Large Language Models

word2vec

Salesforce AI system

Richard Socher Speech

Transformer

RWKV

State Space Models (SSMs)

MoE (Mixture of Experts) Models

MoE-Mamba Models

Monarch Mixer Model

RAG

LoRA

LLM-based Agents

Table of Contents (Level 2)

Introduction

History of Language Models

Large Language Models (LLMs)

Look to the Future

References

Written by Mary Mulan ZHU