Navigating the AI Frontier: Study on Language Model Innovations from Leading Experts

Mary Mulan ZHU
9 min readFeb 15, 2024


Table of Contents

  • Introduction
  • History of Language Models
  • Large Language Models (LLMs)
  • Look to the Future
  • References
Image generated by DALL.E


Early last morning, I came across a research paper on Large Language Models (LLMs), published on February 9, 2024. The authors include Tomas Mikolov, the inventor of the word2vec algorithm, and Richard Socher, the developer of Salesforce’s artificial intelligence system Einstein.

Large Language Models: A Survey, Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 2024.2.9 download

Richard Socher is an expert in Natural Language Processing (NLP) and deep learning, having earned his Ph.D. from Chris Manning’s artificial intelligence lab at Stanford University. Manning and Socher are among the earliest researchers to apply deep learning techniques to natural language processing, and I have been following their lectures and research. Back in 2012, when most people were skeptical about using deep learning for natural language processing, Socher’s lecture “Deep Learning for NLP (without Magic)” was particularly meaningful.

The paper provides a historical overview, current status, and future prospects of LLMs. It introduces key technologies, software, and related applications at various stages. Below, I have extracted some particularly interesting points and discussions from this paper, along with my explanations and opinions. Except for the figure in the chapter on the “History of Language Models”, all figures in this article are attributed to courtesy of the above paper.

History of Language Models

Language models have evolved significantly, with advancements leading to the creation of large-scale, pre-trained statistical language models (LLMs). This evolution can be categorized into four distinct waves.

Language Model History

Statistical Language Models (SLMs)

  • Features: Text as a sequence of words, with probabilities estimated for each word. n-gram models.
  • Challenges: Cannot fully capture the diversity and variability of natural language due to data sparsity, managed through smoothing techniques.
  • Applications: N-gram models are widely used in many NLP systems.

Neural Language Models (NLMs)

  • Features: Task-specific, with each model trained for specific tasks. Overcame data sparsity by using word embeddings in a low-dimensional continuous vector space.
  • Applications: Facilitated computing semantic similarity, useful in various NLP applications (e.g., queries vs. documents in Web search, sentences in different languages in machine translation) or modalities (e.g., image and text in image captioning).

Pre-trained Language Models (PLMs)

  • Features: Task-agnostic, with a generalizable hidden embedding space. Pre-training on large text corpora, followed by fine-tuning for specific tasks. Based on recurrent neural networks or transformers.
  • Models: BERT (Birectional Encoder Representations from Trans- formers), RoBERTa, XLMs; GPT-1 and GPT-2, developed by OpenAI.

Large Language Models (LLMs)

Large Language Models have tens to hundreds of billions of parameters; They have emergent abilities, such as in-context learning, instruction following, and multi-step reasoning; They can be augmented, for example, integration with external knowledge and tools for enhanced interaction and continual learning.

LLMs inlcude PaLM, LLaMA, GPT-3, GPT-4. AI Agents Based on LLMs are one of the applications, with challenges of interaction with dynamic environments and the need for augmentation for real-world applications.



Model Families

There are three major families of LLMs: GPT, LlaMA, and PaLM.


How to build a LLM

Look to the Future

LLM Architecture


Currently, the Transformer architecture is mainstream. Since the 2012 victory of AlexNet in the ImageNet image recognition competition, deep learning has gradually gained widespread recognition. Current natural language processing algorithms include LSTM, RNN, GRU, seq2seq, etc.

The Transformer is based on the 2017 Google paper: Attention is all you need.

State Space Models (SSMs)

Currently, there is significant research on post-attention mechanisms. One important category within this field pertains to studies on State Space Models (SSMs). SSM is often used in reference to the newer Structured State Space Model architecture, abbreviated as S4. Recent models in this category include Mamba, Hyena, and Striped Hyena. These models not only surpass Transformer architecture models on many evaluation metrics, but they also address the limitation of the Transformer concerning the context window. Attention-based models are highly inefficient for longer contexts.

Mixture of Experts (MoE)

Another significant area of research is the Mixture of Experts (MoE). MoE represents attention-compatible architectural mechanisms. In Large Language Models (LLMs), MoEs enable the training of an extremely large model, which is then only partially instantiated during inference. MoE architecture models include Mixtral, GLaM, and there are rumors that GPT-4 has adopted the MoE architecture.

MoE can also be applied to architectures other than Transformers. For example, recent research from the University of Warsaw, Poland, and others, involves applying MoE to Mamba (SSM).

Receptance Weighted Key Value (RWKV)

RWKV combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.

Rethinking the Whole Transformer Architecture

Additionally, some research is focused on rethinking the entire Transformer architecture. An early example of this is the Monarch Mixer. It proposes a new architecture that utilizes the same sub-quadratic primitive, achieving high hardware efficiency on GPUs — Monarch matrices — along both the sequence length and model dimension.

LLM Applications

Solution to Limitations of LLMs

Hallucination can be addressed through advanced prompt engineering (RAG), use of tools, or other augmentation techniques and more research on this area is expected.

Replacement to Machine Learning Systems

LLM-based systems are being deployed on areas which machine learning systems were used traditionally. LLM-based systems provides personalized interactions with understanding of people preference and interests. Some examples:

  • Chatbot in customer service
  • Content recommendation
  • Many other applications using machine learning techniques

LLM-based agents and multi-agent

Agents system can access external tools and resources then make decision using LLM’s reasoning capability. Research on this area is the most close one to Artificial General Intelligence (AGI).


