This article introduces the runtime and data of large language models (LLMs) from a high-level perspective, helping you understand:
- The fundamental process of running an LLM
- What is inside an LLM repository
Basic Concept
At its core, an LLM is doing text prediction. The idea comes from the famous OpenAI paper "Attention is All You Need". LLM processing can be divided into two phases:
|
Given some text -> Process it using many parameters + attention mechanism -> Predict the next token -> Compare with the real next token and update parameters |
|
Given some text -> Process it using many parameters + attention mechanism -> Predict the next token |
The main difference is that training updates the model's parameters based on the actual next token, while running only predicts tokens using the already learned parameters.
In short, an LLM is fundamentally performing a text prediction task, using a combination of learned parameters and the attention mechanism to determine which parts of the input are most relevant.
LLM High-Level View
Based on how an LLM works, running one requires two essential components:
- Runtime: to maintain and execute the attention process
- Data: the pretrained model parameters
Runtime
The runtime is the engine that executes the model. Its core responsibility is to manage the attention process efficiently.
The attention process can be understood simply as:
- Each word (or token) in the input text is represented as a vector.
- The model computes how much each token should "attend to" every other token in the input.
- Using these attention weights, it combines information from all tokens to predict the next token.
Think of attention as a smart spotlight that highlights the most important words for predicting the next word. The runtime ensures this spotlight moves efficiently across potentially thousands of tokens and parameters, while managing memory and computation optimally.
Examples of runtime frameworks:
- PyTorch – A widely used deep learning library that provides the tools to run LLMs efficiently on GPUs or CPUs.
- TensorFlow – Another popular deep learning framework supporting LLM inference and training.
- Ollama – A dedicated LLM runtime for serving and interacting with large language models in production.
- ONNX Runtime – A runtime for executing models in the Open Neural Network Exchange format, often used to speed up inference across platforms.
These runtimes handle efficient tensor operations, memory management, and parallel computation required to make LLMs usable in practice.
Data
The data of an LLM is its pretrained parameters, sometimes called weights. These parameters are learned during training and encode all the knowledge of the model.
Key points about LLM data:
- Parameters: Millions or even billions of numbers that define the model.
- Vocabulary & Tokenization: The mapping between text tokens and numerical representations.
- Configuration: Hyperparameters such as layer sizes, number of attention heads, etc.
During inference, the runtime loads these parameters and uses them to perform predictions. Without the pretrained data, the model cannot generate meaningful text.
Examples of LLM data sources / repos:
- Hugging Face Transformers – A central hub with thousands of pretrained models, e.g., gpt2, LLaMA, Falcon.
- Example: GPT-2 repo on Hugging Face – contains pretrained weights, tokenizer, and configuration.
- Meta LLaMA models – Open-source models from Meta with pretrained weights for various scales.
- OpenAI models – Although not fully open-source, APIs provide access to pretrained models like GPT-3 and GPT-4.
Summary
Running an LLM boils down to two things:
- A runtime that manages the attention mechanism and computations efficiently
- Data containing all the pretrained parameters that encode the knowledge of the model
In essence, the LLM is just performing smart text prediction, using its attention-based runtime and pretrained knowledge to generate coherent outputs.