When you type a question into ChatGPT, Claude, or Gemini and get a coherent, detailed answer back, what's actually happening? The answer is a large language model — and understanding what that means demystifies a lot of what AI can and can't do.

What Is a Language Model?

A language model is a system trained to predict what comes next in text. Given a sequence of words, it estimates the probability of the next word — or more precisely, the next token (which might be a word, part of a word, or a punctuation mark).

A very simple language model trained on English news articles might learn that after "the stock market" the most likely next word is "fell" or "rose" or "closed." A language model trained on all of the internet learns vastly more complex patterns.

Large language models are language models trained on enormous datasets with billions or trillions of parameters — the numeric values that encode everything the model has "learned."

How They're Trained

Training a large language model happens in two major phases:

Phase 1: Pre-training

The model is shown massive quantities of text — books, websites, code, scientific papers, forums — and trained to predict the next token at each step.

This sounds simple, but to predict text accurately across billions of documents, the model must implicitly learn:

Grammar and syntax
Facts about the world
Reasoning patterns
How different topics relate to each other
Styles of writing
Programming languages
Mathematics

The model doesn't memorize text. It compresses statistical patterns into its parameters — billions of numbers that together encode a kind of distilled model of human knowledge as expressed in text.

Pre-training is extremely expensive. Training the largest models costs tens of millions of dollars in compute.

Phase 2: Fine-tuning and RLHF

A pre-trained model is good at predicting text — but that's not quite the same as being useful for conversation. To turn it into an assistant, models go through additional training:

Supervised fine-tuning (SFT): Human trainers write example conversations, and the model is trained to produce similar responses
Reinforcement Learning from Human Feedback (RLHF): Human raters score model responses, and the model is trained to produce responses humans rate more highly

This phase teaches the model to be helpful, to follow instructions, and to decline harmful requests — properties that don't emerge automatically from predicting internet text.

The Transformer Architecture

Modern LLMs are built on an architecture called the Transformer, introduced by Google researchers in 2017.

The key innovation is the attention mechanism: rather than processing text word by word, the model can directly relate any word in a sequence to any other word, regardless of distance. This lets the model understand that "it" in "the trophy didn't fit in the suitcase because it was too big" refers to "trophy," not "suitcase" — a kind of reasoning that requires understanding the whole sentence, not just the neighboring words.

Each layer of the transformer adds a richer representation of the input. With enough layers and enough training, these representations become powerful enough to support complex reasoning.

What LLMs Are Surprisingly Good At

Writing and editing — drafts, summaries, rewrites in different styles or tones
Code generation — generating, explaining, and debugging code across dozens of languages
Question answering — synthesizing information from their training data
Translation — between languages and between technical and plain language
Reasoning through structured problems — math, logic puzzles, step-by-step analysis
Conversation — maintaining context across a long dialogue

What LLMs Are Genuinely Bad At

Understanding the limits is as important as understanding the capabilities:

They can hallucinate — generate confident-sounding text that is factually wrong. The model produces plausible text, not necessarily true text.
They don't have real-time knowledge — unless given tools to access the internet, they only know what was in their training data (up to a cutoff date)
They're inconsistent — ask the same question twice and get different answers
They struggle with precise arithmetic — they approach math through pattern-matching, not calculation
They have no persistent memory by default — each conversation starts fresh
They reflect biases in training data — if the internet is biased about something, the model likely is too

Context Windows

Every LLM has a context window — the maximum amount of text it can consider at once. Early models had context windows of ~4,000 tokens (roughly 3,000 words). Modern models handle 100,000 to 1,000,000+ tokens.

The context window is the model's working memory. Everything in the conversation — your messages, the model's responses, any documents you paste in — must fit within it. Older models would simply forget earlier parts of a long conversation.

Parameters: What "70B" Means

You'll often see models described by their parameter count: GPT-4, Llama 3 70B, Mistral 7B. Parameters are the numeric values adjusted during training that encode the model's learned knowledge.

More parameters generally means more capacity to learn complex patterns — but also more memory and compute required to run the model. A 7B model can run on a laptop. A 70B model needs a server with multiple powerful GPUs. The largest models require data center infrastructure.

LLMs and Retrieval-Augmented Generation (RAG)

Because LLMs have a knowledge cutoff and can hallucinate, a common pattern is to give them access to a retrieval system: search relevant documents, inject them into the context, then ask the model to answer based on those documents.

This is called Retrieval-Augmented Generation (RAG) — the model's generation is augmented by retrieved facts. It's how many AI assistants answer questions about recent events or internal company documents.

The Bottom Line

Large language models are trained on massive amounts of text to predict what comes next — and that simple objective, scaled up enormously, produces systems capable of coherent conversation, code generation, reasoning, and more. They're genuinely useful, but they also hallucinate, lack real-time knowledge, and reflect the biases of their training data. The most effective way to use them is to understand both what they do well and where they're likely to fail.