01 — Fundamentals
0 / 23 explored
The fundamentals
Before writing a single line of AI code, these concepts come up everywhere — in pricing, debugging, architecture, and conversation.
0 / 8
LLM
The neural network that predicts text — and much more
A Large Language Model is a neural network trained on massive amounts of text to predict what token comes next. That's the whole objective — yet at scale, this simple task produces emergent reasoning, coding, translation, and summarization. It doesn't store facts like a database; it compresses patterns from training into billions of parameters.
examples
Claude, GPT-4, Gemini, LLaMA 3, Mistral — all LLMs.
Training data: Common Crawl, books, GitHub, Wikipedia, etc.
Scale: 7B → 405B parameters across modern open/closed models.
Think less "search engine" and more "a person who has internalized nearly all human-written text and can generate fluent continuations of any prompt." The capability is impressive and the failure modes are equally distinct.
Token
The atomic unit of text that LLMs actually process
LLMs don't process characters or words — they process tokens. A token is a chunk of text, typically ~4 characters or a common subword. "unbelievable" splits into 4 tokens; "AI" is 1. Pricing, context limits, and generation speed are all measured in tokens.
interactive — type text to see tokens
LLMs see token IDs, not characters. This is why they sometimes fail at spelling, letter counting, and word puzzles — they never "see" individual characters the way humans do.
Context window
The model's working memory — everything it can "see" at once
The context window is the maximum tokens an LLM can process in a single call. Your entire request — system prompt, conversation history, documents, and tool results — must fit within this limit. The model has no memory outside of it.
context window visualizer
10%
30%
40%
system history documents available
When context overflows, earlier messages are dropped. The model literally forgets them. Managing what goes into context is one of the most important engineering skills in production AI systems.
Temperature
Controls randomness — 0 is deterministic, higher is creative
Temperature scales the probability distribution over next tokens before sampling. At 0, the model always picks the highest-probability token — same output every time. Higher values flatten the distribution, making less likely tokens more probable — outputs become varied, creative, and occasionally wrong.
interactive — drag to see output variation
0.7
Use 0–0.3 for code, data extraction, and factual Q&A. Use 0.7–1.0 for writing and brainstorming. Never exceed 1.2 in production — outputs become unreliable fast.
Embedding
Dense vector representation of meaning in high-dimensional space
An embedding is a list of numbers (a vector) that represents the semantic meaning of text. Text with similar meaning produces vectors that are mathematically "close" — measurable via cosine similarity. Embeddings power semantic search, clustering, and RAG systems.
example
"I love dogs" → [0.21, -0.84, 0.33, ... 1536 dims]
"I enjoy pets" → [0.19, -0.81, 0.35, ... 1536 dims]
cosine_similarity = 0.97 (very close)

"The stock market crashed" → much further away
Embeddings explain why "automobile" and "car" return the same search results. They're the foundational technology behind RAG and semantic search. Every serious AI app eventually touches them.
Hallucination
Confident, fluent, and factually wrong — the core risk of LLMs
Hallucination is when a model generates plausible-sounding but false information — and does so confidently. It's not a bug or a lie; it's what happens when a model optimizes for fluency without a truth-verification mechanism. It's worse with obscure topics, specific numbers, citations, and recent events.
example
Ask an LLM about an obscure API → it invents
method signatures that don't exist, confidently.

Ask for academic citations → it fabricates
author names, titles, and journal volumes.
Hallucinations happen because LLMs complete patterns — they don't look things up. The fix is grounding: RAG, tool use, and output verification. Never trust LLM outputs on factual details without a retrieval layer.
Inference
Running the model — every API call is one forward pass at a time
Inference is running a trained model to generate output. Unlike training (which updates weights), inference uses frozen weights. Every API call you make triggers an inference: your tokens flow through billions of parameters in a forward pass, and the model outputs the next token — repeatedly until a stop condition.
what happens per token
Input tokens → attention layers → feed-forward layers
→ logit scores for all ~100k vocab tokens → sample
→ one token out → repeat until [EOS] or max_tokens
Inference is expensive at scale — this is why providers charge per token and why KV-caching, quantization, and batching are serious engineering concerns in production. Speed and cost are inference problems.
Parameters / weights
Billions of learned numbers that encode everything the model knows
Parameters (or weights) are the billions of floating-point numbers in an LLM — adjusted during training to minimize prediction error. They are the model. All knowledge, reasoning patterns, and language ability is compressed into these numbers. More parameters generally means more capability, but also more compute and cost.
scale reference
GPT-2: 1.5B LLaMA 3.1 8B: 8B
LLaMA 3.1 70B: 70B GPT-4: ~1.8T (est.)

Small models run on laptops.
Large models need GPU clusters.
You never interact with parameters directly — they're baked in. But their count affects capability, speed, and cost. Smaller models are faster and cheaper but less capable. Model selection is always a capability-cost tradeoff.