module 01
The fundamentals
Before writing a single line of AI code, these concepts come up everywhere — in pricing, debugging, architecture, and conversation.
0 / 8
A Large Language Model is a neural network trained on massive amounts of text to predict what token comes next. That's the whole objective — yet at scale, this simple task produces emergent reasoning, coding, translation, and summarization. It doesn't store facts like a database; it compresses patterns from training into billions of parameters.
examples
Claude, GPT-4, Gemini, LLaMA 3, Mistral — all LLMs.
Training data: Common Crawl, books, GitHub, Wikipedia, etc.
Scale: 7B → 405B parameters across modern open/closed models.
Training data: Common Crawl, books, GitHub, Wikipedia, etc.
Scale: 7B → 405B parameters across modern open/closed models.
Think less "search engine" and more "a person who has internalized nearly all human-written text and can generate fluent continuations of any prompt." The capability is impressive and the failure modes are equally distinct.
LLMs don't process characters or words — they process tokens. A token is a chunk of text, typically ~4 characters or a common subword. "unbelievable" splits into 4 tokens; "AI" is 1. Pricing, context limits, and generation speed are all measured in tokens.
interactive — type text to see tokens
LLMs see token IDs, not characters. This is why they sometimes fail at spelling, letter counting, and word puzzles — they never "see" individual characters the way humans do.
The context window is the maximum tokens an LLM can process in a single call. Your entire request — system prompt, conversation history, documents, and tool results — must fit within this limit. The model has no memory outside of it.
context window visualizer
system
history
documents
available
When context overflows, earlier messages are dropped. The model literally forgets them. Managing what goes into context is one of the most important engineering skills in production AI systems.
Temperature scales the probability distribution over next tokens before sampling. At 0, the model always picks the highest-probability token — same output every time. Higher values flatten the distribution, making less likely tokens more probable — outputs become varied, creative, and occasionally wrong.
interactive — drag to see output variation
0.7
Use 0–0.3 for code, data extraction, and factual Q&A. Use 0.7–1.0 for writing and brainstorming. Never exceed 1.2 in production — outputs become unreliable fast.
An embedding is a list of numbers (a vector) that represents the semantic meaning of text. Text with similar meaning produces vectors that are mathematically "close" — measurable via cosine similarity. Embeddings power semantic search, clustering, and RAG systems.
example
"I love dogs" → [0.21, -0.84, 0.33, ... 1536 dims]
"I enjoy pets" → [0.19, -0.81, 0.35, ... 1536 dims]
cosine_similarity = 0.97 (very close)
"The stock market crashed" → much further away
"I enjoy pets" → [0.19, -0.81, 0.35, ... 1536 dims]
cosine_similarity = 0.97 (very close)
"The stock market crashed" → much further away
Embeddings explain why "automobile" and "car" return the same search results. They're the foundational technology behind RAG and semantic search. Every serious AI app eventually touches them.
Hallucination is when a model generates plausible-sounding but false information — and does so confidently. It's not a bug or a lie; it's what happens when a model optimizes for fluency without a truth-verification mechanism. It's worse with obscure topics, specific numbers, citations, and recent events.
example
Ask an LLM about an obscure API → it invents
method signatures that don't exist, confidently.
Ask for academic citations → it fabricates
author names, titles, and journal volumes.
method signatures that don't exist, confidently.
Ask for academic citations → it fabricates
author names, titles, and journal volumes.
Hallucinations happen because LLMs complete patterns — they don't look things up. The fix is grounding: RAG, tool use, and output verification. Never trust LLM outputs on factual details without a retrieval layer.
Inference is running a trained model to generate output. Unlike training (which updates weights), inference uses frozen weights. Every API call you make triggers an inference: your tokens flow through billions of parameters in a forward pass, and the model outputs the next token — repeatedly until a stop condition.
what happens per token
Input tokens → attention layers → feed-forward layers
→ logit scores for all ~100k vocab tokens → sample
→ one token out → repeat until [EOS] or max_tokens
→ logit scores for all ~100k vocab tokens → sample
→ one token out → repeat until [EOS] or max_tokens
Inference is expensive at scale — this is why providers charge per token and why KV-caching, quantization, and batching are serious engineering concerns in production. Speed and cost are inference problems.
Parameters (or weights) are the billions of floating-point numbers in an LLM — adjusted during training to minimize prediction error. They are the model. All knowledge, reasoning patterns, and language ability is compressed into these numbers. More parameters generally means more capability, but also more compute and cost.
scale reference
GPT-2: 1.5B LLaMA 3.1 8B: 8B
LLaMA 3.1 70B: 70B GPT-4: ~1.8T (est.)
Small models run on laptops.
Large models need GPU clusters.
LLaMA 3.1 70B: 70B GPT-4: ~1.8T (est.)
Small models run on laptops.
Large models need GPU clusters.
You never interact with parameters directly — they're baked in. But their count affects capability, speed, and cost. Smaller models are faster and cheaper but less capable. Model selection is always a capability-cost tradeoff.
module 02
Prompting patterns
How you write your input shapes everything about the output. These patterns are the difference between a toy demo and a production system.
0 / 6
Prompt engineering is the practice of structuring inputs to reliably elicit desired outputs from an LLM. It covers instructions, context framing, format specification, persona assignment, and constraints. A well-crafted prompt often outperforms switching to a more expensive model.
weak vs strong prompt
Weak: "Summarize this."
Strong: "Summarize the following technical doc
in 3 bullet points for a non-technical exec.
Avoid jargon. Each bullet under 20 words.
Lead with business impact, not technical detail."
Strong: "Summarize the following technical doc
in 3 bullet points for a non-technical exec.
Avoid jargon. Each bullet under 20 words.
Lead with business impact, not technical detail."
Prompting is real engineering — not fluff. Specificity, positive constraints ("do X") over negative ones ("don't Y"), and format specification are the three levers that move quality the most.
The system prompt is a special input given to the model before the user conversation. It sets the model's role, behavior rules, output format, tool capabilities, and hard constraints. Users typically don't see it, but it governs every response the model produces.
example system prompt
You are a senior code reviewer at Stripe.
Only respond in JSON with shape: { issues: [], severity: "" }.
Flag security vulnerabilities first.
Never generate new code — only provide feedback.
Do not answer questions outside of code review.
Only respond in JSON with shape: { issues: [], severity: "" }.
Flag security vulnerabilities first.
Never generate new code — only provide feedback.
Do not answer questions outside of code review.
In agentic systems, the system prompt is where you define the agent's identity, available tools, and behavioral constraints. It's an architectural decision, not just a cosmetic one.
Few-shot learning is providing input-output examples inside the prompt itself, without touching model weights. The model infers the transformation pattern from the examples and applies it to the new input. 3–5 good examples often achieve what fine-tuning achieves at a fraction of the cost.
example
Input: "I love this product!" → Sentiment: positive
Input: "Total waste of money." → Sentiment: negative
Input: "It's okay, nothing special." → Sentiment: neutral
Input: "Best purchase I've made this year." → ?
Input: "Total waste of money." → Sentiment: negative
Input: "It's okay, nothing special." → Sentiment: neutral
Input: "Best purchase I've made this year." → ?
Few-shot is often all you need before reaching for fine-tuning. Show 3–5 high-quality, representative examples and output quality jumps significantly — especially for structured extraction tasks.
Chain of Thought (CoT) prompting instructs the model to work through intermediate reasoning steps before producing its final answer. This dramatically improves accuracy on multi-step reasoning, math, and logic problems — because the model "shows its work" and can catch errors mid-generation.
without vs with CoT
Without: "15% of 847?" → "127.05" (sometimes wrong)
With CoT: "15% of 847? Let's compute:
10% of 847 = 84.7
5% of 847 = 42.35
Total = 84.7 + 42.35 = 127.05 ✓"
With CoT: "15% of 847? Let's compute:
10% of 847 = 84.7
5% of 847 = 42.35
Total = 84.7 + 42.35 = 127.05 ✓"
Just adding "Think step by step before answering" measurably improves accuracy on complex tasks. Extended thinking models (like Claude's extended thinking mode) take this further by using hidden reasoning tokens.
RAG combines a retrieval system (vector database + semantic search) with an LLM. When a query arrives, relevant documents are fetched from a vector DB and injected into the prompt context — giving the model current, private, and verifiable information to ground its response.
rag pipeline
User query → embed query → search vector DB
→ retrieve top-k documents → inject into context
→ LLM generates grounded answer with sources
Use cases: docs search, enterprise Q&A,
knowledge bases, support bots
→ retrieve top-k documents → inject into context
→ LLM generates grounded answer with sources
Use cases: docs search, enterprise Q&A,
knowledge bases, support bots
RAG is the standard solution for: private data access, post-training knowledge, and hallucination reduction. It's the first pattern to reach for when your app needs accurate, current, or specific information.
Fine-tuning takes a pre-trained model and continues training it on your domain-specific dataset, adjusting the weights to specialize in your task, style, or vocabulary. Unlike prompting (which operates at inference time), fine-tuning changes the model itself — which is powerful but expensive and requires good data.
example
Base model: LLaMA 3 8B (general purpose)
Fine-tune on: 10,000 customer support tickets
Result: model learns your company's tone,
product knowledge, and resolution patterns
Fine-tune on: 10,000 customer support tickets
Result: model learns your company's tone,
product knowledge, and resolution patterns
Fine-tuning is often overkill. Most tasks that seem to require it can be solved with better prompting + RAG. Only fine-tune when: you have high-quality labeled data, prompting has genuinely hit its ceiling, and you need consistent style or format at scale.
module 03
Agentic engineering
Where LLMs stop being chatbots and start taking actions. Agents are LLMs operating in loops, using tools, and affecting the real world.
0 / 9
An AI agent is an LLM operating in a loop — it receives a goal, decides what to do, executes actions using tools, observes results, and continues until the goal is complete or it gives up. Unlike a chatbot (which responds once), an agent operates autonomously across multiple steps without human input at each one.
agent vs chatbot
Chatbot: User asks → model responds → done.
Agent: Goal given → search web → read pages
→ take notes → write draft → review draft
→ send email → done (5+ steps, no human input)
Agent: Goal given → search web → read pages
→ take notes → write draft → review draft
→ send email → done (5+ steps, no human input)
An agent is just: LLM + loop + tools + memory. The magic is orchestration. Most agent bugs are prompt bugs or tool-definition bugs — not model capability issues.
Every agent runs a loop: the model reasons about its current state (Think), decides on and executes a tool call (Act), and receives the result back into its context (Observe). This loop repeats until the agent decides the goal is complete or hits a stop condition.
interactive — click steps to trace a loop
1
Think
Reason about current state and goal
↓
2
Act
Call a tool with structured arguments
↓
3
Observe
Result injected back into context
↩ repeat or finish
Every tool result goes back into the context window. The agent sees what it just did and can course-correct. This is fundamentally different from a one-shot generation — it's the foundation of agentic intelligence.
Tools are functions defined in the API call that the model can invoke. You provide the name, description, and parameter schema — the model decides when and how to call them. The call is returned to your code; you execute it and pass the result back to the model.
tool definition (simplified)
{ name: "get_weather",
description: "Get current weather for a city",
parameters: {
city: { type: "string", required: true },
units: { type: "enum", values: ["C","F"] }
}
}
description: "Get current weather for a city",
parameters: {
city: { type: "string", required: true },
units: { type: "enum", values: ["C","F"] }
}
}
Well-defined tools are the key to capable agents. Clear names, precise descriptions, and typed parameters dramatically improve tool selection accuracy. Bad tool descriptions → bad agents. Treat tool definitions as APIs, not afterthoughts.
LLMs output text. Your application needs typed data. An output parser bridges this gap — it prompts the model to produce a specific structured format (JSON, XML, CSV) and then validates and parses the result into usable objects. Schema validation catches malformed outputs before they crash your pipeline.
pattern
Prompt: "...respond ONLY in JSON: { sentiment, score }"
Model output: '{ "sentiment": "positive", "score": 0.92 }'
Parser: JSON.parse() + Zod/Pydantic validation
Result: { sentiment: "positive", score: 0.92 }
Model output: '{ "sentiment": "positive", "score": 0.92 }'
Parser: JSON.parse() + Zod/Pydantic validation
Result: { sentiment: "positive", score: 0.92 }
Always use Pydantic (Python) or Zod (TypeScript) to validate parsed output. Models occasionally produce malformed JSON. Always have fallback logic — retry with a stricter prompt or catch the parse error gracefully.
Grounding means ensuring model outputs are connected to real, verifiable sources — not generated from parametric memory alone. It combines RAG (retrieved documents), tool results (live data), and explicit citations in the output. Grounded models make errors detectable and traceable.
ungrounded vs grounded
Ungrounded: "The API returns status 200 on success."
(model may be hallucinating this)
Grounded: "Per docs.stripe.com/api fetched at 14:02,
the API returns 200 with a Charge object. [source]"
(model may be hallucinating this)
Grounded: "Per docs.stripe.com/api fetched at 14:02,
the API returns 200 with a Charge object. [source]"
Grounding is the primary production solution to hallucination. If the model cites retrieved sources, wrong answers become detectable — users and systems can verify. Without grounding, you're trusting parametric memory.
Web search is a tool that lets an agent query a search engine and optionally fetch and read the resulting pages. It transforms a model with a static knowledge cutoff into a system with access to current, live information — stock prices, recent news, latest documentation, live data.
web search flow
User: "What's the current Claude API pricing?"
Agent: search("Claude API pricing 2025")
→ get results → fetch top URL
→ extract pricing table → inject into context
→ respond with current, cited data
Agent: search("Claude API pricing 2025")
→ get results → fetch top URL
→ extract pricing table → inject into context
→ respond with current, cited data
Web search is the single highest-leverage tool you can give an agent. It's the difference between a static assistant stuck in training time and a dynamic system that can deal with current reality.
Actions are tool calls with real-world side effects: sending emails, writing files, calling APIs, executing code, creating tickets, updating databases. Unlike retrieval operations (read-only), actions are irreversible or difficult to undo — which makes their design and safety constraints critical.
read vs write actions
Read (safe): search_web(), read_file(), query_db()
Write (consequence): send_email(), write_file(),
create_ticket(), execute_code(), call_api(),
update_record(), delete_item()
Write (consequence): send_email(), write_file(),
create_ticket(), execute_code(), call_api(),
update_record(), delete_item()
Always separate retrieval actions from write actions in your code. For irreversible writes, implement human-in-the-loop confirmation. Agents make mistakes — mistakes with real-world consequences are expensive to undo.
Agent memory has two layers: short-term (everything currently in the context window — active session) and long-term (external storage — vector DBs, key-value stores, files — that persists across sessions). Most production agents combine both: context for active reasoning, external storage for persistent knowledge retrieval.
memory architecture
Short-term: full conversation + tool results in context
→ limited, expensive, lost when session ends
Long-term: summaries + facts stored in vector DB
→ retrieved at session start via semantic search
→ unlimited, persistent, but needs retrieval cost
→ limited, expensive, lost when session ends
Long-term: summaries + facts stored in vector DB
→ retrieved at session start via semantic search
→ unlimited, persistent, but needs retrieval cost
Manage memory actively. Summarize old context before it truncates. Store important facts to long-term storage after each session. Context is expensive — information architecture is a core agent design skill, not an afterthought.
Multi-agent systems use multiple specialized agents working in coordination. An orchestrator agent receives a complex goal, decomposes it into subtasks, routes each to a specialized worker agent, and aggregates the results. This enables parallelism, specialization, and tackling tasks too large for a single context window.
example architecture
Goal: "Competitor analysis report"
Orchestrator → Research Agent (web search)
→ Analysis Agent (data processing)
→ Writing Agent (report generation)
→ Aggregates → final report
Orchestrator → Research Agent (web search)
→ Analysis Agent (data processing)
→ Writing Agent (report generation)
→ Aggregates → final report
Multi-agent systems amplify capabilities but also amplify failure modes. Each agent is an additional point of failure. Start with a single agent and only introduce orchestration when single-agent approaches genuinely hit their limits.