AI for Developers

module 01

The fundamentals

Before writing a single line of AI code, these concepts come up everywhere — in pricing, debugging, architecture, and conversation.

0 / 8

LLM

The neural network that predicts text — and much more

A Large Language Model is a neural network trained on massive amounts of text to predict what token comes next. That's the whole objective — yet at scale, this simple task produces emergent reasoning, coding, translation, and summarization. It doesn't store facts like a database; it compresses patterns from training into billions of parameters.

examples

Claude, GPT-4, Gemini, LLaMA 3, Mistral — all LLMs.
Training data: Common Crawl, books, GitHub, Wikipedia, etc.
Scale: 7B → 405B parameters across modern open/closed models.

Think less "search engine" and more "a person who has internalized nearly all human-written text and can generate fluent continuations of any prompt." The capability is impressive and the failure modes are equally distinct.

Token

The atomic unit of text that LLMs actually process

LLMs don't process characters or words — they process tokens. A token is a chunk of text, typically ~4 characters or a common subword. "unbelievable" splits into 4 tokens; "AI" is 1. Pricing, context limits, and generation speed are all measured in tokens.

interactive — type text to see tokens

LLMs see token IDs, not characters. This is why they sometimes fail at spelling, letter counting, and word puzzles — they never "see" individual characters the way humans do.

Context window

The model's working memory — everything it can "see" at once

The context window is the maximum tokens an LLM can process in a single call. Your entire request — system prompt, conversation history, documents, and tool results — must fit within this limit. The model has no memory outside of it.

context window visualizer

System prompt 10%

History 30%

Documents 40%

system history documents available

When context overflows, earlier messages are dropped. The model literally forgets them. Managing what goes into context is one of the most important engineering skills in production AI systems.

Temperature

Controls randomness — 0 is deterministic, higher is creative

Temperature scales the probability distribution over next tokens before sampling. At 0, the model always picks the highest-probability token — same output every time. Higher values flatten the distribution, making less likely tokens more probable — outputs become varied, creative, and occasionally wrong.

interactive — drag to see output variation

Temperature 0.7

Use 0–0.3 for code, data extraction, and factual Q&A. Use 0.7–1.0 for writing and brainstorming. Never exceed 1.2 in production — outputs become unreliable fast.

Embedding

Dense vector representation of meaning in high-dimensional space

An embedding is a list of numbers (a vector) that represents the semantic meaning of text. Text with similar meaning produces vectors that are mathematically "close" — measurable via cosine similarity. Embeddings power semantic search, clustering, and RAG systems.

example

"I love dogs" → [0.21, -0.84, 0.33, ... 1536 dims]
"I enjoy pets" → [0.19, -0.81, 0.35, ... 1536 dims]
cosine_similarity = 0.97 (very close)

"The stock market crashed" → much further away

Embeddings explain why "automobile" and "car" return the same search results. They're the foundational technology behind RAG and semantic search. Every serious AI app eventually touches them.

Hallucination

Confident, fluent, and factually wrong — the core risk of LLMs

Hallucination is when a model generates plausible-sounding but false information — and does so confidently. It's not a bug or a lie; it's what happens when a model optimizes for fluency without a truth-verification mechanism. It's worse with obscure topics, specific numbers, citations, and recent events.

example

Ask an LLM about an obscure API → it invents
method signatures that don't exist, confidently.

Ask for academic citations → it fabricates
author names, titles, and journal volumes.

Hallucinations happen because LLMs complete patterns — they don't look things up. The fix is grounding: RAG, tool use, and output verification. Never trust LLM outputs on factual details without a retrieval layer.

Inference

Running the model — every API call is one forward pass at a time

Inference is running a trained model to generate output. Unlike training (which updates weights), inference uses frozen weights. Every API call you make triggers an inference: your tokens flow through billions of parameters in a forward pass, and the model outputs the next token — repeatedly until a stop condition.

what happens per token

Input tokens → attention layers → feed-forward layers
→ logit scores for all ~100k vocab tokens → sample
→ one token out → repeat until [EOS] or max_tokens

Inference is expensive at scale — this is why providers charge per token and why KV-caching, quantization, and batching are serious engineering concerns in production. Speed and cost are inference problems.

Parameters / weights

Billions of learned numbers that encode everything the model knows

Parameters (or weights) are the billions of floating-point numbers in an LLM — adjusted during training to minimize prediction error. They are the model. All knowledge, reasoning patterns, and language ability is compressed into these numbers. More parameters generally means more capability, but also more compute and cost.

scale reference

GPT-2: 1.5B LLaMA 3.1 8B: 8B
LLaMA 3.1 70B: 70B GPT-4: ~1.8T (est.)

Small models run on laptops.
Large models need GPU clusters.

You never interact with parameters directly — they're baked in. But their count affects capability, speed, and cost. Smaller models are faster and cheaper but less capable. Model selection is always a capability-cost tradeoff.

module 02

Prompting patterns

How you write your input shapes everything about the output. These patterns are the difference between a toy demo and a production system.

0 / 6

Prompt engineering

The craft of writing inputs that reliably produce the outputs you need

Prompt engineering is the practice of structuring inputs to reliably elicit desired outputs from an LLM. It covers instructions, context framing, format specification, persona assignment, and constraints. A well-crafted prompt often outperforms switching to a more expensive model.

weak vs strong prompt

Weak: "Summarize this."

Strong: "Summarize the following technical doc
in 3 bullet points for a non-technical exec.
Avoid jargon. Each bullet under 20 words.
Lead with business impact, not technical detail."

Prompting is real engineering — not fluff. Specificity, positive constraints ("do X") over negative ones ("don't Y"), and format specification are the three levers that move quality the most.

System prompt

Persistent instructions that shape all model behavior before the conversation starts

The system prompt is a special input given to the model before the user conversation. It sets the model's role, behavior rules, output format, tool capabilities, and hard constraints. Users typically don't see it, but it governs every response the model produces.

example system prompt

You are a senior code reviewer at Stripe.
Only respond in JSON with shape: { issues: [], severity: "" }.
Flag security vulnerabilities first.
Never generate new code — only provide feedback.
Do not answer questions outside of code review.

In agentic systems, the system prompt is where you define the agent's identity, available tools, and behavioral constraints. It's an architectural decision, not just a cosmetic one.

Few-shot learning

Showing examples in the prompt to define the pattern you want

Few-shot learning is providing input-output examples inside the prompt itself, without touching model weights. The model infers the transformation pattern from the examples and applies it to the new input. 3–5 good examples often achieve what fine-tuning achieves at a fraction of the cost.

example

Input: "I love this product!" → Sentiment: positive
Input: "Total waste of money." → Sentiment: negative
Input: "It's okay, nothing special." → Sentiment: neutral

Input: "Best purchase I've made this year." → ?

Few-shot is often all you need before reaching for fine-tuning. Show 3–5 high-quality, representative examples and output quality jumps significantly — especially for structured extraction tasks.

Chain of thought

Forcing the model to reason step-by-step before answering

Chain of Thought (CoT) prompting instructs the model to work through intermediate reasoning steps before producing its final answer. This dramatically improves accuracy on multi-step reasoning, math, and logic problems — because the model "shows its work" and can catch errors mid-generation.

without vs with CoT

Without: "15% of 847?" → "127.05" (sometimes wrong)

With CoT: "15% of 847? Let's compute:
10% of 847 = 84.7
5% of 847 = 42.35
Total = 84.7 + 42.35 = 127.05 ✓"

Just adding "Think step by step before answering" measurably improves accuracy on complex tasks. Extended thinking models (like Claude's extended thinking mode) take this further by using hidden reasoning tokens.

RAG

Retrieval Augmented Generation — grounding the model in real data

RAG combines a retrieval system (vector database + semantic search) with an LLM. When a query arrives, relevant documents are fetched from a vector DB and injected into the prompt context — giving the model current, private, and verifiable information to ground its response.

rag pipeline

User query → embed query → search vector DB
→ retrieve top-k documents → inject into context
→ LLM generates grounded answer with sources

Use cases: docs search, enterprise Q&A,
knowledge bases, support bots

RAG is the standard solution for: private data access, post-training knowledge, and hallucination reduction. It's the first pattern to reach for when your app needs accurate, current, or specific information.

Fine-tuning

Training the model further on your data to specialize it

Fine-tuning takes a pre-trained model and continues training it on your domain-specific dataset, adjusting the weights to specialize in your task, style, or vocabulary. Unlike prompting (which operates at inference time), fine-tuning changes the model itself — which is powerful but expensive and requires good data.

example

Base model: LLaMA 3 8B (general purpose)
Fine-tune on: 10,000 customer support tickets
Result: model learns your company's tone,
product knowledge, and resolution patterns

Fine-tuning is often overkill. Most tasks that seem to require it can be solved with better prompting + RAG. Only fine-tune when: you have high-quality labeled data, prompting has genuinely hit its ceiling, and you need consistent style or format at scale.

module 03

Agentic engineering

Where LLMs stop being chatbots and start taking actions. Agents are LLMs operating in loops, using tools, and affecting the real world.

0 / 9

AI agent

An LLM that takes actions in a loop until a goal is complete

An AI agent is an LLM operating in a loop — it receives a goal, decides what to do, executes actions using tools, observes results, and continues until the goal is complete or it gives up. Unlike a chatbot (which responds once), an agent operates autonomously across multiple steps without human input at each one.

agent vs chatbot

Chatbot: User asks → model responds → done.

Agent: Goal given → search web → read pages
→ take notes → write draft → review draft
→ send email → done (5+ steps, no human input)

An agent is just: LLM + loop + tools + memory. The magic is orchestration. Most agent bugs are prompt bugs or tool-definition bugs — not model capability issues.

Agent loop

Think → Act → Observe — the core execution cycle of every agent

Every agent runs a loop: the model reasons about its current state (Think), decides on and executes a tool call (Act), and receives the result back into its context (Observe). This loop repeats until the agent decides the goal is complete or hits a stop condition.

interactive — click steps to trace a loop

Think

Reason about current state and goal

↓

Act

Call a tool with structured arguments

↓

Observe

Result injected back into context

↩ repeat or finish

Every tool result goes back into the context window. The agent sees what it just did and can course-correct. This is fundamentally different from a one-shot generation — it's the foundation of agentic intelligence.

Tools

Functions the agent can call — APIs, search, code execution, databases

Tools are functions defined in the API call that the model can invoke. You provide the name, description, and parameter schema — the model decides when and how to call them. The call is returned to your code; you execute it and pass the result back to the model.

tool definition (simplified)

{ name: "get_weather",
description: "Get current weather for a city",
parameters: {
city: { type: "string", required: true },
units: { type: "enum", values: ["C","F"] }
}
}

Well-defined tools are the key to capable agents. Clear names, precise descriptions, and typed parameters dramatically improve tool selection accuracy. Bad tool descriptions → bad agents. Treat tool definitions as APIs, not afterthoughts.

Output parser

Extracting structured data from unstructured model output

LLMs output text. Your application needs typed data. An output parser bridges this gap — it prompts the model to produce a specific structured format (JSON, XML, CSV) and then validates and parses the result into usable objects. Schema validation catches malformed outputs before they crash your pipeline.

pattern

Prompt: "...respond ONLY in JSON: { sentiment, score }"

Model output: '{ "sentiment": "positive", "score": 0.92 }'

Parser: JSON.parse() + Zod/Pydantic validation
Result: { sentiment: "positive", score: 0.92 }

Always use Pydantic (Python) or Zod (TypeScript) to validate parsed output. Models occasionally produce malformed JSON. Always have fallback logic — retry with a stricter prompt or catch the parse error gracefully.

Grounding

Anchoring model outputs to verifiable, real-world information

Grounding means ensuring model outputs are connected to real, verifiable sources — not generated from parametric memory alone. It combines RAG (retrieved documents), tool results (live data), and explicit citations in the output. Grounded models make errors detectable and traceable.

ungrounded vs grounded

Ungrounded: "The API returns status 200 on success."
(model may be hallucinating this)

Grounded: "Per docs.stripe.com/api fetched at 14:02,
the API returns 200 with a Charge object. [source]"

Grounding is the primary production solution to hallucination. If the model cites retrieved sources, wrong answers become detectable — users and systems can verify. Without grounding, you're trusting parametric memory.

Web search

Real-time information access — the most powerful single tool for agents

Web search is a tool that lets an agent query a search engine and optionally fetch and read the resulting pages. It transforms a model with a static knowledge cutoff into a system with access to current, live information — stock prices, recent news, latest documentation, live data.

web search flow

User: "What's the current Claude API pricing?"

Agent: search("Claude API pricing 2025")
→ get results → fetch top URL
→ extract pricing table → inject into context
→ respond with current, cited data

Web search is the single highest-leverage tool you can give an agent. It's the difference between a static assistant stuck in training time and a dynamic system that can deal with current reality.

Actions

Things the agent does that affect the real world — not just reads

Actions are tool calls with real-world side effects: sending emails, writing files, calling APIs, executing code, creating tickets, updating databases. Unlike retrieval operations (read-only), actions are irreversible or difficult to undo — which makes their design and safety constraints critical.

read vs write actions

Read (safe): search_web(), read_file(), query_db()

Write (consequence): send_email(), write_file(),
create_ticket(), execute_code(), call_api(),
update_record(), delete_item()

Always separate retrieval actions from write actions in your code. For irreversible writes, implement human-in-the-loop confirmation. Agents make mistakes — mistakes with real-world consequences are expensive to undo.

Memory

Short-term (context) and long-term (external storage) — both matter

Agent memory has two layers: short-term (everything currently in the context window — active session) and long-term (external storage — vector DBs, key-value stores, files — that persists across sessions). Most production agents combine both: context for active reasoning, external storage for persistent knowledge retrieval.

memory architecture

Short-term: full conversation + tool results in context
→ limited, expensive, lost when session ends

Long-term: summaries + facts stored in vector DB
→ retrieved at session start via semantic search
→ unlimited, persistent, but needs retrieval cost

Manage memory actively. Summarize old context before it truncates. Store important facts to long-term storage after each session. Context is expensive — information architecture is a core agent design skill, not an afterthought.

Multi-agent systems

Orchestrator + specialists — parallelism and expertise at scale

Multi-agent systems use multiple specialized agents working in coordination. An orchestrator agent receives a complex goal, decomposes it into subtasks, routes each to a specialized worker agent, and aggregates the results. This enables parallelism, specialization, and tackling tasks too large for a single context window.

example architecture

Goal: "Competitor analysis report"

Orchestrator → Research Agent (web search)
→ Analysis Agent (data processing)
→ Writing Agent (report generation)
→ Aggregates → final report

Multi-agent systems amplify capabilities but also amplify failure modes. Each agent is an additional point of failure. Start with a single agent and only introduce orchestration when single-agent approaches genuinely hit their limits.