Add foundational concepts and entities related to LLMs and AI agents

- Create context-window.md to explain the significance of context window size in LLMs. - Add llm-scaling-laws.md detailing the empirical relationships between model performance and resources. - Introduce retrieval-augmented-generation.md to describe RAG architecture and its advantages. - Add entity pages for key figures and organizations: andrej-karpathy.md, anthropic.md, google-deepmind.md, openai.md, sam-altman.md. - Create sources for foundational papers: attention-is-all-you-need.md, claude-model-card.md, gpt4-technical-report.md, react-paper.md. - Synthesize insights on AI agent patterns and RAG vs fine-tuning in dedicated pages. - Update index.md to include new entities and concepts. - Log all activities related to the wiki's development in log.md.
2026-04-28 15:06:11 +02:00 · 2026-04-13 00:05:30 -04:00
parent 51b4ce6ca7
commit b19bd2e408
25 changed files with 1008 additions and 6 deletions
--- a/.github/workflows/demo-viz.yml
+++ b/.github/workflows/demo-viz.yml
@@ -0,0 +1,53 @@
+name: Build Demo Visualization
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'test-wiki-page/**'
+      - 'src/lib/templates.ts'
+      - 'scripts/generate-viz-scripts.ts'
+      - '.github/workflows/demo-viz.yml'
+
+permissions:
+  pages: write
+  id-token: write
+  contents: read
+
+concurrency:
+  group: pages
+  cancel-in-progress: false
+
+jobs:
+  build-and-deploy:
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    steps:
+      - uses: actions/checkout@v4
+      - uses: oven-sh/setup-bun@v2
+        with:
+          bun-version: latest
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 20
+      - name: Generate viz scripts from templates
+        run: bun run scripts/generate-viz-scripts.ts .viz-tmp
+      - name: Build graph data
+        env:
+          WIKI_DIR: test-wiki-page/wiki
+        run: node .viz-tmp/build-graph.cjs
+      - name: Build site
+        env:
+          GITHUB_REPOSITORY: ${{ github.repository }}
+        run: node .viz-tmp/build-site.cjs
+      - name: Configure Pages
+        uses: actions/configure-pages@v5
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: dist
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
--- a/.gitignore
+++ b/.gitignore
@@ -2,4 +2,5 @@

 node_modules/
 dist/
+.viz-tmp/
 *.tgz
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -78,6 +78,15 @@ docs/
  phase-3.md
  phase-4.md
  phase-5.md
+scripts/
+  generate-viz-scripts.ts # Extracts viz build scripts from templates.ts (used by demo workflow)
+test-wiki-page/
+  wiki/                  # Example wiki pages for live demo on GitHub Pages
+    index.md
+    log.md
+    concepts/
+    sources/
+    synthesis/
 ```

 ## Commands
--- a/README.md
+++ b/README.md
@@ -11,6 +11,8 @@ A CLI tool for LLM agents to build and maintain personal knowledge bases.

 Inspired by [Andrej Karpathy's LLM Wiki](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).

+**[Live Demo](https://doum1004.github.io/llmwiki-cli/)** — interactive d3-force graph built from the example wiki in [`test-wiki-page/`](test-wiki-page/).
+
 ## Overview

 The CLI is the hands -- it reads, writes, searches, and manages wiki files. The LLM is the brain -- it decides what to create, update, and connect.
--- a/scripts/generate-viz-scripts.ts
+++ b/scripts/generate-viz-scripts.ts
@@ -0,0 +1,8 @@
+import { getBuildGraphScript, getBuildSiteScript } from "../src/lib/templates.ts";
+import { writeFileSync, mkdirSync } from "fs";
+
+const outDir = process.argv[2] || ".viz-tmp";
+mkdirSync(outDir, { recursive: true });
+writeFileSync(`${outDir}/build-graph.cjs`, getBuildGraphScript());
+writeFileSync(`${outDir}/build-site.cjs`, getBuildSiteScript());
+console.log(`Wrote build-graph.cjs and build-site.cjs to ${outDir}`);
--- a/src/lib/templates.ts
+++ b/src/lib/templates.ts
@@ -246,7 +246,7 @@ export function getBuildGraphScript(): string {
 const path = require("path");

 const WIKILINK_RE = /\\[\\[([^\\]|]+)(?:\\|[^\\]]+)?\\]\\]/g;
-const WIKI_DIR = "wiki";
+const WIKI_DIR = process.env.WIKI_DIR || "wiki";
 const OUT_DIR = "dist";

 function findMdFiles(dir) {
@@ -275,18 +275,20 @@ function stripFrontmatter(content) {
  return content.trim();
 }

+const wikiPrefix = WIKI_DIR.replace(/\\\\/g, "/").replace(/\\/$/, "") + "/";
+
 function resolveLink(target, allFiles) {
  const withMd = target.endsWith(".md") ? target : target + ".md";
  const candidates = allFiles.map((f) => f.replace(/\\\\/g, "/"));

  if (candidates.includes(withMd)) return withMd;

-  const withWiki = "wiki/" + withMd;
+  const withWiki = wikiPrefix + withMd;
  if (candidates.includes(withWiki)) return withWiki;

-  const dirs = ["wiki/entities", "wiki/concepts", "wiki/sources", "wiki/synthesis"];
-  for (const dir of dirs) {
-    const candidate = dir + "/" + withMd;
+  const subdirs = ["entities", "concepts", "sources", "synthesis"];
+  for (const sub of subdirs) {
+    const candidate = wikiPrefix + sub + "/" + withMd;
    if (candidates.includes(candidate)) return candidate;
  }

@@ -297,6 +299,13 @@ function resolveLink(target, allFiles) {
  return null;
 }

+function relDir(filePath) {
+  const rel = filePath.replace(/\\\\/g, "/");
+  const inner = rel.startsWith(wikiPrefix) ? rel.slice(wikiPrefix.length) : rel;
+  const first = inner.split("/")[0];
+  return inner.includes("/") ? first : "wiki";
+}
+
 const files = findMdFiles(WIKI_DIR);
 const nodes = [];
 const edges = [];
@@ -304,7 +313,7 @@ const edges = [];
 for (const file of files) {
  const content = fs.readFileSync(file, "utf-8");
  const relPath = file.replace(/\\\\/g, "/");
-  const dir = relPath.split("/")[1] || "wiki";
+  const dir = relDir(relPath);
  nodes.push({ id: relPath, title: extractTitle(content, file), dir, body: stripFrontmatter(content) });

  let match;
--- a/test-wiki-page/wiki/concepts/agent-loop.md
+++ b/test-wiki-page/wiki/concepts/agent-loop.md
@@ -0,0 +1,55 @@
+---
+title: Agent Loop
+created: 2024-02-10
+updated: 2024-02-10
+tags: [concept, agents, autonomy, architecture]
+---
+
+# Agent Loop
+
+The agent loop is the core execution pattern of an autonomous LLM agent: a repeating cycle of **Observe → Think → Act** that continues until the agent reaches a goal or is stopped. Each iteration the agent receives new observations, reasons about them using [[chain-of-thought]], selects an action (tool call, message, or termination), and processes the result.
+
+## Basic Structure
+
+```
+while not done:
+    observation = get_context()          # current state, memory, tool results
+    thought = llm.think(observation)     # chain-of-thought reasoning
+    action = llm.choose_action(thought)  # tool call or final answer
+    result = execute(action)             # run the tool
+    memory.append(observation, thought, action, result)
+```
+
+## Key Components
+
+### Memory
+- **In-context**: Everything in the [[context-window]] — conversation, tool outputs, instructions
+- **External**: Retrieved from stores via [[retrieval-augmented-generation]] — the wiki, vector DB, etc.
+
+### Tools
+The set of actions available to the agent. Common tools:
+- Web search / browser
+- Code interpreter / REPL
+- File read/write (e.g. `wiki read`, `wiki write`)
+- API calls
+
+### Termination
+The agent must know when to stop. Poor termination criteria lead to infinite loops or premature exits.
+
+## ReAct Pattern
+
+The most widely-used agent loop variant is ReAct (Reasoning + Acting), from [[sources/react-paper]]. It interleaves natural-language reasoning traces with tool call actions, making the agent's decisions inspectable.
+
+## Failure Modes
+
+| Failure | Cause | Mitigation |
+|---------|-------|-----------|
+| Hallucinated tool calls | Model invents non-existent tools | Strict function schema validation |
+| Context overflow | Long loops fill the [[context-window]] | Summarize or compress history |
+| Stuck in loop | No progress, keeps retrying | Max step limit + backoff |
+| Over-planning | Too much thinking, too little acting | Temperature tuning, step limits |
+
+> [!TIP]
+> llmwiki-cli is designed to be a tool inside an agent loop: the agent calls `wiki search`, `wiki read`, and `wiki write` as actions, using the wiki as its external long-term memory.
+
+See [[synthesis/ai-agent-patterns]] for patterns that have emerged in production agent systems.
--- a/test-wiki-page/wiki/concepts/chain-of-thought.md
+++ b/test-wiki-page/wiki/concepts/chain-of-thought.md
@@ -0,0 +1,43 @@
+---
+title: Chain-of-Thought
+created: 2024-02-10
+updated: 2024-02-10
+tags: [concept, prompting, reasoning, CoT]
+source: https://arxiv.org/abs/2201.11903
+---
+
+# Chain-of-Thought (CoT)
+
+Chain-of-thought prompting is a technique that elicits step-by-step reasoning from a language model by including examples that show the reasoning process — not just the final answer. Introduced by Wei et al. (Google Brain, 2022), it dramatically improves performance on multi-step reasoning tasks.
+
+## Key Variants
+
+| Variant | How | When to Use |
+|---------|-----|-------------|
+| Few-shot CoT | Include worked examples in prompt | Tasks with clear reasoning steps |
+| Zero-shot CoT | Append "Let's think step by step" | Quick boost without example construction |
+| Self-consistency | Sample multiple CoT paths, majority vote | High-stakes tasks requiring reliability |
+| Tree of Thoughts | Branch reasoning into tree, search over paths | Very complex multi-step problems |
+
+## Why It Works
+
+Transformers process tokens sequentially. By forcing the model to generate intermediate reasoning steps before the final answer, CoT:
+1. Allocates more computation (tokens) to hard steps
+2. Externalizes working memory that would otherwise be compressed into hidden states
+3. Creates a "scratchpad" that the model can condition on when generating later tokens
+
+## Connection to Agents
+
+Chain-of-thought is the cognitive substrate of [[agent-loop]]. When an agent uses a ReAct-style loop (see [[sources/react-paper]]), the "think" step is CoT reasoning — the agent writes out its plan before choosing an action.
+
+```
+Observation: search results for "Paris population"
+Thought: The results show Paris has 2.1M city / 12M metro. I need the metro figure.
+Action: answer("The Paris metro area has approximately 12 million people.")
+```
+
+> [!TIP]
+> CoT is most effective on models with ≥ 100B parameters. On smaller models, it can actually hurt performance by generating plausible-sounding but incorrect reasoning chains.
+
+> [!NOTE]
+> [[openai]]'s o1 and o3 models (2024) internalize chain-of-thought as a latent "thinking" process before producing output — a productized version of explicit CoT prompting.
--- a/test-wiki-page/wiki/concepts/context-window.md
+++ b/test-wiki-page/wiki/concepts/context-window.md
@@ -0,0 +1,43 @@
+---
+title: Context Window
+created: 2024-01-20
+updated: 2024-01-20
+tags: [concept, architecture, inference, tokens]
+---
+
+# Context Window
+
+The context window (also called context length or context limit) is the maximum number of tokens a language model can process in a single forward pass — encompassing both the input prompt and the generated output. Everything outside the context window is invisible to the model.
+
+## Why It Matters
+
+Context window size is one of the most practically important properties of a deployed LLM:
+
+- **In-context learning**: The model can only use examples, documents, or conversation history that fits in the window
+- **Long-document tasks**: Summarization, question-answering over books or codebases require large windows
+- **Agent memory**: An [[agent-loop]] accumulates observations and history; a small context window means the agent forgets recent steps
+- **RAG trade-off**: Small windows force reliance on [[retrieval-augmented-generation]]; large windows reduce the need to retrieve
+
+## Historical Progression
+
+| Model | Year | Context |
+|-------|------|---------|
+| GPT-3 | 2020 | 4K tokens |
+| GPT-4 | 2023 | 8K–32K tokens |
+| Claude 2 | 2023 | 100K tokens |
+| Claude 3 | 2024 | 200K tokens |
+| Gemini 1.5 Pro | 2024 | 1M tokens |
+
+## Technical Constraints
+
+Context window size is limited by:
+1. **Quadratic attention complexity**: Standard self-attention scales as O(n²) in sequence length — doubling the context quadruples compute
+2. **KV cache memory**: Each token in the context requires storing key-value pairs in GPU memory
+3. **Positional encoding generalization**: Models must be trained on long sequences to handle them well; RoPE and ALiBi help with generalization
+
+## Implications for Knowledge Management
+
+A large context window does not eliminate the need for a tool like llmwiki-cli. Even a 1M token window cannot hold months of accumulated research. See [[synthesis/why-context-window-matters]] for the full analysis.
+
+> [!TIP]
+> [[anthropic]]'s Claude 3 at 200K tokens can process roughly 150,000 words — the equivalent of a 500-page book — in a single prompt. This makes it practical for full-codebase analysis without chunking.
--- a/test-wiki-page/wiki/concepts/llm-scaling-laws.md
+++ b/test-wiki-page/wiki/concepts/llm-scaling-laws.md
@@ -0,0 +1,43 @@
+---
+title: LLM Scaling Laws
+created: 2024-01-15
+updated: 2024-01-15
+tags: [concept, scaling, training, compute]
+source: https://arxiv.org/abs/2001.08361
+---
+
+# LLM Scaling Laws
+
+Scaling laws describe the empirical relationship between model performance (measured as cross-entropy loss on held-out text) and three key resources: model parameters (N), training data tokens (D), and compute budget (C). The key finding is that performance improves as a smooth power law — predictably and reliably — as these quantities increase.
+
+## Key Papers
+
+1. **Kaplan et al. 2020** ("Scaling Laws for Neural Language Models") — [[openai]] researchers established the foundational relationships. Found that N and D should be scaled together, but suggested parameters matter more.
+2. **Hoffmann et al. 2022** ("Training Compute-Optimal LLMs", aka "Chinchilla") — [[google-deepmind]] researchers revised the Kaplan findings, showing models were being under-trained. The Chinchilla-optimal ratio is roughly **20 tokens per parameter**.
+
+## Core Relationships
+
+```
+Loss ∝ N^(-α)   (more parameters → lower loss)
+Loss ∝ D^(-β)   (more data → lower loss)
+Loss ∝ C^(-γ)   (more compute → lower loss)
+```
+
+where α ≈ 0.076, β ≈ 0.095, γ ≈ 0.050 (Kaplan et al.)
+
+## Implications
+
+- You can predict how good a model will be **before training it** if you know the compute budget
+- Bigger models are more **sample-efficient** — they learn more per token
+- There is an optimal allocation of compute between model size and training data for a fixed budget
+- Performance improvements from scale have not shown signs of plateauing on standard benchmarks (as of 2024)
+
+## Chinchilla Correction
+
+Pre-Chinchilla models (GPT-3, PaLM, Gopher) were significantly over-parameterized relative to their training data. The Chinchilla paper showed a 70B parameter model trained on 1.4T tokens (Chinchilla) outperformed a 280B model (Gopher) trained on fewer tokens — with 4× less compute.
+
+> [!NOTE]
+> [[andrej-karpathy]] frequently cites scaling laws as the core reason the field moved from hand-engineering features to scaling simple architectures. The [[sources/attention-is-all-you-need]] transformer is the architecture that made scaling practical.
+
+> [!WARNING]
+> Scaling laws apply to pre-training loss. They do not directly predict performance on specific downstream tasks, reasoning benchmarks, or instruction-following quality — those require additional alignment techniques.
--- a/test-wiki-page/wiki/concepts/retrieval-augmented-generation.md
+++ b/test-wiki-page/wiki/concepts/retrieval-augmented-generation.md
@@ -0,0 +1,45 @@
+---
+title: Retrieval-Augmented Generation
+created: 2024-02-15
+updated: 2024-02-15
+tags: [concept, RAG, retrieval, architecture]
+source: https://arxiv.org/abs/2005.11401
+---
+
+# Retrieval-Augmented Generation (RAG)
+
+Retrieval-Augmented Generation (RAG) is an architecture pattern that augments an LLM's generation with dynamically retrieved content from an external knowledge store. Instead of relying solely on knowledge encoded in model weights, the system retrieves relevant documents at inference time and injects them into the [[context-window]].
+
+## How It Works
+
+```
+Query → Embed query → Search vector store → Retrieve top-k docs
+     → Inject docs into prompt → LLM generates answer
+```
+
+1. **Indexing**: Documents are chunked and embedded into a vector store (e.g. Pinecone, Weaviate, pgvector)
+2. **Retrieval**: At query time, the query is embedded and nearest-neighbor search finds relevant chunks
+3. **Generation**: Retrieved chunks are inserted into the prompt; the LLM generates a response grounded in them
+
+## Why RAG Exists
+
+RAG solves two fundamental limitations of LLMs:
+1. **Knowledge cutoff**: Weights are frozen at training time; RAG injects fresh information
+2. **Context window limits**: You cannot fit an entire knowledge base in context; RAG selects what's relevant
+
+## RAG vs Fine-Tuning
+
+For keeping knowledge current, RAG is almost always preferred over fine-tuning. See [[synthesis/rag-vs-fine-tuning]] for the full comparison.
+
+## Relevance to llmwiki-cli
+
+llmwiki-cli functions as a lightweight structured RAG system for LLM agents:
+- `wiki search` performs keyword retrieval from the wiki corpus
+- `wiki read` injects the retrieved page into the agent's context
+- The [[agent-loop]] can call these tools repeatedly to accumulate relevant knowledge
+
+> [!NOTE]
+> The original RAG paper (Lewis et al. 2020, Facebook AI) used a learned retriever (DPR) combined with BART for generation. Modern RAG systems typically use off-the-shelf embedding models and generative LLMs like GPT-4 or Claude.
+
+> [!WARNING]
+> RAG quality depends heavily on chunking strategy and embedding model quality. Naive chunking (fixed-size character splits) often breaks semantic units and hurts retrieval precision.
--- a/test-wiki-page/wiki/entities/andrej-karpathy.md
+++ b/test-wiki-page/wiki/entities/andrej-karpathy.md
@@ -0,0 +1,35 @@
+---
+title: Andrej Karpathy
+created: 2024-02-05
+updated: 2024-02-05
+tags: [person, researcher, OpenAI, Tesla]
+source: https://karpathy.ai
+---
+
+# Andrej Karpathy
+
+Andrej Karpathy is an AI researcher and educator known for his work on deep learning, computer vision, and large language models. He is one of the most effective communicators of AI concepts to practitioners.
+
+## Career
+
+| Period | Role |
+|--------|------|
+| 2015–2017 | PhD at Stanford (CS/Vision, Fei-Fei Li's lab) |
+| 2015–2017 | Research Scientist at [[openai]] (founding team) |
+| 2017–2022 | Sr. Director of AI at Tesla (Autopilot) |
+| 2023–2024 | Research Scientist at [[openai]] (returned) |
+| 2024– | Independent / EurekaLabs |
+
+## Key Contributions
+
+### nanoGPT
+Karpathy's `nanoGPT` repository is one of the most widely studied clean implementations of the GPT architecture. It demystifies how transformer language models work from first principles — closely tied to [[llm-scaling-laws]] intuitions.
+
+### Educational Content
+His YouTube lecture series "Neural Networks: Zero to Hero" has become a canonical learning resource for practitioners wanting to understand how LLMs work from the ground up, covering backpropagation through to full GPT training.
+
+### Tokenization Advocacy
+Karpathy is an outspoken critic of subword tokenization as a source of model brittleness, arguing it creates unnecessary complexity that future models should eliminate.
+
+> [!TIP]
+> Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" (2015) remains a landmark piece of ML writing even as the field has moved to transformers — worth reading for historical context on [[llm-scaling-laws]].
--- a/test-wiki-page/wiki/entities/anthropic.md
+++ b/test-wiki-page/wiki/entities/anthropic.md
@@ -0,0 +1,41 @@
+---
+title: Anthropic
+created: 2024-02-01
+updated: 2024-02-01
+tags: [company, LLM, AI-safety, research]
+source: https://anthropic.com
+---
+
+# Anthropic
+
+Anthropic is an AI safety company founded in 2021 by former [[openai]] researchers, led by Dario Amodei (CEO) and Daniela Amodei (President). The company focuses on building reliable, interpretable, and steerable AI systems.
+
+## Founding and Background
+
+Seven of Anthropic's eleven founders came from [[openai]], departing over concerns about the pace of capability development relative to safety work. This origin story shapes Anthropic's emphasis on alignment research alongside product development.
+
+## Claude Model Family
+
+Anthropic's flagship product line is Claude. See [[sources/claude-model-card]] for the full technical details.
+
+| Version | Release | Context Window |
+|---------|---------|----------------|
+| Claude 1 | Mar 2023 | 9K tokens |
+| Claude 2 | Jul 2023 | 100K tokens |
+| Claude 3 Haiku/Sonnet/Opus | Mar 2024 | 200K tokens |
+
+The dramatic expansion of the [[context-window]] — from 9K to 200K tokens — is a defining competitive advantage. See [[synthesis/why-context-window-matters]] for analysis.
+
+## Constitutional AI
+
+Anthropic's key alignment approach is **Constitutional AI (CAI)**: instead of relying entirely on human feedback, the model is trained with a set of principles ("constitution") to self-critique and revise outputs. This reduces dependence on human labelers for harmlessness training.
+
+## Safety Research
+
+Anthropic publishes significant interpretability research, including mechanistic interpretability work trying to understand what computations happen inside transformer layers.
+
+> [!NOTE]
+> Anthropic received $300M from Google in 2023, followed by a further $2B commitment, giving Google a minority stake. Amazon also invested up to $4B in late 2023.
+
+> [!WARNING]
+> Despite the safety focus, Anthropic still ships capable frontier models — the tension between capability and safety is ongoing and unresolved.
--- a/test-wiki-page/wiki/entities/google-deepmind.md
+++ b/test-wiki-page/wiki/entities/google-deepmind.md
@@ -0,0 +1,37 @@
+---
+title: Google DeepMind
+created: 2024-01-15
+updated: 2024-01-15
+tags: [company, LLM, research, Google]
+source: https://deepmind.google
+---
+
+# Google DeepMind
+
+Google DeepMind is the merged AI research division of Google, formed in April 2023 by combining Google Brain and DeepMind. It is led by Demis Hassabis (DeepMind founder) as CEO.
+
+## History
+
+- **DeepMind** (founded 2010, acquired by Google 2014) — famous for AlphaGo, AlphaFold, Gemini
+- **Google Brain** (founded 2011) — developed TensorFlow, pioneered large-scale neural net training; authors of [[sources/attention-is-all-you-need]]
+- **Merger** (April 2023) — combined into Google DeepMind to compete more directly with [[openai]] and [[anthropic]]
+
+## Key Contributions
+
+### Transformer Architecture
+Google Brain researchers authored the foundational "Attention Is All You Need" paper (2017), which introduced the transformer — now the basis for virtually all large language models. See [[sources/attention-is-all-you-need]].
+
+### Scaling Research
+Google was an early contributor to [[llm-scaling-laws]] research, publishing work on compute-optimal training (Chinchilla, 2022), which showed that many models were under-trained relative to their parameter count.
+
+### Gemini
+Gemini is Google DeepMind's frontier model family, competing directly with GPT-4 and Claude 3.
+
+| Version | Release | Notes |
+|---------|---------|-------|
+| Gemini 1.0 | Dec 2023 | Ultra / Pro / Nano tiers |
+| Gemini 1.5 Pro | Feb 2024 | 1M token context window |
+| Gemini 1.5 Flash | May 2024 | Efficient, fast variant |
+
+> [!NOTE]
+> Gemini 1.5 Pro's 1 million token [[context-window]] is currently the largest available in a production model, enabling entirely new use cases like processing full codebases or hour-long videos.
--- a/test-wiki-page/wiki/entities/openai.md
+++ b/test-wiki-page/wiki/entities/openai.md
@@ -0,0 +1,46 @@
+---
+title: OpenAI
+created: 2024-01-15
+updated: 2024-03-15
+tags: [company, LLM, AGI, research]
+source: https://openai.com
+---
+
+# OpenAI
+
+OpenAI is an AI research company founded in December 2015 in San Francisco. Originally a non-profit, it restructured into a "capped-profit" model in 2019 to attract investment. Its stated mission is to ensure that artificial general intelligence benefits all of humanity.
+
+## Key People
+
+- [[sam-altman]] — CEO (returned after brief Nov 2023 board ouster)
+- [[andrej-karpathy]] — founding member, returned as employee 2023–2024
+- Greg Brockman — President and co-founder
+- Ilya Sutskever — co-founder, Chief Scientist (departed 2024)
+
+## Major Models
+
+| Model | Release | Key Feature |
+|-------|---------|-------------|
+| GPT-3 | May 2020 | 175B params, few-shot learning |
+| InstructGPT | Jan 2022 | RLHF alignment |
+| ChatGPT | Nov 2022 | Conversational wrapper on GPT-3.5 |
+| GPT-4 | Mar 2023 | Multimodal, major capability jump |
+| GPT-4o | May 2024 | Native multimodal, faster, cheaper |
+
+## Research Contributions
+
+OpenAI pioneered the [[llm-scaling-laws]] paradigm with the 2020 "Scaling Laws for Neural Language Models" paper, establishing that model performance scales predictably with compute, parameters, and data. See also [[sources/gpt4-technical-report]] for capabilities benchmarking.
+
+## Commercial Products
+
+- **ChatGPT** — consumer product, 100M+ users in first two months
+- **API** — developer access to GPT models
+- **Copilot** (via Microsoft partnership) — integrated into Office, GitHub, Windows
+
+> [!NOTE]
+> OpenAI has a complex relationship with [[anthropic]]: several Anthropic founders (including Dario and Daniela Amodei) left OpenAI in 2021 over strategic and safety disagreements.
+
+## Funding
+
+- Microsoft invested $1B in 2019, $10B in 2023
+- Valuation reached ~$80B by early 2024
--- a/test-wiki-page/wiki/entities/sam-altman.md
+++ b/test-wiki-page/wiki/entities/sam-altman.md
@@ -0,0 +1,33 @@
+---
+title: Sam Altman
+created: 2024-01-20
+updated: 2024-01-20
+tags: [person, CEO, OpenAI]
+source: https://en.wikipedia.org/wiki/Sam_Altman
+---
+
+# Sam Altman
+
+Sam Altman is the CEO of [[openai]], which he has led since 2019. Before OpenAI, he was President of Y Combinator (2014–2019), one of the world's most influential startup accelerators.
+
+## Role at OpenAI
+
+Altman has been the primary public face of OpenAI and a key driver of its commercial strategy, including:
+- The $10B Microsoft partnership
+- Launch and rapid growth of ChatGPT
+- Pushing development of GPT-4 and beyond
+
+## November 2023 Board Crisis
+
+In November 2023, the OpenAI board briefly fired Altman, citing concerns about his candor with the board. Within five days, he was reinstated following a staff revolt (nearly all employees threatened to quit) and investor pressure. The episode raised significant questions about OpenAI's governance structure.
+
+> [!WARNING]
+> The board crisis exposed deep tensions between [[openai]]'s non-profit roots and its commercial ambitions. A restructuring of governance followed in 2024.
+
+## Views on AGI
+
+Altman is a prominent advocate for the belief that AGI is achievable within a few years and that safety research must happen in parallel with capability development — a view that distinguishes him from more skeptical researchers but aligns with [[openai]]'s mission framing.
+
+## Relationship with Anthropic
+
+The departure of the Amodei team to found [[anthropic]] happened in part due to disagreements with Altman and others over strategy and safety. The two companies now compete directly.
--- a/test-wiki-page/wiki/index.md
+++ b/test-wiki-page/wiki/index.md
@@ -0,0 +1,30 @@
+# Index
+
+## Entities
+
+- [OpenAI](entities/openai.md) — AI research company behind GPT series and ChatGPT
+- [Anthropic](entities/anthropic.md) — AI safety company behind Claude series
+- [Google DeepMind](entities/google-deepmind.md) — Google's merged AI research division
+- [Sam Altman](entities/sam-altman.md) — CEO of OpenAI
+- [Andrej Karpathy](entities/andrej-karpathy.md) — AI researcher, former OpenAI/Tesla
+
+## Concepts
+
+- [LLM Scaling Laws](concepts/llm-scaling-laws.md) — Predictable performance improvements with compute, data, and parameters
+- [Context Window](concepts/context-window.md) — Maximum token capacity of a model in one inference call
+- [Retrieval-Augmented Generation](concepts/retrieval-augmented-generation.md) — Combining retrieval from external stores with LLM generation
+- [Chain-of-Thought](concepts/chain-of-thought.md) — Prompting technique that elicits step-by-step reasoning
+- [Agent Loop](concepts/agent-loop.md) — Observe → Think → Act cycle for autonomous LLM agents
+
+## Sources
+
+- [Attention Is All You Need](sources/attention-is-all-you-need.md) — Vaswani et al. 2017, transformer architecture paper
+- [GPT-4 Technical Report](sources/gpt4-technical-report.md) — OpenAI 2023, GPT-4 capabilities and evaluations
+- [Claude Model Card](sources/claude-model-card.md) — Anthropic 2024, Claude 3 model card and safety evals
+- [ReAct Paper](sources/react-paper.md) — Yao et al. 2022, reasoning + acting in language models
+
+## Synthesis
+
+- [Why Context Window Size Matters](synthesis/why-context-window-matters.md) — Long context vs. RAG trade-offs and implications
+- [RAG vs Fine-Tuning](synthesis/rag-vs-fine-tuning.md) — When to retrieve vs. when to train
+- [AI Agent Patterns](synthesis/ai-agent-patterns.md) — Common architectural patterns emerging in production agent systems
--- a/test-wiki-page/wiki/log.md
+++ b/test-wiki-page/wiki/log.md
@@ -0,0 +1,49 @@
+# Activity Log
+
+## [2024-01-15 09:00:00] init | Wiki initialized — domain: AI agents & LLMs
+
+## [2024-01-15 09:15:00] ingest | attention-is-all-you-need — transformer architecture paper (Vaswani et al. 2017)
+
+## [2024-01-15 09:30:00] ingest | openai entity page created
+
+## [2024-01-15 09:35:00] ingest | google-deepmind entity page created
+
+## [2024-01-15 09:40:00] ingest | llm-scaling-laws concept page created
+
+## [2024-01-20 11:00:00] ingest | gpt4-technical-report — ingested OpenAI GPT-4 technical report
+
+## [2024-01-20 11:20:00] ingest | context-window concept page created from GPT-4 report
+
+## [2024-01-20 11:35:00] ingest | sam-altman entity page created
+
+## [2024-01-22 14:00:00] query | searched "scaling laws compute optimal" — read llm-scaling-laws, attention-is-all-you-need
+
+## [2024-02-01 10:00:00] ingest | anthropic entity page created
+
+## [2024-02-01 10:20:00] ingest | claude-model-card — ingested Claude 3 model card
+
+## [2024-02-05 15:00:00] ingest | andrej-karpathy entity page created
+
+## [2024-02-10 09:00:00] ingest | react-paper — ingested ReAct paper (Yao et al. 2022)
+
+## [2024-02-10 09:30:00] ingest | chain-of-thought concept page created
+
+## [2024-02-10 09:45:00] ingest | agent-loop concept page created
+
+## [2024-02-12 14:00:00] query | searched "agent tool use loop" — read agent-loop, chain-of-thought, react-paper
+
+## [2024-02-15 10:00:00] ingest | retrieval-augmented-generation concept page created
+
+## [2024-02-20 11:00:00] synthesis | why-context-window-matters — cross-cutting analysis of context vs retrieval
+
+## [2024-02-25 14:00:00] synthesis | rag-vs-fine-tuning — comparison of retrieval and fine-tuning approaches
+
+## [2024-03-01 09:00:00] synthesis | ai-agent-patterns — patterns emerging in production agent systems
+
+## [2024-03-05 10:00:00] maintenance | ran wiki lint — fixed 2 broken wikilinks, added missing frontmatter to 1 page
+
+## [2024-03-10 11:00:00] query | searched "RAG production latency" — read rag-vs-fine-tuning, retrieval-augmented-generation
+
+## [2024-03-15 09:00:00] ingest | updated openai page with GPT-4o release notes
+
+## [2024-03-20 10:00:00] query | searched "Anthropic safety evals" — read claude-model-card, anthropic
--- a/test-wiki-page/wiki/sources/attention-is-all-you-need.md
+++ b/test-wiki-page/wiki/sources/attention-is-all-you-need.md
@@ -0,0 +1,53 @@
+---
+title: Attention Is All You Need
+created: 2024-01-15
+updated: 2024-01-15
+tags: [paper, transformer, attention, architecture]
+source: https://arxiv.org/abs/1706.03762
+---
+
+# Attention Is All You Need
+
+**Authors**: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin ([[google-deepmind|Google Brain]] / Google Research)
+**Published**: June 2017 (NeurIPS 2017)
+
+## Summary
+
+This paper introduced the **Transformer** architecture, replacing recurrent and convolutional networks with a mechanism called **self-attention** for sequence-to-sequence tasks. It became the foundation for virtually every large language model built since 2018.
+
+## Key Contributions
+
+### Self-Attention Mechanism
+Each token in the sequence computes attention weights over all other tokens, enabling the model to relate positions regardless of distance:
+
+```
+Attention(Q, K, V) = softmax(QK^T / √d_k) V
+```
+
+- **Q** (Query), **K** (Key), **V** (Value) — linear projections of token embeddings
+- Division by √d_k prevents vanishing gradients in softmax at large dimensions
+
+### Multi-Head Attention
+Instead of a single attention function, the paper uses h=8 parallel attention heads, each learning different relationship types (syntax, coreference, semantics, etc.)
+
+### Positional Encoding
+Since self-attention is permutation-invariant, sinusoidal position encodings inject order information.
+
+### Architecture
+```
+Encoder: Embedding → N × (Multi-Head Attn → FFN) → output representations
+Decoder: Embedding → N × (Masked MHA → Cross-Attn → FFN) → token probabilities
+```
+
+## Why It Matters
+
+1. **Parallelism**: Unlike RNNs, all positions process simultaneously during training → orders of magnitude faster on GPUs
+2. **Long-range dependencies**: O(1) path length between any two positions vs O(n) for RNNs
+3. **Scale**: The architecture scales smoothly — see [[llm-scaling-laws]]
+
+## Impact
+
+The transformer is now the backbone of GPT (see [[sources/gpt4-technical-report]]), Claude (see [[sources/claude-model-card]]), Gemini, and essentially every frontier model. It is one of the most cited ML papers of all time.
+
+> [!NOTE]
+> The paper's title was partly a provocation — at the time, the dominant view was that attention was useful *alongside* recurrence, not as a replacement. The title's confidence was validated rapidly.
--- a/test-wiki-page/wiki/sources/claude-model-card.md
+++ b/test-wiki-page/wiki/sources/claude-model-card.md
@@ -0,0 +1,57 @@
+---
+title: Claude 3 Model Card
+created: 2024-02-01
+updated: 2024-02-01
+tags: [paper, Claude, Anthropic, safety, evals]
+source: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
+---
+
+# Claude 3 Model Card
+
+**Authors**: [[anthropic]]
+**Published**: March 2024
+
+## Overview
+
+The Claude 3 model card covers the three-tier Claude 3 family: **Haiku** (fast/cheap), **Sonnet** (balanced), and **Opus** (most capable). This is Anthropic's first model card for a released frontier model family.
+
+## Model Tiers
+
+| Model | Speed | Cost | Best For |
+|-------|-------|------|----------|
+| Claude 3 Haiku | Fastest | Lowest | High-volume tasks, classification |
+| Claude 3 Sonnet | Moderate | Medium | General use, coding |
+| Claude 3 Opus | Slowest | Highest | Complex reasoning, research |
+
+## Context Window
+
+All Claude 3 models support a 200K token [[context-window]] — roughly 150,000 words. This is the largest commercially available context window at launch, enabling:
+- Full-book analysis in a single call
+- Large codebase review
+- Long research sessions without chunking
+
+## Safety Evaluations
+
+The model card is notable for its safety eval methodology:
+
+### Responsible Scaling Policy (RSP)
+Anthropic's RSP defines capability thresholds ("ASL" levels) that trigger additional safeguards before deployment. Claude 3 was evaluated against ASL-3 criteria (uplift for CBRN weapons, autonomous replication).
+
+### CBRN Uplift Testing
+Red-teamers tested whether models provided meaningful uplift for chemical, biological, radiological, or nuclear harm. Claude 3 Opus did not meet the ASL-3 threshold for dangerous uplift.
+
+### Child Safety
+Absolute behavioral refusals are maintained regardless of prompt framing.
+
+## Benchmark Results
+
+Claude 3 Opus outperforms GPT-4 on several benchmarks at release:
+- MMLU: 86.8% vs GPT-4's 86.4%
+- HumanEval: 84.9% vs GPT-4's 67.0%
+- GSM8K: 95.0% vs GPT-4's 92.0%
+
+> [!NOTE]
+> The model card openly discusses failure modes and known limitations — a more candid approach than [[sources/gpt4-technical-report]], which omitted most technical details.
+
+> [!TIP]
+> For [[agent-loop]] applications, Claude 3 Sonnet's combination of large context, strong instruction following, and moderate cost makes it a practical default choice.
--- a/test-wiki-page/wiki/sources/gpt4-technical-report.md
+++ b/test-wiki-page/wiki/sources/gpt4-technical-report.md
@@ -0,0 +1,55 @@
+---
+title: GPT-4 Technical Report
+created: 2024-01-20
+updated: 2024-01-20
+tags: [paper, GPT-4, OpenAI, multimodal, evals]
+source: https://arxiv.org/abs/2303.08774
+---
+
+# GPT-4 Technical Report
+
+**Authors**: OpenAI
+**Published**: March 2023
+
+## Overview
+
+The GPT-4 Technical Report describes [[openai]]'s fourth-generation large language model. Notably, the report deliberately withholds most technical details (parameter count, training data, architecture specifics) for competitive and safety reasons — a controversial decision.
+
+## Key Capabilities
+
+### Multimodality
+GPT-4 accepts both image and text inputs (text outputs only at launch). It can describe images, read charts, solve visual math problems, and interpret screenshots.
+
+### Benchmark Performance
+GPT-4 achieved human-level or above-human-level performance on several professional exams:
+
+| Exam | GPT-3.5 Percentile | GPT-4 Percentile |
+|------|--------------------|-----------------|
+| Bar Exam | ~10th | ~90th |
+| SAT | ~87th | ~93rd |
+| GRE Verbal | ~63rd | ~99th |
+| USMLE Step 1 | ~53rd | ~75th+ |
+
+### Extended Context Window
+GPT-4 launched with an 8K token [[context-window]], later extended to 32K — a significant increase over GPT-3.5's 4K, enabling longer documents and conversation histories.
+
+## RLHF and Alignment
+
+The report describes an extensive RLHF (Reinforcement Learning from Human Feedback) pipeline:
+1. Supervised fine-tuning on human-written demonstrations
+2. Reward model trained on human comparisons
+3. PPO optimization against the reward model
+4. Rule-based reward models (RBRMs) for absolute safety behaviors
+
+## Evals and Red-Teaming
+
+OpenAI engaged external red-teamers before launch to test for:
+- Dangerous capability elicitation (bio/chem/cyber)
+- Jailbreaks and policy violations
+- Deceptive alignment risks
+
+> [!NOTE]
+> The report introduced a reproducible eval framework. The [[llm-scaling-laws]] suggest GPT-4's capabilities were largely predictable from the compute budget used — though OpenAI has not confirmed exact training FLOPs.
+
+> [!WARNING]
+> Because so many technical details were withheld, the GPT-4 Technical Report is more useful as a capabilities/evals reference than as an architecture reference. Contrast with [[sources/attention-is-all-you-need]], which is fully open.
--- a/test-wiki-page/wiki/sources/react-paper.md
+++ b/test-wiki-page/wiki/sources/react-paper.md
@@ -0,0 +1,58 @@
+---
+title: ReAct — Synergizing Reasoning and Acting in Language Models
+created: 2024-02-10
+updated: 2024-02-10
+tags: [paper, agents, reasoning, acting, ReAct]
+source: https://arxiv.org/abs/2210.03629
+---
+
+# ReAct: Synergizing Reasoning and Acting in Language Models
+
+**Authors**: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao (Princeton / Google Brain)
+**Published**: October 2022 (ICLR 2023)
+
+## Summary
+
+ReAct proposes interleaving **Rea**soning traces (chain-of-thought) and **Act**ing (tool calls) in a single LLM prompt. By alternating between "Thought: ..." and "Action: ..." steps, the model produces interpretable, grounded reasoning that is directly coupled to external tool use.
+
+## The ReAct Pattern
+
+```
+Question: What is the capital of the country that won the 2022 FIFA World Cup?
+
+Thought: I need to find which country won the 2022 World Cup first.
+Action: Search[2022 FIFA World Cup winner]
+Observation: Argentina won the 2022 FIFA World Cup, defeating France on penalties.
+
+Thought: Argentina's capital is Buenos Aires.
+Action: Finish[Buenos Aires]
+```
+
+## Why ReAct Works
+
+**[[chain-of-thought]] alone** hallucinates facts — the model reasons but has no way to verify claims.
+
+**Acting alone** (without reasoning) produces brittle tool use — the model can't plan multi-step retrieval.
+
+**ReAct combines them**: reasoning guides which actions to take; observations from actions correct and update the reasoning trace.
+
+## Tasks Evaluated
+
+The paper evaluates ReAct on:
+1. **HotpotQA** — multi-hop question answering requiring chained Wikipedia lookups
+2. **FEVER** — fact verification requiring evidence retrieval
+3. **ALFWorld** — interactive text game requiring navigation + object manipulation
+4. **WebShop** — web shopping simulation requiring search + selection
+
+ReAct outperformed chain-of-thought-only and action-only baselines on all tasks.
+
+## Influence on Agent Frameworks
+
+ReAct is the conceptual backbone of most modern [[agent-loop]] implementations:
+- LangChain's AgentExecutor
+- AutoGPT and BabyAGI
+- Claude's tool use / function calling
+- OpenAI's Assistants API with function calling
+
+> [!TIP]
+> The key insight is that reasoning traces make agents **debuggable** — you can read the Thought steps to understand why the agent chose an action. This is essential for production agent systems. See [[synthesis/ai-agent-patterns]].
--- a/test-wiki-page/wiki/synthesis/ai-agent-patterns.md
+++ b/test-wiki-page/wiki/synthesis/ai-agent-patterns.md
@@ -0,0 +1,84 @@
+---
+title: AI Agent Patterns
+created: 2024-03-01
+updated: 2024-03-01
+tags: [synthesis, agents, architecture, patterns, production]
+---
+
+# AI Agent Patterns
+
+After reading [[sources/react-paper]], tracking several open-source agent frameworks, and following production deployments, I've identified recurring architectural patterns. This is a living synthesis note.
+
+## Pattern 1: ReAct Loop (Reasoning + Acting)
+
+The baseline pattern from [[sources/react-paper]]. The model alternates between reasoning traces and tool calls until it reaches a final answer.
+
+**Best for**: Single-agent tasks with well-defined tools and clear success criteria.
+
+**Limitations**: Fragile on long chains; one bad tool call can derail the whole trace.
+
+## Pattern 2: Plan-and-Execute
+
+The agent first generates a complete plan (list of steps), then executes each step in order, potentially with a different (cheaper) model for execution.
+
+```
+Planner LLM → [step1, step2, step3, ...]
+Executor LLM → execute(step1) → result
+Executor LLM → execute(step2) → result
+...
+```
+
+**Best for**: Tasks where the subtasks are well-understood and execution is mechanical.
+
+## Pattern 3: Multi-Agent Orchestration
+
+Multiple specialized agents collaborate, each with a different prompt/tools/model tier:
+- **Orchestrator**: Plans, assigns tasks, integrates results
+- **Researcher**: Searches the web, reads documents
+- **Coder**: Writes and executes code
+- **Critic**: Reviews outputs for quality
+
+**Best for**: Complex tasks requiring diverse expertise (e.g., "research X, write a report, add visualizations").
+
+## Pattern 4: Reflection and Self-Critique
+
+After completing a task, the agent reviews its own output and iteratively improves it. Related to [[chain-of-thought]] self-consistency.
+
+```
+Draft answer → Critique(draft) → Revised answer → Critique(revised) → ...
+```
+
+## Pattern 5: Memory-Augmented Agents
+
+The [[agent-loop]] integrates with a persistent external memory (vector DB, structured wiki). After each session, key observations are written to memory; at the start of each session, relevant memories are retrieved.
+
+This is exactly what llmwiki-cli supports:
+- **Write**: `wiki write wiki/entities/new-finding.md`
+- **Retrieve**: `wiki search "relevant topic"`
+- **Connect**: add `[[wikilinks]]` to create a knowledge graph
+
+Using [[retrieval-augmented-generation]] within the agent loop transforms the agent from a stateless responder to a system that accumulates expertise over time.
+
+## Key Failure Modes Across All Patterns
+
+> [!WARNING]
+> The most common production failure is **context overflow**: long agent sessions fill the [[context-window]], causing the model to lose track of earlier observations or instructions. Always monitor token usage in production agents.
+
+| Failure | Pattern | Fix |
+|---------|---------|-----|
+| Hallucinated tool calls | All | Strict JSON schema validation |
+| Context overflow | ReAct, Plan-Execute | Summarize history at checkpoints |
+| Divergent multi-agent | Multi-Agent | Shared state store with conflict resolution |
+| Self-critique loops | Reflection | Max iteration limit |
+| Stale memory | Memory-Augmented | TTL on memory entries + periodic maintenance |
+
+## Recommendations
+
+1. Start with ReAct — it's the simplest and most debuggable
+2. Add memory (wiki/vector store) early — retrofitting is hard
+3. Use smaller models for execution, larger for planning
+4. Log all agent traces — you need them for debugging
+5. Design for graceful degradation — agents will fail; plan the fallback
+
+> [!NOTE]
+> [[openai]]'s o1 / o3 models (late 2024) internalize multi-step reasoning into a private "thinking" chain before output. This reduces the need for explicit ReAct prompting but makes the reasoning less inspectable — a trade-off for production debugging.
--- a/test-wiki-page/wiki/synthesis/rag-vs-fine-tuning.md
+++ b/test-wiki-page/wiki/synthesis/rag-vs-fine-tuning.md
@@ -0,0 +1,63 @@
+---
+title: RAG vs Fine-Tuning
+created: 2024-02-25
+updated: 2024-02-25
+tags: [synthesis, RAG, fine-tuning, architecture, trade-offs]
+---
+
+# RAG vs Fine-Tuning
+
+Two primary strategies exist for giving an LLM access to domain-specific knowledge: **[[retrieval-augmented-generation]]** (inject knowledge at inference time) and **fine-tuning** (bake knowledge into weights at training time). This note synthesizes the trade-offs based on what I've read and observed.
+
+## The Core Trade-Off
+
+| Dimension | RAG | Fine-Tuning |
+|-----------|-----|-------------|
+| Knowledge freshness | Real-time — update the index, done | Requires re-training |
+| Knowledge accuracy | High — verbatim retrieval | Can hallucinate/distort during training |
+| Cost to update | Low (index update) | High (GPU training run) |
+| Latency | Added retrieval step | No retrieval overhead |
+| Interpretability | Can cite retrieved chunks | Knowledge is opaque in weights |
+| Style/behavior change | Cannot change model behavior | Can reshape how model responds |
+
+## When to Use RAG
+
+RAG wins when:
+- Knowledge changes frequently (news, docs, code, internal data)
+- You need to cite sources or show retrieved evidence
+- You want to control exactly what information the model uses
+- Budget is limited (no GPU training required)
+- You're building on top of a third-party model API
+
+Most production knowledge-base Q&A systems use RAG. This includes most enterprise LLM applications.
+
+## When to Use Fine-Tuning
+
+Fine-tuning wins when:
+- You need to change model **behavior** or **style**, not just inject knowledge
+- You need to teach the model a new format, new domain vocabulary, or new reasoning patterns
+- You have a massive labeled dataset and need consistent, fast responses at scale
+- You want smaller models to match larger ones on a specific task (distillation)
+
+## The False Dichotomy
+
+Many production systems use both:
+
+```
+Query → Retrieve relevant docs (RAG)
+      → Fine-tuned model generates response grounded in retrieved docs
+```
+
+Example: [[anthropic]]'s Claude is fine-tuned for helpfulness and safety (Constitutional AI), but in deployment it uses tool calls to retrieve external knowledge — RAG on top of a fine-tuned base.
+
+## RAG for Personal Knowledge
+
+For individual researchers and LLM agents, RAG via a structured wiki (llmwiki-cli) is almost always the right choice over fine-tuning:
+- Your knowledge base grows and changes constantly
+- You cannot fine-tune a commercial API model
+- Retrieval via `wiki search` is fast and inspectable
+
+See [[synthesis/why-context-window-matters]] for the related question of when to retrieve vs. just using long context.
+
+> [!TIP]
+> A quick heuristic: if you're trying to give the model **facts**, use RAG. If you're trying to give the model **skills**, use fine-tuning.
--- a/test-wiki-page/wiki/synthesis/why-context-window-matters.md
+++ b/test-wiki-page/wiki/synthesis/why-context-window-matters.md
@@ -0,0 +1,50 @@
+---
+title: Why Context Window Size Matters
+created: 2024-02-20
+updated: 2024-02-20
+tags: [synthesis, context-window, RAG, architecture, trade-offs]
+---
+
+# Why Context Window Size Matters
+
+The [[context-window]] has become one of the most strategically important axes of LLM competition. This note synthesizes what I've learned tracking the space and examines the real-world trade-offs.
+
+## The Race to Longer Context
+
+[[anthropic]] pushed from 9K tokens (Claude 1) to 200K (Claude 3) in about a year. [[google-deepmind]] announced Gemini 1.5 Pro with 1M tokens in February 2024. [[openai]]'s GPT-4 lags behind in context length but leads in other areas.
+
+This arms race is driven by a simple user demand: people want to paste in entire codebases, books, or transcripts and ask questions about them without worrying about chunking.
+
+## What Large Context Enables
+
+1. **Full-document reasoning**: Summarize a 300-page report, compare two books, review a whole codebase — all in one shot
+2. **Long agent sessions**: An [[agent-loop]] that runs for 50+ steps accumulates substantial history; 200K tokens buys significantly more headroom than 32K
+3. **Fewer RAG dependencies**: With enough context, you can skip the retrieval pipeline entirely and just load all relevant data upfront — simpler architecture, lower latency
+4. **In-context learning at scale**: More examples fit → better few-shot performance
+
+## What Large Context Doesn't Solve
+
+> [!WARNING]
+> Long context is not a substitute for persistent knowledge management.
+
+- **Retrieval within context is imperfect**: Research shows models attend poorly to information in the middle of very long contexts ("lost in the middle" problem). Beginning and end of context get disproportionate attention.
+- **Cost scales with tokens**: A 200K token call to Claude 3 Opus costs significantly more than a focused 2K token call after [[retrieval-augmented-generation]] retrieval.
+- **Knowledge doesn't accumulate across sessions**: Even a 1M token window resets between conversations. A wiki persists.
+- **Latency**: First-token latency grows with context length. For interactive applications, long-context calls are slow.
+
+## The Right Mental Model
+
+Think of context window and [[retrieval-augmented-generation]] as **complementary**, not competing:
+
+| Tool | Best For |
+|------|---------|
+| Large context | One-shot processing of a known, bounded document set |
+| RAG | Searching across a large corpus to find what's relevant |
+| External wiki | Accumulating knowledge persistently across sessions |
+
+> [!NOTE]
+> [[anthropic]]'s position — leading on context length — makes sense given their safety focus. Long context reduces the need for tool use, which reduces the attack surface from malicious tool outputs in agentic settings.
+
+## Implication for Knowledge Management
+
+Even with 1M token context, you still need a structured knowledge base. A year of research notes, dozens of papers, hundreds of observations — this grows far beyond any context window. The right approach is: use [[retrieval-augmented-generation]] or a wiki (like llmwiki-cli) to surface the 5–10 most relevant pages, then use context to reason over them.