Add foundational concepts and entities related to LLMs and AI agents

- Create context-window.md to explain the significance of context window size in LLMs.
- Add llm-scaling-laws.md detailing the empirical relationships between model performance and resources.
- Introduce retrieval-augmented-generation.md to describe RAG architecture and its advantages.
- Add entity pages for key figures and organizations: andrej-karpathy.md, anthropic.md, google-deepmind.md, openai.md, sam-altman.md.
- Create sources for foundational papers: attention-is-all-you-need.md, claude-model-card.md, gpt4-technical-report.md, react-paper.md.
- Synthesize insights on AI agent patterns and RAG vs fine-tuning in dedicated pages.
- Update index.md to include new entities and concepts.
- Log all activities related to the wiki's development in log.md.
This commit is contained in:
doum1004
2026-04-13 00:05:30 -04:00
parent 51b4ce6ca7
commit b19bd2e408
25 changed files with 1008 additions and 6 deletions

53
.github/workflows/demo-viz.yml vendored Normal file
View File

@@ -0,0 +1,53 @@
name: Build Demo Visualization
on:
push:
branches: [main]
paths:
- 'test-wiki-page/**'
- 'src/lib/templates.ts'
- 'scripts/generate-viz-scripts.ts'
- '.github/workflows/demo-viz.yml'
permissions:
pages: write
id-token: write
contents: read
concurrency:
group: pages
cancel-in-progress: false
jobs:
build-and-deploy:
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
with:
bun-version: latest
- uses: actions/setup-node@v4
with:
node-version: 20
- name: Generate viz scripts from templates
run: bun run scripts/generate-viz-scripts.ts .viz-tmp
- name: Build graph data
env:
WIKI_DIR: test-wiki-page/wiki
run: node .viz-tmp/build-graph.cjs
- name: Build site
env:
GITHUB_REPOSITORY: ${{ github.repository }}
run: node .viz-tmp/build-site.cjs
- name: Configure Pages
uses: actions/configure-pages@v5
- name: Upload artifact
uses: actions/upload-pages-artifact@v3
with:
path: dist
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4

1
.gitignore vendored
View File

@@ -2,4 +2,5 @@
node_modules/
dist/
.viz-tmp/
*.tgz

View File

@@ -78,6 +78,15 @@ docs/
phase-3.md
phase-4.md
phase-5.md
scripts/
generate-viz-scripts.ts # Extracts viz build scripts from templates.ts (used by demo workflow)
test-wiki-page/
wiki/ # Example wiki pages for live demo on GitHub Pages
index.md
log.md
concepts/
sources/
synthesis/
```
## Commands

View File

@@ -11,6 +11,8 @@ A CLI tool for LLM agents to build and maintain personal knowledge bases.
Inspired by [Andrej Karpathy's LLM Wiki](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).
**[Live Demo](https://doum1004.github.io/llmwiki-cli/)** — interactive d3-force graph built from the example wiki in [`test-wiki-page/`](test-wiki-page/).
## Overview
The CLI is the hands -- it reads, writes, searches, and manages wiki files. The LLM is the brain -- it decides what to create, update, and connect.

View File

@@ -0,0 +1,8 @@
import { getBuildGraphScript, getBuildSiteScript } from "../src/lib/templates.ts";
import { writeFileSync, mkdirSync } from "fs";
const outDir = process.argv[2] || ".viz-tmp";
mkdirSync(outDir, { recursive: true });
writeFileSync(`${outDir}/build-graph.cjs`, getBuildGraphScript());
writeFileSync(`${outDir}/build-site.cjs`, getBuildSiteScript());
console.log(`Wrote build-graph.cjs and build-site.cjs to ${outDir}`);

View File

@@ -246,7 +246,7 @@ export function getBuildGraphScript(): string {
const path = require("path");
const WIKILINK_RE = /\\[\\[([^\\]|]+)(?:\\|[^\\]]+)?\\]\\]/g;
const WIKI_DIR = "wiki";
const WIKI_DIR = process.env.WIKI_DIR || "wiki";
const OUT_DIR = "dist";
function findMdFiles(dir) {
@@ -275,18 +275,20 @@ function stripFrontmatter(content) {
return content.trim();
}
const wikiPrefix = WIKI_DIR.replace(/\\\\/g, "/").replace(/\\/$/, "") + "/";
function resolveLink(target, allFiles) {
const withMd = target.endsWith(".md") ? target : target + ".md";
const candidates = allFiles.map((f) => f.replace(/\\\\/g, "/"));
if (candidates.includes(withMd)) return withMd;
const withWiki = "wiki/" + withMd;
const withWiki = wikiPrefix + withMd;
if (candidates.includes(withWiki)) return withWiki;
const dirs = ["wiki/entities", "wiki/concepts", "wiki/sources", "wiki/synthesis"];
for (const dir of dirs) {
const candidate = dir + "/" + withMd;
const subdirs = ["entities", "concepts", "sources", "synthesis"];
for (const sub of subdirs) {
const candidate = wikiPrefix + sub + "/" + withMd;
if (candidates.includes(candidate)) return candidate;
}
@@ -297,6 +299,13 @@ function resolveLink(target, allFiles) {
return null;
}
function relDir(filePath) {
const rel = filePath.replace(/\\\\/g, "/");
const inner = rel.startsWith(wikiPrefix) ? rel.slice(wikiPrefix.length) : rel;
const first = inner.split("/")[0];
return inner.includes("/") ? first : "wiki";
}
const files = findMdFiles(WIKI_DIR);
const nodes = [];
const edges = [];
@@ -304,7 +313,7 @@ const edges = [];
for (const file of files) {
const content = fs.readFileSync(file, "utf-8");
const relPath = file.replace(/\\\\/g, "/");
const dir = relPath.split("/")[1] || "wiki";
const dir = relDir(relPath);
nodes.push({ id: relPath, title: extractTitle(content, file), dir, body: stripFrontmatter(content) });
let match;

View File

@@ -0,0 +1,55 @@
---
title: Agent Loop
created: 2024-02-10
updated: 2024-02-10
tags: [concept, agents, autonomy, architecture]
---
# Agent Loop
The agent loop is the core execution pattern of an autonomous LLM agent: a repeating cycle of **Observe → Think → Act** that continues until the agent reaches a goal or is stopped. Each iteration the agent receives new observations, reasons about them using [[chain-of-thought]], selects an action (tool call, message, or termination), and processes the result.
## Basic Structure
```
while not done:
observation = get_context() # current state, memory, tool results
thought = llm.think(observation) # chain-of-thought reasoning
action = llm.choose_action(thought) # tool call or final answer
result = execute(action) # run the tool
memory.append(observation, thought, action, result)
```
## Key Components
### Memory
- **In-context**: Everything in the [[context-window]] — conversation, tool outputs, instructions
- **External**: Retrieved from stores via [[retrieval-augmented-generation]] — the wiki, vector DB, etc.
### Tools
The set of actions available to the agent. Common tools:
- Web search / browser
- Code interpreter / REPL
- File read/write (e.g. `wiki read`, `wiki write`)
- API calls
### Termination
The agent must know when to stop. Poor termination criteria lead to infinite loops or premature exits.
## ReAct Pattern
The most widely-used agent loop variant is ReAct (Reasoning + Acting), from [[sources/react-paper]]. It interleaves natural-language reasoning traces with tool call actions, making the agent's decisions inspectable.
## Failure Modes
| Failure | Cause | Mitigation |
|---------|-------|-----------|
| Hallucinated tool calls | Model invents non-existent tools | Strict function schema validation |
| Context overflow | Long loops fill the [[context-window]] | Summarize or compress history |
| Stuck in loop | No progress, keeps retrying | Max step limit + backoff |
| Over-planning | Too much thinking, too little acting | Temperature tuning, step limits |
> [!TIP]
> llmwiki-cli is designed to be a tool inside an agent loop: the agent calls `wiki search`, `wiki read`, and `wiki write` as actions, using the wiki as its external long-term memory.
See [[synthesis/ai-agent-patterns]] for patterns that have emerged in production agent systems.

View File

@@ -0,0 +1,43 @@
---
title: Chain-of-Thought
created: 2024-02-10
updated: 2024-02-10
tags: [concept, prompting, reasoning, CoT]
source: https://arxiv.org/abs/2201.11903
---
# Chain-of-Thought (CoT)
Chain-of-thought prompting is a technique that elicits step-by-step reasoning from a language model by including examples that show the reasoning process — not just the final answer. Introduced by Wei et al. (Google Brain, 2022), it dramatically improves performance on multi-step reasoning tasks.
## Key Variants
| Variant | How | When to Use |
|---------|-----|-------------|
| Few-shot CoT | Include worked examples in prompt | Tasks with clear reasoning steps |
| Zero-shot CoT | Append "Let's think step by step" | Quick boost without example construction |
| Self-consistency | Sample multiple CoT paths, majority vote | High-stakes tasks requiring reliability |
| Tree of Thoughts | Branch reasoning into tree, search over paths | Very complex multi-step problems |
## Why It Works
Transformers process tokens sequentially. By forcing the model to generate intermediate reasoning steps before the final answer, CoT:
1. Allocates more computation (tokens) to hard steps
2. Externalizes working memory that would otherwise be compressed into hidden states
3. Creates a "scratchpad" that the model can condition on when generating later tokens
## Connection to Agents
Chain-of-thought is the cognitive substrate of [[agent-loop]]. When an agent uses a ReAct-style loop (see [[sources/react-paper]]), the "think" step is CoT reasoning — the agent writes out its plan before choosing an action.
```
Observation: search results for "Paris population"
Thought: The results show Paris has 2.1M city / 12M metro. I need the metro figure.
Action: answer("The Paris metro area has approximately 12 million people.")
```
> [!TIP]
> CoT is most effective on models with ≥ 100B parameters. On smaller models, it can actually hurt performance by generating plausible-sounding but incorrect reasoning chains.
> [!NOTE]
> [[openai]]'s o1 and o3 models (2024) internalize chain-of-thought as a latent "thinking" process before producing output — a productized version of explicit CoT prompting.

View File

@@ -0,0 +1,43 @@
---
title: Context Window
created: 2024-01-20
updated: 2024-01-20
tags: [concept, architecture, inference, tokens]
---
# Context Window
The context window (also called context length or context limit) is the maximum number of tokens a language model can process in a single forward pass — encompassing both the input prompt and the generated output. Everything outside the context window is invisible to the model.
## Why It Matters
Context window size is one of the most practically important properties of a deployed LLM:
- **In-context learning**: The model can only use examples, documents, or conversation history that fits in the window
- **Long-document tasks**: Summarization, question-answering over books or codebases require large windows
- **Agent memory**: An [[agent-loop]] accumulates observations and history; a small context window means the agent forgets recent steps
- **RAG trade-off**: Small windows force reliance on [[retrieval-augmented-generation]]; large windows reduce the need to retrieve
## Historical Progression
| Model | Year | Context |
|-------|------|---------|
| GPT-3 | 2020 | 4K tokens |
| GPT-4 | 2023 | 8K32K tokens |
| Claude 2 | 2023 | 100K tokens |
| Claude 3 | 2024 | 200K tokens |
| Gemini 1.5 Pro | 2024 | 1M tokens |
## Technical Constraints
Context window size is limited by:
1. **Quadratic attention complexity**: Standard self-attention scales as O(n²) in sequence length — doubling the context quadruples compute
2. **KV cache memory**: Each token in the context requires storing key-value pairs in GPU memory
3. **Positional encoding generalization**: Models must be trained on long sequences to handle them well; RoPE and ALiBi help with generalization
## Implications for Knowledge Management
A large context window does not eliminate the need for a tool like llmwiki-cli. Even a 1M token window cannot hold months of accumulated research. See [[synthesis/why-context-window-matters]] for the full analysis.
> [!TIP]
> [[anthropic]]'s Claude 3 at 200K tokens can process roughly 150,000 words — the equivalent of a 500-page book — in a single prompt. This makes it practical for full-codebase analysis without chunking.

View File

@@ -0,0 +1,43 @@
---
title: LLM Scaling Laws
created: 2024-01-15
updated: 2024-01-15
tags: [concept, scaling, training, compute]
source: https://arxiv.org/abs/2001.08361
---
# LLM Scaling Laws
Scaling laws describe the empirical relationship between model performance (measured as cross-entropy loss on held-out text) and three key resources: model parameters (N), training data tokens (D), and compute budget (C). The key finding is that performance improves as a smooth power law — predictably and reliably — as these quantities increase.
## Key Papers
1. **Kaplan et al. 2020** ("Scaling Laws for Neural Language Models") — [[openai]] researchers established the foundational relationships. Found that N and D should be scaled together, but suggested parameters matter more.
2. **Hoffmann et al. 2022** ("Training Compute-Optimal LLMs", aka "Chinchilla") — [[google-deepmind]] researchers revised the Kaplan findings, showing models were being under-trained. The Chinchilla-optimal ratio is roughly **20 tokens per parameter**.
## Core Relationships
```
Loss ∝ N^(-α) (more parameters → lower loss)
Loss ∝ D^(-β) (more data → lower loss)
Loss ∝ C^(-γ) (more compute → lower loss)
```
where α ≈ 0.076, β ≈ 0.095, γ ≈ 0.050 (Kaplan et al.)
## Implications
- You can predict how good a model will be **before training it** if you know the compute budget
- Bigger models are more **sample-efficient** — they learn more per token
- There is an optimal allocation of compute between model size and training data for a fixed budget
- Performance improvements from scale have not shown signs of plateauing on standard benchmarks (as of 2024)
## Chinchilla Correction
Pre-Chinchilla models (GPT-3, PaLM, Gopher) were significantly over-parameterized relative to their training data. The Chinchilla paper showed a 70B parameter model trained on 1.4T tokens (Chinchilla) outperformed a 280B model (Gopher) trained on fewer tokens — with 4× less compute.
> [!NOTE]
> [[andrej-karpathy]] frequently cites scaling laws as the core reason the field moved from hand-engineering features to scaling simple architectures. The [[sources/attention-is-all-you-need]] transformer is the architecture that made scaling practical.
> [!WARNING]
> Scaling laws apply to pre-training loss. They do not directly predict performance on specific downstream tasks, reasoning benchmarks, or instruction-following quality — those require additional alignment techniques.

View File

@@ -0,0 +1,45 @@
---
title: Retrieval-Augmented Generation
created: 2024-02-15
updated: 2024-02-15
tags: [concept, RAG, retrieval, architecture]
source: https://arxiv.org/abs/2005.11401
---
# Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an architecture pattern that augments an LLM's generation with dynamically retrieved content from an external knowledge store. Instead of relying solely on knowledge encoded in model weights, the system retrieves relevant documents at inference time and injects them into the [[context-window]].
## How It Works
```
Query → Embed query → Search vector store → Retrieve top-k docs
→ Inject docs into prompt → LLM generates answer
```
1. **Indexing**: Documents are chunked and embedded into a vector store (e.g. Pinecone, Weaviate, pgvector)
2. **Retrieval**: At query time, the query is embedded and nearest-neighbor search finds relevant chunks
3. **Generation**: Retrieved chunks are inserted into the prompt; the LLM generates a response grounded in them
## Why RAG Exists
RAG solves two fundamental limitations of LLMs:
1. **Knowledge cutoff**: Weights are frozen at training time; RAG injects fresh information
2. **Context window limits**: You cannot fit an entire knowledge base in context; RAG selects what's relevant
## RAG vs Fine-Tuning
For keeping knowledge current, RAG is almost always preferred over fine-tuning. See [[synthesis/rag-vs-fine-tuning]] for the full comparison.
## Relevance to llmwiki-cli
llmwiki-cli functions as a lightweight structured RAG system for LLM agents:
- `wiki search` performs keyword retrieval from the wiki corpus
- `wiki read` injects the retrieved page into the agent's context
- The [[agent-loop]] can call these tools repeatedly to accumulate relevant knowledge
> [!NOTE]
> The original RAG paper (Lewis et al. 2020, Facebook AI) used a learned retriever (DPR) combined with BART for generation. Modern RAG systems typically use off-the-shelf embedding models and generative LLMs like GPT-4 or Claude.
> [!WARNING]
> RAG quality depends heavily on chunking strategy and embedding model quality. Naive chunking (fixed-size character splits) often breaks semantic units and hurts retrieval precision.

View File

@@ -0,0 +1,35 @@
---
title: Andrej Karpathy
created: 2024-02-05
updated: 2024-02-05
tags: [person, researcher, OpenAI, Tesla]
source: https://karpathy.ai
---
# Andrej Karpathy
Andrej Karpathy is an AI researcher and educator known for his work on deep learning, computer vision, and large language models. He is one of the most effective communicators of AI concepts to practitioners.
## Career
| Period | Role |
|--------|------|
| 20152017 | PhD at Stanford (CS/Vision, Fei-Fei Li's lab) |
| 20152017 | Research Scientist at [[openai]] (founding team) |
| 20172022 | Sr. Director of AI at Tesla (Autopilot) |
| 20232024 | Research Scientist at [[openai]] (returned) |
| 2024 | Independent / EurekaLabs |
## Key Contributions
### nanoGPT
Karpathy's `nanoGPT` repository is one of the most widely studied clean implementations of the GPT architecture. It demystifies how transformer language models work from first principles — closely tied to [[llm-scaling-laws]] intuitions.
### Educational Content
His YouTube lecture series "Neural Networks: Zero to Hero" has become a canonical learning resource for practitioners wanting to understand how LLMs work from the ground up, covering backpropagation through to full GPT training.
### Tokenization Advocacy
Karpathy is an outspoken critic of subword tokenization as a source of model brittleness, arguing it creates unnecessary complexity that future models should eliminate.
> [!TIP]
> Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" (2015) remains a landmark piece of ML writing even as the field has moved to transformers — worth reading for historical context on [[llm-scaling-laws]].

View File

@@ -0,0 +1,41 @@
---
title: Anthropic
created: 2024-02-01
updated: 2024-02-01
tags: [company, LLM, AI-safety, research]
source: https://anthropic.com
---
# Anthropic
Anthropic is an AI safety company founded in 2021 by former [[openai]] researchers, led by Dario Amodei (CEO) and Daniela Amodei (President). The company focuses on building reliable, interpretable, and steerable AI systems.
## Founding and Background
Seven of Anthropic's eleven founders came from [[openai]], departing over concerns about the pace of capability development relative to safety work. This origin story shapes Anthropic's emphasis on alignment research alongside product development.
## Claude Model Family
Anthropic's flagship product line is Claude. See [[sources/claude-model-card]] for the full technical details.
| Version | Release | Context Window |
|---------|---------|----------------|
| Claude 1 | Mar 2023 | 9K tokens |
| Claude 2 | Jul 2023 | 100K tokens |
| Claude 3 Haiku/Sonnet/Opus | Mar 2024 | 200K tokens |
The dramatic expansion of the [[context-window]] — from 9K to 200K tokens — is a defining competitive advantage. See [[synthesis/why-context-window-matters]] for analysis.
## Constitutional AI
Anthropic's key alignment approach is **Constitutional AI (CAI)**: instead of relying entirely on human feedback, the model is trained with a set of principles ("constitution") to self-critique and revise outputs. This reduces dependence on human labelers for harmlessness training.
## Safety Research
Anthropic publishes significant interpretability research, including mechanistic interpretability work trying to understand what computations happen inside transformer layers.
> [!NOTE]
> Anthropic received $300M from Google in 2023, followed by a further $2B commitment, giving Google a minority stake. Amazon also invested up to $4B in late 2023.
> [!WARNING]
> Despite the safety focus, Anthropic still ships capable frontier models — the tension between capability and safety is ongoing and unresolved.

View File

@@ -0,0 +1,37 @@
---
title: Google DeepMind
created: 2024-01-15
updated: 2024-01-15
tags: [company, LLM, research, Google]
source: https://deepmind.google
---
# Google DeepMind
Google DeepMind is the merged AI research division of Google, formed in April 2023 by combining Google Brain and DeepMind. It is led by Demis Hassabis (DeepMind founder) as CEO.
## History
- **DeepMind** (founded 2010, acquired by Google 2014) — famous for AlphaGo, AlphaFold, Gemini
- **Google Brain** (founded 2011) — developed TensorFlow, pioneered large-scale neural net training; authors of [[sources/attention-is-all-you-need]]
- **Merger** (April 2023) — combined into Google DeepMind to compete more directly with [[openai]] and [[anthropic]]
## Key Contributions
### Transformer Architecture
Google Brain researchers authored the foundational "Attention Is All You Need" paper (2017), which introduced the transformer — now the basis for virtually all large language models. See [[sources/attention-is-all-you-need]].
### Scaling Research
Google was an early contributor to [[llm-scaling-laws]] research, publishing work on compute-optimal training (Chinchilla, 2022), which showed that many models were under-trained relative to their parameter count.
### Gemini
Gemini is Google DeepMind's frontier model family, competing directly with GPT-4 and Claude 3.
| Version | Release | Notes |
|---------|---------|-------|
| Gemini 1.0 | Dec 2023 | Ultra / Pro / Nano tiers |
| Gemini 1.5 Pro | Feb 2024 | 1M token context window |
| Gemini 1.5 Flash | May 2024 | Efficient, fast variant |
> [!NOTE]
> Gemini 1.5 Pro's 1 million token [[context-window]] is currently the largest available in a production model, enabling entirely new use cases like processing full codebases or hour-long videos.

View File

@@ -0,0 +1,46 @@
---
title: OpenAI
created: 2024-01-15
updated: 2024-03-15
tags: [company, LLM, AGI, research]
source: https://openai.com
---
# OpenAI
OpenAI is an AI research company founded in December 2015 in San Francisco. Originally a non-profit, it restructured into a "capped-profit" model in 2019 to attract investment. Its stated mission is to ensure that artificial general intelligence benefits all of humanity.
## Key People
- [[sam-altman]] — CEO (returned after brief Nov 2023 board ouster)
- [[andrej-karpathy]] — founding member, returned as employee 20232024
- Greg Brockman — President and co-founder
- Ilya Sutskever — co-founder, Chief Scientist (departed 2024)
## Major Models
| Model | Release | Key Feature |
|-------|---------|-------------|
| GPT-3 | May 2020 | 175B params, few-shot learning |
| InstructGPT | Jan 2022 | RLHF alignment |
| ChatGPT | Nov 2022 | Conversational wrapper on GPT-3.5 |
| GPT-4 | Mar 2023 | Multimodal, major capability jump |
| GPT-4o | May 2024 | Native multimodal, faster, cheaper |
## Research Contributions
OpenAI pioneered the [[llm-scaling-laws]] paradigm with the 2020 "Scaling Laws for Neural Language Models" paper, establishing that model performance scales predictably with compute, parameters, and data. See also [[sources/gpt4-technical-report]] for capabilities benchmarking.
## Commercial Products
- **ChatGPT** — consumer product, 100M+ users in first two months
- **API** — developer access to GPT models
- **Copilot** (via Microsoft partnership) — integrated into Office, GitHub, Windows
> [!NOTE]
> OpenAI has a complex relationship with [[anthropic]]: several Anthropic founders (including Dario and Daniela Amodei) left OpenAI in 2021 over strategic and safety disagreements.
## Funding
- Microsoft invested $1B in 2019, $10B in 2023
- Valuation reached ~$80B by early 2024

View File

@@ -0,0 +1,33 @@
---
title: Sam Altman
created: 2024-01-20
updated: 2024-01-20
tags: [person, CEO, OpenAI]
source: https://en.wikipedia.org/wiki/Sam_Altman
---
# Sam Altman
Sam Altman is the CEO of [[openai]], which he has led since 2019. Before OpenAI, he was President of Y Combinator (20142019), one of the world's most influential startup accelerators.
## Role at OpenAI
Altman has been the primary public face of OpenAI and a key driver of its commercial strategy, including:
- The $10B Microsoft partnership
- Launch and rapid growth of ChatGPT
- Pushing development of GPT-4 and beyond
## November 2023 Board Crisis
In November 2023, the OpenAI board briefly fired Altman, citing concerns about his candor with the board. Within five days, he was reinstated following a staff revolt (nearly all employees threatened to quit) and investor pressure. The episode raised significant questions about OpenAI's governance structure.
> [!WARNING]
> The board crisis exposed deep tensions between [[openai]]'s non-profit roots and its commercial ambitions. A restructuring of governance followed in 2024.
## Views on AGI
Altman is a prominent advocate for the belief that AGI is achievable within a few years and that safety research must happen in parallel with capability development — a view that distinguishes him from more skeptical researchers but aligns with [[openai]]'s mission framing.
## Relationship with Anthropic
The departure of the Amodei team to found [[anthropic]] happened in part due to disagreements with Altman and others over strategy and safety. The two companies now compete directly.

View File

@@ -0,0 +1,30 @@
# Index
## Entities
- [OpenAI](entities/openai.md) — AI research company behind GPT series and ChatGPT
- [Anthropic](entities/anthropic.md) — AI safety company behind Claude series
- [Google DeepMind](entities/google-deepmind.md) — Google's merged AI research division
- [Sam Altman](entities/sam-altman.md) — CEO of OpenAI
- [Andrej Karpathy](entities/andrej-karpathy.md) — AI researcher, former OpenAI/Tesla
## Concepts
- [LLM Scaling Laws](concepts/llm-scaling-laws.md) — Predictable performance improvements with compute, data, and parameters
- [Context Window](concepts/context-window.md) — Maximum token capacity of a model in one inference call
- [Retrieval-Augmented Generation](concepts/retrieval-augmented-generation.md) — Combining retrieval from external stores with LLM generation
- [Chain-of-Thought](concepts/chain-of-thought.md) — Prompting technique that elicits step-by-step reasoning
- [Agent Loop](concepts/agent-loop.md) — Observe → Think → Act cycle for autonomous LLM agents
## Sources
- [Attention Is All You Need](sources/attention-is-all-you-need.md) — Vaswani et al. 2017, transformer architecture paper
- [GPT-4 Technical Report](sources/gpt4-technical-report.md) — OpenAI 2023, GPT-4 capabilities and evaluations
- [Claude Model Card](sources/claude-model-card.md) — Anthropic 2024, Claude 3 model card and safety evals
- [ReAct Paper](sources/react-paper.md) — Yao et al. 2022, reasoning + acting in language models
## Synthesis
- [Why Context Window Size Matters](synthesis/why-context-window-matters.md) — Long context vs. RAG trade-offs and implications
- [RAG vs Fine-Tuning](synthesis/rag-vs-fine-tuning.md) — When to retrieve vs. when to train
- [AI Agent Patterns](synthesis/ai-agent-patterns.md) — Common architectural patterns emerging in production agent systems

View File

@@ -0,0 +1,49 @@
# Activity Log
## [2024-01-15 09:00:00] init | Wiki initialized — domain: AI agents & LLMs
## [2024-01-15 09:15:00] ingest | attention-is-all-you-need — transformer architecture paper (Vaswani et al. 2017)
## [2024-01-15 09:30:00] ingest | openai entity page created
## [2024-01-15 09:35:00] ingest | google-deepmind entity page created
## [2024-01-15 09:40:00] ingest | llm-scaling-laws concept page created
## [2024-01-20 11:00:00] ingest | gpt4-technical-report — ingested OpenAI GPT-4 technical report
## [2024-01-20 11:20:00] ingest | context-window concept page created from GPT-4 report
## [2024-01-20 11:35:00] ingest | sam-altman entity page created
## [2024-01-22 14:00:00] query | searched "scaling laws compute optimal" — read llm-scaling-laws, attention-is-all-you-need
## [2024-02-01 10:00:00] ingest | anthropic entity page created
## [2024-02-01 10:20:00] ingest | claude-model-card — ingested Claude 3 model card
## [2024-02-05 15:00:00] ingest | andrej-karpathy entity page created
## [2024-02-10 09:00:00] ingest | react-paper — ingested ReAct paper (Yao et al. 2022)
## [2024-02-10 09:30:00] ingest | chain-of-thought concept page created
## [2024-02-10 09:45:00] ingest | agent-loop concept page created
## [2024-02-12 14:00:00] query | searched "agent tool use loop" — read agent-loop, chain-of-thought, react-paper
## [2024-02-15 10:00:00] ingest | retrieval-augmented-generation concept page created
## [2024-02-20 11:00:00] synthesis | why-context-window-matters — cross-cutting analysis of context vs retrieval
## [2024-02-25 14:00:00] synthesis | rag-vs-fine-tuning — comparison of retrieval and fine-tuning approaches
## [2024-03-01 09:00:00] synthesis | ai-agent-patterns — patterns emerging in production agent systems
## [2024-03-05 10:00:00] maintenance | ran wiki lint — fixed 2 broken wikilinks, added missing frontmatter to 1 page
## [2024-03-10 11:00:00] query | searched "RAG production latency" — read rag-vs-fine-tuning, retrieval-augmented-generation
## [2024-03-15 09:00:00] ingest | updated openai page with GPT-4o release notes
## [2024-03-20 10:00:00] query | searched "Anthropic safety evals" — read claude-model-card, anthropic

View File

@@ -0,0 +1,53 @@
---
title: Attention Is All You Need
created: 2024-01-15
updated: 2024-01-15
tags: [paper, transformer, attention, architecture]
source: https://arxiv.org/abs/1706.03762
---
# Attention Is All You Need
**Authors**: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin ([[google-deepmind|Google Brain]] / Google Research)
**Published**: June 2017 (NeurIPS 2017)
## Summary
This paper introduced the **Transformer** architecture, replacing recurrent and convolutional networks with a mechanism called **self-attention** for sequence-to-sequence tasks. It became the foundation for virtually every large language model built since 2018.
## Key Contributions
### Self-Attention Mechanism
Each token in the sequence computes attention weights over all other tokens, enabling the model to relate positions regardless of distance:
```
Attention(Q, K, V) = softmax(QK^T / √d_k) V
```
- **Q** (Query), **K** (Key), **V** (Value) — linear projections of token embeddings
- Division by √d_k prevents vanishing gradients in softmax at large dimensions
### Multi-Head Attention
Instead of a single attention function, the paper uses h=8 parallel attention heads, each learning different relationship types (syntax, coreference, semantics, etc.)
### Positional Encoding
Since self-attention is permutation-invariant, sinusoidal position encodings inject order information.
### Architecture
```
Encoder: Embedding → N × (Multi-Head Attn → FFN) → output representations
Decoder: Embedding → N × (Masked MHA → Cross-Attn → FFN) → token probabilities
```
## Why It Matters
1. **Parallelism**: Unlike RNNs, all positions process simultaneously during training → orders of magnitude faster on GPUs
2. **Long-range dependencies**: O(1) path length between any two positions vs O(n) for RNNs
3. **Scale**: The architecture scales smoothly — see [[llm-scaling-laws]]
## Impact
The transformer is now the backbone of GPT (see [[sources/gpt4-technical-report]]), Claude (see [[sources/claude-model-card]]), Gemini, and essentially every frontier model. It is one of the most cited ML papers of all time.
> [!NOTE]
> The paper's title was partly a provocation — at the time, the dominant view was that attention was useful *alongside* recurrence, not as a replacement. The title's confidence was validated rapidly.

View File

@@ -0,0 +1,57 @@
---
title: Claude 3 Model Card
created: 2024-02-01
updated: 2024-02-01
tags: [paper, Claude, Anthropic, safety, evals]
source: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
---
# Claude 3 Model Card
**Authors**: [[anthropic]]
**Published**: March 2024
## Overview
The Claude 3 model card covers the three-tier Claude 3 family: **Haiku** (fast/cheap), **Sonnet** (balanced), and **Opus** (most capable). This is Anthropic's first model card for a released frontier model family.
## Model Tiers
| Model | Speed | Cost | Best For |
|-------|-------|------|----------|
| Claude 3 Haiku | Fastest | Lowest | High-volume tasks, classification |
| Claude 3 Sonnet | Moderate | Medium | General use, coding |
| Claude 3 Opus | Slowest | Highest | Complex reasoning, research |
## Context Window
All Claude 3 models support a 200K token [[context-window]] — roughly 150,000 words. This is the largest commercially available context window at launch, enabling:
- Full-book analysis in a single call
- Large codebase review
- Long research sessions without chunking
## Safety Evaluations
The model card is notable for its safety eval methodology:
### Responsible Scaling Policy (RSP)
Anthropic's RSP defines capability thresholds ("ASL" levels) that trigger additional safeguards before deployment. Claude 3 was evaluated against ASL-3 criteria (uplift for CBRN weapons, autonomous replication).
### CBRN Uplift Testing
Red-teamers tested whether models provided meaningful uplift for chemical, biological, radiological, or nuclear harm. Claude 3 Opus did not meet the ASL-3 threshold for dangerous uplift.
### Child Safety
Absolute behavioral refusals are maintained regardless of prompt framing.
## Benchmark Results
Claude 3 Opus outperforms GPT-4 on several benchmarks at release:
- MMLU: 86.8% vs GPT-4's 86.4%
- HumanEval: 84.9% vs GPT-4's 67.0%
- GSM8K: 95.0% vs GPT-4's 92.0%
> [!NOTE]
> The model card openly discusses failure modes and known limitations — a more candid approach than [[sources/gpt4-technical-report]], which omitted most technical details.
> [!TIP]
> For [[agent-loop]] applications, Claude 3 Sonnet's combination of large context, strong instruction following, and moderate cost makes it a practical default choice.

View File

@@ -0,0 +1,55 @@
---
title: GPT-4 Technical Report
created: 2024-01-20
updated: 2024-01-20
tags: [paper, GPT-4, OpenAI, multimodal, evals]
source: https://arxiv.org/abs/2303.08774
---
# GPT-4 Technical Report
**Authors**: OpenAI
**Published**: March 2023
## Overview
The GPT-4 Technical Report describes [[openai]]'s fourth-generation large language model. Notably, the report deliberately withholds most technical details (parameter count, training data, architecture specifics) for competitive and safety reasons — a controversial decision.
## Key Capabilities
### Multimodality
GPT-4 accepts both image and text inputs (text outputs only at launch). It can describe images, read charts, solve visual math problems, and interpret screenshots.
### Benchmark Performance
GPT-4 achieved human-level or above-human-level performance on several professional exams:
| Exam | GPT-3.5 Percentile | GPT-4 Percentile |
|------|--------------------|-----------------|
| Bar Exam | ~10th | ~90th |
| SAT | ~87th | ~93rd |
| GRE Verbal | ~63rd | ~99th |
| USMLE Step 1 | ~53rd | ~75th+ |
### Extended Context Window
GPT-4 launched with an 8K token [[context-window]], later extended to 32K — a significant increase over GPT-3.5's 4K, enabling longer documents and conversation histories.
## RLHF and Alignment
The report describes an extensive RLHF (Reinforcement Learning from Human Feedback) pipeline:
1. Supervised fine-tuning on human-written demonstrations
2. Reward model trained on human comparisons
3. PPO optimization against the reward model
4. Rule-based reward models (RBRMs) for absolute safety behaviors
## Evals and Red-Teaming
OpenAI engaged external red-teamers before launch to test for:
- Dangerous capability elicitation (bio/chem/cyber)
- Jailbreaks and policy violations
- Deceptive alignment risks
> [!NOTE]
> The report introduced a reproducible eval framework. The [[llm-scaling-laws]] suggest GPT-4's capabilities were largely predictable from the compute budget used — though OpenAI has not confirmed exact training FLOPs.
> [!WARNING]
> Because so many technical details were withheld, the GPT-4 Technical Report is more useful as a capabilities/evals reference than as an architecture reference. Contrast with [[sources/attention-is-all-you-need]], which is fully open.

View File

@@ -0,0 +1,58 @@
---
title: ReAct — Synergizing Reasoning and Acting in Language Models
created: 2024-02-10
updated: 2024-02-10
tags: [paper, agents, reasoning, acting, ReAct]
source: https://arxiv.org/abs/2210.03629
---
# ReAct: Synergizing Reasoning and Acting in Language Models
**Authors**: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao (Princeton / Google Brain)
**Published**: October 2022 (ICLR 2023)
## Summary
ReAct proposes interleaving **Rea**soning traces (chain-of-thought) and **Act**ing (tool calls) in a single LLM prompt. By alternating between "Thought: ..." and "Action: ..." steps, the model produces interpretable, grounded reasoning that is directly coupled to external tool use.
## The ReAct Pattern
```
Question: What is the capital of the country that won the 2022 FIFA World Cup?
Thought: I need to find which country won the 2022 World Cup first.
Action: Search[2022 FIFA World Cup winner]
Observation: Argentina won the 2022 FIFA World Cup, defeating France on penalties.
Thought: Argentina's capital is Buenos Aires.
Action: Finish[Buenos Aires]
```
## Why ReAct Works
**[[chain-of-thought]] alone** hallucinates facts — the model reasons but has no way to verify claims.
**Acting alone** (without reasoning) produces brittle tool use — the model can't plan multi-step retrieval.
**ReAct combines them**: reasoning guides which actions to take; observations from actions correct and update the reasoning trace.
## Tasks Evaluated
The paper evaluates ReAct on:
1. **HotpotQA** — multi-hop question answering requiring chained Wikipedia lookups
2. **FEVER** — fact verification requiring evidence retrieval
3. **ALFWorld** — interactive text game requiring navigation + object manipulation
4. **WebShop** — web shopping simulation requiring search + selection
ReAct outperformed chain-of-thought-only and action-only baselines on all tasks.
## Influence on Agent Frameworks
ReAct is the conceptual backbone of most modern [[agent-loop]] implementations:
- LangChain's AgentExecutor
- AutoGPT and BabyAGI
- Claude's tool use / function calling
- OpenAI's Assistants API with function calling
> [!TIP]
> The key insight is that reasoning traces make agents **debuggable** — you can read the Thought steps to understand why the agent chose an action. This is essential for production agent systems. See [[synthesis/ai-agent-patterns]].

View File

@@ -0,0 +1,84 @@
---
title: AI Agent Patterns
created: 2024-03-01
updated: 2024-03-01
tags: [synthesis, agents, architecture, patterns, production]
---
# AI Agent Patterns
After reading [[sources/react-paper]], tracking several open-source agent frameworks, and following production deployments, I've identified recurring architectural patterns. This is a living synthesis note.
## Pattern 1: ReAct Loop (Reasoning + Acting)
The baseline pattern from [[sources/react-paper]]. The model alternates between reasoning traces and tool calls until it reaches a final answer.
**Best for**: Single-agent tasks with well-defined tools and clear success criteria.
**Limitations**: Fragile on long chains; one bad tool call can derail the whole trace.
## Pattern 2: Plan-and-Execute
The agent first generates a complete plan (list of steps), then executes each step in order, potentially with a different (cheaper) model for execution.
```
Planner LLM → [step1, step2, step3, ...]
Executor LLM → execute(step1) → result
Executor LLM → execute(step2) → result
...
```
**Best for**: Tasks where the subtasks are well-understood and execution is mechanical.
## Pattern 3: Multi-Agent Orchestration
Multiple specialized agents collaborate, each with a different prompt/tools/model tier:
- **Orchestrator**: Plans, assigns tasks, integrates results
- **Researcher**: Searches the web, reads documents
- **Coder**: Writes and executes code
- **Critic**: Reviews outputs for quality
**Best for**: Complex tasks requiring diverse expertise (e.g., "research X, write a report, add visualizations").
## Pattern 4: Reflection and Self-Critique
After completing a task, the agent reviews its own output and iteratively improves it. Related to [[chain-of-thought]] self-consistency.
```
Draft answer → Critique(draft) → Revised answer → Critique(revised) → ...
```
## Pattern 5: Memory-Augmented Agents
The [[agent-loop]] integrates with a persistent external memory (vector DB, structured wiki). After each session, key observations are written to memory; at the start of each session, relevant memories are retrieved.
This is exactly what llmwiki-cli supports:
- **Write**: `wiki write wiki/entities/new-finding.md`
- **Retrieve**: `wiki search "relevant topic"`
- **Connect**: add `[[wikilinks]]` to create a knowledge graph
Using [[retrieval-augmented-generation]] within the agent loop transforms the agent from a stateless responder to a system that accumulates expertise over time.
## Key Failure Modes Across All Patterns
> [!WARNING]
> The most common production failure is **context overflow**: long agent sessions fill the [[context-window]], causing the model to lose track of earlier observations or instructions. Always monitor token usage in production agents.
| Failure | Pattern | Fix |
|---------|---------|-----|
| Hallucinated tool calls | All | Strict JSON schema validation |
| Context overflow | ReAct, Plan-Execute | Summarize history at checkpoints |
| Divergent multi-agent | Multi-Agent | Shared state store with conflict resolution |
| Self-critique loops | Reflection | Max iteration limit |
| Stale memory | Memory-Augmented | TTL on memory entries + periodic maintenance |
## Recommendations
1. Start with ReAct — it's the simplest and most debuggable
2. Add memory (wiki/vector store) early — retrofitting is hard
3. Use smaller models for execution, larger for planning
4. Log all agent traces — you need them for debugging
5. Design for graceful degradation — agents will fail; plan the fallback
> [!NOTE]
> [[openai]]'s o1 / o3 models (late 2024) internalize multi-step reasoning into a private "thinking" chain before output. This reduces the need for explicit ReAct prompting but makes the reasoning less inspectable — a trade-off for production debugging.

View File

@@ -0,0 +1,63 @@
---
title: RAG vs Fine-Tuning
created: 2024-02-25
updated: 2024-02-25
tags: [synthesis, RAG, fine-tuning, architecture, trade-offs]
---
# RAG vs Fine-Tuning
Two primary strategies exist for giving an LLM access to domain-specific knowledge: **[[retrieval-augmented-generation]]** (inject knowledge at inference time) and **fine-tuning** (bake knowledge into weights at training time). This note synthesizes the trade-offs based on what I've read and observed.
## The Core Trade-Off
| Dimension | RAG | Fine-Tuning |
|-----------|-----|-------------|
| Knowledge freshness | Real-time — update the index, done | Requires re-training |
| Knowledge accuracy | High — verbatim retrieval | Can hallucinate/distort during training |
| Cost to update | Low (index update) | High (GPU training run) |
| Latency | Added retrieval step | No retrieval overhead |
| Interpretability | Can cite retrieved chunks | Knowledge is opaque in weights |
| Style/behavior change | Cannot change model behavior | Can reshape how model responds |
## When to Use RAG
RAG wins when:
- Knowledge changes frequently (news, docs, code, internal data)
- You need to cite sources or show retrieved evidence
- You want to control exactly what information the model uses
- Budget is limited (no GPU training required)
- You're building on top of a third-party model API
Most production knowledge-base Q&A systems use RAG. This includes most enterprise LLM applications.
## When to Use Fine-Tuning
Fine-tuning wins when:
- You need to change model **behavior** or **style**, not just inject knowledge
- You need to teach the model a new format, new domain vocabulary, or new reasoning patterns
- You have a massive labeled dataset and need consistent, fast responses at scale
- You want smaller models to match larger ones on a specific task (distillation)
## The False Dichotomy
Many production systems use both:
```
Query → Retrieve relevant docs (RAG)
→ Fine-tuned model generates response grounded in retrieved docs
```
Example: [[anthropic]]'s Claude is fine-tuned for helpfulness and safety (Constitutional AI), but in deployment it uses tool calls to retrieve external knowledge — RAG on top of a fine-tuned base.
## RAG for Personal Knowledge
For individual researchers and LLM agents, RAG via a structured wiki (llmwiki-cli) is almost always the right choice over fine-tuning:
- Your knowledge base grows and changes constantly
- You cannot fine-tune a commercial API model
- Retrieval via `wiki search` is fast and inspectable
See [[synthesis/why-context-window-matters]] for the related question of when to retrieve vs. just using long context.
> [!TIP]
> A quick heuristic: if you're trying to give the model **facts**, use RAG. If you're trying to give the model **skills**, use fine-tuning.

View File

@@ -0,0 +1,50 @@
---
title: Why Context Window Size Matters
created: 2024-02-20
updated: 2024-02-20
tags: [synthesis, context-window, RAG, architecture, trade-offs]
---
# Why Context Window Size Matters
The [[context-window]] has become one of the most strategically important axes of LLM competition. This note synthesizes what I've learned tracking the space and examines the real-world trade-offs.
## The Race to Longer Context
[[anthropic]] pushed from 9K tokens (Claude 1) to 200K (Claude 3) in about a year. [[google-deepmind]] announced Gemini 1.5 Pro with 1M tokens in February 2024. [[openai]]'s GPT-4 lags behind in context length but leads in other areas.
This arms race is driven by a simple user demand: people want to paste in entire codebases, books, or transcripts and ask questions about them without worrying about chunking.
## What Large Context Enables
1. **Full-document reasoning**: Summarize a 300-page report, compare two books, review a whole codebase — all in one shot
2. **Long agent sessions**: An [[agent-loop]] that runs for 50+ steps accumulates substantial history; 200K tokens buys significantly more headroom than 32K
3. **Fewer RAG dependencies**: With enough context, you can skip the retrieval pipeline entirely and just load all relevant data upfront — simpler architecture, lower latency
4. **In-context learning at scale**: More examples fit → better few-shot performance
## What Large Context Doesn't Solve
> [!WARNING]
> Long context is not a substitute for persistent knowledge management.
- **Retrieval within context is imperfect**: Research shows models attend poorly to information in the middle of very long contexts ("lost in the middle" problem). Beginning and end of context get disproportionate attention.
- **Cost scales with tokens**: A 200K token call to Claude 3 Opus costs significantly more than a focused 2K token call after [[retrieval-augmented-generation]] retrieval.
- **Knowledge doesn't accumulate across sessions**: Even a 1M token window resets between conversations. A wiki persists.
- **Latency**: First-token latency grows with context length. For interactive applications, long-context calls are slow.
## The Right Mental Model
Think of context window and [[retrieval-augmented-generation]] as **complementary**, not competing:
| Tool | Best For |
|------|---------|
| Large context | One-shot processing of a known, bounded document set |
| RAG | Searching across a large corpus to find what's relevant |
| External wiki | Accumulating knowledge persistently across sessions |
> [!NOTE]
> [[anthropic]]'s position — leading on context length — makes sense given their safety focus. Long context reduces the need for tool use, which reduces the attack surface from malicious tool outputs in agentic settings.
## Implication for Knowledge Management
Even with 1M token context, you still need a structured knowledge base. A year of research notes, dozens of papers, hundreds of observations — this grows far beyond any context window. The right approach is: use [[retrieval-augmented-generation]] or a wiki (like llmwiki-cli) to surface the 510 most relevant pages, then use context to reason over them.