Day 3 - Session 2: Model Differences and the Grain of Services
Same prompt, different failures
Welcome Back!
Where We’re Going
- Test same task across multiple models
- Watch small models fail spectacularly
- Understand each service’s “grain”
- Learn to match tool to task
What We’ll Learn This Session
By the end of this session, you will be able to:
- Analyze: Compare extraction accuracy across models
- Evaluate: Select appropriate models for specific tasks
- Predict: Identify patterns that signal likely failure
The Grain of Wood, The Grain of Models
Each model + service has a direction it is easier to achieve results in
- Claude
- Gemini
- Local models
- OpenAI’s GPT-5 (implications of fine tuning)
Fighting the grain = poor results + high token cost
Exercise: Model Torture Test (45 min)
Your Document
Use your bibliography from this morning and/or with a partner work through the models (split up the models you each are testing)
Your Task
Same prompt to each model:
Using the attached document, please extract three direct quotes that best support the claim that [adapt to your topic]. For each quote, provide: 1. The exact text in quotation marks 2. The section where it appears 3. One sentence explaining its relevance
Models to Test
- o4-mini
- Quen 3 Max
- Kimi K2
- GPT-OSS
- Claude Sonnet 4
- Claude Opus 4.1
- Nous Hermes 4
- Gemini 2.5 Pro (on aistudio.google.com)
- GPT-5 on chatgpt.com
Recording Your Results
On the Conceptboard, create a table:
| Model | Accurate Quotes | Hallucinated | Response Quality |
|---|---|---|---|
| Claude | ?/3 |
Green sticky when your group is complete.
What Patterns Do You See?
Expected Failures
- Small models: Partial quotes merged together
- Chinese models: Strong on some topics, weak on others
- All models: Confidence regardless of accuracy
The Key Insight
Model size ≠ task fitness Expensive ≠ better for everything
Discussion: When to Use Which Model?
The Cost-Quality Trade-off
Quick Math
- o4-mini
- Quen 3 Max
- Kimi K2
- GPT-OSS
- Claude Sonnet 4
- Nous Hermes 4
- Gemini 2.5 Pro
Critical Question
Is 95% accuracy worth 30x the cost? (Sometimes yes, sometimes no)
Discussion: Failure Patterns (20 min)
In Groups of 3-4
- Which model failed most interestingly?
- Could you predict failure before seeing output?
- What warning signs appeared in responses?
Looking Ahead
Tomorrow Morning: Infrastructure
- Managing conversations across sessions
- Building your research assistant
- Context window management
Tomorrow Afternoon: The Boring Important Stuff
- Who owns your outputs?
- What happens to your prompts?
Answering Today’s Question
What can we verify?
- Direct quotes: Verifiable via Ctrl+F
- Model behavior: Consistent patterns across tasks
- Failure modes: Predictable based on model choice
- Cost/benefit: Measurable and comparable
But NOT: - Understanding - Reasoning quality
- “Truth” of interpretations
Tonight’s Task
- Continue building your bibliography. Try to find 3-5 sources with good blockquotes.
- Annotate your prompts and the outputs.
Sticky Note Feedback
- On your green sticky, write one specific thing we did well today
- On your pink sticky, write one specific thing we can improve for tomorrow
See you tomorrow at 9:00!