1. TL;DR β Quick Decision Table
| Your Situation | Best Approach |
|---|---|
| Need answers from private docs/DB | β RAG |
| Need real-time / live data | β RAG or Agents |
| Need custom tone / style / format | β Fine-tuning |
| Need domain-specific vocabulary | β Fine-tuning |
| Need to take actions (web, APIs, tools) | β Agents |
| Need multi-step reasoning / planning | β Agents |
| Budget is tight | β RAG (cheapest) |
| Speed is critical (<500ms) | β Fine-tuning |
| Complex enterprise workflows | β Agents + RAG |
The real answer for most production systems in 2026: you combine all three. But let's understand each one first.
2. RAG β Retrieval-Augmented Generation
π΅ RAG in One Sentence
Retrieve relevant context from your knowledge base at query time, inject it into the prompt, let the LLM answer using that context.
How it works
- Ingest: Chunk your documents β embed them β store in a vector DB
- Query: Embed the user's question β find top-K similar chunks β retrieve
- Generate: Feed retrieved chunks + question to LLM β get grounded answer
Minimal RAG with LangChain
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
# 1. Ingest documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(your_docs)
# 2. Embed and store
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# 3. Query
llm = ChatOpenAI(model="deepseek-chat",
base_url="https://api.deepseek.com",
api_key="your-key")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectordb.as_retriever(search_kwargs={"k": 4})
)
answer = qa_chain.invoke({"query": "What is our refund policy?"})
print(answer["result"])
RAG: Pros & Cons
| β Pros | β Cons |
|---|---|
| No model training required | Retrieval quality matters a lot |
| Knowledge stays up-to-date | Latency from retrieval step (+100-500ms) |
| Cheap β no GPU needed | Context window limits how much you can inject |
| Citable sources ("according to doc X...") | Fails on questions needing whole-document reasoning |
| Easy to update knowledge | Chunking strategy heavily affects quality |
Best for: Customer support bots, internal knowledge bases, document Q&A, code search, legal/medical document retrieval.
3. Fine-tuning β Teaching the Model
π£ Fine-tuning in One Sentence
Update a pre-trained model's weights on your domain-specific data so it internalizes your patterns, tone, and knowledge.
When fine-tuning makes sense
- You need a very specific output format (e.g., always return JSON, always follow a template)
- You need a custom tone that prompting alone can't reliably enforce
- You have a narrow, well-defined task with thousands of examples
- You need maximum speed β fine-tuned smaller models beat large prompted models on latency
- You want to reduce prompt length (the model already knows the context)
Fine-tuning with OpenAI API
import json
from openai import OpenAI
client = OpenAI(api_key="your-openai-key")
# 1. Prepare training data (JSONL format)
# Each line: {"messages": [{"role":"system","content":"..."},
# {"role":"user","content":"..."},
# {"role":"assistant","content":"..."}]}
# 2. Upload training file
with open("training_data.jsonl", "rb") as f:
file_obj = client.files.create(file=f, purpose="fine-tune")
# 3. Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file_obj.id,
model="gpt-4o-mini", # or gpt-3.5-turbo
hyperparameters={"n_epochs": 3}
)
print(f"Job ID: {job.id}")
# 4. Monitor (poll until status == "succeeded")
# job = client.fine_tuning.jobs.retrieve(job.id)
# Use job.fine_tuned_model once complete
# 5. Use fine-tuned model
response = client.chat.completions.create(
model="ft:gpt-4o-mini:your-org:your-model:abc123",
messages=[{"role": "user", "content": "Classify: 'I hate this product'"}]
)
print(response.choices[0].message.content) # β negative
Fine-tuning: Pros & Cons
| β Pros | β Cons |
|---|---|
| Fastest inference (smaller model) | Expensive to train ($10sβ$1000s) |
| Best for consistent format/tone | Static β stale after training cutoff |
| Shorter prompts = lower API cost | Needs high-quality labeled data (hundredsβthousands) |
| Can outperform larger base models on narrow tasks | Doesn't generalize beyond training distribution |
Best for: Classification, named entity recognition, format normalization, brand-voice generation, SQL generation, specialized coding tasks.
4. AI Agents β The LLM that Acts
π’ AI Agents in One Sentence
Give the LLM tools (web search, code execution, APIs, file read/write) and let it reason, plan, and take multi-step actions to complete a goal.
Core agent loop (ReAct pattern)
from openai import OpenAI
import json, subprocess
client = OpenAI(api_key="your-deepseek-key", base_url="https://api.deepseek.com")
tools = [
{"type": "function", "function": {
"name": "run_python",
"description": "Execute Python code and return stdout",
"parameters": {"type": "object", "properties": {
"code": {"type": "string", "description": "Python code to run"}
}, "required": ["code"]}
}},
{"type": "function", "function": {
"name": "web_search",
"description": "Search the web for current information",
"parameters": {"type": "object", "properties": {
"query": {"type": "string"}
}, "required": ["query"]}
}}
]
def execute_tool(name, args):
if name == "run_python":
result = subprocess.run(["python", "-c", args["code"]],
capture_output=True, text=True, timeout=10)
return result.stdout or result.stderr
elif name == "web_search":
return f"[Search results for: {args['query']}]" # replace with real search
def agent_loop(goal, max_turns=10):
messages = [{"role": "user", "content": goal}]
for _ in range(max_turns):
resp = client.chat.completions.create(
model="deepseek-chat", messages=messages,
tools=tools, tool_choice="auto"
)
msg = resp.choices[0].message
messages.append(msg)
if not msg.tool_calls:
return msg.content # done
for tc in msg.tool_calls:
args = json.loads(tc.function.arguments)
result = execute_tool(tc.function.name, args)
messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})
return "Max turns reached"
print(agent_loop("Find the top 3 AI agent frameworks by GitHub stars and create a comparison table"))
Agents: Pros & Cons
| β Pros | β Cons |
|---|---|
| Can take real-world actions | Highest latency (multi-step) |
| Handles complex multi-step reasoning | Most expensive (many LLM calls) |
| Can access live data via tools | Can fail or loop unexpectedly |
| Flexible β works on open-ended tasks | Harder to debug and monitor |
| No training required | Non-deterministic output |
Best for: Research assistants, coding agents, workflow automation, data analysis, browser automation, long-horizon planning tasks.
5. Full Comparison: Cost, Speed, Complexity
| Dimension | RAG | Fine-tuning | Agents |
|---|---|---|---|
| Setup cost | Low ($0β$50) | High ($50β$5,000+) | Medium ($0 + API) |
| Inference cost | LowβMedium | Low (smaller model) | High (many calls) |
| Latency | Medium (retrieval + LLM) | Fast (small model) | Slow (multi-step) |
| Data needed | Documents only | Labeled examples (100sβ1000s) | None |
| Handles live data | β (update index) | β (static) | β (via tools) |
| Explainability | High (cite sources) | Medium | Medium (tool trace) |
| Best model sizes | Any | SmallβMedium (7Bβ70B) | Large (70B+) |
| Complexity to build | ββ | ββββ | βββ |
| Complexity to maintain | ββ | βββββ | βββ |
6. Decision Tree β Which to Use?
7. The Real Answer: Combine All Three
Most production LLM applications in 2026 use a combination. Here's how a real enterprise AI assistant works:
- Fine-tuned model β routes and classifies intent (fast, cheap, consistent)
- RAG β retrieves the right knowledge base articles, order history, product docs
- Agent β takes actions: creates a ticket, issues a refund, checks order status via API
# Combined architecture pattern
from openai import OpenAI
client = OpenAI(api_key="your-deepseek-key", base_url="https://api.deepseek.com")
def handle_customer_query(user_message: str, customer_id: str):
# Step 1: Fine-tuned classifier (fast, cheap gpt-4o-mini)
intent = classify_intent(user_message) # β "refund" | "product_question" | "complaint"
# Step 2: RAG β retrieve relevant context
if intent in ["product_question", "complaint"]:
context_docs = retriever.invoke(user_message) # vector search
context = "\n".join([d.page_content for d in context_docs])
else:
context = ""
# Step 3: Agent β answer + take action if needed
system = f"""You are a helpful customer support agent.
Customer ID: {customer_id}
{f'Relevant docs:{chr(10)}{context}' if context else ''}"""
messages = [
{"role": "system", "content": system},
{"role": "user", "content": user_message}
]
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
tools=support_tools, # check_order, issue_refund, create_ticket
tool_choice="auto"
)
return handle_response(response, messages) # execute tools if needed
8. Best Tools for Each Approach in 2026
π΅ RAG Stack
| Component | Top Picks |
|---|---|
| Vector DB | Pinecone, Qdrant, Chroma, Weaviate, pgvector |
| Embeddings | OpenAI text-embedding-3-small, Voyage AI, Cohere |
| Orchestration | LangChain, LlamaIndex, Haystack |
| Observability | LangSmith, Langfuse, Helicone |
π£ Fine-tuning Stack
| Component | Top Picks |
|---|---|
| Cloud fine-tuning | OpenAI (gpt-4o-mini), Together AI, Anyscale |
| Open-source training | Unsloth, LLaMA-Factory, Axolotl |
| Evaluation | RAGAS, DeepEval, Promptfoo |
| Experiment tracking | W&B Weave, MLflow, Comet ML |
π’ Agent Stack
| Component | Top Picks |
|---|---|
| Frameworks | LangGraph, CrewAI, AutoGen, Google ADK, PydanticAI |
| Tool integration | Composio, Toolhouse, MCP servers |
| Memory | Mem0, Zep, Letta |
| Observability | LangSmith, AgentOps, Arize Phoenix |
| Hosting | E2B (sandboxed code), Modal, Railway |
Explore all 420+ tools in each category at AgDex.ai β the comprehensive AI agent tools directory.
- LLM: DeepSeek V4 (
deepseek-chat) β best price-performance - RAG: LlamaIndex + Qdrant Cloud (free tier for prototypes)
- Agents: LangGraph (most control) or CrewAI (easiest multi-agent)
- Observability: Langfuse (open-source, free self-host)
- Fine-tune later only if format/latency becomes a bottleneck