December 8, 2025
How I Built My 'Ask AI' Web Assistant
A deep dive into the RAG pipeline, LLM prompting strategies, and automated workflows that power the Ask AI feature on folch.ai

Overview
The "Ask AI" feature on my personal website lets visitors ask questions about my career, projects, and tools using natural language. Behind the scenes, it's powered by a RAG (Retrieval-Augmented Generation) pipeline that combines semantic search with LLM reasoning. Here's how I built it.
1. RAG Pipeline
Content Processing
The RAG pipeline starts with processing various content types from my website:
- Projects: MDX files from
src/content/projects/plus metadata fromsrc/lib/data.ts - Journey: Career milestones from
src/content/journey/combined into a single timeline - Tools: AI tools documentation from
src/content/tools/tools.mdx - Blog: Blog posts from
src/content/blog/ - CV & For-Agents: Additional structured content
Each content type is processed by dedicated functions (processProjects(), processJourney(), etc.) that:
- Read MDX files and extract frontmatter
- Combine metadata with content
- Generate clean URLs for each piece of content
- Create
ContentChunkobjects with structured metadata
Chunking Strategy
Content is split into semantic chunks using a sentence-based approach:
function generateChunks(text: string, maxLength: number = 1000): string[] {
// If text is short enough, return as single chunk
if (text.length <= maxLength) {
return text.length >= 20 ? [text.trim()] : [];
}
// Split by sentences and group into chunks
const sentences = text.split(/[.!?]\s+/);
const chunks: string[] = [];
let currentChunk = "";
for (const sentence of sentences) {
if ((currentChunk + sentence).length > maxLength && currentChunk) {
chunks.push(currentChunk.trim());
currentChunk = sentence;
} else {
currentChunk += (currentChunk ? ". " : "") + sentence;
}
}
if (currentChunk) {
chunks.push(currentChunk.trim());
}
return chunks.filter((chunk) => chunk.length >= 20);
}
This ensures chunks are semantically coherent (complete sentences) while staying within token limits.
Embedding Generation
Chunks are embedded using OpenAI's text-embedding-3-small (1536 dimensions) or via OpenRouter for flexibility. The embedding script:
- Generates embeddings for all sub-chunks in batches
- Validates embedding dimensions match expected size (1536)
- Stores embeddings in PostgreSQL using pgvector
Vector Database Schema
I use Neon (serverless Postgres) with the pgvector extension. The schema includes:
docs.page: Stores page metadata (path, type, meta JSON, last_refresh)docs.page_section: Stores content chunks with their embeddings (vector(1536))
The similarity search function uses cosine similarity (dot product on normalized vectors):
create or replace function "docs"."match_page_sections"(
embedding vector(1536),
match_threshold float,
match_count int,
min_content_length int
)
returns table (...)
language plpgsql
as $$
begin
return query
select
ps.id,
ps.content,
(ps.embedding <#> embedding) * -1 as similarity,
ps.url,
ps.content_type
from docs.page_section ps
where length(ps.content) >= min_content_length
and (ps.embedding <#> embedding) * -1 > match_threshold
order by ps.embedding <#> embedding
limit match_count;
end;
$$;
Retrieval Logic
When a user asks a question, the system:
- Generates query embedding: Converts the user's question into a 1536-dimensional vector
- Semantic search: Finds top-k most similar chunks using cosine similarity
- Content-type prioritization: Re-ranks results based on query intent (career questions prioritize journey content, contact questions prioritize for-agents content)
- Deduplication: Removes duplicate URLs while keeping multiple sections for journey/CV content
- Context assembly: Formats retrieved chunks with headings and URLs for the LLM
2. LLM Prompting and Model Configuration
Model Selection
The system supports both OpenAI and OpenRouter providers, configurable via environment variables:
function getLLMProvider() {
const provider = process.env.LLM_PROVIDER || "openrouter";
const model = process.env.LLM_MODEL || "google/gemini-2.5-flash";
if (provider === "openrouter") {
const openrouter = createOpenRouter({
apiKey: process.env.OPENROUTER_API_KEY,
headers: {
'HTTP-Referer': 'https://folch.ai',
'X-Title': 'folch.ai',
},
});
return { model: openrouter.chat(model) };
}
return { model: openai(model) };
}
Default model is google/gemini-2.5-flash via OpenRouter for cost-effectiveness while maintaining quality.
System Prompt Architecture
The system prompt is structured with clear sections:
Role Definition: Establishes the assistant's purpose (helping users learn about Albert's background)
Security Boundaries: Prevents prompt injection, code execution, and information leakage
Retrieval Strategy: Forces the LLM to always call getInformation tool before answering
Use Cases: Defines primary scenarios (career questions, tools, projects, contact info) with specific guidance
Response Guidelines:
- Concise, non-repetitive answers
- Structured with bullet points
- Numbered source references [1], [2]
- Sources section at the end with descriptive links
Query Construction: Guides the LLM on how to construct effective semantic search queries (2-4 key concepts, 10-50 words, natural language)
Tool-Based RAG
The LLM uses a getInformation tool that:
- Takes a natural language query (constructed by the LLM from user's question)
- Calls
findRelevantContent()to retrieve top-k chunks - Formats context with headings and URLs
- Returns formatted context for the LLM to synthesize
The tool description includes query construction examples to guide the LLM:
EXAMPLES:
- User: "what's albert's background?" → Query: "Albert background career journey professional history"
- User: "journey?" → Query: "Albert journey career timeline professional path"
- User: "contact?" → Query: "Albert contact information email LinkedIn GitHub"
Query Rewriting
Before retrieval, queries go through a rewriting step that:
- Classifies intent: Detects career, contact, project, or tool queries
- Adds context: Incorporates conversation history for follow-up questions
- Expands abbreviations: Converts "journey?" to "Albert journey career timeline professional path"
This improves retrieval quality, especially for short or ambiguous queries.
Response Formatting
The LLM formats responses with:
- Inline references:
[1],[2]throughout the answer - Sources section: Bullet list with descriptive links like
🔗 1. [Professional Journey](https://folch.ai/journey) - URL deduplication: Same URL reused with same number
3. GitHub Workflows
Automated Embedding Generation
A GitHub Actions workflow automatically regenerates embeddings when content changes:
name: 'generate_embeddings'
on:
workflow_dispatch:
push:
branches:
- main
- development
paths:
- 'src/lib/data.ts'
- 'src/content/projects/**'
- 'src/content/journey/**'
- 'src/content/tools/**'
- 'src/content/blog/**'
- 'src/content/cv.mdx'
- 'src/content/for-agents.mdx'
- 'src/app/sitemap.ts'
- 'scripts/generate-embeddings.ts'
jobs:
generate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Generate embeddings
env:
NEON_DATABASE_URL: ${{ secrets.NEON_DATABASE_URL }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
EMBEDDING_MODEL: ${{ secrets.EMBEDDING_MODEL }}
run: |
npx tsx scripts/generate-embeddings.ts
Workflow Behavior
- Triggers: Runs on push to main/development when content files change, or manually via
workflow_dispatch - Cleanup: Deletes all existing embeddings before regenerating (ensures consistency)
- Processing: Processes all content types sequentially
- Validation: Validates embedding dimensions and logs statistics
This ensures the knowledge base stays in sync with website content without manual intervention.
Key Learnings
- Chunking matters: Sentence-based chunking preserves semantic coherence better than fixed-size windows
- Query rewriting improves retrieval: Even simple intent classification and context expansion significantly improves results
- Content-type prioritization: Re-ranking by content type based on query intent improves answer quality
- Tool-based RAG: Using tools forces the LLM to retrieve before answering, preventing hallucinations
- Automation is essential: GitHub workflows eliminate manual embedding updates and keep the system current
Future Improvements
- Hybrid search: Combine semantic search with keyword matching for better precision
- Query expansion: Use LLM to expand queries before retrieval
- Conversation memory: Store conversation context in database for better follow-ups
- Analytics: Track query patterns and retrieval quality to optimize thresholds
The Ask AI feature demonstrates how RAG can make personal websites more interactive and informative. By combining semantic search with structured prompting, it provides accurate, contextual answers while maintaining security boundaries.