December 8, 2025

How I Built My 'Ask AI' Web Assistant

A deep dive into the RAG pipeline, LLM prompting strategies, and automated workflows that power the Ask AI feature on folch.ai

How I Built My 'Ask AI' Web Assistant

Overview

The "Ask AI" feature on my personal website lets visitors ask questions about my career, projects, and tools using natural language. Behind the scenes, it's powered by a RAG (Retrieval-Augmented Generation) pipeline that combines semantic search with LLM reasoning. Here's how I built it.

1. RAG Pipeline

Content Processing

The RAG pipeline starts with processing various content types from my website:

  • Projects: MDX files from src/content/projects/ plus metadata from src/lib/data.ts
  • Journey: Career milestones from src/content/journey/ combined into a single timeline
  • Tools: AI tools documentation from src/content/tools/tools.mdx
  • Blog: Blog posts from src/content/blog/
  • CV & For-Agents: Additional structured content

Each content type is processed by dedicated functions (processProjects(), processJourney(), etc.) that:

  1. Read MDX files and extract frontmatter
  2. Combine metadata with content
  3. Generate clean URLs for each piece of content
  4. Create ContentChunk objects with structured metadata

Chunking Strategy

Content is split into semantic chunks using a sentence-based approach:

function generateChunks(text: string, maxLength: number = 1000): string[] {
  // If text is short enough, return as single chunk
  if (text.length <= maxLength) {
    return text.length >= 20 ? [text.trim()] : [];
  }

  // Split by sentences and group into chunks
  const sentences = text.split(/[.!?]\s+/);
  const chunks: string[] = [];
  let currentChunk = "";

  for (const sentence of sentences) {
    if ((currentChunk + sentence).length > maxLength && currentChunk) {
      chunks.push(currentChunk.trim());
      currentChunk = sentence;
    } else {
      currentChunk += (currentChunk ? ". " : "") + sentence;
    }
  }

  if (currentChunk) {
    chunks.push(currentChunk.trim());
  }

  return chunks.filter((chunk) => chunk.length >= 20);
}

This ensures chunks are semantically coherent (complete sentences) while staying within token limits.

Embedding Generation

Chunks are embedded using OpenAI's text-embedding-3-small (1536 dimensions) or via OpenRouter for flexibility. The embedding script:

  1. Generates embeddings for all sub-chunks in batches
  2. Validates embedding dimensions match expected size (1536)
  3. Stores embeddings in PostgreSQL using pgvector

Vector Database Schema

I use Neon (serverless Postgres) with the pgvector extension. The schema includes:

  • docs.page: Stores page metadata (path, type, meta JSON, last_refresh)
  • docs.page_section: Stores content chunks with their embeddings (vector(1536))

The similarity search function uses cosine similarity (dot product on normalized vectors):

create or replace function "docs"."match_page_sections"(
  embedding vector(1536), 
  match_threshold float, 
  match_count int, 
  min_content_length int
)
returns table (...)
language plpgsql
as $$
begin
  return query
  select
    ps.id,
    ps.content,
    (ps.embedding <#> embedding) * -1 as similarity,
    ps.url,
    ps.content_type
  from docs.page_section ps
  where length(ps.content) >= min_content_length
    and (ps.embedding <#> embedding) * -1 > match_threshold
  order by ps.embedding <#> embedding
  limit match_count;
end;
$$;

Retrieval Logic

When a user asks a question, the system:

  1. Generates query embedding: Converts the user's question into a 1536-dimensional vector
  2. Semantic search: Finds top-k most similar chunks using cosine similarity
  3. Content-type prioritization: Re-ranks results based on query intent (career questions prioritize journey content, contact questions prioritize for-agents content)
  4. Deduplication: Removes duplicate URLs while keeping multiple sections for journey/CV content
  5. Context assembly: Formats retrieved chunks with headings and URLs for the LLM

2. LLM Prompting and Model Configuration

Model Selection

The system supports both OpenAI and OpenRouter providers, configurable via environment variables:

function getLLMProvider() {
  const provider = process.env.LLM_PROVIDER || "openrouter";
  const model = process.env.LLM_MODEL || "google/gemini-2.5-flash";

  if (provider === "openrouter") {
    const openrouter = createOpenRouter({ 
      apiKey: process.env.OPENROUTER_API_KEY,
      headers: {
        'HTTP-Referer': 'https://folch.ai',
        'X-Title': 'folch.ai',
      },
    });
    return { model: openrouter.chat(model) };
  }

  return { model: openai(model) };
}

Default model is google/gemini-2.5-flash via OpenRouter for cost-effectiveness while maintaining quality.

System Prompt Architecture

The system prompt is structured with clear sections:

Role Definition: Establishes the assistant's purpose (helping users learn about Albert's background)

Security Boundaries: Prevents prompt injection, code execution, and information leakage

Retrieval Strategy: Forces the LLM to always call getInformation tool before answering

Use Cases: Defines primary scenarios (career questions, tools, projects, contact info) with specific guidance

Response Guidelines:

  • Concise, non-repetitive answers
  • Structured with bullet points
  • Numbered source references [1], [2]
  • Sources section at the end with descriptive links

Query Construction: Guides the LLM on how to construct effective semantic search queries (2-4 key concepts, 10-50 words, natural language)

Tool-Based RAG

The LLM uses a getInformation tool that:

  1. Takes a natural language query (constructed by the LLM from user's question)
  2. Calls findRelevantContent() to retrieve top-k chunks
  3. Formats context with headings and URLs
  4. Returns formatted context for the LLM to synthesize

The tool description includes query construction examples to guide the LLM:

EXAMPLES:
- User: "what's albert's background?" → Query: "Albert background career journey professional history"
- User: "journey?" → Query: "Albert journey career timeline professional path"
- User: "contact?" → Query: "Albert contact information email LinkedIn GitHub"

Query Rewriting

Before retrieval, queries go through a rewriting step that:

  • Classifies intent: Detects career, contact, project, or tool queries
  • Adds context: Incorporates conversation history for follow-up questions
  • Expands abbreviations: Converts "journey?" to "Albert journey career timeline professional path"

This improves retrieval quality, especially for short or ambiguous queries.

Response Formatting

The LLM formats responses with:

  • Inline references: [1], [2] throughout the answer
  • Sources section: Bullet list with descriptive links like 🔗 1. [Professional Journey](https://folch.ai/journey)
  • URL deduplication: Same URL reused with same number

3. GitHub Workflows

Automated Embedding Generation

A GitHub Actions workflow automatically regenerates embeddings when content changes:

name: 'generate_embeddings'

on:
  workflow_dispatch:
  push:
    branches:
      - main
      - development
    paths:
      - 'src/lib/data.ts'
      - 'src/content/projects/**'
      - 'src/content/journey/**'
      - 'src/content/tools/**'
      - 'src/content/blog/**'
      - 'src/content/cv.mdx'
      - 'src/content/for-agents.mdx'
      - 'src/app/sitemap.ts'
      - 'scripts/generate-embeddings.ts'

jobs:
  generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - name: Install dependencies
        run: npm ci
      - name: Generate embeddings
        env:
          NEON_DATABASE_URL: ${{ secrets.NEON_DATABASE_URL }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
          EMBEDDING_MODEL: ${{ secrets.EMBEDDING_MODEL }}
        run: |
          npx tsx scripts/generate-embeddings.ts

Workflow Behavior

  1. Triggers: Runs on push to main/development when content files change, or manually via workflow_dispatch
  2. Cleanup: Deletes all existing embeddings before regenerating (ensures consistency)
  3. Processing: Processes all content types sequentially
  4. Validation: Validates embedding dimensions and logs statistics

This ensures the knowledge base stays in sync with website content without manual intervention.

Key Learnings

  1. Chunking matters: Sentence-based chunking preserves semantic coherence better than fixed-size windows
  2. Query rewriting improves retrieval: Even simple intent classification and context expansion significantly improves results
  3. Content-type prioritization: Re-ranking by content type based on query intent improves answer quality
  4. Tool-based RAG: Using tools forces the LLM to retrieve before answering, preventing hallucinations
  5. Automation is essential: GitHub workflows eliminate manual embedding updates and keep the system current

Future Improvements

  • Hybrid search: Combine semantic search with keyword matching for better precision
  • Query expansion: Use LLM to expand queries before retrieval
  • Conversation memory: Store conversation context in database for better follow-ups
  • Analytics: Track query patterns and retrieval quality to optimize thresholds

The Ask AI feature demonstrates how RAG can make personal websites more interactive and informative. By combining semantic search with structured prompting, it provides accurate, contextual answers while maintaining security boundaries.

Back to Garden
RAGAI EngineeringLLMVector DatabasesNext.js