Retrieval-Augmented Generation (RAG) with Spring AI

Retrieval-Augmented Generation (RAG) is a powerful pattern that enhances Large Language Models (LLMs) by grounding their responses in your specific documents and data. While GPT-4 is incredibly capable, it doesn’t know about your proprietary documents, internal knowledge bases, or recent updates that occurred after its training cutoff date. RAG solves this problem by retrieving relevant context from your documents before generating responses.

Key benefits of RAG:

  • Up-to-date Information: Query your latest documents without retraining models
  • Source Attribution: Know which documents informed the answer
  • Cost-Effective: Much cheaper than fine-tuning models for domain-specific knowledge
  • Flexible: Add or remove documents without model changes
  • Accurate: Responses grounded in your specific content, reducing hallucinations

How RAG Works

The RAG process follows this workflow:

  • Document Ingestion: PDF files are extracted and split into manageable chunks (1000 characters with 200 character overlap)
  • Embedding: Each chunk is converted to a vector representation using OpenAI’s text-embedding-3-small model
  • Storage: Vectors are stored in a vector database (in-memory for this demo)
  • Query: User questions are converted to embeddings and matched against stored chunks using cosine similarity
  • Response: The most relevant chunks are included as context in the prompt to GPT-4, which generates accurate, grounded answers

Setting Up RAG in Spring AI

First, add the embedding model dependency to your pom.xml:

Configure the embedding model in application.properties:

 

RagService Implementation

The RagService class handles the core RAG functionality through several key functions:

Document Ingestion – ingestPdfDocument()

This function handles the complete document ingestion pipeline:

  1. Text Extraction: Uses Apache PDFBox to extract text from PDF files
  2. Chunking: Splits the document into 1000-character chunks with 200-character overlap
    • The overlap ensures context isn’t lost at chunk boundaries
    • Attempts to break at sentence boundaries to maintain semantic coherence
  3. Embedding Generation: Converts each chunk to a vector using OpenAI’s text-embedding-3-small model
  4. Storage: Stores chunks with their embeddings and metadata (source file, chunk index) in an in-memory vector database

Key Configuration:

Why chunk overlap? The 200-character overlap prevents important information from being split across chunk boundaries, improving retrieval accuracy. For example, if a key sentence spans a boundary, the overlap ensures it appears in at least one complete chunk.

2. Semantic Search – chatWithDocuments()

This is the heart of the RAG pattern, orchestrating the question-answering workflow:

  1. Question Embedding: Converts the user’s question to a vector representation
  2. Similarity Search: Finds the most similar document chunks using cosine similarity
  3. Context Assembly: Combines the top-K most relevant chunks into a single context string
  4. Prompt Construction: Builds a RAG-specific prompt with the retrieved context
  5. Answer Generation: Sends the contextualized prompt to GPT-4 for response generation

The topK Parameter: Controls how many relevant chunks to retrieve. Higher values (e.g., 5-7) provide more context but may include less relevant information. Lower values (e.g., 2-3) are more focused but might miss important details.

3. Cosine Similarity – cosineSimilarity()

This function measures semantic similarity between vectors:

  • Input: Two embedding vectors (the question and a document chunk)
  • Output: A similarity score between -1 and 1
    • 1.0 = Vectors point in the same direction (semantically very similar)
    • 0.0 = Vectors are orthogonal (unrelated content)
    • -1.0 = Vectors point in opposite directions (contradictory content)

How it works: Calculates the dot product of the vectors divided by the product of their magnitudes. This measures the angle between vectors in high-dimensional space, which correlates with semantic similarity.

4. RAG Prompt Engineering – buildRagPrompt()

Constructs a specialized prompt that grounds the LLM’s response in retrieved documents:

This prompt engineering technique is crucial for preventing hallucinations. It explicitly instructs the model to:

  • Use only the provided context
  • Admit when information isn’t available
  • Ground responses in specific documents rather than general knowledge

5. Document Search – searchDocuments()

A utility function for debugging and transparency:

  • Returns the actual chunks that would be retrieved for a given query
  • Includes metadata (source file, chunk index, text preview)
  • Helpful in understanding the context the LLM receives
  • Helps diagnose poor search results

REST API Endpoints

The ChatController exposes five RAG endpoints that provide a complete document management and querying workflow:

/api/rag/status (GET)

Check the current state of the document store:

  • Returns total number of chunks ingested
  • Indicates whether the system is ready for queries
  • Useful for health checks and monitoring

/api/rag/ingest (POST)

Upload and process PDF documents:

  • Accepts a file path in the request body
  • Triggers the complete ingestion pipeline (extract, chunk, embed, store)
  • Returns the number of chunks created
  • Limited to 50 chunks in this demo to manage API costs

/api/rag/chat (POST)

The main RAG endpoint for asking questions:

  • Accepts a question and optional topK parameter
  • Performs semantic search to find relevant chunks
  • Generates context-grounded answers using GPT-4
  • Returns responses based on your specific documents

/api/rag/search (GET)

Debug endpoint to inspect retrieval results:

  • Shows which chunks would be retrieved for a query
  • Returns chunk text, metadata, and preview
  • Helps understand and tune the retrieval process
  • Useful for diagnosing irrelevant results

/api/rag/clear (POST)

Reset the document store:

  • Removes all ingested documents and embeddings
  • Returns the count of chunks removed
  • Essential for starting fresh with new documents

Testing the RAG Implementation

1. Check RAG Status

First, verify the system is ready:

Response:

2. Ingest a PDF Document

Upload a document to the vector store:

Response:

3. Ask Questions

Now query your document:

The topK parameter controls how many relevant chunks to retrieve. Higher values provide more context but may include less relevant information.

4. Search for Specific Content

To see which chunks would be retrieved for a query:

This endpoint is useful for debugging and understanding what context the LLM receives.

Production Considerations

While this implementation uses an in-memory vector store for simplicity, production deployments should consider:

1. Persistent Vector Database

Replace the in-memory store with a production vector database:

  • PostgreSQL + pgvector: Great for existing PostgreSQL users
  • Pinecone: Managed vector database with excellent performance
  • Qdrant: Open-source vector database with rich features
  • Weaviate: Semantic search with built-in ML capabilities

2. Async Processing

For large documents, implement asynchronous ingestion:

3. Caching

Cache frequently asked questions and embeddings:

4. Security

Add authentication for sensitive operations:

5. Monitoring

Track vector search performance and accuracy:

Poor Search Results

If answers seem irrelevant:

  1. Increase topK: Retrieve more chunks for better context
  2. Adjust chunk size: Smaller chunks for precise matching, larger for more context
  3. Try different embeddingstext-embedding-3-large provides higher quality at a higher cost
  4. Add metadata filtering: Filter by document type, date, or section before similarity search

Rate Limiting

OpenAI’s embedding API has rate limits. For large ingestions, add rate limiting:

Conclusion

RAG is critical for building AI applications that need to work with your specific data. The Spring AI framework makes it straightforward to implement RAG patterns with minimal code, while providing the flexibility to scale to production use cases.

The combination of Spring Boot’s robust ecosystem and OpenAI’s embeddings and language models enables you to build sophisticated question-answering systems over your documents in just a few hundred lines of Java code.

The complete code for this RAG implementation is available in the GitHub repository, including additional examples and a Postman collection for testing.


Note: Full disclosure – I vibe-coded some of the RAG code using Claude. TBH, I am starting to question the value of traditional blogs as I go into 2026. Anything a developer needs can be “taught” by an LLM as long as you know what to ask, and you can quickly generate code to experiment with. Vibe coding adds a whole different perspective where we may not even need to look at the code (yes yes that is controversial as of Dec 2025).