Retrieval-Augmented Generation (RAG) with Spring AI

Retrieval-Augmented Generation (RAG) is a powerful pattern that enhances Large Language Models (LLMs) by grounding their responses in your specific documents and data. While GPT-4 is incredibly capable, it doesn’t know about your proprietary documents, internal knowledge bases, or recent updates that occurred after its training cutoff date. RAG solves this problem by retrieving relevant context from your documents before generating responses.

RAG offers a practical, scalable approach to injecting domain knowledge into AI systems without frequent model retraining. By retrieving information from your latest documents at query time, RAG ensures that responses remain current and aligned with rapidly changing data and business context.

Another important advantage is source attribution. Because answers are generated from retrieved content, it is possible to see exactly which documents influenced a response. This level of transparency is especially valuable in enterprise settings where trust, explainability, and auditability are essential.

RAG is also more cost-effective than fine-tuning large models for domain-specific knowledge. Instead of investing in expensive retraining cycles, teams can update or expand knowledge simply by managing documents. This makes RAG highly flexible, allowing content to be added or removed without touching the underlying model, while grounding responses in real data helps reduce hallucinations and improve accuracy.

The RAG process

The RAG process follows this workflow:

PDF Document → Text Extraction → Chunking → Embedding → Vector Store
                                                            ↓
User Question → Embedding → Similarity Search → Context → LLM → Answer

PDF Document → Text Extraction → Chunking → Embedding → Vector Store

↓

User Question → Embedding → Similarity Search → Context → LLM → Answer

Document Ingestion: PDF files are extracted and split into manageable chunks (1000 characters with 200 character overlap)
Embedding: Each chunk is converted to a vector representation using OpenAI’s text-embedding-3-small model
Storage: Vectors are stored in a vector database (in-memory for this demo)
Query: User questions are converted to embeddings and matched against stored chunks using cosine similarity
Response: The most relevant chunks are included as context in the prompt to GPT-4, which generates accurate, grounded answers

Setting Up RAG in Spring AI

Add the embedding model dependency to your pom.xml:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.0</version>
</dependency>

<groupId>org.springframework.ai</groupId>

<artifactId>spring-ai-openai-spring-boot-starter</artifactId>

</dependency>

<groupId>org.apache.pdfbox</groupId>

<artifactId>pdfbox</artifactId>

</dependency>

Configure the embedding model in application.properties:

# Chat Model
spring.ai.openai.model=gpt-4o
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.temperature=0.7

# Embedding Model (for RAG)
spring.ai.openai.embedding.options.model=text-embedding-3-small

# Chat Model

spring.ai.openai.model=gpt-4o

spring.ai.openai.api-key=${OPENAI_API_KEY}

spring.ai.openai.temperature=0.7

# Embedding Model (for RAG)

spring.ai.openai.embedding.options.model=text-embedding-3-small

RagService Implementation

The RagService class handles the core RAG functionality through several key functions. Refer to the Git repository for the code.

Document Ingestion – `ingestPdfDocument()`

This function handles the complete document ingestion pipeline:

Text Extraction: Uses Apache PDFBox to extract text from PDF files
Chunking: Splits the document into 1000-character chunks with 200-character overlap. The overlap ensures context isn’t lost at chunk boundaries
Embedding Generation: Converts each chunk to a vector using OpenAI’s text-embedding-3-small model
Storage: Stores chunks with their embeddings and metadata (source file, chunk index) in an in-memory vector database

Key Configuration:

private static final int CHUNK_SIZE = 1000; // Characters per chunk 
private static final int CHUNK_OVERLAP = 200; // Overlap for context preservation

1 2	private static final int CHUNK_SIZE = 1000; // Characters per chunk private static final int CHUNK_OVERLAP = 200; // Overlap for context preservation

2. Semantic Search – `chatWithDocuments()`

This is the heart of the RAG pattern, orchestrating the question-answering workflow:

Question Embedding: Converts the user’s question to a vector representation
Similarity Search: Finds the most similar document chunks using cosine similarity
Context Assembly: Combines the top-K most relevant chunks into a single context string
Prompt Construction: Builds a RAG-specific prompt with the retrieved context
Answer Generation: Sends the contextualized prompt to GPT-4 for response generation

The topK Parameter: Controls how many relevant chunks to retrieve. Higher values (e.g., 5-7) provide more context but may include less relevant information. Lower values (e.g., 2-3) are more focused but might miss important details.

3. Cosine Similarity – `cosineSimilarity()`

This function measures semantic similarity between vectors:

Input: Two embedding vectors (the question and a document chunk)
Output: A similarity score between -1 and 1
- 1.0 = Vectors point in the same direction (semantically very similar)
- 0.0 = Vectors are orthogonal (unrelated content)
- -1.0 = Vectors point in opposite directions (contradictory content)

Calculates the dot product of the vectors divided by the product of their magnitudes. This measures the angle between vectors in high-dimensional space, which correlates with semantic similarity.

4. RAG Prompt Engineering – `buildRagPrompt()`

Constructs a specialized prompt that grounds the LLM’s response in retrieved documents:

You are a helpful assistant. Answer the question based on the context provided below.
If the answer cannot be found in the context, say so.

CONTEXT:
[Retrieved document chunks]
QUESTION: [User's question]
ANSWER:

You are a helpful assistant. Answer the question based on the context provided below.

If the answer cannot be found in the context, say so.

CONTEXT:

[Retrieved document chunks]

QUESTION: [User's question]

ANSWER:

This prompt engineering technique is crucial for preventing hallucinations. It explicitly instructs the model to:

Use only the provided context
Admit when information isn’t available
Ground responses in specific documents rather than general knowledge

5. Document Search – `searchDocuments()`

A utility function for debugging and transparency:

Returns the actual chunks that would be retrieved for a given query
Includes metadata (source file, chunk index, text preview)
Helpful in understanding the context the LLM receives
Helps diagnose poor search results

REST API Endpoints

The ChatController exposes five RAG endpoints that provide a complete document management and querying workflow:

`/api/rag/status` (GET)

Check the current state of the document store:

Returns the total number of chunks ingested
Indicates whether the system is ready for queries

`/api/rag/ingest` (POST)

Upload and process PDF documents:

Accepts a file path in the request body
Triggers the complete ingestion pipeline (extract, chunk, embed, store)
Returns the number of chunks created
Limited to 50 chunks in this demo to manage API costs

`/api/rag/chat` (POST)

The main RAG endpoint for asking questions:

Accepts a question and an optional topK parameter
Performs a semantic search to find relevant chunks
Generates context-grounded answers using GPT-4
Returns responses based on your specific documents

`/api/rag/search` (GET)

Debug endpoint to inspect retrieval results:

Shows which chunks would be retrieved for a query
Returns chunk text, metadata, and preview
Helps understand and tune the retrieval process
Useful for diagnosing irrelevant results

`/api/rag/clear` (POST)

Reset the document store:

Removes all ingested documents and embeddings
Returns the count of chunks removed
Essential for starting fresh with new documents

Testing the RAG Implementation

1. Check RAG Status

To verify if the system is ready:

curl http://localhost:8080/api/rag/status

1 2	curl http://localhost:8080/api/rag/status

Response:

{ "total_chunks": 0, "ready": false }

1	{ "total_chunks": 0, "ready": false }

2. Ingest a PDF Document

Upload a document to the vector store:

curl -X POST http://localhost:8080/api/rag/ingest \ 
     -H "Content-Type: application/json" \ 
     -d '{ "filePath": "/path/to/your/document.pdf" }'

curl -X POST http://localhost:8080/api/rag/ingest \

-H "Content-Type: application/json" \

-d '{ "filePath": "/path/to/your/document.pdf" }'

Response:

{
  "message": "Document ingested successfully",
  "chunks_created": 45,
  "total_chunks": 45
}

{

"message": "Document ingested successfully",

"chunks_created": 45,

"total_chunks": 45

}

3. Ask Questions

Now query your document:

curl -X POST http://localhost:8080/api/rag/chat \ 
       -H "Content-Type: application/json" \ 
       -d '{ "question": "What are the key findings in the document?", "topK": 3 }'

curl -X POST http://localhost:8080/api/rag/chat \

-H "Content-Type: application/json" \

-d '{ "question": "What are the key findings in the document?", "topK": 3 }'

The topK parameter controls how many relevant chunks to retrieve. Higher values provide more context but may include less relevant information.

4. Search for Specific Content

To see which chunks would be retrieved for a query:

curl "http://localhost:8080/api/rag/search?query=machine%20learning&topK=3"

1	curl "http://localhost:8080/api/rag/search?query=machine%20learning&topK=3"

This endpoint is useful for debugging and understanding what context the LLM receives.

Production Considerations

1. Use a Real Vector Database

While this implementation uses an in-memory vector store for simplicity, production deployments should consider a persistent vector database such as…

PostgreSQL + pgvector: Great for existing PostgreSQL users
Pinecone: Managed vector database with excellent performance
Qdrant: Open-source vector database with rich features
Weaviate: Semantic search with built-in ML capabilities

2. Async Processing

For large documents, implement asynchronous ingestion:

@Async 
public CompletableFuture<Integer> ingestPdfDocumentAsync(String pdfPath) { 
  // Process in background thread 
}

@Async

public CompletableFuture<Integer> ingestPdfDocumentAsync(String pdfPath) {

// Process in background thread

}

3. Caching

Cache frequently asked questions and embeddings:

@Cacheable("embeddings") 
public float[] embed(String text) { 
   return embeddingModel.embed(text); 
}

@Cacheable("embeddings")

public float[] embed(String text) {

return embeddingModel.embed(text);

}

4. Security

Add authentication for sensitive operations:

@PostMapping("/rag/ingest") 
@PreAuthorize("hasRole('ADMIN')") 
public ResponseEntity<Map<String, Object>> ingestDocument(...) { 
    // Only admins can ingest documents 
}

@PostMapping("/rag/ingest")

@PreAuthorize("hasRole('ADMIN')")

public ResponseEntity<Map<String, Object>> ingestDocument(...) {

// Only admins can ingest documents

}

Conclusion

RAG is critical for building AI applications that need to work with your specific data. The Spring AI framework makes it straightforward to implement RAG patterns with minimal code, while providing the flexibility to scale to production use cases.

The combination of Spring Boot’s robust ecosystem and OpenAI’s embeddings and language models enables you to build sophisticated question-answering systems over your documents in just a few hundred lines of Java code.

The complete code for this RAG implementation is available in the GitHub repository, including additional examples and a Postman collection for testing.

Note: Full disclosure – I vibe-coded some of the RAG code using Claude. TBH, I am starting to question the value of traditional blogs as I go into 2026. Anything a developer needs can be “taught” by an LLM as long as you know what to ask, and you can quickly generate code to experiment with. Vibe coding adds an entirely different perspective, such that we may not even need to look at the code (yes, yes, that is controversial as of Dec 2025).

{"Mat's Random Thoughts"}

Mathew's Tech Notes..

Retrieval-Augmented Generation (RAG) with Spring AI

The RAG process

Setting Up RAG in Spring AI

RagService Implementation

Document Ingestion – `ingestPdfDocument()`

2. Semantic Search – `chatWithDocuments()`

3. Cosine Similarity – `cosineSimilarity()`

4. RAG Prompt Engineering – `buildRagPrompt()`

5. Document Search – `searchDocuments()`

REST API Endpoints

`/api/rag/status` (GET)

`/api/rag/ingest` (POST)

`/api/rag/chat` (POST)

`/api/rag/search` (GET)

`/api/rag/clear` (POST)

Testing the RAG Implementation

1. Check RAG Status

2. Ingest a PDF Document

3. Ask Questions

4. Search for Specific Content

Production Considerations

1. Use a Real Vector Database

2. Async Processing

3. Caching

4. Security

Conclusion

The RAG process

Setting Up RAG in Spring AI

RagService Implementation

Document Ingestion – ingestPdfDocument()

2. Semantic Search – chatWithDocuments()

3. Cosine Similarity – cosineSimilarity()

4. RAG Prompt Engineering – buildRagPrompt()

5. Document Search – searchDocuments()

REST API Endpoints

/api/rag/status (GET)

/api/rag/ingest (POST)

/api/rag/chat (POST)

/api/rag/search (GET)

/api/rag/clear (POST)

Testing the RAG Implementation

1. Check RAG Status

2. Ingest a PDF Document

3. Ask Questions

4. Search for Specific Content

Production Considerations

1. Use a Real Vector Database

2. Async Processing

3. Caching

4. Security

Conclusion

Document Ingestion – `ingestPdfDocument()`

2. Semantic Search – `chatWithDocuments()`

3. Cosine Similarity – `cosineSimilarity()`

4. RAG Prompt Engineering – `buildRagPrompt()`

5. Document Search – `searchDocuments()`

`/api/rag/status` (GET)

`/api/rag/ingest` (POST)

`/api/rag/chat` (POST)

`/api/rag/search` (GET)

`/api/rag/clear` (POST)