Retrieval-Augmented Generation: Building a Knowledge Archive in R

Happy New Year from the Jumping Rivers team!

Now that we’re in the mid-2020s, it’s a good time to reflect on the changes we’ve seen so far in this decade. In the world of data science, nothing has dominated headlines more than the rapid growth and adoption of generative artificial intelligence (GenAI).

Large Language Models (LLMs) such as ChatGPT, Claude and Gemini have incredible potential to streamline everyday tasks, whether processing large amounts of information, providing a human-like chat interface for customers or generating code. But they also pose significant risks if not used responsibly.

Anyone who has come into contact with these models has probably encountered them at some point hallucinationwhere the model confidently presents false information as if it were factually correct. This can happen for several reasons:

LLMs often don’t have access to real-time information: how would a model trained last year know today’s date?
The training data may be missing domain-specific information: can we really trust an off-the-shelf model to have a good understanding of pharmaceuticals and medicinal drugs?
The model may be too eager to appear intelligent, so it decides to provide a reliable result instead of a more nuanced, honest answer.

We often need to give the model access to additional contextual information before we can make it ‘production ready’. We can achieve this with the help of a retrieval-augmented generation (RAG) workflow. In this blog post we will explore the steps involved and set up an example RAG workflow using free and open source packages in R.

What is RAG?

In a typical interaction with an LLM we have:

A user prompt: the text submitted by the user.
A response: the text returned by the LLM.
(optional) A system prompt: additional instructions on how the LLM should respond (e.g
"You respond in approximately 10 words or less").

In a RAG workflow, we provide access to an external knowledge store that can contain text-based documents and web pages. Additional contextual information is then retrieved from the knowledge store (hence “to retrieve”) and added to the user prompt before sending. If we do this, we can expect higher quality output.

How does it work?

Before we go any further, we should first introduce the concept of
vectorization.

Contrary to what you might think, LLMs do not understand non-numeric text! They are mathematical models, meaning they can only take in and output numerical vectors.

So how can a user communicate with a model in plain English? The trick is that mappings exist that can convert between numeric vectors and text. These mappings are called ‘vector embeddings’ and are used to convert the user prompt into a vector representation before passing it to the LLM.

So, when setting up our RAG knowledge store, we need to store the information using a compatible vector representation. With this in mind, let’s introduce a typical RAG workflow:

Contents: we decide which documents we include in the knowledge archive.
Extraction: we extract the text from these documents in Markdown format.
Chunking: Markdown content is broken down into contextual “chunks” (for example, any section or subsection of a document can become a chunk).
Vectorization: the chunks are “vectorized” (i.e. we convert them into a numerical vector representation).
Index: We create an index for our knowledge store that will be used to retrieve relevant pieces of information.
To retrieve: we register the knowledge store with our model interface. Now when a user sends a prompt, it is combined with relevant pieces of information before being ingested by the model.

The retrieval step typically uses a matching algorithm so that only highly relevant chunks are retrieved from the knowledge store. This way we can keep the size of the user prompts (and any costs) to a minimum.

Setting up a RAG workflow in R

We will use two packages available to install via the
Extended R Archive Network (CRAN). Both are actively maintained by Couple
(formerly RStudio) and are free to install and use.

{ragnar}

The {ragnar} package provides features for extracting information from both text-based documents and web pages, and offers vector embeddings compatible with popular LLM providers including OpenAI and Google.

We will use {ragnar} to build our knowledge store.

{more}

The {more} package allows us to interact with a variety of R LLM APIs. A full list of supported model providers can be found in the package documentation.

Please note that while {ellmer} is free to install and use, you will still need to set up an API token with your preferred model provider before you can interact with models. We will use the free Google Gemini tier for our example workflow. See the Gemini API documentation creating an API key, and the {ellmer} documentation
for authentication with your R API key.

Example RAG workflow

We start by loading the package {ragnar}.

library("ragnar")

The URL below links to the title page from the textbook “Efficient R Programming”, written by Robin Lovelace and our own Colin Gillespie. We’re going to use a few chapters from the book to put together a RAG knowledge store.

url = "https://csgillespie.github.io/efficientR/"

Let’s use {ragnar} to read the content of this page in a Markdown format.

md = read_as_markdown(url)

We could vectorize this information as it is, but first we need to break it down into contextual chunks.

chunks = markdown_chunk(md) chunks #> # @document@origin: https://csgillespie.github.io/efficientR/ #> # A tibble: 2 × 4 #> start end context text #> *     
 
 
 
#> 1 1 1572 "" "# Efficient R Programming… #> 2 597 2223 "# Welcome to Efficient R Programming" "## Authors\n\n[Colin Gil…

The chunks are stored in a tibble format, with one row per chunk. The
text column stores the chunk text (in the interests of saving space we
have only included the start of each chunk in the printed output above).

The title page has been split into two chunks and we can see that there
is significant overlap (chunk 1 spans characters 1 to 1572 and chunk 2
spans characters 597 to 2223). Overlapping chunks are perfectly normal
and provides added context as to where each chunk sits relative to the
other chunks.

Note that you can visually inspect the chunks by running
ragnar_chunks_view(chunks).

It’s time to build our knowledge store with a vector embedding that is
appropriate for Google Gemini models.

# Initialise a knowledge store with the Google Gemini embedding
store = ragnar_store_create(
 embed = embed_google_gemini()
)

# Insert the Markdown chunks
ragnar_store_insert(store, chunks)

The Markdown chunks are automatically converted into a vector
representation at the insertion step. It is important to use the
appropriate vector embedding when we create the store. A knowledge store
created using an OpenAI embedding will not be compatible with Google
Gemini models!

Before we can retrieve information from our store, we must create a
store index.

ragnar_store_build_index(store)

We can now test the retrieval capabilities of our knowledge store using
the ragnar_retreive() function. For example, to retrieve any chunks
relevant to the text Who are the authors of “Efficient R
Programming”? we can run:

relevant_knowledge = ragnar_retrieve(
 store,
 text = "Who are the authors of \"Efficient R Programming\"?"
)
relevant_knowledge
#> # A tibble: 1 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text 
#>         
#> 1 https://csgi… 1  1 2223   "" "# E…

Note that the \ operators in \"Efficient R Programming\" have been
used to print raw double quotes in the character string.

Without going into too much detail, the cosine_distance and bm25
columns in the returned tibble provide information relating to the
matching algorithm used to identify the chunks. The other columns relate
to the location and content of the chunks.

From the output tibble we see that the full content of the title page
(characters 1 to 2223) has been returned. This is because the original
two chunks both contained information about the authors.

Let’s add a more technical chapter from the textbook to the knowledge
store. The URL provided below links to Chapter 7 (“Efficient
Optimisation”).
Let’s add this to the knowledge store and rebuild the index.

url = "https://csgillespie.github.io/efficientR/performance.html"

# Extract Markdown content and split into chunks
chunks = url |>
 read_as_markdown() |>
 markdown_chunk()

# Add the chunks to the knowledge store
ragnar_store_insert(store, chunks)

# Rebuild the store index
ragnar_store_build_index(store)

Now that our knowledge store includes content from both the title page
and Chapter 7, let’s ask something more technical, like What are some
good practices for parallel computing in R?.

relevant_knowledge = ragnar_retrieve(
 store,
 text = "What are some good practices for parallel computing in R?"
)
relevant_knowledge
#> # A tibble: 4 × 9
#> origin doc_id chunk_id start end cosine_distance bm25 context text 
#>         
#> 1 https://csgi… 1  1 2223   "" "# E…
#> 2 https://csgi… 2  1 1536   "" "# 7…
#> 3 https://csgi… 2  22541 23995   "# 7 E… "## …
#> 4 https://csgi… 2  23996 26449   "# 7 E… "The…

Four chunks have been returned:

It makes sense that we have chunks from Section 7.5, which appears to be
highly relevant to the question. By including the title page and the
start of Chapter 7, the LLM will also have access to useful metadata
in case the user wants to find out where the model is getting its
information from.

Now that we have built and tested our retrieval tool, it’s time to
connect it up to a Gemini interface using {ellmer}. The code below will
create a chat object allowing us to send user prompts to Gemini.

chat = ellmer::chat_google_gemini(
 system_prompt = "You answer in approximately 10 words or less."
)

A system prompt has been included here to ensure a succinct response
from the model API.

We can register this chat interface with our retrieval tool.

ragnar_register_tool_retrieve(chat, store)

To check if our RAG workflow has been set up correctly, let’s chat with
the model.

chat$chat("What are some good practices for parallel computing in R?")
#> Use the `parallel` package, ensure you stop clusters with `stopCluster()` (or 
#> `on.exit()`), and utilize `parLapply()`, `parApply()`, or `parSapply()`.

The output looks plausible. Just to make sure, let’s check where the
model found out this information.

chat$chat("Where did you get that answer from?")
#> I retrieved the information from "Efficient R programming" by Colin Gillespie 
#> and Robin Lovelace.

Success! The LLM has identified the name of the textbook and if we
wanted to we could even ask about the specific chapter. A user
interacting with our model interface could now search online for this
textbook to fact-check the responses.

In the example workflow above, we manually selected a couple of chapters
from the textbook to include in our knowledge store. It’s worth noting
that you can also use the ragnar_find_links(url) function to retrieve
a list of links from a given webpage.

Doing so for the title page will provide the links to all chapters.

ragnar_find_links("https://csgillespie.github.io/efficientR/")
#> [1] "https://csgillespie.github.io/efficientR/" #> [2] "https://csgillespie.github.io/efficientR/building-the-book-from-source.html" #> [3] "https://csgillespie.github.io/efficientR/collaboration.html" #> [4] "https://csgillespie.github.io/efficientR/data-carpentry.html" #> [5] "https://csgillespie.github.io/efficientR/hardware.html" #> [6] "https://csgillespie.github.io/efficientR/index.html" #> [7] "https://csgillespie.github.io/efficientR/input-output.html" #> [8] "https://csgillespie.github.io/efficientR/introduction.html" #> [9] "https://csgillespie.github.io/efficientR/learning.html" #> [10] "https://csgillespie.github.io/efficientR/performance.html" #> [11] "https://csgillespie.github.io/efficientR/preface.html" #> [12] "https://csgillespie.github.io/efficientR/programming.html" #> [13] "https://csgillespie.github.io/efficientR/references.html" #> [14] "https://csgillespie.github.io/efficientR/set-up.html" #> [15] "https://csgillespie.github.io/efficientR/workflow.html"

You can then go through these links, extract the contents of each web page, and insert them into your RAG knowledge archive. However, keep in mind that including additional information in your store will likely increase the amount of text sent to the model, which could increase costs. Therefore, think about what information is actually relevant for your LLM application.

Summary

In summary, we introduced the concept of fetch-enhanced generation for LLM-powered workflows and built an example workflow in R using open source packages.

Before we close, we are excited to announce that our new course “LLM-Driven Applications with R & Python” has just been added to our training portfolio. You can search for it
here.

If you are interested in practical AI-driven workflows, we look forward to seeing you at our upcoming meeting AI in production 2026 conference taking place in Newcastle-Upon-Tyne from 4 to 5 June. If you wish to present a lecture or workshop, please submit your abstracts before the deadline January 23.

For updates and revisions to this article, see the original post

#RetrievalAugmented #Generation #Building #Knowledge #Archive #bloggers

Retrieval-Augmented Generation: Building a Knowledge Archive in R | R bloggers

What is RAG?

How does it work?

Setting up a RAG workflow in R

{ragnar}

{more}

Example RAG workflow

Summary

Related

Like this:

Related

Similar Posts

The rise of AI in freelancing

Why an LLC works for many business types | Zen business

Leave a Reply Cancel reply

What is RAG?

How does it work?

Setting up a RAG workflow in R

{ragnar}

{more}

Example RAG workflow

Summary

Related

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply