Someone simply assumed at some point that RAG must be based on vector search, an...

softwaredoug · 2026-04-03T18:07:49 1775239669

It’s something of a historical accident

We started with LLMs when everyone in search was building question answering systems. Those architectures look like the vector DB + chunking we associate with RAG.

Agents ability to call tools, using any retrieval backend, call that into question.

We really shouldn’t start RAG with the assumption we need that. I’ll be speaking about the subject in a few weeks

https://maven.com/p/7105dc/rag-is-the-what-agentic-search-is...

TeMPOraL · 2026-04-03T18:21:03 1775240463

Right. R in RAG stands for retrieval, and for a brief moment initially, it meant just that: any kind of tool call that retrieves information based on query, whether that was web search, or RDBMS query, or grep call, or asking someone to look up an address in a phone book. Nothing in RAG implies vector search and text embeddings (beyond those in the LLM itself), yet somehow people married the acronym to one very particular implementation of the idea.

macNchz · 2026-04-03T20:43:23 1775249003

Yeah there's a weird thing where people would get really focused on whether something is "actually doing RAG" when it's pulling in all sorts of outside information, just not using some kind of purpose built RAG tooling or embeddings.

Now, the pendulum on that general concept seems to be swinging the opposite direction where a lot of those people just figured out that you don't need embeddings. That's true, but I'd suggest that people don't overindex on thinking that means embeddings are not actually useful or valuable. Embeddings can be downright magical in what you can build with them, they're just one more tool at your disposal.

You can mix and match these things, too! Indexing your documents into semantically nested folders for agents to peruse? Try chunking and/or summarizing each one, and putting the vectors in sidecar files, or even Yaml frontmatter. Disks are fast these days, you can rip through a lot of files indexed like that before you come close to needing something more sophisticated.

viktor_von · 2026-04-04T03:12:10 1775272330

> yet somehow people married the acronym to one very particular implementation of the idea.

Likely due to the rise in popularity of semantic search via LLM embeddings, which for some reason became the main selling point for RAG. Meanwhile keyword search has existed for decades.

oceansky · 2026-04-03T19:18:44 1775243924

I'm still using the old definition, never got the memo.

adfm · 2026-04-03T19:33:52 1775244832

That’s OK. Most got ReST wrong, too.

KPGv2 · 2026-04-03T19:27:52 1775244472

You seem like someone who knows what they're doing, and I understand the theoretical underpinnings of LLMs (math background), but I have little kids that were born in 2016 and so the entire AI thing has left me in the dust. Never any time to even experiment.

I am active in fandoms and want to create a search where someone can ask "what was that fanfic where XYZ happened?" and get an answer back in the form of links to fanfiction that are responsive.

This is a RAG system, right? I understand I need an actual model (that's something like ollama), the thing that trawls the fanfiction archive and inserts whatever it's supposed to insert into one of these vector DBs, and I need a front-facing thing I write, that takes a user query, sends it to ollama, which can then search the vector DB and return results.

Or something like that.

Is it a RAG system that solves my use case? And if so, what software might I go about using to provide this service to me and my friends? I'm assuming it's pretty low in resource usage since it's just text indexing (maybe indexing new stuff once a week).

The goal is self-hosting. I don't wanna be making monthly payments indefinitely for some silly little thing I'm doing for me and my friends.

I am just a stay at home dad these days and don't have anyone to ask. I'm totally out the tech game for a few years now. I hope that you could respond (or someone else could), and maybe it will help other people.

There's just so many moving parts these days that I can't even hope to keep up. (It's been rather annoying to be totally unable to ride this tech wave the way I've done in the past; watching it all blow by me is disheartening).

9dev · 2026-04-03T20:23:03 1775247783

In the definition of RAG discussed here, that means the workflow looks something like this (simplified for brevity): When you send your query to the server, it will first normalise the words, then convert them to vectors, or embeddings, using an embedding model (there are also plain stochastic mechanisms to do this, but today most people mean a purpose-built LLM). An embedding is essentially an array of numeric coordinates in a huge-dimensional space, so [1, 2.522, …, -0.119]. It can now use that to search a database of arbitrary documents with pre-generated embeddings of their own. This usually happens during inserting them to the database, and follows the same process as your search query above, so every record in the database has its own, discrete set of embeddings to be queried during searches.

The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.

That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.

I hope this helps :-)

johnathandos · 2026-04-03T19:37:16 1775245036

I think the example you give is a little backwards — a RAG system searches for relevant content before sending anything to the LLM, and includes any content retrieved this way in the generative prompt. User query -> search -> results -> user query + search results passed in same context to LLM.

senordevnyc · 2026-04-04T03:44:12 1775274252

Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this. I bet if you put in your comment verbatim with Opus 4.6 and went back and forth a bit, it could help you figure out exactly what you need and build a first version in a couple hours. Seriously, if you know the fundamentals and can poke and prod, these tools are amazing for helping expand your knowledge base. And constraints like how much you want to pay are excellent for steering the models. Seriously, just try it!

justinclift · 2026-04-04T18:12:12 1775326332

You don't need to pay an external crowd for that.

You can run Claude Code using a local instance of ~recent Ollama fine, and it'll do the teaching job perfectly well using (say) Qwen 3.5.

Doesn't even need to be one of the large models, one of the mid-size ones that fit in ~16GB of ram when given 128k+ context size should be fine.

lelanthran · 2026-04-04T07:49:05 1775288945

> Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this.

Paying $20/m sounds like overkill. I have tabs open for all of the most well-known AI chatbots. Despite trying my hardest, it is not possible to exhaust your free options just by learning.

Hell, just on the chatbots alone, small projects can be vibe-coded too! No $20/m necessary.

senordevnyc · 2026-04-04T15:29:19 1775316559

Yeah, but when it comes to actually building stuff, using Codex is night and day different from using ChatGPT.

lelanthran · 2026-04-04T17:16:03 1775322963

> Yeah, but when it comes to actually building stuff, using Codex is night and day different from using ChatGPT.

Sure, but that wasn't what you recommended Codex for, was it?

>>> Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this.

rafterydj · 2026-04-03T19:02:59 1775242979

Stuck it on my calendar, looking forward to it.

safety1st · 2026-04-04T04:11:21 1775275881

We were given a demo of a vector based approach, and it didn't work. They said our docs were too big and for some reason their chunking process was failing. So we ended up using a good old fashioned Elastic backend because that's what we know, and simply forwarding a few of these giant documents to the LLM verbatim along with the user's question. The results have been great, not a single complaint about accuracy, results are fast and cheap using OpenAI's micro models, Elastic is mature tech everyone understands so it's easy to maintain.

I think this turned out to be one of those lessons about premature optimization. It didn't need to be as complex as what people initially assumed. Perhaps with older models it would have been a different story.

bartread · 2026-04-04T04:26:54 1775276814

> They said our docs were too big and for some reason their chunking process was failing.

Why would the size of your docs have any bearing on whether or not the chunking process works? That makes no sense. Unless of course they're operating on the document entirely in memory which seems not very bright unless you're very confident of the maximum size of document you're going to be dealing with.

(I implemented a RAG process from scratch a few weeks ago, having never done so before. For our use case it's actually not that hard. Not trivial, but not that hard. I realise there are now SaaS RAG solutions but we have almost no budget and, in any case, data residence is a huge concern for us, and to get control of that you generally have to go for the expensive Enterprise tier.)

safety1st · 2026-04-04T04:45:00 1775277900

I agree it makes no sense. The whole point of chunking is to handle large documents. If your chunking system fails because a document is too big, that seems like a pretty glaring omission. I just chalked it up to the tech being new and novel and therefore having more bugs/people not fully understanding how it worked/etc. It was a vendor and they never gave us more details.

Not all problems have to be solved. We just fell back to using older, more proven technology, started with the simplest implementation and iterated as needed, and the result was great.

bartread · 2026-04-06T08:33:41 1775464421

That's good. I think if you can get the result you need with a technology that's already familiar to you then, in cases where that tech is still supported, that's going to be a win.

RAG worked well for us in this recent case but, in 3+ years of developing LLM backed solutions, it's the first time I've had to reach for it.

morkalork · 2026-04-03T18:02:12 1775239332

Doesn't have to be tho, I've had great success letting an agent loose on an Apache Lucene instance. Turns out LLMs are great at building queries.

ivanovm · 2026-04-03T19:47:33 1775245653

I don't think this was a simple assumption. LLMs used to be much dumber! GPT-3 era LLMS were not good at grep, they were not that good at recovering from errors, and they were not good at making followup queries over multiple turns of search. Multiple breakthroughs in code generation, tool use, and reasoning had to happen on the model side to make vector-based RAG look like unnecessary complexity

bluegatty · 2026-04-03T18:35:58 1775241358

It was the terminology that did that more than anything. The term 'RAG' just has a lot of consequential baggage. Unfortunately.

darkteflon · 2026-04-03T21:32:05 1775251925

Certainly a lot of blog posts followed. Not sure that “everyone” was so blinkered.

graemefawcett · 2026-04-04T03:58:55 1775275135

RAG is like when you want someone to know something they're not quite getting so you yell a bit louder. For a workflow that's mainly search based, it's useful to keep things grounded.

Less useful in other contexts, unless you move away from traditional chunked embeddings and into things like graphs where the relationships provide constraints as much as additional grounding