AI and ML engineering

Fine Tuning Versus Prompting Versus RAG: From First Principles to Delivery

By La BoétieUpdated June 26, 202624 min read

Fine tuning versus prompting versus RAG is the first real architecture decision in any serious large language model build, and most teams answer it backwards. Prompting means steering a general model through its input alone, with no change to its weights. Retrieval-augmented generation (RAG) means fetching relevant documents at query time and placing them in the model's context window so the answer is grounded in your data. Fine tuning means updating the model's weights on your own examples so a behaviour is baked in. The honest answer to fine tuning versus prompting versus RAG is an order, not a winner: start with prompting, add RAG when the model needs knowledge it never learned, and fine tune only to change behaviour the first two cannot reach. This pillar gives you that rule, the dated numbers behind it, and the position you can quote in a pitch.

Key takeaways:

Prompting is the cheapest first move, with zero training cost and same-day iteration, so it earns the first attempt before anything heavier.

PE Collective's 2026 cost analysis puts a production RAG system at $350 to $2,850 per month against $10,000 to $100,000 for a full fine tuning project.

A Microsoft Research case study on arXiv from January 2024 measured fine tuning adding more than 6 percentage points of accuracy and RAG adding 5 more, with the combination beating either alone.

LoRA (low-rank adaptation) trains roughly 0.1% to 1% of a model's parameters, according to Databricks, cutting the cost of fine tuning by close to 99%.

The studio rule is simple enough to defend in a board meeting: RAG for knowledge, fine tuning for behaviour, and a combination only when a measured gap pays for itself.

What fine tuning versus prompting versus RAG actually decides

Every entry under this hub answers one question: given a model that is already good, how do you make it good at your problem without setting money on fire? Fine tuning versus prompting versus RAG is not a technology beauty contest. It is a decision about where your problem actually lives. If the model gives a wrong answer, the useful first question is why it is wrong. Wrong because it never saw your data is a knowledge problem. Wrong because it formats, reasons or refuses incorrectly even when handed the right context is a behaviour problem. Knowledge problems are solved by retrieval. Behaviour problems are solved by training. Confusing the two is the single most expensive mistake we see.

Prompting is the baseline because it changes nothing about the model and everything about the instruction. A well-built prompt, with clear role, constraints and examples, closes more gaps than most teams expect, and it costs only the tokens you send. The context window, the amount of text a model can read at once, has grown large enough that many problems framed as fine tuning candidates are really prompting problems in disguise. McKinsey's 2025 State of AI survey reports that 78% of organizations now use AI in at least one business function, yet only about 6% qualify as high performers; the gap is rarely the model, and usually the build around it.

RAG enters when the answer depends on knowledge the model was never trained on: your contracts, your product catalogue, last week's pricing. Instead of teaching the model that knowledge, you retrieve it and show it. The mechanics rest on embeddings, numeric representations of meaning, stored in a vector database that finds the passages closest to a question. Fine tuning is the heaviest lever: you change the weights themselves, which is the right move when you need a consistent behaviour, a house tone, a structured output format, or a narrow classification the model keeps getting wrong. Reach for it last, not first.

Consider a concrete case. A support assistant answers 1,000 questions a day and gets 18% wrong. If you inspect the failures and find 15 of every 18 are wrong because the model never saw your current pricing, that is a knowledge gap, and RAG closes most of it for a few hundred dollars a month. If instead 15 of every 18 are wrong because the model ignores your required answer format or escalation rules even when the pricing sits in front of it, no amount of retrieval helps, and a small fine tuning run is the honest fix. Same error rate, opposite remedy. The diagnosis, not the headline accuracy number, is what decides fine tuning versus prompting versus RAG.

One more variable sharpens the choice: how much labelled data you hold. Prompting needs only a handful of examples. RAG needs a corpus worth retrieving from, but no labels. Fine tuning needs hundreds to thousands of high-quality labelled examples, and its output is only as good as that data. If you cannot assemble clean training examples, the fine tuning rung is closed to you regardless of how attractive it looks, which is why the studio audits data readiness before it audits ambition.

The studio house position on fine tuning versus prompting versus RAG

Here is where we part company with most of the field. The standard advice treats fine tuning versus prompting versus RAG as three boxes to evaluate in parallel and then pick from. We treat it as a ladder you climb only as far as the evidence forces you. You start on the cheapest rung and you do not move up until a measurement, not a hunch, tells you the current rung cannot reach the target. The field hand-waves at "it depends"; we put the dependency in writing.

Our position, stated plainly enough to quote: prompting is the default, RAG owns knowledge, fine tuning owns behaviour, and you combine them only when a measured gap justifies the cost. Most teams invert this. They fine tune first because it feels like the serious, technical answer, then discover they have frozen a snapshot of knowledge that is stale the moment a document changes. PE Collective's 2026 analysis captures the trap in one number: a document update that costs nothing to re-index in RAG costs $500 to $5,000 to retrain into a fine tuned model. Fine tuning your knowledge is paying a premium to make your data harder to change.

The combine case deserves precision, because "use both" is where lazy advice hides. Combining RAG and fine tuning is correct only when you have measured two distinct gaps: a knowledge gap that retrieval closes and a behaviour gap that training closes. The Microsoft Research result, fine tuning plus RAG beating either alone, is real, but it is a ceiling you earn through measurement, not a default you adopt because both sound thorough. Techment's 2026 figure that roughly 60% of production projects run both is often cited as proof everyone should; we read it as proof that most mature systems eventually find two gaps, not that you should open with two solutions to fine tuning versus prompting versus RAG.

The second half of the position is about sovereignty, and it is not decoration. A retrieval system keeps your knowledge in your database, auditable and portable, instead of dissolving it into opaque weights you cannot inspect or move. When La Boétie builds, the client keeps ownership of the data, the pipeline and the model choice. That principle, drawn from Étienne de La Boétie's 1548 argument against voluntary servitude, has a concrete engineering consequence here: architectures that lock your knowledge inside a vendor's fine tuned model are architectures you can be held hostage by. RAG-first is also sovereignty-first. If you want the long version of how we apply this rule case by case, our selection walkthrough for picking an approach is the companion piece.

Charts comparing the cost and accuracy of prompting, RAG and fine tuning

How the three approaches compare on cost, latency, and accuracy

Numbers settle arguments that adjectives cannot. The table below puts the three approaches side by side on the dimensions that actually drive the decision, using 2024 to 2026 figures from named sources. Read it as a starting map, not a verdict; your traffic volume and data freshness move the breakeven point.

Dimension	Prompting	RAG	Fine tuning
Upfront cost	Near zero	$5,000 to $50,000 setup	$10,000 to $100,000 project
Running cost	API tokens only	$350 to $2,850 per month	Inference plus periodic retrain
Time to first result	Hours	Days	Weeks
Knowledge freshness	Model cutoff	Live, re-indexed at $0	Frozen until retrain
Best at	Format and reasoning prompts	Current, proprietary knowledge	Consistent behaviour and style
Data sovereignty	High	High, data stays in your store	Lower, knowledge baked into weights

The accuracy picture is where most write-ups oversimplify. The Microsoft Research agriculture study on arXiv from January 2024 is the cleanest public benchmark: from a strong base model, fine tuning lifted accuracy by more than 6 percentage points, and adding RAG on top contributed a further 5, while answer similarity to expert responses climbed from 47% to 72% once the fine tuned model could draw on retrieved cross-regional context. The lesson is not that one approach wins. It is that knowledge gains and behaviour gains stack, because they fix different failures.

Maintenance is the dimension that surprises teams after launch. A prompting system has almost nothing to maintain beyond the prompt itself. A RAG system carries the ongoing cost of keeping its index fresh and its retrieval relevant, which is real but bounded. A fine tuned model carries the heaviest tail: every meaningful knowledge change is a retraining cycle, and Databricks notes that even parameter-efficient runs still demand data curation, evaluation, and version management. Over a two-year horizon, the approach that looks cheapest on day one is frequently the most expensive to own, which is why the studio weighs total cost of ownership, not just the upfront build, when it frames fine tuning versus prompting versus RAG.

Cost is moving in RAG's favour. PE Collective notes embedding model pricing fell roughly 30% in the first quarter of 2026, and Techment's 2026 strategy review reports that retrieval can cut error rates by more than 30% on knowledge-intensive tasks while roughly 60% of production projects now run RAG and fine tuning together. For the full per-approach economics, including the hidden retraining and evaluation line items, see our approach cost breakdown. The headline holds: prompting is cheapest to try, RAG is cheapest to keep current, and fine tuning is cheapest only when a behaviour gap survives the other two.

Why teams reach for fine tuning first, and why it backfires

The most common pattern we see is a team that opens the question by reaching straight for training, because it feels like the most serious engineering answer. It is also the most expensive way to be wrong. Fine tuning bakes a snapshot of knowledge into the weights, so the day a price list, a policy, or a product spec changes, the model is confidently out of date and nobody notices until a customer does. McKinsey's 2025 State of AI survey found only about 6% of organizations qualify as AI high performers, and in our experience the dividing line is rarely model quality; it is whether the team matched the approach to the failure.

There is a second, quieter cost. A fine tuned model is harder to debug, because the reasoning lives in opaque weights rather than in a prompt or a retrieved passage you can read. When an answer is wrong, you cannot point at the clause that caused it. RAG keeps that audit trail intact: every answer traces back to a document you can open. For a regulated business, that traceability is not a nice-to-have, it is the difference between a defensible system and a liability. The studio treats fine tuning as the rung you justify, never the rung you assume, and our catalogue of approach anti-patterns documents the failure modes that follow from getting fine tuning versus prompting versus RAG backwards.

A decision rule you can defend in a board meeting

When a client asks which approach to use, we do not answer with "it depends." We answer with an ordered checklist. Run it top to bottom and stop at the first rung that closes your gap.

Define the failure precisely. Write down a dozen real wrong answers and label each as a knowledge failure or a behaviour failure. The split decides everything that follows.
Exhaust prompting first. Rewrite the instruction with explicit role, constraints, and two to three worked examples. Measure again. Prompting is free and same-day; most format and reasoning gaps close here.
Add RAG for knowledge gaps. If the model is wrong because it lacks your data, build retrieval before you touch weights. Re-indexing a changed document costs $0, where retraining costs $500 to $5,000 per PE Collective's 2026 figures.
Reserve fine tuning for behaviour gaps. Persistent tone, format, or classification errors that survive a correct context are the only clean signal to train. Hugging Face's June 2026 engineering blog found that 98.4% of model cards using a parameter-efficient method choose LoRA, so start with a thin LoRA adapter, not a full retrain.
Quantify before you commit. Set a target metric and a budget ceiling. If a fine tuning project will cost $10,000 to $100,000 to gain three points your users will not notice, it fails the test.
Combine only on evidence. The 6 plus 5 percentage-point stack from the Microsoft Research study is real, but you earn it by measuring each layer, not by assuming both.
Protect ownership. Choose the architecture that keeps your data and model portable. A locked-in fine tuned model is a liability disguised as an asset.

This is the rule the studio applies on every engagement, and it is deliberately boring. Boring rules survive contact with production. For the version with worked scoring per criterion, our approach decision framework carries the full rubric.

Three engagements where the approach choice was load-bearing

Frameworks are cheap; the test is what happens when they meet a deadline. Three anonymized engagements from the studio show the rule working.

A French savings and insurance platform needed an assistant that could answer policyholder questions from a library of contracts that changed weekly. The DIY attempt fine tuned a model on a contract snapshot and shipped answers that were already wrong by the next product update. We rebuilt it as RAG over the live document store in under two weeks, retrieval grounded every answer in a citable clause, and knowledge updates dropped to a re-index job costing nothing per change. Knowledge problem, retrieval answer.

A European auction house wanted lot descriptions classified into several hundred catalogue categories with consistent house terminology. Prompting alone drifted on the long tail, and RAG could not impose the in-house taxonomy because the rules lived in expert heads, not documents. This was a behaviour gap, so we trained a LoRA adapter on a few thousand expert-labelled examples; the adapter touched well under 1% of parameters, trained in hours, and held the taxonomy steady. Behaviour problem, fine tuning answer.

A growth-stage marketplace arrived after a month of building insecure prototypes with off-the-shelf AI tools, the kind with exposed environment variables and unprotected routes. They assumed they needed a fine tuned model. The real fix was a disciplined prompt layer plus RAG over their own catalogue, delivered as a secure, architected system in a fraction of the time the rebuild would have taken from scratch. No weights were changed. The studio's read on when training is genuinely the right call lives in our enterprise approach field report.

Read together, the three engagements make the same point from three directions. The savings platform had a knowledge gap and a retrieval answer. The auction house had a behaviour gap and a training answer. The marketplace had neither, just an architecture problem dressed up as a model problem. In every case the winning move was diagnosis before construction, and the right answer to fine tuning versus prompting versus RAG fell out of the diagnosis rather than the other way round. None of the three needed the most expensive option, and two of them arrived convinced they did. That gap, between what clients ask for and what they actually need, is where the studio earns its keep.

A decision map routing readers to the right hub entry by starting condition

Which entry to read first, by your starting condition

This hub fans out into topical, focal, and special entries, and you do not need to read them in order. Use your starting condition to pick the entry that earns your next hour. The sub-topic map below routes you by where you actually stand.

If you have never shipped an LLM feature and want the ordered method, start with our selection walkthrough, the topical entry that lays out the method step by step. If a stakeholder is demanding numbers before approving spend, the approach benchmarks reference gives you dated, defensible figures. If your specific question is the narrow RAG-or-train fork, the side-by-side comparison of RAG versus a fine tuned model is the focal entry for exactly that.

The diagnostic entries matter most when something has already gone wrong. If you suspect you picked the wrong approach and want to confirm before you spend again, read the wrong approach choice postmortem. If you are weighing whether your stage even justifies training, the focal entry on whether you should fine tune at this stage answers the maturity question directly. Founders facing an investor or acquirer who will scrutinise the architecture should read the entry on due diligence around approach choice, because how you answer fine tuning versus prompting versus RAG signals engineering judgement to anyone reading your stack. Each entry is self-contained; this pillar is the map, and the entries are the territory.

What is changing in this hub this year

The decision is not static, and 2026 has already moved the breakeven points. Embedding costs fell about 30% in the first quarter of 2026 per PE Collective, which pushes more borderline cases toward RAG by making retrieval cheaper to run at scale. Context windows keep growing, which quietly absorbs problems that used to look like fine tuning candidates: when you can fit a policy manual into the prompt, you may not need to train on it at all.

Parameter-efficient training is the other shift. Hugging Face's June 2026 engineering blog reports that 98.4% of parameter-efficient model cards and 71.3% of relevant code imports point to LoRA, and it also shows newer methods such as OFT beating LoRA on some tasks at lower memory, with LoRA reaching 53.2% test accuracy on a math benchmark at 22.6 GB of memory. The practical takeaway: fine tuning has become cheap enough that combining a thin adapter with RAG is now the high-return default, not full retraining. Deloitte's 2026 State of AI in the Enterprise puts generative AI use at 71% of firms, up from 55% in 2024, so the competitive bar for getting this decision right is rising fast.

The third change is organisational, not technical. As generative AI use climbs toward that 71% mark, the people approving these builds are no longer only engineers; they are CFOs and boards who want the decision justified in money. That raises the premium on a defensible rule over a fashionable one. The teams winning in 2026 are not the ones with the fanciest training pipeline. They are the ones who can show, on one slide, why they answered fine tuning versus prompting versus RAG the way they did and what that choice costs to maintain. The studio's standing advice holds: let cost curves and context windows pull you toward the lighter rung whenever the measurement allows.

Where this hub sits in AI and ML engineering

Fine tuning versus prompting versus RAG is one decision inside a larger discipline. The AI and ML engineering family this hub belongs to also covers retrieval architecture in depth, agent orchestration, evaluation and observability, vector database selection, and the cost control that keeps a system economical at scale. Those sibling hubs answer the questions that come after you have chosen an approach: how to build the retrieval pipeline well, how to measure whether it is working, and how to keep it cheap.

The throughline across all of them is the same one that runs through this pillar. The gap between a demo that impresses in a meeting and a system that survives real users is engineering discipline: measurement before commitment, ownership over lock-in, and the right tool on the right rung. When you move from this hub to the retrieval or evaluation hubs, carry the same habit. Decide with numbers, keep your data portable, and add complexity only when a measurement demands it. The approach you pick here sets the constraints every downstream hub inherits.

A practical consequence: the answer you settle here propagates downstream. Choose RAG and your next decisions are about chunking strategy, embedding models, and vector database selection. Choose fine tuning and your next decisions are about data labelling, adapter management, and evaluation harnesses. Choose prompting and your effort shifts to prompt versioning and guardrails. Each sibling hub assumes you have already resolved fine tuning versus prompting versus RAG, because that resolution defines the shape of everything built on top of it.

How La Boétie helps you choose your approach

Most teams do not lose on the model; they lose on the decision around it. La Boétie is a venture studio, agency, and technical consultancy that operates as a single flexible team of about five to six multilingual engineers, and we make this exact call with clients every week. We do not sell you the most expensive rung. We assess what your problem actually needs and build the right thing, which is more often a sharp prompt layer and clean RAG than a costly retrain.

Architecture review. We start by labelling your failures as knowledge or behaviour problems and pressure-test whether you need training at all. Teams that arrive after a month of insecure DIY prototypes, with exposed environment variables and unprotected routes, routinely leave with a rebuilt, secure design in a fraction of that time.

Build and integration. We ship the chosen architecture end to end, from retrieval pipelines to LoRA adapters, drawing on in-house platforms we built for ourselves, including Cortex, Lynkflow, Amorphous and Socialforge. You keep full ownership of the data, the pipeline, and the model choice; nothing is locked inside a vendor stack.

Fractional technical leadership. When you need senior architectural judgement without a full-time hire, we operate as your externalised CTO, owning the roadmap and the build standard. Across client work spanning finance, insurance, auctions, legal and more, the constant is the same opinionated partnership: you keep what gets built.

If you are weighing fine tuning versus prompting versus RAG for a real product, book a studio intro call. Bring a dozen wrong answers from your current system, and we will tell you which rung of the ladder actually closes the gap, before you spend a euro on the wrong one.

FAQ: fine tuning versus prompting versus RAG

Is fine tuning versus prompting versus RAG an either or choice?

No. Prompting, RAG and fine tuning sit on a ladder of cost and control, not a menu of exclusives. Roughly 60% of 2025 to 2026 production projects combine RAG and fine tuning, according to Techment's 2026 analysis. The studio starts with prompting, layers RAG for knowledge, and fine tunes only when a measured behaviour gap remains after the first two rungs.

When is RAG the right answer over fine tuning?

Choose RAG when the model needs current, proprietary or frequently changing knowledge it was never trained on. A document update costs nothing to re-index in a RAG system but $500 to $5,000 to bake back into a fine tuned model, per PE Collective's 2026 cost analysis. RAG also keeps your data auditable in your own store and your sources citable in every answer.

How much does fine tuning actually cost in 2026?

PE Collective's 2026 figures put a full fine tuning project between $10,000 and $100,000 once data preparation and training are counted, against $350 to $2,850 per month for a production RAG system. Parameter-efficient methods such as LoRA cut the training bill by close to 99% by updating only 0.1% to 1% of weights, according to Databricks.

Does fine tuning improve accuracy more than RAG?

It depends on whether your gap is knowledge or behaviour. A Microsoft Research case study on arXiv in January 2024 measured fine tuning adding over 6 percentage points of accuracy and RAG adding a further 5, with the two combined beating either alone. Fine tuning fixes how the model behaves; RAG fixes what it knows. Stack them only when each layer earns its place in measurement.

What is LoRA and why does it matter for this decision?

LoRA, or low-rank adaptation, is a parameter-efficient fine tuning method that freezes the base model and trains small adapter matrices, roughly 0.1% to 1% of parameters. Hugging Face's June 2026 engineering blog on parameter-efficient methods found 98.4% of relevant model cards point to LoRA. It makes fine tuning cheap enough to pair with RAG rather than replace it.

Where should I start if I have never built with an LLM?

Start with prompting on a strong base model and measure. If answers are wrong because the model lacks your knowledge, add RAG. If answers are wrong because the model behaves incorrectly even with the right context, then consider fine tuning. Our selection walkthrough takes you through that exact order, and it is the cheapest path to a system that works.

Can prompting alone be enough for production?

Often, yes. With large context windows and strong base models, a disciplined prompt with clear constraints and worked examples solves a surprising share of problems at zero training cost. The studio ships prompting-only systems whenever measurement shows the gap is about instruction rather than knowledge or behaviour. The rule is to prove prompting insufficient before spending on RAG or fine tuning, not to assume it.

Conclusion

The reason fine tuning versus prompting versus RAG trips up so many teams is that it looks like a technology question when it is really a diagnosis question. Decide where your problem lives first, knowledge or behaviour, and the approach picks itself: prompting to start, RAG for knowledge, fine tuning for behaviour, and a combination only when measurement proves the gap is worth the spend. That ladder is cheaper, more honest, and more sovereign than the fine-tune-first instinct most builds default to. Answer fine tuning versus prompting versus RAG in that order and you will ship a system that survives real users, keeps your data in your hands, and costs what it should.

Sources

Beyond LoRA: Can you beat the most popular fine-tuning technique? : Hugging Face, 2026
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture : Microsoft Research, 2024
RAG vs Fine-Tuning: Cost Comparison and When to Use Each : PE Collective, 2026
Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection : Databricks, 2023
The State of AI: How organizations are rewiring to capture value : McKinsey & Company, 2025
State of AI in the Enterprise : Deloitte, 2026
RAG vs Fine-Tuning vs AI Agents: Choosing the Right LLM Strategy : Techment, 2026

Further reading:

Questions

Is fine tuning versus prompting versus RAG an either or choice?

When is RAG the right answer over fine tuning?

How much does fine tuning actually cost in 2026?

PE Collective's 2026 figures put a full fine tuning project between $10,000 and $100,000 once data preparation and training are counted, against $350 to $2,850 per month for a production RAG system. Parameter-efficient methods like LoRA cut the training bill by close to 99% by updating only 0.1% to 1% of weights, per Databricks.

Does fine tuning improve accuracy more than RAG?

It depends on whether your gap is knowledge or behaviour. A Microsoft Research case study on arXiv in January 2024 measured fine tuning adding over 6 percentage points of accuracy and RAG adding a further 5, with the combination beating either alone. Fine tuning fixes how the model behaves; RAG fixes what it knows.

What is LoRA and why does it matter for this decision?

LoRA, or low-rank adaptation, is a parameter-efficient fine tuning method that freezes the base model and trains small adapter matrices, roughly 0.1% to 1% of parameters. Hugging Face's June 2026 engineering blog found 98.4% of model cards mentioning a parameter-efficient method point to LoRA. It makes fine tuning cheap enough to pair with RAG rather than replace it.

Where should I start if I have never built with an LLM?

Can prompting alone be enough for production?

Work with the studio.