Beyond Style: Fine-Tuning as a Path to Knowledge Injection

Sep 30

Introduction: The Perception Gap

Fine-tuning is one of those ideas that grabs people right away when they first get into large language models. If the model doesn’t know what you need, why not just train it to learn? As soon as you start to research it, though, you’ll immediately see folks telling you it won’t work.

As a rule, the community consensus today seems to be that fine-tuning is only good for changing the style, tone, or format of an LLM’s output. Occasionally, you’ll come across someone claiming to have some success at teaching a model some new “facts”, but you’ll quickly be warned about data quality concerns or the risk of the model “forgetting” what it already knows.

Experimenting with fine-tuning can also feel confusing because the available guidance is often vague. Blog posts, Slack threads, and conference talks surface plenty of opinions, but not always with enough detail to make experiments replicable or results easy to compare. That’s a challenge, since there are so many parameters in play – everything from training method (SFT, DPO, GRPO, etc.) to dataset format and scope, learning rate, batch size, epoch count, and parameterization strategy (full fine-tune, LoRA, prefix tuning).

Lately, though, a handful of research efforts have started to challenge the community narrative. Instead of relegating fine-tuning to style alone, these studies dive deep to explore whether and how models can actually internalize new knowledge through carefully designed updates. Together, they make a strong case that fine-tuning belongs in the knowledge injection toolkit – and they hint at a bigger trajectory, one where models adapt not just through static retraining, but continuously as the world changes.

The Knowledge Update Toolbox

When it comes to updating or extending what a language model “knows”, most practitioners reach for one of three tools: pretraining, prompting, or retrieval. Each has its own tradeoffs.

Pretraining is the gold standard: retrain on a massive dataset, and the knowledge is baked in. It’s how today’s frontier models are built in the first place. And I think this is the primary source of the naïve intuition folks start with about the promise of fine-tuning. The drawback, of course, is cost. Retraining at that scale is beyond the reach of almost everyone. Controversy around the true cost of training aside, the claim that DeepSeek-V3 trained for just ~$5 million makes it clear most don’t have the budget for even a “cheap” version of this.

Prompting is the quick fix. With the right instructions or examples in context, you can coax the model into acting as if it knows something new. It’s lightweight and cheap, but also ephemeral: once the prompt is gone, the “knowledge” disappears with it. And don’t forget that prompting also takes up precious tokens in the context. At best, more context used means slower (or more expensive) inference; in the worst case, the context window for the model simply isn’t large enough to accommodate all the knowledge.

Retrieval-augmented generation (RAG) is the practical favorite. By attaching a search system to selectively pull in context, you can reference relevant documents at query time. It’s flexible, doesn’t alter the model itself, and makes it possible to update knowledge continuously. The flip side is brittleness: retrieval adds complexity and can fail in subtle ways: indexes drift, queries misfire, and suddenly the model is confidently wrong. RAG also doesn’t fully sidestep the context window size problem either if your query simply requires too much knowledge in context at once.

That leaves fine-tuning, which today often gets pigeonholed as a style or format adjustment tool. The prevailing wisdom says you can use it to make responses more on-brand, more polite, or better aligned to a task, but not to actually expand what the model knows. If you push harder, you’ll quickly run into the warnings: the training data has to be pristine, you risk catastrophic forgetting, and even if you succeed you probably won’t scale. In short, fine-tuning often ends up being seen as the least reliable option.

What the Research Shows

Against this backdrop of skepticism, a growing body of research has started to test the premise more directly: can fine-tuning actually teach a model new facts? Rather than leaning on anecdotes, these studies set up controlled experiments to measure whether models can absorb new information and use it consistently.

One such effort is Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning (Mecklenburg et al., 2024). The authors built training sets that paired specific “new facts” (e.g. changes in named entities, fictional scenarios) with simple question–answer formats, and then fine-tuned models like GPT-4 on those examples. They compared token-based vs fact-based scaling strategies, showing that when training was organized around covering facts systematically, the injected knowledge not only stuck but generalized reliably to new queries. Their conclusion: supervised fine-tuning can integrate fresh factual knowledge if the data is structured with care.

In From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning (Zhao et al., 2024), the team ran more than 4,000 fine-tuning experiments on Gemini v1.5, systematically varying dataset type (e.g. Q&A vs Wikipedia prose), content type (categorical vs numerical), and evaluation tasks (recall vs reasoning). They found that knowledge injection worked best when facts were expressed in Q&A form and when the content was categorical rather than numerical. Models could recall new information reliably, but struggled more when asked to use those facts in reasoning chains. One key insight: “style” and “knowledge” aren’t separate categories of fine-tuning at all; they sit on a continuum, and success depends on how the data is framed.

A third effort, Structure-aware Domain Knowledge Injection for Large Language Models (Liu et al., 2025), introduced a methodology called StructTuning. Inspired by human education, the authors broke domain corpora (like medical textbooks) into taxonomies of chapters, sections, and knowledge points, then trained models in two stages: structure-aware continual pre-training and structure-aware supervised fine-tuning. By tying each training chunk to its place in the knowledge hierarchy, the model learned not just isolated facts but how they fit together. On benchmarks like LongBench and MMedBench, StructTuning outperformed prior domain adaptation methods, boosting knowledge recall by nearly 100% while using just 5% of the training data.

Taken together, these studies cut against the common narrative. Together, they show that fine-tuning can do more than polish tone; it can reliably inject new knowledge, and with the right structures, it can do so efficiently and at scale.

Why Fine-Tuning Works

Think of pretraining itself. What is it if not a colossal fine-tuning run? Models start from random weights, then gradually learn the statistical associations that encode both language patterns and factual knowledge. Fine-tuning just continues that same process on a smaller, more focused scale. There’s nothing inherent in the mechanism that specifically predisposes it to encoding one type of knowledge (like style) over another (factual statements).

So why the belief that it doesn’t work? There are definitely a couple of places where fine-tuning could fail in practice. One is if the model is already so fully constrained that there’s no “room” to encode new information without overwriting something it already knows. Another is if the parameter-efficient methods most people use (like LoRA or other adapter-based approaches) aren’t expressive enough to selectively capture new knowledge.

The research doesn’t provide much evidence that these failure modes are a practical impediment. Mecklenburg et al. show that even large, general-purpose models like GPT-4 can reliably integrate new facts through supervised fine-tuning, provided the data is organized well. Zhao et al. demonstrate that whether knowledge “sticks” depends less on some hard limit of the model and more on how the training data is framed (Q&A formats and categorical facts, for example, transfer especially well). And Liu et al. go further, showing that structure-aware training methods not only succeed at injecting domain knowledge but can do so more efficiently than brute-force pretraining, even outperforming existing domain-specific baselines with just a fraction of the data.

These studies demonstrate that fine-tuning can succeed at teaching models new facts, and they highlight the design choices that matter most – spanning everything from data format to parameterization strategy.

Practical Lessons for Builders

The recent research doesn’t just show that fine-tuning can inject knowledge, it also surfaces patterns that make it more effective in practice. A few clearly stand out.

1. Data format matters

Zhao et al. found that Q&A-style data supported better knowledge transfer than long-form passages, and categorical facts tended to stick better than numerical details. Mecklenburg et al. showed that organizing training around facts rather than raw tokens led to more consistent results. Together, the lesson is that how information is framed has a direct impact on whether it “takes”.

2. Structure pays off

Liu et al.’s StructTuning approach demonstrated that reorganizing domain data into a taxonomy—chapters, sections, knowledge points—yielded major gains. By tying updates to the structure of the domain, they reached state-of-the-art recall with a fraction of the training data used in other methods. Domains with clear hierarchies, like medicine, law, or engineering, benefit especially from this style of organization.

3. Scale and scope should be deliberate

Simply throwing more tokens at the problem doesn’t guarantee success. Mecklenburg et al. highlight that coverage of distinct facts is what matters most. In practice, this means curating training sets for breadth across the knowledge you want the model to retain, instead of relying on raw volume alone.

4. Hyper-parameters are decisive

Across all three studies, results depended heavily on details: learning rate, batch size, epoch count, adapter size, quantization. Fine-tuning leaves little margin for error. Small changes in setup can shift outcomes from clean knowledge injection to unintended side effects.

5. Know where fine-tuning fits

Fine-tuning is well-suited to stable, central knowledge that needs to be recalled quickly and reliably. For fast-changing or peripheral information, retrieval or other external methods may be a better fit. The two approaches complement each other, and the most durable systems are likely to combine them.

For builders, the message is straightforward: fine-tuning is a viable way to put new knowledge into a model, provided the setup is done with care. For researchers, the next steps involve scaling laws, long-term retention, and standardizing evaluation. And for the broader community, these findings point to a shift: fine-tuning deserves to be treated as a serious option for knowledge injection, not relegated to surface-level adjustments.

Fine-Tuning and the Road to Adaptive Intelligence

The community has long treated fine-tuning as a tool for polishing outputs rather than shaping what a model knows. The evidence now points in a different direction. Careful studies show that models can internalize new knowledge through fine-tuning, and that the details of data design, structure, and training setup determine how well it works.

For practitioners, this means fine-tuning should be considered alongside pretraining, retrieval, and prompting as a viable way to update a model. For researchers, it opens up new lines of inquiry around scaling laws, retention, and evaluation. And for the broader community, it marks a shift in how we think about building and maintaining LLMs: fine-tuning is more than cosmetic, it’s a practical path for knowledge injection.

Looking ahead, this is part of a larger trajectory. In our piece on real-time, single-pass systems, we explored how models will eventually need to adapt continuously, updating themselves at inference time without retraining cycles. Fine-tuning as knowledge injection doesn’t get us all the way there, but it moves us closer: it shows that the weights of a model can be updated deliberately, safely, and with precision. Today, that means injecting stable knowledge into the model. Tomorrow, it points toward systems that learn continuously—agents that don’t just perform tasks but evolve alongside the environments they inhabit.

David Noblet