Pretraining? Fine Tuning? What does it take to train an LLM?

In a previous post, I looked at the differences between reasoning models, and more traditional LLMs. And to read more on that topic, see the post here.

But in the process of investigating that, it led me to a new question, model training and benchmarking. Specifically, how these models are created and trained. The next question to this would be: “How do I make a model / AI solution work for my specific use-case?” At this point in time, you’ll see a million blog posts saying “Using RAG is cheaper than training a model,” and I’m not questioning the validity of that statement. But my question is, “What is entailed in training these models?”

So when I was investigating the reasoning models, a few key terms kept coming up. Specifically, the following:

Pre-training
Fine Tuning

And ultimately, this brings up a lot of questions and concepts on Neural Networks, and others.

What is a neural network, and how do they relate to AI, and Machine Learning?

When we look at these terms, they are largely seen as buzzwords that are interchangeable. As I did my research into this space, I found that while there is crossover between them, they do have different meanings. Where AI is more the umbrella terms that encompasses the process of building solutions that can take more discerning actions based upon changes in their environment.

What is AI?

What I mean by that is that, traditionally when you code, developers I encouraged to write “defensive code,” which is code that is designed to be resilient to changes in their environment. The problem here is that you can only plan for so many contingencies. AI solutions can be designed with intent in mind. We are building solutions that can see their current environment and be trained / armed with everything they need to succeed while their environment can make unexpected changes.

What is Machine Learning?

Given that, Machine Learning is the process of taking that algorithm / model, and building it through training it based on previous data. This is the process of “how the sausage is made,” and outlines the required steps to build out these solutions. Now much like any software solutions, there are lots of ways to train an AI model, and those options all come with pros and cons that support the kind of outcomes you are looking for.

What is a neural network?

This term is on you will see a lot of and is hardly a new term. I can remember learning about neural networks and expert systems in master’s classes back in 2005, and they were being used in industry back then. The question though is, what is a neural network? The truth is that the concept of a neural network starts with how the human brain works. A neural network is a collection nodes that are assembled into layers. And each layer supports decision making based on how the model was trained. Neural Networks use at their core pattern recognition to train a network to behave and follow specific paths.

As I look at this, the next question becomes, “What are those layers?”, and what I found is the following:

Input Layer – To no one’s surprise, these nodes receive input data, and then pass that data to the hidden layers.
Hidden Layers – These layers represent the middle of the network, that provide the ability to process the data. Now these nodes apply weights and biases to determine the appropriate results.
Output Layer – This layer is what takes the output of the applied weights and biases and then makes a prediction based on what was previously trained on.

The reason the middle layers are called “hidden layers,” is ultimately because these layers are created by the training process to provide an output based on pattern recognition.

How does this relate to LLMs?

When we look at AI, and when it really started, the process and the idea of Artificial Neural Networks, really hasn’t changed. And ultimately the models we interact with (like gpt-4o, Claude, Opus, Llama, etc) are neural networks, just like the previous models. The primary different is the size and scope of the training.

As was discussed before, LLMs do not understand knowledge, but rather are able to predict responses based on the data and parameters they were trained on. As I researched this, I discovered that it would be fair to say that LLMs were created by taking the neural network concepts and machine learning processes and applying significantly larger compute options (like GPUs). This enabled models to build significantly more complex neural networks.

With that being said, the question comes back to “how are these being trained?” The key elements that are identified as part of this process are the generation of those weights and biases?

What are weights and biases?

As I said above, the strength of neural networks is the ability for nodes to leverage defined weights and biases to drive the predictions the network makes.

What are weights?

The technical definition is that “weights are numerical values assigned to the connection between nodes in the hidden layer, and support the influence of a connection has towards the outcome of the output layer.” That is an interesting definition, but what does that actually mean?

The answer being that these weights are part of math, these represent influence that guides a path through the neural network to ultimately a prediction.

Weights are generated as part of input used during the training process.

What are biases?

The other part of the numeric values that guide a neural networks are biases. And biases represent numeric values that guide input, but aren’t based on input training data. These biases are part of what also help guide paths through the neural network.

What is an Activation Function and back propagation?

The answer comes down to something called an “Activation Function,” which is part of each node within the neural network. The Activation function is part of the math within the node itself. It takes in the inputs and then uses the weights and biases to figure out the next step in the network process. This leads to a neural network being able to process the data using a concept called “back propagation.” This gets into some very deep math, admittedly a good chunk of it being over my head.

But the key principle is that this enables the definition of paths that ultimately lead to predictions at the output layer. And specifically these weights and biases enable a neural network to define non-linear paths through a neural network.

What is Pre-Training vs Fine Tuning?

So one of the big question that led to this investigation, was what is the difference between Pre-training and Fine Tuning when it comes to AI Models. The key difference here is how when the training was done.

What is Pre-training?

If you look up the technical definition of pre-training, you will find that this involves taking a model / neural network, and training it on a large corpus of text data, using an unsupervised manner of training. More on that “unsupervised” term in a bit. But the basic idea here is that you are taking a model, and training it on a large volume of data. Which means you are going to need two things:

Need access to a large volume of prepared data for training.
Need access to the GPUs to perform the training.

Now the key benefits of this kind of training are that you are building a base-line for training your model and giving it the opportunity to learn language, grammar and structure, building the embeddings and defining the vectors at the beginning. Honestly if you’ve decided to build your own LLM from scratch, this is the most critical step of the process.

Now, the age-old principle of “garbage in, garbage out,” is especially true here. You can’t realistically do this kind of training without doing a significant amount of pre-processing, and setting up the data to generate the best results.

What is fine-tuning?

Another word for this is “post training,” but really the principle is “fine tuning of a model.” The idea behind this is that you take an existing pre-trained model, and you feed it a smaller domain-specific dataset to adjust the performance to account for this. Now fine-tuning is all the rage right now. The idea is, I’ll take that llama or gpt-4o model, and I’ll stand on the “shoulders of giants” and take a much smaller (relative to the original) dataset, and train the model with a domain-specific focus. The benefits of this is you’re taking a known good standard, and refining it to behave in the way you want it to.
You are still going to need those same two items:

Need access to a large volume of prepared data for training.
Need access to the GPUs to perform the training.

But the goal is the amount of data is significantly less, and the amount of compute / GPU power should be less as well. Really your taking a general AI model, and “fine tuning it” to be domain specific.

Why not just use RAG?

The logical question here, “Why not just use RAG?” As we talked about before, I can build out Retrieval-Augmented Generation architectures to feed embeddings that are domain specific. And then I don’t need GPUs. I would counter point with that Fine tuning and pre-training are serving a very different end than just feeding the model recent data (which is what RAG is designed to do).

RAG at its core is about solving the “My data is changing all the time, I don’t want to retrain 5 times a day.” And RAG is a fantastic option to resolve that kind of problem.

But that isn’t a perfect solution all the time. While RAG provides references in embedded form for it to pull from. Fine Tuning fundamentally changes the model based on domain-specific data. So this is better suited for the following scenarios:

When External Data Source Access is Impossible: When you are dealing with AI at the edge, or in disconnected environments that prevents the ability to have access to the RAG data source.
When you want to change how the model behaves, like agentic solutions.
When you want to improve performance with regard to specific underlying tasks.

What is Supervised vs Unsupervised learning?

These are two different types of training that can be used in data science, and ultimately when it comes to LLMs, they are applied differently.

What is unsupervised learning?

At the core, Unsupervised learning is the process of feeding a model vast amounts of unlabeled data and allowing the neural network to build and train without specific direction from the data scientist. The idea here being that we are allowing the model to build the connections, define the vectors and figure out things like cosine similarity without a human in the loop. This is great for pre-training but does require a larger dataset.

There is still preprocessing required but the demand is significantly less.

This type of learning is most used during pre-training of models, and the output is a trained model with weights and biases built during processing of the dataset.

What is supervised learning?

Supervised learning is the opposite, it is taking and using labeled data to refine the model. So instead of focusing on volume of data to train a model, we are more interested in taking a smaller, more refined dataset and using it to “tune the model to specific tasks.” This process involves taking data that is labeled and set up to guide and train the model to behave differently than originally intended.

The main consideration here being that you need a dataset and that it is labeled. And depending on the size and complexity of the data this could require significant effort to generate. Additionally, the same rules would apply to training a model here in that the more data you can use, the better the results of the fine tuning.

What is reinforcement learning?

Reinforcement learning is a common method of fine tuning that is leveraged to train agents. And this is probably the type of learning that makes the most sense to us as human beings. Ever trained a dog? Same principle. In this scenario you put an agent into an environment, and run it through a situation, and then it evaluates the outcome providing a reward or a punishment.

Now my first question here was how do you “reward” an AI, and realistically, I thought there would be some complex answer to this question. But really the answer seems to be a numeric result that is favorable. I know sort of anti-climactic.

flowchart TD
    A[Agent] -->|takes action based on policy| B[Environment]
    B -->|Review action| C[Evaluator/Grader]
    C -->|Provides reward or penalty| A

The main consideration here being that when looking at evaluation framework, the model can optimize for the grading process. And that is a huge topic and likely something I will be covering for a future blog post on it’s own.

What is Transfer Learning / Distillation?

As we examine learning techniques for LLMs, one of the terms that kept coming up was “transfer learning.” And I wanted to find out more about this and how it applies? The answer was a little surprising, transfer learning is the process of using one model to train another. And this is a technique for fine tuning. Specifically taking a model, and using prompts and data to perform actions, and then using the outputs from one model to train the smaller model. Where this type of training really shines, is by using a larger LLM to train a smaller model. If you think about it, we can use this when trying to build very domain specific models that are designed to run on more cost efficient hardware.

The benefits of this approach being that it can be very cost efficent, and requires less in terms of generating training datasets liked with more traditional supervised learning options. It also can be great for multilanguage or driving fine tuning on different platforms as it abstracts away the platform specific elements and can be facilitated with APIs between the two models.

The drawback of course being that it really only is effective with a domain / tasks specific focus. There is a great write up on transfer learning and distillation here.

General Thoughts…

After spending some time reviewing firstly how neural networks work, and then the learning techniques involved in training AI models, a bunch of observations are immediately evident to me. Firstly this investigation and this information really helped me to understand the process of taking existing models and building domain specific agents. The space around how to perform fine-tuning of models continues to be a heavily evolving space. With a lot of new innovations happening as many brilliant people continue to identify new ways to accomplish these types of learns in cost effective manners.

And I really believe when it comes to building agentic solutions, understanding these types of learnings are going to become more and more important. When we look at scenarios of running agentic ai solutions at the edge, where you don’t have the benefits of access to existence RAG based architectues, it will become important to ensure that these general models are fine-tuned to the specific tasks that they are designed to accomplish. The next area I’m encouraged to find out more about is how we grade and monitor performance of these models.

One of the big areas that continues to fascinate me, is how you measure the effectiveness of fine-tuning to ensure the effectiveness and benefits of the process. Agentic Solutions in the future can only be as good the fine-tuning techniques used to improve their performance, and the ability of the AI engineers to manage the impacts of that training on AI behavior.

While investigating this topic, it amazes me, how the nature of neural networks, which have been a staple of AI for 30 years, continue to be the primary foundation of the new LLMs that are continuing to be generated and optimized today. There’s an old adage there about “20 years to become an overnight successs.”