Blog

  • How Do You Train AI When You Don’t Have Enough Data?

    How Do You Train AI When You Don’t Have Enough Data?

    Data augmentation is the technique of artificially expanding an AI’s training dataset by creating modified versions of existing examples, helping models learn more robust patterns from less data and resist the temptation to memorize specific examples.

    Hey Common Folks!

    Last edition, we saw how fine-tuning can specialize a generalist model with just a few hundred examples. But that still assumes you have a few hundred. What if you don’t? What if you have 50 X-rays of a rare disease, 200 photos of a specific defect on your factory line, or only a few thousand sentences of a language that barely has any written text?

    Today’s edition is about one of the most elegant ideas in the field — the technique that lets AI learn more from less. It’s the reason medical AI can work with 500 X-rays instead of 500,000, why your phone’s face recognition works in unusual lighting, and why self-driving cars can handle road conditions they’ve never literally seen before.


    The Music Student Analogy

    Imagine a music student learning to play a piece on the piano.

    The teacher could have them play it exactly as written, at the same tempo, every day until they have it memorized. They’d get very good at that exact performance. But show them the same piece in a different key, or ask them to play it slightly faster, or have them perform in a different room with different acoustics, and they might struggle. They memorized the piece — not the music.

    A better teacher has the student play the same piece at different tempos. In different keys. With different dynamics, louder, softer, more staccato. Forwards. Sometimes just the left hand, sometimes just the right. From memory, from the sheet, from hearing it.

    Same piece. Infinite variations. The result: the student understands the music deeply enough to adapt to any performance context.

    Data augmentation is this second approach, applied to AI training.


    What Data Augmentation Actually Does

    Instead of training on the same set of examples repeatedly, data augmentation creates new variations of those examples, modified versions that show the model the same underlying concept from different angles.

    For an image classifier training to detect cats, instead of showing the model the same cat photo over and over, you:

    • Flip it horizontally (cats look like cats whether facing left or right)

    • Rotate it slightly (a cat tilted 15 degrees is still a cat)

    • Crop it to different portions (a cat partially out of frame is still a cat)

    • Adjust the brightness up or down (cats in dim rooms are still cats)

    • Add a small amount of random noise (slight graininess doesn’t change what it is)

    • Adjust the color slightly (the model shouldn’t rely on exact color values)

    Each of these modified versions is treated as a new training example. And because the augmentations are typically applied fresh and randomly every time the model studies the data, the model effectively never sees the exact same example twice. Your original 1,000 photos become an essentially endless stream of variations, all from the same source material.’

    The model sees so many variations of each example that it becomes very difficult to memorize any specific one. Instead, it has to learn what’s actually essential: the shapes, structures, and patterns that make something a cat under any conditions.


    Why This Fights Overfitting

    We covered overfitting earlier — when a model memorizes its training data instead of learning generalizable patterns.

    Data augmentation directly attacks this problem. If every time the model sees a training example it looks slightly different, the model can’t memorize the specific pixel values. It’s forced to learn deeper patterns that hold across all the variations.

    It’s like the difference between studying for an exam by memorizing sample questions versus studying by deeply understanding the underlying concepts. The model that memorizes exact training images will fail on slightly different real-world images. The model that’s learned through augmented variations has encountered so much variety that new real-world images feel familiar.


    Augmentation Beyond Images

    Data augmentation was pioneered in computer vision, but it’s now applied across every type of data.

    Text augmentation: For natural language processing (NLP) tasks, you might replace words with their synonyms, randomly swap sentence order, or translate text to another language and back. If a model learns to classify “customer is upset” as negative sentiment, it should also classify “client is dissatisfied” the same way, augmentation teaches it to.

    Audio augmentation: For speech recognition models, you augment recordings by adding background noise, changing the pitch slightly, adjusting playback speed, or simulating different room acoustics. A voice assistant that only trains on clean recordings in a quiet room will struggle in a noisy kitchen, augmentation with realistic noise conditions fixes this.

    Time-series augmentation: For financial or sensor data, you apply small shifts in timing, add realistic noise to readings, or slightly scale numerical values, teaching the model to recognize patterns regardless of minor variations in measurement conditions.

    And the frontier keeps moving. The newest forms of augmentation don’t just modify existing examples — they use AI itself to generate entirely new training data, mixing pieces of one example with another, or producing fully synthetic examples that never existed in the real world. The line between “modifying” data and “creating” it is blurring fast.


    The Practical Impact

    The difference data augmentation makes is most dramatic when training data is scarce, which is most of the time in specialized domains.

    Medical imaging: In a landmark 2017 study at Stanford, researchers trained a skin cancer detection model without millions of labeled dermatology images. They had thousands. Data augmentation — flipping, rotating, zooming, adjusting color balance — expanded their effective training set dramatically. The resulting model matched dermatologist performance in studies.

    Rare disease diagnosis: For conditions where only hundreds of confirmed cases exist in the medical literature, augmentation allows AI models to train on those cases from many angles rather than overfitting to the specific examples available.

    Self-driving vehicles: Companies like Waymo expand their training data with simulated weather conditions, lighting changes, unusual road configurations, and rare events (a bicycle falling in front of the car, a pedestrian in an unexpected location). This blends classical augmentation of real footage with simulation of scenarios that never actually happened, so the car can encounter a real-world situation it’s never literally seen because it’s been trained on the simulated version.

    Low-resource languages: For languages with limited text data for NLP training, augmentation techniques that paraphrase, restructure, or back-translate (translating to another language and then back to the original) can expand training sets dramatically, making language tools available for communities that would otherwise be excluded.


    The Limits

    Data augmentation is a tool, not a miracle.

    The augmentations you apply have to be meaningful. Flipping a cat photo horizontally makes sense — cats are symmetric in important ways. Flipping a chest X-ray, on the other hand, can introduce misleading artifacts: the human body isn’t symmetric on the inside. The heart usually sits on the left, the liver on the right, the stomach on the left. A horizontally-flipped X-ray quietly creates a patient whose anatomy doesn’t exist.

    Augmentation can’t substitute for genuine data diversity. If all your cat photos are of domestic shorthairs, augmenting them won’t teach the model to recognize Maine Coons. The variations you generate are bounded by the original examples.

    And too much augmentation can sometimes hurt. If augmentations are so extreme they change the character of what the example represents, distorting a photo until it no longer looks like a cat, you’re training on misleading examples.

    The art is in choosing augmentations that reflect the real-world variation the model will encounter, without introducing variations that misrepresent the underlying concepts.


    The Takeaway

    Data augmentation is how AI learns to see the world through all its variations, not just the specific examples it was shown.

    By creating modified versions of training examples, flipped, rotated, brightened, noised, paraphrased, you give the model a broader view of the underlying patterns. It becomes harder for the model to memorize specific examples and easier for it to generalize to new ones.

    It’s one of the simplest ideas in the field with one of the most reliable returns. More training variety, for free, from the data you already have.


    Coming Up

    Data augmentation gives the model more variety in what it learns from. But there’s another question hiding right next to it: how should the model actually study all that data? Should it adjust itself after every single example, or wait until it’s read everything, or somewhere in between?

    Turns out the answer is a careful middle path called batch learning, and it shapes how every modern AI model gets trained. Next edition: why AI doesn’t study the whole library before forming an opinion, but also doesn’t change its mind after every single page.

    Was this helpful? Does the music student analogy capture how augmentation works for you? Reply and let us know.

    AI for Common Folks — Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • How Do You Teach a General AI to Do One Specific Job?

    How Do You Teach a General AI to Do One Specific Job?

    Fine-tuning is the process of taking a pre-trained AI model that already has general knowledge and training it further on a smaller, specialized dataset to make it an expert at one specific task — like sending a brilliant generalist back to school for a residency.

    Hey Common Folks!

    The last few editions have all been about how an AI gets its broad education. We covered pre-training — the heavy lifting where a model reads the internet and learns how the world is described in text. Then we covered the train, validate, test split — the honest classroom that tells us whether the model actually learned anything. Both phases describe how a generalist AI gets built.

    But here’s the catch: a pre-trained model is exactly that — a generalist. It knows a little bit about everything, but it isn’t an expert in your specific business. It knows what a “receipt” is, but it doesn’t know how your company processes expenses.

    To turn the generalist into a specialist, we use a process called fine-tuning.

    What is Fine-Tuning?

    It’s the machine learning equivalent of “don’t reinvent the wheel.” Instead of building a new brain from scratch, you take an existing brain that already understands the world and teach it one specific skill on top.

    The pre-trained model is the wheel. Fine-tuning is the car you build around it.

    The Analogy: The Medical Student vs. The Cardiologist

    In the pre-training article we left the medical student standing at the end of medical school — a knowledgeable generalist who has read the textbooks but hasn’t specialized. Now we pick up where that left off.

    1. Pre-training (Medical School): The student spends years learning general anatomy, biology, and chemistry. They know how the human body works in general. They are smart, but you wouldn’t want them performing heart surgery on you yet. This is your Foundation Model — GPT, Claude, Gemini, before any specialization.

    2. Fine-tuning (Residency): Now that general doctor goes through specialized training to become a cardiologist. They stop reading general biology textbooks and focus entirely on the heart. Crucially, they use their previous general knowledge to pick up the specialty much faster than someone starting from zero.

    Fine-tuning is what turns the general doctor into the cardiologist. It takes the broad capabilities the model already has and focuses them on a single, specific job.

    Why Do We Need It? The “Generalist” Problem

    You might ask: if the model has read the whole internet, isn’t it already smart enough?

    Not exactly. There are two gaps a generalist can’t close on its own:

    • The Data Gap: The model might know what a dog is, but if you want it to distinguish between a “phone” and a “tablet” the way your product catalog defines them (categories that weren’t emphasized in its original training), it will struggle.

    • The Tone Gap: A generic model sounds like a generic robot. If you want it to sound like your customer support agent (polite, empathetic, using your company’s specific lingo), you need to teach it that specific style.

    Fine-tuning is the bridge between general knowledge and specific application.

    How Does It Work?

    When we fine-tune, we don’t want the model to forget everything it learned during pre-training. We don’t want the doctor to forget how to read blood pressure while learning heart surgery. There are two common ways to pull this off, both versions of a broader idea called transfer learning.

    The lightweight approach: freezing. We lock the parts of the model that understand foundational things (grammar, what an “edge” looks like in an image) and only let it update its “last layers,” the parts responsible for making the final decision on your specific task. Show it a few hundred examples of your company’s legal contracts and, because it already understands English from pre-training, it picks up the pattern of your contracts very quickly.

    The heavier approach: full fine-tuning. We unfreeze most of the model and let it update widely, but with a much smaller learning rate so it doesn’t lose what it already knows. This needs more data, often tens of thousands of examples, but it can reshape the model far more deeply. It’s how labs teach a model entirely new behaviors, like how to follow instructions or how to sound like a helpful assistant.

    Either way, you don’t need the millions of examples that pre-training required. You’re not rebuilding the brain, you’re just pointing an existing intelligence in a new direction.

    Real-World Magic: How ChatGPT Was Made

    The clearest example of fine-tuning in action is ChatGPT itself.

    1. Pre-training (GPT-3): OpenAI trained a massive model to predict the next word in a sequence. It was smart, but it wasn’t helpful. If you typed a question, it might keep going with more questions instead of answering. It was a brilliant autocomplete, not an assistant.

    2. Fine-tuning (Instruction Tuning): They then fine-tuned it on a dataset of conversations. Humans demonstrated: when a user asks X, the assistant should respond with Y. The model learned the shape of being helpful.

    3. Refinement (RLHF): They then used a related technique called Reinforcement Learning from Human Feedback, where humans ranked pairs of AI answers (”this one is better than that one”) and the model learned to prefer the kinds of answers humans actually wanted. This is the step that taught the model to be helpful, harmless, and conversational.

    Without fine-tuning, GPT-3 was just a text generator. Fine-tuning is what turned it into a chatbot.

    And the same recipe is still in play today. Every new version of ChatGPT, Claude, or Gemini follows the same playbook: pre-train a massive generalist, then fine-tune it to be helpful, safe, and conversational. The base models keep getting bigger and smarter; the underlying principle doesn’t change.

    Is Building a Custom GPT the Same as Fine-Tuning?

    Short answer: no, but a lot of people use the words interchangeably.

    When you create a Custom GPT in ChatGPT, or set up a Claude Project, or write a long system prompt for your chatbot, you’re doing something different from what we’ve described above. You’re not training the model. You’re not changing its weights, its memory of how the world works, or any of the numbers inside its brain. The model itself is exactly the same as everyone else’s.

    What you’re actually doing is giving it a really detailed briefing every time it starts a conversation. It’s the difference between sending a doctor through a residency (real fine-tuning) and handing that same doctor a patient’s chart right before the appointment (prompting). Both shape what they do. Only one changes who they are.

    This kind of customization is called system prompting, and when you also upload reference files for the AI to consult, it’s called retrieval. It’s much cheaper and faster than real fine-tuning. You don’t need any training data, any compute, or any ML team. And for most everyday use cases (a customer-support bot for your store, a writing assistant for your team, a tutor for your kids), prompting is more than enough.

    Real fine-tuning is what you reach for when prompting hits its limit. When you need the model to learn a pattern it just won’t pick up from instructions, or when you have thousands of examples of what good output looks like and want the model to truly internalize the style. For everything else, a good system prompt is usually faster, cheaper, and easier to iterate on.

    In casual conversation, people often say “I fine-tuned my GPT” when they really just wrote a clever system prompt. The slip is small, but worth knowing: one of these changes the AI itself, the other just changes its instructions for the day.

    The Takeaway

    Fine-tuning is how we customize AI for the real world.

    • It saves money. You don’t need hundreds of millions of dollars in compute to train a model from scratch — you stand on top of one that already exists.

    • It saves data. You can get strong results with hundreds or a few thousand examples, instead of the billions pre-training requires.

    • It creates experts. It turns a generic tool into a specialized solution for your business, your domain, your tone.

    Every AI product you use that feels weirdly tuned to a specific job (a customer support bot that sounds like the company, a coding assistant that knows your team’s style, a medical AI that flags conditions a generalist would miss) has been fine-tuned somewhere underneath.


    Coming Up

    Fine-tuning lets you specialize a generalist model with way less data than training from scratch. But what if you barely have any data, like 500 X-rays for a rare disease instead of 500,000? You can’t train from scratch and you don’t have enough to fine-tune properly either.

    Next edition: how AI learns from tiny datasets by cheating fairly — a technique called data augmentation.


    Was this helpful? Did the medical-school analogy click? Reply and let us know what you want us to explain next.

    AI for Common Folks — Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • How Does AI Study, Practice, and Take the Final Exam?

    How Does AI Study, Practice, and Take the Final Exam?

    Training, validation, and test sets are the three separate groups of data every AI model needs to learn properly, tune itself, and be fairly evaluated — like practice problems, a mock exam, and the actual final exam, each playing a completely different role.

    Hey Common Folks!

    Last edition, we walked through pre-training — the general-education phase where an AI model reads the internet and slowly learns how the world is described in text. We ended with a promise: come back and show you what the day-to-day classroom actually looks like. This is that edition.

    Because pre-training raises a question that sits right underneath it: how does a data scientist know whether the model has genuinely learned something, or just memorized the examples it was shown?

    The answer comes down to a simple but crucial idea: you need to test a model on data it has never seen before. And to do that correctly, you have to split your data into three very specific groups from the start.

    That’s what training, validation, and test sets are all about.


    The Exam Analogy

    Think back to how you prepared for an important exam in school.

    Your textbook came with practice problems at the end of each chapter. You worked through them, checked your answers, and kept reviewing the material you got wrong. These are your training set — examples you learn from, with feedback.

    Then your teacher handed out a mock exam. You hadn’t seen these specific questions before, but you used them to figure out where you still had gaps. Based on how you did, you went back and studied those weak areas. These are your validation set — a checkpoint to help you tune your preparation.

    Finally: the real exam. Brand new questions, no second chances, no adjustments afterward. This is the only true measure of whether you actually learned. This is your test set.

    Here’s the critical rule: you cannot use the same data for all three purposes. If you train on the same examples you test on, you’re not measuring learning — you’re measuring memorization. It’s like scoring a student on the exact practice problems they studied. A perfect score tells you nothing.


    The Three Sets Explained

    Training Set

    This is the largest portion of your data — usually around 70-80%. The model learns from this data. It sees thousands or millions of examples, makes predictions, gets corrected, and adjusts its parameters. This is where all the actual learning happens.

    The model sees the training data over and over again across multiple training runs. It’s allowed to learn from its mistakes here.

    Validation Set

    This is typically 10-15% of your data. The model never trains on this data — it only uses it to check its progress.

    During training, you periodically pause and run the model on the validation set. If the model’s accuracy on training data is climbing but its accuracy on the validation set is flat or dropping, you’re watching overfitting happen in real time. The model is memorizing training examples instead of learning generalizable patterns.

    The validation set also helps you make decisions: Should you train for more epochs? Should you adjust the learning rate? Should you use a simpler model? Every tuning decision gets made based on how the model performs here — not on the training set.

    Test Set

    This is the smallest portion — maybe 10-15% — and it gets used exactly once: at the very end, after all training and tuning is complete.

    The test set is your honest final grade. The model has never seen these examples. You haven’t made any decisions based on them. This is the only clean measurement of how the model will perform on truly new data.

    If you use the test set during tuning — making adjustments based on test performance — you’ve contaminated it. You’ve effectively turned it into a validation set, and now you have no honest evaluation left.


    What Happens When You Get This Wrong

    Skipping or contaminating the test set is more common than you’d think, and the consequences can be serious.

    In the late 2010s and early 2020s, AI medical-imaging papers regularly landed in major journals reporting accuracy figures in the 90s — sometimes claiming they could spot tumors, predict COVID, or flag heart disease nearly perfectly. A 2021 review published in Nature Machine Intelligence looked at over 2,000 such papers on COVID detection alone and found that not a single one was clinically useful. The most common reason? Data leakage. The “test set” had quietly seen the same patients, the same scanners, or the same preprocessing as the training set. The exam was rigged.

    When hospitals tried to deploy these tools on truly new patients, accuracy collapsed.

    This isn’t just a technical problem. When AI is making recommendations about someone’s cancer diagnosis, loan application, or parole hearing, the gap between “performed well in testing” and “performs well on real people” has real consequences.

    By 2026, the response has started to bite. The FDA now expects AI medical-device submissions to demonstrate performance on prospective test data — patients enrolled after the model was already locked in — because retrospective test sets had proven too easy to contaminate, even by accident. A properly held-out test set is the closest thing we have to a clean, honest audit before deployment, and regulators have stopped trusting anything weaker.


    An Important Twist: Cross-Validation

    Sometimes your dataset is small enough that you can’t afford to lock away 15% as a test set — you need every example to train on.

    In those cases, data scientists use a technique called cross-validation. Instead of one fixed split, you divide the data into five or ten equal “folds.” You train on nine of them and validate on the tenth. Then you rotate — train on a different nine, validate on the remaining one. Repeat until every fold has been the validation set exactly once.

    This gives you a more reliable estimate of how the model will perform on new data without sacrificing training examples. It’s slower, but it’s smart when data is scarce.


    Where You See This in Practice

    Every responsible AI system uses this three-set approach, even if the terminology differs.

    Google testing a new search ranking algorithm trains on historical search data, validates on recent data to tune the approach, and evaluates on a final held-out period before pushing anything to production.

    Netflix’s recommendation models train on older viewing history, validate on more recent history to catch drift and tune parameters, and measure final performance on the most recent week of data they haven’t touched.

    Even the foundation models behind ChatGPT, Claude, and Gemini get evaluated on held-out benchmarks the model has never seen during pre-training — though, as we’ll see in a future edition, even those held-out exams are getting harder to keep clean as models scale.

    The principle is always the same: learn on one thing, tune on another, and only trust what you measure on something completely separate from both.


    The Takeaway

    The training set teaches the model. The validation set helps you tune it. The test set honestly evaluates it.

    Confuse these roles — or skip any of them — and you lose the ability to know whether your AI actually works.

    This is why, when you hear “AI achieved 95% accuracy,” the right question isn’t “how high is the number?” It’s “what data was it tested on, and was that data truly held out?” The number only means something if the exam was fair.

    Every AI tool you trust — from medical imaging to fraud detection to the spam filter keeping your inbox clean — depends on data scientists getting this split right before it ever reaches you.


    Coming Up

    We’ve now seen the clean version of how AI learns — labeled examples, a tidy split, an honest final exam. But that whole picture assumes someone already labeled the data for you. In the real world, most data shows up messy and unlabeled. Nobody has sat down to tag every tweet, every photo, or every customer record.

    So what happens when you take the answer key away? Can AI still learn anything useful from a pile of raw, unsorted data? Next edition: how AI learns without a teacher.


    Was this helpful? Did you know this was how AI gets tested? Reply and let us know what you want us to explain next.

    AI for Common Folks — Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • How Does AI Get Its Basic Education Before It Meets You?

    How Does AI Get Its Basic Education Before It Meets You?

    Pre-training is the process of teaching an AI model general knowledge from a massive dataset before teaching it any specific job. It is the heavy lifting that turns a blank computer brain into a knowledgeable generalist.

    Hey Common Folks!

    Yesterday we covered tokens — how AI breaks your prompt into the small pieces it can actually read. Now we answer the question that opens up right underneath that one: where did AI learn what those tokens actually mean in the first place?

    The answer is Pre-training. It is the “P” in GPT (Generative Pre-trained Transformer). It is the phase where a Foundation Model learns 99% of everything it knows, and it is the reason today’s AI is so shockingly capable, and occasionally so shockingly confident about things it shouldn’t be.


    What is Pre-training?

    Think of it as General Education. Before you become a doctor, an accountant, or a coder, you first have to learn the alphabet, how to read, how to do basic math, and how the world works. You don’t start kindergarten by performing brain surgery.

    Pre-training is the AI version of that long, broad foundation phase. The model spends an enormous amount of time learning general knowledge before anyone ever asks it to do a specific job.


    The Problem: Learning from Scratch is Hard

    Why do we even need pre-training? Why can’t we just teach an AI to answer customer support emails directly?

    Because deep learning models are data hungry. If you want to train a model from scratch to recognize a cat, you need thousands of photos of cats. But not just photos — you need humans to manually label them: “This is a cat,” “This is a dog.” That labeling process is slow, expensive, and tedious.

    And if you tried to teach a computer to understand English by only showing it customer support emails, it would fail. It wouldn’t know what a verb is, what “angry” sounds like, or even how to structure a sentence.

    You can’t shortcut the foundation. You have to build it.


    The Solution: Predict What Comes Next

    Pre-training flips the problem on its head. Instead of teaching the AI a specific job immediately, we let it loose on a massive amount of data — the internet, books, papers, code — with one beautifully simple goal: predict what comes next.

    The AI reads billions of sentences and tries to guess the next word over and over. After “Hello,” the next word is often “World” or “There.” After “Once upon a,” it is almost always “time.” Through trillions of these tiny prediction games, the model gradually picks up grammar, language patterns, factual knowledge, and slang — the broad shape of how the world is described in text.

    The beautiful part: nobody has to label anything. The next word in the sentence is itself the answer.

    It learns the general features of the world first.

    • In images: it learns what edges, circles, and shapes look like before it learns what a “face” is.

    • In text: it learns how language works before it learns how to write a poem about your dog.

    The whole philosophy comes from a simple idea in transfer learning: don’t reinvent the wheel. If a model already exists that understands the basics, build on top of that knowledge instead of starting from zero.


    The Analogy: The Medical Student

    To understand pre-training versus what comes after it, think of a medical student.

    1. Pre-training (Medical School): The student spends years reading textbooks and learning anatomy, biology, and chemistry. They aren’t treating patients yet. They are just building a massive foundation of general knowledge. They know a little bit about everything.

    2. Specialty Training (Residency): Now that student goes to a hospital to specialize, maybe in cardiology, maybe in surgery. They take all that general knowledge and focus it on one specific task.

    A pre-trained model like GPT-5, Claude, or Gemini is the medical school graduate. It has read the library. It is smart and broad, but it hasn’t specialized in your company’s data, your customer’s tone, or your industry’s jargon yet. That comes later.


    Why is This a Game Changer?

    Before pre-training became the standard, if you wanted to build an AI to translate languages, you needed a massive labeled dataset of English-to-French sentences. If you didn’t have that data, you were stuck. Every new task meant starting from zero.

    With pre-training:

    1. You need less data later. Because the model already knows English, you only need to show it a small number of examples of what you specifically want it to do for it to catch on.

    2. You skip the hardest part. A pre-trained foundation already exists. You just point it at the specific job you care about.

    A note on scale, because this matters for understanding why pre-training is such a big deal in 2026: back when this technique first took off around 2018–2020, pre-training a small language model was a days-or-weeks project on a modest cluster. Today’s frontier models — GPT-5, Claude, Gemini — take months to pre-train, run on tens of thousands of GPUs, and cost hundreds of millions of dollars in compute alone. Pre-training has gotten bigger, not smaller, as AI has scaled.


    The Takeaway

    Pre-training is the heavy lifting. It is the process of creating a Foundation Model — a knowledgeable generalist that can later be pointed at almost any task.

    • It is the bridge between a blank computer brain and one that has read the library.

    • It is the difference between teaching a baby to write a novel (impossible) and asking a college graduate to write a novel (possible).

    • It is the layer underneath every modern AI you use — ChatGPT, Claude, Gemini, Copilot — built long before you ever typed a single prompt.

    One honest caveat for 2026: pre-training is the foundation, but it is not the whole story anymore. Modern AI also goes through additional specialty training on top of pre-training, and that is where today’s models learn to be helpful, careful, and reasoning-capable. We’ll dig into that in the upcoming articles.

    When you use ChatGPT or Claude, you are talking to a model that has already finished its general education. It has read the library. Now it is ready to work for you.


    Coming Up

    You now know what pre-training is — the general-education phase where AI learns how the world is described in text. But how does the AI actually do the learning? What does the day-to-day classroom look like? Next, we’ll walk through the study, practice, and test loop — the actual mechanics of how a blank-slate model goes from random guesses to meaningful answers.


    AI for Common Folks – Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • What Are Tokens and Why Does AI Count Words Differently?

    What Are Tokens and Why Does AI Count Words Differently?

    A Token is the smallest unit of text an AI processes — a whole word, part of a word, or even a single character. It is the bridge between human language and the numbers AI actually understands.

    Hey Common Folks!

    Over the last two articles, we covered what a prompt is and how to write a good one. That was about how to talk to AI. Now we look at the other side of that conversation: how AI actually reads what you typed.

    If you have ever looked at the pricing page for ChatGPT or Claude, or seen an error message saying “Token limit exceeded,” you have probably scratched your head. Why don’t they just count words? Why this fancy term “Token”?

    It turns out, computers don’t read the way we do. To an AI, a Token is the fundamental unit of reality.


    What is a Token?

    A token is a chunk of text. It can be a whole word (”apple”), part of a word (”smart” + “phones”), a piece of punctuation, or even a single character.

    Think of tokens as the atoms of language. Just as a molecule of water is made up of atoms (Hydrogen and Oxygen), a sentence is made up of tokens.


    The Analogy: The Lego Castle

    Imagine you have a beautiful Lego castle (a sentence).

    • For Humans: We look at the castle and say, “That’s a castle.” We read the whole word.

    • For AI: The AI looks at the individual plastic bricks used to build it.

    Sometimes, a single brick is a whole window (a whole word like “apple”). Other times, to build a long wall (a complex word like “smartphones”), you need two bricks: “smart” and “phones.”

    The AI doesn’t see the castle; it sees a pile of bricks. It processes those bricks one by one to understand the structure.


    Why Not Just Use Words?

    A fair question: “Why break ‘smartphones’ into two tokens? Why not just treat it as one word?”

    Because computers don’t understand English. They understand numbers.

    To teach an AI language, every piece of text first has to be converted into a list of numbers (called token IDs), which the AI later turns into rich mathematical representations called embeddings.

    • If we assigned a unique number to every single word in the English language, the list would be infinite and unmanageable. Every name, every typo, every made-up word would need its own slot.

    • By breaking complex words into smaller chunks (tokens), the AI can understand words it has never seen before by recognizing their parts.

    If the AI knows “smart” and it knows “phones,” it can understand “smartphones” without needing a separate definition for it.


    How Does It Work? (The Tokenization Process)

    Before your prompt hits the AI, a process called Tokenization happens.

    1. Input: You type “I love AI.”

    2. Chopping: The tokenizer chops this up. In a modern GPT-style tokenizer, it looks roughly like: ["I", " love", " AI", "."] — four tokens, including the leading spaces. (Real tokenizers are picky like that, and they bake the spacing into the tokens themselves.)

    3. Numbering: The system assigns a specific ID number to each chunk. Modern tokenizers have a vocabulary of tens of thousands to a few hundred thousand possible tokens, so real IDs are usually large numbers. We’ll use small ones below for illustration: [40, 1842, 16124, 13].

    4. Processing: The AI receives that list of numbers and asks itself the one question it knows how to answer — what number should come next? It picks the most likely next number based on the patterns it learned during training (we walked through that pattern-prediction in How AI Actually Learns). That predicted number is converted back into a piece of text, and then the AI does it again — one token at a time — until your full answer is built.

    Want to see this live? Free tools like OpenAI’s tokenizer page or Anthropic’s token-counting API let you paste any text and watch how it gets split. It’s worth doing once — you will never look at a long prompt the same way again.


    The “Currency” of AI

    Why should you care about tokens? Because in the world of Generative AI,
    Tokens = Money.

    When companies like OpenAI or Anthropic charge you, they don’t charge per question. They charge per token.

    • Input Tokens: what you type, paste, or upload into the chat.

    • Output Tokens: what the AI writes back.

    A useful 2026 reality check: output tokens almost always cost more than input tokens — typically 3 to 5 times more — because generating a careful answer is harder for the model than reading one. So a long, rambling AI response costs you more than a short, precise one. Brevity in your prompt and in the format you ask for is a real lever on cost.

    Roughly speaking, 1,000 tokens is about 750 words. So if you ask the AI to summarize a 50-page document, you are “spending” tokens to feed that document into the model, and spending more tokens to get the summary back out.

    The good news for 2026: frontier models like Claude and Gemini now support context windows in the hundreds of thousands to over a million tokens. A 500-page novel can fit in a single prompt today — something that was impossible just two years ago. Tokens still cost money, but the ceiling on how much you can hand the AI in one shot has gone way up.


    The Takeaway

    A Token is simply a chunk of text.

    • It is the bridge between human language and machine numbers.

    • It is the unit used to measure the size of the AI’s memory (the context window).

    • It is the unit used to calculate the cost of using the AI.

    Understanding tokens explains why your long prompt sometimes gets cut off, why running a massive analysis costs a few dollars instead of a few cents, and why brevity in your prompts is a quietly powerful skill.


    Coming Up

    You now know how AI breaks your prompt into tokens. But where did it learn what those tokens actually mean in the first place? Next, we’ll dig into how AI gets its basic education — the massive pre-training phase where a blank-slate model reads a huge slice of the internet and starts to make sense of the world.


    AI for Common Folks – Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • How Do You Get Better Answers from AI?

    How Do You Get Better Answers from AI?

    Prompt Engineering is the skill of crafting inputs to guide AI models toward the specific output you want — less about coding, more about clear communication with machines.

    Hey Common Folks!

    In our last article, we covered what a prompt is — the bridge between human intent and machine output. Before that, we met the Large Language Model — the engine inside ChatGPT, Claude, and Gemini — and a few articles back we built up to it through Neural Networks.

    You now know what a prompt is. This article is about the part that actually changes your results: how to write a good one.

    Have you ever asked ChatGPT a question, gotten a generic, unhelpful answer, then rephrased it and suddenly gotten something brilliant? That’s not luck. That’s the difference between a bad prompt and a good one.

    The skill of crafting better prompts has a name: Prompt Engineering.


    Why “Engineering”?

    Don’t let the word scare you. This isn’t about writing code or building bridges.

    Prompt Engineering is about communication skills — talking to a machine in a way it understands best. It’s the art of being specific, structured, and strategic with your instructions.


    The Analogy: Back to Our New Hire

    Remember the new hire we’ve been working with? The one who read the entire internet before day one?

    They have all the knowledge in the world. But they are extremely literal and have zero context about your specific life or business.

    Bad Manager (You):
    “Write an email to the client.”

    The new hire (The AI):
    Panic. Which client? Good news or bad? Formal or casual? They guess and write something generic and robotic.

    Prompt Engineer (You):
    “Act as a senior sales manager. Write a polite but firm email to ‘Client X’ regarding their overdue payment of $500. Keep it under 100 words. Don’t use emojis.”

    The new hire (The AI):
    Understood. They write exactly what you need because you gave them the Role, the Context, and the Constraints.

    Prompt Engineering is just being a really good manager to your AI new hire.


    Why Does This Work? (The Technical Bit)

    Remember how LLMs work? They predict the next word based on patterns they’ve seen.

    When you write a detailed prompt, you’re filling the AI’s Context Window with specific patterns. The AI looks at those patterns and generates text that fits that specific context.

    When you go a step further and include examples of the input-output pattern you want, you’re using something researchers call In-Context Learning — the AI picks up the pattern from the examples in your prompt without any retraining. We’ll see this in action in technique #3.

    Vague stage = vague performance.
    Specific stage = specific performance.


    The Five Techniques That Actually Work

    You don’t need a degree for this. Master these five approaches.

    1. Role Prompting (The “Act As” Hack)

    Tell the AI who it’s supposed to be.

    • Instead of: “Explain quantum physics.”

    • Try: “You are a kindergarten teacher. Explain quantum physics to a 5-year-old using only examples they’d understand.”

    This sets the tone, complexity, and approach immediately.

    2. Be Specific About Output

    Don’t leave format to chance.

    • Instead of: “Give me marketing ideas.”

    • Try: “Give me 5 marketing ideas for a local bakery. Format as a numbered list. Each idea should be under 20 words.”

    The more specific your constraints, the more useful the output.

    3. Few-Shot Prompting (Show, Don’t Just Tell)

    Sometimes instructions aren’t enough. Show examples.

    If you want the AI to convert slang to formal English, demonstrate:

    • Input: “Sup bro?” → Output: “Hello, how are you?”

    • Input: “Gotta run.” → Output: “I must leave now.”

    • Input: “No way!” → Output:

    The AI sees the pattern and continues it perfectly. That’s In-Context Learning at work.

    4. Chain of Thought (Let’s Think Step-by-Step)

    For complex problems, add: “Let’s think step-by-step.”

    This forces the AI to show its reasoning before giving the final answer. Accuracy on math and logic problems goes up because the AI can’t just guess — it has to work through the problem.

    A note for 2026: this was the original trick that started the whole “reasoning model” era. Today’s reasoning-tier models from OpenAI, Anthropic, and Google already do this internally before they answer you. So the phrase matters less for the latest models, but it still helps with smaller or older ones, and it is the foundation everything else is built on.

    5. Give Context and Background

    The AI doesn’t know your situation. Tell it.

    • Instead of: “Write a resignation letter.”

    • Try: “I’ve worked at this company for 3 years. My boss has been supportive. I’m leaving for a better opportunity, not because I’m unhappy. Write a professional resignation letter that maintains the relationship.”

    Context changes everything.


    Want to Practice This?

    A friend of mine, Don Barger, built a free tool at ripen.donbarger.com around a clean little framework he calls RIPEN:

    • Role — who or what should the AI act as?

    • Input — what information are you giving it?

    • Process — what steps should it take to get to the answer?

    • Example — show it what good output looks like.

    • Notes — tone, constraints, guidelines, anything else.

    It’s the same territory we just walked through, repackaged as a five-letter mnemonic that’s easy to recall the moment you’re actually typing a prompt. If you want a structured place to drill these techniques into muscle memory — whether you’re writing a one-off prompt or building a chatbot’s personality — ripen.donbarger.com is a clean starting point.


    Common Mistakes to Avoid

    Being too vague: “Help me with my project” tells the AI nothing.

    Mashing unrelated tasks into one prompt: Back when ChatGPT first launched in 2022, even simple multi-task prompts like “write me an email, also summarize this, and list action items” would come out confused or messy. Today’s models handle that combo easily — so this advice has evolved, not disappeared. The underlying principle still holds in two specific cases: when the tasks have conflicting tones or audiences (a casual Slack message and a formal client email about the same news), and when you want to iterate on one piece of the answer without regenerating the rest. For serious work, separating prompts still gives you cleaner output and tighter control.

    Not iterating: Your first prompt rarely gives perfect results. Treat it as a conversation — refine and improve.

    Forgetting constraints: Without limits, AI tends toward verbose, generic responses. Add word counts, formats, and restrictions.


    Simple Examples to start with:

    Writing

    • Before: “Write a blog post about productivity.”

    • After: “Write a 500-word blog post about productivity for remote workers. Use a conversational tone. Include 3 actionable tips. Start with a relatable scenario.”

    Research

    • Before: “Tell me about climate change.”

    • After: “Summarize the top 3 causes of climate change in simple terms a high school student would understand. Use bullet points. Keep it under 200 words.”

    Code

    • Before: “Write Python code.”

    • After: “Write a Python function that takes a list of numbers and returns the average. Include comments explaining each step. Handle the case of an empty list.”


    The Limitations (Keeping It Real)

    Prompt Engineering has limits.

    It can’t fix bad models: If the underlying AI is weak, no prompt will save it.

    It’s not magic: Some tasks are genuinely hard for AI. Better prompts help, but they don’t make AI capable of everything.

    It takes practice: You’ll write bad prompts before you write good ones. That’s normal.


    The Takeaway

    Prompt Engineering isn’t a “technical” skill — it’s a clarity skill.

    • Vague prompts = average results.

    • Specific, structured prompts with examples = excellent results.

    Writing code is still a real, valuable craft. The engineers who can also articulate clearly, in plain language, with the right context, are the ones getting the most out of AI right now. Clarity is no longer a soft skill. It is a multiplier on top of every other skill you already have.


    Coming Up

    You now know how to ask. But before the AI can even read your prompt, it has to break it into pieces. Next, we’ll explore Tokens — why AI counts your words differently than you do, why a four-letter word can sometimes count as two tokens, and why that quietly affects what AI costs you and what it remembers.


    AI for Common Folks – Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • What Is a Prompt and Why Does It Matter So Much?

    What Is a Prompt and Why Does It Matter So Much?

    A Prompt is the input you provide to an AI model — text, image, voice, or document to get a specific response. It’s the bridge between human intent and machine output.

    Hey Common Folks!

    In our last article, we met the Large Language Model — the engine inside ChatGPT, Claude, and Gemini. Before that, we decoded GPT, and before that we introduced Foundation Models as the general-purpose AI brains powering today’s tools.

    We know these AI models are incredibly powerful. But a Ferrari is useless if you don’t know how to drive it.

    How do you actually talk to this super-brain? You don’t use Python code or binary zeros and ones. You use a Prompt.

    And here’s the part most people miss: the same AI can give you a brilliant answer or a useless one based on nothing but how you asked. Same model. Same knowledge. Completely different output. That gap is what this article is about.


    What is a Prompt?

    In simple terms, a prompt is whatever you give the AI to get it to do something.

    • When you type a question into ChatGPT, that text is the prompt.

    • When you upload a screenshot of a spreadsheet and ask for key insights, that combination is the prompt.

    • When you paste a 50-page PDF and ask for a summary, the document plus the instruction is the prompt.

    • When you speak to AI through your phone, your voice is the prompt.

    A prompt is the bridge between human intent and machine output. Everything you want, you have to express through it. The AI cannot read your mind. It can only work with what you give it.


    The Analogy: Back to Our New Hire

    Remember the new hire we met last article? The one who read the entire internet before day one?

    They have all the knowledge in the world. But they don’t know you yet. They don’t know your style, your boss, your deadlines, your preferences. On day one, they need instructions for every single task, and the quality of your instructions determines the quality of their work.

    • Bad Prompt: “Write an email.”

      • The new hire thinks: To whom? About what? Angry tone? Professional? Long? Short?

      • The result: A generic, useless draft.

    • Good Prompt: “Write a polite email to my boss asking for two days of sick leave next week. Keep it under 50 words and don’t sound demanding.”

      • The new hire thinks: Got it. Topic, tone, recipient, length — all clear.

      • The result: A perfect, ready-to-send email.

    The AI is that new hire. It has the capability to do almost anything, but it relies heavily on your instructions to know what to do, how to do it, and who it’s doing it for.


    Why One Word Can Change Everything

    You might have heard the term Prompt Engineering. It sounds fancy, but it just means “the art of asking correctly.”

    A fair question to raise: in 2026, with ChatGPT, Claude, and Gemini so much more capable than they were a few years ago, does this skill still matter? The honest answer is yes, and the bar has moved. Early AI models needed near-magical phrasing to produce anything useful at all. Today’s models are far more forgiving — they’ll give you something even from a sloppy prompt. But the gap between something useful and exactly what you need still comes down to how you asked.

    LLMs are sensitive to wording. Changing a single word in your prompt can completely change the answer you get back. Three quick examples:

    One word changes the audience.

    • “Explain gravity.” → a physics textbook definition.

    • “Explain gravity to a 5-year-old.” → a story about falling apples.

    One word changes the tone.

    • “Rewrite this email.” → the AI picks a tone for you, maybe the wrong one.

    • “Rewrite this email more politely.” → it keeps your meaning but softens the edges.

    One word changes the format.

    • “Summarize this article.” → a paragraph.

    • “Summarize this article in bullet points.” → a scannable list.

    Same AI. Same underlying knowledge. Completely different outputs. From one word of difference.

    Why does this happen? Remember from the last article that the LLM predicts one word at a time, always choosing the most likely continuation of what came before. Your prompt is the “what came before.” When you change a word, you change the entire downstream probability of what word should come next. The AI isn’t being tricky. It’s responding exactly as designed to the pattern you gave it.

    Same engine, same knowledge, completely different output. All from how you asked.


    The Anatomy of a Prompt

    Every strong prompt has three basic parts:

    1. The Persona: Tell the AI who it is.

      • Example: “You are an expert travel guide” or “You are a Python coding tutor.”

    2. The Task: Tell it what to do.

      • Example: “Plan a 3-day trip to Goa.”

    3. The Constraints and Format: Tell it how you want the answer.

      • Example: “Give me the answer as a bulleted list” or “Keep it under 100 words.”

    Stack all three together and the new hire knows exactly who they are, what they’re doing, and how the answer should look.

    Most prompts people type skip one or two of these parts. That’s why most AI answers feel generic.


    Bad Prompts vs Good Prompts: Three Real Examples

    Here’s what that looks like in practice across three everyday tasks.

    Writing

    • Bad: “Write a blog post about productivity.”

    • Good: “Write a 500-word blog post about productivity tips for remote workers. Use a conversational tone. Include three actionable tips. Start with a relatable scenario.”

    Why the second works: The AI now knows the length, audience, tone, structure, and opening style. Five decisions you didn’t have to make yourself.

    Research

    • Bad: “Tell me about climate change.”

    • Good: “Summarize the top three causes of climate change in simple terms a high school student would understand. Use bullet points. Keep it under 200 words.”

    Why the second works: You’ve turned an infinite-scope question into a focused, scannable answer at a specific reading level.

    Code

    • Bad: “Write Python code.”

    • Good: “Write a Python function that takes a list of numbers and returns the average. Include comments explaining each step. Handle the case of an empty list.”

    Why the second works: The AI now has a specification — inputs, outputs, edge cases, and documentation. It produces code you can actually use.

    In all three cases, the AI didn’t get smarter. The prompt did.


    The Four Most Common Prompting Mistakes

    Before we close, here are the four patterns that produce most of the frustrating AI experiences. Name them once, and you’ll start catching yourself doing them.

    1. Being too vague. “Help me with my project” tells the AI nothing. No topic, no format, no outcome.

    2. Not giving the AI the context it needs. “Fix this bug” without sharing the code, the error message, and what the code is supposed to do leaves the AI guessing. Being clear about what you want is not the same as giving the AI the raw material to do the job. Modern models are great, but they still can’t read your screen or your mind.

    3. Giving no constraints. Without a word limit, audience, or format, the AI defaults to verbose and generic.

    4. Expecting perfection on the first try for complex tasks. For simple asks, modern models often nail it immediately. But for anything nuanced — a layered analysis, a specific tone, a tricky coding problem — iteration is part of the skill, not a sign you’re doing it wrong.

    The good news: each of these has a clean fix, and there are specific techniques that turn vague prompts into surgically precise ones. We’ll walk through all of them in the next article.


    The Takeaway

    A prompt is how we program modern computers using natural language instead of code.

    • It is the steering wheel of the AI.

    • It is the instructions you hand the new hire.

    • It is the difference between a useless generic reply and a precise, personalized answer.

    The better you get at prompting, the smarter the AI seems to become. Not because the model changes, but because your instructions unlock more of what was always there.

    Coming Up

    Now you know what a prompt is and why it matters. But knowing what a steering wheel is doesn’t make you a driver. How do you actually get good at this? What separates the people who get brilliant AI answers from the ones who give up after a vague first try? Next, we’ll unpack the five techniques that turn any prompt into a great one — from role-playing to step-by-step thinking. That’s how you graduate from knowing about AI to actually getting what you want from it.


    AI for Common Folks – Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • What Is the Engine Behind ChatGPT, Claude, and Gemini?

    What Is the Engine Behind ChatGPT, Claude, and Gemini?

    A Large Language Model (LLM) is an AI system trained on massive amounts of text to understand and generate human language. Think of it as the world’s most over-prepared new hire, one who read every document on the internet before their first day.

    Hey Common Folks!

    In our last two articles, we covered Foundation Models, the massive general-purpose AI brains, and GPT, the most famous family in that category. We talked about the Swiss Army Knives of AI and the three-letter recipe (Generative, Pre-trained, Transformer) that cracked modern language AI.

    But GPT is just one example of a broader category. Claude, Gemini, Llama, DeepSeek — these are all in the same family. That family is called Large Language Models, or LLMs.

    And LLMs are the specific technology powering every AI chatbot you’ve ever used.

    The best way to understand one is to think of it as a new employee at your company. A very unusual one.


    Meet the New Hire

    Imagine your company just hired someone. Before their first day, they did something no human could do: they read every email, every Slack message, every report, every meeting note, every document your company has ever produced. Not just your company, actually. Every company. Every book. Every website. Every Wikipedia article. Every Reddit thread. Every piece of code on GitHub.

    They didn’t understand all of it the way you would. They didn’t form opinions or have experiences. But they noticed patterns. They noticed that after “Dear” people usually write a name. That after “quarterly revenue increased” people usually write “by” followed by a percentage. That when someone asks “how do I” the next words are usually a task, followed by step-by-step instructions.

    This new hire didn’t memorize facts like a textbook. They memorized how language flows. They can finish anyone’s sentence, in any department, on any topic, because they’ve seen millions of similar sentences before.

    That’s an LLM. That’s the whole idea.


    The World’s Best Sentence Finisher

    At its core, an LLM does one thing: predict the next word. (Technically, it predicts the next token — a small chunk of text that’s usually a word or part of a word. We’ll cover tokens in a future article. For now, “word” is close enough.)

    You actually do this too. If I say: “The capital of India is New…”

    Your brain instantly completes: “Delhi.”

    You didn’t look it up. You’ve seen those words together enough times that the completion is automatic.

    Your phone does this too. You type “I am on my…” and your keyboard suggests “way.” Your phone learned this pattern from your text messages.

    Now scale that up dramatically.

    Your phone looks at the last 3 words to guess the next one. An LLM looks at the last 300,000 words. Your phone learned from your texts. An LLM learned from the entire internet.

    Back to our new hire analogy: imagine asking them to finish this sentence: “Based on our Q3 projections and the current market conditions, the board recommends that we…”

    Because they’ve read millions of similar corporate emails, they know what typically comes next. Not because they understand finance. Because they’ve seen this pattern thousands of times. They’re pattern-matching at a scale no human could match.

    That’s how ChatGPT writes entire paragraphs. One word at a time, each chosen because it’s the most likely continuation of everything before it. Like a new hire who’s so well-read that they can finish any sentence in any department.


    How the New Hire Follows Conversations

    Here’s where it gets interesting. Early AI systems were terrible at long sentences. Tell them a long story and by the end, they’d forgotten the beginning. Like a new hire who nods along in a meeting but can’t connect what was said in minute one to what’s being discussed in minute thirty.

    Then in 2017, researchers at Google published a breakthrough called the Transformer. (The “T” in GPT stands for Transformer. That’s how fundamental this is.)

    Transformers gave LLMs a superpower called self-attention. Here’s what that means.

    Consider this sentence: “The animal didn’t cross the street because it was too tired.”

    What does “it” refer to? The animal or the street?

    You know “it” means the animal because the animal is “tired.” Streets don’t get tired.

    Before transformers, AI would struggle with this. It read words one by one, left to right, and by the time it got to “it,” the word “animal” was already fading from memory.

    Self-attention changed that. Now the LLM looks at all the words in the sentence at once and draws connections between them. When it hits the word “it,” it checks: what does “it” connect to? It sees “tired” and traces back to “animal,” not “street.” It understands the relationship.

    Back to our new hire: imagine they’re reading a 50-page email thread where someone says “she approved the budget.” Self-attention is how the new hire traces “she” back to the CFO mentioned 30 emails ago, not the intern mentioned 2 emails ago. They can follow references across long, messy conversations.

    This is what lets LLMs understand context, answer follow-up questions, get jokes, and write code that actually makes sense across hundreds of lines. Before transformers, AI was like a new hire reading one word at a time and forgetting the beginning of the email by the end. After transformers, they can hold the entire conversation in their head at once.


    Training the New Hire: Three Stages

    Building an LLM like ChatGPT or Claude isn’t one step. It’s an onboarding process with three stages. Just like any new hire goes through orientation before they’re ready to talk to customers.

    Stage 1: The Reading Phase (Pre-Training)

    This is where the new hire reads everything. Terabytes of text. Books, websites, Wikipedia, code, academic papers.

    During this phase, the LLM plays a game: we hide a word in a sentence and ask it to guess. If it guesses wrong, it adjusts its internal settings. (Remember the chai analogy? Same loop. Predict, check the error, adjust, repeat. Millions of times.)

    Those “internal settings” are called parameters. Think of them as tiny dials. Modern frontier models (GPT-5, Claude 4, Gemini 2) are believed to have trillions of them, though the exact numbers are kept secret. Each dial is like one of the chai recipe settings from our previous article: a small adjustment that slightly changes the output. Together, trillions of tiny dials produce language that sounds remarkably human.

    That’s what “Large” means in Large Language Model. Large = trillions of adjustable dials, trained on a massive amount of text.

    After pre-training, the new hire knows grammar, facts, writing patterns, coding conventions, and the general structure of human communication. But they’re not helpful yet. Ask them a question and they’ll just keep writing, trying to complete the sentence rather than answer you. They’re like a new hire who’s read the entire company wiki but doesn’t know how to have a normal conversation.

    Stage 2: Job Training (Fine-Tuning)

    Now we teach the new hire how to actually do their job.

    We show them thousands of examples of good conversations:

    • Customer asks: “How do I reset my password?”

    • Good response: “Here are the steps: go to Settings, click Security…”

    • Bad response: “…and also how to reset your username and your profile picture and your billing information and…”

    The new hire learns the format: when someone asks a question, give a direct, helpful answer. Don’t ramble. Don’t go off on tangents.

    This is fine-tuning. Same new hire, same knowledge from the reading phase, but now they know how to channel it into a helpful conversation instead of an endless monologue.

    Stage 3: Performance Reviews (Human Feedback)

    The new hire is now having real conversations. But sometimes they’re rude. Sometimes they make things up. Sometimes they give dangerous advice.

    So we bring in human reviewers. They chat with the LLM and rate the responses. Helpful and accurate? Thumbs up. Rude, wrong, or harmful? Thumbs down.

    The model learns: “Humans prefer it when I’m clear, honest, and careful. They don’t like it when I make things up or lecture them.”

    Think of it as ongoing performance reviews. The new hire adjusts their behavior based on what gets positive feedback and what gets complaints.

    This last stage is why ChatGPT and Claude feel different from each other even though they’re both LLMs. Different companies hire different reviewers with different values. Same new hire, different management styles, different workplace cultures.


    When the New Hire Makes Things Up

    Here’s the catch with our new hire. They’re so well-read and so good at sounding confident that sometimes they make things up. And they deliver the fiction with the exact same confidence as the facts.

    This is called hallucination.

    Remember: the LLM predicts the next most likely word. It doesn’t have a database of facts. It doesn’t “look things up.” It generates text that sounds right based on patterns.

    Imagine asking the new hire: “Who designed the Golden Gate Bridge?”

    They’ve read enough about bridges and famous people that they might say: “The Golden Gate Bridge was designed by Thomas Edison in 1932.” That sentence is completely wrong. But it sounds like a fact. It has the right structure, the right confidence, the right rhythm of a true statement.

    The new hire isn’t lying on purpose. They’re doing what they always do: predicting what the most likely next words would be. And sometimes the most likely-sounding answer isn’t the true answer.

    This is the single most important thing to understand about LLMs: they are designed to sound right. Not to be right.

    Often they are right, because patterns in language usually reflect reality. But not always. And they’ll never pause and say “Actually, I’m not sure about this.” They’ll just keep predicting the next most confident-sounding word.


    Where You Already Use LLMs

    You interact with this new hire more than you realize. They’ve been placed in departments all across your digital life:

    • ChatGPT, Claude, Gemini: The obvious ones. Every conversation is an LLM predicting one word at a time.

    • Email: Gmail’s “Help me write” and Outlook’s Copilot. The new hire is drafting your emails.

    • Code: GitHub Copilot suggests code as developers type. The new hire sits next to every programmer.

    • Search: Google and Bing now use LLMs to summarize search results instead of just showing links. The new hire reads all the results and writes you a summary.

    • Customer service: Many companies have replaced scripted chatbots with LLM-powered support. The new hire handles your complaints now.


    The New Hire’s Limitations (Keeping It Real)

    Our new hire is impressive. But they have real weaknesses you should know about:

    They stopped reading on a specific date. Every LLM has a knowledge cutoff. Ask about yesterday’s news and they genuinely don’t know. It’s like the new hire read everything up to their start date but hasn’t checked the news since. (Some systems work around this by connecting to the internet, but the core model itself is frozen in time.)

    They don’t truly understand. They’re the world’s best pattern matcher, not a thinker. They can sound confident while being completely wrong. They don’t “know” anything the way you know your own name. They know what words usually follow other words. That’s it.

    They’re expensive to keep around. Every response costs computing power. That’s why advanced AI access isn’t free. Running a trillion dials for every single word in every single response adds up fast.

    Their memory has limits. They can only hold so much of the conversation at once. This is called the context window. It’s like the new hire can remember the last hour of conversation clearly but starts forgetting what was said this morning. Long conversations can feel like the AI forgot what you told them earlier, because in a real sense, it did.


    The Takeaway

    A Large Language Model is the engine powering the AI revolution you’re living through right now.

    It’s a new hire who read the entire internet before day one. They predict the next word, one word at a time, with a confidence that makes it look like understanding. They went through reading (pre-training), job training (fine-tuning), and performance reviews (human feedback) to become the helpful assistant you chat with today.

    They’re extraordinary at sounding human. They’re terrible at knowing when they’re wrong. And they’re sitting in more of your apps than you probably realized.

    Under the hood, it’s the same loop you learned about in our How AI Actually Learns article. Predict, check, adjust, repeat. Just with trillions of dials instead of four chai settings.

    Coming Up

    Now you know what the engine is. But here’s a subtle truth: the same LLM can give you a brilliant answer or a useless one depending entirely on how you ask. That little box where you type your question? It has a name — the prompt — and the words you put in it are the steering wheel of the entire engine. Next, we’ll break down what a prompt actually is and why it matters more than most people realize.


    AI for Common Folks – Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment

  • OpenAI’s $20B Cerebras Deal, GPT-Rosalind, Robot Learns on Its Own

    OpenAI’s $20B Cerebras Deal, GPT-Rosalind, Robot Learns on Its Own

    Good morning, OpenAI just doubled down on Cerebras with a chip deal that could reach $30 billion, the company also launched an AI model designed specifically for biology and drug discovery, and a robotics startup showed a robot brain that can figure out tasks nobody ever taught it. Here’s what happened 👇


    1. OpenAI Doubles Its Cerebras Chip Deal to Over $20 Billion

    OpenAI has agreed to pay chip startup Cerebras more than $20 billion over the next three years for servers powered by Cerebras chips, according to The Information. That is double the $10 billion commitment the two companies announced in January. The deal also includes warrants that could give OpenAI up to a 10% equity stake in Cerebras as spending increases, plus $1 billion from OpenAI to help fund Cerebras data centers. Total spending over three years could reach $30 billion.

    Cerebras, which makes wafer-scale engine chips that compete with Nvidia’s GPUs, is preparing an IPO in the second quarter at a valuation of roughly $35 billion. OpenAI CEO Sam Altman is an early investor. The deal is the clearest signal yet that OpenAI is building a chip supply chain that does not depend entirely on Nvidia.

    Why it matters: Every AI company on Earth is fighting for access to the same pool of Nvidia chips. By locking in $20 billion or more with Cerebras, OpenAI is hedging that dependence and, through its equity stake, turning a supplier relationship into a strategic investment. If Cerebras succeeds, OpenAI owns a piece of the alternative chip ecosystem. If you use ChatGPT, the speed and cost of every answer you get is shaped by which chips are running it. We broke down foundation models, the brains that run on these chips, in our AI Explained series → What Are Foundation Models?

    Source: Reuters


    2. OpenAI Launches GPT-Rosalind, a Biology-Tuned AI for Drug Discovery

    OpenAI introduced GPT-Rosalind on Thursday, an AI model built specifically for life sciences research. Named after Rosalind Franklin, the scientist whose X-ray crystallography work was central to discovering DNA’s structure, the model is designed to help researchers with evidence synthesis, hypothesis generation, experimental planning, and other multi-step research tasks. It can query databases, read the latest scientific papers, suggest new experiments, and connect to over 50 scientific tools through a free Codex plugin.

    OpenAI said it is already working with Amgen, Moderna, and Thermo Fisher Scientific to apply GPT-Rosalind across their workflows. The model is available as a research preview through OpenAI’s trusted access deployment structure.

    Why it matters: Drug discovery typically takes over a decade and costs billions. If an AI model can meaningfully accelerate the early stages of research, even by months, the downstream impact on which drugs reach your pharmacy shelves is enormous. This is also OpenAI’s second specialized model in one week, after GPT-5.4-Cyber for cybersecurity. The company is clearly betting that the future of AI is not one model that does everything, but specialized models tuned for high-stakes fields.

    Source: Reuters


    3. This Robot Brain Can Figure Out Tasks Nobody Taught It

    Physical Intelligence, a San Francisco robotics startup valued at $5.6 billion, published research on Thursday showing that its latest model can direct robots to perform tasks they were never explicitly trained on. The model, called π0.7, demonstrated what researchers call “compositional generalization,” the ability to combine skills learned in different contexts to solve new problems. In one test, the robot figured out how to use an air fryer despite having only two barely relevant examples in its entire training dataset. With verbal coaching from a human walking it through the steps, it succeeded.

    The π0.7 model matched the performance of purpose-built specialist models across complex tasks including making coffee, folding laundry, and assembling boxes. The company is reportedly in talks to raise at an $11 billion valuation.

    Why it matters: Until now, training a robot meant collecting data on each specific task and building a model for that task alone. If robots can start remixing skills the way language models remix words, it changes the economics of automation entirely. A warehouse, a hospital, or a restaurant would not need a different robot for every job. They would need one that can be coached. We explained how AI systems learn from data, including the foundations that make this kind of generalization possible, in our AI Explained series → How AI Actually Learns

    Source: TechCrunch


    4. The White House Plans to Give Federal Agencies Access to Anthropic’s Mythos

    The U.S. government is preparing to make a version of Anthropic’s Mythos model available to major federal agencies, Bloomberg News reported. Gregory Barbaccia, the federal chief information officer, emailed Cabinet department officials on Tuesday that the Office of Management and Budget was setting up protections to allow agencies to begin using the model. “We’re working closely with model providers, other industry partners, and the intelligence community to ensure the appropriate guardrails and safeguards are in place,” Barbaccia said.

    Separately, Anthropic CEO Dario Amodei is scheduled to meet White House chief of staff Susie Wiles on Friday, Axios reported, signaling a possible breakthrough in Anthropic’s ongoing dispute with the Pentagon.

    Why it matters: The same model that five major financial regulators spent the past two weeks scrutinizing for cybersecurity risk is now being prepared for use by the very government agencies responsible for protecting critical infrastructure. That is not a contradiction. It is the same logic that drives every advanced weapons system: if something is this powerful, you want your own people to have it first. The Mythos saga is becoming the clearest real-world test case for how governments handle AI models that are simultaneously a defensive tool and a potential threat.

    Source: Reuters | Source: Reuters


    Quick Hits

    • AI traffic to US retail websites jumped 393% in Q1, and shoppers arriving via AI now convert 42% better than non-AI visitors, according to Adobe data. A year ago, AI traffic converted 38% worse. The turnaround is massive. Source: TechCrunch

    • Anthropic’s chief product officer left Figma’s board after reports that Anthropic plans to offer a competing design product. Source: TechCrunch

    • Mozilla launched Thunderbolt, a new AI client focused on self-hosted infrastructure, built on the open-source Haystack framework toward what it calls a “decentralized open source AI ecosystem.” Source: Ars Technica


    That’s it for today. OpenAI is spending like a company that believes compute will be the oil of the next decade, and it is not just buying chips but buying into the companies that make them. Meanwhile, the race to put AI into biology labs, robot arms, and government agencies is accelerating at a pace that makes last year’s “will AI be useful?” debate feel like ancient history.

    Forward this to someone who needs to stay in the loop.

    Subscribe now

    Leave a comment

  • What Does GPT Actually Stand For and How Does It Work?

    What Does GPT Actually Stand For and How Does It Work?

    GPT stands for Generative Pre-trained Transformer — a family of AI models built by OpenAI that powers ChatGPT and defined the modern era of AI.

    AI for Common Folks
    Apr 2026

    Hey Common Folks!

    In our last article on Foundation Models, we talked about the general-purpose brains that power modern AI — the Swiss Army Knives trained to do everything from writing code to drafting emails. Before that, we explored Generative AI, the broad category of AI that creates new content.

    Now let’s zoom in on the most famous Foundation Model family of them all: GPT.

    You see it everywhere. GPT-4, GPT-5, ChatGPT. But what do those three letters actually stand for? Is it a robot? A company? A magic spell?

    Here’s the real story: GPT is not just an acronym. It is three separate breakthroughs in AI that had never been combined at massive scale. OpenAI put them together, and that combination is why modern AI works.

    Let’s unpack each one.

    What is GPT?

    GPT stands for Generative Pre-trained Transformer.

    It is a specific type of Large Language Model (LLM) developed by OpenAI. If AI is the broad industry, GPT is a specific product line, like the “iPhone” of AI models.

    But here is the part nobody tells you: each of those three words (Generative, Pre-trained, Transformer) represents a problem that AI researchers had been stuck on for decades. GPT is the name for what happened when all three got solved at the same time.

    Before GPT: Three Problems AI Couldn’t Crack

    To understand why GPT matters, you have to understand what AI looked like before it existed.

    For most of AI’s history (roughly the 1950s through the 2010s), researchers were stuck on three problems simultaneously:

    1. AI could classify, but it couldn’t create. It could tell you if an email was spam, but it couldn’t write an email.

    2. AI had to be trained from scratch for every task. Want translation? Build a translation model. Want summarization? Build a summarization model. One model, one job, always starting from zero.

    3. AI could only read one word at a time. The dominant technology of the day (called RNNs and LSTMs) processed text sequentially, like reading a book strictly left to right. It was slow, and by the end of a long sentence, it had often forgotten the beginning.

    Every single letter in “GPT” was an answer to one of these problems. Let’s take them one by one.

    1. G is for Generative: The Shift from “Classify” to “Create”

    This is the easy part to say, but the hardest to appreciate.

    What it means: GPT can create new content. Essays, code, poetry, emails. It generates output that didn’t exist before.

    Why it’s a big deal: For decades, AI was a world of “yes/no” answers. Is this spam? Is this a cat or a dog? Does this customer churn? These are classification tasks. AI looks at something and puts it in a bucket.

    Creating something new from scratch (a paragraph, a story, a working function of code) was considered nearly impossible. Language is infinite. There are more possible sentences than atoms in the universe. How would an AI pick a good one?

    The Generative approach said: don’t pick the “right” sentence. Generate it word by word, always predicting the most likely next word given what came before. Do that billions of times, and coherent writing emerges.

    That sounds simple. It is also the shift that took AI from “recognizing patterns in data” to “creating patterns that look human.”

    2. P is for Pre-trained: The Free Labels Trick

    This one is the real genius, and most explanations skip it.

    What it means: Before GPT is ever asked to do anything useful, it has already read a massive amount of text. Books, Wikipedia, websites, articles, code. That’s the “pre” in pre-trained.

    Why it’s a big deal: Traditional AI needed labeled data. To teach AI to spot spam, humans had to label millions of emails as “spam” or “not spam.” To teach it to tell cats from dogs, humans had to label millions of photos. Labeled data is expensive, slow, and limited.

    Pre-training flipped the entire problem on its head with one insight:

    If the task is “predict the next word,” the internet is already labeled. The label is just the next word.

    Read “The cat sat on the ___” and the correct answer is whatever word came next in the original sentence. No humans needed. The data labels itself. And the internet has trillions of words.

    Suddenly, AI had unlimited training data. GPT-3 was trained on roughly 570 GB of filtered text, pulled from an even larger 45 TB of raw internet data. Later models like GPT-4 and GPT-5 used dramatically more. That scale would have been unimaginable with human-labeled data.

    Think of Pre-training as a student reading every book in the library to learn general knowledge. Later, this student can be Fine-tuned (specialized training for a specific job) to become a doctor, a coder, or a chatbot. But the broad education comes first, and it comes from the text itself.

    3. T is for Transformer: Seeing All the Words at Once

    What it means: The Transformer is a specific type of Neural Network architecture introduced by Google researchers in 2017, in a paper famously titled “Attention Is All You Need.”

    Why it’s a big deal: Before Transformers, AI read sentences one word at a time, sequentially. This was slow, and the model often forgot the beginning of a long sentence by the time it reached the end. It also meant you couldn’t spread the work across thousands of chips in parallel, which put a hard ceiling on how big these models could get.

    Transformers introduced two superpowers:

    1. Parallel Processing: They look at all the words in a sentence simultaneously, rather than one by one. This makes them dramatically faster and, critically, scalable to billions of parameters. Without Transformers, no amount of compute could have produced GPT-3 or GPT-4.

    2. Self-Attention: They figure out which words in a sentence relate to each other. In “The bank of the river,” the Transformer pays attention to “river” to know that “bank” means land. In “The bank approved my loan,” it pays attention to “loan” to know bank means the financial kind. Same word, different meaning, figured out from context.

    Self-attention is what gave AI something that looks like understanding context. It is the single architectural idea that made modern AI possible.

    Why the Combination Changed Everything

    Here’s the thing nobody emphasizes enough: each of these three ideas existed on its own before GPT.

    • Researchers had built generative models before.

    • Unsupervised pre-training had been explored in smaller forms.

    • The Transformer paper was published by Google, not OpenAI.

    What OpenAI did was combine all three at massive scale. GPT-1 in 2018 showed the recipe could work. GPT-2 in 2019 showed it could write coherently. GPT-3 in 2020 was the moment the world saw what happens when you push this recipe to billions of parameters: the model started doing things it was never explicitly trained to do. Reasoning. Translation. Summarization. Rudimentary code generation. Researchers call these emergent abilities. Capabilities that appear, seemingly out of nowhere, once the model gets big enough.

    ChatGPT in late 2022 was when the public caught on.

    So when someone says “GPT changed AI,” they are not being dramatic. The specific combination of Generative + Pre-trained + Transformer at scale is the recipe that broke a decades-long logjam.

    GPT vs. ChatGPT

    Are they the same thing? No.

    Here is the best analogy to understand the difference:

    Think of a Laptop.

    • GPT is the Processor (like Intel or Apple Silicon): It is the raw brainpower and technology that does the thinking.

    • ChatGPT is the Laptop (like a MacBook or Dell XPS): It is the product wrapped around that processor with a screen and keyboard (an interface) that allows you to interact with it easily.

    GPT is the model; ChatGPT is the application built using the GPT model.

    The “Decoder” Secret

    If you want to sound extra smart, know this: the Transformer architecture originally came with two parts, an Encoder (to understand input) and a Decoder (to generate output).

    GPT models are actually Decoder-only models. They dropped the Encoder entirely. They are specialists in generating text: predict the next token, then the next, then the next, until they have built a whole sentence.

    Different AI systems use different slices of the Transformer architecture. Google’s original BERT was Encoder-only (great for understanding and search). GPT is Decoder-only (great for generating). That single design choice is a big part of why GPT models feel so fluent when they write.

    The Takeaway

    You didn’t just learn what an acronym stands for. You learned the three ingredients that made modern AI possible:

    • Generative: AI stopped classifying and started creating.

    • Pre-trained: The internet itself became the training data, no humans needed to label it.

    • Transformer: AI stopped reading one word at a time and started seeing the whole picture at once.

    Each of these had been tried separately. Combining them at scale, between 2018 and 2020, is what OpenAI did. And it is the reason “GPT” became shorthand for modern AI.

    The next time someone says “we’re in the GPT era,” you’ll know they don’t mean an acronym. They mean a recipe.

    Coming Up

    You now know what GPT stands for. But here is a subtle point we glossed over: GPT is just one example of a broader category called Large Language Models (LLMs). Claude, Gemini, Llama, and DeepSeek are LLMs too. So what exactly is an LLM, and why is it the engine behind every chatbot you use? In our next article, we’ll break down the engine behind ChatGPT, Claude, and Gemini and show you why LLMs are the defining technology of this decade.


    AI for Common Folks – Making AI understandable, one concept at a time.

    Subscribe now

    Leave a comment