AI Writing Assistants

Jun 01, 2026 · 3 min read

ELI5 = Explain It Like I'm 5 · IRL = In Real Life · TMI = Too Much Information

What’s actually happening under the hood

Large language models (LLMs) are next-token predictors. That’s the entire mechanism. Given a sequence of tokens, the model outputs a probability distribution over every token in its vocabulary, then samples from that distribution to pick the next one. Repeat until done.

“Token” ≠ word. Tokens are subword units — unbelievable might be un, believ, able. English text typically runs around 0.75 words per token, meaning a 1,000-word post is roughly 1,300 tokens.

The context window

Every call to the model operates on a fixed context window — the full sequence of tokens the model can “see” at once. As of mid-2025, frontier models range from ~128K tokens (roughly 100K words) to over 1M tokens.

The context includes:

The system prompt (instructions set by the app)
The entire conversation history so far
Your current message

When the window fills, older content either falls off (sliding window) or gets summarized depending on the implementation. This is why long conversations sometimes feel like the model “forgot” something from earlier.

Temperature and sampling

The probability distribution at each step is shaped by a temperature parameter:

Temperature = 0: Always pick the highest-probability token. Deterministic, repetitive, safe.
Temperature = 1: Sample proportionally to the raw probabilities. More varied, more creative.
Temperature > 1: Amplify low-probability tokens. Gets weird fast.

Most writing tools set temperature somewhere between 0.7 and 1.0. “More creative” outputs are literally higher-temperature samples.

Why hallucination happens

The model has no mechanism for distinguishing “I know this” from “I’m inferring this.” It predicts plausible continuations of text. A plausible continuation of “The capital of France is” is “Paris.” A plausible continuation of “The CEO of [obscure company] is” might be a name that sounds right but isn’t.

The model doesn’t know it’s wrong. It produced the highest-probability continuation given its training. This is a fundamental property of the architecture, not a bug that will be patched away.

Mitigation strategies: retrieval-augmented generation (RAG), grounding responses against a known document corpus, asking the model to cite sources, and just… fact-checking outputs on anything that matters.

Prompt engineering, briefly

Prompts are just the beginning of the input sequence. Everything that follows is prediction. This means:

Role prompting (“You are an expert in X”) sets the distribution toward expert-sounding continuations.
Few-shot examples show the model the pattern you want before asking it to continue that pattern.
Chain-of-thought (“think step by step”) forces the model to generate reasoning tokens before the answer token, which measurably improves accuracy on reasoning tasks.

Training in 20 seconds

Pre-training: show the model trillions of tokens of internet text, have it predict the next token, backpropagate the error, update the weights. Repeat billions of times.

Fine-tuning: take the pre-trained model, show it examples of the specific behavior you want (Q&A pairs, instruction-following, etc.), repeat.

RLHF (Reinforcement Learning from Human Feedback): have humans rank model outputs, train a reward model on those rankings, use that reward model to fine-tune further. This is the step that turns “plausible text predictor” into “assistant that tries to be helpful.”

The weights — billions of floating point numbers — are the model. Everything it “knows” is encoded in those weights from training. It learns nothing new at inference time (absent tools or RAG).