Building a Tiny Transformer From Scratch

Most explanations of transformers quickly jump into complex diagrams and equations. You see terms like attention, embeddings, Q, K, V - and it all feels abstract.

So instead of only explaining it theoretically, we built one.

A tiny transformer. From scratch. In TypeScript.

This post walks through what we built and, more importantly, what each part is trying to do.

The Goal

We want a system that can do this:

"React uses" -> predict -> "state"

That is it.

Everything in a transformer exists just to make this prediction better.

Step 1: Text -> Numbers

Computers do not understand text directly, so we convert words into numbers.

React -> [r1, r2, r3]
uses  -> [u1, u2, u3]

This is called an embedding.

In our code:

const embedding = new Embedding(vocabSize, 3);
const vectors = embedding.forward(ids);

Shape:

2 tokens -> 2 x 3 matrix

Each row represents one word.

Step 2: Attention (Words Look at Each Other)

Words do not exist in isolation.

The word “uses” can mean different things depending on context, so we let words look at each other.

For example in:

React uses

“uses” looks at “React”
“React” looks at “uses”

This is the core intuition behind attention.

Step 3: Q, K, V (Three Views of the Same Word)

Each word vector is transformed into three versions:

Q = what I am looking for
K = what I offer
V = what I give

In code:

const Q = matMul(input, Wq);
const K = matMul(input, Wk);
const V = matMul(input, Wv);

All three come from the same input, but each uses a different learned weight matrix.

Step 4: Compare Words

We compare words using:

scores = Q x K^T

This gives pairwise relevance scores like:

React vs React
React vs uses
uses vs React
uses vs uses

Each value tells how relevant one word is to another.

Step 5: Turn Scores into Importance

Raw scores are hard to interpret, so we normalize them with softmax.

[2, 1] -> [0.7, 0.3]

Now each row becomes attention weights, which act like importance percentages.

Step 6: Mix Information

We combine information with:

output = weights x V

So each word becomes:

original meaning + context from other words

That is how the representation becomes context-aware.

Step 7: Feedforward (Thinking Step)

After words communicate, each token goes through a small neural network:

vector -> expand -> filter -> shrink

In code:

const hidden = relu(matMul(input, W1));
const output = matMul(hidden, W2);

This helps the model learn richer patterns, not just blend tokens.

Step 8: Residual Connections

Instead of replacing data at each stage, we add the new signal to the original:

output = input + attention_output
output = output + feedforward_output

This preserves base meaning and improves training stability.

Step 9: Output Layer (Numbers -> Words Again)

Finally, we map vectors back to vocabulary scores:

vector -> logits -> probabilities

Example:

react -> 0.1
uses  -> 0.2
state -> 0.7

The model predicts “state”.

Step 10: Training (Learning)

We measure error with negative log likelihood:

loss = -log(probability of correct word)

Then update parameters (simplified example):

Wout[i][j] -= lr * error * lastVector[i];

Correct word weights tend to go up.
Wrong word weights tend to go down.

Over many examples, prediction quality improves.

What We Actually Built

We built:

tokenizer -> embedding -> attention -> feedforward -> output -> training

That is the core transformer loop.

What Is Missing

A tiny educational transformer skips many production features:

training embeddings end-to-end at scale
fully trained attention weights on large corpora
multiple stacked transformer layers
large datasets and long training runs
regularization, normalization, and optimization tricks

The Insight

A transformer is not magic.

It is a sequence of understandable operations:

words -> numbers -> compare -> mix -> refine -> predict

One Line Summary

A transformer lets words look at other words, update their meaning using context, and use that to predict the next word.