Terminology / Key Concepts

Transformers in a mathematical way

1. What is a Transformer?

Imagine you’re trying to understand a sentence, like “The cat sat on the mat”. A Transformer is a smart machine that helps computers understand sentences by looking at the words and figuring out how they are related to each other.

So, instead of reading the words one by one, it looks at all the words at once and decides how each word connects to the others. Math helps it do that!

2. How Does a Transformer Look at Words?

Let’s start with numbers. To make sense of words, Transformers turn them into numbers (called vectors). You can think of these as number codes for words.

For example:

"The" might be turned into the number code: [0.1, 0.2, 0.3]
"cat" might be turned into: [0.4, 0.5, 0.6]
"sat" might be turned into: [0.7, 0.8, 0.9]

These numbers help the Transformer understand the meaning of the word.

3. What is Self-Attention? (Looking at Words)

Now, the Transformer needs to figure out how each word connects to the others. This is done with a math process called self-attention. It’s like the Transformer asking: "How important is the word 'cat' to 'sat'?"

To do this, we use 3 math things for each word:

Query (Q) : A question asking, "How important is this word to others?"
Key (K) : The word’s identity (what the word represents).
Value (V) : What the word means.

The Transformer compares each word’s Query (Q) with the Key (K) of other words to see how much attention it should give to them.

4. The Math Behind Self-Attention

The Transformer uses multiplication and addition (the basic math we learn in school) to figure out the attention score. Here’s how:

Step 1: Dot Product (Simple Multiplication and Adding)

Dot Product is just multiplying the numbers from two vectors (word codes) and adding them up.

For example, if Q for "sat" = [0.7, 0.8, 0.9] and K for "cat" = [0.4, 0.5, 0.6], we do the following:

Multiply 0.7 0.4 ,0.8 0.5, and 0.9 * 0.6.
Then add the results together: 0.28 + 0.4 + 0.54 = 1.22

So, the attention score between "sat" and "cat" is 1.22. This means "sat" should pay some attention to "cat".

Step 2: Scale and Softmax

After the score, we do two things:

Scale : We divide the score by a number to make it easier to work with (don’t worry about the details, just know it makes the score more manageable).
Softmax : This just turns the score into a percentage. So if the score is 1.22, the Transformer turns it into a value between 0 and 1, like 30%.

5. Multi-Head Attention: Looking at the Sentence from Different Views

Instead of just focusing on one score, Transformers use multiple heads. Think of this like having different people looking at the same sentence from different angles.

One might focus on grammar , another on meaning , and another on relationships between words. They all look at the sentence at once, in parallel.

Each head calculates attention scores, and then we combine everything to get the full picture.

6. Feed-Forward Networks: Processing the Information

Once the Transformer understands how the words relate to each other, it processes the information using a simple neural network. This is just math that helps the Transformer understand the relationships better.

This step is like a calculator that takes the information from the attention process and gives the Transformer its final answer.

7. Positional Encoding: Remembering the Order of Words

Transformers don’t read words in order like we do. So, to make sure the Transformer knows the order , it adds extra math (called positional encoding) to the word codes.

This helps the Transformer remember the first , second , or third word, so it can still understand the sentence correctly, even if it’s not reading it left-to-right.

Final Summary:

Self-Attention : The Transformer checks how much each word should pay attention to the others by doing simple math (multiplying and adding numbers).
Multi-Head Attention : It looks at the sentence from different angles, like having multiple people looking at the same thing.
Feed-Forward Networks : The Transformer processes the information using a simple math network.
Positional Encoding : It adds extra information to know the order of the words.

📚 Recommended Resources

To delve deeper into Transformers and AI advancements:

The Transformer Explainer: A Live Visual Guide
Transformers: A Primer : Columbia University
The Transformer Model in Equations : johnthickstun.com
Stanford AI Index 2025 Report : Stanford HAI

🔗 External 4

The Transformer Explainer: A Live Visual Guide

Columbia University

johnthickstun.com

Stanford HAI