Context
Preliminaries
1. Logits
Sampling Words
Summary
References

Context

Finding the right words is a hard problem for large language models (LLMs). In this post, we review the fundamental pieces that help LLMs choose appropriate words during text generation.

Autoregressive LLMs generate text one token at a time. At each step, the model predicts a probability distribution over its entire vocabulary, which can contain tens of thousands of tokens. The generation process then needs to select the next token based on this distribution.

Simply choosing the highest-probability token often leads to repetitive text. Therefore, various decoding strategies are employed to balance coherence, diversity, and task-appropriateness in the generated text.

Here’s an image from Holtzman et al. (2020) to illustrate the problem:

As you can see, these decoding strategies significantly influence the quality and relevance of the output, making them crucial for optimizing LLM performance across different applications. We’ll explore popular techniques like temperature scaling, top-k sampling, and top-p (nucleus) sampling, discussing their mechanisms and impacts.

Preliminaries

Logits

A logit is nothing but a number that tells you how likely something is going to happen versus not. Think of logits as a way to stretch out probabilities, making very likely and very unlikely events easier to distinguish.

Logits map probabilities $(0,1)$ to $(-\infty,+\infty)$
Logit = 0 means a 50-50 chance of something happening, i.e. $p=0.5$ .
A negative logit means $p < 0.5$
A positive logit means $p > 0.5$

\text{logit}(p) = \log \left( \frac{p}{1 - p} \right)

Why do we use logits?

Logits cover a wider numerical range without loss of precision
Logits simplify the optimization process
Logits help with training the model effectively using techniques like Gradient Descent

An autoregressive LLM outputs a distribution of logits over a vocabulary in each time step. Sampling parameters determine how to pick the next word from this distribution.

Sampling Words

top-k

top-k is a hard filter. Here I decide “I want only the top k tokens to be considered in each step” before I do anything else. These top-k tokens are selected based on sorting the logit outputs. This is static in each time step.

Holtzman et al. (2020) mention that the presence of flat output distributions makes the use of a small k in top-k sampling problematic, while the presence of peaked distributions makes large k’s problematic.

Temperature

After the top-k step, we apply temperature scaling. The term temperature is borrowed from Statistical Physics and has nothing to do with language. The analogy lies in the idea that temperature, $\tau$ controls the “energy” or “uncertainty” of the system represented by a probability distribution.

In our problem of finding words, remember the logit output in each step? We now divide each logit value by a temperature $\tau$ , ranging from 0 to $\infty$ .

$\tau$ = 1, means the system is unaffected. Everything remains the same.
A high $\tau$ brings logits closer to 0. Remember what a logit of 0 meant? Everything is more equally likely now which means I can say (or sample from) many many things.
A low $\tau$ shifts logits further from 0. Now everything is more opinionated as the logits are further exaggerated, which means I will now pick from the most opinionated words.

Luke Salamone’s blog has a great guide to play with temperature values to get an intuitive sense of what it does (Salamone 2021).

top-p

top-p sampling, also known as nucleus sampling, selects the smallest set of tokens whose cumulative probability mass exceeds a predefined probability p threshold.

This method dynamically adjusts the number of tokens considered for sampling based on their cumulative probability until the probability mass reaches p.

Holtzman et al. (2020) introduced this in 2020 and found that nucleus sampling obtains the closest perplexity to human text.

To calculate probabilities, we first need to go from logits to probabilities. We apply the softmax operation over logit values after top-k filtering and Temperature scaling steps.

\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Note that whatever probability masses was lost in top-k step is now redistributed among the other tokens.

Finally, we sort the probability values, cumulatively add the probability scores from top to bottom and STOP when the score exceeds a certain top-p threshold (let’s say - 0.9).

This now is my final set of words to continue forming my sentence.

Repetition Penalty

Keskar et al. (2019) introduce another form of penalty called repetition penalty,

p_i = \frac{\exp(x_i/(T \cdot I(i \in g)))}{\sum_j \exp(x_j/(T \cdot I(j \in g)))} \qquad I(c) = \theta \text{ if c is True else } 1

$g$ is the list of already generated tokens. They found that using a greedy sampling and $\theta$ ≈ 1.2 yields a good balance between truthful generation and lack of repetition. In passing, they note that this approach succeeds only if the model has learned a sufficiently reliable distribution.

Frequency & Presence Penalties

ChatGPT API has similar frequency and presence penalites to suppress repetition.

The presence penalty is a one-off additive contribution that applies to all tokens that have been sampled at least once and the frequency penalty is a contribution that is proportional to how often a particular token has already been sampled.

The exact modifications to the original logits mu[j] for the j’th token (logits are shifted to the left):

mu[j]  \
- c[j] * alpha_frequency # proportional to how often a particular token has already been sampled
- float(c[j] > 0) * alpha_presence # one-off additive contribution that applies to all tokens that have been sampled at least once

c[j] - how often that token was sampled prior to the current position
float(c[j] > 0) = 1 if c[j] > 0 and 0 otherwise (indicator function)

ChatGPT’s API docs provide some usage tips:

To just reduce repetitive samples somewhat pick 0.1 to 1
To strongly suppress repetition, increase coefficients up to 2, but this can noticeably degrade the quality of samples.
Negative values increase the likelihood of repetition.

Summary

top-k is static, always yielding k tokens, while top-p varies based on output distribution.
top-k and top-p control token diversity and local coherence; temperature influences overall fluency, creativity, and global coherence.
top-k limits the token pipeline at the start, potentially improving coherence by focusing on most probable tokens.
Indicator variables based on generated tokens can control repetition:
- alpha_presence shifts logits left for appeared tokens
- alpha_frequency applies a harder penalty proportional to token frequency
A brainstorming tool might do well with a high temperature, high top-k and a high top-p, while a factual question-answering system might perform better with low top-k and low temperature.
These parameters balance coherence, repetition, and creativity in decoding steps.

References

Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” arXiv. http://arxiv.org/abs/1904.09751.

Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” arXiv. https://doi.org/10.48550/arXiv.1909.05858.

Salamone, Luke. 2021. “What Is Temperature in NLP?🐭.” Luke Salamone’s Blog. April 2, 2021. https://lukesalamone.github.io/posts/what-is-temperature/.

Finding the right words

Contents