t-distributed Stochastic Neighbor Embedding says "what"

3 min read

Contents

  1. What is t-SNE

    1. Perplexity
    2. Normal to t
    3. The algorithm
  2. Example

  3. Links

What is t-SNE

t-SNE or t-distributed Stochastic Neighbor Embedding is a dimensionality reduction technique that projects data onto lower dimension (typically 2 or 3) based on a normal distribution for the initial similarity derivation and then a t-distribution for the projected space. The perplexity controls the scaling of the normal distribution used for comparing similarities.

The algorithm preserves clustering, but distorts original distances (as any dimensionality reduction technique would).

Perplexity

Perplexity can be intuitively thought of as the following: expected density around each point; (loosely as) how to balance attention between local and global aspects of your data; a guess about the number of close neighbors each point has.

Normal to t

The reason for first using a normal distribution and then a t-distribution is to avoid clumping the points in the projected space as the t-distribution is a bit more thicker at the tails thus providing a bit more slack for the lower similarity points.

The algorithm

It proceeds as follows:

  1. Determine unscaled similarity between all points and point of interest based on original t-distributions based on a perplexity
  2. Do this for any point from selected point
  3. Iterate over all points
  4. For each point scale similarities so they all add up to one (dividing by sum of unscaled similarity scores)
  5. Average simality score for each point (in and out)
  6. Get matrix of similarity scores.
  7. Project randomly onto desired number of latent dimensions
  8. Repeat 1-8 on projection 7 but this time with a t-distribution to derive similarity matrix
  9. Make matrix derived in 8 converge to the matrix derived in 6 in tiny steps with epsilon learning rate by moving points with higher similarity closer and lower similarity further to each other.

Example

A simple t-SNE example (supply a df with text column) with TF-IDF sentence vectors. Can be repeated with fastText embeddings or on Orange Data Mining for quick PoCs. The example below creates a nice plotly chart with text that is wrapped at 20 chars.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
import plotly.express as px
import textwrap
 
# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()
 
# Fit and transform the text column
tfidf_matrix = vectorizer.fit_transform(df['text'])
 
# Create a DataFrame with the TF-IDF scores
tfidf_df = pd.DataFrame(tfidf_matrix.toarray().mean(axis=0))
# tfidf_df
tfidf_df = tfidf_df.set_index(vectorizer.get_feature_names_out())
 
tfidf_df.reset_index(inplace=True)
tfidf_df = tfidf_df.rename({"index":"token", 0:"tfidf"}, axis=1)
 
tsne = TSNE(n_components=2, random_state=0)
tsne_results = tsne.fit_transform(tfidf_matrix.toarray())
 
def wrap_text(text, width=20):
    return "<br>".join(textwrap.wrap(text, width=width))
 
# Apply text wrapping to the text_column
df['wrapped_text'] = df['text'].apply(wrap_text)
 
df['tsne-2d-one'] = tsne_results[:,0]
df['tsne-2d-two'] = tsne_results[:,1]
 
# Create an interactive plot
fig = px.scatter(df, x='tsne-2d-one', y='tsne-2d-two', hover_data=['wrapped_text', 'score'], color='label')
 
fig.update_layout(
    xaxis=dict(
        scaleanchor='y',
        scaleratio=1,
    ),
    yaxis=dict(
        scaleanchor='x',
        scaleratio=1,
    ),
    width=800,  # Width in pixels
    height=800  # Height in pixels, set the same as width for a square aspect ratio
)
 
fig.show()
 
fig.write_image("t-sne.png")