HomeOVERVIEWHow Smooth Is Attention? - Apple Machine Learning Research

How Smooth Is Attention? – Apple Machine Learning Research

Self-attention and masked self-attention are at the heart of Transformers’ outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties — which are key when it comes to analyzing robustness and expressive power — is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length n in any compact set, the Lipschitz constant of self-attention is bounded by sqrt(n) up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length n is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of n. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
Figure 1: Regularity of the attention layer as a function of sequence length for different architectures.

Latest articles

Newbury BS cuts resi, expat, landlord rates by up to 30bps  – Mortgage Strategy

Newbury Building Society has cut fixed-rate offers by up to 30 basis points...

Rate and Term Refinances Are Up a Whopping 300% from a Year Ago

What a difference a year makes.While the mortgage industry has been purchase loan-heavy for...

Goldman Sachs loses profit after hits from GreenSky, real estate

Second-quarter profit fell 58% to $1.22 billion, or $3.08 a share, due to steep...

Why Do AIs Lie?

Zeroth Principles can clarify many issues in the ML/AI domain. As discussed in a...

More like this

Overcoming Cross-Platform Deployment Hurdles in the Age of AI Processing Units

AI hardware is growing quickly, with processing units like CPUs, GPUs, TPUs, and NPUs,...

How Much Does a Midjourney Cost

In today’s digital age, subscriptions have become a common way to access various services...

NanoNets AI Solution Transfers Delivery Information to Jamix

Integration between NanoNets and JAMIX Kitchen Intelligence System allows you to streamline your order-delivery...