Data Machina #180

Vision Transformers. OpenAI new embedding model. Meta AI new Data2vec 2.0. Peter Norvig reviews AlphaCode. Diffusion models in Keras. A RecSys with GNNs & PyTorch Geometric.

Dec 18, 2022

On the Power of Vision Transformers. Do you remember when CNNs were ruling vision? Well, it seems that Vision Transformers (ViTs) are quickly becoming the de-facto architecture for computer vision these days.

The ViT was first introduced in 2021 by a team @GoogleBrain in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (paper & code).

The ViT is a visual model based on a transformer architecture. The ViT model represents an input image as a series of image patches, similar to word embeddings in text transformers. The model learns from training data to encode the relative location of the image patches to reconstruct the structure of the image. Then it predicts class labels for the image.

If you’re interested in coding the ViT, Phil Wang has implemented the Vision Transformer in PyTorch, in a simple way to achieve SoTA in vision classification with only a single transformer encoder.

About a year ago, a team @IBM_AI performed an in-depth comparision bewteen ViTs and SoTA CNNs. They concluded that ViTs mostly outperform CNNs. You can read more about their research and reproducible code here: Vision Transformers are Robust Learners.

More recently, a team @GoogleBrain dug deeper into understanding how ViTs work as compared to CNNs. In Do Vision Transformers See Like CNNs? the team uncovers some key, striking representational differences between ViTs and CNNs. And they also describe the implications in terms of attention, scale, classification, object detection, and transfer learning.

Unlike in the case of CNNs, visually exploring the ViT learning mechanism is still a challenge. In What Do Vision Transformers Learn? A Visual Exploration a team of researchers @UniofMaryland describe a new method for large-scale ViT visualisations.

A few days ago, a team@ GoogleBrain published Image-and-Language Understanding from Pixels Only Quite amazingly, they present a single ViT that can understand images and language jointly using images as a sole input modality. Pixels is all you need!

And to keep it up with the times, Huggingface just published The Audio Spectrogram Transformer Model, which applies a ViT to audio, by turning audio into an image (spectrogram). The model obtains SoTA results for audio classification.

If you’re avoiding to get frozen outside, here are a few indoors activities you may try:

Explore the infinitely recursive universe of Game of Life in real time. Just zoom in/out, left/right.
Generate music in real time with Riffusion. Describe a musical prompt, and get a spectrogram image & sound.
Keeping calm, steadily read the papers from Philosophical Foundations of Machine Intelligence

Data Machina

Data Machina #180

Vision Transformers. OpenAI new embedding model. Meta AI new Data2vec 2.0. Peter Norvig reviews AlphaCode. Diffusion models in Keras. A RecSys with GNNs & PyTorch Geometric.

10 Link-o-Troned

A Pythonista *Experience*

Scripting aRt

Deep & Other Learning Bits

ResearchDocs

El Robótico

data v-i-s-i-o-n-s

DataEng Wranglings

startups -> radar

ML Datasets & Stuff

Postscript, etc

A Pythonista Experience