Data Machina #185

SoTA massive text embeddings.SoTA universal image segmentation. A transformers catalog, 2023.Live-coding a GPT from scratch. Google Deep Learning Tuning Playbook. Meta AI RoBERTa

Jan 22, 2023

Massive Text Embeddings: What’s the Latest? It feels like ages since word2vec was the most popular model for word embeddings. Since the Cambrian explosion of Transformers and LLMs, the focus is on massive text embeddings models.

Open AI’s text embeddings model is now one of the most popular models. Here’s Open AI’s intro on What are Text Embeddings? and Use Cases

See also these 4 interesting links:

When Open AI released embeddings v.1 in Jan 2022, some NLP researchers challenged Open AI’s SoTA claims.

But since Open AI released text-embedding-ada-002 in Dec 2022, the NLP researchers have provided An Update: Is Open AI’s text embedding really a new SoTA in dense text embeddings?

Many researchers say that running Open AI’s text embeddings model is expensive. Others mention that we’ll have to wait & see until text-embedding-ada-002 model is properly benchmarked against other leading models.

Another powerful embeddings model is Cohere AI’s multilingual-22-12 model. They claim that the model delivers 3X better performance than existing open-source models, and that outperforms the best next model in search tasks by 230%.

Until recently, SoTA for massive text embeddings had been achieved by Microsoft’s e5-large (which uses contrastive learning,) and very notably by Google’s GTR and sentence-t5 models, which use T5 Transformer Model.

Mukilan @weights_biases wrote a great post on Exploring Google’s T5 Text-To-Text Transformer Model

Well, it seems that as of Jan 2023, INSTRUCTOR is the new SoTA model for massive text embeddings. In the paper: One Embedder, Any Task: Instruction-Finetuned Text Embeddings (paper, code) the researchers claim that INSTRUCTOR, beats all leading models at any embeddings task in any domains, by simply providing the task instruction, without any finetuning.

Data Machina

Discussion about this post

Ready for more?

Data Machina

Data Machina #185

SoTA massive text embeddings.SoTA universal image segmentation. A transformers catalog, 2023.Live-coding a GPT from scratch. Google Deep Learning Tuning Playbook. Meta AI RoBERTa

10 Link-o-Troned

A Pythonista *Experience*

Scripting aRt

Deep & Other Learning Bits

ResearchDocs

El Robótico

data v-i-s-i-o-n-s

DataEng Wranglings

AI startups -> radar

ML Datasets & Stuff

Postscript, etc

Discussion about this post

Ready for more?

A Pythonista Experience