Data Machina #185
SoTA massive text embeddings.SoTA universal image segmentation. A transformers catalog, 2023.Live-coding a GPT from scratch. Google Deep Learning Tuning Playbook. Meta AI RoBERTa
Massive Text Embeddings: What’s the Latest? It feels like ages since word2vec was the most popular model for word embeddings. Since the Cambrian explosion of Transformers and LLMs, the focus is on massive text embeddings models.
Open AI’s text embeddings model is now one of the most popular models. Here’s Open AI’s intro on What are Text Embeddings? and Use Cases
See also these 4 interesting links:
When Open AI released embeddings v.1 in Jan 2022, some NLP researchers challenged Open AI’s SoTA claims.
But since Open AI released text-embedding-ada-002 in Dec 2022, the NLP researchers have provided An Update: Is Open AI’s text embedding really a new SoTA in dense text embeddings?
Many researchers say that running Open AI’s text embeddings model is expensive. Others mention that we’ll have to wait & see until text-embedding-ada-002 model is properly benchmarked against other leading models.
Another powerful embeddings model is Cohere AI’s multilingual-22-12 model. They claim that the model delivers 3X better performance than existing open-source models, and that outperforms the best next model in search tasks by 230%.
Until recently, SoTA for massive text embeddings had been achieved by Microsoft’s e5-large (which uses contrastive learning,) and very notably by Google’s GTR and sentence-t5 models, which use T5 Transformer Model.
Mukilan @weights_biases wrote a great post on Exploring Google’s T5 Text-To-Text Transformer Model
Well, it seems that as of Jan 2023, INSTRUCTOR is the new SoTA model for massive text embeddings. In the paper: One Embedder, Any Task: Instruction-Finetuned Text Embeddings (paper, code) the researchers claim that INSTRUCTOR, beats all leading models at any embeddings task in any domains, by simply providing the task instruction, without any finetuning.
Have a nice week.
10 Link-o-Troned
A Pythonista *Experience*
Scripting aRt
Deep & Other Learning Bits
ResearchDocs
El Robótico
data v-i-s-i-o-n-s
DataEng Wranglings
AI startups -> radar
ML Datasets & Stuff
Postscript, etc
Tips? Suggestions? Feedback? email Carlos
Curated by @ds_ldn in the middle of the night.