Data Machina

Share this post

Data Machina #185

datamachina.substack.com

Data Machina #185

SoTA massive text embeddings.SoTA universal image segmentation. A transformers catalog, 2023.Live-coding a GPT from scratch. Google Deep Learning Tuning Playbook. Meta AI RoBERTa

Carlos
Jan 22, 2023
2
Share
Share this post

Data Machina #185

datamachina.substack.com

Massive Text Embeddings: What’s the Latest? It feels like ages since word2vec was the most popular model for word embeddings. Since the Cambrian explosion of Transformers and LLMs, the focus is on massive text embeddings models.

Open AI’s text embeddings model is now one of the most popular models. Here’s Open AI’s intro on What are Text Embeddings? and Use Cases

See also these 4 interesting links:

  • Open AI embeddings cookbook & example notebooks

  • How to use Open AI's new text-embedding-ada-002 model

  • Build a semantic search engine for financial docs with OpenAI embeddings

  • How to build a Q&A engine with GPT3, embeddings & Datasette

When Open AI released embeddings v.1 in Jan 2022, some NLP researchers challenged Open AI’s SoTA claims.

But since Open AI released text-embedding-ada-002 in Dec 2022, the NLP researchers have provided An Update: Is Open AI’s text embedding really a new SoTA in dense text embeddings?

Many researchers say that running Open AI’s text embeddings model is expensive. Others mention that we’ll have to wait & see until text-embedding-ada-002 model is properly benchmarked against other leading models.

Another powerful embeddings model is Cohere AI’s multilingual-22-12 model. They claim that the model delivers 3X better performance than existing open-source models, and that outperforms the best next model in search tasks by 230%.

Until recently, SoTA for massive text embeddings had been achieved by Microsoft’s e5-large (which uses contrastive learning,) and very notably by Google’s GTR and sentence-t5 models, which use T5 Transformer Model.

Mukilan @weights_biases wrote a great post on Exploring Google’s T5 Text-To-Text Transformer Model

Well, it seems that as of Jan 2023, INSTRUCTOR is the new SoTA model for massive text embeddings. In the paper: One Embedder, Any Task: Instruction-Finetuned Text Embeddings (paper, code) the researchers claim that INSTRUCTOR, beats all leading models at any embeddings task in any domains, by simply providing the task instruction, without any finetuning.

Have a nice week.

Thanks for reading. New to Data Machina?


10 Link-o-Troned

  1. Google’s View on Language, Vision & Generative Models

  2. Karpathy - Let’s Live-Code a GPT from Scratch

  3. Transformer Models: An Intro & Catalog , 2023 Edition

  4. Survival Analysis Meets Reinforcement Learning @Spotify

  5. [witty] Videochat with the Author of Broken Neural Scaling Laws

  6. Interactive DataViz Code Generation with ChatGPT

  7. Large Transformer Model Inference Optimization

  8. Building a Chatbot for Q&A + Search with LangChain

  9. Fixing YouTube Search with OpenAI's Whisper

  10. a16z VC Fund on Who Owns the Generative AI Platform?


Share Data Machina with friends


A Pythonista *Experience*

  1. [Unofficial] MS VALL-E (SoTA Text2Speech) in PyTorch

  2. Pre-training MetaAI FairSeq RoBERTa on Cloud TPU in PyTorch

  3. Deploying ML Web Apps with Gradio and Model-as-a-Service

Scripting aRt

  1. The Causality Revolver - Answers at Gunpoint (paper & code)

  2. [free e-book] Causal Inference in R, (Jan 2023)

  3. [free e-book] Geocomputation Modelling in R

Deep & Other Learning Bits

  1. SoTA Universal Image Segmentation with Transformers

  2. Google Research - Deep Learning Tuning Playbook (2023)

  3. Which GPU(s) to Get for DL: My Experience & Advice

ResearchDocs

  1. Patches Are All You Need? A Simple CNN Beats the ViT

  2. Open-Set Grounded Text2Image Generation (paper, code, demo)

  3. Robust Blind Face Restoration with CodeFormer (paper, code, demo)

El Robótico

  1. NVIDIA DexTreme: Human-like In-Hand Manipulation

  2. Pretraining Quadrupeds and RL

  3. ICRA 2023 Humanoid Robot Wrestling Competition

data v-i-s-i-o-n-s

  1. An Interactive Browser of 400 Text Visualisations

  2. [Interactive] Analysing UK Politicians Financial Records

  3. Rise & Fall of Music Sales by Format (1973-2021)

DataEng Wranglings

  1. Copying Tesla’s Data Engine for a Food ML App

  2. [Free] Modern Data Engineering Zoomcamp

  3. Sketch - An AI code-assistant for pandas

AI startups -> radar

  1. MosaicML - An AI Platform Designed for LLMs

  2. Zest - AI for Better Credit & Lending Models

  3. Beam - Serverless Runtimes for AI Projects

ML Datasets & Stuff

  1. The European Social Survey Dataset, 25 Countries

  2. Dataset Distillation for Deep Learning: An In-Depth Review

  3. Visionner - A Python Toolkit for Image Datasets

Postscript, etc

Thanks for reading Data Machina! Subscribe for free to receive new posts every week

Tips? Suggestions? Feedback? email Carlos

Curated by @ds_ldn in the middle of the night.

2
Share
Share this post

Data Machina #185

datamachina.substack.com
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Data Machina
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing