Data Machina #210

Improving Embeddings. Custom Embeddings. Create Your AI Companion. Google NotebookLM. Pinokio AI Browser. New Keras Corę. Curated Transformer. HyperDreamBooth. Chameleon SOTA Txt2Img

Jul 16, 2023

Improving Embeddings. What’s the distance similarity between sparkling water and seltzer?… I successfully convinced my colleagues to escape from the beer o’clock trap at the Flying Horse Pub. We end up sitting at The Aviary’s fabulous rooftop terrace, overlooking The City skyline. It’s sunset time, still really warm.

People asking for quite some sophisticated cocktails that I never heard of. I challenge the humans: “Anyone tell me the similarities and differences among sparkling water, club soda, seltzer, and tonic water?” Nobody has a clue. Irina -a brilliant ML engineer- shouts: “Why bother: Let’s just try with text-embedding-ada-002!” Then everyone LOLs. Embeddings it is!

So What is Embeddings? Basic to medium level. A nice mini book with code. From the history of embeddings in recommender systems, all the way to embeddings in production, and embeddings as an engineering problem.

Free intro to embeddings mini course - This is a bit of a refresher about embeddings, and the intuition behind them. Concise and clear explanations. At the end, there is a teaser on embetter, a new scikit-learn compatible lib for embeddings.

Embeddings and vector search. My friend says that embeddings and vector DBs are a match made in ML heaven. Well, whatever it is, here are a few interesting links:

Explaining vector DBs in 3 levels of difficulty. A post on demystifying vector DBs across different backgrounds, and the intuition behind them.
A complete introduction to vector similarity search, including a review of vector search strategies and some common distance metrics
OpenAI Embeddings and Vector DB Crash Course using Postman API and SingleStore
Using Pinecone vector DB and embeddings to rapidly develop a semantic search app

Customising embeddings. Depending on the use cases and the domain data, using embedding models as-is may produce not good enough results. Custom embeddings is a bit of a trick to improve the output quality of an embedding model.

Open AI customising embeddings. According to Open AI researchers, "by using customised embeddings in binary classification use cases, we've seen error rates drop by as much as 50%.” Deep dive into the Open AI Cookbook repo with this recipe for customising embeddings.

Frozen models and custom embeddings. Another approach to custom embeddings is to reduce the embedding lifecycle complexity by using frozen models that output frozen base embeddings. These frozen embeddings can be reused and customised for different tasks. Checkout: Customising reusable frozen ML-embeddings with Vespa.

A tutorial on custom vectors. In this tutorial, you’ll go through the process of using Weaviate with your own custom vectors. The tutorial uses 10 pre-vectorized Jeopardy questions in JSON format.

No Open AI embeddings here. I know some s/w engineers just boarding the AI gravy train, who immediately assume that Open AI embedding model is the only model in town. There are quite good alternatives out there, some as cheap as free:

Embeddings with open source Chroma. Chroma is an embeddings DB. imo this is your best option if you want to use alternatives to Open AI embeddings like: all-MiniLM-L6-v2, Sentence Transformer, instructor-embeddings, or the embeddings from Cohere and Google PALM APIs.

Should you use OpenAI's embeddings? Probably not, and here's why. This is an opinionated post in which Diego makes some good points, and discusses some alternatives around Sentence Transformer and Chroma.

GPT4All embeddings: Contrastive learning + Sentence Transformer. The team behind GPT4All just published a new embedding model that peruses a CPU-optimised contrastively trained Sentence Transformer. This model performs on par to Open AI embeddings. You can generate text embeddings of arbitrary length docs for free, *without* an API key, on CPU at 8,000 tok/second.

How to use sentence-transformers to generate text embeddings locally. In this post, Saeed explores alternatives, that allow him to run similar models locally instead of relying on OpenAI’s API. He shares his experience using Sentence Transformer and discusses the pros and cons.

Instructor: One embedding model for any task. In May, Meta AI et al. introduced an instruction-finetuned text embedding model that can generate text embeddings tailored to any task and domain, without any further training. Checkout paper, repo, data here: One Embedder, Any Task: Instruction-Finetuned Text Embeddings

This is a great post in which Eric explains how to use the Instructor model for building a semantic vector search app with open-source Instructor, pgvector, and Flask.

Enhancing sentence embeddings. Pre-trained, fine-tuned LM models are good at generalisation, but have different levels of sensitivity to sentence components. SIFTER is a new method for improving sentence embeddings in tasks like semantic similarity and sentiment analysis. Paper: SIFTER: A Task-specific Alignment Strategy for Enhancing Sentence Embeddings.

Before I leave let me share with you some cool stuff:

Image prompt injection. So funny… In their latest update, Google claims that Bard has become more powerful at coding, language and image understanding… Well, it’s so good that some clever dude has come up with an image-to-prompt injection (click on the image to expand) I love the interweb LOL!

Hands-on live training session on NLP with LLMs, an excellent 2:17h session on NLP & LLMs using Hugging Face and PyTorch Lightning. Checkout the video recording and repo here: NLP with GPT-4 and other LLMs.

Finetuning LLMs is a pita. Fine-tuning models still is a pain in the ass because it requires complex setups, it’s expensive, and requires expertise that most engineers don’t have yet. To address fine-tuning challenges, the team at Monster API just introduced no-code LLM FineTuning.

Data Machina

Data Machina #210

Improving Embeddings. Custom Embeddings. Create Your AI Companion. Google NotebookLM. Pinokio AI Browser. New Keras Corę. Curated Transformer. HyperDreamBooth. Chameleon SOTA Txt2Img

10 Link-o-Troned

the ML Pythonista

Deep & Other Learning Bits

AI/ DL ResearchDocs

data v-i-s-i-o-n-s

MLOps Untangled

AI startups -> radar

ML Datasets & Stuff

Postscript, etc