Data Machina

Share this post

Data Machina #192

datamachina.substack.com

Data Machina #192

Multimodal ML. ChatKit OSS alt to ChatGPt. OpenXLA OSS ML compiler. Vision Transformer from scratch. MetaAI LLaMA on CPU. In-context Learning, SoTA methods. SoTA Universal Speech Model.

Carlos
Mar 12, 2023
9
2
Share
Share this post

Data Machina #192

datamachina.substack.com

The Latest Research in Multimodal ML. In the Y2022 the floodgates of Multimodal ML research opened. And now we’re starting to see actual implementations, and many releases of multimodal ML models. Other than the pandemonium unleashed by the collapse of Silicon Valley Bank, the second hottest topic in the Valley is that GPT-4 will be released next week, and it will be a multimodal model. Let’s talk Multimodal ML.

First, let me suggest some three great resources on Multimodal ML:

  • Foundations and Trends in Multimodal ML

  • CMU Tutorial on Multimodal ML (slides, videos)

  • Multimodal Learning with Transformers: A Survey

Google probably popularised Multimodal ML at the CPVR 2016 Multimodal ML Tutorial, and later when they introduced MultiModel: Multi-Task ML Across Domains back in 2017.

Then Google published the Attention is all you need paper. The language transformer, and later the vision transformer, triggered a massive research trend. This eventually evolved into multimodal ML, combining Large Language Models with vision transformers to solve multi-tasks.

Although less publicised, years later Salesforce BLIP (code, demo, paper) was one of the earlier unified vision-language understanding & generation models. At the time, the model achieved SoTA results on a wide range of vision-language tasks.

In parallel, Microsoft Research unified several speech, language, and text tasks into one single model: SpeechT5. @Matthijs wrote a nice post on Speech Synthesis, Recognition, and More with SpeechT5.

Deepmind’s Flamingo is another pioneer multimodal model that combined a pretrained vision encoder and a pretrained language model. A team @Hugginface posted what they learned in the last 3 months reproducing Flamingo.

Facebook AI Research has been investing in multimodal ML for several years. Two great examples are: MMF, a modular framework for vision & language multimodal research. And TorchMultimodal: a PyTorch library for training SoTA multimodal, multi-task models at scale.

In February, MS Research introduced Kosmos-1, a Multimodal Large Language Model (MLLM.) The model can perceive general modalities, learn in context, and follow instructions. MS researchers say that Kosmos-1 achieves impressive performance on language understanding, multimodal dialogue, visual Q &A, and many more tasks.

Last week, a team @ICL_Uni & NVIDIA, opensourced Prismer: A Vision-Language Model with Multi-Modal Experts (paper, code.) The researchers claim the model is competitive with current SoTA whilst requiring up to two orders of magnitude less training data. You don’t need forms or approvals to get the code ;-)

Earlier this week, Google published PaLM-E: An Embodied Multimodal Language Model (paper, demo,) which is an embedded robotics model that combines vision, perception models with PaLM-E-562B LM. Google says that PaLM-E is a visual-language generalist with SoTA performance.

And three days ago, Microsoft introduced Visual ChatGPT. The model combines ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting. See the demo and examples in the link above. It’s pretty amazing. GPT-4 will be multimodal and Kosmos-1, Visual ChatGPT may be behind.

Enduring the snowy and icy weather? Stay inside, I have some suggestions:

  • The AI long read - An increasing number of journalists and certain researchers, are engaged in a campaign on AI misalignment and AGI existential threat. I enjoyed reading: The Hot Mess Theory of AI Misalignment: More Intelligent Agents Behave Less Coherently

  • The AI generated puzzle- After a few hours of prompting, Daniel created this unique puzzle with ChatGPT

  • The AI experiment: See if you can run a minimal instance of Open AI Whisper fully in the browser

Have a nice week.

10 Link-o-Troned

  1. Introducing ChatKit: The 1st Open Source Alt to ChatGPT

  2. OpenAI's Whisper Speech model - An Overview

  3. Five ML Patterns in Fraud Detection & Content Moderation

  4. Automated Data Drift Detection @Uber

  5. OpenXLA - A Fully Open Source ML Compiler Stack

  6. [Hello?] Online Gradient Descent in SQL

  7. ML Papers Explained: A Long Reading List

  8. A Review of the New Cohere Summarisation Endpoint

  9. The State of Competitive ML

  10. phind- A Free AI Search Engine for Developers


Share Data Machina with your friends

the ML Pythonista

  1. The Vision Transformer (ViT) in PyTorch from Scratch

  2. Run LLaMA (Large Language Model Meta AI) on CPU

  3. Model Too Big? Shard Large Models with Tensor Parallelism

the ML codeR

  1. AI Pair Programming in R with Copilot

  2. tidypredict - Run Predictions Inside SQL DBs

  3. A ChatGPT Coding Assistant for RStudio

Deep & Other Learning Bits

  1. An Easy Interface for In-Context Learning, with SoTA Methods

  2. [Tutorial] In-Context Learning with LlamaIndex & Unstructured

  3. Scaling Up GANs for Text-to-Image Synthesis

AI/ DL ResearchDocs

  1. Google Brain - Foundation Models for Decision Making

  2. Large Language Models Do In-Context Learning Differently

  3. Universal Speech Model (USM): SoTA Speech AI for 100+ languages

El Robótico

  1. Google PaLM-E: An Embodied Multimodal Language Model

  2. Inverse Reinforcement Learning: Learning from Human Experts

  3. Robot Imitation from 1 Minute of Demonstrations (paper, code)

data v-i-s-i-o-n-s

  1. Sorry Chelsea, Money Doesn't Buy Success

  2. An Advanced Data Visualisation Generator

  3. Stata Viz: A Browsable Portfolio of Data Visualisations

MLOps Untangled

  1. Optimise Model Performance with MLflow & Hyperopt

  2. Detecting Data Drift to Monitor ML Models in Prod

  3. MLOps Orchestration: Kubeflow vs. Airflow vs. Perfect

AI startups -> radar

  1. Uptrain - Track ML Models Performance in Real-Time

  2. AIBerry - AI for Mental Heath Screening

  3. PlusOne - Robotics for Advanced Parcel Handling

ML Datasets & Stuff

  1. The Office Lines Dataset

  2. Meta AI Casual Conversations Dataset v.2

  3. FSVVD: A Dataset of Full Scene Volumetric Video

Postscript, etc

Enjoyed this post? Tell your friends about Data Machina. Thanks for reading.

Share

Tips? Suggestions? Feedback? email Carlos

Curated by @ds_ldn in the middle of the night.

9
2
Share
Share this post

Data Machina #192

datamachina.substack.com
2 Comments
Ben Shore
Mar 12

Excellent newsletter and very on point.

Expand full comment
Reply
Vadim
Mar 13

Thank you for preparing the newsletter!

Expand full comment
Reply
Top
New
Community

No posts

Ready for more?

© 2023 Data Machina
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing