Discover more from Data Machina
Data Machina #192
Multimodal ML. ChatKit OSS alt to ChatGPt. OpenXLA OSS ML compiler. Vision Transformer from scratch. MetaAI LLaMA on CPU. In-context Learning, SoTA methods. SoTA Universal Speech Model.
The Latest Research in Multimodal ML. In the Y2022 the floodgates of Multimodal ML research opened. And now we’re starting to see actual implementations, and many releases of multimodal ML models. Other than the pandemonium unleashed by the collapse of Silicon Valley Bank, the second hottest topic in the Valley is that GPT-4 will be released next week, and it will be a multimodal model. Let’s talk Multimodal ML.
First, let me suggest some three great resources on Multimodal ML:
Then Google published the Attention is all you need paper. The language transformer, and later the vision transformer, triggered a massive research trend. This eventually evolved into multimodal ML, combining Large Language Models with vision transformers to solve multi-tasks.
Although less publicised, years later Salesforce BLIP (code, demo, paper) was one of the earlier unified vision-language understanding & generation models. At the time, the model achieved SoTA results on a wide range of vision-language tasks.
In parallel, Microsoft Research unified several speech, language, and text tasks into one single model: SpeechT5. @Matthijs wrote a nice post on Speech Synthesis, Recognition, and More with SpeechT5.
Deepmind’s Flamingo is another pioneer multimodal model that combined a pretrained vision encoder and a pretrained language model. A team @Hugginface posted what they learned in the last 3 months reproducing Flamingo.
Facebook AI Research has been investing in multimodal ML for several years. Two great examples are: MMF, a modular framework for vision & language multimodal research. And TorchMultimodal: a PyTorch library for training SoTA multimodal, multi-task models at scale.
In February, MS Research introduced Kosmos-1, a Multimodal Large Language Model (MLLM.) The model can perceive general modalities, learn in context, and follow instructions. MS researchers say that Kosmos-1 achieves impressive performance on language understanding, multimodal dialogue, visual Q &A, and many more tasks.
Last week, a team @ICL_Uni & NVIDIA, opensourced Prismer: A Vision-Language Model with Multi-Modal Experts (paper, code.) The researchers claim the model is competitive with current SoTA whilst requiring up to two orders of magnitude less training data. You don’t need forms or approvals to get the code ;-)
Earlier this week, Google published PaLM-E: An Embodied Multimodal Language Model (paper, demo,) which is an embedded robotics model that combines vision, perception models with PaLM-E-562B LM. Google says that PaLM-E is a visual-language generalist with SoTA performance.
And three days ago, Microsoft introduced Visual ChatGPT. The model combines ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting. See the demo and examples in the link above. It’s pretty amazing. GPT-4 will be multimodal and Kosmos-1, Visual ChatGPT may be behind.
Enduring the snowy and icy weather? Stay inside, I have some suggestions:
The AI long read - An increasing number of journalists and certain researchers, are engaged in a campaign on AI misalignment and AGI existential threat. I enjoyed reading: The Hot Mess Theory of AI Misalignment: More Intelligent Agents Behave Less Coherently
The AI generated puzzle- After a few hours of prompting, Daniel created this unique puzzle with ChatGPT
The AI experiment: See if you can run a minimal instance of Open AI Whisper fully in the browser
Have a nice week.
the ML Pythonista
the ML codeR
Deep & Other Learning Bits
AI/ DL ResearchDocs
AI startups -> radar
ML Datasets & Stuff
Tips? Suggestions? Feedback? email Carlos
Curated by @ds_ldn in the middle of the night.