Discover more from Data Machina
Data Machina #254
State of AI Coding Agents. SWE-Agent. Amazon Q. Devin. OpenDevin. Devika. Blackbox AI. GPT-Engineer. ChatDev. KHOJ Personal AI Agents. Perplexica. CogVLM2. World Models.
On the State of AI Coding Agents. “How could we start using AI to migrate years of messy, flimsy legacy code to a modern stack? ... Perhaps an AI Code Migration Agent ???”
We’re doing AI chat & espresso at Level 39, One Canada Square. James -a veteran CTO with all the scars- is asking these rather funny, rhetorical questions. There is a deep silence in the room, pensive faces around. Everyone is staring through the massive windows overlooking The City skyline as the sunset strikes. We wonder in perplexity -in the very philosophical and information theory sense- whether AI Coding Agents are fully ready for such tasks in prod, or not and if yes when…
Are AI Coding agents any good at solving real-world coding issues autonomously? The team at Princeton Language & Intelligence (PL&I) has come up with SWE-bench, a benchmark for evaluating AI coding agents (paper, code, benchmark). It turns out that current AI Agents are not achieving very good scores in this benchmark yet.
The PL&I team also open sourced SWE-agent an agent that turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories. Checkout the video below with a good hands-on, deep dive.
Amazon Q. The SWE-bench leaderboard is constantly changing but it seems that Amazon Q Developer Agent is for now leading the pack. Amazon Q is a closed model and not very popular in the AI community. It was able to successfully solve only 13.8% out of 2294 tasks. Not a lot really! Here is a vid with a deep dive on Amazon Q.
Devin the 1st Autonomous AI Engineer? In March, Cognition Labs announced Devin, the world’s first fully autonomous AI software engineer. Cognition Labs claimed they were setting a new state of the art on the SWE-bench coding benchmark. Devin went viral but then people in the AI community exposed some tricks used in Devin’s demo. Watch the video below to understand the good, the bad and the ugly of Devin.
Open source AI community to the rescue: OpenDevin. This started as a small side project and has quickly become one of the most popular AI Coding agents projects. OpenDevin agents collaborate with human developers to write code, fix bugs, and ship features. Probably, one of the best open source AI software engineer for developing apps. OpenDevin is now achieving 21% in the swe-bench, the highest score. Checkout the overview below:
Devika MIT licensed. Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Checkout the vid below.
Blackbox AI Coding Agents IDE. This is a -still free- pretty amazing AI Coding Agents IDE that comes packed with agents specialised in+30 development languages. The agents can perform natural language to code, chat to code, image to code, plus many s/w engineering tasks like: bug fixing, unit testing, code translation, API integration, coding docs, coding optimisation… Apparently millions of devs use it. Checkout Blackbox AI’s playground, agents and features here.
GPT-Engineer. Another popular open source AI engineer, gpt-engineer lets you: 1) Specify software in natural language, 2) sit back and watch as an AI writes and executes the code, and 3) Ask the AI to implement improvements. Somehow, not sure, but it seems this project perhaps is starting to fall behind other similar projects. In this video Arjan asks: “Is GPT Engineer Actually Useful?”
ChatDev Virtual Software Company. This is an amazing, OSS virtual software company that operates through various intelligent agents holding different roles, including CEP, CPO , CTO, programmer, reviewer , tester, art designer… These agents form a multi-agent organisational structure and are united by a mission to "revolutionise the digital world through programming." The agents within ChatDev collaborate by participating in specialised functional seminars, including tasks such as designing, coding, testing, and documenting. Checkout the repo and paper: ChatDev Multi-Agent Collaborative Software Development.
Other less ambitious AI Coding Agents for specific s/w engineering tasks.
PR-Agent automates the review and analysis of pull requests, and generates feedback and suggestions.
What The Diff automatically writes pull request descriptions, sends out summarised notifications to non-technical stakeholders in the loop, and helps you to refactor minor issues during the review.
Cover Agent automatically generates qualified tests to enhance existing test suites to help efficiently increasing code coverage.
Have a nice week.
10 Link-o-Troned
CogVLM2 - An OSS VL Model with Chat Skills that Beats GPT-4V
Meta Tutorials: How to Run Llama-3 on Linux, Windows & MacOS
the ML Pythonista
Deep & Other Learning Bits
Extracting Millions of Interpretable Features from AI Models in Prod
[tutorial] PyCon US2024 The Fundamentals of Modern DL with PyTorch
AI/ DL ResearchDocs
Is Sora a World Simulator? A Survey on General World Models and Beyond
Pandora: On-the-fly World Model VideoGen with NL (paper, code, demo)
An Atari RL Agent Trained in a Diffusion World Model (paper, code, game)
MLOps Untangled
OSS Netflix Metaflow v 2.11 - Easily Build & Manage AI/ML Projects
Breaking Down Workflow Orchestration and Pipeline Authoring in MLOps
ML Datasets & Stuff
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
Curating Custom Datasets for LLM Training with OSS NVIDIA NeMo Curator
Postscript, etc
Tips? Suggestions? Feedback? email Carlos
Curated by @ds_ldn in the middle of the night.
Subscribe to Data Machina
A weekly deep dive into the latest AI / ML research, projects & repos.