uv tool list
# yt2doc v0.3.0
# - yt2doc
yt2doc --video https://www.youtube.com/watch\?v\=nL9jEy99Nh0 --whisper-backend whisper_cpp --whisper-cpp-executable $HOME/Development/whisper.cpp/main --whisper-cpp-model $HOME/Development/whisper.cpp/models/ggml-large-v3-turbo.bin --sat-model sat-9l -o examples --segment-unchaptered --llm-model gemma2 --timestamp-paragraphs
https://www.youtube.com/watch?v=nL9jEy99Nh0
(0:00:00) Okay, thank you, Mathieu, for the very kind introduction. Yeah, like, before we start building AGI, we need to ask ourselves the hard questions. What is intelligence? How can we measure it? Benchmark progress. And what are the directions that we should follow to build it? So I just want to give you my take on these questions. And I'm going to start by taking you back to peak AGI hype, which was early last year. Do you remember what February 2023 felt like? ChatGPT had been released just a couple months prior. GPT4 just came out. BingChat just came out. It was the Google killer. Anyone remember BingChat? And we are told that ChatGPT would make us 100x more productive, 1000x more productive. That it would outright replace us. The existential risk of AI was becoming front page news. And AGI was just around the corner. It was no longer 10 years away, not even five years away. It was just a couple years away. You could start the countdown in months. And that it was one and a half years ago.
(0:01:21) And clearly back then, AI was coming for your job right away. It could do anything you could, but faster and cheaper.
(0:01:21) And how did we know this? Well, it could pass exams. And these exams are the way that we tell whether other humans are fit to perform a certain job. If AI passes the bar exam, then it can be a lawyer. If it can solve programming puzzles, then it can be a software engineer and so on. So many people were saying that all lawyers, all software engineers, all doctors and so on were going to be at the job. Maybe even within the next year, which would have been today. In fact, most desk jobs were going to disappear. And we faced mass unemployment. So it's very funny to think about it because today the employment rates in the U.S. actually higher than it was at the time.
(0:02:16) So was that really true? You know, was it really what the benchmarks were telling us back then?
(0:02:16) So if you go back to the real world, you know, away from the headlines, away from the February 2023 hype, it seems that LLMs might be a little bit short of general AI. I'm sure most of you in this room would agree with that. They suffer from some problems. And these limitations are inherent to clear fitting. It's the inherent to the problem that we are using to build these models. So they're not easy to patch. And in fact, there's been basically no progress on these limitations since day one. And day one was not last year. It was actually when we started using these transformer based large language models over five years ago. So over five years ago, and we've not really made any progress on these problems because the models we are using are still the same. They are parametric curves fitted to a data set via gradient descent, and they're still using the same transformer architecture. So I'm going to cover these limitations. I'm not actually going to cover hallucinations because all of you are probably very familiar with it. But let's take a look at the other ones.
(0:03:30) So to start with, an interesting issue with LLMs is that because they are autoregressive models, they will always output something that seems likely to follow your question without necessarily looking at the contents of your question. So for instance, for a few months after the original release of child GPT, if you asked what's heavier, ten kilos of steel or one kilo of feathers, it will answer they weigh the same. And it would answer that because the trick question, what's heavier, one kilo of steel or one kilo of feathers, is found all over the Internet. And the answer is, of course, that they weigh the same. And so the model would just pattern match the question without actually looking at the numbers, without passing the actual question you're asking. And the same if you provide a variation of the Monty Hall problem, which is the screenshot right here. The LLM has memorized perfectly the clinical answer to the actual Monty Hall problem. So if you ask an observed variation, it's just going to go right through it and output the answer to the original problem.
(0:04:39) So to be clear, these two specific problems, they've already been patched via LHF, but they've been patched by special casing them. And it's very, very easy to find new problems that still fit this failure mode. So you may say, well, you know, these examples are from last year. So surely today we are doing much better. And in fact, no, we are not. The issues have not changed since day one. We've not made any progress towards addressing them. They still play the latest state-of-the-art models, like Claude 3.5, for instance. This is a paper from just a few days ago, from last month, that actually investigates some of these examples instead of the art models, including Claude 3.5.
(0:05:24) So a closely related issue is the extreme sensitivity of LLMs to phrasing. If you change the names or places, variable names, any text paragraph, it can break LLM performance. Or if you change numbers in a formula. There's an interesting paper that investigates this, so you can check it out. It's called embers of autoregression. And people who are very optimistic, they would say that this brittleness is actually a great thing. Because it means that your models are more performant than you know. You just need to query in the right way and you will see better performance. You just need prompt engineering. And the counterpart to that statement is that for any LLM, for any query that seems to work, there is an equivalent rephrasing of the query that a human would readily understand that will break. And to what extent do LLMs actually understand something if you can break their understanding with very simple renaming and rephrasing? It looks a lot more like superficial pattern matching than robust understanding. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. It looks a lot more like superficial pattern matching. But as it turns out, they can only solve it for very specific values of the key size. The specific values like 3 and 5 that you find commonly in online examples. If you show it an example of a cipher with a key size like 13, for instance, it will fail. So it has no actual understanding of the algorithm for solving the cipher. It has only memorized it for very specific values of the key size, right?
(0:08:17) So my hypothesis is that LLM performance depends purely on task familiarity and not at all on task complexity. There's not really any complexity ceiling to what you can get LLMs to solve, to memorize, as long as you give them the opportunity to memorize the solution or the problem solving template, the program, right, that you have to run to get the answer. Instead, LLM performance depends entirely on task familiarity. And so even very simple problems, if they are unfamiliar, will be out of reach.
(0:08:53) Okay, the kicker is back. And lastly, LLMs suffer from a generalization issue with the programs that they did, in fact, memorize. So some examples include the fact that LLMs have trouble with number multiplication, as you probably know, with list sorting and so on, even though they've seen millions of examples of these problems. So they typically have to be aided by external symbolic systems in order to handle these things.
(0:09:23) There's an interesting paper that investigates how LLMs handle composition. So it's titled "The Limits of Transformers of Compositionality." And the main finding is that LLMs do not actually handle composition at all. what they're doing instead is linearized subgraph matching.
(0:09:44) There's another paper that's also very intriguing. It's the reversal curse. So the authors found that if you train an LLM with content like A is B, it cannot actually infer the reverse. B is A. So that's really, you know, a breakdown of generalization on a deep level. And that's actually quite surprising. Like, even I, I'm typically pretty skeptical at LLMs, so I was very surprised with this result.
(0:10:11) So one thing to keep in mind about these failure cases is that specific queries will tend to get fixed relatively quickly because the models are being constantly fine-tuned on new data collected from human contractors based on past query history. And so many of the examples that I show in my slides are probably already working with some of the state of the database. because they failed in the past and so they've been manually addressed since. But that's a very brittle way of making progress because you're only addressing one query at a time. And even for a query that you patched, if you rephrase it or if you change names and variables, it's going to start failing again. So it's a constant game of whack-a-mole. And it's very, very heavily reliant on human labor.
(0:10:56) Today there's probably, you know, between 10 and 30,000 humans that work full-time on creating annotated data to train LLMs. So, you know, on balance, it seems a little bit contradictory. Like, on one hand, LLMs are beating every human benchmark that you throw at them. And on the other hand, they're not really demonstrating a robust understanding of the things they are doing. So, to solve this paradox, you have to understand that skill and benchmarks are not the primary lens through which you should look at these systems.
(0:11:33) So, let's zoom out by a lot. There have been historically two currents of thoughts to define the goals of AI. First, there's the Minsky style view, which echoes the current big tech view that HGI would be a system that can perform most economically valuable tasks. So, Minsky said AI is the science of making machines capable of performing tasks that would require intelligence if done by humans. So, it's very task-centric. You care about whether the AI does well on a fixed set of tasks. And then there's the map cart view. So, he didn't exactly say what I'm quoting here, but he was a big proponent of his ideas, that generality in AI is not task-specific performance scale to many tasks. It's actually about getting machines to handle problems they have not been prepared for. And that difference, it goes the Locke view of intelligence versus the Darwin view of intelligence. So, intelligence as a general purpose learning mechanism versus intelligence as a collection of task-specific skills imparted to you by evolution. And my view is more like the Locke and Mark Carter view. I see that intelligence is a process. And skill, task-specific skill, is the output of that process. This is a really important point. If there's just one point you take away from this talk, it should be this. Skill is not intelligence. And displaying skill at any number of tasks does not show intelligence. It's always possible to be skillful at any given task without requiring any intelligence. And this is like the difference between having a road network versus having a road building company. If you have a road network, then you can go from A to B for a very specific set of A's and B's that are defined in advance. But if you have a road building company, then you can start connecting arbitrary A's and B's on the fly as your needs evolve. So, attributing intelligence to a crystallized behavior program is a category error. You are confusing the output of the process with the process itself. Intelligence is the ability to deal with new situations, the ability to blaze fresh trails and build new roads. It's not the road. So, don't confuse the road and the process that created it. And all the issues that we are facing today with LLAND, they are a direct result of this misguided conceptualization of intelligence. The way we define and measure intelligence is not a technical detail that you have, you can leave to externally provided benchmarks. It reflects our understanding of cognition. So, it reflects the questions that you're asking. And by, by, through that, you know, it also kind of limits the answers that you could be getting. It's really the way that you measure progress is the feedback signal that you use to get closer to your goals. If you have a bad feedback signal, you're not going to make progress towards actual generality.
(0:14:47) So, here are some key concepts that you have to take into account if you want to define and measure intelligence.
(0:14:53) The first thing to keep in mind is the distinction between static skill and fluid intelligence. So, between having access to a large collection of static programs to solve known problems, like what LLM would do, versus being able to synthesize brand new programs on the fly to solve a problem you've never seen before. So, it's not a binary, right? Either you have fluidity or you don't. It's more like a spectrum, but there's higher intelligence on the right side of the spectrum.
(0:15:24) And the second concept is operational area. There's a big difference between being skilled only in situations that are very close to what your family are with, versus being skilled in any situation within a broad scope. So, for instance, if you know how to add numbers, then you should be able to add any two numbers, not just specific numbers that you've seen before or numbers close to them. If you know how to drive, then you should be able to drive in any city. You should even be able to, you know, learn to drive in the US and then move to London and drive in London, where you're driving on the other side of the road. If you know how to drive, but only in very specific geofenced areas, you know, that's less intelligent. So, again, there's a spectrum here. It's not a binary, but there's higher intelligence on the higher generalization side of the spectrum.
(0:16:18) And lastly, the last concept is information efficiency. How much information, how much data was required for your system to acquire a new skill program? If you're more information efficient, you are more intelligent. And so, all these three concepts, these three quantities, they're linked by the concept of generalization. And generalization is really the central question in AI, not skill. Forget about skill, forget about benchmarks. And that's really the reason why using human exams to evaluate AI models is a terrible idea. Because exams were not designed with generalization in mind. Or rather, you know, they were designed with generalization assumptions that are appropriate for human beings, but are not appropriate for machines. You know, most exams assume that humans haven't read the exam questions and the answers beforehand. They assume that the questions you're going to be asking are going to be at least somewhat unfamiliar to the test taker. Unless it's a pure memorization exam, in which case, it would make sense that LLMs could ace it since they've memorized the entire internet.
(0:17:28) So, to get to the next level of capabilities, we've seen that we want AI to have the ability to adapt, to generalize to new situations that it has not been prepared for. And to get there, we need a better way to measure this ability, because it's by measuring it that we'll be able to make progress. We need a feedback signal. So, in order to get there, we need a clear understanding of what generalization means. Generalization is the relationship between the information you have, like the priors that you're born with, and the experience that you've acquired over the course of your lifetime. And your operational area over the space of potential future situations that you might encounter as an agent. And they are going to feature uncertainty, they're going to feature novelty, they're not going to be like the past. And generalization is basically the efficiency with which you operationalize past information in order to deal with the future. So, you can interpret it as a conversion ratio. If you enjoy math, you can, in fact, use algorithmic information theory to try to characterize and quantify precisely this ratio. So, I have a paper about it. If that's interesting to you, you can check it out.
(0:18:52) One of the things I talk about in the paper is that to measure generalization power, to measure intelligence, you should control for priors and experience, since intelligence is a conversion ratio, you need to know what you're dividing by. And if you're interested specifically in comparing AI to human intelligence, then you should standardize on a shared set of cognitive priors, which should be, of course, human cognitive priors, what we call core knowledge. So, as an attempt to fulfill these requirements for a good benchmark of intelligence, I've put together a dataset. It's called the Abstraction Reasoning Corpus for Artificial General Intelligence, or RKGI for short. And you can think of it as a kind of IQ test. So, it's kind of intelligence test that can be taken by humans, it's actually very easy for humans, or AI agents. And you can think of it as a program synthesis dataset, as well. So, a key idea is that in ARKGI, every task that you see, every task you get, is novel. It's different from any other task in the dataset. It's also different from anything you may find online, for instance. So, you cannot prepare in advance for ARK. You cannot just solve ARK by memorizing the solutions in advance. That just doesn't work. And ARK tries to control for experience because you're doing few-shot program learning. You're seeing, you know, two or three examples of a thing, and then you must infer from that the program that links the input to the output. And we're also controlling for priors in the sense that ARKGI is grounded purely in core knowledge priors. So, it's not going to be using any sort of acquired knowledge, like the English language, for instance. It's only built under four core knowledge systems. There's objectness, there's basic geometry and topology, and there's numbers, and there's agentness. So, we first ran a Kaggle competition on this dataset in early 2020 that produced several very interesting solutions, all based on program synthesis. And right now, the state of the art is about 40% of tests are sold. And that's very much baby steps. And that's a dataset that's from before the edge of the lens, but actually, it has become even more relevant in the edge of the lens. Because most benchmarks based on human exams and so on, they've already saturated in the edge of the lens, but not ARKGI. And that's because ARKGI is designed to be resistant to memorization, and all the other benchmarks can be hacked by memory alone. So, in June this year, we've launched a much more ambitious competition around ARKGI. We call it the ARK Prize. So, together with Mike Knup, the co-founder of Zakir, we're offering over a million dollars in prizes to get researchers to solve ARKGI and open source a solution. The competition has two tracks. There's a private track that takes place on Kaggle. It's the largest Kaggle competition at the moment. You get evaluated on 100 hidden tasks. And your solution must be self-contained. So, it must be able to run on a GPU VM within 12 hours. It was a good CPU as well. There's a big prize to get over 85%. And then there are prizes for the top scores as well. And there's even the best paper prize, 45k. So, even if you don't have a top result, but you have good ideas, just write a paper, submit it, you can win some money. And there's also a public track which we added because people kept asking, okay, but, you know, how do the state-of-the-art LLMs, like GPT-4, Cloves, and so on, how do they do on the dataset? So, we launched this sort of like semi-private eval where the tasks are not public, but they're also not quite private because they're being queried by this remote API. And surprisingly, the state-of-the-art on this track is pretty much the same as on the private track. It's actually quite interesting.
(0:22:59) So, what's LLM performance on ArcGIS exactly?
(0:23:02) It's not very good. The state-of-the-art LLMs are doing, most of them are doing between 5% and 9%. And then there's one that's doing better. It's Cloud 3.5. Cloud 3.5 is a big jump. It's at 21%. And meanwhile, you know, basic program search should be able to get you at least 50%. So, how do we know this? 50% is what you get if you assemble all of the submissions that were made in the 2020 competition, which were all brute force program search. So, basically, if you scale up brute force program search to more compute, you should get at least 50%. And meanwhile, humans do like easily over 90%. The private asset was verified by two people, and each of them scored 97 to 98, right? And together, they can sort 100%. So, you know, 5% to 21% is not great, but it's also not zero. So, it implies that LLMs have non-zero intelligence according to the benchmark. And that's intriguing. But one thing you have to keep in mind is that the benchmark is far from perfect. There's a chance that you could achieve this score by purely memorizing patterns and reciting them. It's possible. So, we have to investigate where this comes from. Because if it comes from a kernel of reasoning, then you could scale up the approach to become more general over time and eventually get to general AI. But if that performance actually comes from memorization, then you'll never reach generality. You will always have to keep applying these one-time human-guided point-wise fixes to acquire new skills. So, it's going to be this perpetual game of whack-a-mole and it's not going to scale to generality.
(0:24:44) So, to better understand what LLMs are doing, we have to talk about abstraction. Abstraction and generalization are closely tied because abstraction is the engine through which you produce generalization. So, let's take a look at abstraction in general and then we look at abstraction in LLMs. To understand abstraction, you have to start by looking around, zoom out, look at the universe. An interesting observation about the universe is that it's made of many different things that are all similar. They're all analogous to each other. Like one human is similar to other humans because they have a shared origin. Electromagnetism is analogous to hydrodynamics. It's also analogous to gravity and so on. So, everything is similar to everything else. We are surrounded by isomorphisms. I call this a kaleidoscope hypothesis. So, you know what a kaleidoscope is. It's a tube with a few bits of colored glass that are repeated and amplified by a set of mirrors. And that creates this remarkable richness of complex patterns out of just a few kernels of information. And the universe is like that. And in this context, intelligence is the ability to mine the experience that you have to identify bits that are reusable. And you extract these bits and you call them abstractions. And they take the form of programs, patterns, representations. And then you're going to recombine these bits together to make sense of novel situations. So, intelligence is sensitivity to abstract analogies. And in fact, that's pretty much all there is to it. If you have a high sensitivity to analogies, then you will extract powerful abstractions from little experience. And you will be able to use these abstractions to make sense of the maximally large area of future experience space. And one really important thing to understand about abstraction ability is that it's not a binary thing where either you're capable of abstraction or you're not. It's actually a matter of degree. There's a spectrum from factories to organized knowledge, to abstract models that can generalize broadly and accurately, to meta models that enable you to generate new models on the fly, given a new situation. And the degree zero is when you purely memorize pointwise factoids. There's no abstraction involved. It doesn't generalize at all past what you memorized. So, here we are representing our factoids as the functions with no argument. You're going to see why in a bit. The fact that they have no argument means that they're not abstract at all. Once you have lots of related factoids, you can organize them into something that's more like an abstract function that encodes knowledge. So, here this function is a variable X, which is abstract for X. So, this type of thing, this type of organized knowledge based on pointwise factoids or interpolations between pointwise factoids, it doesn't generalize very well. You know, kind of like the way LLMs add numbers. It looks like abstraction, but it's a relatively weak form of abstraction. It may be inaccurate. It may not work on data points that are far from the data points that you've seen before. And the next degree of abstraction is to turn your organized knowledge into models that generalize strongly. A model is not an interpolation between factoids anymore. It's a concise and causal way of processing inputs to obtain the correct output. So, here a model of addition using just binary operations is going to look like this. This returns the correct result 100% of the time. It's not approximate. And it will work with any input whatsoever, like regardless of how large they might be. So, this is strong abstraction. LLMs, as we know, you know, they still fall short of that.
(0:28:49) The next stage would be the ability to generate abstraction autonomously. That's how you are going to be able to handle novel problems, like things you've not been prepared for. And that's what intelligence is. Everything up to this point was not actually intelligence. It was just crystallized skill. And the last stage is going to be to be able to do so in a way that's maximally information efficient. That would be AGI. So, it means you should be able to master new tasks using very little experience, very little information about the task. So, not only are you going to display high skill at the task, meaning that the model you're applying can generalize strongly, but you will only have to look at a few examples, a few situations to produce that model. So, that's the holy grail of AI. And if you want to sit with LLMs on the spectrum of abstraction, there's somewhere between organized knowledge and generalizable models. They're clearly not quite at the model stage as per the limitations that we have discussed. You know, if LLMs were at the model stage, they could actually add numbers or sort lists. But they have a lot of knowledge. And that knowledge is structured in such a way that it can generalize to some distance from previously seen situations. It's not just a collection of point-wise factories. And if you solve all the limitations of LLMs, like hallucinations and brittleness and so on, you would get to the next stage. But in order to get to actual intelligence, to on-the-fly model synthesis, there's still a massive jump. You cannot just purely scale the current approach and get there. You actually need brand new directions. And of course, AGI after that is still a pretty long way off.
(0:30:32) So, how do we build abstraction in machines? Let's take a look at how abstraction works. I said that intelligence is sensitivity to analogies. But there's actually more than one way to draw analogies. There's two ways. There are two key categories of analogies from which arise two categories of abstraction. There's value-centric abstraction, and there's program-centric abstraction. And they're pretty similar to each other. They mirror each other. They're both about comparing things and then merging individual instances into common abstractions by erasing certain details about the instances that don't matter. So, you take a bunch of things, you compare them among each other. You erase the stuff that doesn't matter. What you're left with is an abstraction. And the key difference between the two is that the first one operates in continuous domain, and the other one operates in discrete domain. So, value-centric abstraction is about comparing things via a continuous distance function, like dot products in LLM, for instance, or the L2 distance. And this is basically what powers human perception, intuition, and pattern and cognition. And meanwhile, program-centric abstraction is about comparing discrete programs, which are graphs, obviously. And instead of computing distance, a distance between graphs, you are looking for exact subgraph isomorphisms, for exact subgraph matching. So, if you ever hear like a software engineer talk about abstraction, for instance, this is actually what they mean. When you're factoring something to make it more abstract, that's what you're doing. And both of these forms of abstraction, you know, they're really driven by analogy-making. It's just different ways to make analogies. Analogy-making is the engine that produces abstraction. And value analogy is grounded in geometry. You compare instances via distance function. And program-centric analogy is grounded in topology. You're doing exact subgraph matching. And all cognition arises from an interplay between these two forms of abstraction. And you can also remember them via the left-brain versus right-brain analogy, or the type-1 versus type-2 thinking distinction from Kenman. So, of course, you know, the left-brain versus right-brain stuff, it's actually an image. It's not how lateralization of cognitive function actually works in the brain. But, you know, it's a fun way to remember it. And transformers are actually great at type-1, at value-centric abstraction. They do everything that type-1 is effective for, like perception, intuition, pattern recognition. So, in that sense, transformers represent a major breakthrough in AI. But they are not a good fit for type-2 abstraction. And that is where all the limitations we listed came from. This is why you cannot add numbers or why you cannot infer from A is B that B is A as well. Even with a transformer that's trained on all the data on the internet.
(0:33:47) So, how do you go forward from here? How do you get to type-2? How do you solve problems like, you know, RKGI, like any reasoning or planning problem? The answer is that you have to leverage discrete program search. As opposed to purely manipulating, continuous, interpretive, embedding spaces. Learn with Grand Descent. And there's an entirely separate branch of computer science that deals with this. And in order to get to AGI, we have to merge discrete program search with deep learning. So, quick intro, you know, what's discrete program search exactly? It's basically combinatorial search over graphs of operators taken from a domain-specific language, a DSL. And there are many flavors of that idea, like genetic programming, for instance. And to better understand it, you can sort of like draw a side-by-side analogy between what machine learning does and what program synthesis does. So, in machine learning, your model is a differentiable parametric function. In PS, it's a graph of operators taken from a DSL. In ML, the learning engine is gradient descent. In PS, it's combinatorial search. And in ML, you know, you're looking at the continuous loss function. Whereas in PS, you only have this binary correctness check as a feedback signal. And the big hurdle in ML is data density. To learn a model, your model is a curve. So, to fit it, you need a dense sampling of the problem space. Meanwhile, PS is extremely data efficient. You can see the program using just a couple examples. But the key hurdle is combinatorial explosion. The size of the program space you have to look at to find the correct program is immense. And it increases combinatorially with DSL size or program length. So, program synthesis has been very successful on ArcGi so far, even though these are just baby steps. And so, all program synthesis solutions on ArcGi, they follow the same template, you know, basically doing brute force program search. And even though that's very primitive, it still outperforms state-of-the-art telelamps with much less compute, by the way.
(0:36:05) So, now we know what the limitations of the lens are. We know what they are good at. They are good at Type 1. We know what they are not good at, Type 2. So, where do we go next? And the answer is that we have to merge machine learning, like this sort of like Type 1 thinking, with Type 2 thinking provided by program synthesis. And I think that's really how intelligence works in humans. That's what human intelligence is really good at. That's what makes us special. It's that we combine perception and intuition together with explicit step-by-step reasoning. We combine, really, both forms of abstraction. So, for instance, if you're playing chess, you're using Type 2 when you calculate, step-by-step. You unfold specific, interesting moves. But you're not doing this for every possible move, because there are lots of them. You know, it's coming out to an explosion. You're only doing this for a handful of different options. So, you use your intuition, which you build up by playing lots of games, in order to narrow down the sort of discrete search that you perform when you're calculating. So you're merging Type 1 and Type 2, and that's why you can actually play chess using very, very small cognitive resources compared to what a computer can do. And this blending of Type 1 and Type 2 is where we should take AI next. So we can combine the planning and discrete search into a unified super approach.
(0:37:48) So, how does that work? Well, the key System 2 technique is discrete search over a space of programs. And the key wall that you run into is combinator explosion. And meanwhile, the key System 1 technique is curve-fitting and generalization via interpolation. So you embed lots of data into an interpolative manifold. So, this manifold can do fast but approximate judgment calls about the target space. So the big idea is to leverage these fast but approximate judgment calls to fight combinator explosion. So you use them as a form of intuition about the structure of the program space that you're trying to search over, that you're navigating. And you use that to make search tractable.
(0:38:36) So, a simple analogy that sounds a little bit too abstract is drawing a map. You take a space of discrete objects with discrete relationships that would normally require combinatorial search, like past finding in the Paris Metro is a good example. That's a combinatorial problem. And you embed these discrete objects and their relationships into a geometric manifold where you can compare things via a continuous distance function. And you can use that to make fast but approximate inferences about relationships. Right?
(0:39:13) I can pretty much draw a line on this map, look at what it intersects. And this gives you a sort of like candidate path that restricts the set of discrete passes that you're going to have to look at one by one. So, this enables you to keep combinatorial explosion in check. But, of course, you cannot draw maps of any kind of space. Right? Like program space is actually very, very non-linear. So, in order to use deep learning for a problem, you need two things. You need an interpolative problem. Like it needs to follow the manifold hypothesis. And you need lots of data. If you have only two examples, it doesn't work. So, if you look at a single ArcGi task, for instance, it's not interpolative and you only have like two to four examples. So, you cannot use deep learning. You cannot solve it purely by reapplying a memorized pattern either. So, you cannot use an LLM. But, if you take a step down, lower down the scale of abstraction and you look at core knowledge, the core knowledge systems that ArcGi is built upon. Each core knowledge system is interpolative in nature and could be learned from data. And, of course, you can collect lots of data about them. So, you could use deep learning at that level to serve as a perception layer that passes ArcWorld into discrete objects. And, likewise, if you take a step higher up the scale of abstraction and you look at the space of all possible ArcGi tasks and all possible programs that solve them. Then, again, you will find continuous dimensions of variation. So, you can actually leverage interpolation in that space to some extent. So, you can use deep learning there to produce intuition over the structure of the space of ArcGi tasks and the programs that solve them.
(0:41:09) So, based on that, you know, I think there are two exciting research areas to combine deep learning and program synthesis. The first one would be leveraging discrete programs that incorporate deep learning components. So, for instance, use deep learning as a perception layer to pass the real world into discrete building blocks that you can fit into a program synthesis engine. You can also add symbolic add-ons to deep learning systems, which is, you know, something I've been talking about for a very long time. But it's actually starting to happen now with things like external verifiers, tool use for LLMs and so on. And the other interaction would be deep learning models used to inform discrete search and improve its efficiency. So, you use deep learning as a driver, as a guide for program synthesis. So, for instance, it can give you intuitive program sketches to guide your program search. And it can reduce the space of possible branching decisions that you should consider over each node and so on.
(0:42:13) So, what would that look like on ArcGi?
(0:42:18) So, I'm going to spread out to you how you can crack ArcGi and win a million dollars maybe.
(0:42:25) So, there are two directions you can go. First, you can use deep learning to draw a map, in a sense, of grid state space, grid space. So, in the limit, this solves program synthesis because you take your initial grid input, you embed it on your manifold. Then you look at the grid output, you embed it, and then you draw a line between the two points on your manifold. And you look at the grids that it interpolates. And this gives you approximately the series of transformations to go from input to output, right? You still have to do local search around them, of course, because this is fast, but this is very approximate. It may not be correct, but it's a very good starting point. You are turning program synthesis into a pure interpolation problem. And the other direction you can go is program embedding. You can use deep learning to draw a map of program space, this time, instead of grid space. And you can use this map to generate discrete programs and make your search process more efficient. So, a very good example of how you can combine LLMs with discrete program search is this paper. I could do this search inductive reasoning with language model. So, it uses LLM to first generate a number of hypotheses about an Arc task in natural language. And then it uses another LLM to implement candidate programs corresponding to each hypothesis in Python. And so, by doing this, they are actually getting a 2x improvement on ArcGi. So, that's very promising.
(0:44:10) Another very crude example that's very promising is the submission from Ryan Greenblatt on the ArcGi leaderboard, the public leaderboard. So, it's using a very sophisticated prompting pipeline based on GPT-4-0, where it uses GPT-4-0 to generate candidate Python programs to solve Arc tasks. And then he has an external verifier. So, it's generating thousands of programs per task. It also has a way to refine tasks, programs that seem to be doing close to what you want. So, this scores 42% on the public leaderboard. And that's the current state of the art.
(0:44:54) So, again, here's where we are. We know that LLMs fall short of LLMs. They're great at System 1 thinking. They lack System 2. And meanwhile, progress towards System 2 has stored. Progress towards LLMs has stored. The limitations that we are dealing with, with LLMs, they are still the exact same we are dealing with five years ago. And we need new ideas. We need new breakthroughs. And my bet is that the next breakthrough will likely come from an outsider while, you know, all the big labs are busy training big LLMs. So, maybe it could even be someone in this room. Maybe you have the new ideas that I want to see. So, see you on the leaderboard for ArcGi. Thank you.
(0:45:40) Questions? Yeah. So, if you follow the public state of the art, the approach appears to be kind of using an LLM to sample really lots and lots of programs, like thousands of samples based on one prompt, tens of thousands of samples. And it seems that the performance goes up, you know, maybe if you go from 10,000 to 20,000 program samples, maybe goes up by 2%. And it didn't seem like there was necessarily a limit to that improvement or other than like the massive amount of compute that you throw at it. There would be, the improvement is not linear. So, you know, ArcGi is not difficult. I mean, for humans, it's extremely easy. If you have infinite compute, you can in fact solve ArcGi with brute force program search, right? If you have data center scale compute, that's actually doable. We already know that you can get at least 50% with brute force search. And of course, you know, brute force search performance improves as you throw more compute at it. Part of the challenge is to be able to solve it with limited compute resources, right? On the public track, we don't have hard compute limits. So, for instance, Ryan's Greenblatt solution is actually using a tremendous amount of compute. It's costing, I think, about 10K to 20K to actually run it in terms of API fees. So, it's enormous. It's only doing 42%. If you were to just apply large-scale brute force program search, you could spend hundreds of dollars and get the same results, right? So, you think that performance, the nature of that is simply you're just approaching in the limit brute force search? Yeah. And by the way, I don't think, because improvement is going to be logarithmic, I don't think you're going to get to like 85%, 90% like human level performance just by scaling compute. We need more compute than is available in the universe.
(0:47:50) Quick question, Francois. You said you think that the next breakthrough will come from an outsider, not from the big labs. But don't big labs like Google employ outside thinkers like yourself?
(0:48:03) Yeah. So, the way I see it, you know, we used to have a pretty good amount of intellectual diversity in AI a few years ago. But the rise of LLMs and Chad GPT and, you know, this sort of arms race between the big tech companies to build the best possible LLM that has kind of like led to an intellectual collapse where everyone is now working on the same things. And I think this is actually making progress towards AGI slower than it would have been otherwise. And I'll tell you, you know, all the top entrance in ArcGGI so far, they've been outside us. They've not been like big tech labs. They've been, you know, random people coming up with really interesting novel ideas. Sometimes these novel ideas actually echo what some of the big labs are doing. Like for instance, AlphaProof versus the currently leading solution from Jack Cole and his team. There's actually pretty strong parallels there.
(0:49:08) Do you think there will be like an Arc 2 challenge that measures some more ambitious form of generalization? Or is this going to be enough?
(0:49:17) Yes. These are very, very good questions. So I don't think the first system to crack ArcGGI is going to be an actual AGI. I think we're going to get there before we get to AGI. So you can think of ArcGGI not as a sufficient condition for AGI, but as a necessary condition. As long as we don't solve ArcGGI, we do not have AGI. It's a rebuttal to all the people who say AGI is already here.
(0:49:42) And another thing, you know, you mentioned Arc 2. There is going to be Arc 2. Arc is a working progress. Like I came up with it in 2019. I started working on it in 2017. It's a very crude attempt. It's the first attempt. As, you know, it came out and people started trying to crack it, we learned from that. And we want to reinvest these learnings into version 2. And then there's going to be version 3. And as time goes by, it's going to get more sophisticated, hopefully closer to real world problems. And also more dynamic and open-ended. Like I don't think a good AGI test in the long term can just be this sort of like static input-out-put-pair programs in this dataset. I think it's going to go much, much beyond that. And I want to tackle the problem of open-endedness as well. So I would say one thing that I'm reasonably confident about is that if you solve Arc 3. It means that you have at least a decent answer to the question of how do we make machines that can approach problems they've not seen before and figure them out. So you're solving this specific sub-problem of AGI. And I think that's going to lead to tremendous progress. Like that's, I would say, the most important problem that's blocking us from getting to AGI today. So, and kind of in transition to our third contributed paper session, which is on causality. I know the first paper talks a little bit about experiential learning as, you know, when babies learn and they begin abstracting concepts just by exploring the world. And I was wondering how you see that integrating into your ideas there.
(0:51:39) Yeah. I have two kids. They're three, three year old and they're three months old respectively. And I think, you know, watching kids grow up is a tremendous source of inspiration for coming up with theories of the human mind. I think especially is especially true between ages one and two, I would say, because that's when cognition is still sort of like simple enough that you can see it in action. Past that point, you know, it gets so sophisticated that what you're seeing is very, very difficult to see through. And before that point, it's also sort of like, you know, some ceremonial affordances are not fully controlled. It's not quite like interesting enough, but at that point in time, it's really fascinating. And I do, personally, I do see things in my kids that have kind of like inspired me to have certain solutions that are applicable to art. I think humans actually, it's not like a controversial statement, but I do think humans are capable of performing few-shot program synthesis. The sort of process that you would need to solve our KGI. But instead of applying it to these sort of like IQ-like puzzles, we are just applying it to completely everyday situations. Like we set sort of like short-term goals. Like you're a baby, you're like crawling on the floor, you want to get to an object. You set that goal. Then you enter this sort of like feedback loop where you know your current state. You can imagine the state you want to get to, right? You have this sort of like latent representation of your targets. And you're trying to do planning between the two. So that's a few-shot program synthesis. And you're trying to execute, then you kind of like restart the loop. So I think this is very much how human cognition works on a fundamental level. Thank you, and thank you so much.