uv tool list
# yt2doc v0.2.8
# - yt2doc
yt2doc --video https://www.youtube.com/live/pjqKHOeykp8 --segment-unchaptered --llm-model gemma2 --whisper-model deepdml/faster-whisper-large-v3-turbo-ct2 --sat-model sat-12l -o examples
https://www.youtube.com/live/pjqKHOeykp8
Thank you for your cooperation. I kindly request everyone to thank you for the time to talk about the work and the time to talk about the work and the time to talk about the work. I invite Professor Kamakodi Binadhan, Director of the Indian Institute of Technology, Madras, for address the gathering. Hearty welcome to you, Yan, to Bharat, to Tamil Nadu, to Chennai. Thank you Subra and Chris for enabling this and a very, very warm welcome to all the attendees for this third Subra Suresh.
Thank you Subra Suresh Public Lecture. And we are very fortunate to have you address us on one of the most fascinating topics that explodes the border where machine intelligence ends and human intelligence takes over. The R&D community has been pushing this border, trying to make machines more and more intelligent. What we may term as Kasparov's challenge in his book, Deep Thinking, where machine intelligence ends. He asked this wonderful question. When he played a rapid round with 100 computers, he asked this question. Can a robo play a rapid round with 100 Kasparov's? I think that's a very important, thought provoking statement for which many, many, many computer scientists and engineers aspire to find an answer. In this lecture, we look forward to look answers for two major challenges. One, a very famous engineering quote that optimization hinders evolution. Today, when we look at large language models, the generative AI platform, we are looking at lot of power consumptions and lot of questions that arise from a sustainability point of view. That demands optimization. So, in your opinion, does this optimization hinders evolution?
The next important aspect is the responsible A. You are now providing a platform. Meta provides a platform. It is considered as a knowledge repository. India as a repository. India as a country, we have diverse views about many things and our strength is the unity in diversity. And when questions are posed to Meta, when somebody learns knowledge from Meta, I think there is a responsibility that comes that all these diverse views are protected properly. I am sure India will be a very great best bet for Meta. And I look forward that this particular visit of yours will enable more collaborations with our Indian knowledge system and Meta.
Yesterday night, I just asked Meta, what are three tough questions that I can ask your Vice President. And Meta came out with three interesting questions. Number one, can AI truly be creative or is it limited pattern recognition? Two, do you believe AI can possess consciousness or self-awareness? Three, should AI develop prioritized human will be or technological advancement? I am sure your lecture will throw light on many of these aspects. Again, I ask a hearty welcome to all the audience here and we look forward to listening from here. Thank you very much.
Thank you, Professor. I think I speak for everyone when I say that you have truly set the stage for an evening of intellectual exploration and meaningful dialogue that will resonate with us long after tonight. On that note, I invite you to direct your attention to the screen for a video introduction of Professor Subra Suresh, the eponym of this distinguished lecture series. Professor Subra Suresh is a professor at large at Brown University and the Vannevar Bush Professor of Engineering Emeritus at MIT. Your scientific accomplishments have been recognized by two United States presidents and described by one colleague as nothing short of breathtaking. The numerous awards you have received include a Franklin Institute Medal, which joins you with the most influential scientists of the last two centuries, including Marie Curie, Thomas Edison, and Albert Einstein. You are the ultimate exemplar of the three cornerstones of 21st century science. Innovation, interdisciplinarity, and global-minded collaboration. Your work applying nanomechanics to malaria research has shaped entirely new fields of cross-disciplinary study, while illuminating our understanding of the cellular and molecular basis of disease.
Subra, you are one of the most visionary, impactful leaders in the global scientific community. For your extraordinary contributions to science and engineering, Dartmouth is proud to award you the honorary degree Doctor of Science.
Subra was incredibly creative and effective in his role as director of the NSF, as he had previously been at MIT as a professor in four different departments, chairman of the Department of Material Science and Engineering, and dean of engineering. Hand-picked by U.S. President Barack Obama, Professor Suresh's accomplishments as head of the NSF, included the creation of the Innovation Call, or what is now known as the NSF iCall, a highly successful nationwide commercialization accelerator program, took NSF principles of politically unbiased, scientifically based and transparent merit review to the rest of the world, inspiring the creation of a global research council.
What's interesting about our energetic new leader is he's also a testament of why we need to be an open country to get extraordinarily brilliant people to come into our country and change our country and build our future. Dr. Suresh.
Dr. Suresh, brilliant scientist, a visionary leader, a truly wonderful man and a friend. Congratulations.
Next on the video, for another video introduction of another esteemed IITM alumnus, Mr. Sanapati Chris Gopalakrishnan, whose general support has made this lecture series possible. Mr. Sanapati Chris Gopalakrishnan is the chairman of Exilor Ventures in Bengaluru and co-founder of Infosys. Mr. Sanapati Chris Gopalakrishnan is the chairman of MSC Physics and while doing my MSC Physics, I took a course in the campus, this was not part of my curriculum. It was a non-credit course on Fortran programming, which introduced me to computers, got hooked onto it. Lately again, I got into the MTF Computer Science program and from that point onwards, I was working with computers.
A star engineer in Chris Gopalakrishnan. A pioneer in sustainability initiatives. We have been involved in various international forums like the WBCSD through CII through Terry, etc. And when the request came for me to chair, I believe it was the recognition of Infosys of the pioneering efforts we have had in the last seven years. We need to innovate to stay ahead of competition. Even in the service land we have today, we need to make sure that we are bringing in more efficiency, we are bringing in new techniques, technology, tools, etc. So innovation has to be there in everything we do. Being entrepreneurial is actually with transparency, in spite of transparency.
Desh Deshpande about setting up a centre to bring ideas from lab to market, but make faculty the core of creating new start-up. We are very grateful to all the people from Sikholy, Ramadurai, Chandra, Asim Premji, everybody who has been instrumental in creating this industry. They all spent, in some cases, three, four hours with us, interviewing, reminiscing about their stories, etc. And so we got these stories from, in some sense, the horse's mouth or the leader's mouth. And that was what I intended this book to be. That was truly inspiring. Just a general reminder, everyone, please keep your phones on silent. It is now my pleasure to invite Professor Ashwin Mahalingam, Dean ACR, to introduce the speaker of the evening, Dr. Yann Lekun. Good evening, everyone. And, Yann, welcome to Chennai. Just judging by how quickly this large auditorium has filled up, I don't think you need an introduction, but I'll go ahead and quickly introduce you. Anyway, Yann Lekun is the VP and Chief AI Scientist, an AI scientist at Meta, and Silver Professor at NYU, affiliated with the Cunha Institute of Mathematical Sciences in the Center for Data Science. We just caught up earlier. He continues to actually teach courses in artificial intelligence and guide PhD and postdoctoral students. He's still a very active faculty. He was the founding director of FAIR and of the NYU Center for Data Science. He received an engineering diploma from ESIEE, Paris, and a PhD from Sorbonne University. After a postdoc in Toronto, he joined AT&T Bell Labs in 1988 and AT&T Labs in 1996 as Head of Image Processing Research. He then joined NYU as a professor in 2003 and Meta, or what was then known as Facebook, in 2013. His interests include AI, machine learning, computer perception, robotics, and computational neuroscience. He is the recipient of the 2018 ACM Turing Award with Jeffrey Hinton and Joshua Bengio for conceptual engineering breakthroughs that have made deep neural networks a critical component of computing. He is also a member of the National Academy of Sciences, the National Academy of Engineering, and the French Academy de Sciences.
So, Yan, welcome again to Chennai, and we really look forward to your talk. I invite Professor Tamakoti Vinadhan to felicitate our esteemed speaker. Thank you very much for this kind introduction. I think I'm going to keep this to keep me warm.
Okay, it's a real pleasure to be here. This is my third time in India. It's always amazing to see how much enthusiasm there is for AI technology in general. And I hope to inspire perhaps some of you to keep this enthusiasm.
Okay, big question. It's a scientific question of my life. Open machine reach human level intelligence. In fact, the scientific question really is, what is intelligence? It's one of the three big questions that, you know, you are entitled to ask yourself when you are a naive, aspiring scientist. What's the universe made of? What is life all about? And how does the brain work? Those are the three big questions. Everything else is kind of a side question to this. To those three big ones. So, as an engineer, I don't think we can answer the question of what is intelligence unless we are able to build an intelligent system ourselves. And, you know, I'm not going to give you the solution to how we could reach human intelligence, but perhaps give you some elements of how we could get there, perhaps a path towards that. And I've written a long paper about this that I'll mention later.
Okay, so the first question is, should we try to build machines that have human level intelligence? And I think yes. And the reason is, the commodity that we're missing the most in the world today is intelligence. And so, if we can augment our own intelligence with that of machines, I think it can have a big positive effect on society at large, on humanity. And so, the best way to envision a future with intelligent machines is one in which we carry systems at all times that allows us to ask any question. AI system could help us, you know, daily lives, solve problems for us. It would be like walking around with a staff of really smart people working for us. Okay, we shouldn't need, we shouldn't mean, we shouldn't feel threatened by the idea that we're going to have machines that are more intelligent than us. Because we should not feel threatened by working with people who are smarter than us. I am certainly very familiar with the concept of working with people who are smarter than me. So, this will empower, amplify our intelligence. But for that, we need systems that have intelligence that resembles human intelligence so that we can interact with them easily. And that means we need machines that understand the world, understand how the world works, can remember things, and can reason and plan. Okay, so those are the desiderata for what we call AMI, Advanced Machine Intelligence. This is our internal code name at Meta for what other people call AGI. I don't like the term because human intelligence is not general at all. It's very specialized. So, designating human level intelligence with general intelligence I think is a mistake. So, we prefer AMI and we pronounce it ami, which in French means friend. I think it's appropriate. So, systems that understand how the world works basically construct mental models of the world that allow them to understand the world but also plan and reason. systems that have persistent memory, systems that have persistent memory, systems that can plan action sequences so that to fulfill an objective, systems that can reason, invent new solutions to unseen problems, one shot without having to be trained, and systems that are controllable and safe by design.
So, let's start with an essential characteristic that I think intelligent systems should have. And it's the ability to perform inference, which means computing their outputs, not by propagating a signal through a bunch of layers in the neural net, but by optimizing an objective function. Okay? So, you have those two models of computation. One, you have a fixed number of steps in the computation and you produce an output. And the other one is you have some objective function that produces a scalar output. And you search for a value of the output that minimizes this objective function. That objective function measures the degree of compatibility of the output with the observed input. Now, that process is intrinsically more powerful than the one that proceeds by forward propagation through a network or through a fixed number of steps. Because every computational problem can be reduced to an optimization problem, essentially. So, that's the first characteristic I think of intelligent systems. Note that dominant models of the time, LLMs, are of the first type. They produce one token after the other. Each token takes the exact same amount of computation to produce. So, current models are not appropriate to take us to human law intelligence, just on the first slide. Okay?
Now, this concept of minimizing an objective with respect to a set of variables to compute an output, we can, I call this energy-based models. So, think of this function as an energy function. It takes value, let's say zero, if the output is compatible with the input. If it's a good output for that input. And it takes larger values if the output is not compatible with the input. Okay?
So, the advantage of having a system like this is that it allows for the system to search for a good answer, and it could be multiple good answers. If you're propagating through a bunch of layers, you only produce one answer. A system that proceeds by optimization could, in principle, produce multiple answers for a given input. This type of search is really very classical in classical AI, in good old-fashioned AI. The whole idea of AI in the old days before machine learning was to actually search through a solution space for a solution that satisfied a criterion, an objective. Okay? That was the whole thing. That's what reasoning is all about. That's what past planning is all about. That's what SAT problems is all about, logical inference even. So, it's not really a new idea at all. It's an idea that we've forgotten because of machine learning a little bit.
The second thing is that this idea of inference by optimization is more akin to what psychologists call system two in psychology. So, the type of reasoning that we do when we consciously try to solve a problem and we think about it hard for a while. It's different from system one, which allows us to act without thinking subconsciously. So, current AI systems are essentially system one and we need them to be system two.
Now, this idea of energy-based model basically captures the dependencies between two sets of variables. X, that is the set of variables you observe. Y is the set of variable you're supposed to produce. And the dependency between the two is captured by an energy function that takes low values on the data points. So, if you have training data, your energy function can be trained to take low values around the training points. The training points are those black dots here in the diagram on the right. And then larger values outside of those black dots, outside of the region of high data density.
Okay, and the problem of training energy-based model is to train, is to shape this energy function so that it gives low energy to stuff you train it on and high energy to stuff you don't train it on.
Okay, so before I explain how you do this and why this is an important concept, let me talk about autoregressive LLM. So, large language models should really be called autoregressive large language models. But really that's what the language model is. They use feedforward inference, as I mentioned earlier. And the way they operate is that you train them to predict a word from the word that precede it, using tons of data. And then once it's trained, you show it a sequence of words. You ask it to produce the next word. You shift that into the input. And then the second word shifts that into the input. The third word, et cetera. It's called autoregressive prediction. It's a very old concept in statistics and signal processing. But it's got some issues, which is that, again, there is no system two. There is no reflection. There is no kind of thinking. There is no reasoning. There is no world model. And in fact, there's a bunch of papers that have been published, both by computer scientists on one side, and sort of people coming from classical AI on the other side saying, NNMs really cannot reason and cannot plan. We're not recognizing the essential mechanisms that need to be present, that are present in living organisms, or in classical AI systems that enable reasoning, okay? So they sometimes appear to be doing reasoning, but really what they're doing is intelligent retrieval. So you pose a problem to an LLM, if it's a standard puzzle, the system will just give you the answer, because it's been trained on that puzzle, and it knows the answer. We change the statement for the problem a little bit, and we'll still regurgitate the answer for the previous problem, because it doesn't really have any kind of sense of what a solution to a problem is, other than extensive learning. That doesn't mean LLMs are not useful. They're very useful. They can be used for all kinds of stuff, and a lot of people around the world, including in India, are using or developing LLMs for all kinds of wonderful applications, for which you don't need human level intelligence, okay? But they are not a path towards human level intelligence, despite what you might hear from some people on the west coast of the US, mostly.
Okay, so, by the way, there's a good series of papers by a gentleman called Subararo Kembampati, who is a professor at Arizona State, and he's a former president of Tribunal AI. And he has a series of papers about, you know, LLM can't plan, LLM still cannot plan, LLM really cannot plan. Even LLM that claim they can't plan, like, or one cannot plan. And he has a tutorial on the whole idea of planning, essentially, perhaps for LLM.
Okay, so, how are we gonna build those systems that perform inference by optimization? So a very simple concept is that those systems would possess a world model. What is a world model? A world model is something that, given a current estimate of the state of the world, and given an action or a sequence of actions that you imagine taking, a world model will give you a prediction as to what the resulting state of the world is going to be, okay? So, state of the world at time t, action I might take, state of the world at time t plus one, predicted through the world model. So, if you have a system of this type, so, of course, when you perceive the world, you don't perceive the entire state of the world. So, you might have to combine your current observation with the content of the memory that contains your idea about the rest of the state of the world, right? You might be perceiving the content of this room right now. But, you know, if I remove this beautiful foulard from AIT Madras, it doesn't change your idea about the state of the world outside of this room, okay? And so, it doesn't make sense to sort of recompute your entire idea of the state of the world every time. You have to combine this with some content of your memory. Okay, so you have the state of the world, and you perceive it as you perceive it, the content of the memory that gives you some idea of the state of the world. Feed this to your world model. Feed the sequence of actions that you imagine taking. Your world model predicts the outcome. Now, you can feed the outcome to a task objective. Something that measures to what extent a task has been accomplished. So, it gives you an output of zero, let's say, if the task is accomplished, and a larger number as a function of how unaccomplished the task is. And perhaps there's a bunch of other objectives that are guardrails. So, they guarantee that whatever action sequence the system takes is not going to hurt anyone, for example. So, if you have a system like this, an inference episode consists in showing the input to the system, and then searching through sets of action sequences for one that minimizes the objectives. Okay? So, inference by optimization through search. If all of those modules are differentiable, that search can be done using gradient-based optimization techniques. Okay? Gradient descent, essentially, or something slightly more sophisticated. I'm not talking about learning here. I'm only talking about inference, right? We've not talked about learning yet. Now, of course, an action is generally not a single action. It's a sequence of actions. And perhaps the same model of the world applies to a sequence of actions. So, perhaps what you might need to do is run your model multiple times for multiple actions, and then predict the sequence of successive states of the world that the world will go through as a consequence of taking the sequence of actions. You might put guardrails along the trajectory, and then the task objective at the end. At the end. And again, infer the optimal sequence of actions through minimization. Perhaps gradient-based optimization.
Now, this model is essentially what people in control theory call model predictive control. It's a very classical set of methods going back to the 1960s, 1950s if you were in the Soviet Union. And very classical, except for the fact that here we're not going to build the model of the system we're trying to control by hand, but we're just going to train it from data. So, I'm not specifying which optimization algorithm we're going to use for inference, but it doesn't matter at this point.
Now, the sad news is that the world is generally not entirely predictable. It might be stochastic for various reasons, perhaps because it's intrinsically stochastic, or perhaps because we have only partial information about the state of the world and we can only make predictions that have a certain level of uncertainty. And so, this would be represented by a latent variable. So, the blocks that, the symbolism I'm using here is that the round shapes are variables. So, latent variables, actions, inputs like the Earth. The modules that are rounded at one end are the deterministic function. Let's say a neural net, okay? And then a latent variable is a variable whose value is not determined and is not obtained necessarily by optimization, but perhaps is sampled within a distribution or is swiped over a set. And so, it represents everything you don't know about the world. So, if I toss a coin, I can't predict if it's going to, you know, be head or tail, but I can predict it's going to fall on one of the two. So, the latent variable would be a binary variable that indicates whether it's one or the other, and the world model would make two predictions, right? So, there's some difficulty in handling this because if you have lots of discrete variables that can take, you know, values in a sequence, it might lead to an explosion, exponential explosion of number of possible states that you might want to predict. But in most situations, you don't actually need to do this.
Now, ultimately though, this is not the kind of model that we want. What we want is a model that can do hierarchical planning because all of us do hierarchical planning. Animals do hierarchical planning. Let me take an example. Let's say, and so, before I say that, what that means is that we need models of the world at several different levels of abstraction, depending on the type of prediction we're doing. Let me take an example. If I'm sitting in my office at NYU and I decide to fly to Paris, let's say. Okay? At a very high abstract level, I can say that if it's in the afternoon and I decide to be in Paris tomorrow morning, I'll be in Paris tomorrow morning. Okay? And I just need to go to the airport and catch a plane. Okay? That's a very high level plan with a very abstract model of the world that tells me that, you know, I can fly overnight to Paris. Okay? So now that I have a high level plan that says go to the airport, catch a plane, I have a sub-goal, go to the airport. I can plan to go to the airport. I'm in New York City, so that means going down on the street, hailing a taxi. So now I have another sub-goal, going down on the street. I need to go to the elevator and push the button and walk out the building. How do I go to the elevator? I need to stand up for my chair, pick up my bag, open the door, walk to the elevator. How do I stand up for my chair? Well, I need to push, you know, certain muscles and use feedback to kind of stay up and things like that, right? And then there is a level below which language is insufficient to describe what's going on. So this ability to decompose a complex task into sub-tasks and then down to the level where accomplishing the task does not require planning anymore, it just requires acting. That's an essential characteristic of intelligence. It's a problem that is completely unsolved, completely unsolved. No one has any idea how to do this. I mean, people do this. Robotics do hierarchical planning, but they design the levels by hand. What we're talking about here is training a system, including a world model, that will learn all the multiple levels of abstraction that will allow it to do hierarchical planning. Okay? That sounds like a tough thing to do, but your cat does it really well. Okay? A cat has only 800 million neurons, about 100 times less than humans. Okay?
So this whole kind of picture led me to a kind of an architecture that I call objective-driven AI. And I described this in a paper that I published two and a half years ago on Open Review, so that it's not on archive, it's on Open Review, so that people can make comments and tell me how stupid I am. And I invite you to make some comments if you think there's ways to improve this idea. But essentially, it led to an architecture that you could call a cognitive architecture, which is composed of a perceptual module, similar to one I showed earlier, a world model, a cost module that contains costs that are defined by the task, or perhaps can sub-goals, if I define a sub-goal for the system, a shorter memory, and an actor. And the role of the actor is to perform this optimization procedure to find the good sequence functions that will optimize the cost, the cost module. There's a mysterious module at the top here called the Configurator, and the role of this module, and I don't know how to build it, it's just a concept at the moment, is to configure the entire system for a particular task, a particular goal, right? So it configures the cost function to accomplish the task, configures the world model for the situation at hand, configures the perception module to pay attention to the right details, things like that. So perhaps if we succeed in building a system of this type, we'll have, you know, we might be able to fulfill all the desiderata, but we're still missing something big at the moment with current architectures. So how could machine learn world models from sensory input? And the answer is, well, we self-supervised learning, but I'm seeing this, not seeing much. So, you know, never mind humans, cats and dogs can do amazing feats. And worries, you know, we have LLMs that can pass the bar exam, you know, write essays for us, answer some questions by smart retrieval. But we don't have domestic robots. We don't have level five self-driving cars. We certainly don't have a self-driving car that can learn to drive by itself in about 20 hours of practice, okay? Something a 17-year-old can do. We don't have a domestic robot that can learn in zero shot to clean up the dinner table and fill up the dishwasher, a task that a 10-year-old can accomplish, you know, the first time he or she tries. So, you know, clearly we're missing something really big. We keep bumping into this thing called the Moravec paradox, which is that tasks that appear, that we attribute to high-level intelligence, like playing chess and writing poems and stuff like that, for that to be not that hard. The things that we take for granted that a cat can accomplish, we still can do with computers. So we need computers to really understand the physical world. And so that tells you perhaps we're never going to reach human-level intelligence unless we build systems that can understand the physical world. And there is supporting evidence for this. And it's the fact that an LLM, the biggest LLMs today are trained typically on 20 trillion tokens, okay? Two 10 to the 13 tokens. A token is like a word. A little bit smaller than a word. Each token is three bytes. So that's six 10 to the 13 bytes. That represents the entirety of all the text publicly available on the internet. It would take any of us here a few hundred thousand years to read. Okay? An enormous amount of data. But in fact, it's not that much data. Take a four-year-old. A four-year-old has been awake a total of 16,000 hours. At least that's what developmental psychologists are telling us. Which, by the way, corresponds to 30 minutes of YouTube upload. So not that much data. We have two million optic nerve fibers going from our retinas to our brain. You know, I could just put a number on the amount of information captured by the retina. But that would be cheating because it gets compressed to squeeze down the optic nerve. So we've got two million fibers. One million for each eye. Each carrying about one byte per second. Probably even a bit less. And so do the arithmetics. And that's one tenth to the 14 bytes. So it's the same order of magnitude as the amount of data the biggest data labs are trained on. In just four years. In four years, the child has learned an enormous amount of background knowledge about how the world works. Way more than, you know, the smartest AI systems that we have today. So what we need is a way to get a system to understand how the world works by observation. So in the first four months of life, babies basically just observe the world. They can't really do anything, right? They can't like grab objects and stuff. It starts after that. But the first few months, they really can't.
This chart comes from my colleague, Emmanuel Dupou, in Paris, who, you know, has kind of a...
It represents at what age, in months, babies learn basic concepts. So things like object permanence is learned between two and six months, not entirely clear when. The difference between animate and inanimate objects, biological motion, around three months. The fact that objects have natural categories, the chair is different from a table, from a car, blah, blah, blah. That emerges around between four and five months, and then until nine months. And then, basic concepts of intuitive physics, like gravity, inertia, that pops up around nine months in babies. So if you show a six-month infant, the scenario shown at the bottom left, where there is a little car, a toy, on the platform. And you push the car off the platform, and it appears to float in the air. A six-month-old child will barely pay attention to it. A ten-month-old will stare at it and be really surprised because the car is supposed to fall, and it's not falling. And we are hardwired to pay attention to stuff that violates our world model, our internal world model, because that's how we learn. Also, if something violates your world model, it might just kill you. So you've got to pay attention.
So what psychologists do is they measure how long the babies stare at something, and that's an indication as to whether the scenario they're looking at violates their internal world model.
Okay, so, you know, when we train natural language processing systems, the way we do it is that we take a piece of text, we corrupt it in some way. The corruption sometimes consists in masking. Sometimes it's only conceptual, the masking is implicit because of the architecture of the network, but let's take that as an example. So take a sequence of words, remove some of the words, okay, replace them by blank markers, and then train some gigantic neural net to predict the words that are missing. That's the basic principle of LLMs and basically every NLP system and translation system of the last five or six years. Okay, that paradigm of learning is called self-supervised because you don't have, like, a differentiated input and output. The input is a corrupted version of the output, right? You're training the system not for any particular task other than recover the full input from a corrupted version of it. So this works great for text, right?
Totally revolutionized natural language understanding systems. Why not apply this to video? What a smart idea. Take a video, mask a piece of it, for example, the second half of the video, and then show the first half of the video to a gigantic neural net and train this neural net to predict the parts of the video that is missing. Predict what's going to happen next in the video. It doesn't work. Unlike with text where it works beautifully, it does not work for video. It does not even work for images. I'll show some examples later.
So what happens with video? What happens with videos is that the system makes blurry predictions. So a big neural net here was trained to predict what happens in those kind of synthetic video at the bottom, which represent cars on the highway. And the second column represents a prediction made by a neural net just, you know, trained to predict pixels. It makes blurry predictions. The one you see at the top, same thing. The first four frames of this six frame video are observed, and the last two are predicted, and the last two are blurry. Why? Because the system can only predict the average of all the possible futures, and it doesn't know if it's a little girl who's going to move forward or backward or whatever. So it predicts a blurry mess.
Okay, so this idea that you can reconstruct pixels is, you know, basically doesn't work. And you can understand this, right? If I take a video of this room, I point the camera here, and I slowly pan towards another part of the room. The system may predict that this is a conference room, and there's people sitting in seats, and at some point the room is going to end. But there is no way it's going to predict what every one of you looks like. It's not possible. The information is just not there. And so it's going to predict a blurry mess of an average of what you could look like, right? And, of course, you know, the solution to this would be to predict not a single point but a distribution. The problem is that we do not know how to appropriately represent distributions in high-dimensional continuous spaces like images. So the idea that we're going to use, you know, probabilistic prediction for that doesn't work.
Now, this idea of doing video prediction to get a system, a machine, to learn the nature of the world is not recent. I've been working on this for the better part of the last 15 years. The little video you see at the top is from a 2016 paper, eight years old. But I have a solution to that problem, which popped up fairly recently, actually, in the last five years or so. Well, in fact, the early version of this is an old paper of mine from 1993. So it's not recent, but 30 years old. And the solution I'm proposing is what I call joint embedding, specifically something called a JEPA, Joint Embedding Predictive Architecture. And the references you see here are papers that kind of build on this idea of JEPA.
So what is a JEPA? So what is a JEPA? A JEPA is an architecture which is not generative, okay? It's an architecture that, instead of predicting all the pixels in a video, does not do that. It takes the full video. It runs it to an encoder, which produces an abstract representation of the content of that video. Then you take the corrupted video or the partially matched video, also run it through an encoder, perhaps the same encoder. And then you train a predictor to make predictions to predict the representation of the full video from the representation of the partial video. Now, how you train this, I'm going to come to this in a second. But that presents a huge advantage, which is that all the details in the video that are not really predictable can be eliminated by the encoder, okay? So the details of what everyone here looks like might be eliminated. But the system, at some level of abstraction, can predict that there's going to be people sitting in chairs without representing the details, okay? The same way, at an abstract level of abstraction, if I'm sitting in my office in New York, I can predict that it can be in Paris the next morning without knowing all the details. And I certainly cannot plan my entire trip from New York to Paris in terms of millisecond by millisecond muscle control. I have to use hierarchical planning. So I think an essential component of a world model is to be able to lift the representation, the observations that we get from the world into an abstract representation that allows us to make predictions.
Let me take a concrete example. There is an infinite amount of information that we can gather about Jupiter. It's not really infinite, but you know what I mean. Okay? Jupiter is an incredibly complex object. It's got weather, you know. It's got a funny shape. It's got complicated composition. It's got, like, nuclear reactions in the core. It's got all kinds of complex stuff going on. It's got satellites. But if I'm interested in predicting its trajectory for the next century, I only need six numbers. I need three positions and three velocities, and I'm done. And so, you know, that's Newtonian physics basically, right? So the complexity here is not necessarily to learn how to predict, but it's to learn the appropriate representation that allows you to predict and eliminate the irrelevant stuff for any task.
Okay, so now you have the difference between two architectures. The generative architecture on the left that basically tries to predict or reconstruct observations. And the one on the right, the JEPA, which only predicts in representation space. And the challenge is how to train those JEPA architectures on observational data. And I'll tell you why it's an issue. It's an issue because those architectures, if you train the encoder and the predictor simultaneously, they can collapse. In other words, if the only thing you train the system to do is to minimize the prediction error, the D module here, which measures the distance between the predicted state, S-Y, and the actual state, S-Y observed from Y. If you just minimize that, the system is really happy to completely ignore X and Y and produce constant S-X and S-Y, and now the prediction task is trivial. Okay? That's called a collapse. Okay?
So to prevent this from happening, we have to appeal to this idea of energy-based model. So I told you the way you capture dependencies between variables with an energy-based model is you build a neural net that has a single scalar output. And you train it to produce, let's say, zero if the input and the output come from the training set. You know they are compatible. And to produce a larger value for everything else. So training a neural net to produce a small value for a given sample is very simple. You show the sample, and then you tweak the weight so that the output goes down super easy. So the hard part is how you make sure that the energy is higher outside of the training set. And for this, there is two categories of methods that I know of. Okay? So you want to prevent this collapse where the energy function is zero for every pair X-Y, right? If the encoders ignore X and Y and produce constant outputs, the prediction error is zero all the time. And if you think that the prediction error is the energy, then that's an energy surface that is zero everywhere. It's not a good model. You want it to be small on the data points but higher outside. Okay?
Two sets of methods, contrastive methods. Contrastive methods consist in generating contrastive points, those flashing green points that you see. And then pushing that energy up. So as you keep pushing the energy of the green points up and you keep pushing down the energy of the blue points down, then perhaps the energy surface will take the right shape. But that turns out to be inefficient in many situations. Because in a high-dimensional space, imagine that the Y variable is high-dimensional. The number of different points you're going to have to push up the energy of goes exponentially with dimension. For like an infinitely flexible energy surface. So it doesn't scale very well. In fact, in practice, when we train those systems using contrastive methods, the intrinsic dimension of the representation we get is relatively small. So I prefer those regularized methods. What does that mean? A regularized method typically would come up with some sort of term in the energy or the loss that minimizes the volume of space that can take low energy. Okay? So that if you push down the energy in one particular region, it has to go up in another region because there is only a small amount of volume of low energy stuff to go around. Now, that sounds a little mysterious, how you do this. But I'll give you a couple methods of how you do this.
Okay, so here is a particular instance of this. And this is really the type of experiments that convinced me that those joint invading architecture were vastly superior to the generating ones. If you want to train a system to learn good representations of images, not video, you take an image, you corrupt it, you run the image to, in the case of joint invading architecture, you run the images through encoders, the full one and the corrupted one. And again, you train the encoder and the predictor to predict the representation of the full image from the representation of the corrupted one. Once the system has been trained, you just take the encoder as a universal feature extractor, if you want. And you apply it to an image and then you can train a classifier in supervised mode on top of it to do, for example, image recognition. And this idea goes back to 1993, a paper of mine called Siamese networks. So it's the idea that you can have two identical neural nets and you make their outputs as similar as possible for inputs that you know are semantically similar and you push them away from each other for inputs that you know are different.
There was another set of papers by Sumit Chopra Raya Hatzel, two of my students at NYU in the mid-2000s. And then these kind of techniques were a little revived by Chen at Google in 2020. Jeffrey Hinton was the co-author on that paper, it seemed clear, which is a contrastive method for training vision systems. Those works work okay, but the embedding dimension of the representations they can learn is fairly limited. So I prefer another set of methods, those regularized methods. And one way to do this is to prevent the system from collapsing, is to have some measure of the information content of what comes out of the encoder and try to maximize that.
Okay, so let's denote I of Sx, the information content of Sx. You have to measure this over a batch of samples because the information content of a single vector is really not an interesting quantity. So take a batch of samples and then measure how much information there is contained in the vectors. Now, and then try to maximize that. Have a criterion that you can differentiate so that you can maximize that with respect to training. With respect to the parameters of the encoder.
Now, here is the bad news. You want to maximize an information content. And for that, you would need a lower bound. You would need a way to compute the information content, or at least a lower bound on information content. So by pushing up on the lower bound, you would maximize information content. The bad news is that we don't have lower bounds on information content. We only have upper bounds. And it's because we don't know how much dependencies there are between groups of variables. And so we just assume there isn't any. And that overestimates the information content.
So here is a really simple but not so great way to estimate information content. Again, an upper bound. And it's using the covariance matrix. So you take a bunch of representation vectors coming out of your encoder from a batch of samples. And you compute the covariance matrix of that. So you compute the product of this matrix transposed by itself. Okay? So now you have a matrix whose dimension is the dimension of the representation vector squared. Okay? And if the variable, the diagonal terms of that matrix are the variances of the variables, and you're going to use a criterion that says, I want those variances to be larger than one. Okay? So I don't want the thing to just collapse and be constant. I want the variables to change from one sample to the next. And then the second criterion is a covariance term. So it says, I want different variables in my representation to represent different things, to be uncorrelated. Okay? And the correlations are represented by the off-diagonal term of the covariance matrix. So you just have a differentiable criterion that tries to make the covariance matrix as close to the identity as possible.
Now, there's a number of methods that people have proposed to do this. The one I'm...
The particular one I'm...
On this slide is called Vickreg, Variance and Variance Covariance Regularization. It's a particular set of criteria to get the covariance matrix to be close to identity. But other people have proposed similar methods. So there's something called NCR squared by Yima at Berkeley. Another one called MNCR by some of my colleagues at NYU. So Yung Chen and Erosim and Shelley. It doesn't really matter exactly which criterion you use. We use a trick to expand the dimension of the representation and decorrelate the variables in that expanded dimension. So that makes the original variable more independent of each other, not just uncorrelated. But I'm not gonna go into the details of this. There's a little bit of theory about this, which I'm not gonna go into details of.
Okay, there's another set of methods that are...
You could think of as regularized methods. They are not contrastive. And you could call them distillation methods. And there's a number of papers about this. One for DeepMind called BYOL. A bunch from my colleagues at Meta, SimSiam, Dino V2. And then a couple that I participated in, IJEPA and VJEPA. And the advantage of those methods is that they seem to learn quickly. And they don't need negative samples. They like contrasting methods. And they might have some advantage compared to the Craig. But some disadvantage as well.
So the basic idea of this is you have still those two encoders. They had to share the same weights, but not exactly. The encoder on the right uses an exponential moving average of the weights of the network on the left. Why? I don't know. It just works. Which is why I have some hesitations about this method. It works, but it's kind of hard to understand why. So there are some theoretical papers that show why it works in simple cases when the encoder is linear. So this is a paper by Yang Dong Tian and Surya Muguri and colleagues at FAIR. But I'm not entirely convinced. It's a pretty hard paper to read. So they are convinced that there's good reasons for this thing to work. At least the fixed point, there are fixed points of the method that are not collapsed. And then the red cross that you see on the right of the encoder means you're not propagating gradient to the right part of the right copy of the encoder, because its weights are obtained by computing an exponential moving average of the weights of the left one. So there's a slight difficulty with this method, which is that you don't know which cost function you're minimizing. You measure the prediction error, the d function, as time goes by, and it doesn't necessarily go down. The reason being that you're changing parameters, but you're not computing the gradient of a cost function with respect to the parameters, because of this exponential moving average trick. And so it's a bit of a strange algorithm. But it works really well.
So in particular, there's a set of method called Dino V2, produced by some of my colleagues at FAIR Paris, Maxime Okab and collaborative. And you can download this.
So this is a generic feature extractor, trained on lots and lots of different images. It's used by a lot of people across the world, including by people here in one of the brain projects here. Or at least they experimented with it. They're telling me it doesn't work so well for them. But you can just download the thing, and what you have is basically a generic feature extractor for image recognition. You can train ahead on top of it to solve any task. So I'm not going to bore you with table of results, but it works really well for self-supervised learning in self-supervised learning scenarios for object recognition, transfer learning, all kinds of stuff.
One of our colleagues, Camille Coupri at FAIR Paris, used the Dino features to solve an interesting problem, which is to estimate the height of the canopy of the trees and vegetation all around the world. So we have satellite data for the entire Earth. And for small parts of the Earth, we also have LIDAR data. So someone flew a plane and shot a radar or LIDAR and could estimate the height of the canopy of the trees. We don't have a lot of that data. So what Camille did was take the Dino feature, run satellite images through the Dino feature, and train a very small head on top of Dino features to estimate the height of the canopy from whatever training data she had available. Then she applied this to the entire Earth, and that gives us an interesting number, which is an estimate of the total amount of carbon contained in vegetation, captured in vegetation. So that's a really interesting quantity to have for climate change prediction. So that's an example. People use this for x-ray, for biological image analysis, for all kinds of stuff.
Another technique is called image JEPA. It's very similar in many ways. Dino has a trick that iJEPA doesn't need. And iJEPA can learn really, really good features for images and for video. This is a technique called vJEPA. So here we take a video. We corrupt it by basically masking a whole segment, a tube in that video, a temporal tube. And then train some big neural net to learn an encoder of the video as well as a predictor of the representation of that video. And when we train a system like this, and then we use the encoder as a feature extractor to, for example, identify the action that takes place in the video, it works really well, better than most other methods. What we compare it to is methods that use self-supervised learning by reconstruction, so generative architecture. Okay? So there is a set of methods of this type that are based on autoencoder ideas. Autoencoders, regularized autoencoders, past autoencoders, variational autoencoders, VQ, VAE, vector quantized variational autoencoders, mass autoencoder, MAE, which you see on the chart. Those are all generative architecture that tries to predict pixels. Either in a corrupted image, by reconstructing a corrupted image, or by reconstructing a corrupted video. None of those techniques work nearly as well as the joint embedding. Okay? So several years ago, we realized that that was the case. All the attempts that we made to train self-supervised systems by reconstruction were now giving results that were nearly as good as the one that were by joint embedding. And so that had me thinking that there was something really fundamental about the joint embedding architecture. And that we had to abandon the idea of using generative models for video and images. Okay? And it's pretty radical because generative models are pretty popular right now. So when I tell people, abandon generative models, I have a hard time convincing them. But anyway, I'm not going to bore you with table results. I'll come to look at the papers. But what I'm going to say is, what I'm going to talk about is research experiment, training a role model from those Dino features. And it's using model predictive control, so planning, using a trained predictor. So this is work by a student that is co-advised by myself and Leral Pinto, who's a roboticist at NYU, and a collaborator, and Kai Pan. This was partially done while Gai Yue was interned at Meta in New York. And so the idea is you take a bunch of frames for a video, you run them through the Dino feature extractor, and then you train a predictor on top of it, so that given a Dino feature representation of a frame, and an action that is being taken in the world that you're observing, predict the representation of the next state of the world. And the actions here depend on the task. So these are robotics simulated tasks, and for the sort of early preview of some results here on the right, that this technique works better than some previous methods that people have tried.
Okay, so the way this works is very much the type of picture I showed early on in the talk, where you have this world model that predicts the next successive states of the world in the form of Dino features, basically. And the actions that you feed to the system is a sequence of a certain number of actions of the robot in the world.
Okay, so what are the tasks that we're talking about here?
Okay, and the way you solve the task here, there's no learning, other than, you know, once you've trained the world model, there's no learning to do. It's just planning, okay? So it's zero-shot task solving, essentially, where you measure the distance in representation space between the predicted state and the target state, right? So you take a target state, which is an image, you run it through the Dino encoder. It gives you a representation state. You compute the Ukrainian distance between this representation of the target state with the state that is predicted from the sequence of actions of your world model. And that's your objective function. You want to minimize that. You want to figure out a sequence of action that will minimize that cost. So you can do this with a number of tasks. So one task is the point maze, which is, you know, moving a dot to a particular location. So let me play this again. Another one is to move a dot, to move a T-shaped object at a particular target location, okay? There's like a number of actions that you have to do to get to this. Another set of tasks is, you know, again, moving a dot, planning a trajectory for a dot to move from one side of a wall to another, to a door, and the door moves. And then another set of tasks where you are kind of supposed to shape a deformable object into a particular shape. So it's easy to generate training data for a wall model. You basically set the world in a random configuration, take a random action, observe the results, and that's how you train your wall model. And this works pretty well.
So these are open-loop plans, or what's called rollouts sometimes, where you start from a visual condition at the top, and then you run a sequence of actions, and then you run it through the environment, and you get the top line. That's the ground truth, if you want. And then you run the sequence of actions through a bunch of different models that are trained in different ways, which I'm not going to go into the details of. At the bottom, this is the system I'm talking about. So you're letting the system imagine, I mean, accepting the sequence of actions, which is the same sequence as the one at the top, and you let the system imagine the succession of the sequence of internal states that the world is going to take, and then you train a decoder to produce an image of that internal state. Okay, the decoder is trained separately. It's not used to train the world model at all. It's just for visualization. And what you get is the line at the bottom, and it does a pretty good job at predicting what the final state of the world is going to be, right? The bottom right square is pretty much identical to the top right square. Not the case for the other techniques. This is an interesting task. This is a task of basically throwing a bunch of chips on a platter, and then the action that the robot can take is coming down on the platter at coordinate x, y, and then move by delta x, delta y, and then lift. Okay, that's one action. That's four numbers. And so you can train the system by taking random actions from random configurations. And again, you can see the predictions that the system does compare the configuration, the image at the bottom right with the one at the top right. And, you know, they're not exactly identical, but they're pretty similar, whereas other models don't do nearly as good of a job. In fact, experimentally, these techniques actually can do planning pretty well. So it can plan a sequence of actions to essentially bring the chips in a particular configuration that we decided in advance. So the chart on the right, on the right, the granular is this task of moving the chips. And it's measured by transfer distance, which means lower is better. And so the blue is this diagonal web model, and the other ones are kind of other methods that people have proposed in the past. So let me show you the results. So let's play that again. What you see at the bottom is, you know, the sort of jerky action, because we're sending an action every five frames. The action is constant between the frames. But what you can see is that the system is able to make a simple plan to get the dot to the target. which is shown on the right column. Same for the T. Okay?
So at the bottom, the blue dot moves the T in the desired position, the target position, which is represented on the right. This is the more interesting one here. So we don't see the robot moving, because an action, again, is, you know, going down on a platter, moving and lifting, and the robot comes back in its original position. And so you don't see the result. Let me show this again. So we start from random action. The system predicts what you see at the bottom is what the system predicts is going to happen as a sequence of five actions. And what actually occurs is what's at the top, given those actions. And the target is to put all the chips in the square. And, of course, we start from something that's not a square. But it does a pretty good job at, like, you know, corralling the chips into kind of a compact set, if you want. Not quite the square. So that works pretty well.
So I'm going to give you a number of recommendations resulting from those, you know, years of experiments. The first one is abandoned generative models. All right. That's a tough one. If you are interested in human-level intelligence, okay? If you're interested in LLMs, do LLMs. But if you're interested in human-level intelligence, which I think you should, abandoned generative models. Use those J-pads. Abandoned probabilistic modeling. Because we can't represent distributions accurately in high-dimensional continuous spaces. It's intractable. So use energy-based models. You don't need to normalize anything.
Abandoned contrastive methods. Contrastive learning is kind of a buzzword also in machine learning. I've become very, even though I contributed to inventing them, I've become very pessimistic about them. I prefer those regularized methods. I've been on reinforcement learning, but that I've said, I've said that for 10 years now, so that's not new. Those four things are four of the most popular topics in machine learning at the moment. So I'm not very popular. But the main recommendation is that if you are interested in human-level AI, do not work on LLMs. In fact, do not work on LLMs at all. Because you're going to be competing with people who have teams of hundreds with enormous computing resources. And there's one of them in the room who has such a team. Where is he? Okay, I'm not seeing you again. So you don't want to be competing with these people. If you are in academia, you just don't have the GPU resources. You don't have, you know, dozens of engineers working with you. You don't have hundreds of scientists working with you. Like, just don't work on LLMs. Come up with new ideas, okay? You don't have to train this on gigantic data sets, right? You just need to show a principle. A little bit like the example I just showed.
Okay, so we still have a lot of problems to solve before we can put together a system that might have a chance of reaching, you know, animal-level intelligence like cat or something. So, you know, training large-scale world models. Coming up with good planning algorithms. Turns out gradient-based methods get stuck in local minima. And so, and if you have complex neural nets in your world model, it becomes difficult to optimize. But, so there's a lot of research to do there. Probably, you know, in the context of applied math. Dealing with uncertainty, with latent variables. Planning in the presence of uncertainty. And then the idea of hierarchical planning, as I said, is completely unsolved. Very important to solve that problem. Coming up with systems that can be used as large associative memories. Which are needed to kind of remember what the state of the world is. And then, you know, kind of slightly more theoretical issues. Mathematical foundations for energy-based models. Inverse RL to learn cost modules. Exploration to adjust your world model and things like that.
Okay, but if we succeed in doing all this, perhaps, you know, within five, six, seven years, we'll have a good handle on, you know, having a path towards human-level intelligence. And we may be able to build virtual AI systems that would be considerably more useful than the current ones. And those systems might eventually constitute the repository of all human knowledge. If we do it right. We're gonna need to have a distributed architecture that trains all the data in the world. But perhaps without copying the data. So, if we want, you know, future AI systems to speak all the languages of India. Or at least a good portion of them. You know, let's start with 20 or something. You know, we need a lot of data from India. And, you know, it might be difficult to actually, you know, the government of India may not be willing to just give that data to Meta or OpenAI, obviously. So, we need a way to do distributed training so that we can have systems that can be trained on all the data in the world but without copying the data. That's an interesting technical challenge. But it also means that open source AI platforms are necessary. In the future, AI platforms would basically be infrastructure. Infrastructure must be open source. And so, that's the option that Meta has adopted in its strategy that other AI companies, I'm sorry to say, have not. So, open research is really a must and it should not be regulated out of existence. And some jurisdictions around the world are not doing it completely right. So, that's a big danger. But, again, if we succeed, human intelligence would be amplified. People would be smarter for it. So, this may bring a change in human society, perhaps similar to what happened in the 15th century when the printing press was widely disseminated. And it made it worth people's time to learn to read. And that brought about the enlightenment and, you know, the destruction of the feudal system. And completely different modus operandi for humanity. We might see something like this over the next few decades here and we need to do it right. And we have an important role to play here in India. Thank you very much. So, I went way over time. So, I'm not sure what the organizers are going to say next. On behalf of everyone present here, I thank you, Doctor, for your engaging lecture. Your passion for AI was evident and your insight and expertise truly enriched our understanding. We look forward to learning further from you.
We will now begin the Q&A session. For the same, please join me in welcoming our moderator for the evening, Professor Balaraman Rabindran, HOD of the Wadwani School of Data Science and AI. Professor, we'll be moderating the questions that we already received during the registration.
Thank you, Jan, once again for an amazing, amazing lecture. So, I'm going to just go through a couple of questions that you thought might clarify some of the things that you spoke about.
So, one of the questions that people have been asking about is what is the role of embodiment in these kinds of AI models to get to its living?
Right. So, I think it's crucial in the sense that the people in the last five, six years who have made interesting contributions in this idea of world model and things like this are roboticists. People who are trying to apply machine learning to robotics because you can't cheat with robots. So, you really have to have a system to understand how the physical world works. So, quite a few years ago. So, a story I might tell is that when I was discussing with Mark Zuckerberg whether to start an AI research lab at Meta, I asked him, is there any area of AI that you think we should not be working on? And his answer was, I can't think of any good reason for Facebook at the time to work on robotics. And I said, okay. So, for two years we didn't do anything about robotics. And then after two or three years, I realized that there's a lot of interesting things to do in robotics just to make progress in AI. And so, we started a small group in robotics and then we built it up and now we have a whole group called Embody AI, basically. And now robotics is kind of a hot topic in the industry. So, it's a role of a research lab to anticipate trends of this type five years in advance so that when, you know, your boss comes to you and says, so, what are we doing about X? And he said, well, we've been working on X for the last five years. So, you had actually mentioned in the beginning about Subarav, Subarav Kamabati. So, I just wanted to point out that Subarav is an alumnus of IT Meta. Oh, great. And I have a question. So, would Subarav say that your model plans? The DM, the DM model that you're showing at the end, right? Would he accept that this is a model that actually plans? I think so, yes. I think he would. Now, his idea of planning, though, is in the context of classical AI where planning is, you know, like stacking blocks on top of each other and the actions are discrete because this is the kind of combinatorial search you can do in sort of classical AI. What I'm interested in is more sort of planning in continuous space, you know, like motion planning, you know, controlling arms and legs and things like that. So, I think it's more connected with optimal control and robotics than it is to classical AI. In fact, I know that Raoul had something to do with this space because he told me, well, actually, you're giving a talk in a concert hall, which is known for carnatic music or jazz, and I say, well, you know, I love the mixture of the two, you know, jazz with inspiration from carnatic music.
One last question because we're really out of time. So, how far are you, do you think, that we are from a theory of how these networks work? You said there are things that happen that we don't really understand what's happening. And actually, Surya Ganguly actually predicted that this is going to be the century when we truly understand AI and build the theory of AI like we did the theory of communication the previous century. What do you want to do on that?
Okay. So, I don't have a single answer for this. I think there are different ways to approach the understanding of deep learning, particularly self-supervised learning and things like this from different angles. So, one of my post-docs is an information theorist, as a matter of fact. So, he's trying to sort of analyze the model and theorize on self-supervised learning by, you know, quantities of information content and neutral information and stuff like that. There's a number of papers that he and I wrote together, mostly him. So, on the topic, there's another angle which is from statistical physics. So, it used to be that statistical physicists were interested in neural nets back in the 1980s. And then their interest kind of waned a little bit in the 90s. And now they're coming back to it because of deep learning and the mathematics of it, like this energy-based model framework I talked about. A lot of the underlying mathematics comes from statistical physics. So, maybe that's where the next thing will come from. I don't have a huge amount of hopes from, like, classical, theoretical computer science. Especially since you brought it up. So, what's your opinion on the recent Nobel in physics?
Yeah. Okay. So, my impression on this is that the Nobel Committee was under some pressure to reward deep learning. And you could see this because there were documentaries on Swedish TV and segments on the Swedish news program on TV. Saying, you know, Nobel, you know, they give it to obscure contributions in physics. Like, why not to people who are revolutionizing the world with AI? And so, I think they, it was pretty clear at some point they were going to give it to people doing protein structure prediction. Right? Including Alfeffold and David Baker's lab and, you know, perhaps some other people. So, they probably decided to do this. But then they probably also decided that they should give it as well to people who contributed to, you know, fundamental ideas. And they are, they couldn't pack, you know, more than three people in chemistry. And so, they had to kind of pick physics. And, and then they had to pick a physicist, John Huffield, who's a physicist, biophysicist. And, you know, Jeff Hinton made complete sense in that, in that context, even though he's not a physicist. And his model is called the Boseman machine and Boseman is, you know, a legend in statistical physics. So, this whole thing kind of makes sense. Okay. But the thing that you have to realize, and like, I'm, I'm, I'm super pumped up and excited by the fact that Nobel was, was given to people working on neural nets. I think it's great. But you have to realize that neither hub tail nets nor Boseman machines are used anymore. There are very conceptually interesting models. They are completely useless in practice, right? Nobody uses them. What we use is back propagation. And so, so that's an interesting concept. But, you know, I have no criticism to say. Sure. So, thank you very much, Jan, for answering these questions. And let's thank Jan again. Thank you. Thank you. And if you are curious, the pictures that I'm showing here are pictures I took from my backyard in New Jersey. I would now like to invite on stage Professor Raghunathan Rangaswamy, Dean of the Office of Global Engagement, to present a memento to our esteemed speaker as a token of our appreciation. Okay. I feel like I should put this back on. Thank you, Professor.
Finally, I would like to invite Mr. Vinod from the Office of Global Engagement on stage to propose a word of thanks on behalf of everyone here today.
Good evening, everyone. I hope you all had a wonderful session that was both interesting and informative. As we conclude today's lecture, I would like to express our sincere gratitude to those who made this event possible. First, our heartfelt thanks to our esteemed speaker, Professor Yan Likun, for his invaluable insights on artificial intelligence. We also appreciate the support from the entire META team. And thank you to Dr. Subra Suresh for gracing us with his presence, and to Mr. Chris Gopalakrishnan for his generous support in organizing this program. I also extend my gratitude to Professor Kamakodi Virinathan, Director of IIT Madras, for his participation, as well as to Professor Raghunathan Rangaswamy, Dean of the Office of Global Engagement, and Professor Ashwin Mahalingam, Dean of Alumni on Corporate Relations for their guidance. My special thanks to Professor B. Ravindran for moderating the Q&A session, and to his team from the Department of Data Science and Artificial Intelligence for their support. I also appreciate my colleagues from the Office of Global Engagement, and Alumni and Corporate Relations for their hard work. My sincere thanks to the IIT Madras alumni team, and to our audience for your active participation. Your presence underscores the importance of today's discussion, and also thanks to the media team for the coverage. I thank everyone watching the lecture on YouTube, and my apologies to those who wish to register, but could not due to their overwhelming response. Once again, thank you all for being here. We look forward to future events. Have a wonderful evening. Thank you.
Thank you everyone. Kindly rise for the National Anthem. Kindly rise for the National Anthem. Kindly rise for the National Anthem. Kindly rise for the National Anthem from the Chron foundational harmless gusto and know all the things you do.
The National Anthem from the National Anthem is produced by Andrew Rodman, and his might be an achievement for ATIZ.
Oh Oh Oh Oh Oh Oh Oh Oh Oh Oh