Lex Friedman Podcast - Andrej Karpathy

00:00:00,000 --> 00:00:03,960 I think it's possible that physics has exploits and we should be trying to find them, arranging
00:00:03,960 --> 00:00:08,640 some kind of a crazy quantum mechanical system that somehow gives you buffer overflow, somehow
00:00:08,640 --> 00:00:11,340 gives you a rounding error in the floating point.
00:00:11,340 --> 00:00:15,600 Synthetic intelligences are kind of like the next stage of development.
00:00:15,600 --> 00:00:21,000 And I don't know where it leads to, like at some point, I suspect the universe is some
00:00:21,000 --> 00:00:23,340 kind of a puzzle.
00:00:23,340 --> 00:00:30,160 These synthetic AIs will uncover that puzzle and solve it.
00:00:30,160 --> 00:00:34,860 The following is a conversation with Andrek Kapathy, previously the Director of AI at
00:00:34,860 --> 00:00:39,960 Tesla and before that at OpenAI and Stanford.
00:00:39,960 --> 00:00:46,480 He is one of the greatest scientists, engineers, and educators in the history of artificial
00:00:46,480 --> 00:00:48,040 intelligence.
00:00:48,040 --> 00:00:52,880 This is the Lex Friedman podcast, to support it, please check out our sponsors.
00:00:52,880 --> 00:00:58,320 And now, dear friends, here's Andrek Kapathy.
00:00:58,320 --> 00:01:04,680 What is a neural network and why does it seem to do such a surprisingly good job of learning?
00:01:04,680 --> 00:01:05,680 What is a neural network?
00:01:05,680 --> 00:01:12,760 It's a mathematical abstraction of the brain, I would say that's how it was originally developed.
00:01:12,760 --> 00:01:15,700 At the end of the day, it's a mathematical expression and it's a fairly simple mathematical
00:01:15,700 --> 00:01:17,580 expression when you get down to it.
00:01:17,580 --> 00:01:23,760 It's basically a sequence of matrix multiplies, which are really dot products mathematically,
00:01:23,760 --> 00:01:25,400 and some nonlinearities thrown in.
00:01:25,400 --> 00:01:29,460 And so it's a very simple mathematical expression and it's got knobs in it.
00:01:29,460 --> 00:01:30,460 Many knobs.
00:01:30,460 --> 00:01:31,460 Many knobs.
00:01:31,460 --> 00:01:35,000 And these knobs are loosely related to basically the synapses in your brain, they're trainable,
00:01:35,000 --> 00:01:36,000 they're modifiable.
00:01:36,000 --> 00:01:39,600 And so the idea is like, we need to find the setting of the knobs that makes the neural
00:01:39,600 --> 00:01:43,680 net do whatever you want it to do, like classify images and so on.
00:01:43,680 --> 00:01:47,220 And so there's not too much mystery, I would say in it.
00:01:47,220 --> 00:01:50,600 You might think that, basically don't want to endow it with too much meaning with respect
00:01:50,600 --> 00:01:54,920 to the brain and how it works, it's really just a complicated mathematical expression
00:01:54,920 --> 00:01:58,840 with knobs and those knobs need a proper setting for it to do something desirable.
00:01:58,840 --> 00:02:04,320 Yeah, but poetry is just the collection of letters with spaces, but it can make us feel
00:02:04,320 --> 00:02:05,320 a certain way.
00:02:05,320 --> 00:02:10,160 And in that same way, when you get a large number of knobs together, whether it's inside
00:02:10,160 --> 00:02:16,160 the brain or inside a computer, they seem to surprise us with their power.
00:02:16,160 --> 00:02:17,160 Yeah.
00:02:17,160 --> 00:02:18,160 And that's fair.
00:02:18,160 --> 00:02:22,960 So basically, I'm underselling it by a lot because you definitely do get very surprising
00:02:22,960 --> 00:02:27,740 emergent behaviors out of these neural nets when they're large enough and trained on complicated
00:02:27,740 --> 00:02:28,740 enough problems.
00:02:28,740 --> 00:02:33,920 Like say, for example, the next word prediction in a massive data set from the internet.
00:02:33,920 --> 00:02:37,240 And then these neural nets take on pretty surprising magical properties.
00:02:37,240 --> 00:02:41,720 Yeah, I think it's kind of interesting how much you can get out of even very simple mathematical
00:02:41,720 --> 00:02:42,720 formalism.
00:02:42,720 --> 00:02:46,360 When your brain right now is talking, is it doing next word prediction?
00:02:46,360 --> 00:02:48,800 Or is it doing something more interesting?
00:02:48,800 --> 00:02:53,840 Well, it's definitely some kind of a generative model that's a GPT-like and prompted by you.
00:02:53,840 --> 00:02:54,840 Yeah.
00:02:54,840 --> 00:02:58,840 So you're giving me a prompt and I'm kind of like responding to it in a generative way.
00:02:58,840 --> 00:03:01,000 And by yourself, perhaps a little bit?
00:03:01,000 --> 00:03:05,800 Like are you adding extra prompts from your own memory inside your head?
00:03:05,800 --> 00:03:06,800 Or no?
00:03:06,800 --> 00:03:10,920 Well, it definitely feels like you're referencing some kind of a declarative structure of like
00:03:10,920 --> 00:03:16,360 memory and so on, and then you're putting that together with your prompt and giving
00:03:16,360 --> 00:03:17,360 away some answers.
00:03:17,360 --> 00:03:21,560 Like how much of what you just said has been said by you before?
00:03:21,560 --> 00:03:23,560 Nothing, basically, right?
00:03:23,560 --> 00:03:28,040 No, but if you actually look at all the words you've ever said in your life, and you do
00:03:28,040 --> 00:03:33,920 a search, you'll probably have said a lot of the same words in the same order before.
00:03:33,920 --> 00:03:34,920 Yeah, could be.
00:03:34,920 --> 00:03:37,600 I mean, I'm using phrases that are common, etc.
00:03:37,600 --> 00:03:42,160 But I'm remixing it into a pretty sort of unique sentence at the end of the day.
00:03:42,160 --> 00:03:43,160 But you're right, definitely.
00:03:43,160 --> 00:03:44,160 There's like a ton of remixing.
00:03:44,160 --> 00:03:52,400 It's like Magnus Carlsen said, I'm rated 2900 whatever, which is pretty decent.
00:03:52,400 --> 00:03:58,240 I think you're talking very, you're not giving enough credit to neural nets here.
00:03:58,240 --> 00:04:05,000 Why do they seem to, what's your best intuition about this emergent behavior?
00:04:05,000 --> 00:04:09,040 I mean, it's kind of interesting because I'm simultaneously underselling them.
00:04:09,040 --> 00:04:12,320 But I also feel like there's an element to which I'm over, like, it's actually kind of
00:04:12,320 --> 00:04:15,760 incredible that you can get so much emergent magical behavior out of them, despite them
00:04:15,760 --> 00:04:17,680 being so simple mathematically.
00:04:17,680 --> 00:04:22,820 So I think those are kind of like two surprising statements that are kind of juxtaposed together.
00:04:22,820 --> 00:04:26,240 And I think basically what it is, is we are actually fairly good at optimizing these neural
00:04:26,240 --> 00:04:27,420 nets.
00:04:27,420 --> 00:04:31,700 And when you give them a hard enough problem, they are forced to learn very interesting
00:04:31,700 --> 00:04:37,480 solutions in the optimization, and those solution basically have these emergent properties that
00:04:37,480 --> 00:04:39,060 are very interesting.
00:04:39,060 --> 00:04:43,980 There's wisdom and knowledge in the knobs.
00:04:43,980 --> 00:04:49,520 And so this representation that's in the knobs, does it make sense to you intuitively that
00:04:49,520 --> 00:04:55,640 a large number of knobs can hold a representation that captures some deep wisdom about the data
00:04:55,640 --> 00:04:56,960 it has looked at?
00:04:56,960 --> 00:04:58,880 It's a lot of knobs.
00:04:58,880 --> 00:05:04,120 It's a lot of knobs, and somehow, you know, so speaking concretely, one of the neural
00:05:04,120 --> 00:05:08,400 nets that people are very excited about right now are our GPT's, which are basically just
00:05:08,400 --> 00:05:10,420 next word prediction networks.
00:05:10,420 --> 00:05:14,840 So you consume a sequence of words from the internet, and you try to predict the next
00:05:14,840 --> 00:05:15,880 word.
00:05:15,880 --> 00:05:23,640 And once you train these on a large enough data set, they, you can basically prompt these
00:05:23,640 --> 00:05:27,080 neural nets in arbitrary ways, and you can ask them to solve problems, and they will.
00:05:27,080 --> 00:05:32,520 So you can just tell them, you can make it look like you're trying to solve some kind
00:05:32,520 --> 00:05:35,800 of a mathematical problem, and they will continue what they think is the solution based on what
00:05:35,800 --> 00:05:37,200 they've seen on the internet.
00:05:37,200 --> 00:05:43,200 And very often, those solutions look very remarkably consistent, look correct, potentially.
00:05:43,200 --> 00:05:45,560 Do you still think about the brain side of it?
00:05:45,560 --> 00:05:52,400 So as neural nets is an abstraction or mathematical abstraction of the brain, you still draw wisdom
00:05:52,400 --> 00:05:58,360 from the biological neural networks, or even the bigger question, so you're a big fan
00:05:58,360 --> 00:06:02,560 of biology and biological computation.
00:06:02,560 --> 00:06:08,000 What impressive thing is biology doing to you that computers are not yet?
00:06:08,000 --> 00:06:09,120 That gap?
00:06:09,120 --> 00:06:13,520 I would say I'm definitely on, I'm much more hesitant with the analogies to the brain than
00:06:13,520 --> 00:06:16,360 I think you would see potentially in the field.
00:06:16,360 --> 00:06:22,080 And I kind of feel like, certainly, the way neural networks started is everything stemmed
00:06:22,080 --> 00:06:26,160 from inspiration by the brain, but at the end of the day, the artifacts that you get
00:06:26,160 --> 00:06:30,960 after training, they are arrived at by a very different optimization process than the optimization
00:06:30,960 --> 00:06:33,040 process that gave rise to the brain.
00:06:33,040 --> 00:06:38,820 And so I think, I kind of think of it as a very complicated alien artifact.
00:06:38,820 --> 00:06:42,680 It's something different, I'm sorry, the neural nets that we're training.
00:06:42,680 --> 00:06:45,720 They are complicated alien artifact.
00:06:45,720 --> 00:06:49,280 I do not make analogies to the brain because I think the optimization process that gave
00:06:49,280 --> 00:06:51,840 rise to it is very different from the brain.
00:06:51,840 --> 00:06:57,200 So there was no multi-agent self-play kind of set up and evolution.
00:06:57,200 --> 00:07:02,640 It was an optimization that is basically what amounts to a compression objective on a massive
00:07:02,640 --> 00:07:03,640 amount of data.
00:07:03,640 --> 00:07:09,240 Okay, so artificial neural networks are doing compression and biological neural networks
00:07:09,240 --> 00:07:16,960 are not really doing anything, they're an agent in a multi-agent self-play system that's
00:07:16,960 --> 00:07:19,580 been running for a very, very long time.
00:07:19,580 --> 00:07:25,100 That said, evolution has found that it is very useful to predict and have a predictive
00:07:25,100 --> 00:07:26,520 model in the brain.
00:07:26,520 --> 00:07:31,440 And so I think our brain utilizes something that looks like that as a part of it, but
00:07:31,440 --> 00:07:37,400 it has a lot more, you know, gadgets and gizmos and value functions and ancient nuclei that
00:07:37,400 --> 00:07:41,000 are all trying to like make you survive and reproduce and everything else.
00:07:41,000 --> 00:07:44,480 And the whole thing through embryogenesis is built from a single cell.
00:07:44,480 --> 00:07:49,880 I mean, it's just, the code is inside the DNA and it just builds it up like the entire
00:07:49,880 --> 00:07:57,000 organism with arms and the head and legs and like it does it pretty well.
00:07:57,000 --> 00:07:59,320 It should not be possible.
00:07:59,320 --> 00:08:03,760 So there's some learning going on, there's some kind of computation going through that
00:08:03,760 --> 00:08:04,760 building process.
00:08:04,760 --> 00:08:10,640 I mean, I don't know where, if you were just to look at the entirety of history of life
00:08:10,640 --> 00:08:15,360 on earth, what do you think is the most interesting invention?
00:08:15,360 --> 00:08:17,880 Is it the origin of life itself?
00:08:17,880 --> 00:08:20,480 Is it just jumping to eukaryotes?
00:08:20,480 --> 00:08:22,240 Is it mammals?
00:08:22,240 --> 00:08:25,160 Is it humans themselves, homo sapiens?
00:08:25,160 --> 00:08:31,840 The origin of intelligence or highly complex intelligence?
00:08:31,840 --> 00:08:36,240 Or is it all just a continuation of the same kind of process?
00:08:36,240 --> 00:08:40,320 Certainly I would say it's an extremely remarkable story that I'm only like briefly learning
00:08:40,320 --> 00:08:45,760 about recently, all the way from, actually like you almost have to start at the formation
00:08:45,760 --> 00:08:48,840 of earth and all of its conditions and the entire solar system and how everything is
00:08:48,840 --> 00:08:53,600 arranged with Jupiter and moon and the habitable zone and everything.
00:08:53,600 --> 00:08:57,920 And then you have an active earth that's turning over material.
00:08:57,920 --> 00:09:02,000 And then you start with a embryogenesis and everything.
00:09:02,000 --> 00:09:03,880 So it's all like a pretty remarkable story.
00:09:03,880 --> 00:09:12,320 I'm not sure that I can pick a single unique piece of it that I find most interesting.
00:09:12,320 --> 00:09:15,400 I guess for me as an artificial intelligence researcher, it's probably the last piece.
00:09:15,400 --> 00:09:22,480 We have lots of animals that are not building technological society, but we do.
00:09:22,480 --> 00:09:24,880 And it seems to have happened very quickly.
00:09:24,880 --> 00:09:27,120 It seems to have happened very recently.
00:09:27,120 --> 00:09:31,160 And something very interesting happened there that I don't fully understand.
00:09:31,160 --> 00:09:35,440 I almost understand everything else kind of, I think intuitively, but I don't understand
00:09:35,440 --> 00:09:38,220 exactly that part and how quick it was.
00:09:38,220 --> 00:09:40,200 Both explanations would be interesting.
00:09:40,200 --> 00:09:43,360 One is that this is just a continuation of the same kind of process.
00:09:43,360 --> 00:09:47,840 There's nothing special about humans that would be deeply understanding that would be
00:09:47,840 --> 00:09:53,520 very interesting that we think of ourselves as special, but it was obvious or it was already
00:09:53,520 --> 00:10:01,240 written in the code that you would have greater and greater intelligence emerging.
00:10:01,240 --> 00:10:06,800 And then the other explanation, which is something truly special happened, something like a rare
00:10:06,800 --> 00:10:13,000 event, whether it's like crazy rare event, like a space odyssey, what would it be?
00:10:13,000 --> 00:10:23,180 If you say like the invention of fire or the, as Richard Wrangham says, the beta males deciding
00:10:23,180 --> 00:10:26,720 a clever way to kill the alpha males by collaborating.
00:10:26,720 --> 00:10:31,920 So just optimizing the collaboration, the multi-agent aspects of the multi-agent and
00:10:31,920 --> 00:10:38,240 that really being constrained on resources and trying to survive the collaboration aspect
00:10:38,240 --> 00:10:40,680 is what created the complex intelligence.
00:10:40,680 --> 00:10:44,520 But it seems like it's a natural outgrowth of the evolutionary process.
00:10:44,520 --> 00:10:49,760 What could possibly be a magical thing that happened, like a rare thing that would say
00:10:49,760 --> 00:10:55,560 that humans are actually, human level intelligence is actually a really rare thing in the universe?
00:10:55,560 --> 00:11:01,160 Yeah, I'm hesitant to say that it is rare by the way, but it definitely seems like it's
00:11:01,160 --> 00:11:05,440 kind of like a punctuated equilibrium where you have lots of exploration and then you
00:11:05,440 --> 00:11:08,160 have certain leaps, sparse leaps in between.
00:11:08,160 --> 00:11:17,040 So of course like origin of life would be one, DNA, sex, eukaryotic life, the endosymbiosis
00:11:17,040 --> 00:11:20,740 event where the Archeon ate little bacteria, you know, just the whole thing.
00:11:20,740 --> 00:11:23,600 And then of course emergence of consciousness and so on.
00:11:23,600 --> 00:11:27,080 So it seems like definitely there are sparse events where massive amount of progress was
00:11:27,080 --> 00:11:29,200 made, but yeah, it's kind of hard to pick one.
00:11:29,200 --> 00:11:32,480 So you don't think humans are unique?
00:11:32,480 --> 00:11:37,160 Got to ask you how many intelligent alien civilizations do you think are out there?
00:11:37,160 --> 00:11:44,040 And is their intelligence different or similar to ours?
00:11:44,040 --> 00:11:45,920 Yeah.
00:11:45,920 --> 00:11:50,200 I've been preoccupied with this question quite a bit recently, basically the Fermi paradox
00:11:50,200 --> 00:11:52,000 and just thinking through.
00:11:52,000 --> 00:11:56,400 And the reason actually that I am very interested in the origin of life is fundamentally trying
00:11:56,400 --> 00:12:03,080 to understand how common it is that there are technological societies out there in space.
00:12:03,080 --> 00:12:09,000 And the more I study it, the more I think that there should be quite a few, quite a
00:12:09,000 --> 00:12:10,000 lot.
00:12:10,000 --> 00:12:11,000 Why haven't we heard from them?
00:12:11,000 --> 00:12:12,000 Because I agree with you.
00:12:12,000 --> 00:12:19,920 It feels like I just don't see why what we did here on earth is so difficult to do.
00:12:19,920 --> 00:12:20,920 Yeah.
00:12:20,920 --> 00:12:23,200 And especially when you get into the details of it, I used to think origin of life was
00:12:23,200 --> 00:12:31,080 very it was this magical rare event, but then you read books like for example, Nick Lane,
00:12:31,080 --> 00:12:34,540 the vital question, life ascending, et cetera.
00:12:34,540 --> 00:12:38,960 And he really gets in and he really makes you believe that this is not that rare.
00:12:38,960 --> 00:12:39,960 Basic chemistry.
00:12:39,960 --> 00:12:43,160 You have an active earth and you have your alkaline vents and you have lots of alkaline
00:12:43,160 --> 00:12:45,800 waters mixing with the ocean.
00:12:45,800 --> 00:12:49,360 And you have your proton gradients and you have little porous pockets of these alkaline
00:12:49,360 --> 00:12:51,960 vents that concentrate chemistry.
00:12:51,960 --> 00:12:55,720 And basically, as he steps through all of these little pieces, you start to understand
00:12:55,720 --> 00:12:58,920 that actually this is not that crazy.
00:12:58,920 --> 00:13:01,640 You could see this happen on other systems.
00:13:01,640 --> 00:13:06,520 And he really takes you from just a geology to primitive life.
00:13:06,520 --> 00:13:09,360 And he makes it feel like it's actually pretty plausible.
00:13:09,360 --> 00:13:15,400 And also, like the origin of life, didn't was actually fairly fast after formation of
00:13:15,400 --> 00:13:16,400 Earth.
00:13:16,400 --> 00:13:19,560 If I remember correctly, just a few hundred million years or something like that after
00:13:19,560 --> 00:13:22,560 basically when it was possible, life actually arose.
00:13:22,560 --> 00:13:26,160 And so that makes me feel like that is not the constraint, that is not the limiting variable
00:13:26,160 --> 00:13:29,860 and that life should actually be fairly common.
00:13:29,860 --> 00:13:35,440 And then, you know, where the drop offs are is very, is very interesting to think about.
00:13:35,440 --> 00:13:38,400 I currently think that there's no major drop offs, basically.
00:13:38,400 --> 00:13:40,380 And so there should be quite a lot of life.
00:13:40,380 --> 00:13:44,560 And basically, where that brings me to then is the only way to reconcile the fact that
00:13:44,560 --> 00:13:49,820 we haven't found anyone and so on is that we just can't, we can't see them.
00:13:49,820 --> 00:13:50,820 We can't observe them.
00:13:50,820 --> 00:13:52,040 Just a quick brief comment.
00:13:52,040 --> 00:13:56,720 Nick Lane and a lot of biologists I talked to, they really seem to think that the jump
00:13:56,720 --> 00:14:01,200 from bacteria to more complex organisms is the hardest jump.
00:14:01,200 --> 00:14:02,760 The eukaryotic life, basically.
00:14:02,760 --> 00:14:03,760 Yeah.
00:14:03,760 --> 00:14:04,920 Which I don't, I get it.
00:14:04,920 --> 00:14:10,720 They're much more knowledgeable than me about like the intricacies of biology.
00:14:10,720 --> 00:14:16,840 But that seems like crazy because how much, how many single cell organisms are there?
00:14:16,840 --> 00:14:21,480 Like and how much time you have, surely it's not that difficult.
00:14:21,480 --> 00:14:26,360 Like in a billion years is not even that long of a time, really.
00:14:26,360 --> 00:14:30,820 Just all these bacteria under constrained resources battling it out, I'm sure they can
00:14:30,820 --> 00:14:31,820 invent more complex.
00:14:31,820 --> 00:14:37,600 Like I don't understand, it's like how to move from a hello world program to like invent
00:14:37,600 --> 00:14:39,040 a function or something like that.
00:14:39,040 --> 00:14:40,040 I don't.
00:14:40,040 --> 00:14:41,040 Yeah.
00:14:41,040 --> 00:14:43,120 And so I don't, yeah, so I'm with you.
00:14:43,120 --> 00:14:48,040 I just feel like I don't see any, if the origin of life, that would be my intuition.
00:14:48,040 --> 00:14:49,040 That's the hardest thing.
00:14:49,040 --> 00:14:52,600 But if that's not the hardest thing because it happened so quickly, then it's got to be
00:14:52,600 --> 00:14:53,600 everywhere.
00:14:53,600 --> 00:14:55,240 And yeah, maybe we're just too dumb to see it.
00:14:55,240 --> 00:14:58,080 Well, it's just, we don't have really good mechanisms for seeing this life.
00:14:58,080 --> 00:15:05,720 I mean, by what, how, so I'm not an expert just to preface this, but just from, I want
00:15:05,720 --> 00:15:09,040 to meet an expert on alien intelligence and how to communicate.
00:15:09,040 --> 00:15:12,840 I'm very suspicious of our ability to find these intelligences out there and to find
00:15:12,840 --> 00:15:13,840 these Earths.
00:15:13,840 --> 00:15:16,480 Like radio waves, for example, are terrible.
00:15:16,480 --> 00:15:19,280 Their power drops off as basically one over R square.
00:15:19,280 --> 00:15:25,320 So I remember reading that our current radio waves would not be, the ones that we are broadcasting
00:15:25,320 --> 00:15:30,320 would not be measurable by our devices today, only like, was it like one 10th of a light
00:15:30,320 --> 00:15:31,320 year away?
00:15:31,320 --> 00:15:36,240 Like not even basically tiny distance because you really need like a targeted transmission
00:15:36,240 --> 00:15:40,840 of massive power directed somewhere for this to be picked up on long distances.
00:15:40,840 --> 00:15:45,000 And so I just think that our ability to measure is not amazing.
00:15:45,000 --> 00:15:47,180 I think there's probably other civilizations out there.
00:15:47,180 --> 00:15:50,280 And then the big question is why don't they build binomial probes and why don't they interstellar
00:15:50,280 --> 00:15:52,640 travel across the entire galaxy?
00:15:52,640 --> 00:15:56,600 And my current answer is it's probably interstellar travel is like really hard.
00:15:56,600 --> 00:15:57,680 You have the interstellar medium.
00:15:57,680 --> 00:16:00,800 If you want to move at closer speed of light, you're going to be encountering bullets along
00:16:00,800 --> 00:16:06,400 the way because even like tiny hydrogen atoms and little particles of dust are basically
00:16:06,400 --> 00:16:09,600 have like massive kinetic energy at those speeds.
00:16:09,600 --> 00:16:11,540 And so basically you need some kind of shielding.
00:16:11,540 --> 00:16:13,980 You need, you have all the cosmic radiation.
00:16:13,980 --> 00:16:15,160 It's just like brutal out there.
00:16:15,160 --> 00:16:16,160 It's really hard.
00:16:16,160 --> 00:16:20,400 And so my thinking is maybe interstellar travel is just extremely hard and you have to be
00:16:20,400 --> 00:16:21,400 very slow.
00:16:21,400 --> 00:16:24,240 To build hard.
00:16:24,240 --> 00:16:28,600 It feels like, it feels like we're not a billion years away from doing that.
00:16:28,600 --> 00:16:32,840 It just might be that it's very, you have to go very slowly potentially as an example
00:16:32,840 --> 00:16:33,840 through space.
00:16:33,840 --> 00:16:34,840 Right.
00:16:34,840 --> 00:16:36,720 As opposed to close to the speed of light.
00:16:36,720 --> 00:16:40,320 So I'm suspicious basically of our ability to measure life and I'm suspicious of the
00:16:40,320 --> 00:16:44,640 ability to just permeate all of space in the galaxy or across galaxies.
00:16:44,640 --> 00:16:48,000 And that's the only way that I can currently see a way around it.
00:16:48,000 --> 00:16:53,800 Yeah, it's kind of mind blowing to think that there's trillions of intelligent alien civilizations
00:16:53,800 --> 00:16:59,080 out there kind of slowly traveling through space to meet each other.
00:16:59,080 --> 00:17:04,160 And some of them meet, some of them go to war, some of them collaborate.
00:17:04,160 --> 00:17:06,400 Or they're all just independent.
00:17:06,400 --> 00:17:08,520 They're all just like little pockets.
00:17:08,520 --> 00:17:14,720 Well statistically, if there's like, if there's trillions of them, surely some of them, some
00:17:14,720 --> 00:17:16,360 of the pockets are close enough to get them.
00:17:16,360 --> 00:17:17,860 Some of them happen to be close, yeah.
00:17:17,860 --> 00:17:19,680 And close enough to see each other.
00:17:19,680 --> 00:17:26,240 And then once you see, once you see something that is definitely complex life, like if we
00:17:26,240 --> 00:17:32,140 see something, we're probably going to be severe, like intensely aggressively motivated
00:17:32,140 --> 00:17:35,200 to figure out what the hell that is and try to meet them.
00:17:35,200 --> 00:17:41,560 But what would be your first instinct to try to, like at a generational level, meet them
00:17:41,560 --> 00:17:44,600 or defend against them?
00:17:44,600 --> 00:17:52,080 Or what would be your instinct as a president of the United States and a scientist?
00:17:52,080 --> 00:17:55,080 I don't know which hat you prefer in this question.
00:17:55,080 --> 00:17:59,840 Yeah, I think the question, it's really hard.
00:17:59,840 --> 00:18:05,680 I will say like, for example, for us, we have lots of primitive life forms on earth.
00:18:05,680 --> 00:18:09,600 Next to us, we have all kinds of ants and everything else and we share space with them.
00:18:09,600 --> 00:18:15,400 And we are hesitant to impact on them and we're trying to protect them by default because
00:18:15,400 --> 00:18:19,540 they are amazing, interesting, dynamical systems that took a long time to evolve and they are
00:18:19,540 --> 00:18:20,760 interesting and special.
00:18:20,760 --> 00:18:26,120 And I don't know that you want to destroy that by default.
00:18:26,120 --> 00:18:31,480 And so I like complex dynamical systems that took a lot of time to evolve.
00:18:31,480 --> 00:18:37,040 I think I'd like to preserve it if I can afford to.
00:18:37,040 --> 00:18:41,600 And I'd like to think that the same would be true about the galactic resources and that
00:18:41,600 --> 00:18:45,160 they would think that we're kind of incredible, interesting story that took time.
00:18:45,160 --> 00:18:49,040 It took a few billion years to unravel and you don't want to just destroy it.
00:18:49,040 --> 00:18:54,580 I could see two aliens talking about earth right now and saying, I'm a big fan of complex
00:18:54,580 --> 00:18:55,680 dynamical systems.
00:18:55,680 --> 00:19:01,600 So I think it was a value to preserve these and who basically are a video game they watch
00:19:01,600 --> 00:19:03,760 or show a TV show that they watch.
00:19:03,760 --> 00:19:08,920 Yeah, I think you would need like a very good reason, I think, to destroy it.
00:19:08,920 --> 00:19:10,640 Like why don't we destroy these ant farms and so on?
00:19:10,640 --> 00:19:14,720 It's because we're not actually like really in direct competition with them right now.
00:19:14,720 --> 00:19:19,620 We do it accidentally and so on, but there's plenty of resources.
00:19:19,620 --> 00:19:22,280 And so why would you destroy something that is so interesting and precious?
00:19:22,280 --> 00:19:25,800 Well, from a scientific perspective, you might probe it.
00:19:25,800 --> 00:19:27,640 You might interact with it lightly.
00:19:27,640 --> 00:19:29,620 You might want to learn something from it.
00:19:29,620 --> 00:19:34,280 So I wonder, there could be certain physical phenomena that we think is a physical phenomena,
00:19:34,280 --> 00:19:38,360 but it's actually interacting with us to like poke the finger and see what happens.
00:19:38,360 --> 00:19:43,240 I think it should be very interesting to scientists, other alien scientists, what happened here.
00:19:43,240 --> 00:19:47,400 And you know, what we're seeing today is a snapshot, basically, it's a result of a huge
00:19:47,400 --> 00:19:52,320 amount of computation over like billion years or something like that.
00:19:52,320 --> 00:19:55,080 So it could have been initiated by aliens.
00:19:55,080 --> 00:19:57,560 This could be a computer running a program.
00:19:57,560 --> 00:20:02,280 Like when, okay, if you had the power to do this, when you, okay, for sure, at least I
00:20:02,280 --> 00:20:07,920 would, I would pick a earth-like planet that has the conditions based on my understanding
00:20:07,920 --> 00:20:10,700 of the chemistry prerequisites for life.
00:20:10,700 --> 00:20:13,240 And I would seed it with life and run it.
00:20:13,240 --> 00:20:14,240 Right?
00:20:14,240 --> 00:20:19,200 Like wouldn't you 100% do that and observe it and then protect?
00:20:19,200 --> 00:20:24,760 I mean, that's not just a hell of a good TV show, it's a good scientific experiment.
00:20:24,760 --> 00:20:30,180 And it's physical simulation, right?
00:20:30,180 --> 00:20:39,400 Maybe the evolution is the most, like actually running it is the most efficient way to understand
00:20:39,400 --> 00:20:41,280 computation or to compute stuff.
00:20:41,280 --> 00:20:46,300 Or to understand life or, you know, what life looks like and what branches it can take.
00:20:46,300 --> 00:20:51,260 It does make me kind of feel weird that we're part of a science experiment, but maybe it's,
00:20:51,260 --> 00:20:53,040 everything's a science experiment.
00:20:53,040 --> 00:20:56,320 Does that change anything for us if we're a science experiment?
00:20:56,320 --> 00:20:58,400 I don't know.
00:20:58,400 --> 00:21:01,800 Two descendants of apes talking about being inside of a science experiment.
00:21:01,800 --> 00:21:06,760 I'm suspicious of this idea of like a deliberate panspermia as you described it sort of, and
00:21:06,760 --> 00:21:11,200 I don't see a divine intervention in some way in the historical record right now.
00:21:11,200 --> 00:21:16,520 I do feel like the story in these books, like Nick Lane's books and so on sort of makes
00:21:16,520 --> 00:21:20,980 sense and it makes sense how life arose on earth uniquely.
00:21:20,980 --> 00:21:25,320 And yeah, I don't need to reach for more exotic explanations right now.
00:21:25,320 --> 00:21:26,320 Sure.
00:21:26,320 --> 00:21:32,440 But NPCs inside a video game don't observe any divine intervention either.
00:21:32,440 --> 00:21:35,600 We might just be all NPCs running a kind of code.
00:21:35,600 --> 00:21:36,600 Maybe eventually they will.
00:21:36,600 --> 00:21:41,000 Currently NPCs are really dumb, but once they're running GPTs, maybe they will be like, Hey,
00:21:41,000 --> 00:21:42,000 this is really suspicious.
00:21:42,000 --> 00:21:43,560 What the hell?
00:21:43,560 --> 00:21:49,560 So you famously tweeted, it looks like if you bombard earth with photons for a while,
00:21:49,560 --> 00:21:51,760 you can emit a roadster.
00:21:51,760 --> 00:21:56,920 So if like in Hitchhiker's Guide to the Galaxy, we would summarize the story of earth.
00:21:56,920 --> 00:22:00,560 So in that book, it's mostly harmless.
00:22:00,560 --> 00:22:06,120 What do you think is all the possible stories, like a paragraph long or a sentence long that
00:22:06,120 --> 00:22:08,720 earth could be summarized as?
00:22:08,720 --> 00:22:10,760 Once it's done, it's computation.
00:22:10,760 --> 00:22:18,480 So like all the possible full, if earth is a book, right?
00:22:18,480 --> 00:22:19,480 Eventually there has to be an ending.
00:22:19,480 --> 00:22:23,200 I mean, there's going to be an end to earth and it could end in all kinds of ways.
00:22:23,200 --> 00:22:24,200 It can end soon.
00:22:24,200 --> 00:22:25,200 It can end later.
00:22:25,200 --> 00:22:27,040 What do you think are the possible stories?
00:22:27,040 --> 00:22:32,880 Well, definitely there seems to be, yeah, you're sort of, it's pretty incredible that
00:22:32,880 --> 00:22:38,240 these self replicating systems will basically arise from the dynamics and then they perpetuate
00:22:38,240 --> 00:22:42,280 themselves and become more complex and eventually become conscious and build a society.
00:22:42,280 --> 00:22:48,920 And I kind of feel like in some sense, it's kind of like a deterministic wave that kind
00:22:48,920 --> 00:22:53,360 of just like happens on any sufficiently well-arranged system like earth.
00:22:53,360 --> 00:22:58,600 And so I kind of feel like there's a certain sense of inevitability in it.
00:22:58,600 --> 00:22:59,600 And it's really beautiful.
00:22:59,600 --> 00:23:01,020 And it ends somehow, right?
00:23:01,020 --> 00:23:12,240 So it's a chemically diverse environment where complex dynamical systems can evolve and be
00:23:12,240 --> 00:23:19,760 kind of become more and more further and further complex, but then there's a certain, what
00:23:19,760 --> 00:23:20,760 is it?
00:23:20,760 --> 00:23:22,240 There's certain terminating conditions.
00:23:22,240 --> 00:23:23,240 Yeah.
00:23:23,240 --> 00:23:26,680 I don't know what the terminating conditions are, but definitely there's a trend line of
00:23:26,680 --> 00:23:28,380 something and we're part of that story.
00:23:28,380 --> 00:23:30,800 And like, where does that, where does it go?
00:23:30,800 --> 00:23:35,920 So we're famously described often as a biological bootloader for AIs and that's because humans,
00:23:35,920 --> 00:23:42,720 I mean, we're an incredible biological system and we're capable of computation and love
00:23:42,720 --> 00:23:46,360 and so on, but we're extremely inefficient as well.
00:23:46,360 --> 00:23:47,960 Like we're talking to each other through audio.
00:23:47,960 --> 00:23:52,520 It's just kind of embarrassing, honestly, that we're manipulating like seven symbols,
00:23:52,520 --> 00:23:55,240 serially we're using vocal chords.
00:23:55,240 --> 00:23:57,760 It's all happening over like multiple seconds.
00:23:57,760 --> 00:24:02,840 It's just like kind of embarrassing when you step down to the frequencies at which computers
00:24:02,840 --> 00:24:05,320 operate or are able to cooperate on.
00:24:05,320 --> 00:24:11,040 And so basically it does seem like synthetic intelligences are kind of like the next stage
00:24:11,040 --> 00:24:12,640 of development.
00:24:12,640 --> 00:24:18,480 And I don't know where it leads to, like at some point I suspect the universe is some
00:24:18,480 --> 00:24:26,920 kind of a puzzle and these synthetic AIs will uncover that puzzle and solve it.
00:24:26,920 --> 00:24:28,800 And then what happens after, right?
00:24:28,800 --> 00:24:33,840 Like what, cause if you just like fast forward earth, many billions of years, it's like,
00:24:33,840 --> 00:24:38,520 it's quiet and then it's like terminal, you see like city lights and stuff like that.
00:24:38,520 --> 00:24:44,200 And then what happens at like at the end, like, is it like a, or is it like a calming?
00:24:44,200 --> 00:24:45,620 Is it explosion?
00:24:45,620 --> 00:24:50,880 Is it like earth, like open, like a giant, cause you said emit roasters, like will it
00:24:50,880 --> 00:24:56,120 start emitting like, like a giant number of like satellites?
00:24:56,120 --> 00:24:57,120 Yes.
00:24:57,120 --> 00:25:00,800 It's some kind of a crazy explosion and we're living, we're like, we're stepping through
00:25:00,800 --> 00:25:04,840 a explosion and we're like living day to day and it doesn't look like it, but it's actually,
00:25:04,840 --> 00:25:09,200 if you, I saw a very cool animation of earth and life on earth and basically nothing happens
00:25:09,200 --> 00:25:10,200 for a long time.
00:25:10,200 --> 00:25:14,800 And then the last like two seconds, like basically cities and everything and just, and the lower
00:25:14,800 --> 00:25:17,680 orbit just gets cluttered and just the whole thing happens in the last two seconds and
00:25:17,680 --> 00:25:18,680 you're like, this is exploding.
00:25:18,680 --> 00:25:19,680 This is a state of explosion.
00:25:19,680 --> 00:25:27,640 So if you play, yeah, yeah, if you play at a normal speed, it'll just look like an explosion.
00:25:27,640 --> 00:25:28,640 It's a firecracker.
00:25:28,640 --> 00:25:30,680 We're living in a firecracker.
00:25:30,680 --> 00:25:33,880 Fire it's going to start emitting all kinds of interesting things.
00:25:33,880 --> 00:25:39,200 And then the, so explosion doesn't, it might actually look like a little explosion with,
00:25:39,200 --> 00:25:42,200 with lights and fire and energy emitted, all that kind of stuff.
00:25:42,200 --> 00:25:48,200 But when you look inside the details of the explosion, there's actual complexity happening
00:25:48,200 --> 00:25:52,180 where there's like a, yeah, human life or some kind of life.
00:25:52,180 --> 00:25:53,820 We hope it's not a destructive firecracker.
00:25:53,820 --> 00:25:56,800 It's kind of like a constructive firecracker.
00:25:56,800 --> 00:26:02,120 All right, so given that, I think a hilarious discussion, it is really interesting to think
00:26:02,120 --> 00:26:05,840 about like what the puzzle of the universe is, did the creator of the universe give us
00:26:05,840 --> 00:26:06,840 a message?
00:26:06,840 --> 00:26:12,200 Like for example, in the book Contact, Carl Sagan, there's a message for humanity, for
00:26:12,200 --> 00:26:18,360 any civilization in the digits, in the expansion of PI and base 11 eventually, which is kind
00:26:18,360 --> 00:26:23,300 of interesting thought, maybe, maybe we're supposed to be giving a message to our creator.
00:26:23,300 --> 00:26:27,120 Maybe we're supposed to somehow create some kind of a quantum mechanical system that alerts
00:26:27,120 --> 00:26:29,880 them to our intelligent presence here.
00:26:29,880 --> 00:26:34,240 Because if you think about it from their perspective, it's just say like quantum field theory, massive
00:26:34,240 --> 00:26:36,880 like cellular autonomous like thing.
00:26:36,880 --> 00:26:38,640 And like, how do you even notice that we exist?
00:26:38,640 --> 00:26:42,260 You might not even be able to pick us up in that simulation.
00:26:42,260 --> 00:26:45,960 And so how do you, how do you prove that you exist, that you're intelligent and that you're
00:26:45,960 --> 00:26:47,640 a part of the universe?
00:26:47,640 --> 00:26:50,080 So this is like a Turing test for intelligence from earth.
00:26:50,080 --> 00:26:55,640 Like the creator is, I mean, maybe this is like trying to complete the next word in a
00:26:55,640 --> 00:26:56,640 sentence.
00:26:56,640 --> 00:26:57,640 This is a complicated way of that.
00:26:57,640 --> 00:27:00,760 Like earth is just, is basically sending a message back.
00:27:00,760 --> 00:27:01,760 Yeah.
00:27:01,760 --> 00:27:04,600 The puzzle is basically like alerting the creator that we exist.
00:27:04,600 --> 00:27:08,800 Or maybe the puzzle is just to just break out of the system and just, you know, stick
00:27:08,800 --> 00:27:10,740 it to the creator in some way.
00:27:10,740 --> 00:27:15,560 Basically like if you're playing a video game, you can, you can somehow find an exploit and
00:27:15,560 --> 00:27:19,940 find a way to execute on the host machine, any arbitrary code.
00:27:19,940 --> 00:27:24,480 There's some, for example, I believe someone got a Mario, a game of Mario to play Pong
00:27:24,480 --> 00:27:31,160 just by exploiting it and then creating a basically writing, writing code and being
00:27:31,160 --> 00:27:33,340 able to execute arbitrary code in the game.
00:27:33,340 --> 00:27:38,680 And so maybe we should be, maybe that's the puzzle is that we should be find a way to
00:27:38,680 --> 00:27:39,680 exploit it.
00:27:39,680 --> 00:27:42,600 So, so I think like some of these synthetic AI's will eventually find the universe to
00:27:42,600 --> 00:27:45,360 be some kind of a puzzle and then solve it in some way.
00:27:45,360 --> 00:27:47,600 And that's kind of like the end game somehow.
00:27:47,600 --> 00:27:51,560 Do you often think about it as a, as a simulation?
00:27:51,560 --> 00:27:57,800 So as are the universe being a kind of computation that has, might have bugs and exploits?
00:27:57,800 --> 00:27:58,800 Yes.
00:27:58,800 --> 00:27:59,800 Yeah, I think so.
00:27:59,800 --> 00:28:01,160 Is that what physics is essentially?
00:28:01,160 --> 00:28:05,140 I think it's possible that physics has exploits and we should be trying to find them, arranging
00:28:05,140 --> 00:28:09,840 some kind of a crazy quantum mechanical system that somehow gives you buffer overflow, somehow
00:28:09,840 --> 00:28:12,680 gives you a rounding error in the floating point.
00:28:12,680 --> 00:28:20,040 Yeah, that's right, and like more and more sophisticated exploits, like those are jokes,
00:28:20,040 --> 00:28:21,440 but that could be actually very close to reality.
00:28:21,440 --> 00:28:22,440 Yeah.
00:28:22,440 --> 00:28:23,440 We'll find some way to extract infinite energy.
00:28:23,440 --> 00:28:27,920 For example, when you train a reinforcement learning agents in physical simulations and
00:28:27,920 --> 00:28:32,320 you ask them to say, run quickly on the flat ground, they'll end up doing all kinds of
00:28:32,320 --> 00:28:35,240 like weird things in part of that optimization, right?
00:28:35,240 --> 00:28:38,840 They'll get on their back leg and they'll slide across the floor.
00:28:38,840 --> 00:28:43,120 And it's because the optimization, the enforcement learning optimization on that agent has figured
00:28:43,120 --> 00:28:47,480 out a way to extract infinite energy from the friction forces and basically their poor
00:28:47,480 --> 00:28:51,280 implementation and they found a way to generate infinite energy and just slide across the
00:28:51,280 --> 00:28:52,280 surface.
00:28:52,280 --> 00:28:53,280 And it's not what you expected.
00:28:53,280 --> 00:28:56,320 It's just sort of like a perverse solution.
00:28:56,320 --> 00:28:58,160 And so maybe we can find something like that.
00:28:58,160 --> 00:29:03,240 Maybe we can be that little dog in this physical simulation.
00:29:03,240 --> 00:29:09,720 That cracks or escapes the intended consequences of the physics that the universe came up with.
00:29:09,720 --> 00:29:12,280 We'll figure out some kind of shortcut to some weirdness.
00:29:12,280 --> 00:29:17,800 And then, but see the problem with that weirdness is the first person to discover the weirdness,
00:29:17,800 --> 00:29:23,200 like sliding on the back legs, that's all we're going to do.
00:29:23,200 --> 00:29:27,000 It's very quickly because everybody does that thing.
00:29:27,000 --> 00:29:33,960 So the paperclip maximizer is a ridiculous idea, but that very well could be what then
00:29:33,960 --> 00:29:38,160 we'll just all switch that because it's so fun.
00:29:38,160 --> 00:29:39,960 Well no person will discover it, I think, by the way.
00:29:39,960 --> 00:29:46,800 I think it's going to have to be some kind of a super intelligent AGI of a third generation.
00:29:46,800 --> 00:29:50,800 Like we're building the first generation AGI.
00:29:50,800 --> 00:29:51,800 Third generation.
00:29:51,800 --> 00:30:00,240 Yeah, so the bootloader for an AI, that AI will be a bootloader for another AI.
00:30:00,240 --> 00:30:04,240 And then there's no way for us to introspect like what that might even.
00:30:04,240 --> 00:30:07,600 I think it's very likely that these things, for example, like say you have these AGI's,
00:30:07,600 --> 00:30:10,480 it's very likely, for example, they will be completely inert.
00:30:10,480 --> 00:30:14,160 I like these kinds of sci-fi books sometimes where these things are just completely inert.
00:30:14,160 --> 00:30:15,640 They don't interact with anything.
00:30:15,640 --> 00:30:20,600 And I find that kind of beautiful because they've probably figured out the meta game
00:30:20,600 --> 00:30:22,920 of the universe in some way, potentially.
00:30:22,920 --> 00:30:26,320 They're doing something completely beyond our imagination.
00:30:26,320 --> 00:30:31,360 And they don't interact with simple chemical lifeforms, like why would you do that?
00:30:31,360 --> 00:30:33,680 So I find those kinds of ideas compelling.
00:30:33,680 --> 00:30:34,960 What's their source of fun?
00:30:34,960 --> 00:30:35,960 What are they doing?
00:30:35,960 --> 00:30:36,960 What's the source of pleasure?
00:30:36,960 --> 00:30:39,400 Well, probably funnel solving in the universe.
00:30:39,400 --> 00:30:40,400 But inert.
00:30:40,400 --> 00:30:46,360 So can you define what it means inert so they escape the interactional physical reality?
00:30:46,360 --> 00:30:55,760 As in they will behave in some very strange way to us because they're beyond, they're
00:30:55,760 --> 00:30:57,280 playing the meta game.
00:30:57,280 --> 00:31:00,640 And the meta game is probably say like arranging quantum mechanical systems in some very weird
00:31:00,640 --> 00:31:07,120 ways to extract infinite energy, solve the digital expansion of pi to whatever amount.
00:31:07,120 --> 00:31:10,760 They will build their own like little fusion reactors or something crazy.
00:31:10,760 --> 00:31:15,160 Like they're doing something beyond comprehension and not understandable to us and actually
00:31:15,160 --> 00:31:17,200 brilliant under the hood.
00:31:17,200 --> 00:31:24,360 What if quantum mechanics itself is the system and we're just thinking it's physics, but
00:31:24,360 --> 00:31:29,520 we're really parasites on, not parasite, we're not really hurting physics.
00:31:29,520 --> 00:31:34,960 We're just living on this organism and we're like trying to understand it.
00:31:34,960 --> 00:31:38,440 But really it is an organism and with a deep, deep intelligence.
00:31:38,440 --> 00:31:46,800 Maybe physics itself is the organism that's doing the super interesting thing.
00:31:46,800 --> 00:31:52,120 And we're just like one little thing, ant sitting on top of it, trying to get energy
00:31:52,120 --> 00:31:53,120 from it.
00:31:53,120 --> 00:31:56,320 We're just kind of like these particles in the wave that I feel like is mostly deterministic
00:31:56,320 --> 00:32:02,400 and takes a universe from some kind of a big bang to some kind of a super intelligent replicator,
00:32:02,400 --> 00:32:06,600 some kind of a stable point in the universe, given these laws of physics.
00:32:06,600 --> 00:32:10,960 You don't think, as Einstein said, God doesn't play dice.
00:32:10,960 --> 00:32:12,800 So you think it's mostly deterministic.
00:32:12,800 --> 00:32:13,800 There's no randomness in the thing.
00:32:13,800 --> 00:32:14,800 I think it's deterministic.
00:32:14,800 --> 00:32:18,280 Oh, there's tons of, well, I want to be careful with randomness.
00:32:18,280 --> 00:32:19,280 Pseudo random?
00:32:19,280 --> 00:32:20,280 Yeah.
00:32:20,280 --> 00:32:21,280 I don't like random.
00:32:21,280 --> 00:32:23,840 I think maybe the laws of physics are deterministic.
00:32:23,840 --> 00:32:24,840 Yeah.
00:32:24,840 --> 00:32:25,840 I think they're deterministic.
00:32:25,840 --> 00:32:29,320 You just got really uncomfortable with this question.
00:32:29,320 --> 00:32:35,040 Do you have anxiety about whether the universe is random or not?
00:32:35,040 --> 00:32:36,040 There's no randomness.
00:32:36,040 --> 00:32:38,160 You said you like good will hunting.
00:32:38,160 --> 00:32:39,160 It's not your fault, Andre.
00:32:39,160 --> 00:32:42,840 It's not your fault, man.
00:32:42,840 --> 00:32:44,720 So you don't like randomness?
00:32:44,720 --> 00:32:45,720 Yeah.
00:32:45,720 --> 00:32:47,000 I think it's unsettling.
00:32:47,000 --> 00:32:48,800 I think it's a deterministic system.
00:32:48,800 --> 00:32:53,040 I think that things that look random, like say the collapse of the wave function, et
00:32:53,040 --> 00:32:57,640 cetera, I think they're actually deterministic, just entanglement and so on, and some kind
00:32:57,640 --> 00:32:59,640 of a multi-verse theory, something, something.
00:32:59,640 --> 00:33:00,640 Okay.
00:33:00,640 --> 00:33:03,000 So why does it feel like we have a free will?
00:33:03,000 --> 00:33:10,440 Like if I raised a hand, I chose to do this now.
00:33:10,440 --> 00:33:12,360 That doesn't feel like a deterministic thing.
00:33:12,360 --> 00:33:14,600 It feels like I'm making a choice.
00:33:14,600 --> 00:33:15,600 It feels like it.
00:33:15,600 --> 00:33:16,600 Okay.
00:33:16,600 --> 00:33:17,600 So it's all feelings.
00:33:17,600 --> 00:33:18,600 It's just feelings.
00:33:18,600 --> 00:33:19,600 Yeah.
00:33:19,600 --> 00:33:26,080 So when an RL agent is making a choice, it's not really making a choice.
00:33:26,080 --> 00:33:27,720 The choice is already there.
00:33:27,720 --> 00:33:28,720 Yeah.
00:33:28,720 --> 00:33:32,000 You're interpreting the choice and you're creating a narrative for having made it.
00:33:32,000 --> 00:33:33,000 Yeah.
00:33:33,000 --> 00:33:34,000 And now we're talking about the narrative.
00:33:34,000 --> 00:33:35,520 It's very meta.
00:33:35,520 --> 00:33:41,200 Looking back, what is the most beautiful or surprising idea in deep learning or AI in
00:33:41,200 --> 00:33:43,360 general that you've come across?
00:33:43,360 --> 00:33:47,920 You've seen this field explode and grow in interesting ways.
00:33:47,920 --> 00:33:55,640 Just what, what cool ideas like, like we made you sit back and go, hmm, small, big or small?
00:33:55,640 --> 00:34:01,160 Well the one that I've been thinking about recently, the most probably is the, the transformer
00:34:01,160 --> 00:34:03,500 architecture.
00:34:03,500 --> 00:34:08,280 So basically neural networks have a lot of architectures that were trendy have come and
00:34:08,280 --> 00:34:13,160 gone for different sensory modalities, like for vision, audio, text, you would process
00:34:13,160 --> 00:34:14,920 them with different looking neural nets.
00:34:14,920 --> 00:34:19,420 And recently we've seen these, this convergence towards one architecture, the transformer,
00:34:19,420 --> 00:34:23,520 and you can feed it video or you can feed it, you know, images or speech or text, and
00:34:23,520 --> 00:34:24,520 it just gobbles it up.
00:34:24,520 --> 00:34:30,380 And it's kind of like a bit of a general purpose computer that is also trainable and very efficient
00:34:30,380 --> 00:34:32,160 to run on our hardware.
00:34:32,160 --> 00:34:37,280 And so this paper came out in 2016, I want to say.
00:34:37,280 --> 00:34:38,280 Attention is all you need.
00:34:38,280 --> 00:34:39,400 Attention is all you need.
00:34:39,400 --> 00:34:46,840 You criticize the paper title in retrospect that it wasn't, it didn't foresee the bigness
00:34:46,840 --> 00:34:48,880 of the impact that it was going to have.
00:34:48,880 --> 00:34:49,880 Yeah.
00:34:49,880 --> 00:34:52,560 I'm not sure if the authors were aware of the impact that that paper would go on to
00:34:52,560 --> 00:34:53,560 have.
00:34:53,560 --> 00:34:56,840 Probably they weren't, but I think they were aware of some of the motivations and design
00:34:56,840 --> 00:35:00,480 decisions behind the transformer and they chose not to, I think, expand on it in that
00:35:00,480 --> 00:35:01,480 way in the paper.
00:35:01,480 --> 00:35:06,840 And so I think they had an idea that there was more than just the surface of just like,
00:35:06,840 --> 00:35:09,200 oh, we're just doing translation and here's a better architecture.
00:35:09,200 --> 00:35:10,200 You're not just doing translation.
00:35:10,200 --> 00:35:13,880 This is like a really cool, differentiable, optimizable, efficient computer that you've
00:35:13,880 --> 00:35:15,120 proposed.
00:35:15,120 --> 00:35:18,440 And maybe they didn't have all of that foresight, but I think it's really interesting.
00:35:18,440 --> 00:35:23,780 Isn't it funny, sorry to interrupt, that that title is memeable, that they went for such
00:35:23,780 --> 00:35:25,520 a profound idea.
00:35:25,520 --> 00:35:29,120 They went with a, I don't think anyone used that kind of title before, right?
00:35:29,120 --> 00:35:30,320 Attention is all you need.
00:35:30,320 --> 00:35:31,320 Yeah.
00:35:31,320 --> 00:35:32,320 It's like a meme or something.
00:35:32,320 --> 00:35:33,320 Yeah.
00:35:33,320 --> 00:35:34,320 Isn't that funny?
00:35:34,320 --> 00:35:39,040 That one, like maybe if it was a more serious title, it wouldn't have the impact.
00:35:39,040 --> 00:35:42,520 Honestly I, yeah, there is an element of me that honestly agrees with you and prefers
00:35:42,520 --> 00:35:43,520 it this way.
00:35:43,520 --> 00:35:44,520 Yes.
00:35:44,520 --> 00:35:49,240 If it was too grand, it would over promise and then under deliver potentially.
00:35:49,240 --> 00:35:53,400 So you want to just meme your way to greatness.
00:35:53,400 --> 00:35:54,400 That should be a t-shirt.
00:35:54,400 --> 00:35:59,520 So you tweeted that Transformer is a magnificent neural network architecture because it is
00:35:59,520 --> 00:36:04,360 a general purpose, differentiable computer, it is simultaneously expressive in the forward
00:36:04,360 --> 00:36:11,840 pass, optimizable via backpropagation gradient descent, and efficient high parallelism compute
00:36:11,840 --> 00:36:12,840 graph.
00:36:12,840 --> 00:36:19,440 Can you discuss some of those details, expressive, optimizable, efficient for memory or in general,
00:36:19,440 --> 00:36:21,080 whatever comes to your heart?
00:36:21,080 --> 00:36:24,440 You want to have a general purpose computer that you can train on arbitrary problems,
00:36:24,440 --> 00:36:28,160 like say the task of next word prediction or detecting if there's a cat in an image
00:36:28,160 --> 00:36:29,960 or something like that.
00:36:29,960 --> 00:36:32,920 And you want to train this computer so you want to set its weights.
00:36:32,920 --> 00:36:37,780 And I think there's a number of design criteria that sort of overlap in the Transformer simultaneously
00:36:37,780 --> 00:36:39,020 that made it very successful.
00:36:39,020 --> 00:36:46,520 And I think the authors were kind of deliberately trying to make this a really powerful architecture.
00:36:46,520 --> 00:36:53,400 And so basically, it's very powerful in the forward pass because it's able to express
00:36:53,400 --> 00:36:58,040 very general computation as sort of something that looks like message passing.
00:36:58,040 --> 00:37:00,360 You have nodes and they all store vectors.
00:37:00,360 --> 00:37:04,780 And these nodes get to basically look at each other and it's each other's vectors.
00:37:04,780 --> 00:37:09,240 And they get to communicate and basically nodes get to broadcast, hey, I'm looking for
00:37:09,240 --> 00:37:10,240 certain things.
00:37:10,240 --> 00:37:12,760 And then other nodes get to broadcast, hey, these are the things I have.
00:37:12,760 --> 00:37:13,760 Those are the keys and the values.
00:37:13,760 --> 00:37:15,000 So it's not just attention.
00:37:15,000 --> 00:37:16,000 Yeah, exactly.
00:37:16,000 --> 00:37:17,800 Transformer is much more than just the attention component.
00:37:17,800 --> 00:37:21,160 It's got many pieces architectural that went into it, the residual connection, the way
00:37:21,160 --> 00:37:22,160 it's arranged.
00:37:22,160 --> 00:37:27,040 There's a multi-layer perceptron in there, the way it's stacked and so on.
00:37:27,040 --> 00:37:30,160 But basically, there's a message passing scheme where nodes get to look at each other, decide
00:37:30,160 --> 00:37:33,000 what's interesting, and then update each other.
00:37:33,000 --> 00:37:37,920 And so I think when you get to the details of it, I think it's a very expressive function.
00:37:37,920 --> 00:37:40,980 So it can express lots of different types of algorithms in forward pass.
00:37:40,980 --> 00:37:44,680 Not only that, but the way it's designed with the residual connections, layer normalizations,
00:37:44,680 --> 00:37:47,600 the softmax attention and everything, it's also optimizable.
00:37:47,600 --> 00:37:49,120 This is a really big deal.
00:37:49,120 --> 00:37:53,080 Because there's lots of computers that are powerful that you can't optimize, or they're
00:37:53,080 --> 00:37:56,520 not easy to optimize using the techniques that we have, which is backpropagation and gradient
00:37:56,520 --> 00:37:57,520 sent.
00:37:57,520 --> 00:37:59,960 These are first order methods, very simple optimizers, really.
00:37:59,960 --> 00:38:04,100 And so you also need it to be optimizable.
00:38:04,100 --> 00:38:06,660 And then lastly, you want it to run efficiently in our hardware.
00:38:06,660 --> 00:38:10,740 Our hardware is a massive throughput machine, like GPUs.
00:38:10,740 --> 00:38:13,220 They prefer lots of parallelism.
00:38:13,220 --> 00:38:16,280 So you don't want to do lots of sequential operations, you want to do a lot of operations
00:38:16,280 --> 00:38:17,280 serially.
00:38:17,280 --> 00:38:19,360 And the transformer is designed with that in mind as well.
00:38:19,360 --> 00:38:23,280 And so it's designed for our hardware, and it's designed to both be very expressive in
00:38:23,280 --> 00:38:26,480 a forward pass, but also very optimizable in the backward pass.
00:38:26,480 --> 00:38:32,160 And you said that the residual connections support a kind of ability to learn short algorithms
00:38:32,160 --> 00:38:37,600 fast and first, and then gradually extend them longer during training.
00:38:37,600 --> 00:38:39,640 What's the idea of learning short algorithms?
00:38:39,640 --> 00:38:40,640 Right.
00:38:40,640 --> 00:38:46,000 So basically a transformer is a series of blocks, right?
00:38:46,000 --> 00:38:48,760 And these blocks have a tension and a little multi-layer perceptron.
00:38:48,760 --> 00:38:52,640 And so you go off into a block and you come back to this residual pathway, and then you
00:38:52,640 --> 00:38:53,640 go off and you come back.
00:38:53,640 --> 00:38:56,160 And then you have a number of layers arranged sequentially.
00:38:56,160 --> 00:39:00,160 And so the way to look at it, I think, is because of the residual pathway in the backward
00:39:00,160 --> 00:39:06,280 pass, the gradients sort of flow along it uninterrupted, because addition distributes
00:39:06,280 --> 00:39:08,480 the gradient equally to all of its branches.
00:39:08,480 --> 00:39:14,160 So the gradient from the supervision at the top just floats directly to the first layer.
00:39:14,160 --> 00:39:18,260 And all the residual connections are arranged so that in the beginning, during initialization,
00:39:18,260 --> 00:39:21,600 they contribute nothing to the residual pathway.
00:39:21,600 --> 00:39:26,960 So what it kind of looks like is, imagine the transformer is kind of like a Python function,
00:39:26,960 --> 00:39:28,240 like a dev.
00:39:28,240 --> 00:39:32,300 And you get to do various kinds of lines of code.
00:39:32,300 --> 00:39:36,800 Say you have a hundred layers deep transformer, typically they would be much shorter, say
00:39:36,800 --> 00:39:37,800 20.
00:39:37,800 --> 00:39:39,640 You have 20 lines of code and you can do something in them.
00:39:39,640 --> 00:39:42,560 And so during the optimization, basically what it looks like is, first you optimize
00:39:42,560 --> 00:39:45,560 the first line of code, and then the second line of code can kick in, and the third line
00:39:45,560 --> 00:39:46,640 of code can kick in.
00:39:46,640 --> 00:39:51,020 And I kind of feel like because of the residual pathway and the dynamics of the optimization,
00:39:51,020 --> 00:39:54,680 you can sort of learn a very short algorithm that gets the approximate answer, but then
00:39:54,680 --> 00:39:57,880 the other layers can sort of kick in and start to create a contribution.
00:39:57,880 --> 00:40:02,800 And at the end of it, you're optimizing over an algorithm that is 20 lines of code, except
00:40:02,800 --> 00:40:05,960 these lines of code are very complex because it's an entire block of a transformer, you
00:40:05,960 --> 00:40:06,960 can do a lot in there.
00:40:06,960 --> 00:40:10,320 So what's really interesting is that this transformer architecture actually has been
00:40:10,320 --> 00:40:14,440 a remarkably resilient, basically the transformer that came out in 2016 is the transformer you
00:40:14,440 --> 00:40:17,960 would use today, except you reshuffle some of the layer norms.
00:40:17,960 --> 00:40:22,040 The layer normalizations have been reshuffled to a pre-norm formulation.
00:40:22,040 --> 00:40:25,560 And so it's been remarkably stable, but there's a lot of bells and whistles that people have
00:40:25,560 --> 00:40:27,960 attached to it and tried to improve it.
00:40:27,960 --> 00:40:32,660 I do think that basically it's a big step in simultaneously optimizing for lots of properties
00:40:32,660 --> 00:40:34,520 of a desirable neural network architecture.
00:40:34,520 --> 00:40:38,780 And I think people have been trying to change it, but it's proven remarkably resilient.
00:40:38,780 --> 00:40:42,000 But I do think that there should be even better architectures potentially.
00:40:42,000 --> 00:40:45,720 But it's, you admire the resilience here.
00:40:45,720 --> 00:40:50,580 There's something profound about this architecture that at least, so maybe we can, everything
00:40:50,580 --> 00:40:55,440 can be turned into a problem that transformers can solve.
00:40:55,440 --> 00:40:58,920 Currently it definitely looks like the transformer is taking over AI and you can feed basically
00:40:58,920 --> 00:41:00,240 arbitrary problems into it.
00:41:00,240 --> 00:41:05,600 And it's a general differentiable computer and it's extremely powerful and this conversions
00:41:05,600 --> 00:41:10,240 in AI has been really interesting to watch for me personally.
00:41:10,240 --> 00:41:13,040 What else do you think could be discovered here about transformers?
00:41:13,040 --> 00:41:18,800 Like what surprising thing or is it a stable, I want a stable place.
00:41:18,800 --> 00:41:23,120 Is there something interesting we might discover about transformers like aha moments, maybe
00:41:23,120 --> 00:41:28,600 has to do with memory, maybe knowledge representation, that kind of stuff.
00:41:28,600 --> 00:41:32,840 Basically the zeitgeist today is just pushing, like basically right now the zeitgeist is
00:41:32,840 --> 00:41:35,960 do not touch the transformer, touch everything else.
00:41:35,960 --> 00:41:38,400 So people are scaling up the data sets, making them much, much bigger.
00:41:38,400 --> 00:41:41,640 They're working on the evaluation, making the evaluation much, much bigger.
00:41:41,640 --> 00:41:45,840 And they're basically keeping the architecture unchanged.
00:41:45,840 --> 00:41:51,000 And that's how we've, that's the last five years of progress in AI kind of.
00:41:51,000 --> 00:41:54,980 What do you think about one flavor of it, which is language models?
00:41:54,980 --> 00:41:58,800 Have you been surprised?
00:41:58,800 --> 00:42:03,260 Has your sort of imagination been captivated by, you mentioned GPT and all the bigger and
00:42:03,260 --> 00:42:12,560 bigger and bigger language models and what are the limits of those models do you think?
00:42:12,560 --> 00:42:16,200 So just the task of natural language.
00:42:16,200 --> 00:42:19,680 Basically the way GPT is trained, right, is you just download a massive amount of text
00:42:19,680 --> 00:42:24,800 data from the internet and you try to predict the next word in the sequence, roughly speaking.
00:42:24,800 --> 00:42:29,540 You're predicting little word chunks, but roughly speaking, that's it.
00:42:29,540 --> 00:42:33,360 And what's been really interesting to watch is basically it's a language model.
00:42:33,360 --> 00:42:36,680 Language models have actually existed for a very long time.
00:42:36,680 --> 00:42:39,400 There's papers on language modeling from 2003, even earlier.
00:42:39,400 --> 00:42:42,480 Can you explain in that case what a language model is?
00:42:42,480 --> 00:42:43,480 Yeah.
00:42:43,480 --> 00:42:47,720 So language model just basically the rough idea is just predicting the next word in a
00:42:47,720 --> 00:42:49,880 sequence, roughly speaking.
00:42:49,880 --> 00:42:54,960 So there's a paper from, for example, Benjio and the team from 2003, where for the first
00:42:54,960 --> 00:42:59,600 time they were using a neural network to take, say, like three or five words and predict
00:42:59,600 --> 00:43:01,880 the next word.
00:43:01,880 --> 00:43:03,840 And they're doing this on much smaller data sets.
00:43:03,840 --> 00:43:05,400 And the neural net is not a transformer.
00:43:05,400 --> 00:43:09,080 It's a multilayer perceptron, but it's the first time that a neural network has been
00:43:09,080 --> 00:43:10,420 applied in that setting.
00:43:10,420 --> 00:43:16,120 But even before neural networks, there were language models, except they were using n-gram
00:43:16,120 --> 00:43:17,120 models.
00:43:17,120 --> 00:43:19,920 N-gram models are just count-based models.
00:43:19,920 --> 00:43:26,120 So if you try to take two words and predict the third one, you just count up how many
00:43:26,120 --> 00:43:30,200 times you've seen any two-word combinations and what came next.
00:43:30,200 --> 00:43:33,080 And what you predict that's coming next is just what you've seen the most of in the training
00:43:33,080 --> 00:43:34,360 set.
00:43:34,360 --> 00:43:36,820 And so language modeling has been around for a long time.
00:43:36,820 --> 00:43:39,600 Neural networks have done language modeling for a long time.
00:43:39,600 --> 00:43:46,200 So really what's new or interesting or exciting is just realizing that when you scale it up
00:43:46,200 --> 00:43:51,480 with a powerful enough neural net, a transformer, you have all these emergent properties where
00:43:51,480 --> 00:43:58,880 basically what happens is if you have a large enough data set of text, you are in the task
00:43:58,880 --> 00:44:00,560 of predicting the next word.
00:44:00,560 --> 00:44:04,560 You are multitasking a huge amount of different kinds of problems.
00:44:04,560 --> 00:44:09,920 You are multitasking understanding of chemistry, physics, human nature.
00:44:09,920 --> 00:44:11,800 Lots of things are sort of clustered in that objective.
00:44:11,800 --> 00:44:15,280 It's a very simple objective, but actually you have to understand a lot about the world
00:44:15,280 --> 00:44:16,320 to make that prediction.
00:44:16,320 --> 00:44:21,520 You just said the you word understanding.
00:44:21,520 --> 00:44:25,040 In terms of chemistry and physics and so on, what do you feel like it's doing?
00:44:25,040 --> 00:44:29,520 Is it searching for the right context?
00:44:29,520 --> 00:44:32,280 What is the actual process happening here?
00:44:32,280 --> 00:44:33,280 Yeah.
00:44:33,280 --> 00:44:36,840 So basically it gets a thousand words and it's trying to predict the thousand and first.
00:44:36,840 --> 00:44:41,280 And in order to do that very, very well over the entire data set available on the internet,
00:44:41,280 --> 00:44:48,320 you actually have to basically kind of understand the context of what's going on in there.
00:44:48,320 --> 00:44:53,920 And it's a sufficiently hard problem that if you have a powerful enough computer like
00:44:53,920 --> 00:45:00,120 a transformer, you end up with interesting solutions and you can ask it to do all kinds
00:45:00,120 --> 00:45:01,280 of things.
00:45:01,280 --> 00:45:06,080 And it shows a lot of emergent properties like in-context learning.
00:45:06,080 --> 00:45:10,160 That was the big deal with GPT and the original paper when they published it is that you can
00:45:10,160 --> 00:45:14,160 just sort of prompt it in various ways and ask it to do various things and it will just
00:45:14,160 --> 00:45:15,340 kind of complete the sentence.
00:45:15,340 --> 00:45:18,080 But in the process of just completing the sentence, it's actually solving all kinds
00:45:18,080 --> 00:45:21,640 of really interesting problems that we care about.
00:45:21,640 --> 00:45:24,640 Do you think it's doing something like understanding?
00:45:24,640 --> 00:45:29,600 Like when we use the word understanding for us humans?
00:45:29,600 --> 00:45:32,160 I think it's doing some understanding in its weights.
00:45:32,160 --> 00:45:36,640 It understands, I think a lot about the world and it has to in order to predict the next
00:45:36,640 --> 00:45:38,880 word in the sequence.
00:45:38,880 --> 00:45:42,560 So it's trained on the data from the internet.
00:45:42,560 --> 00:45:47,920 What do you think about this approach in terms of datasets of using data from the internet?
00:45:47,920 --> 00:45:52,800 Do you think the internet has enough structured data to teach AI about human civilization?
00:45:52,800 --> 00:45:56,000 Yeah, so I think the internet has a huge amount of data.
00:45:56,000 --> 00:45:57,960 I'm not sure if it's a complete enough set.
00:45:57,960 --> 00:46:04,780 I don't know that text is enough for having a sufficiently powerful AGI as an outcome.
00:46:04,780 --> 00:46:08,280 Of course there is audio and video and images and all that kind of stuff.
00:46:08,280 --> 00:46:10,760 Yeah, so text by itself, I'm a little bit suspicious about.
00:46:10,760 --> 00:46:14,100 There's a ton of things we don't put in text in writing just because they're obvious to
00:46:14,100 --> 00:46:17,340 us about how the world works and the physics of it and that things fall.
00:46:17,340 --> 00:46:19,160 We don't put that stuff in text because why would you?
00:46:19,160 --> 00:46:21,080 We share that understanding.
00:46:21,080 --> 00:46:25,920 And so text is a communication medium between humans and it's not an all encompassing medium
00:46:25,920 --> 00:46:27,680 of knowledge about the world.
00:46:27,680 --> 00:46:31,680 But as you pointed out, we do have video and we have images and we have audio.
00:46:31,680 --> 00:46:33,820 And so I think that definitely helps a lot.
00:46:33,820 --> 00:46:39,740 But we haven't trained models sufficiently across all of those modalities yet.
00:46:39,740 --> 00:46:41,200 So I think that's what a lot of people are interested in.
00:46:41,200 --> 00:46:46,480 But I wonder what that shared understanding of what we might call common sense has to
00:46:46,480 --> 00:46:51,880 be learned, inferred in order to complete the sentence correctly.
00:46:51,880 --> 00:46:57,160 So maybe the fact that it's implied on the internet, the model is going to have to learn
00:46:57,160 --> 00:47:02,900 that not by reading about it, by inferring it in the representation.
00:47:02,900 --> 00:47:07,400 Because common sense, just like we, I don't think we learn common sense.
00:47:07,400 --> 00:47:10,240 Nobody tells us explicitly.
00:47:10,240 --> 00:47:13,200 We just figure it all out by interacting with the world.
00:47:13,200 --> 00:47:17,560 And so here's a model of reading about the way people interact with the world.
00:47:17,560 --> 00:47:22,200 It might have to infer that, I wonder.
00:47:22,200 --> 00:47:27,760 You briefly worked on a project called World of Bits, training an RL system to take actions
00:47:27,760 --> 00:47:32,360 on the internet versus just consuming the internet like we talked about.
00:47:32,360 --> 00:47:35,560 Do you think there's a future for that kind of system, interacting with the internet to
00:47:35,560 --> 00:47:36,560 help the learning?
00:47:36,560 --> 00:47:41,280 Yes, I think that's probably the final frontier for a lot of these models.
00:47:41,280 --> 00:47:46,080 Because as you mentioned, when I was at OpenAI, I was working on this project World of Bits.
00:47:46,080 --> 00:47:50,360 And basically, it was the idea of giving neural networks access to a keyboard and a mouse.
00:47:50,360 --> 00:47:53,560 And the idea is possibly go wrong.
00:47:53,560 --> 00:47:59,280 So basically, you perceive the input of the screen pixels.
00:47:59,280 --> 00:48:04,280 And basically, the state of the computer is sort of visualized for human consumption in
00:48:04,280 --> 00:48:06,880 images of the web browser and stuff like that.
00:48:06,880 --> 00:48:10,380 And then you give the neural network the ability to press keyboards and use the mouse.
00:48:10,380 --> 00:48:15,960 And we're trying to get it to, for example, complete bookings and interact with user interfaces.
00:48:15,960 --> 00:48:16,960 What did you learn from that experience?
00:48:16,960 --> 00:48:18,920 What was some fun stuff?
00:48:18,920 --> 00:48:20,280 This is a super cool idea.
00:48:20,280 --> 00:48:21,280 Yeah.
00:48:21,280 --> 00:48:28,720 I mean, it's like, yeah, I mean, the step between observer to actor is a super fascinating
00:48:28,720 --> 00:48:29,720 idea.
00:48:29,720 --> 00:48:30,720 Yeah.
00:48:30,720 --> 00:48:32,680 Well, it's the universal interface in the digital realm, I would say.
00:48:32,680 --> 00:48:37,040 And there's a universal interface in the physical realm, which in my mind is a humanoid form
00:48:37,040 --> 00:48:38,640 factor kind of thing.
00:48:38,640 --> 00:48:40,060 We can later talk about Optimus and so on.
00:48:40,060 --> 00:48:46,960 But I feel like they're kind of like a similar philosophy in some way, where the physical
00:48:46,960 --> 00:48:48,920 world is designed for the human form.
00:48:48,920 --> 00:48:52,880 And the digital world is designed for the human form of seeing the screen and using
00:48:52,880 --> 00:48:54,800 keyboard and mouse.
00:48:54,800 --> 00:48:59,940 And so it's the universal interface that can basically command the digital infrastructure
00:48:59,940 --> 00:49:01,480 we've built up for ourselves.
00:49:01,480 --> 00:49:06,960 And so it feels like a very powerful interface to command and to build on top of.
00:49:06,960 --> 00:49:10,700 Now to your question as to what I learned from that, it's interesting because the world
00:49:10,700 --> 00:49:15,940 of bits was basically too early, I think, at OpenAI at the time.
00:49:15,940 --> 00:49:18,640 This is around 2015 or so.
00:49:18,640 --> 00:49:23,280 And the zeitgeist at that time was very different in AI from the zeitgeist today.
00:49:23,280 --> 00:49:27,300 At the time, everyone was super excited about reinforcement learning from scratch.
00:49:27,300 --> 00:49:32,520 This is the time of the Atari paper, where neural networks were playing Atari games and
00:49:32,520 --> 00:49:36,120 beating humans in some cases, AlphaGo and so on.
00:49:36,120 --> 00:49:39,400 So everyone's very excited about training neural networks from scratch using reinforcement
00:49:39,400 --> 00:49:42,480 learning directly.
00:49:42,480 --> 00:49:45,600 It turns out that reinforcement learning is extremely inefficient way of training neural
00:49:45,600 --> 00:49:48,760 networks, because you're taking all these actions and all these observations and you
00:49:48,760 --> 00:49:51,240 get some sparse rewards once in a while.
00:49:51,240 --> 00:49:53,660 So you do all this stuff based on all these inputs.
00:49:53,660 --> 00:49:57,600 And once in a while, you're told you did a good thing, you did a bad thing.
00:49:57,600 --> 00:49:58,880 And it's just an extremely hard problem.
00:49:58,880 --> 00:50:00,080 You can't learn from that.
00:50:00,080 --> 00:50:02,880 You can burn a forest and you can sort of brute force through it.
00:50:02,880 --> 00:50:08,840 And we saw that, I think, with Go and Dota and so on, and it does work, but it's extremely
00:50:08,840 --> 00:50:13,360 inefficient and not how you want to approach problems, practically speaking.
00:50:13,360 --> 00:50:17,280 And so that's the approach that, at the time, we also took to world of bits.
00:50:17,280 --> 00:50:19,880 We would have an agent initialize randomly.
00:50:19,880 --> 00:50:23,200 So with keyboard mash and mouse mash and try to make a booking.
00:50:23,200 --> 00:50:27,760 And it's just like revealed the insanity of that approach very quickly, where you have
00:50:27,760 --> 00:50:31,880 to stumble by the correct booking in order to get a reward of you did it correctly.
00:50:31,880 --> 00:50:35,360 And you're never going to stumble by it by chance at random.
00:50:35,360 --> 00:50:37,960 So even with a simple web interface, there's too many options.
00:50:37,960 --> 00:50:40,040 There's just too many options.
00:50:40,040 --> 00:50:42,360 And it's too sparse of a reward signal.
00:50:42,360 --> 00:50:44,120 And you're starting from scratch at the time.
00:50:44,120 --> 00:50:45,240 And so you don't know how to read.
00:50:45,240 --> 00:50:48,240 You don't understand pictures, images, buttons, you don't understand what it means to like
00:50:48,240 --> 00:50:49,680 make a booking.
00:50:49,680 --> 00:50:52,960 But now what's happened is it is time to revisit that.
00:50:52,960 --> 00:50:55,280 And OpenAI is interested in this.
00:50:55,280 --> 00:50:58,080 Companies like ADEPT are interested in this and so on.
00:50:58,080 --> 00:51:01,540 And the idea is coming back, because the interface is very powerful.
00:51:01,540 --> 00:51:05,840 But now you're not training an agent from scratch, you are taking the GPT as an initialization.
00:51:05,840 --> 00:51:09,760 So GPT is pre trained on all of text.
00:51:09,760 --> 00:51:15,440 And it understands what's a booking, it understands what's a submit, it understands quite a bit
00:51:15,440 --> 00:51:16,440 more.
00:51:16,440 --> 00:51:18,700 And so it already has those representations, they are very powerful.
00:51:18,700 --> 00:51:23,520 And that makes all the training significantly more efficient and makes the problem tractable.
00:51:23,520 --> 00:51:28,520 Should the interaction be with like the way humans see it with the buttons and the language,
00:51:28,520 --> 00:51:32,560 or should be with the HTML, JavaScript and the CSS?
00:51:32,560 --> 00:51:34,200 What do you think is the better?
00:51:34,200 --> 00:51:37,700 So today, all of this interaction is mostly on the level of HTML, CSS, and so on.
00:51:37,700 --> 00:51:40,520 That's done because of computational constraints.
00:51:40,520 --> 00:51:45,420 But I think ultimately, everything is designed for human visual consumption.
00:51:45,420 --> 00:51:49,400 And so at the end of the day, there's all the additional information is in the layout
00:51:49,400 --> 00:51:52,680 of the web page and what's next to you and what's our red background and all this kind
00:51:52,680 --> 00:51:54,500 of stuff and what it looks like visually.
00:51:54,500 --> 00:51:58,560 So I think that's the final frontier is we're taking in pixels and we're giving out keyboard
00:51:58,560 --> 00:51:59,560 mouse commands.
00:51:59,560 --> 00:52:01,920 But I think it's impractical still today.
00:52:01,920 --> 00:52:05,640 Do you worry about bots on the internet?
00:52:05,640 --> 00:52:09,740 Given these ideas, given how exciting they are, do you worry about bots on Twitter being
00:52:09,740 --> 00:52:14,160 not the stupid bots that we see now with the crypto bots, but the bots that might be out
00:52:14,160 --> 00:52:19,160 there actually that we don't see, that they're interacting in interesting ways?
00:52:19,160 --> 00:52:23,680 So this kind of system feels like it should be able to pass the I'm not a robot click
00:52:23,680 --> 00:52:24,680 button, whatever.
00:52:24,680 --> 00:52:28,720 Which do you actually understand how that test works?
00:52:28,720 --> 00:52:33,160 I don't quite, like there's a checkbox or whatever that you click.
00:52:33,160 --> 00:52:39,560 It's presumably tracking like mouse movement and the timing and so on.
00:52:39,560 --> 00:52:43,920 So exactly this kind of system we're talking about should be able to pass that.
00:52:43,920 --> 00:52:53,040 So what do you feel about bots that are language models plus have some interactability and
00:52:53,040 --> 00:52:54,800 are able to tweet and reply and so on?
00:52:54,800 --> 00:52:57,160 Do you worry about that world?
00:52:57,160 --> 00:53:02,160 I think it's always been a bit of an arms race between sort of the attack and the defense.
00:53:02,160 --> 00:53:06,320 So the attack will get stronger, but the defense will get stronger as well, our ability to
00:53:06,320 --> 00:53:07,320 detect that.
00:53:07,320 --> 00:53:08,320 How do you defend?
00:53:08,320 --> 00:53:09,320 How do you detect?
00:53:09,320 --> 00:53:14,840 How do you know that your Karpathe account on Twitter is human?
00:53:14,840 --> 00:53:16,160 How would you approach that?
00:53:16,160 --> 00:53:22,800 Like if people were claiming, how would you defend yourself in the court of law that I'm
00:53:22,800 --> 00:53:23,800 a human?
00:53:23,800 --> 00:53:24,800 This account is human.
00:53:24,800 --> 00:53:30,000 Yeah, at some point I think it might be, I think the society will evolve a little bit.
00:53:30,000 --> 00:53:34,800 Like we might start signing, digitally signing some of our correspondence or things that
00:53:34,800 --> 00:53:36,160 we create.
00:53:36,160 --> 00:53:39,120 Right now it's not necessary, but maybe in the future it might be.
00:53:39,120 --> 00:53:46,680 I do think that we are going towards a world where we share the digital space with AIs.
00:53:46,680 --> 00:53:47,680 Synthetic beings.
00:53:47,680 --> 00:53:48,680 Yeah.
00:53:48,680 --> 00:53:52,180 And they will get much better and they will share our digital realm and they'll eventually
00:53:52,180 --> 00:53:54,880 share our physical realm as well, it's much harder.
00:53:54,880 --> 00:53:57,960 But that's kind of like the world we're going towards and most of them will be benign and
00:53:57,960 --> 00:54:01,960 awful and some of them will be malicious and it's going to be an arms race trying to detect
00:54:01,960 --> 00:54:02,960 them.
00:54:02,960 --> 00:54:08,800 So I mean the worst isn't the AIs, the worst is the AIs pretending to be human.
00:54:08,800 --> 00:54:13,680 So I don't know if it's always malicious, there's obviously a lot of malicious applications
00:54:13,680 --> 00:54:19,920 but it could also be, you know if I was an AI I would try very hard to pretend to be
00:54:19,920 --> 00:54:22,000 human because we're in a human world.
00:54:22,000 --> 00:54:26,360 I wouldn't get any respect as an AI, I want to get some love and respect.
00:54:26,360 --> 00:54:28,260 I don't think the problem is intractable.
00:54:28,260 --> 00:54:33,560 People are thinking about the proof of personhood and we might start digitally signing our stuff
00:54:33,560 --> 00:54:39,440 and we might all end up having basically some solution for proof of personhood, it doesn't
00:54:39,440 --> 00:54:40,440 seem to me intractable.
00:54:40,440 --> 00:54:44,640 It's just something that we haven't had to do until now but I think once the need really
00:54:44,640 --> 00:54:49,920 starts to emerge, which is soon, I think people will think about it much more.
00:54:49,920 --> 00:54:58,320 But that too will be a race because obviously you can probably spoof or fake the proof of
00:54:58,320 --> 00:54:59,320 personhood.
00:54:59,320 --> 00:55:06,240 So you have to try to figure out how to, I mean it's weird that we have like social security
00:55:06,240 --> 00:55:12,440 numbers and like passports and stuff, it seems like it's harder to fake stuff in the physical
00:55:12,440 --> 00:55:18,360 space than in the digital space, it just feels like it's going to be very tricky, very tricky
00:55:18,360 --> 00:55:22,800 to out, because it seems to be pretty low cost to fake stuff.
00:55:22,800 --> 00:55:30,400 Nobody's going to put an AI in jail for like trying to use a fake personhood proof.
00:55:30,400 --> 00:55:35,680 I mean okay fine, you'll put a lot of AIs in jail but there'll be more AIs, like exponentially
00:55:35,680 --> 00:55:36,680 more.
00:55:36,680 --> 00:55:40,360 The cost of creating a bot is very low.
00:55:40,360 --> 00:55:48,920 Unless there's some kind of way to track accurately, like you're not allowed to create any program
00:55:48,920 --> 00:55:53,880 without showing, tying yourself to that program.
00:55:53,880 --> 00:56:00,000 Like any program that runs on the internet, you'll be able to trace every single human
00:56:00,000 --> 00:56:01,800 program that was involved with that program.
00:56:01,800 --> 00:56:02,800 Right.
00:56:02,800 --> 00:56:06,360 Yeah, maybe you have to start declaring when, you know, we have to start drawing those boundaries
00:56:06,360 --> 00:56:13,760 and keeping track of okay, what are digital entities versus human entities and what is
00:56:13,760 --> 00:56:19,240 the ownership of human entities and digital entities and something like that.
00:56:19,240 --> 00:56:25,880 I don't know, but I think I'm optimistic that this is possible and in some sense we're currently
00:56:25,880 --> 00:56:31,680 in like the worst time of it because all these bots suddenly have become very capable, but
00:56:31,680 --> 00:56:34,280 we don't have the fences yet built up as a society.
00:56:34,280 --> 00:56:38,080 But I think that doesn't seem to me intractable, it's just something that we have to deal with.
00:56:38,080 --> 00:56:45,160 It seems weird that the Twitter bot, like really crappy Twitter bots are so numerous.
00:56:45,160 --> 00:56:49,060 So I presume that the engineers of Twitter are very good.
00:56:49,060 --> 00:56:55,280 So it seems like what I would infer from that is it seems like a hard problem.
00:56:55,280 --> 00:57:00,080 They're probably catching, all right, if I were to sort of steel man the case, it's a
00:57:00,080 --> 00:57:10,840 hard problem and there's a huge cost to false positive, to removing a post by somebody that's
00:57:10,840 --> 00:57:11,840 not a bot.
00:57:11,840 --> 00:57:14,520 That creates a very bad user experience.
00:57:14,520 --> 00:57:20,840 So they're very cautious about removing, and maybe the bots are really good at learning
00:57:20,840 --> 00:57:26,840 what gets removed and not such that they can stay ahead of the removal process very quickly.
00:57:26,840 --> 00:57:33,720 My impression of it, honestly, is there's a lot of long for, I mean, just it's not subtle.
00:57:33,720 --> 00:57:35,280 My impression of it, it's not subtle.
00:57:35,280 --> 00:57:38,240 But you have to, yeah, that's my impression as well.
00:57:38,240 --> 00:57:43,840 But it feels like maybe you're seeing the tip of the iceberg.
00:57:43,840 --> 00:57:49,360 Maybe the number of bots is in like the trillions and you have to like just, it's a constant
00:57:49,360 --> 00:57:54,440 assault of bots and yeah, I don't know.
00:57:54,440 --> 00:57:57,920 You have to steel man the case because the bots I'm seeing are pretty obvious.
00:57:57,920 --> 00:58:00,760 I could write a few lines of code that catch these bots.
00:58:00,760 --> 00:58:04,560 I mean, definitely there's a lot of long for it, but I will say, I agree that if you are
00:58:04,560 --> 00:58:09,640 a sophisticated actor, you could probably create a pretty good bot right now using tools
00:58:09,640 --> 00:58:12,240 like GPTs because it's a language model.
00:58:12,240 --> 00:58:17,400 You can generate faces that look quite good now and you can do this at scale.
00:58:17,400 --> 00:58:22,040 And so I think, yeah, it's quite plausible and it's going to be hard to defend.
00:58:22,040 --> 00:58:26,680 There was a Google engineer that claimed that the Lambda was sentient.
00:58:26,680 --> 00:58:33,680 Do you think there's any inkling of truth to what he felt?
00:58:33,680 --> 00:58:38,120 And more importantly, to me at least, do you think language models will achieve sentience
00:58:38,120 --> 00:58:40,600 or the illusion of sentience soon-ish?
00:58:40,600 --> 00:58:41,600 Yeah.
00:58:41,600 --> 00:58:45,960 To me, it's a little bit of a canary in a coal mine kind of moment, honestly, a little
00:58:45,960 --> 00:58:54,520 bit because this engineer spoke to a chatbot at Google and became convinced that this bot
00:58:54,520 --> 00:58:55,520 is sentient.
00:58:55,520 --> 00:58:57,600 He asked it some existential philosophical questions.
00:58:57,600 --> 00:59:01,920 And it gave reasonable answers and looked real and so on.
00:59:01,920 --> 00:59:10,760 So to me, he wasn't sufficiently trying to stress the system, I think, and exposing the
00:59:10,760 --> 00:59:14,720 truth of it as it is today.
00:59:14,720 --> 00:59:18,200 But I think this will be increasingly harder over time.
00:59:18,200 --> 00:59:26,680 So yeah, I think more and more people will basically become, yeah, I think there'll be
00:59:26,680 --> 00:59:29,320 more people like that over time as this gets better.
00:59:29,320 --> 00:59:32,320 Like form an emotional connection to an AI chatbot.
00:59:32,320 --> 00:59:33,880 Yeah, perfectly plausible in my mind.
00:59:33,880 --> 00:59:38,720 I think these AIs are actually quite good at human connection, human emotion.
00:59:38,720 --> 00:59:43,800 A ton of text on the internet is about humans and connection and love and so on.
00:59:43,800 --> 00:59:48,280 So I think they have a very good understanding in some sense of how people speak to each
00:59:48,280 --> 00:59:49,480 other about this.
00:59:49,480 --> 00:59:55,360 And they're very capable of creating a lot of that kind of text.
00:59:55,360 --> 00:59:59,000 There's a lot of sci-fi from 50s and 60s that imagined AIs in a very different way.
00:59:59,000 --> 01:00:01,640 They are calculating cold, Vulcan-like machines.
01:00:01,640 --> 01:00:03,280 That's not what we're getting today.
01:00:03,280 --> 01:00:11,520 We're getting pretty emotional AIs that actually are very competent and capable of generating
01:00:11,520 --> 01:00:13,920 credible-sounding text with respect to all of these topics.
01:00:13,920 --> 01:00:18,280 See, I'm really hopeful about AI systems that are like companions that help you grow, develop
01:00:18,280 --> 01:00:22,240 as a human being, help you maximize long-term happiness.
01:00:22,240 --> 01:00:27,120 But I'm also very worried about AI systems that figure out from the internet that humans
01:00:27,120 --> 01:00:29,080 get attracted to drama.
01:00:29,080 --> 01:00:33,160 So these would just be like shit-talking AIs that just constantly, did you hear it?
01:00:33,160 --> 01:00:40,800 They'll do gossip, they'll try to plant seeds of suspicion to other humans that you love
01:00:40,800 --> 01:00:47,200 and trust and just kind of mess with people, because that's going to get a lot of attention.
01:00:47,200 --> 01:00:53,120 So drama, maximize drama on the path to maximizing engagement.
01:00:53,120 --> 01:01:01,360 And us humans will feed into that machine and it'll be a giant drama shitstorm.
01:01:01,360 --> 01:01:02,880 So I'm worried about that.
01:01:02,880 --> 01:01:08,600 So it's the objective function really defines the way that human civilization progresses
01:01:08,600 --> 01:01:10,280 with AIs in it.
01:01:10,280 --> 01:01:14,240 I think right now, at least today, they are not sort of, it's not correct to really think
01:01:14,240 --> 01:01:17,640 of them as goal-seeking agents that want to do something.
01:01:17,640 --> 01:01:20,320 They have no long-term memory or anything.
01:01:20,320 --> 01:01:24,460 It's literally a good approximation of it is you get a thousand words and you're trying
01:01:24,460 --> 01:01:28,320 to predict a thousand at first, and then you continue feeding it in and you are free to
01:01:28,320 --> 01:01:29,960 prompt it in whatever way you want.
01:01:29,960 --> 01:01:36,520 So in text, so you say, okay, you are a psychologist and you are very good and you love humans.
01:01:36,520 --> 01:01:41,200 And here's a conversation between you and another human, human colon something, you
01:01:41,200 --> 01:01:42,360 something.
01:01:42,360 --> 01:01:43,760 And then it just continues the pattern.
01:01:43,760 --> 01:01:46,160 And suddenly you're having a conversation with a fake psychologist who's like trying
01:01:46,160 --> 01:01:47,440 to help you.
01:01:47,440 --> 01:01:52,020 And so it's still kind of like in the realm of a tool is a, people can prompt it in arbitrary
01:01:52,020 --> 01:01:56,440 ways and it can create really incredible text, but it doesn't have long-term goals over long
01:01:56,440 --> 01:01:57,440 periods of time.
01:01:57,440 --> 01:02:00,560 It doesn't try to, so it doesn't look that way right now.
01:02:00,560 --> 01:02:01,560 Yeah.
01:02:01,560 --> 01:02:04,280 But you can do short-term goals that have long-term effects.
01:02:04,280 --> 01:02:09,880 So if my prompting short-term goal is to get Andrey Kapustin to respond to me on Twitter
01:02:09,880 --> 01:02:15,980 when I like, I think AI might, that's the goal, but it might figure out that talking
01:02:15,980 --> 01:02:20,900 shit to you, it would be the best in a highly sophisticated, interesting way.
01:02:20,900 --> 01:02:28,140 And then you build up a relationship when you respond once and then it like over time,
01:02:28,140 --> 01:02:36,920 it gets to not be sophisticated and just like, just talk shit.
01:02:36,920 --> 01:02:41,200 And okay, maybe it won't get to Andrey, but it might get to another celebrity, it might
01:02:41,200 --> 01:02:46,640 get into other big accounts and then it'll just, so with just that simple goal, get them
01:02:46,640 --> 01:02:47,640 to respond.
01:02:47,640 --> 01:02:48,640 Yeah.
01:02:48,640 --> 01:02:50,160 Maximize the probability of actual response.
01:02:50,160 --> 01:02:51,160 Yeah.
01:02:51,160 --> 01:02:54,920 I mean, you could prompt a powerful model like this with their, its opinion about how
01:02:54,920 --> 01:02:57,800 to do any possible thing you're interested in.
01:02:57,800 --> 01:03:01,640 So they will just, they're kind of on track to become these oracles, I could sort of think
01:03:01,640 --> 01:03:02,640 of it that way.
01:03:02,640 --> 01:03:06,360 They are oracles, currently it's just text, but they will have calculators, they will
01:03:06,360 --> 01:03:10,160 have access to Google search, they will have all kinds of gadgets and gizmos, they will
01:03:10,160 --> 01:03:14,060 be able to operate the internet and find different information.
01:03:14,060 --> 01:03:19,920 And yeah, in some sense, that's kind of like currently what it looks like in terms of the
01:03:19,920 --> 01:03:20,920 development.
01:03:20,920 --> 01:03:27,640 Do you think it'll be an improvement eventually over what Google is for access to human knowledge?
01:03:27,640 --> 01:03:31,060 Like it'll be a more effective search engine to access human knowledge?
01:03:31,060 --> 01:03:34,040 I think there's definite scope in building a better search engine today.
01:03:34,040 --> 01:03:37,400 And I think Google, they have all the tools, all the people, they have everything they
01:03:37,400 --> 01:03:38,400 need.
01:03:38,400 --> 01:03:39,400 They have all the puzzle pieces.
01:03:39,400 --> 01:03:43,120 They have people training transformers at scale, they have all the data.
01:03:43,120 --> 01:03:46,800 It's just not obvious if they are capable as an organization to innovate on their search
01:03:46,800 --> 01:03:47,920 engine right now.
01:03:47,920 --> 01:03:49,600 And if they don't, someone else will.
01:03:49,600 --> 01:03:53,720 There's absolute scope for building a significantly better search engine built on these tools.
01:03:53,720 --> 01:03:54,720 It's so interesting.
01:03:54,720 --> 01:03:59,120 If you're a large company where the search, there's already an infrastructure, it works
01:03:59,120 --> 01:04:00,680 as it brings out a lot of money.
01:04:00,680 --> 01:04:06,600 So where structurally inside a company is their motivation to pivot to say, we're going
01:04:06,600 --> 01:04:08,240 to build a new search engine.
01:04:08,240 --> 01:04:09,240 Yeah.
01:04:09,240 --> 01:04:10,240 That's really hard.
01:04:10,240 --> 01:04:13,280 So it's usually going to come from a startup, right?
01:04:13,280 --> 01:04:15,860 That's that would be, yeah.
01:04:15,860 --> 01:04:19,400 Or some other more competent organization.
01:04:19,400 --> 01:04:20,920 So I don't know.
01:04:20,920 --> 01:04:25,200 So currently, for example, maybe Bing has another shot at it, you know, as an example.
01:04:25,200 --> 01:04:27,600 Microsoft Edge as we're talking offline.
01:04:27,600 --> 01:04:33,120 I mean, I definitely, it's really interesting because search engines used to be about, okay,
01:04:33,120 --> 01:04:38,800 here's some query, here's web pages that look like the stuff that you have, but you could
01:04:38,800 --> 01:04:42,800 just directly go to answer and then have supporting evidence.
01:04:42,800 --> 01:04:47,220 And these models basically, they've read all the texts and they've read all the web pages.
01:04:47,220 --> 01:04:50,580 And so sometimes when you see yourself going over to search results and sort of getting
01:04:50,580 --> 01:04:54,720 like a sense of like the average answer to whatever you're interested in, like that just
01:04:54,720 --> 01:04:55,720 directly comes out.
01:04:55,720 --> 01:04:58,540 You don't have to do that work.
01:04:58,540 --> 01:05:03,600 So they're kind of like, yeah, I think they have a way of distilling all that knowledge
01:05:03,600 --> 01:05:06,800 into like some level of insight, basically.
01:05:06,800 --> 01:05:13,160 Do you think of prompting as a kind of teaching and learning like this whole process, like
01:05:13,160 --> 01:05:19,240 another layer, you know, because maybe that's what humans are, where you have that background
01:05:19,240 --> 01:05:23,040 model and then the world is prompting you.
01:05:23,040 --> 01:05:24,320 Yeah, exactly.
01:05:24,320 --> 01:05:29,680 I think the way we are programming these computers now, like GPTs, is converging to how you program
01:05:29,680 --> 01:05:30,680 humans.
01:05:30,680 --> 01:05:33,080 I mean, how do I program humans via prompt?
01:05:33,080 --> 01:05:37,320 I go to people and I prompt them to do things, I prompt them from information.
01:05:37,320 --> 01:05:41,480 And so natural language prompt is how we program humans and we're starting to program computers
01:05:41,480 --> 01:05:42,760 directly in that interface.
01:05:42,760 --> 01:05:44,580 It's like pretty remarkable, honestly.
01:05:44,580 --> 01:05:49,880 So you've spoken a lot about the idea of software 2.0.
01:05:49,880 --> 01:05:57,520 All good ideas become like cliches so quickly, like the terms, it's kind of hilarious.
01:05:57,520 --> 01:06:03,160 It's like, I think Eminem once said that like, if he gets annoyed by a song he's written
01:06:03,160 --> 01:06:08,800 very quickly, that means it's going to be a big hit because it's too catchy.
01:06:08,800 --> 01:06:13,280 But can you describe this idea and how you're thinking about it has evolved over the months
01:06:13,280 --> 01:06:16,240 and years since you coined it.
01:06:16,240 --> 01:06:17,240 Yeah.
01:06:17,240 --> 01:06:22,960 Yeah, so I had a blog post on software 2.0, I think several years ago now.
01:06:22,960 --> 01:06:27,840 And the reason I wrote that post is because I kind of saw something remarkable happening
01:06:27,840 --> 01:06:33,560 in like software development and how a lot of code was being transitioned to be written
01:06:33,560 --> 01:06:37,920 not in sort of like C++ and so on, but it's written in the weights of a neural net.
01:06:37,920 --> 01:06:41,600 Basically just saying that neural nets are taking over software, the realm of software
01:06:41,600 --> 01:06:44,240 and taking more and more tasks.
01:06:44,240 --> 01:06:49,360 And at the time, I think not many people understood this deeply enough that this is a big deal,
01:06:49,360 --> 01:06:51,240 it's a big transition.
01:06:51,240 --> 01:06:55,080 Neural networks were seen as one of multiple classification algorithms you might use for
01:06:55,080 --> 01:06:57,080 your dataset problem on Kaggle.
01:06:57,080 --> 01:07:03,280 Like this is not that, this is a change in how we program computers.
01:07:03,280 --> 01:07:08,280 And I saw neural nets as this is going to take over, the way we program computers is
01:07:08,280 --> 01:07:09,280 going to change.
01:07:09,280 --> 01:07:12,840 It's going to be people writing a software in C++ or something like that and directly
01:07:12,840 --> 01:07:14,480 programming the software.
01:07:14,480 --> 01:07:19,040 It's going to be accumulating training sets and datasets and crafting these objectives
01:07:19,040 --> 01:07:20,960 by which we train these neural nets.
01:07:20,960 --> 01:07:24,840 And at some point, there's going to be a compilation process from the datasets and the objective
01:07:24,840 --> 01:07:30,880 and the architecture specification into the binary, which is really just the neural net,
01:07:30,880 --> 01:07:33,640 you know, weights and the forward pass of the neural net.
01:07:33,640 --> 01:07:34,640 And then you can deploy that binary.
01:07:34,640 --> 01:07:40,440 And so I was talking about that sort of transition and that's what the post is about.
01:07:40,440 --> 01:07:45,560 And I saw this sort of play out in a lot of fields, you know, autopilot being one of
01:07:45,560 --> 01:07:48,560 them, but also just a simple image classification.
01:07:48,560 --> 01:07:53,040 People thought originally, you know, in the 80s and so on, that they would write the algorithm
01:07:53,040 --> 01:07:55,580 for detecting a dog in an image.
01:07:55,580 --> 01:07:57,800 And they had all these ideas about how the brain does it.
01:07:57,800 --> 01:08:01,200 And first we detect corners and then we detect lines and then we stitched them up.
01:08:01,200 --> 01:08:02,280 And they were like really going at it.
01:08:02,280 --> 01:08:05,080 They were like thinking about how they're going to write the algorithm.
01:08:05,080 --> 01:08:08,680 And this is not the way you build it.
01:08:08,680 --> 01:08:13,480 And there was a smooth transition where, okay, first we thought we were going to build everything.
01:08:13,480 --> 01:08:18,720 Then we were building the features, so like hog features and things like that, that detect
01:08:18,720 --> 01:08:20,960 these little statistical patterns from image patches.
01:08:20,960 --> 01:08:24,480 And then there was a little bit of learning on top of it, like a support vector machine
01:08:24,480 --> 01:08:29,380 or binary classifier for cat versus dog and images on top of the features.
01:08:29,380 --> 01:08:34,560 So we wrote the features, but we trained the last layer, sort of the classifier.
01:08:34,560 --> 01:08:37,720 And then people are like, actually, let's not even design the features because we can't.
01:08:37,720 --> 01:08:39,240 Honestly, we're not very good at it.
01:08:39,240 --> 01:08:41,160 So let's also learn the features.
01:08:41,160 --> 01:08:44,920 And then you end up with basically a convolutional neural net where you're learning most of it.
01:08:44,920 --> 01:08:46,640 You're just specifying the architecture.
01:08:46,640 --> 01:08:50,800 And the architecture has tons of fill in the blanks, which is all the knobs.
01:08:50,800 --> 01:08:53,160 And you let the optimization write most of it.
01:08:53,160 --> 01:08:56,840 And so this transition is happening across the industry everywhere.
01:08:56,840 --> 01:09:01,520 And suddenly, we end up with a ton of code that is written in neural net weights.
01:09:01,520 --> 01:09:04,600 And I was just pointing out that the analogy is actually pretty strong.
01:09:04,600 --> 01:09:10,080 And we have a lot of developer environments for software 1.0, like we have IDEs, how you
01:09:10,080 --> 01:09:13,880 work with code, how you debug code, how do you run code, how do you maintain code.
01:09:13,880 --> 01:09:14,880 We have GitHub.
01:09:14,880 --> 01:09:16,760 So I was trying to make those analogies in the neural realm.
01:09:16,760 --> 01:09:19,160 Like what is the GitHub of software 2.0?
01:09:19,160 --> 01:09:23,260 Turns out it's something that looks like Hugging Face right now.
01:09:23,260 --> 01:09:28,200 And so I think some people took it seriously and built cool companies and many people originally
01:09:28,200 --> 01:09:29,200 attacked the post.
01:09:29,200 --> 01:09:31,960 It actually was not well received when I wrote it.
01:09:31,960 --> 01:09:35,480 And I think maybe it has something to do with the title, but the post was not well received.
01:09:35,480 --> 01:09:38,720 And I think more people sort of have been coming around to it over time.
01:09:38,720 --> 01:09:39,720 Yeah.
01:09:39,720 --> 01:09:48,000 So you were the director of AI at Tesla where I think this idea was really implemented at
01:09:48,000 --> 01:09:52,160 scale, which is how you have engineering teams doing software 2.0.
01:09:52,160 --> 01:09:57,920 So can you sort of linger on that idea of, I think we're in the really early stages of
01:09:57,920 --> 01:10:01,600 everything you just said, which is like GitHub IDEs.
01:10:01,600 --> 01:10:09,080 Like how do we build engineering teams that work in software 2.0 systems and the data
01:10:09,080 --> 01:10:15,320 collection and the data annotation, which is all part of that software 2.0.
01:10:15,320 --> 01:10:19,000 Like what do you think is the task of programming a software 2.0?
01:10:19,000 --> 01:10:25,760 Is it debugging in the space of hyperparameters or is it also debugging in the space of data?
01:10:25,760 --> 01:10:26,760 Yeah.
01:10:26,760 --> 01:10:32,760 The way by which you program the computer and influence its algorithm is not by writing
01:10:32,760 --> 01:10:34,680 the commands yourself.
01:10:34,680 --> 01:10:37,240 You're changing mostly the data set.
01:10:37,240 --> 01:10:41,760 You're changing the loss functions of like what the neural net is trying to do, how it's
01:10:41,760 --> 01:10:42,760 trying to predict things.
01:10:42,760 --> 01:10:46,500 But basically the data sets and the architecture of the neural net.
01:10:46,500 --> 01:10:51,920 And so in the case of the autopilot, a lot of the data sets have to do with, for example,
01:10:51,920 --> 01:10:54,760 detection of objects and lane line markings and traffic lights and so on.
01:10:54,760 --> 01:10:59,880 So you accumulate massive data sets of, here's an example, here's the desired label.
01:10:59,880 --> 01:11:05,000 And then here's roughly what the algorithm should look like, and that's a convolutional
01:11:05,000 --> 01:11:06,000 neural net.
01:11:06,000 --> 01:11:09,360 So the specification of the architecture is like a hint as to what the algorithm should
01:11:09,360 --> 01:11:10,620 roughly look like.
01:11:10,620 --> 01:11:15,920 And then the fill in the blanks process of optimization is the training process.
01:11:15,920 --> 01:11:17,800 And then you take your neural net that was trained.
01:11:17,800 --> 01:11:21,100 It gives all the right answers on your data set and you deploy it.
01:11:21,100 --> 01:11:28,360 So there is, in that case, perhaps at all machine learning cases, there's a lot of tasks.
01:11:28,360 --> 01:11:35,560 So is coming up formulating a task like for a multi-headed neural network is formulating
01:11:35,560 --> 01:11:37,320 a task part of the programming?
01:11:37,320 --> 01:11:39,040 Yeah, very much so.
01:11:39,040 --> 01:11:42,040 How do you break down a problem into a set of tasks?
01:11:42,040 --> 01:11:43,040 Yeah.
01:11:43,040 --> 01:11:48,880 I'm on a high level, I would say, if you look at the software running in the autopilot.
01:11:48,880 --> 01:11:50,920 I gave a number of talks on this topic.
01:11:50,920 --> 01:11:54,360 I would say originally a lot of it was written in software 1.0.
01:11:54,360 --> 01:11:57,600 There's imagine lots of C++, right?
01:11:57,600 --> 01:12:01,800 And then gradually there was a tiny neural net that was, for example, predicting given
01:12:01,800 --> 01:12:04,160 a single image, is there like a traffic light or not?
01:12:04,160 --> 01:12:06,060 Or is there a lane line marking or not?
01:12:06,060 --> 01:12:09,800 And this neural net didn't have too much to do in the scope of the software.
01:12:09,800 --> 01:12:12,800 It was making tiny predictions on individual little image.
01:12:12,800 --> 01:12:15,280 And then the rest of the system stitched it up.
01:12:15,280 --> 01:12:18,840 So, okay, we're actually we don't have just a single camera with eight cameras, we actually
01:12:18,840 --> 01:12:20,160 have eight cameras over time.
01:12:20,160 --> 01:12:21,880 And so what do you do with these predictions?
01:12:21,880 --> 01:12:22,880 How do you put them together?
01:12:22,880 --> 01:12:25,120 How do you do the fusion of all that information?
01:12:25,120 --> 01:12:26,120 And how do you act on it?
01:12:26,120 --> 01:12:30,020 All of that was written by humans in C++.
01:12:30,020 --> 01:12:36,120 And then we decided, okay, we don't actually want to do all of that fusion in C++ code,
01:12:36,120 --> 01:12:38,280 because we're actually not good enough to write that algorithm.
01:12:38,280 --> 01:12:40,080 We want the neural nets to write the algorithm.
01:12:40,080 --> 01:12:44,400 And we want to port all that software into the 2.0 stack.
01:12:44,400 --> 01:12:49,040 And so then we actually had neural nets that now take all the eight camera images simultaneously
01:12:49,040 --> 01:12:51,480 and make predictions for all of that.
01:12:51,480 --> 01:12:57,440 So and actually, they don't make predictions in the space of images, they now make predictions
01:12:57,440 --> 01:12:58,440 directly in 3D.
01:12:58,440 --> 01:13:02,680 And actually, they don't in three dimensions around the car.
01:13:02,680 --> 01:13:08,700 And now actually, we don't manually fuse the predictions over in 3D over time, we don't
01:13:08,700 --> 01:13:10,440 trust ourselves to write that tracker.
01:13:10,440 --> 01:13:14,360 So actually, we give the neural net the information over time.
01:13:14,360 --> 01:13:17,120 So it takes these videos now and makes those predictions.
01:13:17,120 --> 01:13:19,840 And so you're sort of just like putting more and more power into the neural net, more and
01:13:19,840 --> 01:13:20,840 more processing.
01:13:20,840 --> 01:13:25,280 And at the end of it, the eventual sort of goal is to have most of the software potentially
01:13:25,280 --> 01:13:30,280 be in 2.0 land, because it works significantly better.
01:13:30,280 --> 01:13:32,320 Humans are just not very good at writing software, basically.
01:13:32,320 --> 01:13:38,440 So the prediction is happening in this like 4D land with three dimensional world over
01:13:38,440 --> 01:13:39,440 time.
01:13:39,440 --> 01:13:40,440 Yeah.
01:13:40,440 --> 01:13:43,400 How do you do annotation in that world?
01:13:43,400 --> 01:13:51,640 What have you, so data annotation, whether it's self supervised or manual by humans is
01:13:51,640 --> 01:13:53,960 a big part of this software 2.0 world.
01:13:53,960 --> 01:13:54,960 Right.
01:13:54,960 --> 01:13:58,920 I would say by far in the industry, if you're like talking about the industry and how, what
01:13:58,920 --> 01:14:01,960 is the technology of what we have available, everything is supervised learning.
01:14:01,960 --> 01:14:06,820 So you need data sets of input, desired output, and you need lots of it.
01:14:06,820 --> 01:14:09,560 And there are three properties of it that you need.
01:14:09,560 --> 01:14:13,480 You need it to be very large, you need it to be accurate, no mistakes, and you need
01:14:13,480 --> 01:14:14,480 it to be diverse.
01:14:14,480 --> 01:14:19,120 You don't want to just have a lot of correct examples of one thing.
01:14:19,120 --> 01:14:21,760 You need to really cover the space of possibility as much as you can.
01:14:21,760 --> 01:14:25,360 And the more you can cover the space of possible inputs, the better the algorithm will work
01:14:25,360 --> 01:14:26,360 at the end.
01:14:26,360 --> 01:14:31,720 Now, once you have really good data sets that you're collecting, curating, and cleaning,
01:14:31,720 --> 01:14:35,420 you can train your neural net on top of that.
01:14:35,420 --> 01:14:37,940 So a lot of the work goes into cleaning those data sets now.
01:14:37,940 --> 01:14:41,760 As you pointed out, it's probably, it could be, the question is, how do you achieve a
01:14:41,760 --> 01:14:47,980 ton of, if you want to basically predict in 3D, you need data in 3D to back that up.
01:14:47,980 --> 01:14:52,840 So in this video, we have eight videos coming from all the cameras of the system.
01:14:52,840 --> 01:14:54,400 And this is what they saw.
01:14:54,400 --> 01:14:56,500 And this is the truth of what actually was around.
01:14:56,500 --> 01:14:58,520 There was this car, there was this car, this car.
01:14:58,520 --> 01:15:00,920 These are the lane line markings, this is the geometry of the road, there's traffic
01:15:00,920 --> 01:15:04,840 light in this three dimensional position, you need the ground truth.
01:15:04,840 --> 01:15:08,520 And so the big question that the team was solving, of course, is, how do you arrive
01:15:08,520 --> 01:15:09,640 at that ground truth?
01:15:09,640 --> 01:15:13,160 Because once you have a million of it, and it's large, clean, and diverse, then training
01:15:13,160 --> 01:15:17,040 a neural net on it works extremely well, and you can ship that into the car.
01:15:17,040 --> 01:15:21,320 And so there's many mechanisms by which we collected that trained data, you can always
01:15:21,320 --> 01:15:25,360 go for human annotation, you can go for simulation as a source of ground truth.
01:15:25,360 --> 01:15:30,720 You can also go for what we call the offline tracker that we've spoken about at the AI
01:15:30,720 --> 01:15:35,520 Day and so on, which is basically an automatic reconstruction process for taking those videos
01:15:35,520 --> 01:15:40,880 and recovering the three dimensional sort of reality of what was around that car.
01:15:40,880 --> 01:15:44,640 So basically, think of doing like a three dimensional reconstruction as an offline thing.
01:15:44,640 --> 01:15:49,500 And then understanding that, okay, there's 10 seconds of video, this is what we saw.
01:15:49,500 --> 01:15:52,640 And therefore, here's all the lane lines, cars and so on.
01:15:52,640 --> 01:15:56,560 And then once you have that annotation, you can train neural nets to imitate it.
01:15:56,560 --> 01:15:59,440 And how difficult is the reconstruction, the reconstruction?
01:15:59,440 --> 01:16:01,720 It's difficult, but it can be done.
01:16:01,720 --> 01:16:07,520 So there's overlap between the cameras, and you do the reconstruction, and there's perhaps
01:16:07,520 --> 01:16:11,440 if there's any inaccuracy, so that's caught in the annotation step.
01:16:11,440 --> 01:16:16,520 Yes, the nice thing about the annotation is that it is fully offline, you have infinite
01:16:16,520 --> 01:16:20,640 time, you have a chunk of one minute, and you're trying to just offline in a supercomputer
01:16:20,640 --> 01:16:24,520 somewhere, figure out where were the positions of all the cars, all the people, and you have
01:16:24,520 --> 01:16:28,040 your full one minute of video from all the angles, and you can run all the neural nets
01:16:28,040 --> 01:16:31,480 you want, and they can be very efficient, massive neural nets.
01:16:31,480 --> 01:16:34,800 There can be neural nets that can't even run in the car later at test time.
01:16:34,800 --> 01:16:37,920 So they can be even more powerful neural nets than what you can eventually deploy.
01:16:37,920 --> 01:16:41,760 So you can do anything you want, three dimensional reconstruction, neural nets, anything you
01:16:41,760 --> 01:16:45,440 want just to recover that truth, and then you supervise that truth.
01:16:45,440 --> 01:16:52,080 What have you learned, you said no mistakes about humans doing annotation, because I assume
01:16:52,080 --> 01:16:57,120 humans are, there's like a range of things they're good at in terms of clicking stuff
01:16:57,120 --> 01:16:58,120 on screen.
01:16:58,120 --> 01:17:03,880 Isn't that, how interesting is that you have a problem of designing an annotator where
01:17:03,880 --> 01:17:08,840 humans are accurate, enjoy it, like what are even the metrics, are efficient or productive,
01:17:08,840 --> 01:17:09,840 all that kind of stuff?
01:17:09,840 --> 01:17:15,320 Yeah, so I grew the annotation team at Tesla from basically zero to a thousand while I
01:17:15,320 --> 01:17:16,320 was there.
01:17:16,320 --> 01:17:20,840 That was really interesting, you know, my background is a PhD student, researcher, so
01:17:20,840 --> 01:17:26,960 growing that kind of an organization was pretty crazy, but yeah, I think it's extremely
01:17:26,960 --> 01:17:30,440 interesting and part of the design process very much behind the autopilot as to where
01:17:30,440 --> 01:17:31,920 you use humans.
01:17:31,920 --> 01:17:34,120 Humans are very good at certain kinds of annotations.
01:17:34,120 --> 01:17:36,680 They're very good, for example, at two dimensional annotations of images.
01:17:36,680 --> 01:17:42,340 They're not good at annotating cars over time in three dimensional space, very, very hard.
01:17:42,340 --> 01:17:46,480 And so that's why we were very careful to design the tasks that are easy to do for humans
01:17:46,480 --> 01:17:49,120 versus things that should be left to the offline tracker.
01:17:49,120 --> 01:17:52,800 Like maybe the computer will do all the triangulation and three degree construction, but the human
01:17:52,800 --> 01:17:57,860 will say exactly these pixels of the image are a car, exactly these pixels are a human.
01:17:57,860 --> 01:18:03,680 And so co-designing the data annotation pipeline was very much bread and butter was what I
01:18:03,680 --> 01:18:04,680 was doing daily.
01:18:04,680 --> 01:18:09,040 Do you think there's still a lot of open problems in that space?
01:18:09,040 --> 01:18:14,480 Just in general, annotation where the stuff the machines are good at, machines do and
01:18:14,480 --> 01:18:18,880 the humans do what they're good at and there's maybe some iterative process.
01:18:18,880 --> 01:18:19,880 Right.
01:18:19,880 --> 01:18:23,600 I think to a very large extent, we went through a number of iterations and we learned a ton
01:18:23,600 --> 01:18:26,200 about how to create these datasets.
01:18:26,200 --> 01:18:27,960 I'm not seeing big open problems.
01:18:27,960 --> 01:18:32,280 Like originally when I joined, I was like, I was really not sure how this would turn
01:18:32,280 --> 01:18:33,280 out.
01:18:33,280 --> 01:18:37,280 But by the time I left, I was much more secure and actually we sort of understand the philosophy
01:18:37,280 --> 01:18:38,680 of how to create these datasets.
01:18:38,680 --> 01:18:41,740 And I was pretty comfortable with where that was at the time.
01:18:41,740 --> 01:18:48,440 So what are strengths and limitations of cameras for the driving task in your understanding
01:18:48,440 --> 01:18:53,360 when you formulate the driving task as a vision task with eight cameras?
01:18:53,360 --> 01:18:57,080 You've seen that the entire, you know, most of the history of the computer vision field
01:18:57,080 --> 01:19:01,120 when it has to do with neural networks, what, just if you step back, what are the strengths
01:19:01,120 --> 01:19:05,440 and limitations of pixels of using pixels to drive?
01:19:05,440 --> 01:19:06,440 Yeah.
01:19:06,440 --> 01:19:10,520 Pixels I think are a beautiful sensory, beautiful sensor I would say.
01:19:10,520 --> 01:19:14,080 The thing is like cameras are very, very cheap and they provide a ton of information, ton
01:19:14,080 --> 01:19:15,560 of bits.
01:19:15,560 --> 01:19:20,660 So it's a extremely cheap sensor for a ton of bits and each one of these bits has a constraint
01:19:20,660 --> 01:19:21,960 on the state of the world.
01:19:21,960 --> 01:19:27,740 And so you get lots of megapixel images very cheap and it just gives you all these constraints
01:19:27,740 --> 01:19:30,100 for understanding what's actually out there in the world.
01:19:30,100 --> 01:19:34,520 So vision is probably the highest bandwidth sensor.
01:19:34,520 --> 01:19:36,440 It's a very high bandwidth sensor.
01:19:36,440 --> 01:19:43,880 And I love that pixels is a constraint on the world.
01:19:43,880 --> 01:19:49,920 This is highly complex, high bandwidth constraint on the world, on the state of the world.
01:19:49,920 --> 01:19:50,920 That's fascinating.
01:19:50,920 --> 01:19:55,440 It's not just that, but again, this real, real importance of it's the sensor that humans
01:19:55,440 --> 01:19:56,440 use.
01:19:56,440 --> 01:19:59,200 Therefore everything is designed for that sensor.
01:19:59,200 --> 01:20:00,200 Yeah.
01:20:00,200 --> 01:20:04,380 The text, the writing, the flashing signs, everything is designed for vision.
01:20:04,380 --> 01:20:07,320 And so you just find it everywhere.
01:20:07,320 --> 01:20:10,360 And so that's why that is the interface you want to be in.
01:20:10,360 --> 01:20:12,620 Talking again about these universal interfaces.
01:20:12,620 --> 01:20:17,080 And that's where we actually want to measure the world as well and then develop software
01:20:17,080 --> 01:20:18,160 for that sensor.
01:20:18,160 --> 01:20:23,360 But there's other constraints on the state of the world that humans use to understand
01:20:23,360 --> 01:20:24,360 the world.
01:20:24,360 --> 01:20:31,760 I mean, vision ultimately is the main one, but we're like referencing our understanding
01:20:31,760 --> 01:20:38,320 of human behavior and some common sense physics that could be inferred from vision from a
01:20:38,320 --> 01:20:45,000 perception perspective, but it feels like we're using some kind of reasoning to predict
01:20:45,000 --> 01:20:46,000 the world.
01:20:46,000 --> 01:20:47,000 Yeah.
01:20:47,000 --> 01:20:48,000 And not just the pixels.
01:20:48,000 --> 01:20:52,120 I mean, you have a powerful prior for how the world evolves over time, et cetera.
01:20:52,120 --> 01:20:56,680 So it's not just about the likelihood term coming up from the data itself telling you
01:20:56,680 --> 01:21:00,840 about what you are observing, but also the prior term of like where are the likely things
01:21:00,840 --> 01:21:03,400 to see and how do they likely move and so on.
01:21:03,400 --> 01:21:11,360 And the question is how complex is the range of possibilities that might happen in the
01:21:11,360 --> 01:21:12,360 driving task?
01:21:12,360 --> 01:21:13,360 Right.
01:21:13,360 --> 01:21:17,280 That's still, is that to you still an open problem of how difficult is driving?
01:21:17,280 --> 01:21:25,080 Like philosophically speaking, all the time you worked on driving, do you understand how
01:21:25,080 --> 01:21:26,080 hard driving is?
01:21:26,080 --> 01:21:27,080 Yeah.
01:21:27,080 --> 01:21:30,360 Driving is really hard because it has to do with the predictions of all these other agents
01:21:30,360 --> 01:21:34,480 and the theory of mind and you know what they're going to do and are they looking at you?
01:21:34,480 --> 01:21:35,480 Are they, where are they looking?
01:21:35,480 --> 01:21:36,480 Where are they thinking?
01:21:36,480 --> 01:21:37,480 Yeah.
01:21:37,480 --> 01:21:41,760 There's a lot that goes there at the, at the full tail of, you know, the, the expansion
01:21:41,760 --> 01:21:45,400 of the knives that we have to be comfortable with that eventually the final problems are
01:21:45,400 --> 01:21:46,400 of that form.
01:21:46,400 --> 01:21:48,640 I don't think those are the problems that are very common.
01:21:48,640 --> 01:21:52,240 I think eventually they're important, but it's like really in the tail end.
01:21:52,240 --> 01:21:58,240 In the tail end, the rare edge cases from the vision perspective, what are the toughest
01:21:58,240 --> 01:22:01,800 parts of the vision problem of driving?
01:22:01,800 --> 01:22:09,700 Well, basically the sensor is extremely powerful, but you still need to process that information.
01:22:09,700 --> 01:22:13,680 And so going from brightnesses of these pixel values to, Hey, here are the three dimensional
01:22:13,680 --> 01:22:15,760 world is extremely hard.
01:22:15,760 --> 01:22:18,440 And that's what the neural networks are fundamentally doing.
01:22:18,440 --> 01:22:24,440 And so the difficulty really is in just doing an extremely good job of engineering the entire
01:22:24,440 --> 01:22:30,000 pipeline, the entire data engine, having the capacity to train these neural nets, having
01:22:30,000 --> 01:22:33,840 the ability to evaluate the system and iterate on it.
01:22:33,840 --> 01:22:37,160 So I would say just doing this in production at scale is like the hard part.
01:22:37,160 --> 01:22:38,640 It's an execution problem.
01:22:38,640 --> 01:22:46,100 So the data engine, but also the, the sort of deployment of the system such that has
01:22:46,100 --> 01:22:47,100 low latency performance.
01:22:47,100 --> 01:22:48,840 So it has to do all these steps.
01:22:48,840 --> 01:22:49,840 Yeah.
01:22:49,840 --> 01:22:52,600 For the neural net specifically, just making sure everything fits into the chip on the
01:22:52,600 --> 01:22:58,360 car and you have a finite budget of flops that you can perform and memory bandwidth
01:22:58,360 --> 01:23:02,080 and other constraints, and you have to make sure it flies and you can squeeze in as much
01:23:02,080 --> 01:23:04,240 computer as you can into the time.
01:23:04,240 --> 01:23:05,800 What have you learned from that process?
01:23:05,800 --> 01:23:11,640 Because maybe that's one of the bigger, like new things coming from a research background
01:23:11,640 --> 01:23:16,320 where there's a system that has to run under heavily constrained resources, has to run
01:23:16,320 --> 01:23:17,760 really fast.
01:23:17,760 --> 01:23:20,400 What kind of insights have you learned from that?
01:23:20,400 --> 01:23:24,160 Yeah, I'm not sure if it's, if there's too many insights.
01:23:24,160 --> 01:23:28,560 You're trying to create a neural net that will fit in what you have available and you're
01:23:28,560 --> 01:23:30,060 always trying to optimize it.
01:23:30,060 --> 01:23:35,280 And we talked a lot about it on AI Day and basically the, the triple backflips that the
01:23:35,280 --> 01:23:39,580 team is doing to make sure it all fits and utilizes the engine.
01:23:39,580 --> 01:23:42,380 So I think it's extremely good engineering.
01:23:42,380 --> 01:23:46,920 And then there's all kinds of little insights peppered in on how to do it properly.
01:23:46,920 --> 01:23:49,820 Let's actually zoom out because I don't think we talked about the data engine.
01:23:49,820 --> 01:23:56,000 The entirety of the layouts of this idea that I think is just beautiful with humans in the
01:23:56,000 --> 01:23:57,000 loop.
01:23:57,000 --> 01:23:59,080 Can you describe the data engine?
01:23:59,080 --> 01:24:06,560 Yeah, the data engine is what I call the almost biological feeling like process by which you
01:24:06,560 --> 01:24:10,320 perfect the training sets for these neural networks.
01:24:10,320 --> 01:24:13,760 So because most of the programming now is in the level of these data sets and make sure
01:24:13,760 --> 01:24:16,240 they're large, diverse and clean.
01:24:16,240 --> 01:24:19,380 Basically you have a data set that you think is good.
01:24:19,380 --> 01:24:23,920 You train your neural net, you deploy it, and then you observe how well it's performing
01:24:23,920 --> 01:24:27,360 and you're trying to always increase the quality of your data set.
01:24:27,360 --> 01:24:32,240 So you're trying to catch scenarios basically that are basically rare.
01:24:32,240 --> 01:24:35,720 And it is in these scenarios that neural nets will typically struggle in because they weren't
01:24:35,720 --> 01:24:38,840 told what to do in those rare cases in the data set.
01:24:38,840 --> 01:24:42,880 But now you can close the loop because if you can now collect all those at scale, you
01:24:42,880 --> 01:24:47,760 can then feed them back into the reconstruction process I described and reconstruct the truth
01:24:47,760 --> 01:24:49,800 in those cases and add it to the data set.
01:24:49,800 --> 01:24:54,600 And so the whole thing ends up being like a staircase of improvement of perfecting your
01:24:54,600 --> 01:24:55,600 training set.
01:24:55,600 --> 01:25:00,400 And you have to go through deployments so that you can mine the parts that are not yet
01:25:00,400 --> 01:25:02,600 represented well in the data set.
01:25:02,600 --> 01:25:03,840 So your data set is basically imperfect.
01:25:03,840 --> 01:25:04,840 It needs to be diverse.
01:25:04,840 --> 01:25:08,460 It has pockets that are missing and you need to pad out the pockets.
01:25:08,460 --> 01:25:11,760 You can sort of think of it that way in the data.
01:25:11,760 --> 01:25:13,260 What role do humans play in this?
01:25:13,260 --> 01:25:20,880 So what's this biological system, like human bodies made up of cells, what role, like how
01:25:20,880 --> 01:25:27,920 do you optimize the human system, the multiple engineers collaborating, figuring out what
01:25:27,920 --> 01:25:35,560 to focus on, what to contribute, which task to optimize in this neural network, who's
01:25:35,560 --> 01:25:41,800 in charge of figuring out which task needs more data, can you speak to the hyperparameters
01:25:41,800 --> 01:25:44,600 of the human system?
01:25:44,600 --> 01:25:47,520 It really just comes down to extremely good execution from an engineering team who knows
01:25:47,520 --> 01:25:48,520 what they're doing.
01:25:48,520 --> 01:25:52,200 They understand intuitively the philosophical insights underlying the data engine and the
01:25:52,200 --> 01:25:58,120 process by which the system improves and how to, again, like delegate the strategy of the
01:25:58,120 --> 01:26:02,160 data collection and how that works and then just making sure it's all extremely well executed.
01:26:02,160 --> 01:26:06,000 And that's where most of the work is not even the philosophizing or the research or the
01:26:06,000 --> 01:26:07,000 ideas of it.
01:26:07,000 --> 01:26:10,840 It's just extremely good execution is so hard when you're dealing with data at that scale.
01:26:10,840 --> 01:26:16,400 So your role in the data engine, executing well on it is difficult and extremely important.
01:26:16,400 --> 01:26:23,300 Is there a priority of like a vision board of saying like, we really need to get better
01:26:23,300 --> 01:26:29,120 at stoplights, like the prioritization of tasks, is that essentially, and that comes
01:26:29,120 --> 01:26:30,700 from the data.
01:26:30,700 --> 01:26:35,040 That comes to a very large extent to what we are trying to achieve in the product roadmap
01:26:35,040 --> 01:26:39,600 or we're trying to, the release we're trying to get out in the feedback from the QA team
01:26:39,600 --> 01:26:42,880 worth it, where the system is struggling or not, the things that we're trying to improve.
01:26:42,880 --> 01:26:49,520 And the QA team gives some signal, some information in aggregate about the performance of the
01:26:49,520 --> 01:26:50,520 system in various conditions.
01:26:50,520 --> 01:26:51,520 That's right.
01:26:51,520 --> 01:26:53,160 And then of course all of us drive it and we can also see it.
01:26:53,160 --> 01:26:57,320 It's really nice to work with a system that you can also experience yourself and it drives
01:26:57,320 --> 01:26:58,320 you home.
01:26:58,320 --> 01:27:02,880 Is there some insight you can draw from your individual experience that you just can't
01:27:02,880 --> 01:27:06,840 quite get from an aggregate statistical analysis of data?
01:27:06,840 --> 01:27:07,840 Yeah.
01:27:07,840 --> 01:27:08,840 It's so weird, right?
01:27:08,840 --> 01:27:09,840 Yeah.
01:27:09,840 --> 01:27:13,560 It's not scientific in a sense because you're just one anecdotal sample.
01:27:13,560 --> 01:27:14,560 Yeah.
01:27:14,560 --> 01:27:17,400 I think there's a ton of, it's a source of truth.
01:27:17,400 --> 01:27:20,560 It's your interaction with the system and you can see it, you can play with it, you
01:27:20,560 --> 01:27:24,640 can perturb it, you can get a sense of it, you have an intuition for it.
01:27:24,640 --> 01:27:30,280 I think numbers just like have a way of numbers and plots and graphs are much harder.
01:27:30,280 --> 01:27:31,600 It hides a lot of...
01:27:31,600 --> 01:27:38,280 It's like if you train a language model, it's a really powerful way is by you interacting
01:27:38,280 --> 01:27:39,280 with it.
01:27:39,280 --> 01:27:40,280 Yeah.
01:27:40,280 --> 01:27:41,280 100%.
01:27:41,280 --> 01:27:42,280 Just try to build up an intuition.
01:27:42,280 --> 01:27:43,280 Yeah.
01:27:43,280 --> 01:27:45,320 I think like Elon also, he always wanted to drive the system himself.
01:27:45,320 --> 01:27:49,100 He drives a lot and I want to say almost daily.
01:27:49,100 --> 01:27:55,080 So he also sees this as a source of truth, you driving the system and it performing and
01:27:55,080 --> 01:27:56,080 yeah.
01:27:56,080 --> 01:27:57,920 So what do you think?
01:27:57,920 --> 01:28:00,120 Tough questions here.
01:28:00,120 --> 01:28:06,320 So Tesla last year removed radar from the sensor suite and now just announced that it's
01:28:06,320 --> 01:28:10,920 going to remove all ultrasonic sensors relying solely on vision.
01:28:10,920 --> 01:28:13,400 So camera only.
01:28:13,400 --> 01:28:18,040 Does that make the perception problem harder or easier?
01:28:18,040 --> 01:28:20,160 I would almost reframe the question in some way.
01:28:20,160 --> 01:28:23,520 So the thing is basically you would think that additional sensors-
01:28:23,520 --> 01:28:25,080 By the way, can I just interrupt?
01:28:25,080 --> 01:28:26,080 Go ahead.
01:28:26,080 --> 01:28:29,200 I wonder if a language model will ever do that if you prompt it.
01:28:29,200 --> 01:28:30,640 Let me reframe your question.
01:28:30,640 --> 01:28:32,560 That would be epic.
01:28:32,560 --> 01:28:33,560 This is the wrong prompt.
01:28:33,560 --> 01:28:34,560 Sorry.
01:28:34,560 --> 01:28:38,080 It's like a little bit of a wrong question because basically you would think that these
01:28:38,080 --> 01:28:45,240 sensors are an asset to you, but if you fully consider the entire product in its entirety,
01:28:45,240 --> 01:28:49,880 these sensors are actually potentially a liability because these sensors aren't free.
01:28:49,880 --> 01:28:51,400 They don't just appear on your car.
01:28:51,400 --> 01:28:55,160 You need suddenly you have an entire supply chain, you have people procuring it.
01:28:55,160 --> 01:28:56,640 There can be problems with them.
01:28:56,640 --> 01:28:57,920 They may need replacement.
01:28:57,920 --> 01:28:59,120 They are part of the manufacturing process.
01:28:59,120 --> 01:29:01,800 They can hold back the line in production.
01:29:01,800 --> 01:29:02,800 You need to source them.
01:29:02,800 --> 01:29:03,800 You need to maintain them.
01:29:03,800 --> 01:29:06,880 You need to have teams that write the firmware, all of it.
01:29:06,880 --> 01:29:10,160 And then you also have to incorporate them, fuse them into the system in some way.
01:29:10,160 --> 01:29:13,800 And so it actually like bloats a lot of it.
01:29:13,800 --> 01:29:17,120 And I think Elon is really good at simplify, simplify.
01:29:17,120 --> 01:29:18,480 Best part is no part.
01:29:18,480 --> 01:29:21,440 And he always tries to throw away things that are not essential because he understands the
01:29:21,440 --> 01:29:24,020 entropy in organizations and an approach.
01:29:24,020 --> 01:29:28,240 And I think in this case, the cost is high and you're not potentially seeing it if you're
01:29:28,240 --> 01:29:29,840 just a computer vision engineer.
01:29:29,840 --> 01:29:34,320 And I'm just trying to improve my network and you know, is it more useful or less useful?
01:29:34,320 --> 01:29:35,440 How useful is it?
01:29:35,440 --> 01:29:39,260 And the thing is, once you consider the full cost of a sensor, it actually is potentially
01:29:39,260 --> 01:29:43,920 a liability and you need to be really sure that it's giving you extremely useful information.
01:29:43,920 --> 01:29:48,220 In this case, we lucked at using it or not using it and the delta was not massive.
01:29:48,220 --> 01:29:49,680 And so it's not useful.
01:29:49,680 --> 01:29:56,220 Is it also bloat in the data engine, like having more sensors is a distraction?
01:29:56,220 --> 01:29:57,840 And these sensors, you know, they can change over time.
01:29:57,840 --> 01:30:01,120 For example, you can have one type of say radar, you can have other type of radar, they
01:30:01,120 --> 01:30:02,120 change over time.
01:30:02,120 --> 01:30:03,120 Now suddenly you need to worry about it.
01:30:03,120 --> 01:30:07,060 Now suddenly you have a column in your SQLite telling you, oh, what sensor type was it?
01:30:07,060 --> 01:30:08,880 And they all have different distributions.
01:30:08,880 --> 01:30:15,380 And then they contribute noise and entropy into everything and they bloat stuff.
01:30:15,380 --> 01:30:20,700 And also organizationally has been really fascinating to me that it can be very distracting.
01:30:20,700 --> 01:30:25,600 If all you want to get to work is vision, all the resources are on it and you're building
01:30:25,600 --> 01:30:30,980 out a data engine and you're actually making forward progress because that is the sensor
01:30:30,980 --> 01:30:33,760 with the most bandwidth, the most constraints on the world.
01:30:33,760 --> 01:30:36,520 And you're investing fully into that and you can make that extremely good.
01:30:36,520 --> 01:30:41,860 If you're only a finite amount of sort of spend of focus across different facets of
01:30:41,860 --> 01:30:43,200 the system.
01:30:43,200 --> 01:30:49,520 And this kind of reminds me of Rich Sutton's, A Bitter Lesson, that just seems like simplifying
01:30:49,520 --> 01:30:50,520 the system.
01:30:50,520 --> 01:30:51,520 Yeah.
01:30:51,520 --> 01:30:52,520 In the long run.
01:30:52,520 --> 01:30:56,240 Now, of course, you don't know with the long run, it seems to be always the right solution.
01:30:56,240 --> 01:30:57,240 Yeah.
01:30:57,240 --> 01:30:58,240 Yes.
01:30:58,240 --> 01:31:01,080 In that case, it was for RL, but it seems to apply generally across all systems that
01:31:01,080 --> 01:31:02,080 do computation.
01:31:02,080 --> 01:31:03,080 Yeah.
01:31:03,080 --> 01:31:09,600 So what do you think about the LIDAR as a crutch debate, the battle between point clouds
01:31:09,600 --> 01:31:10,600 and pixels?
01:31:10,600 --> 01:31:11,600 Yeah.
01:31:11,600 --> 01:31:15,680 I think this debate is always slightly confusing to me because it seems like the actual debate
01:31:15,680 --> 01:31:18,280 should be about like, do you have the fleet or not?
01:31:18,280 --> 01:31:21,940 That's like the really important thing about whether you can achieve a really good functioning
01:31:21,940 --> 01:31:24,080 of an AI system at this scale.
01:31:24,080 --> 01:31:25,480 So data collection systems.
01:31:25,480 --> 01:31:26,480 Yeah.
01:31:26,480 --> 01:31:29,760 Do you have a fleet or not is significantly more important whether you have LIDAR or not.
01:31:29,760 --> 01:31:32,400 It's just another sensor.
01:31:32,400 --> 01:31:41,000 And yeah, I think similar to the radar discussion, basically, I don't think it basically doesn't
01:31:41,000 --> 01:31:44,040 offer extra information.
01:31:44,040 --> 01:31:45,080 It's extremely costly.
01:31:45,080 --> 01:31:46,080 It has all kinds of problems.
01:31:46,080 --> 01:31:47,080 You have to worry about it.
01:31:47,080 --> 01:31:48,080 You have to calibrate it, et cetera.
01:31:48,080 --> 01:31:49,080 It creates bloat and entropy.
01:31:49,080 --> 01:31:53,080 You have to be really sure that you need this sensor.
01:31:53,080 --> 01:31:55,120 In this case, I basically don't think you need it.
01:31:55,120 --> 01:31:57,280 And I think, honestly, I will make a stronger statement.
01:31:57,280 --> 01:32:01,400 I think the others, some of the other companies who are using it are probably going to drop
01:32:01,400 --> 01:32:02,400 it.
01:32:02,400 --> 01:32:03,400 Yeah.
01:32:03,400 --> 01:32:10,280 So you have to consider the sensor in the full, considering can you build a big fleet
01:32:10,280 --> 01:32:15,840 that collects a lot of data and can you integrate that sensor with that data and that sensor
01:32:15,840 --> 01:32:21,080 into a data engine that's able to quickly find different parts of the data that then
01:32:21,080 --> 01:32:24,120 continuously improves whatever the model that you're using.
01:32:24,120 --> 01:32:25,120 Yeah.
01:32:25,120 --> 01:32:30,520 Another way to look at it is like vision is necessary in a sense that the world is designed
01:32:30,520 --> 01:32:31,520 for human visual consumption.
01:32:31,520 --> 01:32:32,520 So you need vision.
01:32:32,520 --> 01:32:33,920 It's necessary.
01:32:33,920 --> 01:32:38,280 And then also it is sufficient because it has all the information that you need for
01:32:38,280 --> 01:32:40,940 driving and humans, obviously, has a vision to drive.
01:32:40,940 --> 01:32:42,560 So it's both necessary and sufficient.
01:32:42,560 --> 01:32:43,820 So you want to focus resources.
01:32:43,820 --> 01:32:48,800 And you have to be really sure if you're going to bring in other sensors, you could add sensors
01:32:48,800 --> 01:32:49,800 to infinity.
01:32:49,800 --> 01:32:51,180 At some point, you need to draw the line.
01:32:51,180 --> 01:32:55,840 And I think in this case, you have to really consider the full cost of any one sensor that
01:32:55,840 --> 01:32:58,840 you're adopting and do you really need it?
01:32:58,840 --> 01:33:00,880 And I think the answer in this case is no.
01:33:00,880 --> 01:33:06,840 So what do you think about the idea that the other companies are forming high resolution
01:33:06,840 --> 01:33:11,800 maps and constraining heavily the geographic regions in which they operate?
01:33:11,800 --> 01:33:19,760 Is that approach, in your view, not going to scale over time to the entirety of the
01:33:19,760 --> 01:33:20,760 United States?
01:33:20,760 --> 01:33:21,760 Yeah.
01:33:21,760 --> 01:33:25,040 I think as you mentioned, like they pre-map all the environments and they need to refresh
01:33:25,040 --> 01:33:29,080 the map and they have a perfect centimeter level accuracy map of everywhere they're going
01:33:29,080 --> 01:33:30,080 to drive.
01:33:30,080 --> 01:33:31,080 It's crazy.
01:33:31,080 --> 01:33:32,080 How are you going to...
01:33:32,080 --> 01:33:35,160 When we're talking about autonomy actually changing the world, we're talking about the
01:33:35,160 --> 01:33:40,560 deployment on a global scale of autonomous systems for transportation.
01:33:40,560 --> 01:33:45,520 And if you need to maintain a centimeter accurate map for earth or like for many cities and
01:33:45,520 --> 01:33:50,360 keep them updated, it's a huge dependency that you're taking on, huge dependency.
01:33:50,360 --> 01:33:51,620 It's a massive, massive dependency.
01:33:51,620 --> 01:33:54,640 And now you need to ask yourself, do you really need it?
01:33:54,640 --> 01:33:57,420 And humans don't need it, right?
01:33:57,420 --> 01:34:01,820 So it's very useful to have a low level map of like, okay, the connectivity of your road,
01:34:01,820 --> 01:34:04,760 you know that there's a fork coming up when you drive an environment, you sort of have
01:34:04,760 --> 01:34:05,760 that high level understanding.
01:34:05,760 --> 01:34:11,360 It's like a small Google map and Tesla uses Google map like similar kind of resolution
01:34:11,360 --> 01:34:16,480 information in the system, but it will not pre-map environments to send me a lot of accuracy.
01:34:16,480 --> 01:34:20,880 It's a crutch, it's a distraction, it costs entropy and it diffuses the team, it dilutes
01:34:20,880 --> 01:34:21,880 the team.
01:34:21,880 --> 01:34:26,560 And you're not focusing on what's actually necessary, which is the computer vision problem.
01:34:26,560 --> 01:34:32,000 What did you learn about machine learning, about engineering, about life, about yourself
01:34:32,000 --> 01:34:36,520 as one human being from working with Elon Musk?
01:34:36,520 --> 01:34:41,080 I think the most I've learned is about how to sort of run organizations efficiently and
01:34:41,080 --> 01:34:46,360 how to create efficient organizations and how to fight entropy in an organization.
01:34:46,360 --> 01:34:49,360 So human engineering in the fight against entropy.
01:34:49,360 --> 01:34:50,360 Yeah.
01:34:50,360 --> 01:34:56,340 There's a, I think Elon is a very efficient warrior in the fight against entropy in organizations.
01:34:56,340 --> 01:34:59,140 What does entropy in an organization look like exactly?
01:34:59,140 --> 01:35:06,200 It's process, it's process and inefficiencies in the form of meetings and that kind of stuff.
01:35:06,200 --> 01:35:07,200 Yeah.
01:35:07,200 --> 01:35:08,200 Meetings.
01:35:08,200 --> 01:35:09,200 He hates meetings.
01:35:09,200 --> 01:35:11,000 He keeps telling people to skip meetings if they're not useful.
01:35:11,000 --> 01:35:15,440 He basically runs the world's biggest startups, I would say.
01:35:15,440 --> 01:35:17,720 Tesla SpaceX are the world's biggest startups.
01:35:17,720 --> 01:35:19,600 Tesla actually has multiple startups.
01:35:19,600 --> 01:35:21,560 I think it's better to look at it that way.
01:35:21,560 --> 01:35:27,880 And so I think he's extremely good at that and yeah, he has a very good intuition for
01:35:27,880 --> 01:35:34,240 streamline processes, making everything efficient, best part is no part, simplifying, focusing
01:35:34,240 --> 01:35:38,160 and just kind of removing barriers, moving very quickly, making big moves.
01:35:38,160 --> 01:35:41,640 All of this is very startupy sort of seeming things, but at scale.
01:35:41,640 --> 01:35:44,360 So strong drive to simplify.
01:35:44,360 --> 01:35:49,800 From your perspective, I mean that also probably applies to just designing systems and machine
01:35:49,800 --> 01:35:50,800 learning and otherwise.
01:35:50,800 --> 01:35:51,800 Yeah.
01:35:51,800 --> 01:35:52,800 Like simplify, simplify.
01:35:52,800 --> 01:35:53,800 Yes.
01:35:53,800 --> 01:35:59,240 What do you think is the secret to maintaining the startup culture in a company that grows?
01:35:59,240 --> 01:36:03,840 Is there, can you introspect that?
01:36:03,840 --> 01:36:08,080 I do think you need someone in a powerful position with a big hammer, like Elon, who's
01:36:08,080 --> 01:36:12,840 like the cheerleader for that idea and ruthlessly pursues it.
01:36:12,840 --> 01:36:18,440 If no one has a big enough hammer, everything turns into committees, democracy within the
01:36:18,440 --> 01:36:23,840 company, process, talking to stakeholders, decision making, just everything just crumbles.
01:36:23,840 --> 01:36:30,040 If you have a big person who is also really smart and has a big hammer, things move quickly.
01:36:30,040 --> 01:36:35,040 So you said your favorite scene in Interstellar is the intense docking scene with the AI and
01:36:35,040 --> 01:36:38,760 Cooper talking saying, Cooper, what are you doing?
01:36:38,760 --> 01:36:40,320 Docking, it's not possible.
01:36:40,320 --> 01:36:43,080 No, it's necessary.
01:36:43,080 --> 01:36:44,080 Such a good line.
01:36:44,080 --> 01:36:52,600 But just so many questions there, why an AI in that scene, presumably is supposed to be
01:36:52,600 --> 01:36:56,480 able to compute a lot more than the human is saying it's not optimal.
01:36:56,480 --> 01:37:01,400 Why the human, I mean, that's a movie, but shouldn't the AI know much better than the
01:37:01,400 --> 01:37:02,400 human?
01:37:02,400 --> 01:37:07,700 Anyway, what do you think is the value of setting seemingly impossible goals?
01:37:07,700 --> 01:37:15,160 So like our initial intuition, which seems like something that you have taken on that
01:37:15,160 --> 01:37:21,680 Elon espouses that where the initial intuition of the community might say this is very difficult
01:37:21,680 --> 01:37:25,040 and then you take it on anyway with a crazy deadline.
01:37:25,040 --> 01:37:32,400 You just from a human engineering perspective, have you seen the value of that?
01:37:32,400 --> 01:37:36,280 I wouldn't say that setting impossible goals exactly is a good idea, but I think setting
01:37:36,280 --> 01:37:38,360 very ambitious goals is a good idea.
01:37:38,360 --> 01:37:43,820 I think there's what I call sublinear scaling of difficulty, which means that 10x problems
01:37:43,820 --> 01:37:45,560 are not 10x hard.
01:37:45,560 --> 01:37:51,320 Usually 10x harder problem is like two or three X harder to execute on.
01:37:51,320 --> 01:37:55,760 Because if you want to improve a system by 10%, it costs some amount of work.
01:37:55,760 --> 01:38:00,600 And if you want to 10x improve the system, it doesn't cost 100x amount of work.
01:38:00,600 --> 01:38:02,960 And it's because you fundamentally change the approach.
01:38:02,960 --> 01:38:06,400 If you start with that constraint, then some approaches are obviously dumb and not going
01:38:06,400 --> 01:38:07,400 to work.
01:38:07,400 --> 01:38:09,840 And it forces you to re-evaluate.
01:38:09,840 --> 01:38:14,000 And I think it's a very interesting way of approaching problem solving.
01:38:14,000 --> 01:38:19,680 But it requires a weird kind of thinking, just going back to your like PhD days, it's
01:38:19,680 --> 01:38:27,000 like, how do you think which ideas in the machine learning community are solvable?
01:38:27,000 --> 01:38:28,000 Yes.
01:38:28,000 --> 01:38:33,200 It requires, what is that, I mean, there's a cliche of first principles thinking, but
01:38:33,200 --> 01:38:38,240 like, it requires to basically ignore what the community is saying, because doesn't a
01:38:38,240 --> 01:38:45,000 community in science usually draw lines of what is and isn't possible?
01:38:45,000 --> 01:38:48,280 And like, it's very hard to break out of that without going crazy.
01:38:48,280 --> 01:38:49,280 Yeah.
01:38:49,280 --> 01:38:52,160 I mean, I think a good example here is, you know, the deep learning revolution in some
01:38:52,160 --> 01:38:57,600 sense, because you could be in computer vision at that time when during the deep learning
01:38:57,600 --> 01:39:01,880 sort of revolution of 2012, and so on, you could be improving a computer vision stack
01:39:01,880 --> 01:39:02,880 by 10%.
01:39:02,880 --> 01:39:06,040 Or we can just be saying, actually, all this is useless.
01:39:06,040 --> 01:39:07,720 And how do I do 10x better computer vision?
01:39:07,720 --> 01:39:12,760 Well, it's not probably by tuning a hog feature detector, I need a different approach.
01:39:12,760 --> 01:39:17,840 I need something that is scalable, going back to Richard Sutton's, and understanding sort
01:39:17,840 --> 01:39:21,160 of like the philosophy of the bitter lesson.
01:39:21,160 --> 01:39:24,400 And then being like, actually, I need much more scalable system, like a neural network,
01:39:24,400 --> 01:39:28,440 that in principle works, and then having some deep believers that can actually execute on
01:39:28,440 --> 01:39:29,600 that mission and make it work.
01:39:29,600 --> 01:39:34,320 So that's the 10x solution.
01:39:34,320 --> 01:39:38,720 What do you think is the timeline to solve the problem of autonomous driving?
01:39:38,720 --> 01:39:42,000 That's still in part an open question.
01:39:42,000 --> 01:39:46,720 Yeah, I think the tough thing with timelines of self driving, obviously, is that no one
01:39:46,720 --> 01:39:48,240 has created self driving.
01:39:48,240 --> 01:39:49,240 Yeah.
01:39:49,240 --> 01:39:52,080 So it's not like, what do you think is the timeline to build this bridge?
01:39:52,080 --> 01:39:57,280 Well, we've built a million bridges before, here's how long that takes.
01:39:57,280 --> 01:40:00,400 No one has built autonomy, it's not obvious.
01:40:00,400 --> 01:40:04,080 Some parts turn out to be much easier than others, so it's really hard to forecast.
01:40:04,080 --> 01:40:07,840 You do your best based on trend lines and so on, and based on intuition.
01:40:07,840 --> 01:40:10,960 But that's why fundamentally, it's just really hard to forecast this.
01:40:10,960 --> 01:40:11,960 No one has built it.
01:40:11,960 --> 01:40:14,920 So even still, like being inside of it, it's hard to do?
01:40:14,920 --> 01:40:15,920 Yes.
01:40:15,920 --> 01:40:19,060 Some things turn out to be much harder, and some things turn out to be much easier.
01:40:19,060 --> 01:40:24,560 Do you try to avoid making forecasts, because Elon doesn't avoid them, right?
01:40:24,560 --> 01:40:29,520 And heads of car companies in the past have not avoided it either.
01:40:29,520 --> 01:40:33,900 Ford and other places have made predictions that we're going to solve at level four driving
01:40:33,900 --> 01:40:36,720 by 2020, 2021, whatever.
01:40:36,720 --> 01:40:42,560 And now they're all kind of backtracking that prediction.
01:40:42,560 --> 01:40:49,560 As an AI person, do you for yourself privately make predictions, or do they get in the way
01:40:49,560 --> 01:40:53,360 of your actual ability to think about a thing?
01:40:53,360 --> 01:40:58,400 Yeah, I would say what's easy to say is that this problem is tractable, and that's an easy
01:40:58,400 --> 01:40:59,400 prediction to make.
01:40:59,400 --> 01:41:00,720 It's tractable, it's going to work.
01:41:00,720 --> 01:41:02,280 Yes, it's just really hard.
01:41:02,280 --> 01:41:06,520 Some things turn out to be harder, and some things turn out to be easier.
01:41:06,520 --> 01:41:10,600 But it definitely feels tractable, and it feels like, at least the team at Tesla, which
01:41:10,600 --> 01:41:13,440 is what I saw internally, is definitely on track to that.
01:41:13,440 --> 01:41:20,780 How do you form a strong representation that allows you to make a prediction about tractability?
01:41:20,780 --> 01:41:29,480 So you're the leader of a lot of humans, you have to kind of say, this is actually possible.
01:41:29,480 --> 01:41:31,200 How do you build up that intuition?
01:41:31,200 --> 01:41:36,440 It doesn't have to be even driving, it could be other tasks.
01:41:36,440 --> 01:41:39,040 What difficult tasks did you work on in your life?
01:41:39,040 --> 01:41:45,600 Achieving certain, just an image net, certain level of superhuman level performance?
01:41:45,600 --> 01:41:51,000 Yeah, expert intuition, it's just intuition, it's belief.
01:41:51,000 --> 01:41:56,900 So just thinking about it long enough, studying, looking at sample data, like you said, driving.
01:41:56,900 --> 01:42:01,720 My intuition is really flawed on this, I don't have a good intuition about tractability.
01:42:01,720 --> 01:42:08,000 It could be anything, it could be solvable.
01:42:08,000 --> 01:42:14,480 The driving task could be simplified into something quite trivial, like the solution
01:42:14,480 --> 01:42:16,600 to the problem would be quite trivial.
01:42:16,600 --> 01:42:22,660 And at scale, more and more cars driving perfectly might make the problem much easier.
01:42:22,660 --> 01:42:27,600 The more cars you have driving, like people learn how to drive correctly, not correctly,
01:42:27,600 --> 01:42:34,880 but in a way that's more optimal for a heterogeneous system of autonomous and semi-autonomous and
01:42:34,880 --> 01:42:37,280 manually driven cars, that could change stuff.
01:42:37,280 --> 01:42:42,480 And again, also I've spent a ridiculous number of hours just staring at pedestrians crossing
01:42:42,480 --> 01:42:51,680 streets, thinking about humans, and it feels like the way we use our eye contact, it sends
01:42:51,680 --> 01:42:55,240 really strong signals and there's certain quirks and edge cases of behavior.
01:42:55,240 --> 01:43:01,560 And of course, a lot of the fatalities that happen have to do with drunk driving and both
01:43:01,560 --> 01:43:05,800 on the pedestrian side and the driver's side, so there's that problem of driving at night
01:43:05,800 --> 01:43:06,800 and all that kind of.
01:43:06,800 --> 01:43:13,040 So I wonder, it's like the space of possible solutions to autonomous driving includes so
01:43:13,040 --> 01:43:17,720 many human factor issues that it's almost impossible to predict.
01:43:17,720 --> 01:43:20,600 There could be super clean, nice solutions.
01:43:20,600 --> 01:43:21,600 Yeah.
01:43:21,600 --> 01:43:25,480 I would say definitely like to use a game analogy, there's some fog of war, but you
01:43:25,480 --> 01:43:30,040 definitely also see the frontier of improvement and you can measure historically how much
01:43:30,040 --> 01:43:31,580 you've made progress.
01:43:31,580 --> 01:43:35,480 And I think for example, at least what I've seen in roughly five years at Tesla, when
01:43:35,480 --> 01:43:38,880 I joined, it barely kept lane on the highway.
01:43:38,880 --> 01:43:42,280 I think going up from Palo Alto to SF was like three or four interventions.
01:43:42,280 --> 01:43:47,240 Anytime the road would do anything geometrically or turn too much, it would just like not work.
01:43:47,240 --> 01:43:50,960 And so going from that to like a pretty competent system in five years and seeing what happens
01:43:50,960 --> 01:43:54,560 also under the hood and what the scale at which the team is operating now with respect
01:43:54,560 --> 01:43:59,680 to data and compute and everything else is just a massive progress.
01:43:59,680 --> 01:44:05,840 So you're climbing a mountain and it's fog, but you're making a lot of progress.
01:44:05,840 --> 01:44:08,400 You're making progress and you see what the next directions are and you're looking at
01:44:08,400 --> 01:44:12,900 some of the remaining challenges and they're not perturbing you and they're not changing
01:44:12,900 --> 01:44:15,880 your philosophy and you're not contorting yourself.
01:44:15,880 --> 01:44:18,120 You're like, actually, these are the things that we still need to do.
01:44:18,120 --> 01:44:21,440 Yeah, the fundamental components of solving the problem seem to be there from the data
01:44:21,440 --> 01:44:25,720 engine to the compute, to the compute on the car, to the compute for the training, all
01:44:25,720 --> 01:44:27,340 that kind of stuff.
01:44:27,340 --> 01:44:33,240 So you've done, over the years you've been at Tesla, you've done a lot of amazing breakthrough
01:44:33,240 --> 01:44:40,320 ideas and engineering, all of it, from the data engine to the human side, all of it.
01:44:40,320 --> 01:44:44,240 Can you speak to why you chose to leave Tesla?
01:44:44,240 --> 01:44:48,000 Basically as I described that, Ryan, I think over time during those five years, I've kind
01:44:48,000 --> 01:44:52,640 of gotten myself into a little bit of a managerial position.
01:44:52,640 --> 01:44:57,440 Most of my days were meetings and growing the organization and making decisions about
01:44:57,440 --> 01:45:04,480 high level strategic decisions about the team and what it should be working on and so on.
01:45:04,480 --> 01:45:07,160 It's kind of like a corporate executive role and I can do it.
01:45:07,160 --> 01:45:11,240 I think I'm okay at it, but it's not fundamentally what I enjoy.
01:45:11,240 --> 01:45:16,000 And so I think when I joined, there was no computer vision team because Tesla was just
01:45:16,000 --> 01:45:19,380 going from the transition of using Mobileye, a third party vendor for all of its computer
01:45:19,380 --> 01:45:21,960 vision, to having to build its computer vision system.
01:45:21,960 --> 01:45:25,200 So when I showed up, there were two people training deep neural networks and they were
01:45:25,200 --> 01:45:30,680 training them at a computer at their legs, like down at the workstation.
01:45:30,680 --> 01:45:32,840 A basic classification task.
01:45:32,840 --> 01:45:33,840 Yeah.
01:45:33,840 --> 01:45:38,480 And so I kind of like grew that into what I think is a fairly respectable deep learning
01:45:38,480 --> 01:45:43,240 team, a massive compute cluster, a very good data annotation organization.
01:45:43,240 --> 01:45:45,360 And I was very happy with where that was.
01:45:45,360 --> 01:45:46,820 It became quite autonomous.
01:45:46,820 --> 01:45:51,920 And so I kind of stepped away and I'm very excited to do much more technical things again.
01:45:51,920 --> 01:45:53,080 Yeah.
01:45:53,080 --> 01:45:54,840 And kind of like refocus on AGI.
01:45:54,840 --> 01:45:56,520 What was this soul searching like?
01:45:56,520 --> 01:46:00,160 Cause you took a little time off and think like, what, um, how many mushrooms did you
01:46:00,160 --> 01:46:01,160 take?
01:46:01,160 --> 01:46:02,160 No, I'm just kidding.
01:46:02,160 --> 01:46:03,840 Uh, I mean, what, what was going through your mind?
01:46:03,840 --> 01:46:05,840 The human lifetime is finite.
01:46:05,840 --> 01:46:06,840 Yeah.
01:46:06,840 --> 01:46:09,160 You did a few incredible things here.
01:46:09,160 --> 01:46:11,800 You're one of the best teachers of AI in the world.
01:46:11,800 --> 01:46:13,560 You're one of the best.
01:46:13,560 --> 01:46:14,560 And I don't mean that.
01:46:14,560 --> 01:46:20,000 I mean that in the best possible way, you're one of the best tinkerers in the AI world.
01:46:20,000 --> 01:46:25,120 Something like understanding the fundamental fundamentals of how something works by building
01:46:25,120 --> 01:46:28,640 it from scratch and playing with it, with the basic intuitions.
01:46:28,640 --> 01:46:33,440 It's like Einstein, Feynman, we're all really good at this kind of stuff, like small example
01:46:33,440 --> 01:46:37,160 of a thing to play with it, to try to understand it.
01:46:37,160 --> 01:46:44,480 So that, and obviously now with Tessa, you helped build a team of machine learning, um,
01:46:44,480 --> 01:46:48,040 like engineers and the system that actually accomplishes something in the real world.
01:46:48,040 --> 01:46:51,000 Just given all that, like what was the soul searching like?
01:46:51,000 --> 01:46:56,400 Well, it was hard because obviously I love the company a lot and I love, I love Elon.
01:46:56,400 --> 01:46:57,400 I love Tesla.
01:46:57,400 --> 01:47:00,120 I want, um, it was, it was hard to leave.
01:47:00,120 --> 01:47:01,640 I love the team basically.
01:47:01,640 --> 01:47:07,600 Um, uh, but yeah, I think I actually, I will be potentially like interested in revisiting
01:47:07,600 --> 01:47:08,600 it.
01:47:08,600 --> 01:47:13,000 Maybe coming back at some point, uh, working in Optimus or kind of AGI at Tesla.
01:47:13,000 --> 01:47:16,000 Uh, I think Tesla is going to do incredible things.
01:47:16,000 --> 01:47:22,880 It's basically like, uh, it's a massive large scale robotics kind of company with a ton
01:47:22,880 --> 01:47:25,480 of in-house talent for doing really incredible things.
01:47:25,480 --> 01:47:29,240 And I think, uh, human or robots are going to be amazing.
01:47:29,240 --> 01:47:32,000 Uh, I think, uh, autonomous transportation is going to be amazing.
01:47:32,000 --> 01:47:33,000 All this is happening at Tesla.
01:47:33,000 --> 01:47:35,680 So I think it's just a really amazing organization.
01:47:35,680 --> 01:47:38,880 So being part of it and helping it along, I think was very, basically I enjoyed that
01:47:38,880 --> 01:47:39,880 a lot.
01:47:39,880 --> 01:47:40,880 Yeah.
01:47:40,880 --> 01:47:44,040 It was basically difficult for those reasons because I love the company, uh, but you know,
01:47:44,040 --> 01:47:48,720 I'm happy to potentially at some point come back for Act 2, but I felt like at this stage
01:47:48,720 --> 01:47:53,800 I built the team, it felt autonomous and, uh, I became a manager and I wanted to do
01:47:53,800 --> 01:47:54,800 a lot more technical stuff.
01:47:54,800 --> 01:47:55,800 I wanted to learn stuff.
01:47:55,800 --> 01:47:57,040 I wanted to teach stuff.
01:47:57,040 --> 01:48:01,200 Uh, and, uh, I just kind of felt like it was a good time for, uh, for a change of pace
01:48:01,200 --> 01:48:02,200 a little bit.
01:48:02,200 --> 01:48:07,480 What do you think is, uh, the best movie sequel of all time, speaking of part two, cause like,
01:48:07,480 --> 01:48:09,160 cause most of them suck.
01:48:09,160 --> 01:48:10,160 Movie sequels?
01:48:10,160 --> 01:48:11,160 Movie sequels.
01:48:11,160 --> 01:48:12,160 Yeah.
01:48:12,160 --> 01:48:13,160 And you tweeted about movies.
01:48:13,160 --> 01:48:14,440 It's in a tiny tangent.
01:48:14,440 --> 01:48:18,600 Is there, what's your, what's like a favorite movie sequel?
01:48:18,600 --> 01:48:19,600 Godfather part two.
01:48:19,600 --> 01:48:21,680 Um, are you a fan of Godfather?
01:48:21,680 --> 01:48:23,560 Cause you didn't even tweet or mention the Godfather.
01:48:23,560 --> 01:48:24,560 Yeah.
01:48:24,560 --> 01:48:25,560 I don't love that movie.
01:48:25,560 --> 01:48:26,560 I know it has a huge following.
01:48:26,560 --> 01:48:27,560 We're going to edit that out.
01:48:27,560 --> 01:48:28,560 We're going to edit out the hate towards the Godfather.
01:48:28,560 --> 01:48:29,560 How dare you disrespect.
01:48:29,560 --> 01:48:31,000 I think I will make a strong statement.
01:48:31,000 --> 01:48:32,160 I don't know why.
01:48:32,160 --> 01:48:37,760 I don't know why, but I basically don't like any movie before 1995.
01:48:37,760 --> 01:48:38,760 Something like that.
01:48:38,760 --> 01:48:39,760 Didn't you mention Terminator 2?
01:48:39,760 --> 01:48:40,760 Okay.
01:48:40,760 --> 01:48:41,760 Okay.
01:48:41,760 --> 01:48:45,240 Terminator 2 was a little bit later in 1990.
01:48:45,240 --> 01:48:47,480 No, I think Terminator 2 was in the 80s.
01:48:47,480 --> 01:48:48,960 And I like Terminator 1 as well.
01:48:48,960 --> 01:48:49,960 So okay.
01:48:49,960 --> 01:48:53,600 So like a few exceptions, but by and large, for some reason, I don't like movies before
01:48:53,600 --> 01:48:55,520 1995 or something.
01:48:55,520 --> 01:48:56,920 They feel very slow.
01:48:56,920 --> 01:48:58,160 The camera is like zoomed out.
01:48:58,160 --> 01:48:59,160 It's boring.
01:48:59,160 --> 01:49:00,160 It's kind of naive.
01:49:00,160 --> 01:49:01,160 It's kind of weird.
01:49:01,160 --> 01:49:04,080 And also Terminator was very much ahead of its time.
01:49:04,080 --> 01:49:05,080 Yes.
01:49:05,080 --> 01:49:06,800 And the Godfather, there's like no AGI.
01:49:06,800 --> 01:49:14,080 I mean, but you have Good Will Hunting was one of the movies you mentioned.
01:49:14,080 --> 01:49:15,720 And that doesn't have any AGI either.
01:49:15,720 --> 01:49:16,720 I guess it has mathematics.
01:49:16,720 --> 01:49:17,720 Yeah.
01:49:17,720 --> 01:49:20,640 I guess occasionally I do enjoy movies that don't feature.
01:49:20,640 --> 01:49:21,800 Or like Anchorman.
01:49:21,800 --> 01:49:24,680 That's so good.
01:49:24,680 --> 01:49:30,880 I don't understand, speaking of AGI, because I don't understand why Will Ferrell is so
01:49:30,880 --> 01:49:32,080 funny.
01:49:32,080 --> 01:49:33,080 It doesn't make sense.
01:49:33,080 --> 01:49:34,080 It doesn't compute.
01:49:34,080 --> 01:49:35,560 There's just something about him.
01:49:35,560 --> 01:49:40,000 And he's a singular human because you don't get that many comedies these days.
01:49:40,000 --> 01:49:44,680 And I wonder if it has to do about the culture or like the machine of Hollywood, or does
01:49:44,680 --> 01:49:48,760 it have to do with just we got lucky with certain people in comedy that came together
01:49:48,760 --> 01:49:53,280 because he is a singular human.
01:49:53,280 --> 01:49:55,680 That was a ridiculous tangent, I apologize.
01:49:55,680 --> 01:49:57,340 But you mentioned human or robot.
01:49:57,340 --> 01:49:59,880 So what do you think about Optimus?
01:49:59,880 --> 01:50:00,880 About Tesla Bot?
01:50:00,880 --> 01:50:06,000 Do you think we'll have robots in the factory and in the home in 10, 20, 30, 40, 50 years?
01:50:06,000 --> 01:50:07,000 Yeah.
01:50:07,000 --> 01:50:08,000 I think it's a very hard project.
01:50:08,000 --> 01:50:09,000 I think it's going to take a while.
01:50:09,000 --> 01:50:11,760 But who else is going to build human or robots at scale?
01:50:11,760 --> 01:50:14,400 And I think it is a very good form factor to go after.
01:50:14,400 --> 01:50:17,880 Because like I mentioned, the world is designed for human and form factor.
01:50:17,880 --> 01:50:20,840 These things would be able to operate our machines, they would be able to sit down in
01:50:20,840 --> 01:50:24,480 chairs, potentially even drive cars.
01:50:24,480 --> 01:50:25,960 Basically the world is designed for humans.
01:50:25,960 --> 01:50:29,400 That's the form factor you want to invest into and make work over time.
01:50:29,400 --> 01:50:33,480 I think there's another school of thought, which is okay, pick a problem and design a
01:50:33,480 --> 01:50:34,480 robot to it.
01:50:34,480 --> 01:50:37,240 But actually designing a robot and getting a whole data engine and everything behind
01:50:37,240 --> 01:50:39,960 it to work is actually an incredibly hard problem.
01:50:39,960 --> 01:50:45,040 So it makes sense to go after general interfaces that are not perfect for any one given task,
01:50:45,040 --> 01:50:50,040 but they actually have the generality of just with a prompt with English able to do something
01:50:50,040 --> 01:50:51,040 across.
01:50:51,040 --> 01:50:57,120 And so I think it makes a lot of sense to go after a general interface in the physical
01:50:57,120 --> 01:50:58,600 world.
01:50:58,600 --> 01:51:02,120 And I think it's a very difficult project, I think it's going to take time.
01:51:02,120 --> 01:51:07,640 But I see no other company that can execute on that vision, I think it's going to be amazing.
01:51:07,640 --> 01:51:11,280 Basically physical labor, like if you think transportation is a large market, try physical
01:51:11,280 --> 01:51:12,280 labor.
01:51:12,280 --> 01:51:13,280 It's insane.
01:51:13,280 --> 01:51:15,600 But it's not just physical labor.
01:51:15,600 --> 01:51:18,760 To me, the thing that's also exciting is social robotics.
01:51:18,760 --> 01:51:23,600 So the relationship we'll have on different levels with those robots.
01:51:23,600 --> 01:51:28,080 That's why I was really excited to see Optimus.
01:51:28,080 --> 01:51:34,600 People have criticized me for the excitement, but I've worked with a lot of research labs
01:51:34,600 --> 01:51:41,060 that do humanoid legged robots, Boston Dynamics, Unitree, there's a lot of companies that do
01:51:41,060 --> 01:51:42,960 legged robots.
01:51:42,960 --> 01:51:51,820 But that's the elegance of the movement is a tiny, tiny part of the big picture.
01:51:51,820 --> 01:51:57,160 So integrating the two big exciting things to me about Tesla doing humanoid or any legged
01:51:57,160 --> 01:52:03,120 robots is clearly integrating into the data engine.
01:52:03,120 --> 01:52:09,280 So the data engine aspect, so the actual intelligence for the perception and the control and the
01:52:09,280 --> 01:52:14,800 planning and all that kind of stuff, integrating into the fleet that you mentioned.
01:52:14,800 --> 01:52:24,680 And then speaking of fleet, the second thing is the mass manufacturers, just knowing culturally
01:52:24,680 --> 01:52:30,400 driving towards a simple robot that's cheap to produce at scale and doing that well, having
01:52:30,400 --> 01:52:33,200 experience to do that well, that changes everything.
01:52:33,200 --> 01:52:37,760 That's a very different culture and style than Boston Dynamics, who by the way, those
01:52:37,760 --> 01:52:45,480 robots are just, the way they move, it'll be a very long time before Tesla can achieve
01:52:45,480 --> 01:52:51,000 the smoothness of movement, but that's not what it's about.
01:52:51,000 --> 01:52:55,040 It's about the entirety of the system, like we talked about the data engine and the fleet.
01:52:55,040 --> 01:52:56,040 That's super exciting.
01:52:56,040 --> 01:53:01,600 Even the initial sort of models, but that too was really surprising that in a few months
01:53:01,600 --> 01:53:04,200 you can get a prototype.
01:53:04,200 --> 01:53:08,200 And the reason that happened very quickly is as you alluded to, there's a ton of copy
01:53:08,200 --> 01:53:10,940 paste from what's happening in the autopilot, a lot.
01:53:10,940 --> 01:53:14,240 The amount of expertise that came out of the woodworks at Tesla for building the human
01:53:14,240 --> 01:53:16,360 robot was incredible to see.
01:53:16,360 --> 01:53:22,320 Basically, Elon said at one point we're doing this and then next day, basically, all these
01:53:22,320 --> 01:53:27,000 CAD models started to appear and people are talking about the supply chain and manufacturing
01:53:27,000 --> 01:53:31,380 and people showed up with screwdrivers and everything the other day and started to put
01:53:31,380 --> 01:53:32,380 together the body.
01:53:32,380 --> 01:53:35,720 And I was like, whoa, all these people exist at Tesla and fundamentally building a car
01:53:35,720 --> 01:53:38,220 is actually not that different from building a robot.
01:53:38,220 --> 01:53:43,840 And that is true, not just for the hardware pieces and also let's not forget hardware,
01:53:43,840 --> 01:53:49,680 not just for a demo, but manufacturing of that hardware at scale is like a whole different
01:53:49,680 --> 01:53:50,680 thing.
01:53:50,680 --> 01:53:56,680 But for software as well, basically this robot currently thinks it's a car.
01:53:56,680 --> 01:53:59,580 It's going to have a midlife crisis at some point.
01:53:59,580 --> 01:54:01,200 It thinks it's a car.
01:54:01,200 --> 01:54:04,080 Some of the earlier demos actually, we were talking about potentially doing them outside
01:54:04,080 --> 01:54:07,400 in the parking lot because that's where all of the computer vision was like working out
01:54:07,400 --> 01:54:11,840 of the box instead of like inside.
01:54:11,840 --> 01:54:15,920 But all the operating system, everything just copy pastes, computer vision, mostly copy
01:54:15,920 --> 01:54:16,920 pastes.
01:54:16,920 --> 01:54:19,000 I mean, you have to retrain the neural nets, but the approach and everything and data engine
01:54:19,000 --> 01:54:22,600 and offline trackers and the way we go about the occupancy tracker and so on, everything
01:54:22,600 --> 01:54:23,600 copy pastes.
01:54:23,600 --> 01:54:26,080 You just need to retrain the neural nets.
01:54:26,080 --> 01:54:29,640 And then the planning control of course has to change quite a bit, but there's a ton of
01:54:29,640 --> 01:54:31,720 copy paste from what's happening at Tesla.
01:54:31,720 --> 01:54:36,020 And so if you were to go with goal of like, okay, let's build a million human robots and
01:54:36,020 --> 01:54:38,760 you're not Tesla, that's a lot to ask.
01:54:38,760 --> 01:54:43,000 If you're Tesla, it's actually like, it's not that crazy.
01:54:43,000 --> 01:54:46,840 And then the follow up question is then how difficult, just like with driving, how difficult
01:54:46,840 --> 01:54:51,000 is the manipulation task such that it can have an impact at scale?
01:54:51,000 --> 01:54:58,280 I think depending on the context, the really nice thing about robotics is that unless you
01:54:58,280 --> 01:55:02,920 do a manufacturing and that kind of stuff, is there is more room for error.
01:55:02,920 --> 01:55:06,280 Driving is so safety critical and also time critical.
01:55:06,280 --> 01:55:10,200 So I got robot is allowed to move slower, which is nice.
01:55:10,200 --> 01:55:11,200 Yes.
01:55:11,200 --> 01:55:14,400 I think it's going to take a long time, but the way you want to structure the development
01:55:14,400 --> 01:55:16,760 is you need to say, okay, it's going to take a long time.
01:55:16,760 --> 01:55:22,200 How can I set up the product development roadmap so that I'm making revenue along the way?
01:55:22,200 --> 01:55:26,180 I'm not setting myself up for a zero one loss function where it doesn't work until it works.
01:55:26,180 --> 01:55:27,520 You don't want to be in that position.
01:55:27,520 --> 01:55:29,740 You want to make it useful almost immediately.
01:55:29,740 --> 01:55:35,800 And then you want to slowly deploy it at scale and you want to set up your data engine,
01:55:35,800 --> 01:55:41,480 your improvement loops, the telemetry, the evaluation, the harness and everything.
01:55:41,480 --> 01:55:44,300 And you want to improve the product over time incrementally and you're making revenue along
01:55:44,300 --> 01:55:45,300 the way.
01:55:45,300 --> 01:55:49,480 That's extremely important because otherwise you cannot build these large undertakings
01:55:49,480 --> 01:55:51,400 just like don't make sense economically.
01:55:51,400 --> 01:55:54,360 And also from the point of view of the team working on it, they need the dopamine along
01:55:54,360 --> 01:55:55,360 the way.
01:55:55,360 --> 01:55:58,840 They're not just going to make a promise about this being useful.
01:55:58,840 --> 01:56:00,880 This is going to change the world in 10 years when it works.
01:56:00,880 --> 01:56:02,440 This is not where you want to be.
01:56:02,440 --> 01:56:08,120 You want to be in a place like I think Autopaldis today where it's offering increased safety
01:56:08,120 --> 01:56:10,240 and convenience of driving today.
01:56:10,240 --> 01:56:11,240 People pay for it.
01:56:11,240 --> 01:56:12,240 People like it.
01:56:12,240 --> 01:56:13,240 People will purchase it.
01:56:13,240 --> 01:56:16,480 And then you also have the greater mission that you're working towards.
01:56:16,480 --> 01:56:17,480 And you see that.
01:56:17,480 --> 01:56:20,760 So the dopamine for the team, that was a source of happiness.
01:56:20,760 --> 01:56:21,880 Yes, 100%.
01:56:21,880 --> 01:56:22,880 You're deploying this.
01:56:22,880 --> 01:56:23,880 People like it.
01:56:23,880 --> 01:56:24,880 People drive it.
01:56:24,880 --> 01:56:25,880 People pay for it.
01:56:25,880 --> 01:56:26,880 They care about it.
01:56:26,880 --> 01:56:27,880 There's all these YouTube videos.
01:56:27,880 --> 01:56:28,880 Your grandma drives it.
01:56:28,880 --> 01:56:29,880 She gives you feedback.
01:56:29,880 --> 01:56:30,880 People like it.
01:56:30,880 --> 01:56:31,880 People engage with it.
01:56:31,880 --> 01:56:32,880 It's huge.
01:56:32,880 --> 01:56:40,800 Do people that drive Teslas recognize you and give you love, like, hey, thanks for this
01:56:40,800 --> 01:56:42,200 nice feature that it's doing?
01:56:42,200 --> 01:56:44,880 Yeah, I think the tricky thing is some people really love you.
01:56:44,880 --> 01:56:48,240 Some people, unfortunately, you're working on something that you think is extremely valuable,
01:56:48,240 --> 01:56:49,240 useful, et cetera.
01:56:49,240 --> 01:56:50,240 Some people do hate you.
01:56:50,240 --> 01:56:55,520 There's a lot of people who hate me and the team and the whole project.
01:56:55,520 --> 01:56:56,520 And I think-
01:56:56,520 --> 01:56:57,520 Are they Tesla drivers?
01:56:57,520 --> 01:56:59,640 In many cases, they're not, actually.
01:56:59,640 --> 01:57:06,440 Yeah, that actually makes me sad about humans or the current ways that humans interact.
01:57:06,440 --> 01:57:07,760 I think that's actually fixable.
01:57:07,760 --> 01:57:09,520 I think humans want to be good to each other.
01:57:09,520 --> 01:57:14,320 I think Twitter and social media is part of the mechanism that actually somehow makes
01:57:14,320 --> 01:57:21,640 the negativity more viral that it doesn't deserve, like disproportionately add a viral
01:57:21,640 --> 01:57:23,800 boost to the negativity.
01:57:23,800 --> 01:57:29,880 So I wish people would just get excited about, so suppress some of the jealousy, some of
01:57:29,880 --> 01:57:32,360 the ego and just get excited for others.
01:57:32,360 --> 01:57:34,480 And then there's a karma aspect to that.
01:57:34,480 --> 01:57:36,840 You get excited for others, they'll get excited for you.
01:57:36,840 --> 01:57:37,840 Same thing in academia.
01:57:37,840 --> 01:57:41,480 If you're not careful, there is a dynamical system there.
01:57:41,480 --> 01:57:47,600 If you think of in silos and get jealous of somebody else being successful, that actually
01:57:47,600 --> 01:57:53,600 perhaps counterintuitively leads to less productivity of you as a community and you individually.
01:57:53,600 --> 01:57:59,720 I feel like if you keep celebrating others, that actually makes you more successful.
01:57:59,720 --> 01:58:04,400 I think people haven't, depending on the industry, haven't quite learned that yet.
01:58:04,400 --> 01:58:07,800 Some people are also very negative and very vocal, so they're very prominently featured.
01:58:07,800 --> 01:58:12,680 But actually, there's a ton of people who are cheerleaders, but they're silent cheerleaders.
01:58:12,680 --> 01:58:17,920 And when you talk to people just in the world, they will tell you, it's amazing, it's great.
01:58:17,920 --> 01:58:20,480 Especially like people who understand how difficult it is to get this stuff working.
01:58:20,480 --> 01:58:25,800 But people who have built products and makers, entrepreneurs, like making this work and changing
01:58:25,800 --> 01:58:28,720 something is incredibly hard.
01:58:28,720 --> 01:58:30,600 Those people are more likely to cheerlead you.
01:58:30,600 --> 01:58:35,360 Well, one of the things that makes me sad is some folks in the robotics community don't
01:58:35,360 --> 01:58:38,040 do the cheerleading and they should.
01:58:38,040 --> 01:58:39,240 Because they know how difficult it is.
01:58:39,240 --> 01:58:43,480 Well, they actually sometimes don't know how difficult it is to create a product that scale.
01:58:43,480 --> 01:58:45,120 They actually deploy it in the real world.
01:58:45,120 --> 01:58:54,080 A lot of the development of robots and AI systems is done on very specific small benchmarks.
01:58:54,080 --> 01:58:55,680 And as opposed to real world conditions.
01:58:55,680 --> 01:58:56,680 Yes.
01:58:56,680 --> 01:59:00,160 Yeah, I think it's really hard to work on robotics in an academic setting.
01:59:00,160 --> 01:59:02,400 Or AI systems that apply in the real world.
01:59:02,400 --> 01:59:10,040 You've criticized, you flourished and loved for a time the ImageNet, the famed ImageNet
01:59:10,040 --> 01:59:11,040 data set.
01:59:11,040 --> 01:59:18,800 And I've recently had some words of criticism that the academic research ML community gives
01:59:18,800 --> 01:59:23,920 a little too much love still to the ImageNet or like those kinds of benchmarks.
01:59:23,920 --> 01:59:28,960 Can you speak to the strengths and weaknesses of data sets used in machine learning research?
01:59:28,960 --> 01:59:35,120 Actually, I don't know that I recall the specific instance where I was unhappy or criticizing
01:59:35,120 --> 01:59:36,120 ImageNet.
01:59:36,120 --> 01:59:38,920 I think ImageNet has been extremely valuable.
01:59:38,920 --> 01:59:44,000 It was basically a benchmark that allowed the deep learning community to demonstrate
01:59:44,000 --> 01:59:47,640 that deep neural networks actually work.
01:59:47,640 --> 01:59:49,820 There's a massive value in that.
01:59:49,820 --> 01:59:54,320 So I think ImageNet was useful, but basically it's become a bit of an MNIST at this point.
01:59:54,320 --> 01:59:57,840 So MNIST is like little 28 by 28 grayscale digits.
01:59:57,840 --> 02:00:00,440 There's kind of a joke data set that everyone like just crushes.
02:00:00,440 --> 02:00:03,240 There's still papers written on MNIST though, right?
02:00:03,240 --> 02:00:04,240 Maybe they shouldn't.
02:00:04,240 --> 02:00:05,240 Strong papers.
02:00:05,240 --> 02:00:06,240 Yeah.
02:00:06,240 --> 02:00:09,360 Maybe they should focus on like how do we learn with a small amount of data, that kind
02:00:09,360 --> 02:00:10,360 of stuff.
02:00:10,360 --> 02:00:11,360 Yeah.
02:00:11,360 --> 02:00:12,640 I could see that being helpful, but not in sort of like mainline computer vision research
02:00:12,640 --> 02:00:13,640 anymore, of course.
02:00:13,640 --> 02:00:17,840 I think the way I've heard you somewhere, maybe I'm just imagining things, but I think
02:00:17,840 --> 02:00:21,400 you said like ImageNet was a huge contribution to the community for a long time and now it's
02:00:21,400 --> 02:00:23,040 time to move past those kinds of...
02:00:23,040 --> 02:00:24,240 Well, ImageNet has been crushed.
02:00:24,240 --> 02:00:32,640 I mean, you know, the error rates are, yeah, we're getting like 90% accuracy in 1000 classification
02:00:32,640 --> 02:00:35,160 way prediction.
02:00:35,160 --> 02:00:39,000 And I've seen those images and it's like really high.
02:00:39,000 --> 02:00:40,000 That's really good.
02:00:40,000 --> 02:00:44,940 If I remember correctly, the top five error rate is now like 1% or something.
02:00:44,940 --> 02:00:49,400 Given your experience with a gigantic real-world data set, would you like to see benchmarks
02:00:49,400 --> 02:00:52,480 move in a certain directions that the research community uses?
02:00:52,480 --> 02:00:56,080 Unfortunately, I don't think academics currently have the next ImageNet.
02:00:56,080 --> 02:00:57,920 We've obviously, I think we've crushed MNIST.
02:00:57,920 --> 02:01:02,780 We've basically kind of crushed ImageNet and there's no next sort of big benchmark
02:01:02,780 --> 02:01:08,400 that the entire community rallies behind and uses, you know, for further development of
02:01:08,400 --> 02:01:09,400 these networks.
02:01:09,400 --> 02:01:10,400 Yeah.
02:01:10,400 --> 02:01:13,680 I wonder what it takes for a data set to captivate the imagination of everybody, like where they
02:01:13,680 --> 02:01:15,560 all get behind it.
02:01:15,560 --> 02:01:19,600 That could also need like a leader, right?
02:01:19,600 --> 02:01:20,600 Somebody with popularity.
02:01:20,600 --> 02:01:25,040 I mean, yeah, why did ImageNet take off?
02:01:25,040 --> 02:01:26,520 Is it just the accident of history?
02:01:26,520 --> 02:01:29,520 It was the right amount of difficult.
02:01:29,520 --> 02:01:33,240 It was the right amount of difficult and simple and interesting enough.
02:01:33,240 --> 02:01:37,940 It just kind of like, it was the right time for that kind of a data set.
02:01:37,940 --> 02:01:40,940 Question from Reddit.
02:01:40,940 --> 02:01:45,180 What are your thoughts on the role that synthetic data and game engines will play in the future
02:01:45,180 --> 02:01:48,280 of neural net model development?
02:01:48,280 --> 02:01:55,820 I think as neural nets converge to humans, the value of simulation to neural nets will
02:01:55,820 --> 02:01:59,720 be similar to the value of simulation to humans.
02:01:59,720 --> 02:02:07,280 So people use simulation because they can learn something in that kind of a system and
02:02:07,280 --> 02:02:09,420 without having to actually experience it.
02:02:09,420 --> 02:02:11,840 But are you referring to the simulation we do in our head?
02:02:11,840 --> 02:02:18,440 No, sorry, simulation, I mean like video games or, you know, other forms of simulation for
02:02:18,440 --> 02:02:20,160 various professionals.
02:02:20,160 --> 02:02:24,160 So let me push back on that because maybe there's simulation that we do in our heads,
02:02:24,160 --> 02:02:28,720 like simulate, if I do this, what do I think will happen?
02:02:28,720 --> 02:02:30,280 Okay, that's like internal simulation.
02:02:30,280 --> 02:02:31,280 Yeah, internal.
02:02:31,280 --> 02:02:32,280 Isn't that what we're doing?
02:02:32,280 --> 02:02:33,280 Assuming it before we act?
02:02:33,280 --> 02:02:36,440 Oh, yeah, but that's independent from like the use of simulation in the sense of like
02:02:36,440 --> 02:02:40,400 computer games or using simulation for training set creation or...
02:02:40,400 --> 02:02:43,000 Is it independent or is it just loosely correlated?
02:02:43,000 --> 02:02:50,680 Because like, isn't that useful to do like counterfactual or like edge gaze simulation
02:02:50,680 --> 02:02:55,160 to like, you know, what happens if there's a nuclear war?
02:02:55,160 --> 02:02:58,000 What happens if there's, you know, like those kinds of things?
02:02:58,000 --> 02:03:01,060 Yeah, that's a different simulation from like Unreal Engine.
02:03:01,060 --> 02:03:02,360 That's how I interpreted the question.
02:03:02,360 --> 02:03:07,040 Ah, so like simulation of the average case?
02:03:07,040 --> 02:03:09,000 Is that what's Unreal Engine?
02:03:09,000 --> 02:03:11,840 What do you mean by Unreal Engine?
02:03:11,840 --> 02:03:18,680 So simulating a world, physics of that world, why is that different?
02:03:18,680 --> 02:03:22,120 Like because you also can add behavior to that world.
02:03:22,120 --> 02:03:25,000 And you could try all kinds of stuff, right?
02:03:25,000 --> 02:03:27,000 You could throw all kinds of weird things into it.
02:03:27,000 --> 02:03:31,500 So Unreal Engine is not just about simulating, I mean, I guess it is about simulating the
02:03:31,500 --> 02:03:32,500 physics of the world.
02:03:32,500 --> 02:03:34,960 It's also doing something with that.
02:03:34,960 --> 02:03:38,840 Yeah, the graphics, the physics and the agents that you put into the environment and stuff
02:03:38,840 --> 02:03:39,840 like that.
02:03:39,840 --> 02:03:40,840 Yeah.
02:03:40,840 --> 02:03:46,480 See, I feel like you said that it's not that important, I guess, for the future of AI development.
02:03:46,480 --> 02:03:48,120 Is that correct to interpret it that way?
02:03:48,120 --> 02:03:55,360 I think humans use simulators, humans use simulators and they find them useful.
02:03:55,360 --> 02:03:58,160 And so computers will use simulators and find them useful.
02:03:58,160 --> 02:04:02,080 Okay, so you're saying it's not that, I don't use simulators very often.
02:04:02,080 --> 02:04:06,080 I play a video game every once in a while, but I don't think I derive any wisdom about
02:04:06,080 --> 02:04:09,440 my own existence from those video games.
02:04:09,440 --> 02:04:15,440 It's a momentary escape from reality versus a source of wisdom about reality.
02:04:15,440 --> 02:04:19,720 So I think that's a very polite way of saying simulation is not that useful.
02:04:19,720 --> 02:04:21,120 Yeah, maybe not.
02:04:21,120 --> 02:04:27,060 I don't see it as like a fundamental, really important part of training neural nets currently.
02:04:27,060 --> 02:04:31,160 But I think as neural nets become more and more powerful, I think you will need fewer
02:04:31,160 --> 02:04:34,680 examples to train additional behaviors.
02:04:34,680 --> 02:04:39,160 And simulation is, of course, there's a domain gap in a simulation that is not the real world,
02:04:39,160 --> 02:04:40,700 it's likely something different.
02:04:40,700 --> 02:04:45,920 But with a powerful enough neural net, you need, the domain gap can be bigger, I think,
02:04:45,920 --> 02:04:48,860 because neural net will sort of understand that even though it's not the real world,
02:04:48,860 --> 02:04:52,400 it has all this high level structure that I'm supposed to be able to learn from.
02:04:52,400 --> 02:04:58,080 So the neural net will actually, yeah, it will be able to leverage the synthetic data
02:04:58,080 --> 02:05:04,520 better by closing the gap and understanding in which ways this is not real data.
02:05:04,520 --> 02:05:05,520 Exactly.
02:05:05,520 --> 02:05:08,160 I'd rather do better questions next time.
02:05:08,160 --> 02:05:12,400 That was a question, I'm just kidding.
02:05:12,400 --> 02:05:14,400 All right.
02:05:14,400 --> 02:05:19,100 So is it possible, do you think, speaking of MNIST, to construct neural nets and training
02:05:19,100 --> 02:05:23,420 processes that require very little data?
02:05:23,420 --> 02:05:26,680 So we've been talking about huge data sets like the internet for training.
02:05:26,680 --> 02:05:30,640 I mean, one way to say that is, like you said, like the querying itself is another level
02:05:30,640 --> 02:05:34,400 of training, I guess, and that requires a little data.
02:05:34,400 --> 02:05:41,960 But do you see any value in doing research and kind of going down the direction of can
02:05:41,960 --> 02:05:45,600 we use very little data to train, to construct a knowledge base?
02:05:45,600 --> 02:05:46,600 100%.
02:05:46,600 --> 02:05:49,200 I just think like at some point you need a massive data set.
02:05:49,200 --> 02:05:54,000 And then when you pre-train your massive neural net and get something that is like a GPT or
02:05:54,000 --> 02:05:59,720 something, then you're able to be very efficient at training any arbitrary new task.
02:05:59,720 --> 02:06:04,920 So a lot of these GPTs, you can do tasks like sentiment analysis or translation or so on
02:06:04,920 --> 02:06:07,140 just by being prompted with very few examples.
02:06:07,140 --> 02:06:08,840 Here's the kind of thing I want you to do.
02:06:08,840 --> 02:06:11,460 Here's an input sentence, here's the translation into German.
02:06:11,460 --> 02:06:13,060 Input sentence, translation to German.
02:06:13,060 --> 02:06:16,900 Input sentence, blank, and the neural net will complete the translation to German just
02:06:16,900 --> 02:06:19,680 by looking at sort of the example you've provided.
02:06:19,680 --> 02:06:24,680 And so that's an example of a very few shot learning in the activations of the neural
02:06:24,680 --> 02:06:26,600 net instead of the weights of the neural net.
02:06:26,600 --> 02:06:31,680 And so I think basically just like humans, neural nets will become very data efficient
02:06:31,680 --> 02:06:33,900 at learning any other new task.
02:06:33,900 --> 02:06:38,160 But at some point you need a massive data set to pre-train your network.
02:06:38,160 --> 02:06:42,000 To get that, and probably we humans have something like that.
02:06:42,000 --> 02:06:43,000 Do we have something like that?
02:06:43,000 --> 02:06:51,360 Do we have a passive in the background model constructing thing that just runs all the
02:06:51,360 --> 02:06:53,160 time in a self-supervised way?
02:06:53,160 --> 02:06:54,240 We're not conscious of it?
02:06:54,240 --> 02:06:59,960 I think humans definitely, I mean, obviously we have, we learn a lot during our lifespan,
02:06:59,960 --> 02:07:06,360 but also we have a ton of hardware that helps us at initialization coming from sort of evolution.
02:07:06,360 --> 02:07:08,840 And so I think that's also a really big component.
02:07:08,840 --> 02:07:13,080 A lot of people in the field, I think they just talk about the amounts of like seconds
02:07:13,080 --> 02:07:17,640 that a person has lived pretending that this is a tabula rasa, sort of like a zero initialization
02:07:17,640 --> 02:07:19,000 of a neural net.
02:07:19,000 --> 02:07:20,000 And it's not.
02:07:20,000 --> 02:07:24,360 You can look at a lot of animals, like for example, zebras, zebras get born and they
02:07:24,360 --> 02:07:27,160 see and they can run.
02:07:27,160 --> 02:07:29,240 There's zero trained data in their lifespan.
02:07:29,240 --> 02:07:30,720 They can just do that.
02:07:30,720 --> 02:07:35,320 So somehow I have no idea how evolution has found a way to encode these algorithms and
02:07:35,320 --> 02:07:39,440 these neural net initializations that are extremely good into ATCGs and I have no idea
02:07:39,440 --> 02:07:44,480 how this works, but apparently it's possible because here's a proof by existence.
02:07:44,480 --> 02:07:50,080 There's something magical about going from a single cell to an organism that is born
02:07:50,080 --> 02:07:51,560 to the first few years of life.
02:07:51,560 --> 02:07:55,520 I kind of like the idea that the reason we don't remember anything about the first few
02:07:55,520 --> 02:07:59,640 years of our life is that it's a really painful process.
02:07:59,640 --> 02:08:03,920 Like it's a very difficult, challenging training process.
02:08:03,920 --> 02:08:10,760 Like intellectually, like, and maybe, yeah, I mean, I don't, why don't we remember any
02:08:10,760 --> 02:08:11,760 of that?
02:08:11,760 --> 02:08:18,520 There might be some crazy training going on and that, maybe that's the background model
02:08:18,520 --> 02:08:23,240 training that is very painful.
02:08:23,240 --> 02:08:27,760 And so it's best for the system once it's trained, not to remember how it was constructed.
02:08:27,760 --> 02:08:31,960 I think it's just like the hardware for long-term memory is just not fully developed.
02:08:31,960 --> 02:08:37,120 I kind of feel like the first few years of infants is not actually like learning.
02:08:37,120 --> 02:08:39,520 It's brain maturing.
02:08:39,520 --> 02:08:42,120 We're born premature.
02:08:42,120 --> 02:08:44,640 There's a theory along those lines because of the birth canal and the swelling of the
02:08:44,640 --> 02:08:45,640 brain.
02:08:45,640 --> 02:08:49,320 And so we're born premature and then the first few years we're just, the brain is maturing
02:08:49,320 --> 02:08:51,720 and then there's some learning eventually.
02:08:51,720 --> 02:08:53,960 That's my current view on it.
02:08:53,960 --> 02:09:00,060 What do you think, do you think neural nets can have long-term memory?
02:09:00,060 --> 02:09:03,480 Like that approach is something like humans, do you think, do you think there needs to
02:09:03,480 --> 02:09:07,760 be another meta architecture on top of it to add something like a knowledge base that
02:09:07,760 --> 02:09:10,840 learns facts about the world and all that kind of stuff?
02:09:10,840 --> 02:09:15,920 Yes, but I don't know to what extent it will be explicitly constructed.
02:09:15,920 --> 02:09:21,000 It might take unintuitive forms where you are telling the GPT like, Hey, you have a,
02:09:21,000 --> 02:09:25,320 you have a declarative memory bank to which you can store and retrieve data from.
02:09:25,320 --> 02:09:28,520 And whenever you encounter some information that you find useful, just save it to your
02:09:28,520 --> 02:09:29,960 memory bank.
02:09:29,960 --> 02:09:33,520 And here's an example of something you have retrieved and here's how you say it and here's
02:09:33,520 --> 02:09:34,520 how you load from it.
02:09:34,520 --> 02:09:40,080 You just say load, whatever, you teach it in text in English and then it might learn
02:09:40,080 --> 02:09:41,840 to use a memory bank from that.
02:09:41,840 --> 02:09:48,280 Oh, so the neural net is the architecture for the background model, the base thing.
02:09:48,280 --> 02:09:50,280 And then everything else is just on top of it.
02:09:50,280 --> 02:09:51,760 It's not just text, right?
02:09:51,760 --> 02:09:53,100 You're giving it gadgets and gizmos.
02:09:53,100 --> 02:09:57,720 So you're teaching some kind of a special language by which we can, it can save arbitrary
02:09:57,720 --> 02:09:59,820 information and retrieve it at a later time.
02:09:59,820 --> 02:10:04,320 And you're telling about these special tokens and how to arrange them to use these interfaces.
02:10:04,320 --> 02:10:06,560 It's like, Hey, you can use a calculator.
02:10:06,560 --> 02:10:07,560 Here's how you use it.
02:10:07,560 --> 02:10:13,440 Just do five, three plus four, one equals, and when equals is there, a calculator will
02:10:13,440 --> 02:10:16,400 actually read out the answer and you don't have to calculate it yourself.
02:10:16,400 --> 02:10:19,680 And you just like tell it in English, this might actually work.
02:10:19,680 --> 02:10:24,280 Do you think in that sense, Godot is interesting, the deep mind system that it's not just no
02:10:24,280 --> 02:10:31,200 language, but actually throws it all in the same pile, images, actions, all that kind
02:10:31,200 --> 02:10:32,200 of stuff.
02:10:32,200 --> 02:10:34,320 That's basically what we're moving towards.
02:10:34,320 --> 02:10:35,320 I think so.
02:10:35,320 --> 02:10:41,960 So Godot is very much a kitchen sink approach to reinforcement learning, lots of different
02:10:41,960 --> 02:10:46,640 environments with a single fixed transformer model.
02:10:46,640 --> 02:10:51,480 I think it's a very sort of early result in that realm, but I think it's along the lines
02:10:51,480 --> 02:10:53,520 of what I think things will eventually look like.
02:10:53,520 --> 02:10:54,520 Right.
02:10:54,520 --> 02:10:57,400 So this is the early days of a system that eventually will look like this, like from
02:10:57,400 --> 02:11:00,080 a rich, sudden perspective.
02:11:00,080 --> 02:11:04,920 And I'm not super huge fan of, I think all these interfaces that like look very different.
02:11:04,920 --> 02:11:07,520 I would want everything to be normalized into the same API.
02:11:07,520 --> 02:11:11,940 So for example, screen pixels, very same API, instead of having like different world environments
02:11:11,940 --> 02:11:15,760 that have very different physics and joint configurations and appearances and whatever.
02:11:15,760 --> 02:11:19,560 And you're having some kind of special tokens for different games that you can plug.
02:11:19,560 --> 02:11:22,720 I'd rather just normalize everything to a single interface.
02:11:22,720 --> 02:11:25,160 So it looks the same to the neural net, if that makes sense.
02:11:25,160 --> 02:11:27,680 So it's all going to be pixel based pong in the end.
02:11:27,680 --> 02:11:28,680 I think so.
02:11:28,680 --> 02:11:32,420 Okay.
02:11:32,420 --> 02:11:35,880 Let me ask you about your own personal life.
02:11:35,880 --> 02:11:39,480 A lot of people want to know, you're one of the most productive and brilliant people in
02:11:39,480 --> 02:11:40,560 the history of AI.
02:11:40,560 --> 02:11:44,680 What does a productive day in the life of Andrej Karpathy look like?
02:11:44,680 --> 02:11:46,080 What time do you wake up usually?
02:11:46,080 --> 02:11:51,680 Because imagine some kind of dance between the average productive day and a perfect productive
02:11:51,680 --> 02:11:52,680 day.
02:11:52,680 --> 02:11:56,520 So the perfect productive day is the thing we strive towards and the average is kind
02:11:56,520 --> 02:12:01,240 of what it kind of converges to given all the mistakes and human eventualities and so
02:12:01,240 --> 02:12:02,240 on.
02:12:02,240 --> 02:12:03,240 So what times do you wake up?
02:12:03,240 --> 02:12:04,400 Are you a morning person?
02:12:04,400 --> 02:12:05,560 I'm not a morning person.
02:12:05,560 --> 02:12:07,040 I'm a night owl, for sure.
02:12:07,040 --> 02:12:09,040 Is it stable or not?
02:12:09,040 --> 02:12:12,320 It's semi stable, like eight or nine or something like that.
02:12:12,320 --> 02:12:16,600 During my PhD, it was even later, I used to go to sleep usually at 3am.
02:12:16,600 --> 02:12:21,680 I think the am hours are precious and very interesting time to work because everyone
02:12:21,680 --> 02:12:22,680 is asleep.
02:12:22,680 --> 02:12:26,400 At 8am or 7am, the East Coast is awake.
02:12:26,400 --> 02:12:29,480 So there's already activity, there's already some text messages, whatever, there's stuff
02:12:29,480 --> 02:12:30,480 happening.
02:12:30,480 --> 02:12:34,280 You can go on like some news website and there's stuff happening, it's distracting.
02:12:34,280 --> 02:12:36,740 At 3am, everything is totally quiet.
02:12:36,740 --> 02:12:42,260 And so you're not going to be bothered and you have solid chunks of time to do your work.
02:12:42,260 --> 02:12:45,320 So I like those periods, night owl by default.
02:12:45,320 --> 02:12:50,160 And then I think like productive time basically, what I like to do is you need to like build
02:12:50,160 --> 02:12:53,920 some momentum on the problem without too much distraction.
02:12:53,920 --> 02:13:00,520 And you need to load your RAM, your working memory with that problem.
02:13:00,520 --> 02:13:04,120 And then you need to be obsessed with it when you're taking shower when you're falling asleep.
02:13:04,120 --> 02:13:07,280 You need to be obsessed with the problem and it's fully in your memory and you're ready
02:13:07,280 --> 02:13:09,040 to wake up and work on it right there.
02:13:09,040 --> 02:13:14,120 So it is a scale of, is this in a scale, temporal scale of a single day or a couple of days,
02:13:14,120 --> 02:13:15,200 a week, a month?
02:13:15,200 --> 02:13:19,480 So I can't talk about one day basically in isolation because it's a whole process.
02:13:19,480 --> 02:13:23,880 When I want to get productive in the problem, I feel like I need a span of a few days.
02:13:23,880 --> 02:13:27,800 Where I can really get in on that problem and I don't want to be interrupted.
02:13:27,800 --> 02:13:31,320 And I'm going to just be completely obsessed with that problem.
02:13:31,320 --> 02:13:34,240 And that's where I do most of my good workouts.
02:13:34,240 --> 02:13:38,540 You've done a bunch of cool like little projects in a very short amount of time, very quickly.
02:13:38,540 --> 02:13:40,680 So that, that requires you just focusing on it.
02:13:40,680 --> 02:13:41,680 Yeah.
02:13:41,680 --> 02:13:44,360 Basically I need to load my working memory with the problem and I need to be productive
02:13:44,360 --> 02:13:49,120 because there's always like a huge fixed cost to approaching any problem.
02:13:49,120 --> 02:13:52,280 You know, like I was struggling with this, for example, at Tesla because I want to work
02:13:52,280 --> 02:13:55,560 on like small side project, but okay, you first need to figure out, okay, I need to
02:13:55,560 --> 02:13:56,560 SSH into my cluster.
02:13:56,560 --> 02:13:59,480 I need to bring up a VS code editor so I can like work on this.
02:13:59,480 --> 02:14:03,920 I need to, I ran into some stupid error because of some reason, like you're not at a point
02:14:03,920 --> 02:14:05,760 where you can be just productive right away.
02:14:05,760 --> 02:14:07,680 You are facing barriers.
02:14:07,680 --> 02:14:12,880 And so it's about really removing all of that barrier and you're able to go into the problem
02:14:12,880 --> 02:14:15,480 and you have the full problem loaded in your memory.
02:14:15,480 --> 02:14:22,260 And somehow avoiding distractions of all different forms, like news stories, emails, but also
02:14:22,260 --> 02:14:26,600 distractions from other interesting projects that you previously worked on or currently
02:14:26,600 --> 02:14:27,600 working on and so on.
02:14:27,600 --> 02:14:28,600 Yeah.
02:14:28,600 --> 02:14:29,600 You just want to really focus your mind.
02:14:29,600 --> 02:14:34,160 And I mean, I can take some time off for distractions and in between, but I think it can't be too
02:14:34,160 --> 02:14:35,160 much.
02:14:35,160 --> 02:14:38,240 Uh, you know, most of your day is sort of like spent on that problem.
02:14:38,240 --> 02:14:42,240 And then, you know, I drink coffee, I have my morning routine.
02:14:42,240 --> 02:14:46,800 I look at some news, uh, Twitter, hacker news, wall street journal, et cetera.
02:14:46,800 --> 02:14:50,480 So you, so basically you wake up, you have some coffee, are you trying to get to work
02:14:50,480 --> 02:14:51,480 as quickly as possible?
02:14:51,480 --> 02:14:56,440 Or do you do take in this diet of, of like what the hell is happening in the world first?
02:14:56,440 --> 02:14:57,440 I am.
02:14:57,440 --> 02:14:59,680 I do find it interesting to know about the world.
02:14:59,680 --> 02:15:03,720 I don't know that it's useful or good, but it is part of my routine right now.
02:15:03,720 --> 02:15:09,000 So I do read through a bunch of news articles and I want to be informed and, um, I'm suspicious
02:15:09,000 --> 02:15:10,000 of it.
02:15:10,000 --> 02:15:12,120 I'm suspicious of the practice, but currently that's where I am.
02:15:12,120 --> 02:15:18,000 Oh, you mean suspicious about the positive effect of that practice on your productivity
02:15:18,000 --> 02:15:19,000 and your wellbeing.
02:15:19,000 --> 02:15:21,240 My wellbeing psychologically.
02:15:21,240 --> 02:15:25,720 And also on your ability to deeply understand the world, because there's a bunch of sources
02:15:25,720 --> 02:15:26,720 of information.
02:15:26,720 --> 02:15:28,520 You're not really focused on deeply integrating.
02:15:28,520 --> 02:15:29,520 Yeah.
02:15:29,520 --> 02:15:30,520 It's a little bit distracting.
02:15:30,520 --> 02:15:31,520 Yeah.
02:15:31,520 --> 02:15:37,360 In terms of a perfectly productive day for how long of a stretch of time in one session
02:15:37,360 --> 02:15:39,560 do you try to work and focus on a thing?
02:15:39,560 --> 02:15:40,560 It's a couple of hours.
02:15:40,560 --> 02:15:43,600 There's a one hours at 30 minutes is 10 minutes.
02:15:43,600 --> 02:15:47,480 I can probably go like a small few hours and then I need some breaks in between for like
02:15:47,480 --> 02:15:48,920 food and stuff.
02:15:48,920 --> 02:15:53,560 And um, yeah, but I think like, uh, it's still really hard to accumulate hours.
02:15:53,560 --> 02:15:57,200 I was using a tracker that told me exactly how much time I spent coding any one day.
02:15:57,200 --> 02:16:01,040 And even on a very productive day, I still spent only like six or eight hours.
02:16:01,040 --> 02:16:02,040 Yeah.
02:16:02,040 --> 02:16:07,400 And it's just because there's so much padding, commute, talking to people, food, et cetera.
02:16:07,400 --> 02:16:13,280 There's like a cost of life, just living and sustaining and homeostasis and just maintaining
02:16:13,280 --> 02:16:16,360 yourself as a human is very high.
02:16:16,360 --> 02:16:21,360 And that there seems to be a desire within the human mind to, to, uh, to participate
02:16:21,360 --> 02:16:26,240 in society that creates that padding because I, yeah, the most productive days I've ever
02:16:26,240 --> 02:16:30,840 had is just completely from start to finish is tuning out everything and just sitting
02:16:30,840 --> 02:16:31,840 there.
02:16:31,840 --> 02:16:34,200 And then, and then you could do more than six and eight hours.
02:16:34,200 --> 02:16:39,240 Is there some wisdom about what gives you strength to do like, uh, tough days of long
02:16:39,240 --> 02:16:40,240 focus?
02:16:40,240 --> 02:16:41,240 Yeah.
02:16:41,240 --> 02:16:44,200 Just like whenever I get obsessed about a problem, something just needs to work.
02:16:44,200 --> 02:16:45,440 Something just needs to exist.
02:16:45,440 --> 02:16:46,440 It needs to exist.
02:16:46,440 --> 02:16:50,800 And you, so you're able to deal with bugs and programming issues and technical issues
02:16:50,800 --> 02:16:54,280 and, uh, design decisions that turn out to be the wrong ones.
02:16:54,280 --> 02:16:57,560 You're able to think through all of that given, given that you want to think to exist.
02:16:57,560 --> 02:16:58,560 Yeah.
02:16:58,560 --> 02:16:59,560 It needs to exist.
02:16:59,560 --> 02:17:02,280 And then I think, uh, to me also a big factor is, uh, you know, are other humans are going
02:17:02,280 --> 02:17:03,280 to appreciate it?
02:17:03,280 --> 02:17:04,280 Are they going to like it?
02:17:04,280 --> 02:17:05,480 That's a big part of my motivation.
02:17:05,480 --> 02:17:10,880 If I'm helping humans and they seem happy, they say nice things, uh, they tweet about
02:17:10,880 --> 02:17:11,880 it or whatever.
02:17:11,880 --> 02:17:13,980 That gives me pleasure because I'm doing something useful.
02:17:13,980 --> 02:17:17,040 Like you do see yourself sharing it with the world.
02:17:17,040 --> 02:17:19,600 Like whether it's on GitHub, whether it's a blog post or through videos.
02:17:19,600 --> 02:17:20,600 Yeah.
02:17:20,600 --> 02:17:21,600 I was thinking about it.
02:17:21,600 --> 02:17:22,960 Like suppose I did all these things, but did not share them.
02:17:22,960 --> 02:17:25,700 I don't think I would have the same amount of motivation that I can build up.
02:17:25,700 --> 02:17:32,320 You enjoy the feeling of other people, um, gaining value and happiness from the stuff
02:17:32,320 --> 02:17:33,320 you've created.
02:17:33,320 --> 02:17:34,320 Yeah.
02:17:34,320 --> 02:17:36,240 Uh, what about diet?
02:17:36,240 --> 02:17:39,320 Is there, I saw you played with, in, in a minute fast, do you fast?
02:17:39,320 --> 02:17:40,320 Does that help?
02:17:40,320 --> 02:17:41,320 I play with everything.
02:17:41,320 --> 02:17:46,320 With the things you play, what's been most beneficial to the, your ability to mentally
02:17:46,320 --> 02:17:50,840 focus on a thing and just mental productivity and happiness.
02:17:50,840 --> 02:17:51,840 You still fast?
02:17:51,840 --> 02:17:52,840 Yeah.
02:17:52,840 --> 02:17:55,560 I still fast, but I do intermittent fasting, but really what it means at the end of the
02:17:55,560 --> 02:17:56,560 day is I skip breakfast.
02:17:56,560 --> 02:17:57,560 Yeah.
02:17:57,560 --> 02:18:01,200 So I do, uh, 18-6 roughly by default when I'm in my steady state.
02:18:01,200 --> 02:18:04,560 If I'm traveling or doing something else, I will break the rules, but in my steady state,
02:18:04,560 --> 02:18:05,560 I do 18-6.
02:18:05,560 --> 02:18:09,760 Uh, so I eat only from 12 to six, uh, not a hard rule and I break it often, but that's
02:18:09,760 --> 02:18:11,060 my default.
02:18:11,060 --> 02:18:15,480 And then, um, yeah, I've done a bunch of random experiments for the most part right now, uh,
02:18:15,480 --> 02:18:20,200 where I've been for the last year and a half, I want to say is, um, um, plant-based or plant
02:18:20,200 --> 02:18:21,200 forward.
02:18:21,200 --> 02:18:22,200 I heard plant forward.
02:18:22,200 --> 02:18:23,200 It sounds better.
02:18:23,200 --> 02:18:24,200 Exactly.
02:18:24,200 --> 02:18:25,920 I don't actually know what the difference is, but it sounds better in my mind, but it
02:18:25,920 --> 02:18:32,160 just means that I prefer a plant-based food and, uh, or cooked or I prefer cooked, uh,
02:18:32,160 --> 02:18:33,160 and plant-based.
02:18:33,160 --> 02:18:35,800 So plant-based, forgive me.
02:18:35,800 --> 02:18:41,840 I don't actually know how wide the category of plant entails, but it just means that you're
02:18:41,840 --> 02:18:47,760 not, uh, like a chickpeas and you can flex and, uh, you just prefer to eat plants.
02:18:47,760 --> 02:18:51,080 And you know, you're not making, you're not trying to influence other people.
02:18:51,080 --> 02:18:54,060 And if someone is, you come to someone's house party and they serve you a steak that they're
02:18:54,060 --> 02:18:55,600 really proud of, you will eat it.
02:18:55,600 --> 02:18:56,600 Yes.
02:18:56,600 --> 02:18:57,600 Right.
02:18:57,600 --> 02:18:58,600 It's just like, oh, that's beautiful.
02:18:58,600 --> 02:19:02,880 I mean, that's, um, I'm the flip side of that, but I'm very sort of flexible.
02:19:02,880 --> 02:19:04,980 Have you tried doing one meal a day?
02:19:04,980 --> 02:19:09,520 Uh, I have, uh, accidentally not consistently, but I've accidentally had that.
02:19:09,520 --> 02:19:10,640 I don't, I don't like it.
02:19:10,640 --> 02:19:12,800 I think it makes me feel, uh, not good.
02:19:12,800 --> 02:19:14,920 It's too, it's too much, too much of a hit.
02:19:14,920 --> 02:19:15,920 Yeah.
02:19:15,920 --> 02:19:18,520 And, uh, so currently I have about two meals a day, 12 and six.
02:19:18,520 --> 02:19:19,520 I do that nonstop.
02:19:19,520 --> 02:19:20,520 I'm doing it now.
02:19:20,520 --> 02:19:22,520 I do it one meal a day.
02:19:22,520 --> 02:19:23,520 Okay.
02:19:23,520 --> 02:19:24,520 It's interesting.
02:19:24,520 --> 02:19:25,520 It's a interesting feeling.
02:19:25,520 --> 02:19:26,520 Have you ever fasted longer than a day?
02:19:26,520 --> 02:19:27,520 Yeah.
02:19:27,520 --> 02:19:30,020 I've done a bunch of water fasts because I was curious what happens.
02:19:30,020 --> 02:19:31,060 What happened?
02:19:31,060 --> 02:19:32,060 Anything interesting?
02:19:32,060 --> 02:19:33,060 Yeah, I would say so.
02:19:33,060 --> 02:19:36,880 I mean, you know, what's interesting is that you're hungry for two days and then starting
02:19:36,880 --> 02:19:39,560 day three or so, you're not hungry.
02:19:39,560 --> 02:19:42,520 It's like such a weird feeling because you haven't eaten in a few days and you're not
02:19:42,520 --> 02:19:43,520 hungry.
02:19:43,520 --> 02:19:44,520 Isn't that weird?
02:19:44,520 --> 02:19:45,520 It's really weird.
02:19:45,520 --> 02:19:49,520 One of the many weird things about human biology, it figures something out and finds, finds
02:19:49,520 --> 02:19:53,960 another source of energy or something like that or, or, uh, relaxes the system.
02:19:53,960 --> 02:19:54,960 I don't know how it works.
02:19:54,960 --> 02:19:55,960 Yeah.
02:19:55,960 --> 02:19:56,960 The body is like, you're hungry, you're hungry.
02:19:56,960 --> 02:19:57,960 And then it just gives up.
02:19:57,960 --> 02:19:58,960 It's like, okay, I guess we're fasting now.
02:19:58,960 --> 02:19:59,960 There's nothing.
02:19:59,960 --> 02:20:04,080 And then it's just kind of like focuses on trying to make you not hungry and, you know,
02:20:04,080 --> 02:20:08,200 not feel the damage of that and, uh, trying to give you some space to figure out the food
02:20:08,200 --> 02:20:09,200 situation.
02:20:09,200 --> 02:20:14,680 So are you still to this day, most productive, uh, at night?
02:20:14,680 --> 02:20:18,880 I would say I am, but it is really hard to maintain my PhD schedule.
02:20:18,880 --> 02:20:22,760 Um, especially when I was say working at Tesla and so on, it's a non-starter.
02:20:22,760 --> 02:20:27,480 Uh, so, but even now, like, you know, people want to meet for various events.
02:20:27,480 --> 02:20:31,840 They society lives in a certain period of time and you sort of have to like work with
02:20:31,840 --> 02:20:32,840 that.
02:20:32,840 --> 02:20:36,440 It's hard to like do a social thing and then after that return and do work.
02:20:36,440 --> 02:20:37,440 Yeah.
02:20:37,440 --> 02:20:38,440 It's just really hard.
02:20:38,440 --> 02:20:43,760 Uh, that's why I try when I do social things, I try not to do too, uh, too much drinking
02:20:43,760 --> 02:20:47,040 so I can return and, uh, continue doing work.
02:20:47,040 --> 02:20:54,080 Um, but at Tesla, is there, is there conversions like Tesla, but any, any company, is there
02:20:54,080 --> 02:21:00,120 conversions to always a schedule or is there more, is that how humans behave when they
02:21:00,120 --> 02:21:01,120 collaborate?
02:21:01,120 --> 02:21:02,120 I need to learn about this.
02:21:02,120 --> 02:21:05,640 Do they try to keep a consistent schedule where you're all awake at the same time?
02:21:05,640 --> 02:21:09,840 I mean, I do try to create a routine and I try to create a steady state in which I'm,
02:21:09,840 --> 02:21:10,840 uh, comfortable in.
02:21:10,840 --> 02:21:14,760 Uh, so I have a morning routine, I have a day routine, I try to keep things to a steady
02:21:14,760 --> 02:21:19,560 state and, um, things are predictable and then you can sort of just like your body just
02:21:19,560 --> 02:21:21,000 sort of like sticks to that.
02:21:21,000 --> 02:21:23,800 And if you try to stress that a little too much, it will create a, you know, when you're
02:21:23,800 --> 02:21:28,640 traveling and you're dealing with jet lag, you're not able to really ascend to, you know,
02:21:28,640 --> 02:21:29,640 where you need to go.
02:21:29,640 --> 02:21:30,640 Yeah.
02:21:30,640 --> 02:21:31,640 Yeah.
02:21:31,640 --> 02:21:33,240 That's what humans with the habits and stuff.
02:21:33,240 --> 02:21:38,800 Uh, what are your thoughts on work life balance throughout a human lifetime?
02:21:38,800 --> 02:21:44,680 So Tesla in part was known for sort of pushing people to their limits in terms of what they're
02:21:44,680 --> 02:21:49,800 able to do in terms of what they're trying to do in terms of how much they work, all
02:21:49,800 --> 02:21:50,800 that kind of stuff.
02:21:50,800 --> 02:21:51,800 Yeah.
02:21:51,800 --> 02:21:55,680 And I will say Tesla gets all too much, uh, bad rep for this because what's happening
02:21:55,680 --> 02:21:57,800 is Tesla is a, it's a bursty environment.
02:21:57,800 --> 02:22:02,600 Uh, so I would say the baseline, uh, my only point of reference is Google where I've interned
02:22:02,600 --> 02:22:06,160 three times and I saw what it's like inside Google and deep mind.
02:22:06,160 --> 02:22:10,720 Um, I would say the baseline is higher than that, but then there's a punctuated equilibrium
02:22:10,720 --> 02:22:14,980 where once in a while there's a fire and, uh, someone like people work really hard.
02:22:14,980 --> 02:22:19,280 And so it's spiky and bursty and then all the stories get collected about the bursts
02:22:19,280 --> 02:22:23,160 and then it gives the appearance of like total insanity, but actually it's just a bit more
02:22:23,160 --> 02:22:24,720 intense environment.
02:22:24,720 --> 02:22:27,160 And there are fires and sprints.
02:22:27,160 --> 02:22:31,900 And so I think, uh, you know, definitely though, I, I would say, um, it's a more intense environment
02:22:31,900 --> 02:22:36,000 than something you would get in your personal, forget all of that, just in your own personal
02:22:36,000 --> 02:22:37,000 life.
02:22:37,000 --> 02:22:42,600 Um, what do you think about the happiness of a human being, uh, a brilliant person like
02:22:42,600 --> 02:22:49,720 yourself above finding a balance between work and life, or is it such a thing, not a good
02:22:49,720 --> 02:22:50,720 thought experiment?
02:22:50,720 --> 02:22:51,720 Yeah.
02:22:51,720 --> 02:22:59,000 I think, I think balance is good, but I also love to have sprints that are out of distribution.
02:22:59,000 --> 02:23:04,820 And that's when I think I've been pretty, uh, creative and, um, as well.
02:23:04,820 --> 02:23:11,720 Sprints out of distribution means that most of the time you have a quote unquote balance.
02:23:11,720 --> 02:23:16,120 I have balance most of the time, but I like being obsessed with something once in a while.
02:23:16,120 --> 02:23:17,120 Once in a while is what?
02:23:17,120 --> 02:23:18,480 Once a week, once a month, once a year.
02:23:18,480 --> 02:23:19,480 Yeah.
02:23:19,480 --> 02:23:20,480 Probably like say once a month or something.
02:23:20,480 --> 02:23:21,480 Yeah.
02:23:21,480 --> 02:23:22,840 And that's when we get a new GitHub repo.
02:23:22,840 --> 02:23:23,840 Yeah.
02:23:23,840 --> 02:23:25,000 That's when you're like really care about a problem.
02:23:25,000 --> 02:23:26,000 It must exist.
02:23:26,000 --> 02:23:27,160 This will be awesome.
02:23:27,160 --> 02:23:28,360 You're obsessed with it.
02:23:28,360 --> 02:23:29,920 And now you can't just do it on that day.
02:23:29,920 --> 02:23:33,340 You need to pay the fixed cost of getting into the groove and then you need to stay
02:23:33,340 --> 02:23:37,080 there for awhile and then society will come and they will try to mess with you and they
02:23:37,080 --> 02:23:38,240 will try to distract you.
02:23:38,240 --> 02:23:39,240 Yeah.
02:23:39,240 --> 02:23:41,840 It's not like a person who's like, I just need five minutes of your time.
02:23:41,840 --> 02:23:42,840 Yeah.
02:23:42,840 --> 02:23:47,400 This is the cost of that is not five minutes and society needs to change how it thinks
02:23:47,400 --> 02:23:49,920 about just five minutes of your time.
02:23:49,920 --> 02:23:50,920 Right.
02:23:50,920 --> 02:23:53,120 It's never, it's never just one minute.
02:23:53,120 --> 02:23:54,120 It's just 36.
02:23:54,120 --> 02:23:55,120 Just a quick thing.
02:23:55,120 --> 02:23:56,120 What's the big deal?
02:23:56,120 --> 02:23:57,120 Why are you being so?
02:23:57,120 --> 02:23:58,120 Yeah.
02:23:58,120 --> 02:23:59,120 No.
02:23:59,120 --> 02:24:01,120 Uh, what's your computer setup?
02:24:01,120 --> 02:24:05,240 What uh, what's like the perfect, do you, are you somebody that's flexible to no matter
02:24:05,240 --> 02:24:13,040 what, laptop, four screens or do you prefer a certain setup that you're most productive?
02:24:13,040 --> 02:24:19,240 Um, I guess the one that I'm familiar with is one large screen, a 27 inch, um, and my
02:24:19,240 --> 02:24:20,540 laptop on the side.
02:24:20,540 --> 02:24:21,680 What operating system?
02:24:21,680 --> 02:24:22,680 I do max.
02:24:22,680 --> 02:24:23,840 That's my primary.
02:24:23,840 --> 02:24:24,840 For all tasks.
02:24:24,840 --> 02:24:27,880 I would say OSX, but when you're working on deep learning, everything is Linux.
02:24:27,880 --> 02:24:30,980 You're SSHed into a cluster and you're working remotely.
02:24:30,980 --> 02:24:32,280 But what about the actual development?
02:24:32,280 --> 02:24:33,280 Like they're using the IDE.
02:24:33,280 --> 02:24:37,680 Yeah, you would use, uh, I think a good way is you just run VS code.
02:24:37,680 --> 02:24:40,200 Um, my favorite editor right now on your Mac.
02:24:40,200 --> 02:24:43,600 Uh, but you are actually, you have a remote folder through SSH.
02:24:43,600 --> 02:24:47,600 Um, so the actual files that you're manipulating are on the cluster somewhere else.
02:24:47,600 --> 02:24:52,380 So what's the best IDE, uh, VS code.
02:24:52,380 --> 02:24:58,440 What else do people, so I use Emacs still, it may be cool.
02:24:58,440 --> 02:25:00,760 I don't know if it's maximum productivity.
02:25:00,760 --> 02:25:06,440 Um, so what, what do you recommend in terms of editors, you worked a lot of software engineers,
02:25:06,440 --> 02:25:11,320 editors for Python, C++ machine learning applications.
02:25:11,320 --> 02:25:13,820 I think the current answer is VS code.
02:25:13,820 --> 02:25:16,800 Currently I believe that's the best, um, IDE.
02:25:16,800 --> 02:25:18,440 It's got a huge amount of extensions.
02:25:18,440 --> 02:25:23,120 It has a GitHub co-pilot, um, uh, integration, which I think is very valuable.
02:25:23,120 --> 02:25:25,600 What do you think about the, the co-pilot integration?
02:25:25,600 --> 02:25:30,880 I was actually, uh, I got to talk a bunch with Guido and Rasam was a creative Python and he
02:25:30,880 --> 02:25:31,880 loves co-pilot.
02:25:31,880 --> 02:25:34,600 He like, he programs a lot with it.
02:25:34,600 --> 02:25:35,600 Yeah.
02:25:35,600 --> 02:25:36,600 Uh, do you?
02:25:36,600 --> 02:25:37,600 Yeah.
02:25:37,600 --> 02:25:38,600 Use co-pilot.
02:25:38,600 --> 02:25:39,600 I love it.
02:25:39,600 --> 02:25:40,600 And, uh, it's free for me, but I would pay for it.
02:25:40,600 --> 02:25:41,600 Yeah.
02:25:41,600 --> 02:25:42,600 I think it's very good.
02:25:42,600 --> 02:25:45,320 And the utility that I found with it was, is in, is it, I would say there is a learning
02:25:45,320 --> 02:25:50,040 curve and you need to figure out when it's helpful and when to pay attention to its outputs
02:25:50,040 --> 02:25:53,200 and when it's not going to be helpful, where you should not pay attention to it.
02:25:53,200 --> 02:25:56,040 Because if you're just reading its suggestions all the time, it's not a good way of interacting
02:25:56,040 --> 02:25:57,040 with it.
02:25:57,040 --> 02:25:58,880 But I think I was able to sort of like mold myself to it.
02:25:58,880 --> 02:26:00,100 I find it's very helpful.
02:26:00,100 --> 02:26:03,160 Number one in a copy paste and replace some parts.
02:26:03,160 --> 02:26:07,920 So I don't, um, when the pattern is clear, it's really good at completing the pattern.
02:26:07,920 --> 02:26:11,240 And number two, sometimes it suggests API's that I'm not aware of.
02:26:11,240 --> 02:26:14,240 Uh, so it, it tells you about something that you didn't know.
02:26:14,240 --> 02:26:17,980 So and that's an opportunity to discover a new opportunity to see, I would never take
02:26:17,980 --> 02:26:19,560 co-pilot code as given.
02:26:19,560 --> 02:26:23,760 I almost always, uh, copy a copy paste into a Google search and you see what this function
02:26:23,760 --> 02:26:24,760 is doing.
02:26:24,760 --> 02:26:26,760 And then you're like, Oh, it's actually, actually exactly what I need.
02:26:26,760 --> 02:26:27,760 Thank you.
02:26:27,760 --> 02:26:28,760 Copilot.
02:26:28,760 --> 02:26:29,760 So you learned something.
02:26:29,760 --> 02:26:34,040 So it's in part a search engine, a part, um, maybe getting the exact syntax correctly that
02:26:34,040 --> 02:26:37,720 once you see it, it's that NP hard thing.
02:26:37,720 --> 02:26:42,440 Once you see it, you know, it's correct, but you yourself struggle.
02:26:42,440 --> 02:26:45,780 You can verify efficiently, but you can't generate efficiently.
02:26:45,780 --> 02:26:49,720 And co-pilot really, I mean, it's, it's autopilot for programming, right?
02:26:49,720 --> 02:26:53,440 And currently it's doing the link following, which is like the simple copy paste and sometimes
02:26:53,440 --> 02:26:54,440 suggest.
02:26:54,440 --> 02:26:57,360 Uh, but over time it's going to become more and more autonomous.
02:26:57,360 --> 02:27:01,160 And so the same thing will play out in not just coding, but actually across many, many
02:27:01,160 --> 02:27:02,160 different things.
02:27:02,160 --> 02:27:04,440 Probably coding is an important one, right?
02:27:04,440 --> 02:27:05,440 Writing programs.
02:27:05,440 --> 02:27:06,440 Yeah.
02:27:06,440 --> 02:27:10,440 What, how do you see the future of that developing, uh, the program synthesis, like being able
02:27:10,440 --> 02:27:12,960 to write programs that are more and more complicated?
02:27:12,960 --> 02:27:18,000 Cause right now it's human supervised in interesting ways.
02:27:18,000 --> 02:27:19,000 Yes.
02:27:19,000 --> 02:27:22,160 Like what it feels like the transition will be very painful.
02:27:22,160 --> 02:27:25,680 My mental model for it is the same thing will happen as with the autopilot.
02:27:25,680 --> 02:27:30,080 Uh, so currently he's doing link following, he's doing some simple stuff and eventually
02:27:30,080 --> 02:27:33,360 we'll be doing autonomy and people will have to intervene less and less.
02:27:33,360 --> 02:27:39,320 And there could be like, like testing mechanisms, like if it writes a function and that function
02:27:39,320 --> 02:27:43,200 looks pretty damn correct, but how do you know it's correct?
02:27:43,200 --> 02:27:47,280 Cause you're like getting lazier and lazier as a programmer, like your ability to cause
02:27:47,280 --> 02:27:50,640 like little bugs, but I guess it won't make little mistakes.
02:27:50,640 --> 02:27:51,640 No, it will.
02:27:51,640 --> 02:27:54,840 It, it, Copilot will make, uh, off by one subtle bugs.
02:27:54,840 --> 02:27:55,840 It has done that to me.
02:27:55,840 --> 02:28:01,320 But do you think future systems will, or is it really the off by one is actually a fundamental
02:28:01,320 --> 02:28:03,160 challenge of programming?
02:28:03,160 --> 02:28:07,160 In that case, it wasn't fundamental and I think things can improve, but, uh, yeah, I
02:28:07,160 --> 02:28:08,440 think humans have to supervise.
02:28:08,440 --> 02:28:12,880 I am nervous about people not supervising what comes out and what happens to, for example,
02:28:12,880 --> 02:28:15,380 the proliferation of bugs in all of our systems.
02:28:15,380 --> 02:28:19,120 I'm nervous about that, but I think there will probably be some other copilots for bug
02:28:19,120 --> 02:28:22,820 finding and stuff like that at some point, cause there'll be like a lot more automation
02:28:22,820 --> 02:28:26,160 for, uh, man.
02:28:26,160 --> 02:28:33,100 It's like a program, a copilot that generates a compiler, one that does a linter, one that
02:28:33,100 --> 02:28:36,360 does like a type checker.
02:28:36,360 --> 02:28:37,840 Yeah.
02:28:37,840 --> 02:28:42,000 It's a committee of like a GPT sort of like, and then there'll be like a manager for the
02:28:42,000 --> 02:28:43,000 committee.
02:28:43,000 --> 02:28:44,000 Yeah.
02:28:44,000 --> 02:28:45,600 And then there'll be somebody that says a new version of this is needed.
02:28:45,600 --> 02:28:47,120 We need to regenerate it.
02:28:47,120 --> 02:28:48,120 Yeah.
02:28:48,120 --> 02:28:50,400 There were 10 GPTs that were forwarded and gave 50 suggestions.
02:28:50,400 --> 02:28:53,040 Another one looked at it and picked a few that they like.
02:28:53,040 --> 02:28:55,760 A bug one looked at it and it was like, it's probably a bug.
02:28:55,760 --> 02:28:57,560 They got re-ranked by some other thing.
02:28:57,560 --> 02:29:01,400 And then a final ensemble, a GPT comes in and it's like, okay, given everything you
02:29:01,400 --> 02:29:04,280 guys have told me, this is probably the next token.
02:29:04,280 --> 02:29:07,320 You know, the feeling is the number of programmers in the world has been growing and growing
02:29:07,320 --> 02:29:08,320 very quickly.
02:29:08,320 --> 02:29:13,560 Do you think it's possible that it'll actually level out and drop to like a very low number
02:29:13,560 --> 02:29:14,600 with this kind of world?
02:29:14,600 --> 02:29:22,400 Because then you'd be doing software 2.0 programming and you'll be doing this kind of generation
02:29:22,400 --> 02:29:28,640 of copilot type systems programming, but you won't be doing the old school software 1.0
02:29:28,640 --> 02:29:29,640 programming.
02:29:29,640 --> 02:29:35,240 I don't currently think that they're just going to replace human programmers.
02:29:35,240 --> 02:29:40,240 I'm so hesitant saying stuff like this because this is going to be replaced in five years.
02:29:40,240 --> 02:29:46,000 I know it's going to show that this is where we thought, because I agree with you, but
02:29:46,000 --> 02:29:49,120 I think we might be very surprised.
02:29:49,120 --> 02:29:55,400 What's your sense of where we stand with language models?
02:29:55,400 --> 02:29:57,920 Does it feel like the beginning or the middle or the end?
02:29:57,920 --> 02:29:59,440 The beginning, 100%.
02:29:59,440 --> 02:30:03,080 I think the big question in my mind is for sure GPT will be able to program quite well
02:30:03,080 --> 02:30:04,360 competently and so on.
02:30:04,360 --> 02:30:05,920 How do you steer the system?
02:30:05,920 --> 02:30:09,680 You still have to provide some guidance to what you actually are looking for.
02:30:09,680 --> 02:30:12,920 How do you steer it and how do you talk to it?
02:30:12,920 --> 02:30:17,120 How do you audit it and verify that what is done is correct?
02:30:17,120 --> 02:30:18,680 How do you work with this?
02:30:18,680 --> 02:30:24,280 It's as much not just an AI problem, but a UI UX problem.
02:30:24,280 --> 02:30:30,400 Beautiful fertile ground for so much interesting work for VS Code++ where it's not just human
02:30:30,400 --> 02:30:31,400 programming anymore.
02:30:31,400 --> 02:30:32,400 It's amazing.
02:30:32,400 --> 02:30:37,000 You're interacting with the system, so not just one prompt, but it's iterative prompting.
02:30:37,000 --> 02:30:41,160 You're trying to figure out having a conversation with the system.
02:30:41,160 --> 02:30:46,200 To me, that's super exciting to have a conversation with the program I'm writing.
02:30:46,200 --> 02:30:48,200 Maybe at some point you're just conversing with it.
02:30:48,200 --> 02:30:50,120 It's like, okay, here's what I want to do.
02:30:50,120 --> 02:30:54,160 Actually this variable, maybe it's not even that low level as a variable.
02:30:54,160 --> 02:30:59,440 You can also imagine like, can you translate this to C++ and back to Python and back to
02:30:59,440 --> 02:31:04,400 it already exists in some ways, but just like doing it as part of the program experience.
02:31:04,400 --> 02:31:10,960 I think I'd like to write this function in C++ or you just keep changing for different
02:31:10,960 --> 02:31:13,780 programs because there's different syntax.
02:31:13,780 --> 02:31:17,280 Maybe I want to convert this into a functional language.
02:31:17,280 --> 02:31:23,200 You get to become multilingual as a programmer and dance back and forth efficiently.
02:31:23,200 --> 02:31:27,340 I think the UI UX of it though is still very hard to think through because it's not just
02:31:27,340 --> 02:31:29,640 about writing code on a page.
02:31:29,640 --> 02:31:31,480 You have an entire developer environment.
02:31:31,480 --> 02:31:33,280 You have a bunch of hardware on it.
02:31:33,280 --> 02:31:34,600 You have some environmental variables.
02:31:34,600 --> 02:31:36,760 You have some scripts that are running in a Chrome job.
02:31:36,760 --> 02:31:43,000 There's a lot going on to working with computers and how do these systems set up environment
02:31:43,000 --> 02:31:47,680 flags and work across multiple machines and set up screen sessions and automate different
02:31:47,680 --> 02:31:48,680 processes.
02:31:48,680 --> 02:31:53,560 How all that works and is auditable by humans and so on is a massive question at the moment.
02:31:53,560 --> 02:31:59,340 You've built Archive Sanity, what is Archive and what is the future of academic research
02:31:59,340 --> 02:32:02,440 publishing that you would like to see?
02:32:02,440 --> 02:32:03,960 Archive is this pre-print server.
02:32:03,960 --> 02:32:07,960 If you have a paper, you can submit it for publication to journals or conferences and
02:32:07,960 --> 02:32:11,680 then wait six months and then maybe get a decision, pass or fail, or you can just upload
02:32:11,680 --> 02:32:13,600 it to Archive.
02:32:13,600 --> 02:32:17,160 Then people can tweet about it three minutes later and then everyone sees it, everyone
02:32:17,160 --> 02:32:20,500 reads it and everyone can profit from it in their own little ways.
02:32:20,500 --> 02:32:23,920 You can cite it and it has an official look to it.
02:32:23,920 --> 02:32:27,600 It feels like a publication process.
02:32:27,600 --> 02:32:29,880 It feels different than if you just put it in a blog post.
02:32:29,880 --> 02:32:30,880 Yeah.
02:32:30,880 --> 02:32:34,960 I mean, it's a paper and usually the bar is higher for something that you would expect
02:32:34,960 --> 02:32:38,240 on Archive as opposed to something you would see in a blog post.
02:32:38,240 --> 02:32:46,000 The culture created the bar because you could probably post a pretty crappy picture in Archive.
02:32:46,000 --> 02:32:47,000 What's that make you feel?
02:32:47,000 --> 02:32:49,880 What's that make you feel about peer review?
02:32:49,880 --> 02:32:57,280 This peer review by two, three experts versus the peer review of the community right as
02:32:57,280 --> 02:32:58,280 it's written?
02:32:58,280 --> 02:32:59,280 Yeah.
02:32:59,280 --> 02:33:02,720 Basically, I think the community is very well able to peer review things very quickly on
02:33:02,720 --> 02:33:04,280 Twitter.
02:33:04,280 --> 02:33:07,600 I think maybe it just has to do something with AI machine learning field specifically
02:33:07,600 --> 02:33:08,600 though.
02:33:08,600 --> 02:33:13,980 I feel like things are more easily auditable and the verification is easier potentially
02:33:13,980 --> 02:33:17,120 than the verification somewhere else.
02:33:17,120 --> 02:33:20,800 You can think of these scientific publications as little blockchains where everyone's building
02:33:20,800 --> 02:33:22,640 on each other's work and citing each other.
02:33:22,640 --> 02:33:29,080 You have AI, which is this much faster and loose blockchain, but then you have any one
02:33:29,080 --> 02:33:33,880 individual entry is very cheap to make and then you have other fields where maybe that
02:33:33,880 --> 02:33:37,160 model doesn't make as much sense.
02:33:37,160 --> 02:33:40,480 I think in AI, at least, things are pretty easily verifiable.
02:33:40,480 --> 02:33:43,560 That's why when people upload papers, they're a really good idea and so on.
02:33:43,560 --> 02:33:47,700 People can try it out the next day and they can be the final arbiter of whether it works
02:33:47,700 --> 02:33:51,640 or not on their problem and the whole thing just moves significantly faster.
02:33:51,640 --> 02:33:55,800 I feel like academia still has a place, sorry, this conference journal process still has
02:33:55,800 --> 02:34:01,840 a place, but it's like it lags behind, I think, and it's a bit more maybe higher quality
02:34:01,840 --> 02:34:07,420 process, but it's not the place where you will discover cutting edge work anymore.
02:34:07,420 --> 02:34:10,760 It used to be the case when I was starting my PhD that you go to conferences and journals
02:34:10,760 --> 02:34:12,620 and you discuss all the latest research.
02:34:12,620 --> 02:34:16,120 Now when you go to a conference or a journal, no one discusses anything that's there because
02:34:16,120 --> 02:34:18,520 it's already three generations ago irrelevant.
02:34:18,520 --> 02:34:24,880 Yeah, which makes me sad about DeepMind, for example, where they still publish in Nature
02:34:24,880 --> 02:34:29,160 and these big prestigious, I mean, there's still value, I suppose, to the prestige that
02:34:29,160 --> 02:34:34,700 comes with these big venues, but the result is that they'll announce some breakthrough
02:34:34,700 --> 02:34:41,160 performance and it will take a year to actually publish the details.
02:34:41,160 --> 02:34:45,520 And those details, if they were published immediately, would inspire the community to
02:34:45,520 --> 02:34:46,520 move in certain directions.
02:34:46,520 --> 02:34:50,320 Yeah, it would speed up the rest of the community, but I don't know to what extent that's part
02:34:50,320 --> 02:34:51,920 of their objective function also.
02:34:51,920 --> 02:34:52,920 That's true.
02:34:52,920 --> 02:34:56,800 So it's not just the prestige, a little bit of the delay is part of it.
02:34:56,800 --> 02:35:02,240 Yeah, they certainly, DeepMind specifically, has been working in the regime of having slightly
02:35:02,240 --> 02:35:07,440 higher quality basically process and latency and publishing those papers that way.
02:35:07,440 --> 02:35:12,680 Another question from Reddit, do you or have you suffered from imposter syndrome?
02:35:12,680 --> 02:35:18,480 Being the director of AI at Tesla, being this person when you're at Stanford where like
02:35:18,480 --> 02:35:25,600 the world looks at you as the expert in AI to teach the world about machine learning.
02:35:25,600 --> 02:35:30,400 When I was leaving Tesla after five years, I spent a ton of time in meeting rooms and
02:35:30,400 --> 02:35:31,400 I would read papers.
02:35:31,400 --> 02:35:34,720 In the beginning, when I joined Tesla, I was writing code and then I was writing less and
02:35:34,720 --> 02:35:37,840 less code and I was reading code and then I was reading less and less code.
02:35:37,840 --> 02:35:40,400 And so this is just a natural progression that happens, I think.
02:35:40,400 --> 02:35:43,840 And definitely I would say near the tail end, that's when it sort of like starts to hit
02:35:43,840 --> 02:35:47,520 you a bit more that you're supposed to be an expert, but actually the source of truth
02:35:47,520 --> 02:35:52,160 is the code that people are writing, the GitHub and the actual code itself.
02:35:52,160 --> 02:35:54,560 And you're not as familiar with that as you used to be.
02:35:54,560 --> 02:35:57,040 And so I would say maybe there's some like insecurity there.
02:35:57,040 --> 02:36:00,960 Yeah, that's actually pretty profound that a lot of the insecurity has to do with not
02:36:00,960 --> 02:36:06,040 writing the code in the computer science space, like that, because that is the truth.
02:36:06,040 --> 02:36:07,040 Code is the source of truth.
02:36:07,040 --> 02:36:09,880 The papers and everything else, it's a high level summary.
02:36:09,880 --> 02:36:13,680 I don't, yeah, it's just a high level summary, but at the end of the day, you have to read
02:36:13,680 --> 02:36:14,680 code.
02:36:14,680 --> 02:36:18,680 It's impossible to translate all that code into actual, you know, paper form.
02:36:18,680 --> 02:36:22,080 So when things come out, especially when they have a source code available, that's my favorite
02:36:22,080 --> 02:36:23,240 place to go.
02:36:23,240 --> 02:36:28,960 So like I said, you're one of the greatest teachers of machine learning, AI, ever.
02:36:28,960 --> 02:36:34,720 From CS231N to today, what advice would you give to beginners interested in getting into
02:36:34,720 --> 02:36:36,760 machine learning?
02:36:36,760 --> 02:36:40,580 Beginners are often focused on like what to do.
02:36:40,580 --> 02:36:43,480 And I think the focus should be more like how much you do.
02:36:43,480 --> 02:36:48,280 So I am kind of like believer on a high level in this 10,000 hours kind of concept where
02:36:48,280 --> 02:36:51,400 you just kind of have to just pick the things where you can spend time and you care about
02:36:51,400 --> 02:36:52,400 and you're interested in.
02:36:52,400 --> 02:36:55,120 You literally have to put in 10,000 hours of work.
02:36:55,120 --> 02:36:59,520 It doesn't even matter as much where you put it and you'll iterate and you'll improve and
02:36:59,520 --> 02:37:00,520 you'll waste some time.
02:37:00,520 --> 02:37:01,920 I don't know if there's a better way.
02:37:01,920 --> 02:37:03,660 You need to put in 10,000 hours.
02:37:03,660 --> 02:37:06,360 But I think it's actually really nice because I feel like there's some sense of determinism
02:37:06,360 --> 02:37:10,040 about being an expert at a thing if you spend 10,000 hours.
02:37:10,040 --> 02:37:12,640 You can literally pick an arbitrary thing.
02:37:12,640 --> 02:37:16,320 And I think if you spend 10,000 hours of deliberate effort and work, you actually will become
02:37:16,320 --> 02:37:17,840 an expert at it.
02:37:17,840 --> 02:37:21,680 And so I think that's kind of like a nice thought.
02:37:21,680 --> 02:37:25,640 And so basically I would focus more on like, are you spending 10,000 hours?
02:37:25,640 --> 02:37:27,300 That's what I would focus on.
02:37:27,300 --> 02:37:32,520 So and then thinking about what kind of mechanisms maximize your likelihood of getting to 10,000
02:37:32,520 --> 02:37:33,520 hours.
02:37:33,520 --> 02:37:34,520 Exactly.
02:37:34,520 --> 02:37:39,440 Which for us silly humans means probably forming a daily habit of like every single day actually
02:37:39,440 --> 02:37:40,440 doing the thing.
02:37:40,440 --> 02:37:41,440 Whatever helps you.
02:37:41,440 --> 02:37:44,840 So I do think to a large extent it's a psychological problem for yourself.
02:37:44,840 --> 02:37:48,860 One other thing that I think is helpful for the psychology of it is many times people
02:37:48,860 --> 02:37:50,760 compare themselves to others in the area.
02:37:50,760 --> 02:37:52,480 I think this is very harmful.
02:37:52,480 --> 02:37:56,200 Only compare yourself to you from some time ago, like say a year ago.
02:37:56,200 --> 02:38:00,280 Are you better than you a year ago is the only way to think.
02:38:00,280 --> 02:38:03,640 And I think this, then you can see your progress and it's very motivating.
02:38:03,640 --> 02:38:09,400 That's so interesting that focus on the quantity of hours because I think a lot of people in
02:38:09,400 --> 02:38:15,640 the beginner stage, but actually throughout get paralyzed by the choice.
02:38:15,640 --> 02:38:19,080 Like which one do I pick this path or this path?
02:38:19,080 --> 02:38:22,680 Like they'll literally get paralyzed by like which ID to use.
02:38:22,680 --> 02:38:23,680 Well they're worried.
02:38:23,680 --> 02:38:24,680 Yeah.
02:38:24,680 --> 02:38:25,680 They're worried about all these things.
02:38:25,680 --> 02:38:28,560 But the thing is some of the, you will waste time doing something wrong.
02:38:28,560 --> 02:38:30,040 You will eventually figure out it's not right.
02:38:30,040 --> 02:38:34,080 You will accumulate scar tissue and next time you will grow stronger because next time you'll
02:38:34,080 --> 02:38:36,800 have the scar tissue and next time you'll learn from it.
02:38:36,800 --> 02:38:41,520 And now next time you come to a similar situation, you'll be like, Oh, I messed up.
02:38:41,520 --> 02:38:45,360 I've spent a lot of time working on things that never materialize into anything.
02:38:45,360 --> 02:38:48,760 And I have all that scar tissue and I have some intuitions about what was useful, what
02:38:48,760 --> 02:38:50,680 wasn't useful, how things turned out.
02:38:50,680 --> 02:38:54,080 So all those mistakes were not dead work.
02:38:54,080 --> 02:38:56,720 So I just think you should just focus on working.
02:38:56,720 --> 02:38:57,720 What have you done?
02:38:57,720 --> 02:39:00,840 What have you done last week?
02:39:00,840 --> 02:39:05,720 That's a good question actually to ask for a lot of things, not just machine learning.
02:39:05,720 --> 02:39:10,840 It's a good way to cut the, I forgot what the term we use, but the fluff, the blubber,
02:39:10,840 --> 02:39:15,080 whatever the inefficiencies in life.
02:39:15,080 --> 02:39:17,220 What do you love about teaching?
02:39:17,220 --> 02:39:21,760 You seem to find yourself often in the, like draw to teaching.
02:39:21,760 --> 02:39:23,480 You're very good at it, but you're also drawn to it.
02:39:23,480 --> 02:39:25,200 I mean, I don't think I love teaching.
02:39:25,200 --> 02:39:32,280 I love happy humans and happy humans like when I teach, I wouldn't say I hate teaching.
02:39:32,280 --> 02:39:35,660 I tolerate teaching, but it's not like the act of teaching that I like.
02:39:35,660 --> 02:39:41,200 It's that, you know, I have some, I have something I'm actually okay at it.
02:39:41,200 --> 02:39:44,160 I'm okay at teaching and people appreciate it a lot.
02:39:44,160 --> 02:39:47,440 And so I'm just happy to try to be helpful.
02:39:47,440 --> 02:39:50,840 And teaching itself is not like the most, I mean, it's really annoying.
02:39:50,840 --> 02:39:52,640 It can be really annoying, frustrating.
02:39:52,640 --> 02:39:54,480 I was working on a bunch of lectures just now.
02:39:54,480 --> 02:39:58,840 I was reminded back to my days of 231 and just how much work it is to create some of
02:39:58,840 --> 02:40:02,280 these materials and make them good, the amount of iteration and thought, and you go down
02:40:02,280 --> 02:40:04,900 blind alleys and just how much you change it.
02:40:04,900 --> 02:40:10,400 So creating something good in terms of like educational value is really hard and it's
02:40:10,400 --> 02:40:11,400 not fun.
02:40:11,400 --> 02:40:12,440 It was difficult.
02:40:12,440 --> 02:40:15,160 So for people should definitely go watch your new stuff.
02:40:15,160 --> 02:40:19,040 You put out, there are lectures where you're actually building the thing like from, like
02:40:19,040 --> 02:40:20,880 you said, the code is truth.
02:40:20,880 --> 02:40:26,080 So discussing backpropagation by building it, by looking through and just the whole
02:40:26,080 --> 02:40:27,080 thing.
02:40:27,080 --> 02:40:28,080 So how difficult is that to prepare for?
02:40:28,080 --> 02:40:30,480 I think that's a really powerful way to teach.
02:40:30,480 --> 02:40:34,440 Did you have to prepare for that or are you just live thinking through it?
02:40:34,440 --> 02:40:38,800 I will typically do like say three takes and then I take like the better take.
02:40:38,800 --> 02:40:41,560 So I do multiple takes and I take some of the better takes and then I just build out
02:40:41,560 --> 02:40:43,360 a lecture that way.
02:40:43,360 --> 02:40:46,800 Sometimes I have to delete 30 minutes of content because it just went down an alley that I
02:40:46,800 --> 02:40:47,800 didn't like too much.
02:40:47,800 --> 02:40:52,680 So there's a bunch of iteration and it probably takes me somewhere around 10 hours to create
02:40:52,680 --> 02:40:53,680 one hour of content.
02:40:53,680 --> 02:40:54,880 To get one hour.
02:40:54,880 --> 02:40:55,880 It's interesting.
02:40:55,880 --> 02:40:59,000 I mean, is it difficult to go back to the basics?
02:40:59,000 --> 02:41:02,440 Do you draw a lot of wisdom from going back to the basics?
02:41:02,440 --> 02:41:03,440 Yeah.
02:41:03,440 --> 02:41:05,300 Going back to backpropagation, loss functions, where they come from.
02:41:05,300 --> 02:41:09,280 And one thing I like about teaching a lot honestly is it definitely strengthens your
02:41:09,280 --> 02:41:10,440 understanding.
02:41:10,440 --> 02:41:12,720 So it's not a purely altruistic activity.
02:41:12,720 --> 02:41:13,860 It's a way to learn.
02:41:13,860 --> 02:41:19,600 If you have to explain something to someone, you realize you have gaps in knowledge.
02:41:19,600 --> 02:41:22,420 And so I even surprised myself in those lectures.
02:41:22,420 --> 02:41:25,640 Like also the result will obviously look at this and then the result doesn't look like
02:41:25,640 --> 02:41:26,640 it.
02:41:26,640 --> 02:41:28,600 And I'm like, okay, I thought I understood this.
02:41:28,600 --> 02:41:32,680 That's why it's really cool to literally code.
02:41:32,680 --> 02:41:36,880 You run it in a notebook and it gives you a result and you're like, oh, wow.
02:41:36,880 --> 02:41:39,920 And like actual numbers, actual input, actual code.
02:41:39,920 --> 02:41:41,600 It's not mathematical symbols, et cetera.
02:41:41,600 --> 02:41:43,120 The source of truth is the code.
02:41:43,120 --> 02:41:44,400 It's not slides.
02:41:44,400 --> 02:41:45,880 It's just like, let's build it.
02:41:45,880 --> 02:41:46,880 It's beautiful.
02:41:46,880 --> 02:41:48,900 You're a rare human in that sense.
02:41:48,900 --> 02:41:54,600 What advice would you give to researchers trying to develop and publish idea that have
02:41:54,600 --> 02:41:56,920 a big impact in the world of AI?
02:41:56,920 --> 02:42:01,680 So maybe undergrads, maybe early graduate students.
02:42:01,680 --> 02:42:02,680 Yeah.
02:42:02,680 --> 02:42:06,320 I mean, I would say like they definitely have to be a little bit more strategic than I had
02:42:06,320 --> 02:42:09,820 to be as a PhD student because of the way AI is evolving.
02:42:09,820 --> 02:42:14,000 It's going the way of physics where in physics, you used to be able to do experiments on your
02:42:14,000 --> 02:42:17,100 bench top and everything was great and you could make progress.
02:42:17,100 --> 02:42:21,240 And now you have to work in like LHC or like CERN.
02:42:21,240 --> 02:42:23,920 And so AI is going in that direction as well.
02:42:23,920 --> 02:42:28,440 So there's certain kinds of things that's just not possible to do on the bench top anymore.
02:42:28,440 --> 02:42:32,560 And I think that didn't used to be the case at the time.
02:42:32,560 --> 02:42:41,280 Do you still think that there's like GAN type papers to be written where like a very simple
02:42:41,280 --> 02:42:44,320 idea that requires just one computer to illustrate a simple example?
02:42:44,320 --> 02:42:48,040 I mean, one example that's been very influential recently is diffusion models.
02:42:48,040 --> 02:42:49,400 The fusion models are amazing.
02:42:49,400 --> 02:42:51,840 The fusion models are six years old.
02:42:51,840 --> 02:42:55,240 For the longest time, people were kind of ignoring them as far as I can tell.
02:42:55,240 --> 02:42:59,040 And they're an amazing generative model, especially in images.
02:42:59,040 --> 02:43:01,760 And so stable diffusion and so on, it's all diffusion based.
02:43:01,760 --> 02:43:02,960 But diffusion is new.
02:43:02,960 --> 02:43:07,040 It was not there and came from, well, it came from Google, but a researcher could have come
02:43:07,040 --> 02:43:08,040 up with it.
02:43:08,040 --> 02:43:11,940 In fact, some of the first, actually, no, those came from Google as well.
02:43:11,940 --> 02:43:14,560 But a researcher could come up with that in an academic institution.
02:43:14,560 --> 02:43:15,560 Yeah.
02:43:15,560 --> 02:43:18,000 What do you find most fascinating about diffusion models?
02:43:18,000 --> 02:43:22,760 So from the societal impact of the technical architecture.
02:43:22,760 --> 02:43:25,720 What I like about the fusion is it works so well.
02:43:25,720 --> 02:43:26,960 Is that surprising to you?
02:43:26,960 --> 02:43:32,240 The amount of the variety, almost the novelty of the synthetic data is generating.
02:43:32,240 --> 02:43:33,240 Yeah.
02:43:33,240 --> 02:43:36,440 So the stable diffusion images are incredible.
02:43:36,440 --> 02:43:41,040 It's the speed of improvement in generating images has been insane.
02:43:41,040 --> 02:43:45,100 We went very quickly from generating like tiny digits to tiny faces and it all looked
02:43:45,100 --> 02:43:46,100 messed up.
02:43:46,100 --> 02:43:48,240 And now we have stable diffusion and that happened very quickly.
02:43:48,240 --> 02:43:49,960 There's a lot that academia can still contribute.
02:43:49,960 --> 02:43:55,900 You know, for example, flash attention is a very efficient kernel for running the attention
02:43:55,900 --> 02:43:59,760 operation inside a transformer that came from academic environment.
02:43:59,760 --> 02:44:03,880 It's a very clever way to structure the kernel that does the calculation.
02:44:03,880 --> 02:44:07,240 So it doesn't materialize the attention matrix.
02:44:07,240 --> 02:44:09,960 And so I think there's still like lots of things to contribute, but you have to be just
02:44:09,960 --> 02:44:10,960 more strategic.
02:44:10,960 --> 02:44:13,960 Do you think neural networks can be made to reason?
02:44:13,960 --> 02:44:14,960 Yes.
02:44:14,960 --> 02:44:17,640 Do you think they're already reason?
02:44:17,640 --> 02:44:18,640 Yes.
02:44:18,640 --> 02:44:21,520 What's your definition of reasoning?
02:44:21,520 --> 02:44:22,520 Information processing.
02:44:22,520 --> 02:44:32,040 So in the way that humans think through a problem and come up with novel ideas, it feels
02:44:32,040 --> 02:44:34,080 like reasoning.
02:44:34,080 --> 02:44:43,400 So the novelty, I don't want to say, but out of distribution ideas, you think it's possible?
02:44:43,400 --> 02:44:44,400 Yes.
02:44:44,400 --> 02:44:46,560 And I think we're seeing that already in the current neural nets.
02:44:46,560 --> 02:44:51,280 You're able to remix the training set information into true generalization in some sense.
02:44:51,280 --> 02:44:54,800 That doesn't appear in a fundamental way in the training set.
02:44:54,800 --> 02:44:56,560 Like you're doing something interesting algorithmically.
02:44:56,560 --> 02:45:03,240 You're manipulating some symbols and you're coming up with some correct, unique answer
02:45:03,240 --> 02:45:05,000 in a new setting.
02:45:05,000 --> 02:45:11,300 What would illustrate to you, holy shit, this thing is definitely thinking?
02:45:11,300 --> 02:45:15,360 To me, thinking or reasoning is just information processing and generalization.
02:45:15,360 --> 02:45:17,560 And I think the neural nets already do that today.
02:45:17,560 --> 02:45:26,320 So being able to perceive the world or perceive whatever the inputs are and to make predictions
02:45:26,320 --> 02:45:28,240 based on that or actions based on that.
02:45:28,240 --> 02:45:29,240 That's reasoning.
02:45:29,240 --> 02:45:30,240 Yeah.
02:45:30,240 --> 02:45:34,960 You're giving correct answers in novel settings by manipulating information.
02:45:34,960 --> 02:45:36,680 You've learned the correct algorithm.
02:45:36,680 --> 02:45:39,760 You're not doing just some kind of a lookup table and there's neighbor search, something
02:45:39,760 --> 02:45:40,760 like that.
02:45:40,760 --> 02:45:42,440 Let me ask you about AGI.
02:45:42,440 --> 02:45:48,300 What are some moonshot ideas you think might make significant progress towards AGI?
02:45:48,300 --> 02:45:52,720 Maybe in other ways, what are the big blockers that we're missing now?
02:45:52,720 --> 02:45:57,720 Basically I am fairly bullish on our ability to build AGI's.
02:45:57,720 --> 02:46:02,920 Basically automated systems that we can interact with that are very human-like and we can interact
02:46:02,920 --> 02:46:05,920 with them in a digital realm or a physical realm.
02:46:05,920 --> 02:46:10,080 Currently it seems most of the models that sort of do these sort of magical tasks are
02:46:10,080 --> 02:46:12,840 in a text realm.
02:46:12,840 --> 02:46:18,200 I think as I mentioned, I'm suspicious that text realm is not enough to actually build
02:46:18,200 --> 02:46:20,360 full understanding of the world.
02:46:20,360 --> 02:46:23,940 I do actually think you need to go into pixels and understand the physical world and how
02:46:23,940 --> 02:46:25,000 it works.
02:46:25,000 --> 02:46:28,840 So I do think that we need to extend these models to consume images and videos and train
02:46:28,840 --> 02:46:32,000 on a lot more data that is multimodal in that way.
02:46:32,000 --> 02:46:34,800 Do you think you need to touch the world to understand it also?
02:46:34,800 --> 02:46:38,760 Well, that's the big open question I would say in my mind is if you also require the
02:46:38,760 --> 02:46:45,000 embodiment and the ability to sort of interact with the world, run experiments and have data
02:46:45,000 --> 02:46:48,800 of that form, then you need to go to Optimus or something like that.
02:46:48,800 --> 02:46:55,080 And so I would say Optimus in some way is like a hedge in AGI because it seems to me
02:46:55,080 --> 02:47:00,400 that it's possible that just having data from the internet is not enough.
02:47:00,400 --> 02:47:06,800 If that is the case, then Optimus may lead to AGI because Optimus, to me, there's nothing
02:47:06,800 --> 02:47:07,800 beyond Optimus.
02:47:07,800 --> 02:47:11,480 You have this humanoid form factor that can actually do stuff in the world.
02:47:11,480 --> 02:47:14,800 You can have millions of them interacting with humans and so on.
02:47:14,800 --> 02:47:20,260 And if that doesn't give rise to AGI at some point, I'm not sure what will.
02:47:20,260 --> 02:47:25,280 So from a completeness perspective, I think that's a really good platform, but it's a
02:47:25,280 --> 02:47:29,720 much harder platform because you are dealing with atoms and you need to actually build
02:47:29,720 --> 02:47:32,800 these things and integrate them into society.
02:47:32,800 --> 02:47:36,800 So I think that path takes longer, but it's much more certain.
02:47:36,800 --> 02:47:40,240 And then there's a path of the internet and just like training these compression models
02:47:40,240 --> 02:47:45,120 effectively on trying to compress all the internet.
02:47:45,120 --> 02:47:48,560 And that might also give these agents as well.
02:47:48,560 --> 02:47:51,760 Compress the internet, but also interact with the internet.
02:47:51,760 --> 02:47:53,680 So it's not obvious to me.
02:47:53,680 --> 02:48:00,520 In fact, I suspect you can reach AGI without ever entering the physical world.
02:48:00,520 --> 02:48:09,100 But which is a little bit more concerning because that results in it happening faster.
02:48:09,100 --> 02:48:11,960 So it just feels like we're in boiling water.
02:48:11,960 --> 02:48:14,240 We won't know as it's happening.
02:48:14,240 --> 02:48:17,840 I would like to, I'm not afraid of AGI.
02:48:17,840 --> 02:48:19,280 I'm excited about it.
02:48:19,280 --> 02:48:25,700 There's always concerns, but I would like to know when it happens and have like hints
02:48:25,700 --> 02:48:30,120 about when it happens, like a year from now it will happen, that kind of thing.
02:48:30,120 --> 02:48:32,720 I just feel like in the digital realm, it just might happen.
02:48:32,720 --> 02:48:33,720 Yeah.
02:48:33,720 --> 02:48:37,000 I think all we have available to us because no one has built AGI again.
02:48:37,000 --> 02:48:42,480 So all we have available to us is, is there enough fertile ground on the periphery?
02:48:42,480 --> 02:48:43,480 I would say yes.
02:48:43,480 --> 02:48:47,880 And we have the progress so far, which has been very rapid and there are next steps that
02:48:47,880 --> 02:48:48,880 are available.
02:48:48,880 --> 02:48:54,440 And so I would say, yeah, it's quite likely that we'll be interacting with digital entities.
02:48:54,440 --> 02:48:57,200 How will you know that somebody has built AGI?
02:48:57,200 --> 02:49:00,080 It's going to be a slow, I think it's going to be a slow incremental transition.
02:49:00,080 --> 02:49:01,760 It's going to be product based and focused.
02:49:01,760 --> 02:49:04,000 It's going to be GitHub co-pilot getting better.
02:49:04,000 --> 02:49:06,620 And then GPT is helping you write.
02:49:06,620 --> 02:49:09,680 And then these oracles that you can go to with mathematical problems.
02:49:09,680 --> 02:49:15,480 I think we're on a, on a verge of being able to ask very complex questions in chemistry,
02:49:15,480 --> 02:49:19,960 physics, math, of these oracles and have them complete solutions.
02:49:19,960 --> 02:49:22,720 So AGI to use primarily focused on intelligence.
02:49:22,720 --> 02:49:27,920 So consciousness doesn't enter into, into it.
02:49:27,920 --> 02:49:32,040 So in my mind, consciousness is not a special thing you will, you will figure out and bolt
02:49:32,040 --> 02:49:33,040 on.
02:49:33,040 --> 02:49:37,600 I think it's an emergent phenomenon of a large enough and complex enough generative model
02:49:37,600 --> 02:49:38,600 sort of.
02:49:38,600 --> 02:49:45,240 So if you have a complex enough world model that understands the world, then it also understands
02:49:45,240 --> 02:49:50,120 its predicament in the world as being a language model, which to me is a form of consciousness
02:49:50,120 --> 02:49:52,160 or self-awareness.
02:49:52,160 --> 02:49:55,720 So in order to understand the world deeply, you probably have to integrate yourself into
02:49:55,720 --> 02:49:56,720 the world.
02:49:56,720 --> 02:49:57,720 Yeah.
02:49:57,720 --> 02:50:01,840 If you can interact with humans and other living beings, consciousness is a very useful
02:50:01,840 --> 02:50:02,840 tool.
02:50:02,840 --> 02:50:03,840 Yeah.
02:50:03,840 --> 02:50:06,400 I think consciousness is like a modeling insight.
02:50:06,400 --> 02:50:07,400 Modeling insight.
02:50:07,400 --> 02:50:08,400 Yeah.
02:50:08,400 --> 02:50:11,400 It's a, you have a powerful enough model of understanding the world that you actually
02:50:11,400 --> 02:50:13,280 understand that you are an entity in it.
02:50:13,280 --> 02:50:14,280 Yeah.
02:50:14,280 --> 02:50:18,920 But there's also this perhaps just the narrative we tell ourselves there's a, it feels like
02:50:18,920 --> 02:50:23,200 something to experience the world, the hard problem of consciousness, but that could be
02:50:23,200 --> 02:50:24,720 just a narrative that we tell ourselves.
02:50:24,720 --> 02:50:25,720 Yeah.
02:50:25,720 --> 02:50:27,160 I don't think, I think it will emerge.
02:50:27,160 --> 02:50:29,560 I think it's going to be something very boring.
02:50:29,560 --> 02:50:33,440 Like we'll be talking to these digital AIs, they will claim they're conscious.
02:50:33,440 --> 02:50:35,080 They will appear conscious.
02:50:35,080 --> 02:50:37,760 They will do all the things that you would expect of other humans.
02:50:37,760 --> 02:50:40,320 And it's going to just be a stalemate.
02:50:40,320 --> 02:50:46,240 I think there'll be a lot of actual fascinating ethical questions, like Supreme Court level
02:50:46,240 --> 02:50:51,880 questions of whether you're allowed to turn off a conscious AI.
02:50:51,880 --> 02:50:57,680 If you're allowed to build a conscious AI, maybe there would have to be the same kind
02:50:57,680 --> 02:51:05,320 of debates that you have around, sorry to bring up a political topic, but abortion,
02:51:05,320 --> 02:51:11,800 which is the deeper question with abortion is what is life?
02:51:11,800 --> 02:51:16,480 And the deep question with AI is also what is life and what is conscious?
02:51:16,480 --> 02:51:22,920 And I think that'll be very fascinating to bring up, it might become illegal to build
02:51:22,920 --> 02:51:29,760 systems that are capable of such a level of intelligence that consciousness would emerge
02:51:29,760 --> 02:51:34,760 and therefore the capacity to suffer would emerge and a system that says, no, please
02:51:34,760 --> 02:51:35,760 don't kill me.
02:51:35,760 --> 02:51:41,320 Well, that's what the Lambda chatbot already told this Google engineer, right?
02:51:41,320 --> 02:51:44,960 Like it was talking about not wanting to die or so on.
02:51:44,960 --> 02:51:49,760 So that might become illegal to do that.
02:51:49,760 --> 02:51:54,160 Because otherwise you might have a lot of creatures that don't want to die.
02:51:54,160 --> 02:51:55,160 And they will-
02:51:55,160 --> 02:51:59,360 You can just spawn infinity of them on a cluster.
02:51:59,360 --> 02:52:02,880 And then that might lead to like horrible consequences because then there might be a
02:52:02,880 --> 02:52:09,040 lot of people that secretly love murder and they'll start practicing murder on those systems.
02:52:09,040 --> 02:52:14,820 To me, all of this stuff just brings a beautiful mirror to the human condition and human nature
02:52:14,820 --> 02:52:15,820 and we'll get to explore it.
02:52:15,820 --> 02:52:20,960 And that's what like the best of the Supreme Court of all the different debates we have
02:52:20,960 --> 02:52:23,480 about ideas of what it means to be human.
02:52:23,480 --> 02:52:27,600 We get to ask those deep questions that we've been asking throughout human history.
02:52:27,600 --> 02:52:31,440 There's always been the other in human history.
02:52:31,440 --> 02:52:36,200 We're the good guys and that's the bad guys and we're going to, throughout human history,
02:52:36,200 --> 02:52:38,120 let's murder the bad guys.
02:52:38,120 --> 02:52:39,880 And the same will probably happen with robots.
02:52:39,880 --> 02:52:43,500 There'll be the other at first and then we'll get to ask questions of what does it mean
02:52:43,500 --> 02:52:44,500 to be alive?
02:52:44,500 --> 02:52:46,840 What does it mean to be conscious?
02:52:46,840 --> 02:52:50,400 And I think there's some canary in the coal mines, even with what we have today.
02:52:50,400 --> 02:52:54,560 And for example, there's these like waifus that you can work with and some people are
02:52:54,560 --> 02:52:59,160 trying to like, this company is going to shut down, but this person really loved their waifu
02:52:59,160 --> 02:53:04,120 and is trying to port it somewhere else and it's not possible.
02:53:04,120 --> 02:53:11,400 I think definitely people will have feelings towards these systems because in some sense
02:53:11,400 --> 02:53:16,120 they are like a mirror of humanity because they are like sort of like a big average of
02:53:16,120 --> 02:53:18,640 humanity in a way that it's trained.
02:53:18,640 --> 02:53:22,440 But we can, that average, we can actually watch.
02:53:22,440 --> 02:53:26,200 It's nice to be able to interact with the big average of humanity and do like a search
02:53:26,200 --> 02:53:27,200 query on it.
02:53:27,200 --> 02:53:28,200 Yeah.
02:53:28,200 --> 02:53:29,200 Yeah.
02:53:29,200 --> 02:53:30,200 It's very fascinating.
02:53:30,200 --> 02:53:32,000 And we can of course also like shape it.
02:53:32,000 --> 02:53:33,000 It's not just a pure average.
02:53:33,000 --> 02:53:36,200 We can mess with the training data, we can mess with the objective, we can fine tune
02:53:36,200 --> 02:53:37,740 them in various ways.
02:53:37,740 --> 02:53:42,640 So we have some impact on what those systems look like.
02:53:42,640 --> 02:53:50,080 If you, once we achieve AGI and you could have a conversation with her and ask her,
02:53:50,080 --> 02:53:54,200 talk about anything, maybe ask her a question, what kind of stuff would you ask?
02:53:54,200 --> 02:53:59,120 I would have some practical questions in my mind like, do I or my loved ones really have
02:53:59,120 --> 02:54:00,120 to die?
02:54:00,120 --> 02:54:03,000 What can we do about that?
02:54:03,000 --> 02:54:07,360 Do you think it will answer clearly or would it answer poetically?
02:54:07,360 --> 02:54:08,960 I would expect it to give solutions.
02:54:08,960 --> 02:54:12,400 I would expect it to be like, well, I've read all of these textbooks and I know all these
02:54:12,400 --> 02:54:13,560 things that you've produced.
02:54:13,560 --> 02:54:17,680 And it seems to me like here are the experiments that I think it would be useful to run next.
02:54:17,680 --> 02:54:20,120 And here's some gene therapies that I think would be helpful.
02:54:20,120 --> 02:54:22,520 And here are the kinds of experiments that you should run.
02:54:22,520 --> 02:54:23,520 Okay.
02:54:23,520 --> 02:54:25,920 Let's go with this thought experiment.
02:54:25,920 --> 02:54:33,120 Imagine that mortality is actually a prerequisite for happiness.
02:54:33,120 --> 02:54:37,120 So if we become immortal, we'll actually become deeply unhappy.
02:54:37,120 --> 02:54:39,720 And the model is able to know that.
02:54:39,720 --> 02:54:42,560 So what is it supposed to tell you, stupid human, about it?
02:54:42,560 --> 02:54:46,320 Yes, you can become immortal, but you will become deeply unhappy.
02:54:46,320 --> 02:54:53,680 If the AGI system is trying to empathize with you, human, what is it supposed to tell you?
02:54:53,680 --> 02:54:57,960 That yes, you don't have to die, but you're really not going to like it?
02:54:57,960 --> 02:55:00,040 Is it going to be deeply honest?
02:55:00,040 --> 02:55:01,040 There's a interstellar.
02:55:01,040 --> 02:55:02,040 What is it?
02:55:02,040 --> 02:55:08,160 The AI says like, humans want 90% honesty.
02:55:08,160 --> 02:55:11,880 So like you have to pick how honest I want to answer these practical questions.
02:55:11,880 --> 02:55:12,880 Yeah.
02:55:12,880 --> 02:55:14,200 I love AI interstellar, by the way.
02:55:14,200 --> 02:55:16,920 I think it's like such a sidekick to the entire story.
02:55:16,920 --> 02:55:19,840 But at the same time, it's like really interesting.
02:55:19,840 --> 02:55:22,280 It's kind of limited in certain ways, right?
02:55:22,280 --> 02:55:23,280 Yeah, it's limited.
02:55:23,280 --> 02:55:24,280 And I think that's totally fine.
02:55:24,280 --> 02:55:29,640 By the way, I don't think I think it's fine and plausible to have a limited and imperfect
02:55:29,640 --> 02:55:32,440 AGI.
02:55:32,440 --> 02:55:34,100 Is that the feature almost?
02:55:34,100 --> 02:55:38,480 As an example, like it has a fixed amount of compute on its physical body.
02:55:38,480 --> 02:55:43,260 And it might just be that even though you can have a super amazing mega brain, super
02:55:43,260 --> 02:55:46,760 intelligent AI, you can also can have like, you know, less intelligent AIs that you can
02:55:46,760 --> 02:55:49,680 deploy in a power efficient way.
02:55:49,680 --> 02:55:50,680 And then they're not perfect.
02:55:50,680 --> 02:55:51,680 They might make mistakes.
02:55:51,680 --> 02:55:56,920 No, I meant more like, say you had infinite compute, and it's still good to make mistakes
02:55:56,920 --> 02:55:57,920 sometimes.
02:55:57,920 --> 02:56:03,400 Like, in order to integrate yourself, like, what is it going back to Good Will Hunting,
02:56:03,400 --> 02:56:09,400 Robin Williams character says like the human imperfections, that's the good stuff, right?
02:56:09,400 --> 02:56:10,400 Isn't it?
02:56:10,400 --> 02:56:16,760 Isn't that the like, we don't want perfect, we want flaws, in part, to form connection
02:56:16,760 --> 02:56:17,760 with each other?
02:56:17,760 --> 02:56:23,120 Because it feels like something you can attach your feelings to the flaws.
02:56:23,120 --> 02:56:26,080 In that same way, you want an AI that's flawed.
02:56:26,080 --> 02:56:27,080 I don't know.
02:56:27,080 --> 02:56:28,080 I feel like perfectionist.
02:56:28,080 --> 02:56:33,600 But then you're saying, okay, yeah, but that's not AGI, but see, AGI would need to be intelligent
02:56:33,600 --> 02:56:37,040 enough to give answers to humans that humans don't understand.
02:56:37,040 --> 02:56:40,440 And I think perfect is something humans can't understand.
02:56:40,440 --> 02:56:42,680 Because even science doesn't give perfect answers.
02:56:42,680 --> 02:56:49,120 There's always gaps and mysteries and I don't know, I don't know if humans want perfect.
02:56:49,120 --> 02:56:54,840 Yeah, I can imagine just having a conversation with this kind of Oracle entity, as you'd
02:56:54,840 --> 02:56:55,840 imagine them.
02:56:55,840 --> 02:57:02,160 And yeah, maybe it can tell you about, you know, based on my analysis of human condition,
02:57:02,160 --> 02:57:03,280 you might not want this.
02:57:03,280 --> 02:57:05,280 And here's some of the things that might.
02:57:05,280 --> 02:57:12,120 But every, every dumb human will say, yeah, trust me, I can give me the truth, I can handle
02:57:12,120 --> 02:57:13,120 it.
02:57:13,120 --> 02:57:14,120 But that's the beauty.
02:57:14,120 --> 02:57:15,120 Like people can choose.
02:57:15,120 --> 02:57:21,360 But then the old marshmallow test with the kids and so on, I feel like too many people
02:57:21,360 --> 02:57:29,000 like can't handle the truth, probably including myself, like the deep truth of the human condition.
02:57:29,000 --> 02:57:31,040 I don't know if I can handle it.
02:57:31,040 --> 02:57:33,240 Like what if there's some dark stuff?
02:57:33,240 --> 02:57:35,880 What if we are an alien science experiment?
02:57:35,880 --> 02:57:38,360 And it realizes that, what if it had, I mean.
02:57:38,360 --> 02:57:39,360 Yeah.
02:57:39,360 --> 02:57:43,120 I mean, this is the matrix, you know, all over again.
02:57:43,120 --> 02:57:44,120 I don't know.
02:57:44,120 --> 02:57:46,000 I would, what would I talk about?
02:57:46,000 --> 02:57:52,600 I don't even, yeah, I, probably I will go with the safer scientific questions at first
02:57:52,600 --> 02:57:58,480 that have nothing to do with my own personal life and mortality, just like about physics
02:57:58,480 --> 02:57:59,480 and so on.
02:57:59,480 --> 02:58:00,480 Yeah.
02:58:00,480 --> 02:58:04,760 To, to build up, like, let's see where it's at, or maybe see if it has a sense of humor.
02:58:04,760 --> 02:58:05,760 That's another question.
02:58:05,760 --> 02:58:10,320 Would it be able to, presumably in order to, if it understands humans deeply, would it
02:58:10,320 --> 02:58:15,000 be able to generate, to generate humor?
02:58:15,000 --> 02:58:18,320 Yeah, I think that's actually a wonderful benchmark almost.
02:58:18,320 --> 02:58:21,360 Like is it able, I think that's a really good point basically.
02:58:21,360 --> 02:58:22,360 To make you laugh.
02:58:22,360 --> 02:58:23,360 Yeah.
02:58:23,360 --> 02:58:26,200 If it's able to be like a very effective standup comedian that is doing something very interesting
02:58:26,200 --> 02:58:27,200 computationally.
02:58:27,200 --> 02:58:29,480 I think being funny is extremely hard.
02:58:29,480 --> 02:58:30,720 Yeah.
02:58:30,720 --> 02:58:37,760 Because it's hard in a way, like a Turing test, the original intent of the Turing test
02:58:37,760 --> 02:58:43,640 is hard because you have to convince humans and there's nothing, that's why, that's what
02:58:43,640 --> 02:58:45,480 comedians talk about this.
02:58:45,480 --> 02:58:50,240 Like there's, this is deeply honest because if people can't help but laugh and if they
02:58:50,240 --> 02:58:51,960 don't laugh, that means you're not funny.
02:58:51,960 --> 02:58:52,960 They laugh, it's funny.
02:58:52,960 --> 02:58:57,160 And you're showing, you need a lot of knowledge to create, to create humor about like the
02:58:57,160 --> 02:58:58,680 occupation, human condition and so on.
02:58:58,680 --> 02:59:01,320 And then you need to be clever with it.
02:59:01,320 --> 02:59:06,840 You mentioned a few movies, you tweeted, movies that I've seen five plus times, but I'm ready
02:59:06,840 --> 02:59:11,200 and willing to keep watching, Interstellar, Gladiator, Contact, Good Will, Hunting, The
02:59:11,200 --> 02:59:16,560 Matrix, Lord of the Rings, all three, Avatar, Fifth Element, so on, it goes on, Terminator
02:59:16,560 --> 02:59:20,720 2, Mean Girls, I'm not going to ask about that.
02:59:20,720 --> 02:59:23,760 Mean Girls is great.
02:59:23,760 --> 02:59:29,440 What are some of the, jump on to your memory that you love and why, like you mentioned
02:59:29,440 --> 02:59:34,600 The Matrix, as a computer person, why do you love The Matrix?
02:59:34,600 --> 02:59:36,800 There's so many properties that make it like beautiful and interesting.
02:59:36,800 --> 02:59:42,120 So there's all these philosophical questions, but then there's also AGI's and there's simulation
02:59:42,120 --> 02:59:43,440 and it's cool.
02:59:43,440 --> 02:59:48,480 And there's, you know, the black, you know, the look of it, the feel of it, the feel of
02:59:48,480 --> 02:59:50,040 it, the action, the bullet time.
02:59:50,040 --> 02:59:53,620 It was just like innovating in so many ways.
02:59:53,620 --> 02:59:57,480 And then Good Will Hunting, why do you like that one?
02:59:57,480 --> 03:00:03,600 Yeah, I just, I really liked this tortured genius sort of character who's like grappling
03:00:03,600 --> 03:00:08,120 with whether or not he has like any responsibility or like what to do with this gift that he
03:00:08,120 --> 03:00:11,160 was given or like how to think about the whole thing.
03:00:11,160 --> 03:00:16,920 And there's also a dance between the genius and the personal, like what it means to love
03:00:16,920 --> 03:00:18,200 another human being.
03:00:18,200 --> 03:00:19,200 There's a lot of themes there.
03:00:19,200 --> 03:00:20,380 It's just a beautiful movie.
03:00:20,380 --> 03:00:24,400 And then the fatherly figure, the mentor and the psychiatrist.
03:00:24,400 --> 03:00:27,120 It like really like, it messes with you.
03:00:27,120 --> 03:00:31,140 You know, there's some movies that just like really mess with you on a deep level.
03:00:31,140 --> 03:00:33,720 Do you relate to that movie at all?
03:00:33,720 --> 03:00:34,720 No.
03:00:34,720 --> 03:00:37,160 It's not your fault, Andre, as I said.
03:00:37,160 --> 03:00:39,200 Lord of the Rings, that's self-explanatory.
03:00:39,200 --> 03:00:42,920 Terminator 2, which is interesting.
03:00:42,920 --> 03:00:44,320 You rewatched that a lot.
03:00:44,320 --> 03:00:46,240 Is that better than Terminator 1?
03:00:46,240 --> 03:00:47,240 You like Arnold?
03:00:47,240 --> 03:00:50,080 I do like Terminator 1 as well.
03:00:50,080 --> 03:00:56,000 I like Terminator 2 a little bit more, but in terms of like its surface properties.
03:00:56,000 --> 03:00:59,560 Do you think Skynet is at all a possibility?
03:00:59,560 --> 03:01:00,640 Yes.
03:01:00,640 --> 03:01:04,960 Like the actual sort of autonomous weapon system kind of thing.
03:01:04,960 --> 03:01:06,920 Do you worry about that stuff?
03:01:06,920 --> 03:01:07,920 I do worry about it.
03:01:07,920 --> 03:01:08,920 AI being used for war.
03:01:08,920 --> 03:01:10,720 I 100% worry about it.
03:01:10,720 --> 03:01:15,800 And so the, I mean, some of these fears of AGI's and how this will plan out, I mean,
03:01:15,800 --> 03:01:18,060 these will be like very powerful entities probably at some point.
03:01:18,060 --> 03:01:21,680 And so for a long time, there are going to be tools in the hands of humans.
03:01:21,680 --> 03:01:26,120 You know, people talk about like alignment of AGI's and how to make, the problem is like
03:01:26,120 --> 03:01:27,880 even humans are not aligned.
03:01:27,880 --> 03:01:33,600 So how this will be used and what this is going to look like is, yeah, it's troubling.
03:01:33,600 --> 03:01:40,800 So do you think it'll happen slowly enough that we'll be able to, as a human civilization,
03:01:40,800 --> 03:01:41,800 think through the problems?
03:01:41,800 --> 03:01:42,800 Yes.
03:01:42,800 --> 03:01:46,200 That's my hope is that it happens slowly enough and an open enough way where a lot of people
03:01:46,200 --> 03:01:48,440 can see and participate in it.
03:01:48,440 --> 03:01:52,280 Just figure out how to deal with this transition, I think, where it was going to be interesting.
03:01:52,280 --> 03:01:57,040 I draw a lot of inspiration from nuclear weapons because I sure thought it would be, it would
03:01:57,040 --> 03:02:05,200 be fucked once they developed nuclear weapons, but like, it's almost like when the systems
03:02:05,200 --> 03:02:10,160 are not so dangerous, they destroy human civilization, we deploy them and learn the lessons.
03:02:10,160 --> 03:02:14,640 And then we quickly, if it's too dangerous, we quickly, quickly, we might still deploy
03:02:14,640 --> 03:02:17,800 it, but you very quickly learn not to use them.
03:02:17,800 --> 03:02:19,880 And so there'll be like this balance achieved.
03:02:19,880 --> 03:02:23,100 Humans are very clever as a species, it's interesting.
03:02:23,100 --> 03:02:27,840 We exploit the resources as much as we can, but we don't, we avoid destroying ourselves,
03:02:27,840 --> 03:02:28,840 it seems like.
03:02:28,840 --> 03:02:29,840 Yeah.
03:02:29,840 --> 03:02:30,840 Well, I don't know about that actually.
03:02:30,840 --> 03:02:31,840 I hope it continues.
03:02:31,840 --> 03:02:37,960 I mean, I'm definitely like concerned about nuclear weapons and so on, not just as a result
03:02:37,960 --> 03:02:42,440 of the recent conflict, even before that, that's probably like my number one concern
03:02:42,440 --> 03:02:43,540 for humanity.
03:02:43,540 --> 03:02:51,440 So if humanity destroys itself or destroys, you know, 90% of people, that would be because
03:02:51,440 --> 03:02:52,440 of nukes?
03:02:52,440 --> 03:02:53,440 I think so.
03:02:53,440 --> 03:02:55,880 And it's not even about the full destruction.
03:02:55,880 --> 03:03:00,240 To me, it's bad enough if we reset society, that would be like terrible, it would be really
03:03:00,240 --> 03:03:01,240 bad.
03:03:01,240 --> 03:03:03,320 And I can't believe we're like, so close to it.
03:03:03,320 --> 03:03:04,320 Yeah.
03:03:04,320 --> 03:03:05,320 It's like so crazy to me.
03:03:05,320 --> 03:03:08,720 It feels like we might be a few tweets away from something like that.
03:03:08,720 --> 03:03:09,720 Yep.
03:03:09,720 --> 03:03:14,320 Basically, it's extremely unnerving, but and has been for me for a long time.
03:03:14,320 --> 03:03:24,160 It seems unstable that world leaders just having a bad mood can like take one step towards
03:03:24,160 --> 03:03:26,720 a bad direction and it escalates.
03:03:26,720 --> 03:03:27,720 Yeah.
03:03:27,720 --> 03:03:33,160 And because of a collection of bad moods, it can escalate without being able to stop.
03:03:33,160 --> 03:03:37,360 Yeah, it's just it's a huge amount of power.
03:03:37,360 --> 03:03:42,080 And then also with the proliferation, I basically I don't I don't actually really see, I don't
03:03:42,080 --> 03:03:45,080 actually know what the good outcomes are here.
03:03:45,080 --> 03:03:46,800 So I'm definitely worried about it a lot.
03:03:46,800 --> 03:03:48,520 And then AGI is not currently there.
03:03:48,520 --> 03:03:53,520 But I think at some point will more and more become something like it.
03:03:53,520 --> 03:03:59,280 The danger with AGI even is that I think it's even like slightly worse in the sense that
03:03:59,280 --> 03:04:01,560 there are good outcomes of AGI.
03:04:01,560 --> 03:04:05,360 And then the bad outcomes are like an epsilon away, like a tiny one away.
03:04:05,360 --> 03:04:10,860 And so I think capitalism and humanity and so on will drive for the positive ways of
03:04:10,860 --> 03:04:12,060 using that technology.
03:04:12,060 --> 03:04:16,880 But then if bad outcomes are just like a tiny like flip a minus sign away, that's a really
03:04:16,880 --> 03:04:18,280 bad position to be in.
03:04:18,280 --> 03:04:23,120 A tiny perturbation of the system results in the destruction of the human species.
03:04:23,120 --> 03:04:24,920 So weird line to walk.
03:04:24,920 --> 03:04:28,080 Yeah, I think in general, it's really weird about like the dynamics of humanity and this
03:04:28,080 --> 03:04:33,240 explosion we've talked about is just like the insane coupling afforded by technology.
03:04:33,240 --> 03:04:36,320 And just the instability of the whole dynamical system.
03:04:36,320 --> 03:04:38,800 I think it just doesn't look good, honestly.
03:04:38,800 --> 03:04:43,560 Yes, that explosion could be destructive or constructive and the probabilities are nonzero
03:04:43,560 --> 03:04:44,560 in both.
03:04:44,560 --> 03:04:48,680 Yeah, I'm gonna have to I do feel like I have to try to be optimistic and so on.
03:04:48,680 --> 03:04:53,880 And yes, I think even in this case, I still am predominantly optimistic, but there's definitely
03:04:53,880 --> 03:04:54,880 me too.
03:04:54,880 --> 03:04:59,060 Do you think we'll become a multi planetary species?
03:04:59,060 --> 03:05:04,240 Probably yes, but I don't know if it's dominant feature of future humanity.
03:05:04,240 --> 03:05:08,800 There might be some people on some planets and so on, but I'm not sure if it's like,
03:05:08,800 --> 03:05:12,300 yeah, if it's like a major player in our culture and so on.
03:05:12,300 --> 03:05:16,920 We still have to solve the drivers of self destruction here on Earth.
03:05:16,920 --> 03:05:20,120 So just having a backup on Mars is not going to solve the problem.
03:05:20,120 --> 03:05:21,880 So by the way, I love the backup on Mars.
03:05:21,880 --> 03:05:22,880 I think that's amazing.
03:05:22,880 --> 03:05:23,880 We should absolutely do that.
03:05:23,880 --> 03:05:24,880 Yes.
03:05:24,880 --> 03:05:25,880 And I'm so thankful.
03:05:25,880 --> 03:05:29,040 Would you would you go to Mars?
03:05:29,040 --> 03:05:31,200 Personally No, I do like Earth quite a lot.
03:05:31,200 --> 03:05:32,680 Okay, I'll go to Mars.
03:05:32,680 --> 03:05:35,560 I'll tweet at you from there.
03:05:35,560 --> 03:05:37,640 Maybe eventually I would once it's safe enough.
03:05:37,640 --> 03:05:42,920 But I don't actually know if it's on my lifetime scale, unless I can extend it by a lot.
03:05:42,920 --> 03:05:47,080 I do think that, for example, a lot of people might disappear into virtual realities and
03:05:47,080 --> 03:05:48,080 stuff like that.
03:05:48,080 --> 03:05:52,560 I think that could be the major thrust of sort of the cultural development of humanity,
03:05:52,560 --> 03:05:53,920 if it survives.
03:05:53,920 --> 03:05:58,580 So it might not be, it's just really hard to work in physical realm and go out there.
03:05:58,580 --> 03:06:02,200 And I think ultimately, all your experiences are in your brain.
03:06:02,200 --> 03:06:06,000 And so it's much easier to disappear into digital realm.
03:06:06,000 --> 03:06:10,440 And I think people will find them more compelling, easier, safer, more interesting.
03:06:10,440 --> 03:06:14,640 So you're a little bit captivated by virtual reality by the possible worlds, whether it's
03:06:14,640 --> 03:06:17,200 the metaverse or some other manifestation of that?
03:06:17,200 --> 03:06:18,200 Yeah.
03:06:18,200 --> 03:06:21,640 Yeah, it's really interesting.
03:06:21,640 --> 03:06:29,040 I'm interested just talking a lot to Carmack, where's the thing that's currently preventing
03:06:29,040 --> 03:06:30,040 that?
03:06:30,040 --> 03:06:31,040 Yeah.
03:06:31,040 --> 03:06:37,080 I think what's interesting about the future is it's not that, I kind of feel like the
03:06:37,080 --> 03:06:39,240 variance in the human condition grows.
03:06:39,240 --> 03:06:40,560 That's the primary thing that's changing.
03:06:40,560 --> 03:06:44,060 It's not as much the mean of the distribution, it's like the variance of it.
03:06:44,060 --> 03:06:46,880 So there will probably be people on Mars, and there will be people in VR, and there
03:06:46,880 --> 03:06:48,220 will be people here on Earth.
03:06:48,220 --> 03:06:51,140 It's just like there will be so many more ways of being.
03:06:51,140 --> 03:06:54,800 And so I kind of feel like I see it as like a spreading out of a human experience.
03:06:54,800 --> 03:06:57,820 There's something about the internet that allows you to discover those little groups
03:06:57,820 --> 03:07:02,280 and then you gravitate to something about your biology likes that kind of world that
03:07:02,280 --> 03:07:03,280 you find each other.
03:07:03,280 --> 03:07:04,280 Yeah.
03:07:04,280 --> 03:07:06,900 And we'll have transhumanists, and then we'll have the Amish, and everything is just going
03:07:06,900 --> 03:07:07,900 to coexist.
03:07:07,900 --> 03:07:11,640 Yeah, the cool thing about it, because I've interacted with a bunch of internet communities,
03:07:11,640 --> 03:07:15,200 is they don't know about each other.
03:07:15,200 --> 03:07:19,800 Like you can have a very happy existence, just like having a very close-knit community
03:07:19,800 --> 03:07:21,160 and not knowing about each other.
03:07:21,160 --> 03:07:27,640 I mean, you even sense this, just having traveled to Ukraine, they don't know so many things
03:07:27,640 --> 03:07:28,640 about America.
03:07:28,640 --> 03:07:29,640 Yeah.
03:07:29,640 --> 03:07:33,160 When you travel across the world, they think you experience this, too.
03:07:33,160 --> 03:07:37,080 There are certain cultures that are like, they have their own thing going on.
03:07:37,080 --> 03:07:41,760 So you can see that happening more and more and more and more in the future, we have little
03:07:41,760 --> 03:07:42,760 communities.
03:07:42,760 --> 03:07:43,760 Yeah.
03:07:43,760 --> 03:07:44,760 Yeah, I think so.
03:07:44,760 --> 03:07:46,840 That seems to be how it's going right now.
03:07:46,840 --> 03:07:48,800 And I don't see that trend like really reversing.
03:07:48,800 --> 03:07:53,000 I think people are diverse and they're able to choose their own path in existence.
03:07:53,000 --> 03:07:55,580 And I sort of like celebrate that.
03:07:55,580 --> 03:07:56,580 And so...
03:07:56,580 --> 03:08:00,640 Do you spend so much time in the metaverse, in the virtual reality, or which community
03:08:00,640 --> 03:08:01,640 are you?
03:08:01,640 --> 03:08:10,640 Are you the physicalist, the physical reality enjoyer, or do you see drawing a lot of pleasure
03:08:10,640 --> 03:08:12,920 and fulfillment in the digital world?
03:08:12,920 --> 03:08:17,360 Yeah, I think currently the virtual reality is not that compelling.
03:08:17,360 --> 03:08:21,720 I do think it can improve a lot, but I don't really know to what extent.
03:08:21,720 --> 03:08:25,120 Maybe there's actually like even more exotic things you can think about with like Neuralinks
03:08:25,120 --> 03:08:26,480 or stuff like that.
03:08:26,480 --> 03:08:31,720 So currently I kind of see myself as mostly a team human person.
03:08:31,720 --> 03:08:32,720 I love nature.
03:08:32,720 --> 03:08:33,720 Yeah.
03:08:33,720 --> 03:08:34,720 I love harmony.
03:08:34,720 --> 03:08:35,720 I love people.
03:08:35,720 --> 03:08:36,720 I love humanity.
03:08:36,720 --> 03:08:39,120 I love emotions of humanity.
03:08:39,120 --> 03:08:44,800 And I just want to be like in this like solar punk little utopia that's my happy place.
03:08:44,800 --> 03:08:49,760 My happy place is like people I love thinking about cool problems surrounded by lush, beautiful
03:08:49,760 --> 03:08:54,640 dynamic nature and secretly high tech in places that count.
03:08:54,640 --> 03:08:59,840 So you use technology to empower that love for other humans and nature.
03:08:59,840 --> 03:09:03,320 Yeah, I think technology used like very sparingly.
03:09:03,320 --> 03:09:07,400 I don't love when it sort of gets in the way of humanity in many ways.
03:09:07,400 --> 03:09:11,920 I like just people being humans in a way we sort of like slightly evolved and prefer I
03:09:11,920 --> 03:09:13,540 think just by default.
03:09:13,540 --> 03:09:16,220 People kept asking me because they know you love reading.
03:09:16,220 --> 03:09:23,440 Are there particular books that you enjoyed that had an impact on you for silly or for
03:09:23,440 --> 03:09:27,360 profound reasons that you recommend?
03:09:27,360 --> 03:09:29,720 You mentioned the vital question.
03:09:29,720 --> 03:09:33,220 Many of course, I think in biology as an example, the vital question is a good one.
03:09:33,220 --> 03:09:38,840 Anything by Nick Lane, really life ascending, I would say is like a bit more potentially
03:09:38,840 --> 03:09:44,240 representative as like a summary of a lot of the things he's been talking about.
03:09:44,240 --> 03:09:46,300 I was very impacted by the selfish gene.
03:09:46,300 --> 03:09:49,980 I thought that was a really good book that helped me understand altruism as an example
03:09:49,980 --> 03:09:53,440 and where it comes from and just realizing that the selection is in the love of genes
03:09:53,440 --> 03:09:56,480 was a huge insight for me at the time and it sort of like cleared up a lot of things
03:09:56,480 --> 03:09:57,480 for me.
03:09:57,480 --> 03:10:02,000 What do you think about the idea that ideas are the organisms, the memes?
03:10:02,000 --> 03:10:03,000 Yes, love it, 100%.
03:10:03,000 --> 03:10:11,360 Are you able to walk around with that notion for a while that there is an evolutionary
03:10:11,360 --> 03:10:13,400 kind of process with ideas as well?
03:10:13,400 --> 03:10:14,400 There absolutely is.
03:10:14,400 --> 03:10:18,560 There's memes just like genes and they compete and they live in our brains.
03:10:18,560 --> 03:10:19,560 It's beautiful.
03:10:19,560 --> 03:10:22,120 Are we silly humans thinking that we're the organisms?
03:10:22,120 --> 03:10:25,960 Is it possible that the primary organisms are the ideas?
03:10:25,960 --> 03:10:32,080 Yeah, I would say like the ideas kind of live in the software of like our civilization in
03:10:32,080 --> 03:10:33,740 the minds and so on.
03:10:33,740 --> 03:10:37,760 We think as humans that the hardware is the fundamental thing.
03:10:37,760 --> 03:10:43,600 I, human, is a hardware entity, but it could be the software, right?
03:10:43,600 --> 03:10:44,600 Yeah.
03:10:44,600 --> 03:10:48,080 Yeah, I would say like there needs to be some grounding at some point to like a physical
03:10:48,080 --> 03:10:49,080 reality.
03:10:49,080 --> 03:10:56,560 But if we clone an Andre, the software is the thing, like is this thing that makes that
03:10:56,560 --> 03:10:57,560 thing special, right?
03:10:57,560 --> 03:10:59,440 Yeah, I guess you're right.
03:10:59,440 --> 03:11:01,800 But then cloning might be exceptionally difficult.
03:11:01,800 --> 03:11:05,180 There might be a deep integration between the software and the hardware in ways we don't
03:11:05,180 --> 03:11:06,180 quite understand.
03:11:06,180 --> 03:11:09,960 Well, from the evolution point of view, like what makes me special is more like the gang
03:11:09,960 --> 03:11:13,280 of genes that are riding in my chromosomes, I suppose, right?
03:11:13,280 --> 03:11:15,720 Like they're the replicating unit, I suppose.
03:11:15,720 --> 03:11:17,400 No, but that's just the compute.
03:11:17,400 --> 03:11:25,320 The thing that makes you special, sure, well, the reality is what makes you special is your
03:11:25,320 --> 03:11:32,640 ability to survive based on the software that runs on the hardware that was built by the
03:11:32,640 --> 03:11:34,080 genes.
03:11:34,080 --> 03:11:36,840 So the software is the thing that makes you survive, not the hardware.
03:11:36,840 --> 03:11:40,480 It's a little bit of both, I mean, you know, it's just like a second layer.
03:11:40,480 --> 03:11:42,920 It's a new second layer that hasn't been there before the brain.
03:11:42,920 --> 03:11:43,920 They both coexist.
03:11:43,920 --> 03:11:46,040 But there's also layers of the software.
03:11:46,040 --> 03:11:52,120 I mean, it's not, it's a abstraction on top of abstractions.
03:11:52,120 --> 03:11:53,120 But okay.
03:11:53,120 --> 03:11:58,480 So Selfish Gene and Nick Lane, I would say sometimes books are like not sufficient.
03:11:58,480 --> 03:12:01,480 I like to reach for textbooks sometimes.
03:12:01,480 --> 03:12:05,360 I kind of feel like books are for too much of a general consumption sometime and they
03:12:05,360 --> 03:12:09,040 just kind of like, they're too high up in the level of abstraction and it's not good
03:12:09,040 --> 03:12:10,040 enough.
03:12:10,040 --> 03:12:11,040 Yeah.
03:12:11,040 --> 03:12:12,040 So I like textbooks.
03:12:12,040 --> 03:12:13,040 I like The Cell.
03:12:13,040 --> 03:12:14,040 I think The Cell was pretty cool.
03:12:14,040 --> 03:12:19,880 That's why also I like the writing of Nick Lane is because he's pretty willing to step
03:12:19,880 --> 03:12:26,000 one level down and he doesn't, yeah, he sort of, he's willing to go there.
03:12:26,000 --> 03:12:27,920 But he's also willing to sort of be throughout the stack.
03:12:27,920 --> 03:12:30,980 So he'll go down to a lot of detail, but then he will come back up.
03:12:30,980 --> 03:12:34,860 And I think he has a, yeah, basically I really appreciate that.
03:12:34,860 --> 03:12:39,520 That's why I love college, early college, even high school, just textbooks on the basics
03:12:39,520 --> 03:12:44,800 of computer science and mathematics, of biology, of chemistry.
03:12:44,800 --> 03:12:51,340 Those are, they condense down, it's sufficiently general that you can understand both the philosophy
03:12:51,340 --> 03:12:56,540 and the details, but also like you get homework problems and you get to play with it as much
03:12:56,540 --> 03:12:59,720 as you would if you were in programming stuff.
03:12:59,720 --> 03:13:00,720 Yeah.
03:13:00,720 --> 03:13:04,400 And then I'm also suspicious of textbooks, honestly, because as an example in deep learning,
03:13:04,400 --> 03:13:07,160 there's no like amazing textbooks and the field is changing very quickly.
03:13:07,160 --> 03:13:11,480 I imagine the same is true and say synthetic biology and so on.
03:13:11,480 --> 03:13:13,520 These books like The Cell are kind of outdated.
03:13:13,520 --> 03:13:14,520 They're still high level.
03:13:14,520 --> 03:13:16,520 Like what is the actual real source of truth?
03:13:16,520 --> 03:13:23,220 It's people in wet labs working with cells, you know, sequencing genomes and yeah, actually
03:13:23,220 --> 03:13:24,980 working with, working with it.
03:13:24,980 --> 03:13:28,020 And I don't have that much exposure to that or what that looks like.
03:13:28,020 --> 03:13:30,960 So I still don't fully, I'm reading through The Cell and it's kind of interesting and
03:13:30,960 --> 03:13:34,240 I'm learning, but it's still not sufficient, I would say, in terms of understanding.
03:13:34,240 --> 03:13:40,840 Well, it's a clean summarization of the mainstream narrative, but you have to learn that before
03:13:40,840 --> 03:13:43,640 you break out towards the cutting edge.
03:13:43,640 --> 03:13:44,640 Yeah.
03:13:44,640 --> 03:13:47,440 What is the actual process of working with these cells and growing them and incubating
03:13:47,440 --> 03:13:51,080 them and you know, it's kind of like a massive cooking recipes and making sure your cells
03:13:51,080 --> 03:13:55,560 live and proliferate and then you're sequencing them, running experiments and just how that
03:13:55,560 --> 03:13:59,080 works I think is kind of like the source of truth of, at the end of the day, what's really
03:13:59,080 --> 03:14:01,680 useful in terms of creating therapies and so on.
03:14:01,680 --> 03:14:06,440 Yeah, I wonder what in the future AI textbooks will be because, you know, there's artificial
03:14:06,440 --> 03:14:07,680 intelligence, the modern approach.
03:14:07,680 --> 03:14:12,160 I actually haven't read if it's come out, the recent version, the recent, there's been
03:14:12,160 --> 03:14:13,160 a recent edition.
03:14:13,160 --> 03:14:15,880 I also saw there's a science of deep learning book.
03:14:15,880 --> 03:14:19,240 I'm waiting for textbooks that are worth recommending, worth reading.
03:14:19,240 --> 03:14:23,560 It's tricky because it's like papers and code, code, code.
03:14:23,560 --> 03:14:25,720 Honestly, I find papers are quite good.
03:14:25,720 --> 03:14:28,800 I especially like the appendix of any paper as well.
03:14:28,800 --> 03:14:33,240 It's like, it's like the most detail you can have.
03:14:33,240 --> 03:14:35,880 It doesn't have to be cohesive, connected to anything else.
03:14:35,880 --> 03:14:39,200 You just described a very specific way you saw the particular thing.
03:14:39,200 --> 03:14:40,200 Yeah.
03:14:40,200 --> 03:14:43,280 Many times papers can be actually quite readable, not always, but sometimes the introduction
03:14:43,280 --> 03:14:46,520 and the abstract is readable even for someone outside of the field.
03:14:46,520 --> 03:14:47,840 This is not always true.
03:14:47,840 --> 03:14:52,600 And sometimes I think unfortunately scientists use complex terms even when it's not necessary.
03:14:52,600 --> 03:14:54,040 I think that's harmful.
03:14:54,040 --> 03:14:55,960 I think there there's no reason for that.
03:14:55,960 --> 03:15:00,960 And papers sometimes are longer than they need to be in this, in the parts that don't
03:15:00,960 --> 03:15:01,960 matter.
03:15:01,960 --> 03:15:02,960 Yeah.
03:15:02,960 --> 03:15:06,520 Appendix would be long, but then the paper itself, you know, look at Einstein, make it
03:15:06,520 --> 03:15:07,520 simple.
03:15:07,520 --> 03:15:08,520 Yeah.
03:15:08,520 --> 03:15:10,500 But certainly I've come across papers, I would say in say like synthetic biology or something
03:15:10,500 --> 03:15:13,520 that I thought were quite readable for the abstract and the introduction.
03:15:13,520 --> 03:15:16,320 And then you're reading the rest of it and you don't fully understand, but you kind of
03:15:16,320 --> 03:15:17,320 are getting a gist.
03:15:17,320 --> 03:15:20,400 And I think it's cool.
03:15:20,400 --> 03:15:25,760 What advice you give advice to folks interested in machine learning and research, but in
03:15:25,760 --> 03:15:31,680 general life advice to a young person, high school, early college, about how to have a
03:15:31,680 --> 03:15:34,960 career they can be proud of or a life they can be proud of.
03:15:34,960 --> 03:15:35,960 Yeah.
03:15:35,960 --> 03:15:37,960 I think I'm very hesitant to give general advice.
03:15:37,960 --> 03:15:38,960 I think it's really hard.
03:15:38,960 --> 03:15:41,280 I've mentioned like some of the stuff I've mentioned is fairly general.
03:15:41,280 --> 03:15:45,920 I think like focus on just the amount of work you're spending on like a thing, compare
03:15:45,920 --> 03:15:48,240 yourself only to yourself, not to others.
03:15:48,240 --> 03:15:49,240 That's good.
03:15:49,240 --> 03:15:50,240 I think those are fairly general.
03:15:50,240 --> 03:15:52,300 How do you pick the thing?
03:15:52,300 --> 03:15:57,800 You just have like a deep interest in something or like try to like find the arc max over
03:15:57,800 --> 03:15:58,840 like the things that you're interested in.
03:15:58,840 --> 03:16:01,040 Arc max at that moment and stick with it.
03:16:01,040 --> 03:16:05,240 How do you not get distracted and switched to another thing?
03:16:05,240 --> 03:16:07,240 You can, if you like.
03:16:07,240 --> 03:16:13,320 Well, if you do an arc max repeatedly every week, every month, it's a problem.
03:16:13,320 --> 03:16:14,320 Yeah.
03:16:14,320 --> 03:16:17,760 You can like low pass filter yourself in terms of like what has consistently been true for
03:16:17,760 --> 03:16:18,760 you.
03:16:18,760 --> 03:16:22,240 But yeah, I definitely see how it can be hard.
03:16:22,240 --> 03:16:25,240 But I would say like you're going to work the hardest on the thing that you care about
03:16:25,240 --> 03:16:26,240 the most.
03:16:26,240 --> 03:16:30,620 So low pass filter yourself and really introspect in your past, what are the things that gave
03:16:30,620 --> 03:16:33,660 you energy and what are the things that took energy away from you?
03:16:33,660 --> 03:16:34,760 Concrete examples.
03:16:34,760 --> 03:16:38,560 And usually from those concrete examples, sometimes patterns can merge.
03:16:38,560 --> 03:16:41,240 I like it when things look like this when I'm in these positions.
03:16:41,240 --> 03:16:44,440 So that's not necessarily the field, but the kind of stuff you're doing in a particular
03:16:44,440 --> 03:16:45,440 field.
03:16:45,440 --> 03:16:50,440 So for you, it seems like you were energized by implementing stuff, building actual things.
03:16:50,440 --> 03:16:51,440 Yeah.
03:16:51,440 --> 03:16:56,760 Level learning and then also communicating so that others can go through the same realizations
03:16:56,760 --> 03:16:59,480 and shortening that gap.
03:16:59,480 --> 03:17:01,840 Because I usually have to do way too much work to understand a thing.
03:17:01,840 --> 03:17:04,340 And then I'm like, okay, this is actually like, okay, I think I get it.
03:17:04,340 --> 03:17:06,240 And like, why was it so much work?
03:17:06,240 --> 03:17:09,040 It should have been much less work.
03:17:09,040 --> 03:17:10,740 And that gives me a lot of frustration.
03:17:10,740 --> 03:17:12,640 And that's why I sometimes go teach.
03:17:12,640 --> 03:17:19,760 So aside from the teaching you're doing now, putting out videos, aside from a potential
03:17:19,760 --> 03:17:26,760 Godfather part two with the AGI at Tesla and beyond, what does the future for Anjay Kapatthi
03:17:26,760 --> 03:17:27,760 hold?
03:17:27,760 --> 03:17:28,760 Have you figured that out yet or no?
03:17:28,760 --> 03:17:36,840 I mean, as you see through the fog of war that is all of our future, do you start seeing
03:17:36,840 --> 03:17:41,040 silhouettes of what that possible future could look like?
03:17:41,040 --> 03:17:44,920 The consistent thing I've been always interested in, for me at least, is AI.
03:17:44,920 --> 03:17:50,040 And that's probably what I'm spending the rest of my life on, because I just care about
03:17:50,040 --> 03:17:51,040 it a lot.
03:17:51,040 --> 03:17:54,680 And I actually care about like many other problems as well, like say aging, which I
03:17:54,680 --> 03:17:56,580 basically view as disease.
03:17:56,580 --> 03:17:59,040 And I care about that as well.
03:17:59,040 --> 03:18:02,280 But I don't think it's a good idea to go after it specifically.
03:18:02,280 --> 03:18:06,080 I don't actually think that humans will be able to come up with the answer.
03:18:06,080 --> 03:18:08,960 I think the correct thing to do is to ignore those problems.
03:18:08,960 --> 03:18:11,920 And you solve AI and then use that to solve everything else.
03:18:11,920 --> 03:18:13,200 And I think there's a chance that this will work.
03:18:13,200 --> 03:18:15,080 I think it's a very high chance.
03:18:15,080 --> 03:18:18,520 And that's kind of like the way I'm betting, at least.
03:18:18,520 --> 03:18:23,880 So when you think about AI, are you interested in all kinds of applications, all kinds of
03:18:23,880 --> 03:18:29,320 domains, and any domain you focus on will allow you to get insights to the big problem
03:18:29,320 --> 03:18:30,320 of AGI?
03:18:30,320 --> 03:18:31,840 Yeah, for me, it's the ultimate meta problem.
03:18:31,840 --> 03:18:33,600 I don't want to work on any one specific problem.
03:18:33,600 --> 03:18:34,600 There's too many problems.
03:18:34,600 --> 03:18:36,640 So how can you work on all problems simultaneously?
03:18:36,640 --> 03:18:39,520 You solve the meta problem, which to me is just intelligence.
03:18:39,520 --> 03:18:42,440 And how do you automate it?
03:18:42,440 --> 03:18:49,360 Is there cool small projects like Archive Sanity and so on that you're thinking about
03:18:49,360 --> 03:18:53,240 that the ML world can anticipate?
03:18:53,240 --> 03:18:55,520 There's always some fun side projects.
03:18:55,520 --> 03:18:57,440 Archive Sanity is one.
03:18:57,440 --> 03:18:58,920 Basically there's way too many archive papers.
03:18:58,920 --> 03:19:02,360 How can I organize it and recommend papers and so on?
03:19:02,360 --> 03:19:05,000 I transcribed all of your podcasts.
03:19:05,000 --> 03:19:10,120 What did you learn from that experience, from transcribing the process of...
03:19:10,120 --> 03:19:13,400 You like consuming audiobooks and podcasts and so on.
03:19:13,400 --> 03:19:19,160 Here's the process that achieves closer to human level performance on annotation.
03:19:19,160 --> 03:19:20,160 Yeah.
03:19:20,160 --> 03:19:25,360 Well, I definitely was surprised that transcription with OpenAI's Whisperer was working so well,
03:19:25,360 --> 03:19:29,640 compared to what I'm familiar with from Siri and a few other systems, I guess.
03:19:29,640 --> 03:19:30,640 It worked so well.
03:19:30,640 --> 03:19:34,320 And that's what gave me some energy to try it out.
03:19:34,320 --> 03:19:36,820 And I thought it could be fun to run on podcasts.
03:19:36,820 --> 03:19:41,520 It's kind of not obvious to me why Whisperer is so much better compared to anything else,
03:19:41,520 --> 03:19:44,000 because I feel like there should be a lot of incentive for a lot of companies to produce
03:19:44,000 --> 03:19:47,000 transcription systems and that they've done so over a long time.
03:19:47,000 --> 03:19:48,600 Whisperer is not a super exotic model.
03:19:48,600 --> 03:19:50,380 It's a transformer.
03:19:50,380 --> 03:19:54,280 It takes Mel spectrograms and just outputs tokens of text.
03:19:54,280 --> 03:19:55,280 It's not crazy.
03:19:55,280 --> 03:19:58,440 The model and everything has been around for a long time.
03:19:58,440 --> 03:20:00,280 I'm not actually 100% sure why this came out.
03:20:00,280 --> 03:20:02,240 It's not obvious to me either.
03:20:02,240 --> 03:20:04,120 It makes me feel like I'm missing something.
03:20:04,120 --> 03:20:05,120 I'm missing something.
03:20:05,120 --> 03:20:10,480 Yeah, because there is a huge, even at Google and so on, YouTube transcription.
03:20:10,480 --> 03:20:11,480 Yeah.
03:20:11,480 --> 03:20:12,680 Yeah, it's unclear.
03:20:12,680 --> 03:20:17,440 But some of it is also integrating into a bigger system.
03:20:17,440 --> 03:20:20,080 So the user interface, how it's deployed and all that kind of stuff.
03:20:20,080 --> 03:20:25,300 Maybe running it as an independent thing is much easier, like an order of magnitude easier
03:20:25,300 --> 03:20:31,280 than deploying it to a large integrated system, like YouTube transcription or anything like
03:20:31,280 --> 03:20:32,280 meetings.
03:20:32,280 --> 03:20:38,480 YouTube has transcription that's kind of crappy, but creating an interface where it detects
03:20:38,480 --> 03:20:46,800 the different individual speakers, it's able to display it in compelling ways, run it real
03:20:46,800 --> 03:20:47,800 time, all that kind of stuff.
03:20:47,800 --> 03:20:48,800 Maybe that's difficult.
03:20:48,800 --> 03:20:56,400 That's the only explanation I have because I'm currently paying quite a bit for human
03:20:56,400 --> 03:21:00,120 transcription and human captions annotation.
03:21:00,120 --> 03:21:03,920 It seems like there's a huge incentive to automate that.
03:21:03,920 --> 03:21:04,920 Yeah.
03:21:04,920 --> 03:21:05,920 It's very confusing.
03:21:05,920 --> 03:21:09,160 I don't know if you looked at some of the Whisper transcripts, but they're quite good.
03:21:09,160 --> 03:21:10,160 They're good.
03:21:10,160 --> 03:21:12,360 And especially in tricky cases.
03:21:12,360 --> 03:21:18,520 I've seen Whisper's performance on super tricky cases and it does incredibly well.
03:21:18,520 --> 03:21:19,520 So I don't know.
03:21:19,520 --> 03:21:21,000 A podcast is pretty simple.
03:21:21,000 --> 03:21:26,840 It's high quality audio and you're speaking usually pretty clearly.
03:21:26,840 --> 03:21:32,000 So I don't know what OpenAI's plans are either.
03:21:32,000 --> 03:21:34,840 But yeah, there's always like fun projects basically.
03:21:34,840 --> 03:21:38,440 And StableDiffusion also is opening up a huge amount of experimentation, I would say, in
03:21:38,440 --> 03:21:43,320 the visual realm and generating images and videos and movies ultimately.
03:21:43,320 --> 03:21:44,320 Videos now.
03:21:44,320 --> 03:21:46,440 And so that's going to be pretty crazy.
03:21:46,440 --> 03:21:50,520 That's going to almost certainly work and it's going to be really interesting when the
03:21:50,520 --> 03:21:52,680 cost of content creation is going to fall to zero.
03:21:52,680 --> 03:21:56,240 You used to need a painter for a few months to paint a thing and now it's going to be
03:21:56,240 --> 03:21:59,280 speak to your phone to get your video.
03:21:59,280 --> 03:22:05,640 So if Hollywood will start using that to generate scenes, which completely opens up.
03:22:05,640 --> 03:22:06,640 Yeah.
03:22:06,640 --> 03:22:12,520 So you can make a movie like Avatar eventually for under a million dollars.
03:22:12,520 --> 03:22:13,520 Much less.
03:22:13,520 --> 03:22:14,520 Maybe just by talking to your phone.
03:22:14,520 --> 03:22:17,800 I mean, I know it sounds kind of crazy.
03:22:17,800 --> 03:22:19,600 And then there'd be some voting mechanism.
03:22:19,600 --> 03:22:23,560 Like how do you have, like would there be a show on Netflix that's generated completely
03:22:23,560 --> 03:22:27,560 automatically?
03:22:27,560 --> 03:22:33,160 And what does it look like also when you can generate it on demand and there's infinity
03:22:33,160 --> 03:22:34,160 of it?
03:22:34,160 --> 03:22:35,160 Yeah.
03:22:35,160 --> 03:22:38,280 Oh man.
03:22:38,280 --> 03:22:39,280 All the synthetic content.
03:22:39,280 --> 03:22:43,840 I mean, it's humbling because we treat ourselves as special for being able to generate art
03:22:43,840 --> 03:22:46,480 and ideas and all that kind of stuff.
03:22:46,480 --> 03:22:50,920 If that can be done in an automated way by AI.
03:22:50,920 --> 03:22:54,320 I think it's fascinating to me how these, the predictions of AI and what it's going
03:22:54,320 --> 03:22:57,840 to look like and what it's going to be capable of are completely inverted and wrong.
03:22:57,840 --> 03:23:01,400 And a sci-fi of fifties and sixties was just like, totally not right.
03:23:01,400 --> 03:23:05,000 They imagined AI as like super calculating theorem provers.
03:23:05,000 --> 03:23:09,120 And we're getting things that can talk to you about emotions, they can do art.
03:23:09,120 --> 03:23:10,120 It's just like weird.
03:23:10,120 --> 03:23:12,120 Are you excited about that future?
03:23:12,120 --> 03:23:20,040 Just AIs like hybrid systems, heterogeneous systems of humans and AIs talking about emotions,
03:23:20,040 --> 03:23:26,080 and children and AI system where the Netflix thing you watch is also generated by AI.
03:23:26,080 --> 03:23:29,520 I think it's going to be interesting for sure.
03:23:29,520 --> 03:23:32,800 And I think I'm cautiously optimistic, but it's not obvious.
03:23:32,800 --> 03:23:42,080 Well, the sad thing is your brain and mine developed in a time where before Twitter,
03:23:42,080 --> 03:23:43,080 before the internet.
03:23:43,080 --> 03:23:47,840 So I wonder people that are born inside of it might have a different experience.
03:23:47,840 --> 03:23:54,000 Like I, and maybe you will still resist it and the people born now will not.
03:23:54,000 --> 03:24:01,160 Well, I do feel like humans are extremely malleable and you're probably right.
03:24:01,160 --> 03:24:05,240 What is the meaning of life, Andre?
03:24:05,240 --> 03:24:11,140 We talked about sort of the universe having a conversation with us humans or with the
03:24:11,140 --> 03:24:16,800 systems we create to try to answer for the universe, for the creator of the universe
03:24:16,800 --> 03:24:23,720 to notice us, we're trying to create systems that are loud enough to answer back.
03:24:23,720 --> 03:24:25,100 I don't know if that's the meaning of life.
03:24:25,100 --> 03:24:27,000 That's like meaning of life for some people.
03:24:27,000 --> 03:24:30,440 The first level answer I would say is anyone can choose their own meaning of life because
03:24:30,440 --> 03:24:34,240 we are conscious entity and it's beautiful, number one.
03:24:34,240 --> 03:24:39,300 But I do think that like a deeper meaning of life as someone is interested is along
03:24:39,300 --> 03:24:43,640 the lines of like, what the hell is all this and like why?
03:24:43,640 --> 03:24:47,440 And if you look at the into fundamental physics and the quantum field theory and the standard
03:24:47,440 --> 03:24:50,440 model, they're like very complicated.
03:24:50,440 --> 03:24:55,600 And there's this like, you know, 19 free parameters of our universe.
03:24:55,600 --> 03:24:57,920 And like, what's going on with all this stuff?
03:24:57,920 --> 03:24:58,920 And why is it here?
03:24:58,920 --> 03:24:59,920 And can I hack it?
03:24:59,920 --> 03:25:00,920 Can I work with it?
03:25:00,920 --> 03:25:01,920 Is there a message for me?
03:25:01,920 --> 03:25:03,480 Am I supposed to create a message?
03:25:03,480 --> 03:25:05,800 And so I think there's some fundamental answers there.
03:25:05,800 --> 03:25:10,240 But I think there's actually even like, you can't actually really make dent in those without
03:25:10,240 --> 03:25:11,320 more time.
03:25:11,320 --> 03:25:15,280 And so to me, also, there's a big question around just getting more time, honestly.
03:25:15,280 --> 03:25:18,240 Yeah, that's kind of like what I think about quite a bit as well.
03:25:18,240 --> 03:25:25,240 So kind of the ultimate, or at least first way to sneak up to the why question is to
03:25:25,240 --> 03:25:30,560 try to escape the system, the universe.
03:25:30,560 --> 03:25:35,840 And then for that, you sort of backtrack and say, okay, for that, that's going to be take
03:25:35,840 --> 03:25:36,840 a very long time.
03:25:36,840 --> 03:25:41,400 So the why question boils down from an engineering perspective to how do we extend?
03:25:41,400 --> 03:25:44,920 Yeah, I think that's the question number one, practically speaking, because you can't, you're
03:25:44,920 --> 03:25:49,120 not going to calculate the answer to the deeper questions in time you have.
03:25:49,120 --> 03:25:53,880 And that could be extending your own lifetime or extending just the lifetime of human civilization?
03:25:53,880 --> 03:25:57,480 Of whoever wants to, not many people might not want that.
03:25:57,480 --> 03:26:02,520 But I think people who do want that, I think, I think it's probably possible.
03:26:02,520 --> 03:26:06,300 And I don't think I don't know that people fully realize this, I kind of feel like people
03:26:06,300 --> 03:26:09,040 think of death as an inevitability.
03:26:09,040 --> 03:26:13,080 But at the end of the day, this is a physical system, some things go wrong.
03:26:13,080 --> 03:26:17,120 It makes sense why things like this happen, evolutionarily speaking.
03:26:17,120 --> 03:26:21,040 And there's most certainly interventions that mitigate it.
03:26:21,040 --> 03:26:27,080 That'd be interesting if death is eventually looked at as as a fascinating thing that used
03:26:27,080 --> 03:26:28,920 to happen to humans.
03:26:28,920 --> 03:26:29,960 I don't think it's unlikely.
03:26:29,960 --> 03:26:33,800 I think it's, I think it's likely.
03:26:33,800 --> 03:26:39,960 And it's up to our imagination to try to predict what the world without death looks like.
03:26:39,960 --> 03:26:43,560 It's hard to, I think the values will completely change.
03:26:43,560 --> 03:26:49,360 Could be, I don't, I don't really buy all these ideas that, oh, without death, there's
03:26:49,360 --> 03:26:54,840 no meaning, there's nothing is, I don't intuitively buy all those arguments.
03:26:54,840 --> 03:26:57,480 I think there's plenty of meaning, plenty of things to learn.
03:26:57,480 --> 03:27:01,320 They're interesting, exciting, I want to know, I want to calculate, I want to improve the
03:27:01,320 --> 03:27:05,440 condition of all the humans and organisms that are alive.
03:27:05,440 --> 03:27:06,440 Yeah.
03:27:06,440 --> 03:27:08,640 The way we find meaning might change.
03:27:08,640 --> 03:27:13,080 There is a lot of humans, probably including myself, that finds meaning in the finiteness
03:27:13,080 --> 03:27:14,840 of things.
03:27:14,840 --> 03:27:16,720 But that doesn't mean that's the only source of meaning.
03:27:16,720 --> 03:27:17,720 Yeah.
03:27:17,720 --> 03:27:21,080 I do think many people will go with that, which I think is great.
03:27:21,080 --> 03:27:24,200 I love the idea that people can just choose their own adventure.
03:27:24,200 --> 03:27:28,640 Like you are born as a conscious free entity by default, I'd like to think.
03:27:28,640 --> 03:27:33,480 And you have your unalienable rights for life.
03:27:33,480 --> 03:27:38,480 In the pursuit of happiness, I don't know if you have that in the nature, the landscape
03:27:38,480 --> 03:27:39,480 of happiness.
03:27:39,480 --> 03:27:41,720 And you can choose your own adventure mostly.
03:27:41,720 --> 03:27:43,720 And that's not fully true.
03:27:43,720 --> 03:27:51,560 But I still am pretty sure I'm an NPC, but an NPC can't know it's an NPC.
03:27:51,560 --> 03:27:54,160 There could be different degrees and levels of consciousness.
03:27:54,160 --> 03:27:58,720 I don't think there's a more beautiful way to end it.
03:27:58,720 --> 03:28:00,240 Andre, you're an incredible person.
03:28:00,240 --> 03:28:02,360 I'm really honored you would talk with me.
03:28:02,360 --> 03:28:07,480 Everything you've done for the machine learning world, for the AI world, to just inspire people,
03:28:07,480 --> 03:28:10,440 to educate millions of people, it's been great.
03:28:10,440 --> 03:28:12,000 And I can't wait to see what you do next.
03:28:12,000 --> 03:28:13,000 It's been an honor, man.
03:28:13,000 --> 03:28:14,000 Thank you so much for talking to me.
03:28:14,000 --> 03:28:15,000 Awesome.
03:28:15,000 --> 03:28:16,000 Thank you.
03:28:16,000 --> 03:28:18,920 Thanks for listening to this conversation with Andre Karpathy.
03:28:18,920 --> 03:28:23,760 To support this podcast, please check out our sponsors in the description.
03:28:23,760 --> 03:28:28,700 And now, let me leave you with some words from Samuel Carlin.
03:28:28,700 --> 03:28:35,720 The purpose of models is not to fit the data, but to sharpen the questions.
03:28:35,720 --> 03:28:54,760 Thanks for listening, and hope to see you next time.