Back in the 80' s we used to do this sort of thing all the time, to extend languages. You would write a compiler that understood the new language extensions, compile it under your existing compiler, then write another compiler that used the language extensions, and compile that using the extended compiler. e.g. if your language does not have constants, you write a compiler system that could lex/parse/codegen the term "CONST " Then write a new compiler that uses "CONST"'s and compile it with the CONST-aware compiler (you could even re-feed this back through itself to generate a better compiler) A few simple tweaks to the compiler code and you can now have ENUMs, and so on The most fun was writing optimising compilers and feeding them through iterations of themselves :) FORTH was an especially great language to do this process on
Just realized this is why gentoo compiles gcc twice when doing an upgrade (or used to at least). One of those "it's glaringly obvious but I had to have someone else point it out to me before I noticed" moments.
It's not Gentoo in particular that compiles gcc twice; gcc's own build process compiles its source code twice. It also at least used to leave out support for features its own source didn't use, such as Fortran or floating point math, since the first build was only going to ever compile one program.
I once read warning in a book about this, that a malicious developer could insert a piece of code that detects if the compiler compiles itself and injects the source code for injecting that backdoor again. And when the compiler compiles something else it could just install the backdoor into the program. After the compiler has compiled itself with the malicious code once, the developer can even purge his code from version control in order to cover his tracks. This is the main reason, why you must keep all the versions of your compiler. If the backdoor is detected only several versions later, you have to go back to the last non backdoored version and start the bootstrapping process again.
@@oleh6738 It is. I read it 1996, standing in the Stanford University campus bookstore for two hours, my legs aching. I'm from Louisville, Kentucky, but have visited every landmark of computing I can afford, and my experience of Stanford University consists of 45 minutes of walking the incredibly lovely campus, and 4.5 hours standing reading books I couldn't afford in the bookstore.
@@oleh6738 The thing that fascinates me most about it, and which I struggle to understand the mechanics of, is the task of getting a computer program to output its own source code.... that for me is a more deeply interesting aspect of his lecture than the horrifyingly obvious-in-retrospect light it casts on computer security. The paper is doubly interesting because it has deep results in it from pure computer science, and also this vastly troubling real-world implication to it. Perfect example of the math-meets-engineering nature of the field, with all its thousands of years of political and social consequences. Computer Science is a subset of the Humanities.
Fun fact: C# compiler is written in C#. Also Typescript transpiler is also written in Typescript. The idea of a self compiling compiler has many examples.
Indeed, it's quite a fascinating process and a rite of passage for many languages. The Rust compiler is written in Rust, but there are two different versions used in the bootstrapping process. The very first original compiler was written in OCaml and was once used for bootstrapping, but it has been replaced by a more advanced version called *mrustc* written in C which is the one typically used for bootstrapping *rustc* these days.
I remember doing similar when I was 14... Had an assembler running in BASIC that was extremely slow, so wrote a new one in Assembler, got the BASIC assembler to assemble that, and then used the binary version of the assembler to improve and complete itself. Was back in the day, when 6502 assembler was a thing. Was happy with the result, assembled itself something like 100 times faster than the BASIC version in the end.
Thank you - and also to a few others on here - who've contributed some nice first-hand bootstrapping tales. Your speed-up factor of 100 must have made it all seem worthwhile ...
The best videos on this channel are those with Professor Brailsford. I guess the compiler series also has some episodes about disassembly and then decompilation. (Decompilation is hard, because of UB (undefined behavior) in C (the target language)) Edit: the series should also contain a video about undefined behavior in C and C++, because it is a serious topic.
Decompilation is hard because its akin to unscrambling an egg. Not sure why you think its hard because of UB. Thats a small part of the equation. UB is simply a consequence of low level code. Not every use case can be known at compile time. If you look at a language like Rust, there is very little UB in the safe part of the language, but if you use the unsafe escape hatch you can produce the same kinds of UB as C.
@@noxabellus When UB would be ignored, the decompiled code might be incorrect, the generated code has to use either compiler builtins to correctly reflect the binary or insert checks everywhere.
I Remember a story from "the Jargon file" about someone modifying the compiler to install a backdoor in the login software and a backdoor seed in the compiler. then compiling itself and removing all traces.
For compilers that aren't C the first step is probably going to involve a simple compiler for some new language written in C rather than written in ASM we can do that cause someone has already written a working C compiler for most every platform and C is much easier to use than ASM
Another way of dealing with the bugs in the bootstrapping process is to write a really simple to compile C compiler that BinA can compile and can self compiler, then write a second version of a C compiler that's more advanced, and work on getting the first C compiler code to be able to compile that. Adding in a third C compiler could even be needed, and if you're in a very complicated language I could see even more stages being helpful.
Very interesting angle on self compiling compilers - improving quality of code and compiler. My own career experience (forty years!) was doing this as a technique to achieve portability - to run software on new machines / chipsets etc. Excellent - thanks!
Quite interesting, recently started writing my own compiler (mitsy if anyone interested) and right now I'm working on the most essential part of it, namely the bit that takes C literals and transforms them into hard values, part I'm struggling with is the exponent on floats, everything else has been a relative breeze as far as literals go
if there is _one_ paper that anybody talking about security of computer systems should really read it's that one (especially if we're talking about any kind of electronic voting systems)
@@okuno54 if you are talking about elections in a large country, spending few million to get the result you want _is exactly what big corporations are already doing_
This greatly reminds me of the RepRap Project, which is especially clear when you may or may not knowingly referred to it when talking about self replicating 3D printers. They very much exist and they are commercially viable given how they’ve gotten the desktop 3D printing industry to drop prices like mad.
On top of them being an excellent teacher, something about their voice makes learning so much more enjoyable and easy to understand. It's so not threatening and intriguing! It's the comp sci grandfather we always wanted and not uh.... yeah.
The original executable format for DOS was the COM program format, which was a "bare" (no headers, single code and data segment, meaning up to 64kb for code+data) 16bit executable. DOS would put that single section at a fixed location in memory and then execute from a fixed adress. The DOS EXE "MZ" file format (MZ being the initials of Mark Zbikowski, who defined the format) was the advanced format that would allow multiple segments, separate code and data, etc. 16bit windows EXE files use a newer format called NE (New Executable), and 32bit/64bit windows EXE files use PE (Portable Executable -- the portable came from how the same file format could be used in all the platforms Windows used to work on at the time). EDIT: Added some details. Addendum: If you have ever opened an NE or PE EXE file in a hex editor (or even in notepad, but don't do that, it would corrupt the file if you save by mistake), you'd see it has MZ at the beginning, which is there for anyone who tried to run a windows program in DOS, who would receive a message saying the program can't be run in DOS.
dos itself thinks that windows exe are just fine to run because it ignores all the windows specific code and data. the message saying the exe can not be run in dos is output by a small dos only code segment in the exe that dos sees and executes
@@MonochromeWench Yes. The "dos segment" you speak of is the beginning of the file, the header contains a number of reserved bytes before the actual data starts, and those bytes are used to create the little dos program that prints an error message. Both NE and PE executables have this.
MZ is not "there for anyone who tried to run a windows program in DOS". DOS .exe files have had the MZ preamble well before Windows. The thing they did with Windows executables was to include a .TEXT stub which does nothing else than writing "This program cannot be run in DOS mode." to standard output, at the same time keeping the format backwards compatible and exit gracefully when started from DOS. the Windows executable loader would look for the NE/PE after that stub, and ignore it.
I wrote an assembler using a monitor on the vic-20 (6502) then used the assembler to write a better assembler with labels, then used that to write a small-c compiler, then used that to write a c-compiler. That's how I learned to program.
HESMon? I used it to reverse engineer the C64 Kernel, stripped out all the cassette routines, and replace it with bank swapping (using the cassette control circuit) on a 4x sized rom. Built in all the tools I wanted.
I was asked this question so often, for every language you can imagine... The second question is always: "But why would you write it again, if it can already be compiled?" There is one thing you could have mentioned, because it's always the third question. "If a compiler can make code executable for machine A, how can it make the code executable for machine B?" I apologize if I just didn't hear you when you said it.
The second question can be answered as: Because writing an C compiler in C, is easier than ASM, so the final product tends to be better, and its kinda of a self "unit test". The third question can be answered as: If machine B have the same architecture, you don't have a problem, just copy the C-MK2 compiler over. If it is a different architecture, then you need to use machine-B's ASM to write a C-MK1, and then feed your already coded C-MK2 in this B-C-MK1, to have the same C-MK2 as machine A. Quoting back to first video of this series, you always only need to rewrite this small C-MK1, in each new architecture in order to generate MK2.
@@wfzyx The answers seemed too obvious to me, so I did not bother. This was really ignorant of me. I learned all of this years ago and have been writing my own compiler for some very specific stuff. This made me forget how hard it was to learn all of this from zero. Anyways, thanks for answering :)
fun fact : the reproducing rapid prototyping machene or reprap for short, is the basis of modern desktop 3d printing the idea behind the reprap project was to make a machene that you can use to replicate the machene given enogh time and money
I'm glad you made that video because I read somewhere that when compiling a compiler, it would need to get compiled by itself again and again to make sure it wouldn't break and shoot bugs but it was not explained that the quality of the latest versions would be better, so I thought it would become worse. Your explanation was very clear as to why not :)
I was going to write a spoof on Monty Python's "Decomposing Composers" with "Decompiling Compilers" but there aren't enough abandonware compilers to fit the bill. Ouroborous compiling...
What I don't understand is how something made with quality 0 binary can produce better code (let alone faster)? I mean, if I wrote a C compiler in assembly how my self hosting version could possibly be faster? It can't be faster than original version made in assembly (let's call it version 0) from scratch, right? Or the point is that during bootstrapping optimizations possibilities open up, which are hard / impossible to do directly in assembly in version 0?
the way i understand it, the efficiency of the compiler as a program itself is separate from the efficiency of the binary it produces. so compiler A could be very inefficient, e.g. it takes 2 days to compile compiler B, which you feed into it. but, once that’s done, compiler B itself is a different binary than compiler A, so the way it works is different as well
He probably doesn't care much for all the hype. If you've been working with manual memory management for a few years it doesn't really bother you anymore so the tradeoff of huge compile times doesn't have so much appeal
The creation of a compiler that can compile itself is called "self-hosting." All assemblers started this way. This is common for Pascal compilers; I know of at least 5 (XDPascal, Stanford Pascal, AAEC 8000 Pascal, UCSD Pascal, and Free Pascal, all of which I've had reason to read their sources). As of now, all "world-class" C++ compilers, including LLVM's Clang and GNU GCC, are self-hosting. So is Free Basic. And there are others, Any programming language where it can read text, and write to a disk file, can be self-hosting, even if it started from something else.
You don't necessarily have to write the bootstrap compiler in assembler. Nicklaus Worth write the first Pascal compiler using FORTRAN. He wrote the compiler originally in itself, then hand-translated it into FORTRAN. Once he built that, the compiler could be used to compile itself, and once all bugs were fixed, all newer versions. Another possibility is to write a subset compiler,, just enough to compile the smallest subset than can do something, then once that is compiled, add more of the compiler to itself until you get it to where it can compile the full language.
I'm not sure why all these complicated diagrams and notations are needed to describe something quite straightforward. It all started with a piece of software written in machine code to make writing and running programs easier ("Initial Orders"). Then a compiler was developed using the above to convert assembler instructions to machine code. Then that compiler was rewritten in assembler, compiled and became the new version of the compiler. The process of enhancing the compiler to introdeuce new language features continued, using the previous version of the compiler. New languages were developed and compilers written in languages for which there were working compilers, including producing code for other types of computer. So it is no different to any other technological process - we create new and improved tools using existing tools.
It makes me think about the human mind actually. Programming, quantum physics, brain surgery is absolutely nothing compared to the thing that produced it which is the human mind. We use our mind to try understand our mind and things obviously go wonky sometimes :D To improve yourself you have to think in a way that is "computable" for your current thinking capacity. You have to think about problems you are having and work out problems in your thinking with the thing that has problems and hopefully make your new thinking a bit better. You have to start at something simple and build on that. The human mind is always building on itself which is fascinating to me :)
I believe that these days the GNU C compiler needs a C++ compiler to compile it. I suppose that was seen as inevitable, like eventually writing the assembler in C, ...
@@SimonBastienFiliatrault In my opinion (and I said this in my comment on the last video) the critical part of this visualization should be to show how code-as-instructions and code-as-data are *interchangeable.* That is, the code when it acts as a running program, and the code when it acts as the data being transformed by that program. The T-diagram ought to represent the code as a program that transforms its input into its output. What they should be doing is taking another T-diagram, *shrinking it down and passing it through* the first T-diagram to be transformed into a different one - to show how the C-code representation of that program gets turned into a binary-code representation of it.
Believe me I’m PC enthusiast but I’m no scientist, PHD holder or engineer in computer world. My question is why would we want a new compiler for our current language if current compiler is working as expected I mean, it can traslate english (computer language) to binary and as long as it’s translated to a binary (executable code) with no intermediate or algorithm that need to be run during run-time like, I suppose, Java, it is good enough for any programme that we write. So why would we want to change how the compiler work?
I must not have been following so well, I am still confused as I can't think of a system that only builds on itself and is able to run on a specific architecture, without that architecture translation layer
That is what the compiler is. It translates from a high level language to a low level language. Once you have the compiler working, you no longer need the low level references. For instance when installing Linux you go through a tool chain. You start by downloading a working image with a working compiler. Then you use the image compiler to cross compile a compiler for your system. Then you boot your system and recompile your compiler with the cross compiled compiler to create a natively compiled compiler which you use for the rest of your installations. At no point in this process did you use a program written in assembly or in binary because you were able to start with a working compiler.
Pretty much all high level languages compile to C and let the hardware vendors do the work from there. They still face the problem of getting C code to run well on their new hardware, we just don't see it.
Could you have a bug in BinA that doesn't show up until say BinC? In that case would you have to fix it and iterate through A, B, and C again or could you somehow fix it in just your next iteration?
Can anyone remember a program called the 'Last One'. It was really hyped about 30 years ago to be the last compiler program you would need. The program would produce another runnable program from your Input statements. It was a spectacular failure. Did you buy a copy and what was it really like?
How about designing a compiler that at its core is based off of Machine Learning, Pattern Recognition and AI where the structure of the program that compiles your language of choice is a self contained polymorphic mutating algorithm. A compiler that learns and becomes self aware the more you use it and adjusts itself accordingly. So in a sense if person A writes a series or programs in some language and compiles those programs their compiler will adapt to the habits of that programmer where person B using the same compiler writes their sets of programs in some language and it adapts to their habits. The only thing that is unified between the two is the syntax and rules of the language itself. However, at the end of the day the same compiler on two different machines that was used between two different people end up being two completely different entities so to speak. Would such a construct such as this be feasible? As I stated the off the shelf compilers for person A & B are exactly the same during installation. The compilers on both machines will accept Language (L) with its alphabet, identifiers, keywords, language syntax and rules that is the same for both parties, so they have to write coherent code in said language. However over time and through the use of the compiler the compiler will adjust itself or self modifying to the behaviors of each user based on the set of source code it is trying to compile. This way the compiler can optimize the code more efficiently for that user. For example; user A programs 3D applications using Vulkan over and over again while user B is writing embedded driver codes for micro controllers and user C is writing source code for a specific architecture Operating System. So over a period of time; each of the 3 users or code developers start off with the same off the shelf compiler, but through time and use the compiler itself begins to self mutate as it learns the sets of data it is processing from that user. So in user's A case it will efficiently compile and optimize Vulkan API calls where user B it will optimize driver, kernel or library type code base and user C's version will effectively optimize code based for an operating system on a specific architecture. Of course this can be and is also fine tuned by the user since a user can changed the default settings of the compiler to their desired settings, but much of this could also be automated through Machine Learning - A.I. with the use of polymorphic mutating algorithms which would allow the compiler itself to evolve over time. I know this is different than what is proposed in this video, but it's just something to think about.
@@noxabellus They are both ways of getting new opcode based systems off the ground. If you don't know why that's directly relavent to the subject at hand then that shows there there is a need for an episode on the topic.
@@noxabellus Just to be explicit about what makes them similar: imagine that the Bin_A and Bin_B didn't represent two qualities of binary on the same architecture as in brailsford's example, but two different architectures' binary formats.
There are psychological horror games that are premised on the game itself seemingly changing its fundamental behavior, but an examination of the code reveals that the "aberrant" behavior was always programmed in, and can be de/activated by changing one variable. Could you make a game that actually rewrites its own code, to mess with even the players that snoop around in the game's files?
Certainly. That would be a fun challenge although I'm not sure how accessible/marketable it would be as a horror gimmick, I doubt it would sell well by itself.
Doublefine's game "Hack 'n' Slash" has you at first edit logical properties of objects in memory, then to edit the raw memory of the scripting runtime, and eventually edit the raw, on disk byte code (like assembler for a scripting language) of the game. All in a Zelda-style game interface.
Look up a video by Prusa research about their factory. They make all the plastic parts of their printers on their printers, as a combination test environment and factory floor. It's part of the original reprap project ethos, it's a family of machines that can be made out of only parts they make themselves and off the shelf parts not specific to 3D printers. They haven't reached that ideal yet, since the actual print head involves custom metal parts specific to 3d printers but they're getting closer.
These T-diagrams over-complicate the explanation in my opinion. Bootstrapping is simply: ./bina binb.c > binb ./binb binc.c > binc and so on. With the most tedium step being getting bina in the first place. This can be done which can be done in any language which you have available to you or by hand.
@Computerphile in the future, when a host wants to cover up part of the paper, don't let the use a flipped-over piece of used paper. Really visually confusing because you can pretty clearly see through the paper. Just use a clean sheet. EDIT: eg @ 7:05
When I first learned of this in the mid-seventies, it dawned on me that computers would eventually take over the world. Not like now with our PCs, tablets and smartphones, but an absolute, complete takeover.
Self-modifying AGI is the future, no doubt about that. Though, not the future of humanity, but _the future_, since biology will inevitably become obsolete with all shortcomings it has.
Because if you wrote a very basic compiler for C in assembly (BinA in the video) and then again in C (BinB), you had to write it twice. After that, it gets easier and more readable with later versions.
There's this subtle movement of the cam that makes it very computerphile-ish. Is it an intentional aesthetic you stick with? or is it because you just don't bother setting up tripods and stuff?
The throwing away the previous mess that got you there.... and then, out of the blue, you find yourself working around a bug in the latest greatest. If only you kept the entire chain.... better to use a versioned naming convention, and keep them all... Thank me later.
Yes - just occasionally Mike Campbell's plaintive guitar riff on "Boys of Summer" seems to be the ideal track to bring memories of California flooding back ....
Don't forget the classic put-down for when someone won't shut up about their favourite new language: "Has it been used for anything other than its own compiler?"
This seemed like a mildly hard concept (I mean to grasp, not to execute) explained in a way that made it sound more challenging. Poor visuals and instructional planning I think.
I feel like the concept itself is straight forward, only complicated by the idea that it feels like a monstrous undertaking at the very least in C. Maybe it might be more straight forward in more modern languages but by golly I feel like it will be forever above my patience and (consequently) paygrade.
I can see how it's an advantage to not depend on a compiler of a different language that might not be maintained in the long run. But why is the compiler written in its own language automatically faster and better, as Professor Brailsford claims?
Well, the initial premise of the video is that there is no compiler for any language, really, so writing a C compiler directly in assembly and then a C compiler in C is the sensible thing to do, in that it requires the least work
It's not. You could write a horrible C compiler in C. However, assembly can be tedious to write, so you might skip out on faster but more complex algorithms or compile-time optimizations; instead, including them in your compiler written in C.
ARM, not AMD. It's a different *type* of CPU, that runs different instructions. It's used in phones and the like, since it can use much less power for the same amount of computation.
Back in the 80' s we used to do this sort of thing all the time, to extend languages. You would write a compiler that understood the new language extensions, compile it under your existing compiler, then write another compiler that used the language extensions, and compile that using the extended compiler.
e.g. if your language does not have constants, you write a compiler system that could lex/parse/codegen the term "CONST "
Then write a new compiler that uses "CONST"'s and compile it with the CONST-aware compiler (you could even re-feed this back through itself to generate a better compiler)
A few simple tweaks to the compiler code and you can now have ENUMs, and so on
The most fun was writing optimising compilers and feeding them through iterations of themselves :)
FORTH was an especially great language to do this process on
Just realized this is why gentoo compiles gcc twice when doing an upgrade (or used to at least). One of those "it's glaringly obvious but I had to have someone else point it out to me before I noticed" moments.
it's not glaringly obvious. 99.99% of people would have never guessed that.
It's not Gentoo in particular that compiles gcc twice; gcc's own build process compiles its source code twice. It also at least used to leave out support for features its own source didn't use, such as Fortran or floating point math, since the first build was only going to ever compile one program.
Three times I thought. Once with the basic copiler. Once to optimize the compiler. Once to build itself with the optimized compiler.
Same for OpenJDK, as long as you have the "jbootstrap" use flag enabled.
Why is it not possible to fully optimize with -O2 or -O3 the first time?
I once read warning in a book about this, that a malicious developer could insert a piece of code that detects if the compiler compiles itself and injects the source code for injecting that backdoor again. And when the compiler compiles something else it could just install the backdoor into the program. After the compiler has compiled itself with the malicious code once, the developer can even purge his code from version control in order to cover his tracks. This is the main reason, why you must keep all the versions of your compiler. If the backdoor is detected only several versions later, you have to go back to the last non backdoored version and start the bootstrapping process again.
That would be "Reflections on Trusting Trust", Ken Thompson's ACM Turing Award lecture.
@@AlanCanon2222 Thank you. That was a very interesting read.
@@oleh6738 It is. I read it 1996, standing in the Stanford University campus bookstore for two hours, my legs aching. I'm from Louisville, Kentucky, but have visited every landmark of computing I can afford, and my experience of Stanford University consists of 45 minutes of walking the incredibly lovely campus, and 4.5 hours standing reading books I couldn't afford in the bookstore.
@@oleh6738 The thing that fascinates me most about it, and which I struggle to understand the mechanics of, is the task of getting a computer program to output its own source code.... that for me is a more deeply interesting aspect of his lecture than the horrifyingly obvious-in-retrospect light it casts on computer security.
The paper is doubly interesting because it has deep results in it from pure computer science, and also this vastly troubling real-world implication to it. Perfect example of the math-meets-engineering nature of the field, with all its thousands of years of political and social consequences. Computer Science is a subset of the Humanities.
I've heard from others about the Linus version of this
Fun fact: C# compiler is written in C#. Also Typescript transpiler is also written in Typescript. The idea of a self compiling compiler has many examples.
Indeed, it's quite a fascinating process and a rite of passage for many languages. The Rust compiler is written in Rust, but there are two different versions used in the bootstrapping process. The very first original compiler was written in OCaml and was once used for bootstrapping, but it has been replaced by a more advanced version called *mrustc* written in C which is the one typically used for bootstrapping *rustc* these days.
Also: pypy is written in a subset of python
Really...? And nobody mentions Rust-written Rust compiler?! O.o
I mean of all those mentioned, this is the most revolutionary, don't u think? Hh
Golang is also "self hosted" I believe.
The Oracle Java Compiler is a Java Library.
I remember doing similar when I was 14... Had an assembler running in BASIC that was extremely slow, so wrote a new one in Assembler, got the BASIC assembler to assemble that, and then used the binary version of the assembler to improve and complete itself. Was back in the day, when 6502 assembler was a thing. Was happy with the result, assembled itself something like 100 times faster than the BASIC version in the end.
Thank you - and also to a few others on here - who've contributed some nice first-hand bootstrapping tales. Your speed-up factor of 100 must have made it all seem worthwhile ...
The machining tools analogy really unlocked this for me! 👍
I appreciate that he's using up his remaining stock of tractor feed green bar by pseudocoding on it.
The best videos on this channel are those with Professor Brailsford.
I guess the compiler series also has some episodes about disassembly and then decompilation. (Decompilation is hard, because of UB (undefined behavior) in C (the target language))
Edit: the series should also contain a video about undefined behavior in C and C++, because it is a serious topic.
Try Decompiling Visual C++.... 😄
Decompilation is hard because its akin to unscrambling an egg. Not sure why you think its hard because of UB. Thats a small part of the equation. UB is simply a consequence of low level code. Not every use case can be known at compile time. If you look at a language like Rust, there is very little UB in the safe part of the language, but if you use the unsafe escape hatch you can produce the same kinds of UB as C.
@@noxabellus When UB would be ignored, the decompiled code might be incorrect, the generated code has to use either compiler builtins to correctly reflect the binary or insert checks everywhere.
My Love & Hate :: C++
Try also to decompile app apk with this app into java , ...etc
I Remember a story from "the Jargon file" about someone modifying the compiler to install a backdoor in the login software and a backdoor seed in the compiler. then compiling itself and removing all traces.
Probably Reflections in trusting trust by Ken Thompson... Terrifying thoughts
So if I understand correctly:
Have a compiler v1
Write a compiler v2
Compile v2 with v1 to produce v2_slow
Use v2_slow to compile v2 to get v2_fast
For compilers that aren't C the first step is probably going to involve a simple compiler for some new language written in C rather than written in ASM we can do that cause someone has already written a working C compiler for most every platform and C is much easier to use than ASM
Another way of dealing with the bugs in the bootstrapping process is to write a really simple to compile C compiler that BinA can compile and can self compiler, then write a second version of a C compiler that's more advanced, and work on getting the first C compiler code to be able to compile that. Adding in a third C compiler could even be needed, and if you're in a very complicated language I could see even more stages being helpful.
Very interesting angle on self compiling compilers - improving quality of code and compiler. My own career experience (forty years!) was doing this as a technique to achieve portability - to run software on new machines / chipsets etc. Excellent - thanks!
Quite interesting, recently started writing my own compiler (mitsy if anyone interested) and right now I'm working on the most essential part of it, namely the bit that takes C literals and transforms them into hard values, part I'm struggling with is the exponent on floats, everything else has been a relative breeze as far as literals go
Great to see that the stock of dot matrix printing paper is still going strong.
Ever since I learned that self compiling compilers exist I live in fear
Cf. Ken Thompson's 1984 Turing Award lecture "Reflections on Trusting Trust"
if there is _one_ paper that anybody talking about security of computer systems should really read it's that one
(especially if we're talking about any kind of electronic voting systems)
That is a beautiful paper. Not the most practical, but absolutely beautiful in its paranoia!
@@okuno54 if you are talking about elections in a large country, spending few million to get the result you want _is exactly what big corporations are already doing_
@@666Tomato666 I was not... where even...?
I mean also, try a few billion if we're totaling it all up, and congrats on the quadruple post.
@@okuno54 "try a few billion if we're totaling it all up"
go to opensecrets, you'd be surprised how cheaply politicians can be bought
This greatly reminds me of the RepRap Project, which is especially clear when you may or may not knowingly referred to it when talking about self replicating 3D printers. They very much exist and they are commercially viable given how they’ve gotten the desktop 3D printing industry to drop prices like mad.
Brings back difficult memories from university. I had to write a Scheme interpreter in Scheme. Compilation was the only class I abandoned.
On top of them being an excellent teacher, something about their voice makes learning so much more enjoyable and easy to understand. It's so not threatening and intriguing! It's the comp sci grandfather we always wanted and not uh.... yeah.
The original executable format for DOS was the COM program format, which was a "bare" (no headers, single code and data segment, meaning up to 64kb for code+data) 16bit executable. DOS would put that single section at a fixed location in memory and then execute from a fixed adress. The DOS EXE "MZ" file format (MZ being the initials of Mark Zbikowski, who defined the format) was the advanced format that would allow multiple segments, separate code and data, etc. 16bit windows EXE files use a newer format called NE (New Executable), and 32bit/64bit windows EXE files use PE (Portable Executable -- the portable came from how the same file format could be used in all the platforms Windows used to work on at the time).
EDIT: Added some details.
Addendum: If you have ever opened an NE or PE EXE file in a hex editor (or even in notepad, but don't do that, it would corrupt the file if you save by mistake), you'd see it has MZ at the beginning, which is there for anyone who tried to run a windows program in DOS, who would receive a message saying the program can't be run in DOS.
dos itself thinks that windows exe are just fine to run because it ignores all the windows specific code and data. the message saying the exe can not be run in dos is output by a small dos only code segment in the exe that dos sees and executes
@@MonochromeWench Yes. The "dos segment" you speak of is the beginning of the file, the header contains a number of reserved bytes before the actual data starts, and those bytes are used to create the little dos program that prints an error message. Both NE and PE executables have this.
MZ is not "there for anyone who tried to run a windows program in DOS". DOS .exe files have had the MZ preamble well before Windows. The thing they did with Windows executables was to include a .TEXT stub which does nothing else than writing "This program cannot be run in DOS mode." to standard output, at the same time keeping the format backwards compatible and exit gracefully when started from DOS. the Windows executable loader would look for the NE/PE after that stub, and ignore it.
When i see prof. Brailsford i click faster than Thanos can snap.
Then how fast do you click when it's Pound?
But you watched Nerdist news first
I've watched end game, and can say he doesn't snap very fast.
my click was stored in cache
Please make a video on Reflections on trusting trust by Ken Thompson
I wrote an assembler using a monitor on the vic-20 (6502) then used the assembler to write a better assembler with labels, then used that to write a small-c compiler, then used that to write a c-compiler. That's how I learned to program.
HESMon? I used it to reverse engineer the C64 Kernel, stripped out all the cassette routines, and replace it with bank swapping (using the cassette control circuit) on a 4x sized rom. Built in all the tools I wanted.
I was asked this question so often, for every language you can imagine...
The second question is always: "But why would you write it again, if it can already be compiled?"
There is one thing you could have mentioned, because it's always the third question. "If a compiler can make code executable for machine A, how can it make the code executable for machine B?"
I apologize if I just didn't hear you when you said it.
The second question can be answered as: Because writing an C compiler in C, is easier than ASM, so the final product tends to be better, and its kinda of a self "unit test".
The third question can be answered as: If machine B have the same architecture, you don't have a problem, just copy the C-MK2 compiler over. If it is a different architecture, then you need to use machine-B's ASM to write a C-MK1, and then feed your already coded C-MK2 in this B-C-MK1, to have the same C-MK2 as machine A.
Quoting back to first video of this series, you always only need to rewrite this small C-MK1, in each new architecture in order to generate MK2.
@@wfzyx The answers seemed too obvious to me, so I did not bother. This was really ignorant of me.
I learned all of this years ago and have been writing my own compiler for some very specific stuff. This made me forget how hard it was to learn all of this from zero.
Anyways, thanks for answering :)
Professor Brailsford please take care of yourself. You are a treasure to humanity
The Hamlet of computing, To-C or not To C ;)
legendary professor this one
Yoda of Computer Science. Yup.
fun fact : the reproducing rapid prototyping machene or reprap for short, is the basis of modern desktop 3d printing the idea behind the reprap project was to make a machene that you can use to replicate the machene given enogh time and money
Where does something like yacc (yet another compiler compiler) fit into this discussion?
@@toddmarshall7573 Yacc isn't really a compiler compiler as advertised; it only compiles a parser. Handy, but not nearly the interesting bit.
I'm glad you made that video because I read somewhere that when compiling a compiler, it would need to get compiled by itself again and again to make sure it wouldn't break and shoot bugs but it was not explained that the quality of the latest versions would be better, so I thought it would become worse. Your explanation was very clear as to why not :)
I was going to write a spoof on Monty Python's "Decomposing Composers" with "Decompiling Compilers" but there aren't enough abandonware compilers to fit the bill.
Ouroborous compiling...
I learned this with Go 1.5+, still a "pain" to build from source
Thank you Professor Brailsford!
What I don't understand is how something made with quality 0 binary can produce better code (let alone faster)? I mean, if I wrote a C compiler in assembly how my self hosting version could possibly be faster? It can't be faster than original version made in assembly (let's call it version 0) from scratch, right? Or the point is that during bootstrapping optimizations possibilities open up, which are hard / impossible to do directly in assembly in version 0?
the way i understand it, the efficiency of the compiler as a program itself is separate from the efficiency of the binary it produces. so compiler A could be very inefficient, e.g. it takes 2 days to compile compiler B, which you feed into it. but, once that’s done, compiler B itself is a different binary than compiler A, so the way it works is different as well
I would so wish for prof. Brailsford to compare his ever so loving C with the Rust programming language!
Really curious what his thoughts are on it...
He probably doesn't care much for all the hype. If you've been working with manual memory management for a few years it doesn't really bother you anymore so the tradeoff of huge compile times doesn't have so much appeal
The same is true of compiler-compiler generators (e.g. yacc written in yacc)
The creation of a compiler that can compile itself is called "self-hosting." All assemblers started this way. This is common for Pascal compilers; I know of at least 5 (XDPascal, Stanford Pascal, AAEC 8000 Pascal, UCSD Pascal, and Free Pascal, all of which I've had reason to read their sources). As of now, all "world-class" C++ compilers, including LLVM's Clang and GNU GCC, are self-hosting. So is Free Basic. And there are others, Any programming language where it can read text, and write to a disk file, can be self-hosting, even if it started from something else.
We called it bootstrap compiler.
You don't necessarily have to write the bootstrap compiler in assembler. Nicklaus Worth write the first Pascal compiler using FORTRAN. He wrote the compiler originally in itself, then hand-translated it into FORTRAN. Once he built that, the compiler could be used to compile itself, and once all bugs were fixed, all newer versions. Another possibility is to write a subset compiler,, just enough to compile the smallest subset than can do something, then once that is compiled, add more of the compiler to itself until you get it to where it can compile the full language.
I'm not sure why all these complicated diagrams and notations are needed to describe something quite straightforward.
It all started with a piece of software written in machine code to make writing and running programs easier ("Initial Orders").
Then a compiler was developed using the above to convert assembler instructions to machine code.
Then that compiler was rewritten in assembler, compiled and became the new version of the compiler.
The process of enhancing the compiler to introdeuce new language features continued, using the previous version of the compiler.
New languages were developed and compilers written in languages for which there were working compilers, including producing code for other types of computer.
So it is no different to any other technological process - we create new and improved tools using existing tools.
Miss him.
It makes me think about the human mind actually. Programming, quantum physics, brain surgery is absolutely nothing compared to the thing that produced it which is the human mind. We use our mind to try understand our mind and things obviously go wonky sometimes :D To improve yourself you have to think in a way that is "computable" for your current thinking capacity. You have to think about problems you are having and work out problems in your thinking with the thing that has problems and hopefully make your new thinking a bit better. You have to start at something simple and build on that. The human mind is always building on itself which is fascinating to me :)
thank you so much for the video.
finally it answered the many questions that I had
Back in the day, we wrote the FORTRAN VII compiler in FORTRAN VII
I was accused of being a computerphile
And no mention of Forth? Sun's SPARCstation bios is Forth. PostScript looks like it's Forth to me.
Wow! Bootstrapping in a way use pointers to rid of the unwanted product.
I am making a compiled language, but I am still not sure how I would allow it to work on other CPUs if I bootstrap.
I believe that these days the GNU C compiler needs a C++ compiler to compile it. I suppose that was seen as inevitable, like eventually writing the assembler in C, ...
What an enticing title!
Wait a second Computerphile = Computer file OMFG
The visualizations for this T-diagram series are still confused about the concept being explained. 5:53 does not show the right thing at all.
I was wondering the same couldn't wrap my brain.
@@SimonBastienFiliatrault In my opinion (and I said this in my comment on the last video) the critical part of this visualization should be to show how code-as-instructions and code-as-data are *interchangeable.* That is, the code when it acts as a running program, and the code when it acts as the data being transformed by that program.
The T-diagram ought to represent the code as a program that transforms its input into its output.
What they should be doing is taking another T-diagram, *shrinking it down and passing it through* the first T-diagram to be transformed into a different one - to show how the C-code representation of that program gets turned into a binary-code representation of it.
@@AexisRai I can't agree more. That would've been clearer.
You could say it's "machine reproduction"
Golang compiler is also rewritten in Golang
Believe me I’m PC enthusiast but I’m no scientist, PHD holder or engineer in computer world. My question is why would we want a new compiler for our current language if current compiler is working as expected I mean, it can traslate english (computer language) to binary and as long as it’s translated to a binary (executable code) with no intermediate or algorithm that need to be run during run-time like, I suppose, Java, it is good enough for any programme that we write. So why would we want to change how the compiler work?
To create more efficient binary and also to develop the language.
I must not have been following so well, I am still confused as I can't think of a system that only builds on itself and is able to run on a specific architecture, without that architecture translation layer
That is what the compiler is. It translates from a high level language to a low level language. Once you have the compiler working, you no longer need the low level references. For instance when installing Linux you go through a tool chain. You start by downloading a working image with a working compiler. Then you use the image compiler to cross compile a compiler for your system. Then you boot your system and recompile your compiler with the cross compiled compiler to create a natively compiled compiler which you use for the rest of your installations. At no point in this process did you use a program written in assembly or in binary because you were able to start with a working compiler.
it still remains. the C compiler will describe how to create the binary file. the difference is that it's described in C instead of assembler
Soulthym there is nothing to follow. This was just random rambling rantings
@@codycast you're wrong. this video makes sense
Pretty much all high level languages compile to C and let the hardware vendors do the work from there. They still face the problem of getting C code to run well on their new hardware, we just don't see it.
Could you have a bug in BinA that doesn't show up until say BinC? In that case would you have to fix it and iterate through A, B, and C again or could you somehow fix it in just your next iteration?
Not really. BinA shouldn't change what BinB does. Theoretically BinB on BinB should do the exact same thing as BinB on BinA but better.
Can anyone remember a program called the 'Last One'. It was really hyped about 30 years ago to be the last compiler program you would need. The program would produce another runnable program from your Input statements. It was a spectacular failure. Did you buy a copy and what was it really like?
Sounds like "esolangs" we have today that try to emulate natural language
How about designing a compiler that at its core is based off of Machine Learning, Pattern Recognition and AI where the structure of the program that compiles your language of choice is a self contained polymorphic mutating algorithm. A compiler that learns and becomes self aware the more you use it and adjusts itself accordingly. So in a sense if person A writes a series or programs in some language and compiles those programs their compiler will adapt to the habits of that programmer where person B using the same compiler writes their sets of programs in some language and it adapts to their habits. The only thing that is unified between the two is the syntax and rules of the language itself. However, at the end of the day the same compiler on two different machines that was used between two different people end up being two completely different entities so to speak. Would such a construct such as this be feasible? As I stated the off the shelf compilers for person A & B are exactly the same during installation. The compilers on both machines will accept Language (L) with its alphabet, identifiers, keywords, language syntax and rules that is the same for both parties, so they have to write coherent code in said language. However over time and through the use of the compiler the compiler will adjust itself or self modifying to the behaviors of each user based on the set of source code it is trying to compile. This way the compiler can optimize the code more efficiently for that user. For example; user A programs 3D applications using Vulkan over and over again while user B is writing embedded driver codes for micro controllers and user C is writing source code for a specific architecture Operating System. So over a period of time; each of the 3 users or code developers start off with the same off the shelf compiler, but through time and use the compiler itself begins to self mutate as it learns the sets of data it is processing from that user. So in user's A case it will efficiently compile and optimize Vulkan API calls where user B it will optimize driver, kernel or library type code base and user C's version will effectively optimize code based for an operating system on a specific architecture. Of course this can be and is also fine tuned by the user since a user can changed the default settings of the compiler to their desired settings, but much of this could also be automated through Machine Learning - A.I. with the use of polymorphic mutating algorithms which would allow the compiler itself to evolve over time. I know this is different than what is proposed in this video, but it's just something to think about.
I would think cross compiling is used more now. Maybe that's an idea for an episode.
Weird to compare which is used more since they are totally different things
@@noxabellus They are both ways of getting new opcode based systems off the ground. If you don't know why that's directly relavent to the subject at hand then that shows there there is a need for an episode on the topic.
@@noxabellus Just to be explicit about what makes them similar: imagine that the Bin_A and Bin_B didn't represent two qualities of binary on the same architecture as in brailsford's example, but two different architectures' binary formats.
"feeding me with myself". Isn't that auto-cannibalism? Dr. Hannibal Lector would be proud.
There are psychological horror games that are premised on the game itself seemingly changing its fundamental behavior, but an examination of the code reveals that the "aberrant" behavior was always programmed in, and can be de/activated by changing one variable.
Could you make a game that actually rewrites its own code, to mess with even the players that snoop around in the game's files?
like Doki Doki Literature Club?
Certainly. That would be a fun challenge although I'm not sure how accessible/marketable it would be as a horror gimmick, I doubt it would sell well by itself.
@@davidmcgill1000 DDLC doesn't rewrite its code though
Doublefine's game "Hack 'n' Slash" has you at first edit logical properties of objects in memory, then to edit the raw memory of the scripting runtime, and eventually edit the raw, on disk byte code (like assembler for a scripting language) of the game. All in a Zelda-style game interface.
You would need to install a compiler inside the game itself.
Sounds like the von Neumann machine... Preferentially selecting for the creation of the new-best version of itself.
A 3d printer that prints a 3d printer
The industrial revolution is largely based on lathes making better lathes.
Look up a video by Prusa research about their factory. They make all the plastic parts of their printers on their printers, as a combination test environment and factory floor. It's part of the original reprap project ethos, it's a family of machines that can be made out of only parts they make themselves and off the shelf parts not specific to 3D printers. They haven't reached that ideal yet, since the actual print head involves custom metal parts specific to 3d printers but they're getting closer.
Consumer 3d printing is basically a thing because of this.
@Zero Cool so first human was created by Allah.
@@ekrem_dincel there's no thing called "first human" because **evolution**
when i see wise men with a tutorial, i know i'm not being cat fish
These T-diagrams over-complicate the explanation in my opinion. Bootstrapping is simply:
./bina binb.c > binb
./binb binc.c > binc
and so on.
With the most tedium step being getting bina in the first place. This can be done which can be done in any language which you have available to you or by hand.
Just now from the next room: “... it drinks in semen! . . . What the F@ are you watching‽" lolz
@Computerphile in the future, when a host wants to cover up part of the paper, don't let the use a flipped-over piece of used paper. Really visually confusing because you can pretty clearly see through the paper. Just use a clean sheet.
EDIT: eg @ 7:05
thanks. great video. i had to rewatch the previous one to get back up to speed, but it was worth it
How was the assembler written which produced BIN A?
badly
*assembly
When I first learned of this in the mid-seventies, it dawned on me that computers would eventually take over the world. Not like now with our PCs, tablets and smartphones, but an absolute, complete takeover.
Self-modifying AGI is the future, no doubt about that. Though, not the future of humanity, but _the future_, since biology will inevitably become obsolete with all shortcomings it has.
@@bytefu Luckily computers, and family, have their shortcomings too :)
why is it hard writing a self-compiling compiler?
Because if you wrote a very basic compiler for C in assembly (BinA in the video) and then again in C (BinB), you had to write it twice. After that, it gets easier and more readable with later versions.
Is he talking about cross compiling?
There's this subtle movement of the cam that makes it very computerphile-ish. Is it an intentional aesthetic you stick with? or is it because you just don't bother setting up tripods and stuff?
Also the way he zooms in to the face sometimes.
0:31 - "It drinks in C program statements and it spits out at the other end [...]"
That's not how it works at the other end.
If you are thinking anatomically then that is exactly how it works.
I need to buy a box of continuous feed green stripe.
Compiled compilers compiling compilers
Compiled compilers compiling compiler compilers
I need a Japanese c compiler.
I love this dude :D
The throwing away the previous mess that got you there.... and then, out of the blue, you find yourself working around a bug in the latest greatest. If only you kept the entire chain.... better to use a versioned naming convention, and keep them all...
Thank me later.
Has he retired now?
I hope he doesn't retire from computerphile.
Don Henley - Actual miles…
Yes - just occasionally Mike Campbell's plaintive guitar riff on "Boys of Summer" seems to be the ideal track to bring memories of California flooding back ....
@@profdaveb6384 Impeccable taste in profession, impeccable taste in music. What are you, the definition of "man of culture"???
Don't forget the classic put-down for when someone won't shut up about their favourite new language: "Has it been used for anything other than its own compiler?"
Compilerphile
Terry Davis wants to know your location
Me head exploded
This seemed like a mildly hard concept (I mean to grasp, not to execute) explained in a way that made it sound more challenging. Poor visuals and instructional planning I think.
I feel like the concept itself is straight forward, only complicated by the idea that it feels like a monstrous undertaking at the very least in C. Maybe it might be more straight forward in more modern languages but by golly I feel like it will be forever above my patience and (consequently) paygrade.
Crazy!!
But shouldn't the next version of a C compiler be "A++" rather then B?
Wat
B preceeded C not the other way around
@@noxabellus That joke went way over your head.
03:45 Oroboric Compilation? :-)
Cool
It sounds like software cannibalism.
self compiling compiler i thought you were talking about jump the gun slipping the sear, turns out your just talking about hacking machines really
I've bootstrapped 3D printer repairs before...it's an eye opening experience...
Too late, evolution already invented this.
bootstrapping
Какой умный народ англичане
I can see how it's an advantage to not depend on a compiler of a different language that might not be maintained in the long run. But why is the compiler written in its own language automatically faster and better, as Professor Brailsford claims?
Well, the initial premise of the video is that there is no compiler for any language, really, so writing a C compiler directly in assembly and then a C compiler in C is the sensible thing to do, in that it requires the least work
It's not. You could write a horrible C compiler in C.
However, assembly can be tedious to write, so you might skip out on faster but more complex algorithms or compile-time optimizations; instead, including them in your compiler written in C.
I'm a little worried about how he said "amd"
ARM, not AMD. It's a different *type* of CPU, that runs different instructions. It's used in phones and the like, since it can use much less power for the same amount of computation.
@@SimonBuchanNz 😯😮
The first C++14 compiler could not be written in C++14. But it could be written in C++11.