Rearchitecting the 6502

Some Assembly required

Просмотров 12 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 30 июн 2024
03:The 6502 is a CISC 8 bit CPU. We'll be implementing it as a synchronous bus FPGA module. It was widely successful, including two implementations by this very project.
And yet, I'm rebuilding it. In fact, I'm fitting a whole new architecture to it. Why?
I'd like to thank my Patreon BBC Micro level supporter, Yehuda T. Deutsch.
You, too, can support my work on Patreon: / compusar
Discord server invite: / discord
The code is available at github.com/CompuSAR/sar6502-sync
6502 block diagram is at www.witwright.com/DonPub/6502...
Table of contents:
00:00 - Rewriting the 6502, again
01:42 - Synchronous vs. Asynchronous bus
03:59 - Needs too slow a clock
04:46 - Should need too many cycles/cycle
07:21 - Constant cycles/cycle implementation
08:50 - Clock speed keeps changing
10:22 - Variable cycles/cycle implementation
12:03 - Moderator implementation
15:04 - Other advantages
Наука

Комментарии • 68

@mikepartin571 10 дней назад ⁺³
This was my introduction to your channel. You earned the sub within the first 30 seconds! But just so you know, if my boss complains about my productivity for the next few days I'm blaming you, as I will likely be binge watching these videos for the foreseeable future.
@CompuSAR 10 дней назад
Thank you.
With that said, my older videos give me the cringe. I hope you survive the experience.
@otzmaanalytics4679 12 дней назад ⁺²
Your best video yet! I love how far your production quality has come. And the content remains top notch.
@zxborg9681 12 дней назад ⁺¹¹
A reasonable approach, given that your target speed is two orders of magnitude lower than the native FPGA clock, LOL. Another trick I saw in an old 1972-era minicomputer was to generate the instruction clock from an 8x oscillator (1 MHz insn cycle from an 8 MHz clock), and use a counter and decoder (74LS138 idea) to generate eight distinct phases of the clock. The designers sequenced (or micro-sequenced? not quite the same as the modern concept) the flow within an insn by clocking flops or enabling latched on each of the distinct phases when they needed it. So you might increment the PC on phase 1 but load the instruction from memory in phase 2, load the register read data indexed by the opcode on phase 3, and so on. It worked quite well in practice for the mostly 74xx TTL design. But would be a nightmare in an FPGA unless you use them strictly as clock enables in a natively 8x fully synchronous clock design, which is sort of what you're doing.
@CompuSAR 12 дней назад ⁺⁶
Actually there's an important difference between that and what I'm doing. In control, it's called open loop vs. closed loop feed back.
Doing that would generate the correct clock out of the too-fast clock. Execution rate would be constant and predictable, and you would only be able to divide the original clock by a whole number. If you don't want to muck around with DDR modules, you might even be limited to dividing by an even number.
Aside from allowing to divide by an odd number (and, in fact, reach any rational fraction of the original clock), this technique is a closed loop feedback control. If there is an external source of delay, such as DDR being unavailable or bus contention with another unit (the Risc-V, HDMI and SPI all also need to access the memory), in your way you're pretty much resigned to lose the whole cycle. In my way you make up for the lost time, and do so at a rate adapted to the high frequency clock.
@ArneChristianRosenfeldt 12 дней назад ⁺¹
It turned out that all those latches are expensive. The i386 had two phases. Today, most designs are single phase. Each pipeline stage does its work. The result is captured on the edge of one clock signal. Then the new input is gated to new values for the next “task” of this block of combinatorical logic.
More phases allow you to model different gate delays through a circuit. Yeah, and overlap: pass through preliminary results and transients to not loose any time in the latch.
@CompuSAR 12 дней назад ⁺⁴
I don't think this is about latches being expensive, as it was about the pipeline architecture proving superior, making everyone switch. Pipeline is very flipflops oriented (though it is not immune to cross stages communication).
@ArneChristianRosenfeldt 12 дней назад
@@CompuSAR then explain to me why the deep Pentium4 pipeline failed? A latch needs 7 transistors per bit. I think that the ALU in the 6502 only has like 20 transistors per bit. With 8 phases you pay more ( in terms of area and power) for latches than any real work.
@CompuSAR 12 дней назад ⁺¹
@@ArneChristianRosenfeldt I will readily admit ignorance at those levels of analysis. With that said, *as far as I understand*, the main motivator to switch was the promise of more instructions/cycle, rather than power.
The Intel line are the only modern CPUs that still carry machine language defined in the CISC era, and they pay a huge price to convert it to a pipeline in terms of pre-execution processing and, trying to save that, instruction caches size. And it's still worth it to them because the had no hope of achieving super-scalar execution with a CISC architecture.
With that said, all of the above is my understanding of things. It's not my main area of expertise, so if I'm wrong, I'm more than happy to learn.
@Dinnye01 12 дней назад ⁺⁴
Good. You have seen a flaw and addressed it. In for a penny, in for a pound!
@CompuSAR 12 дней назад ⁺¹
That could have been this channel's name.
@byteme6346 8 дней назад
The venerable 6502, the original RISC design, should be the first architecture implemented in graphene.
@fronbasal 11 дней назад ⁺¹
Beautiful!
@tenminutetokyo2643 8 дней назад
That is nuts!
@CompuSAR 8 дней назад
I explicitly call this crazy, after all.🤣
@sabriath 9 дней назад ⁺¹
This is basically how i handle load-balancing within single thread applications to tie the fps rates to the desire position.....but I adjust the individual load timing on-the-fly in order to maintain rather than gate them, this way it attempts to maximize calculations until the core starts to suffer, then just raise the wait time for the following cycles (since the individual calculations performed are never known ahead of time, they are just added in as class callbacks into a list).
soooo....in short, I create my own multi-thread using count triggers lol.
@esra_erimez 11 дней назад ⁺¹
Wow, I was totally riveted to this video and I have nothing to do with hardware. Subbed!
@CompuSAR 11 дней назад
Welcome! I hope you find the rest of my content as interesting.
@proxy1035 6 дней назад
personally i would've gone exactly the opposite way, adjust the RISC-V CPU to have a 6502 compatible async. 8-bit bus. and then have the RISC-V CPU's clock just be the 6502's run through a PLL (multiplied by 50 or something). that way they're always in-sync with eachother and you can use some much much simplier logic to connect them to the same bus (before the width adjustment/cache).
plus it allows you to adjust the 6502's clock to whatever system you're emulating and the RISC-V's clock will automatically follow.
though overall, your idea of using a large master clock for both CPUs and just letting it get 1 cycle out of a set amount of cycles is way better.
@CompuSAR 6 дней назад
Neither adjustments are particularly easy, but I think writing a synchronous 6502 is easier than writing an async RiscV. What's more, it's not just the RiscV. You'd also need to adjust the DDR controller, SPI controller, interconnect and any other component in the system. The whole system would have to be async.
@ryanbrooks1671 12 дней назад
I was going to upvote this, but it was at 42- so I left it as is.... oh wait! Now I can. Great discussion.
@CompuSAR 12 дней назад
Just like all Internet discussions, you can only talk when things are negative.
@anon_y_mousse 17 часов назад
Personally, I don't care about accuracy. I'm not even sure if I would start with a 6502, but definitely I'd make refinements until it was basically a whole new chip anyway. I understand that most who pick up an FPGA are looking to make a completely accurate simulation of an old chip, and if that usage makes them happy then great. I suppose they'd label me a heretic for thinking emulation is a better option if you just want to play games, but I'd really like to design my own processor and eventually produce a physical piece of hardware from my design. I figure an FPGA is great for experimentation in that regard, especially as you scale up beyond what an emulator can adequately do.
@SianaGearz 11 дней назад ⁺²
So 6502 is running with an effective clock jitter, to allow for other bus activity or SDRAM being unresponsive. It doesn't hurt your project since all of your other subsystems have flexible timing as well, and might as well go to sleep and catch up as needed.
But i wonder how it would affect interfacing with legacy hardware, say Commodore IEC bus and 1541 drive, or emulating the 1541. The basic IEC protocol is explicitly clocked, so an endpoint can cycle stretch, but wouldn't have fastloader compatibility?
@CompuSAR 11 дней назад ⁺¹
It's an excellent question, and one I don't have a ready answer to. I'm guessing no, but I might be wrong.
In more details: DDR refresh takes several hundred nanoseconds, so is unlikely to cause even a single missed cycle (1MHz translates to 1us cycles, so 1000ns). Same goes for cache evictions and HDMI access, which pretty much sums up the potential causes for outside delays.
Even if the delay is longer, I can't see it affecting IEC. Even with a fast loader, the IEC is driven by the 6502 on the C1541. It's maximal granularity is 3-4 cycles assuming no loops, closer to 10 cycles with a loop. Even a 3 cycles jitter cannot possibly affect it.
Where things are a little more touch and go are things that sit on the actual 8 bit bus. If you hook up an Apple II expansion card or a C64 cartridge (which this project totally aims to support), those expect to see a 1MHz clock with bus operations. I'm still optimistic that it'll be possible to give them what they need (or, at least, close enough for things to work), but time will tell.
@CompuSAR 11 дней назад ⁺¹
On a completely different note, my channel stats insist that my viewers consist of 100% males. Please tell me YT is wrong on that front.
@SianaGearz 11 дней назад ⁺¹
@@CompuSAR Yes. It was already noticed by Hector Martin that someone like myself and several of his female friends who largely watch engineering related content won't count as a female viewer, and the self-specified gender is completely ignored. Don't trust the stat.
@CompuSAR 11 дней назад ⁺¹
My theory is worse. I suspect you don't count *because* you watch engineering content, which is really depressing if true.
@SianaGearz 11 дней назад
@@CompuSAR Yes that was exactly the implication. Luckily there was never such a stigma in my family when i grew up, but then that's a family with Bulgarian roots, so that's a little different from most of the rest of the world.
@bgone5520 12 дней назад
would that screw up all non-apple 2 usage of your fgpa 6502?
@CompuSAR 12 дней назад
I'm sorry, I don't understand the question.
We'll have a Risc-V running in whatever clock speed it can (currently 75MHz), and a 6502 running at the same clock speed, but being throttled down to the original Apple II speeds, so effectively it runs at 1MHz.
Since all internal buses are 75MHz, no clock domains need be maintained.
The only place this requires adapters is if you want the project to allow an external enhancement card, or a Commodore 64 cartridge. This requires a 1MHz async bus, but like I said in the video, I _can_ write an adapter.
@CompuSAR 12 дней назад
After re-reading your question, I think I understand it.
Yes, it is highly unlikely that any project other than CompuSar will find a use for a 6502 designed this way. With that said:
a. CompuSar wants to implement more than just the Apple II
b. The same can be said to pretty much all other components in the system.
At 08:05 I mention that CompuSar isn't a stranger to crazy ideas. This is precisely what I meant. The whole project revolves around building custom modules for tasks where standard modules already exist, so that they are a more precise match and take less FPGA resources, allowing the use of cheaper FPGAs and lowering the end product's BoM.
@bgone5520 11 дней назад
@@CompuSAR i was referring to the halting of the clock that was done by the apple ][. I thought that would affect compatabilty with other systems for example atari 800 or vic 20 or the BBC micro.
@CompuSAR 11 дней назад
I am really hard pressed to think of a scenario where two 8 bit systems, whether homogeneous or heterogeneous, would communicate with each other at those clock speeds.
@hanelyp1 10 дней назад
@@CompuSAR I think he's asking about emulating other 6502 based computers.
As an example, Atari systems ran at a (roughly) 1.79MHz clock which served the CPU and video signal generation. The exact clock speed derives from NTSC signal generation. RAM was also shared, and signals mediated the CPU or video chips accessing memory. Some advanced methods on these machines required the CPU to update memory mapped hardware registers in pixel sync with video generation.
It looks like the method you're using will produce some phase dithering in the emulated 6502 clock, but unlikely to be more than few percent.
@destroyer2973 11 дней назад ⁺²
I was thinking of how DEC made the Virtual Address Extension to turn the 16 bit PDP-11 into a modern for the time 32 Bit CPU while making sure PDP-11 code would still work. I think something similar could be done for the Zilog eZ80F917, Motorola 68060, WDC 65C832 and Vortex86DX3 to allow them to run modern 64 bit code with 256 Bit Vector Extensions, a 64 bit address bus, Cryptography acceleration, 4 way SIMD, EPIC instructions, and a hardware PCIE Express controller without breaking backwards compatibility. To escalate from the original instruction set to the 64 bit expanded instruction set, there would be a VAX instruction that would be used by the bootloader to switch to the expanded instruction set and address bus.
@mal2ksc 11 дней назад
I also think of the real mode/protected mode divide introduced with the 80286 and really only rendered fully baked in the 80386.
@destroyer2973 11 дней назад
@@mal2ksc I am aware of that, but I favor DEC's approach with the VAX/PDP11 architecture over the protected/real mode intel architecture. The reason is backwards compatibility. What my proposed virtual address extension would so is allow 8 bit eZ80 code to escalate to 64 bit VAX code and to take advantage of an expanded instruction set. Other than that the instruction set is the same and if the programmer wants to they could use the AVX2, EPIC, 4 way SIMD and the hardware PCIE controller in 8 bit mode, but all the VAX instruction does is allow 64 bit code to run while 8 bit code that already exists does not need any modifications.
@gpisic 10 дней назад ⁺¹
Actually the 6502 is considered one of the first RISC processors not CISC like your video description states.
@CompuSAR 10 дней назад ⁺²
I'm sorry, but that's just not true. While there are some RISCish traits to the 6502, it is still very clearly a CISC CPU with CISC machine code.
The three main RISC characteristics are a pipeline, with all instructions taking the same number of cycles and having the same length. The 6502 has none of those.
@gpisic 10 дней назад
@@CompuSAR Actually there is a little bit of true pipelining, and a lot of instructions do finish up while the next one is being fetched. An example given in WDC's programming manual is ADC#, which requires 5 distinct steps, but only two clocks' time:
Step 1: Fetch the instruction opcode ADC.
Step 2: Interpret the opcode to be ADC of a constant.
Step 3: Fetch the operand, the constant to be added.
Step 4: Add the constant to the accumulator contents.
Step 5: Store the result back to the accumulator.
Steps 2 and 3 both happen in a single clock. The processor fetches the next byte not knowing yet if it will need it or what it will be for. Steps 4 and 5 occur during the next instruction's step 1, eliminating the need for two more clocks. It cannot do steps 3 and 4 in one clock because the memory being read may not have the data valid and stable any more than a small set-up time before phase 2 falls and the data actually gets taken into the processor; so step 4 cannot begin until after step 3 is totally finished. But doing 2 and 3 simultaneously, and then doing 4 and 5 simultaneous with step 1 of the next instruction makes the whole 5-step process appear to take only 2 clocks.
Another part of the pipelining is the reason why operands are low-byte-first. The processor starts fetching the operand's low byte before the instruction decode has figured out how many bytes the instruction will have (1, 2, or 3). In the case of indexing before or without any indirection, the low byte needs to be added to the index register first anyway, so the 6502 gets that going before the high byte has finished arriving at the processor. In the case of something like LDA(abs), the first indirect address is fetched before the carry from the low-byte addition is added to the high byte. Then if it finds out there was no carry generated, it already has what it needs, and there's no need to add another cycle to read another address 256 bytes higher in the memory map. This way the whole 7-step instruction process requires only 4 clocks. (This is from the next page of the same programming manual.)
While i agree it is not a full RISC CPU by nowadays standards i also can not aggree that it is a full CISC CPU like you claim.
It's something between them both but it definitely has a lower amount of instructions then other CPU's of that time.
@CompuSAR 10 дней назад ⁺¹
@gpisic If you want to paint a broad stroke brush, here is the division:
CISC: internal buses, microcode
RISC: Everything is done through one or more pipelines
The 6502 does not have a pipeline. It is true that it has some things that are characteristic of RISC, but nowhere near enough to justify the RISC label.
In particular:
The 6502 fetches the PC at the second cycle of an instruction, regardless of what that instruction is (your 2 and 3 steps). So in a way, it fetches before it decodes, which is similar to what a pipeline does. The 6502, however, does not do that as part of a pipeline, not in the RISC sense of the word.
There are a few commands (ADC isn't one of them) that indeed take effect after the next command has already started executing. Again, it's something a RISC machine does too, but using a different mechanism.
So, no, saying that the 6502 is a RISC CPU is, to me, stretching the truth beyond breaking point. If you want, we can agree that it's a precursor to the RISC CPUs.
@JohnBayko 9 дней назад
@@gpisicRISC was a strategy to break up complex instructions into smaller parts that can be implemented more simply, and allow redundancy to be removed, allowing faster execution by a higher clock speed and more efficient code, at the expense of larger code. But to simplify complex instructions, the CPU must be complex to have them in the first place.
The earliest CPUs were accumulator based, in which arithmetic and logic operations were carried out in a dedicated register (later several), with data from memory. Early microprocessors like the 6800 and 6502 were also accumulator based, while mini and mainframe computers had general purpose register sets with operands in registers, memory, or indirect memory. It took a while for those types of designs to become microprocessors, leading to RISC, well after the 6502’s time.
Oddly, there was an 8 bit microprocessor which did have a very RISC-like design, the RCA 1802. It had a large flexible register set, higher than average clock speed, and simplified instruction set, along with larger code size. It didn’t have much advantage over the competition, but was used in space (Galileo space probe to Jupiter’s moons).
@markramsell454 10 дней назад ⁺³
When the enhanced 6502s came out with higher clock rates and every instruction operates in a single cycle we ditched the original. The new models had PHX and PHY so IRQs had less overhead. We moved more data that way. Why even use the original, it was obsolete when the enhanced versions came out, early 90s? Make something that runs at a higher speed than the RISC-V and run it with the RISC-V clock. Then people can fix the code to deal with the higher clock rate. Innovate forward.
@CompuSAR 10 дней назад ⁺¹
Please do spend one minute on the channel's trailer to see what the end goal is. What you suggest is literally the opposite of what I'm trying to achieve.
@SirHackaL0t. 10 дней назад
Fyi, the background music seems a bit too much foreground.
@CompuSAR 10 дней назад
Yeah, I know. It's my first time trying to integrate music.
@SirHackaL0t. 10 дней назад
@@CompuSAR 👍 Low levels does work, stops these type of videos from being too dry.
@thanatosor 10 дней назад
Someone made a 100Mhz 6052 on FPGA already 😂
@CompuSAR 10 дней назад ⁺²
You do realize that's not what I'm trying to do here, right?
@thanatosor 10 дней назад
@@CompuSAR I wonder if their work may help you somehow 🤷‍♂️
@CompuSAR 10 дней назад ⁺¹
Link?
@RalphDratman 10 дней назад
Why?
@CompuSAR 10 дней назад ⁺¹
Ooh, I know that one!
Because.
@vanhetgoor 10 дней назад
Some things in the MOS6502 are not that great, but this processor was one of the early ones. The MOS6510 was a bit better, still not super but millions of them were made and the people loved it. The 6502 is a bit basic, it is a processor and not much more. Long time I did not see the strong points of the MOS6510 in comparison with the MOS6502, the extra bits were so meagre, so little, so tiny. Why did they ever bother to produce it.
The MOS6502 had mountains of limitations and piles of shortcomings. Yeah, it could have been done better, but it wasn't done back then and now it is too late to do something about it. History is written down already, no use reinventing the wheel. The MOS6502 is a bit special, like a diversity kid that has to be in a commercial on TV, his parent must be proud that during the hole commercial he did not drewl rainbows on something, that kind of special is the MOS6502 processor. It is not loved for being fantastic but loved for getting the work done despite of the huge stack of quiescentnesses. (I ment koo worky ness ses)
@bobweiram6321 5 дней назад
Impressive, but we need to stop canonizing technology. We did that with UNIX and look at the mess it made. It time to move forward by exploring new ideas, not old ones.
@CompuSAR 5 дней назад ⁺²
I don't think anybody has ever tried to do this in this way before. Doesn't that make this, by definition, a new idea?
@jbucata 3 дня назад
@@CompuSAR Clearly your mental CPU includes the PWN instruction
@kepeb1 4 дня назад
The music is SO ANNOYING! When will people learn???
@CompuSAR 4 дня назад ⁺¹
Hopefully, next video. This was *literally* the first time I tried integrating music, and mistakes were made.
@helmutzollner5496 11 дней назад ⁺³
Interesting content, but why this brainfucking soundtrack?
Your voice over is what we come for why put an annoying and distracting wall of sound underneath it?
@CompuSAR 11 дней назад ⁺²
Before anything else: I'm glad you like the content.
The short answer is that most people seem to enjoy the video better this way. With that said, this is my first time experimenting with adding music, and mistakes were made. I should have definitely done a better job of making sure it doesn't interfere with the narrator. I'll try to do better next time.

Следующие

Автовоспроизведение