this is very well explained, I don't know much about how ram works and how chips work internally but I could still understand almost everything in the video! great video, keep it up
I'd like to see a logic flow chart for all the timing parameters listed in a motherboard and when they occur, in what order and the dependencies. Command issued, Column access strobe, Row access strobe, Read to Write delay, etc. That would be facsinating especially with QDR/PAM4 on its way.
The sequence of command and address signals does not change when you go from DDR to PAM4, that change is isolated basically entirely on the data signal lines.
Thanks for bringing up this topic. Never looked at DDR systems. Regarding phase in Write and Read operations. I am pretty sure you know this, but it wasn't that clear in the video: This is standard for every logic interface I have seen so far. When writing you need to pull the level to the desired level early enough so it has time to settle (since the change cannot be instantaneous due to capacitance, etc.). This is called setup time (tSU). Then you need to keep it at this level so that the target can read it (potentially removing some charge). This is called hold time (tH). When reading, the clock initiates the read. Thus it needs to follow the clock. The time needed is called clock to output time(tCO). Point is, this is less connected to the memory being dumb and more part of the structure of who sends the clock. So data with sent with the clock normally respects setup and hold. Data requested with the clock, takes tCO. When combining elements in SDR land, you normally make sure that tCO is longer than tH. That way the output always changes only after the hold time, ensuring no timing violations. I assume DDR ram uses a multiple of the clock internally and the DDR interface is only there to limit the frequencies on the channel, simplifying connections and routing on the boards).
"The memory is just kinda there." Basically the story of how DDR won over every other memory system. It's also becoming one of the biggest downsides to this type of memory system and why the likes of HBM and IBMs Centaur buffered memory exist.
Your videos are wine that ages gracefully with time. Example when doing RX Vega water cooling / tuning, your videos held much greater revelancy and insight once you get deep in the topic presented that may be missed by some upon initial viewing. Keep up the great content.
Would love to see overclocking guide - basically going over all the timings and sub-timings - talking briefly what they do, how important they are for performance and how to determine stable operational values and how voltage plays into all this. Because say I'm going for 3600Mbps - how do I know where to aim with either sub-timing for this memory speed, etc.
Informative. I did not know about the shifted sampling clock for the read data timings. However, to me as a digital designer from the 1980s and as a multi-layer PCB designer from the 1990s, the really hard bit to get my head around is how they maintain data integrity with up to 64 single-ended data lines on a module all switching at up to 3600 mega-transitions per second or more. Differential is easy (relatively), but single-ended, bi-directional signals...? sheeesh...
In fact, it's mandatory to setup data some ps ahead so the target chip is sure to get valid data un it's input buffer for sampling. This is always specified on the data sheet. On RS232 the appropriate time to sample is right in the middle of the clock. The functionning is well explained great video !
Just a quick note: DQ is D (an input) during a write operation, or Q (an output) during a read operation. Saying data queue is not technically accurate, although the data is either going into a pipeline (writes) or coming from a pipeline (reads). The pipeline is why the differential clock CK_c/CK_t is required. The DQS_c.DQS_c is responsible for transferring the input data D to the pipeline, or transferring the output data Q to the output buffers. Edit: The 90 degree phase shift for the read clock is there because it is necessary to allow the data enough time to travel from the output buffers of the RAM chips to the inputs of the memory controller.
I'd love to see this in relation to drive strengths, on-die termination and setup times. What happens to the signals when things are not configured properly in bios.
Hey Buildzoid. Are you an electrical engineer. I am hoping to study elec. hardware engineering in college and find your pcb breakdowns amazing. Your knowledge is unrivaled.
Speaking of memory chips, i had an old dell server from 2010, it had 18 RAM slots, and each slot could support these 16GB DIMMs each dimm i beleive had 17 chips on either side (i think one of them was for parity ECC or might not have been memory) Thats at least 576 memory chips, on a server from 2010 that only supported 3 channels per socket. Now we have 12 channel processors and there are now LRDIMMS that i think have 24-32 chips per side, meaning you could see as many as 4608 memory chips in a single 2 socket server.
Thanks for this explanation.... from this I pondered if quad data rate was possible since why can't you combine transmission on the rising/falling edge AND at the high/lows of the clock.... I didn't realise that Intel has used it in their CPUs already
Hey Buildzoid! Any chance you'll also explain some of the timings, including why they're needed/what they do ? Things like CAS Latency. I remember vaguely that in order to read or write, you need some latency, in which time the memory sort of "prepares" the exact address, that is the exact row and column of the physical memory location. Also at some intervals (10ms ? 100ms?) the whole memory needs to be refreshed with electricity (think of it like watering the plants) time in which the memory is effectively unusable for anything. If I'm not mistaken, the intervals between each refresh is that 4th/last timing in the usual 4 primary timings (the big timing). And that's why you want it to be as high as possible, it means the interval is bigger, so the memory is offline fewer times per second effectively less time offline. But make the interval it too big and you risk getting memory errors as some cells/areas might lose too much charge and a 1 might become a 0. Or something like that. I always wanted to get to know more in this area. I'd also like to know like the best and worst scenarios for when a CPU wants or needs to get or write data from memory. Knowing how the memory works, in detail, would allow someone to compute these timings. For example, a worst scenario would be for you to wait it to finish another operation, then to wait CAS Latency, then probably to wait to have its periodic charge refresh, and then ... CAS Latency again ? and THEN finally actual data transfer. I don't know.
I am personally a bit surprised that the data lines aren't differential. Main reason with differential signaling is that it isn't easily disturbed by EMI from other nearby signals, or even power rail noise. Also, in regards to generating a clock signal at any offset from another, this isn't too hard implement in practice. One simple solution is just a few selectable binary weighted delay lines, barely takes much to work with. And DRAM chips often don't change their distance from the memory controller. (Ie, when we talk to a given module we know its data comes back at a certain skew.) To a large degree, one can do the same to even de skew data lines from each other. However, a 90 degree offset is the simplest. Since te data is clocked at whatever "transfer rate" the memory channel operates at. Ie, a 3200 kit does clock its data at 3200 MHz, however, the reference clock sent out to the DRAM chip is halved for signal integrity reasons. When we halv a frequency we transition each second transition of the frequency we desire to halve. But if we have a second frequency divider that transitions on the unused transitions, then we now have two halved versions of the frequency that are now 90 degrees apart. There is also other ways to generate phase offsets, for an example through quadrature modulation, but this isn't particularly trivial. But this more or less offers the ability to chose any arbitrary phase depending on the input amplitude of two 90 degree offset references. With a pair of good variable gain amps we can go the whole 360 degrees.
So technically, if memory is using a clock signal of say 2.2 GHz and it's branded as 4400 MHz, chances are somewhere in that memory controller there's a clock running at 4.4 GHz to generate those two phase-shifted 2.2 GHz signals. Meaning, _technically_ they're not lying 😅
You don't need a 4400mhz clock to generate two 2200mhz phase shifted signals. In this case its quite simple to do it with one 2200MHZ clock, because when measuring the voltage on the positive and negative terminals they are always going to be of equal absolute value and opposite sign, which is a 180 degree shift.
@@unvergebeneid DDR simply takes advantage of the fact that the clock changes twice per cycle. The second clock can be at any offset, and isn't even required since you could just measure the signal every time it hits 0. The issue with not having a second clock would be that any interference would offset your clock, So you use two inverted signals, expecting they would suffer mostly identical interference, and check when they're equal instead. And you want a 180 shift between the two clock signals because it's easier to generate two inverted signals from a single clock, rather than having to synchronize two or more. The issue with going beyond DDR is that even if you can synchronize an arbitrary number of clocks with a constant offset between them you will still end up with a clock signal identical to one from a clock running at the sum of all frequencies, so you face the same signal integrity issues.
@@unvergebeneid I reread your reply, and I think I understand. The 90 degree shift is from the "peaks" of the clock signal to the point where you do the read/write operation, while the 180 degree phase shift is between the two clock signals. You why would you need 4x your frequency?
Most chips are built with CMOS technology these days. Energy is dissipated as heat during change of signals. So often switching between 0 and 1 user more power. The data lines here are just one part where levels change, so the difference is minuscule. But there are instances where crypto keys were compromised by doing thousands of operations and then statistically analysing the variations.
My issue is that you didn't really explain how DDR works, beyond the "sampling on both edges", which is quite basic. Everything you said about sampling can also be said for SDR memory, which works the same way, but only samples in the positive edge of the clock. So the really interesting question would be, why not just use a clock with twice the frequency and sample only on the raising edge? The answer to this question is the whole point of DDR.
The answer is probably "Physics." There is probably something which makes doubling that clock physically impractical relative to the solution chosen. I'm guessing at some point someone who was trying to make the clock faster realized they could get far more out of a smarter design than a better one due to having to engineer around physical limitations.
@@PwadigytheOddity Rising wave frequency (clock) in a bus is difficoult for a variety of physical reasons. Those reasons can be simplified down to the fact that the line must be significantly shorter then a wave length to ensure signal synchrony and integrity on both end of the transmission line.
10:50 to about 11:12 is really all that makes DDR DDR. You read/write twice per clock cycle, and that's all there is to it. It is being used instead of SDR because it's easier to maintain signal integrity on a lower clock speed; Any delay in voltage change you'd get from things like inductance, capacitance, or certain active components would just make up a smaller portion of a longer clock cycle. Though I wouldn't expect it to be mentioned because this isn't what the video is about.
In SDR you mostly dont get that 90° shift in the actual data, you get a 180° shift instead (at the 'other' edge of the clock) which makes the memory controller a bit simpler. But then you have to generate and transmit a stable and clean clock signal of twice the frequency across the motherboard, which consumes more power, needs better drive transistors and has a few more issues. So doing DDR is preferrable.
To all the people replying: yes, I know why it makes sense to have a slower clock and sample in both edges, that was my point. I just think it would have been more interesting to discuss the issues that led to the usage of DDR instead of doubling the clock speed, rather than taking 20 minutes just to show signal sampling in the raising and falling edges, which can be shown in 30 seconds.
Maybe it would be worth it to notice the difference between the basic DDR(1) with memory and bus clocks being the same and the DDR2 running the bus 2x faster than the internal clock. And the DDR3 and DDR4 running the bus 4x faster. Also worth mentioning that the DDR is used only on the DQ.... Commands etc. are sent only at the rising edge.
Hi, my doubt is that based on whose clock signals does the memory controller work?. So do they work based on the CPU clock speed (generally in terms of GHz) or DRAM's clock Speed (in terms of MHz). Its confusing since its located along the processor chip in modern systems. At the same time, I couldn't get any clear clarification from online. Thanks.
They are. But if you use the main clock, you start charging and discharging parts of the silicon that should be left alone. It's basically the main clock masked with an "enable" signal.
so actually during a clock cycle a bit of data can change hands at 4 different times (90 degrees). How much longer until jedec advertises this as quad data rate?
I can appreciate the detail and that you cut to the chase. I just wanted to make a suggestion that you give more context in future educational videos. I had no problem following, but I also have a CS degree and already know the subject. Once again, just a suggestion.. but I think it would be more effective if you first explain what a bus is, aind it's relationship to the clock.
You mentioned that it is beyond the scope of this video to specify the reson for the clock voltage to be differential. Could you give me something to google it?
Might be related to how audio signals do it in balanced connections. If there is a noise source, it will be on both signals, if you sum up the signals against each other the noise will be the only thing left which you then can subtract from the signal to get a clean version without noise. If the signals are just flipped so 2v and -2v, you sum em up, you get zero. Everything not zero is noice
@@TV4ELP 1. Using a comparator essentially does the "flipping" and noise cancelling in one step AIUI 2. Differential signalling only eliminates *common mode* noise. ie. noise that is the same on both lines. Common mode noise is the primary source of noise in conductors that are arbitrarily close together and parallel but as frequency gets higher (with respect to line length and/or voltage), other sources of noise start to become significant too eg. capacitive coupling
I wonder how this works when a ramdisk is mounted and used as a swap partition on Linux lol would that screw up the ram read-write timings on clock edges or would it just delay the phase by a cycle when the ram is mounted
Quick question: Does anyone know why the differential clock signals have the subscripts c and t? CK_c and CK_t ... I am guessing it has to do with some opamp standard that I am unfamiliar with.
So if I understand it correctly QDR would require a Read Clock offset by 45 degrees or 2 read clocks? I will have to watch video one more time, to understand the basics of that (I understood - probably - DDR, it's just that I started thinking about QDR). It's a good video. And yeah it will be useful - once the damn heatwave goes away, maybe I will build the damn computer that I bought parts for.
This is awesome. With this diagram could you explain what the rest of the timing numbers mean and why they interplay with wall clock time? The timing information is listed as a set of 4 numbers. CL-tRDC-tRP-tRAS correct? Given this information a CL of 10 at 1000MHz is 10ns (Because 1 second / 1,000 / 1,000,000 (MHz) = 0.000000001 seconds × 10 = 10ns [For divide move the 1 from it's unit place on the number line for the number of zeros to the right including the unit number place.]) So every transaction that happens incurs this 10ns pre-amble cost. But in the same way a CL of 20 at 2000 MHz is 10ns also (Because 1 second / 2,000 / 1,000,000 (MHz) = 0.0000000005 seconds × 10 = 10ns again). So a lot of things play into wall clock time and when we are doing billions of memory access reads and writes this tiny nanoseconds really add up into perceivable time. CL (CAS) is already shown on the diagram directly where CL=11 meaning 11 clock cycles go by before the read operation starts. tRDC (= CL + tRDC) is harder to see on here. When does a new row open up in memory? As a software engineer what should we be looking for to avoid this if we can? tRP (= tRP + tRCD + CL) is the recharge cycle count from issuing the precharge command and opening the next row. What is a pre-charge command? When does one open the next row? How does that play into data-bus width if at all? tRAS (= tRCD + CL | tRCD + 2×CL) is the minimum number of clock cycles required between a row active command and issuing the pre-charge command. When is it tRDC + CL and when is it tRDC + 2×CL? In general it's easy to see now why CL numbers are usually the thing that's advertised up front. It plays into all of the calculations and is an intimidate cost that must be paid for any command execution.
>tRDC (= CL + tRDC) is harder to see on here. When does a new row open up in memory? As a software engineer what should we be looking for to avoid this if we can? You get something like 2048 rows on a stick and 1024 columns in each row AFAIK - some 8MB per row. If you want to read a chunk of data which is all together, you can read all 1024 columns on the back of a single Row Access Strobe (what RCD, RAS to CAS delay, is referring to). All you have to do is open it and then spam CL, CL, CL as many times as you want. If your data is split into little fragments all over the memory, though, you cannot do this and access becomes much slower - you'll often have to wait for Precharge to close a row and then RCD to open another one and get ready to CAS again. This is somewhat analagous to fragmentation on a hard drive where it may read a single chunk of data at 100MB/s but slow below 1MB/s if it has to seek all over the platter after transferring a few kilobytes. I don't think that there's much that you can do to optimise this from a programming standpoint unless you're writing with a low-level language like assembly and you can be very explicit about exactly where, when and how you're reading and writing with DRAM. I believe that a lot of it these days it automatic on the memory controller side in hardware and not modifiable even if you wanted to and had the know-how. >What is a pre-charge command? When does one open the next row? Precharge is the delay to close a memory row before you can open another one. If you have an open row and you want to read data which is in a different row of that bank, you must wait for Precharge to close the current row and then RCD to open the other one before you can CAS with that new row. >tRAS - When is it tRDC + CL and when is it tRDC + 2×CL? It's never explicitly either of those, nor do they make that much sense as a guide. The memory controller will try to open a row, do something inside of it and only afterwards it may try to close it again - closing a row may allow us to avoid waiting for Precharge later on if we need to access a different row, so an IMC will sometimes choose to gamble on closing a row in the hopes of the next requests being for a different row. If tRAS hasn't elapsed then it can't close the row and has to wait. Even if you have a tRAS of 20, the IMC may choose to keep a row open for 100 or 1000 cycles if it's doing something useful in that row or thinks that it can do soon. The fastest operations (where you can open a row, do something and then even want to close it again, thus potentially triggering tRAS limit) are probably Writes because writing is generally faster than reading. For example my bdie at 3800mt/s runs stable with RCDRD and CL of 14, but RCDWR of 8 and CWL of 10. CL or CWL do happen for every read or write operation, but because of their headline importance they are also tuned pretty well on virtually all memory. You'll never find a stick which has CL running 2x or 3x slower than it could be. The biggest speedups in memory timings are generally in the secondary timings because you *do* find these 2-3x improvements - like a FAW of 49 which can run perfectly stable at 16, for example. That's much more important than improving CL - even from e.g. 18 to 14.
@@AerynGaming Woah, great answer. Now I need to know more about secondly timings. Is there a place where you learned about all of these things some resources you have?
I mean, technically a lot is possible, you could generate a clock with doubled frequency on the memory chips and use that in the same way as in DDR, but then you're probably better off just using double the clock frequency to begin with and keep the memory simple.
@@SwordQuake2 Well we are talking about digital circuits, something always has to change state when you want to do anything. 1 clock cycle always only has 2 state changes, not 4 or more, so you need to get the extra state changes somehow. You could use some variations of what I described. For example using 2 90° shifted clock signals to clock the RAM, and alternate between them. That's basically still the same thing. So if something like that doesnt satisfy your idea of QDR, then I guess it's impossible.
@@Basement-Science well yes, the cycle has 2 edges so that's why it was a question whether it would be possible. I guess with 2 clocks that are shifted like you mentioned it would be possible to get the 4 edges in a cycle.
depending on the memory controller all memory channels can be completely independent or behave like 1 giant memory channel. Fun fact intel 6-11th gen CPUs actually behave like 128bit single channel rather than 2x 64bit channels. 12th gen works at 2x64 or 4x32
@@ActuallyHardcoreOverclocking Fun fact, I just installed on a rog strix x670-e with a ryzen 9 7950x3d "128 gb ddr5" 6000mt/s in 2 64gb kits (2x32) f5-600j3040g32gx2-tz5nr and it only recognizes 3200mhz :(
Is there a way you recommend for someone, who isn't beyond clicking OC is bios, to learn how to overclock their system? Or is that strictly learning computer science?
still the lack of experience and playing around hardware will be a limiting factor. There are tons of guides, info and videos, but someone never stops learning. I started overclocking in 2005 with socket A, having pass through a lot of hardware and configurations and yet this year started to understand DDR4 RAM overclocking at it's fullest (only because I never bothered tbh).
There are a lot of online resources, even on BZ's playlists he has a lot of informative videos on how to approach certain aspects of overclocking. Also this area of computers is actually closer to electrical/computer engineering than computer science, which might help push you in the right direction. Only thing that might be hard to find that BZ has are some of the datasheets to components or processes that are not readily available by the manufacturer. They don't make all of them public and/or take them down from public access.
Is there a way you recommend for someone, who isn't beyond clicking OC is bios, to learn how to overclock their system? Or is that strictly learning computer science?
this is very well explained, I don't know much about how ram works and how chips work internally but I could still understand almost everything in the video! great video, keep it up
I'd like to see a logic flow chart for all the timing parameters listed in a motherboard and when they occur, in what order and the dependencies. Command issued, Column access strobe, Row access strobe, Read to Write delay, etc. That would be facsinating especially with QDR/PAM4 on its way.
Yes please!
Well use the micron sheet
The sequence of command and address signals does not change when you go from DDR to PAM4, that change is isolated basically entirely on the data signal lines.
@@wewillrockyou1986 I assume PAM4 would require an additional encoding step though? As far as I'm aware, the data isn't stored in a PAM4 format.
Thanks for bringing up this topic. Never looked at DDR systems.
Regarding phase in Write and Read operations. I am pretty sure you know this, but it wasn't that clear in the video: This is standard for every logic interface I have seen so far.
When writing you need to pull the level to the desired level early enough so it has time to settle (since the change cannot be instantaneous due to capacitance, etc.). This is called setup time (tSU).
Then you need to keep it at this level so that the target can read it (potentially removing some charge). This is called hold time (tH).
When reading, the clock initiates the read. Thus it needs to follow the clock. The time needed is called clock to output time(tCO).
Point is, this is less connected to the memory being dumb and more part of the structure of who sends the clock. So data with sent with the clock normally respects setup and hold. Data requested with the clock, takes tCO.
When combining elements in SDR land, you normally make sure that tCO is longer than tH. That way the output always changes only after the hold time, ensuring no timing violations.
I assume DDR ram uses a multiple of the clock internally and the DDR interface is only there to limit the frequencies on the channel, simplifying connections and routing on the boards).
Nice one Mr. Buildzoid! When you put your mind to it you can actually structure education very well, while keeping it entertaining :)
I guess BZ was just having a good day!🙃
"The memory is just kinda there." Basically the story of how DDR won over every other memory system. It's also becoming one of the biggest downsides to this type of memory system and why the likes of HBM and IBMs Centaur buffered memory exist.
Your videos are wine that ages gracefully with time. Example when doing RX Vega water cooling / tuning, your videos held much greater revelancy and insight once you get deep in the topic presented that may be missed by some upon initial viewing. Keep up the great content.
Would love to see overclocking guide - basically going over all the timings and sub-timings - talking briefly what they do, how important they are for performance and how to determine stable operational values and how voltage plays into all this. Because say I'm going for 3600Mbps - how do I know where to aim with either sub-timing for this memory speed, etc.
Informative. I did not know about the shifted sampling clock for the read data timings. However, to me as a digital designer from the 1980s and as a multi-layer PCB designer from the 1990s, the really hard bit to get my head around is how they maintain data integrity with up to 64 single-ended data lines on a module all switching at up to 3600 mega-transitions per second or more. Differential is easy (relatively), but single-ended, bi-directional signals...? sheeesh...
thank you for this just got an aqua z690 oc, and i never overclocked memory before. This was very timely should be fully built in a few days.
In fact, it's mandatory to setup data some ps ahead so the target chip is sure to get valid data un it's input buffer for sampling. This is always specified on the data sheet.
On RS232 the appropriate time to sample is right in the middle of the clock.
The functionning is well explained great video !
Very interesting video BZ! Somewhat different than other videos you have put out. Keep us on our toes BZ! Appreciate it!😃
I have greatly enjoyed all the deep dives into ram
For those wondering 01011100 (0x5c) is ASCII for '\'. Don't ask why I didn't need to look this up.
Finally the video I've been wanting for quite some time
I learned so much! Thanks dude!
I love how DDR memory internally behaves how you would expect QDR memory to behave from a high-level description.
QDR?
@@mitlanderson Quad Data Rate, like GDDR memory
@@LordSaliss damn you just blew my mind, cause I didn't even know GDDR was QDR.
@@mitlanderson i thought the g-ddr meant it was double data rate
Just a quick note: DQ is D (an input) during a write operation, or Q (an output) during a read operation. Saying data queue is not technically accurate, although the data is either going into a pipeline (writes) or coming from a pipeline (reads). The pipeline is why the differential clock CK_c/CK_t is required. The DQS_c.DQS_c is responsible for transferring the input data D to the pipeline, or transferring the output data Q to the output buffers.
Edit: The 90 degree phase shift for the read clock is there because it is necessary to allow the data enough time to travel from the output buffers of the RAM chips to the inputs of the memory controller.
I'd love to see this in relation to drive strengths, on-die termination and setup times. What happens to the signals when things are not configured properly in bios.
Quality video. Thank you. Amazing very useful and extremely interesting. Thank you
Good 1st lesson. Go ahead and do them all (ram timing charts). Historical ram timings may be helpful in explaining as fewer timing parameters.
Hey! Thank you for all the good information!
13:20 OH NO!!! THE FANCY ONES!!! everybody, find cover!
thanks, thats a very informative video, bz, cheers, have a great day!
Hey Buildzoid. Are you an electrical engineer. I am hoping to study elec. hardware engineering in college and find your pcb breakdowns amazing. Your knowledge is unrivaled.
Way too bad at math to get into an EE course. I failed out of my 2nd year of computer systems engineering.
Speaking of memory chips, i had an old dell server from 2010, it had 18 RAM slots, and each slot could support these 16GB DIMMs each dimm i beleive had 17 chips on either side (i think one of them was for parity ECC or might not have been memory)
Thats at least 576 memory chips, on a server from 2010 that only supported 3 channels per socket.
Now we have 12 channel processors and there are now LRDIMMS that i think have 24-32 chips per side, meaning you could see as many as 4608 memory chips in a single 2 socket server.
Thanks for this explanation.... from this I pondered if quad data rate was possible since why can't you combine transmission on the rising/falling edge AND at the high/lows of the clock.... I didn't realise that Intel has used it in their CPUs already
Loved this video. Gonna watch the rest of the series tomorrow
i normally only click like on music i dont mind hearing twice. i clicked like for this.
Hey Buildzoid! Any chance you'll also explain some of the timings, including why they're needed/what they do ? Things like CAS Latency.
I remember vaguely that in order to read or write, you need some latency, in which time the memory sort of "prepares" the exact address, that is the exact row and column of the physical memory location. Also at some intervals (10ms ? 100ms?) the whole memory needs to be refreshed with electricity (think of it like watering the plants) time in which the memory is effectively unusable for anything. If I'm not mistaken, the intervals between each refresh is that 4th/last timing in the usual 4 primary timings (the big timing). And that's why you want it to be as high as possible, it means the interval is bigger, so the memory is offline fewer times per second effectively less time offline. But make the interval it too big and you risk getting memory errors as some cells/areas might lose too much charge and a 1 might become a 0. Or something like that.
I always wanted to get to know more in this area. I'd also like to know like the best and worst scenarios for when a CPU wants or needs to get or write data from memory. Knowing how the memory works, in detail, would allow someone to compute these timings. For example, a worst scenario would be for you to wait it to finish another operation, then to wait CAS Latency, then probably to wait to have its periodic charge refresh, and then ... CAS Latency again ? and THEN finally actual data transfer. I don't know.
I am personally a bit surprised that the data lines aren't differential. Main reason with differential signaling is that it isn't easily disturbed by EMI from other nearby signals, or even power rail noise.
Also, in regards to generating a clock signal at any offset from another, this isn't too hard implement in practice. One simple solution is just a few selectable binary weighted delay lines, barely takes much to work with. And DRAM chips often don't change their distance from the memory controller. (Ie, when we talk to a given module we know its data comes back at a certain skew.) To a large degree, one can do the same to even de skew data lines from each other.
However, a 90 degree offset is the simplest. Since te data is clocked at whatever "transfer rate" the memory channel operates at. Ie, a 3200 kit does clock its data at 3200 MHz, however, the reference clock sent out to the DRAM chip is halved for signal integrity reasons. When we halv a frequency we transition each second transition of the frequency we desire to halve. But if we have a second frequency divider that transitions on the unused transitions, then we now have two halved versions of the frequency that are now 90 degrees apart.
There is also other ways to generate phase offsets, for an example through quadrature modulation, but this isn't particularly trivial. But this more or less offers the ability to chose any arbitrary phase depending on the input amplitude of two 90 degree offset references. With a pair of good variable gain amps we can go the whole 360 degrees.
very helpful ,thank you!
Awesome thanks for the video!
So technically, if memory is using a clock signal of say 2.2 GHz and it's branded as 4400 MHz, chances are somewhere in that memory controller there's a clock running at 4.4 GHz to generate those two phase-shifted 2.2 GHz signals. Meaning, _technically_ they're not lying 😅
You don't need a 4400mhz clock to generate two 2200mhz phase shifted signals. In this case its quite simple to do it with one 2200MHZ clock, because when measuring the voltage on the positive and negative terminals they are always going to be of equal absolute value and opposite sign, which is a 180 degree shift.
@@guycxz I thought the point was to get a 90-degree phase shift? But you're right, for this you might need 4x the frequency, not 2x.
@@unvergebeneid
DDR simply takes advantage of the fact that the clock changes twice per cycle. The second clock can be at any offset, and isn't even required since you could just measure the signal every time it hits 0. The issue with not having a second clock would be that any interference would offset your clock, So you use two inverted signals, expecting they would suffer mostly identical interference, and check when they're equal instead.
And you want a 180 shift between the two clock signals because it's easier to generate two inverted signals from a single clock, rather than having to synchronize two or more.
The issue with going beyond DDR is that even if you can synchronize an arbitrary number of clocks with a constant offset between them you will still end up with a clock signal identical to one from a clock running at the sum of all frequencies, so you face the same signal integrity issues.
@@guycxz that's not what we're talking about here though.
@@unvergebeneid I reread your reply, and I think I understand.
The 90 degree shift is from the "peaks" of the clock signal to the point where you do the read/write operation, while the 180 degree phase shift is between the two clock signals.
You why would you need 4x your frequency?
Weird question but if u where transferring a lot of 0’s would your powerdraw be less than alot o 1’s comparatively
Most chips are built with CMOS technology these days. Energy is dissipated as heat during change of signals. So often switching between 0 and 1 user more power. The data lines here are just one part where levels change, so the difference is minuscule.
But there are instances where crypto keys were compromised by doing thousands of operations and then statistically analysing the variations.
My issue is that you didn't really explain how DDR works, beyond the "sampling on both edges", which is quite basic. Everything you said about sampling can also be said for SDR memory, which works the same way, but only samples in the positive edge of the clock. So the really interesting question would be, why not just use a clock with twice the frequency and sample only on the raising edge? The answer to this question is the whole point of DDR.
The answer is probably "Physics." There is probably something which makes doubling that clock physically impractical relative to the solution chosen.
I'm guessing at some point someone who was trying to make the clock faster realized they could get far more out of a smarter design than a better one due to having to engineer around physical limitations.
@@PwadigytheOddity Rising wave frequency (clock) in a bus is difficoult for a variety of physical reasons. Those reasons can be simplified down to the fact that the line must be significantly shorter then a wave length to ensure signal synchrony and integrity on both end of the transmission line.
10:50 to about 11:12 is really all that makes DDR DDR. You read/write twice per clock cycle, and that's all there is to it.
It is being used instead of SDR because it's easier to maintain signal integrity on a lower clock speed; Any delay in voltage change you'd get from things like inductance, capacitance, or certain active components would just make up a smaller portion of a longer clock cycle. Though I wouldn't expect it to be mentioned because this isn't what the video is about.
In SDR you mostly dont get that 90° shift in the actual data, you get a 180° shift instead (at the 'other' edge of the clock) which makes the memory controller a bit simpler.
But then you have to generate and transmit a stable and clean clock signal of twice the frequency across the motherboard, which consumes more power, needs better drive transistors and has a few more issues. So doing DDR is preferrable.
To all the people replying: yes, I know why it makes sense to have a slower clock and sample in both edges, that was my point. I just think it would have been more interesting to discuss the issues that led to the usage of DDR instead of doubling the clock speed, rather than taking 20 minutes just to show signal sampling in the raising and falling edges, which can be shown in 30 seconds.
Thanks dude.
I would love to see you do a segment like this with PCIe
Thanks for the nice video. It is truly amazing! Could you also attach the link to the reference document you are using in the video?
Maybe it would be worth it to notice the difference between the basic DDR(1) with memory and bus clocks being the same and the DDR2 running the bus 2x faster than the internal clock. And the DDR3 and DDR4 running the bus 4x faster. Also worth mentioning that the DDR is used only on the DQ.... Commands etc. are sent only at the rising edge.
Hi, my doubt is that based on whose clock signals does the memory controller work?. So do they work based on the CPU clock speed (generally in terms of GHz) or DRAM's clock Speed (in terms of MHz). Its confusing since its located along the processor chip in modern systems. At the same time, I couldn't get any clear clarification from online. Thanks.
So what's the point of the data queue strobe? Why not use the main clock? Looks in the diagram that they're perfectly in sync 🤔
They are. But if you use the main clock, you start charging and discharging parts of the silicon that should be left alone. It's basically the main clock masked with an "enable" signal.
so actually during a clock cycle a bit of data can change hands at 4 different times (90 degrees). How much longer until jedec advertises this as quad data rate?
Nice explained
I can appreciate the detail and that you cut to the chase. I just wanted to make a suggestion that you give more context in future educational videos. I had no problem following, but I also have a CS degree and already know the subject. Once again, just a suggestion.. but I think it would be more effective if you first explain what a bus is, aind it's relationship to the clock.
You mentioned that it is beyond the scope of this video to specify the reson for the clock voltage to be differential. Could you give me something to google it?
Might be related to how audio signals do it in balanced connections.
If there is a noise source, it will be on both signals, if you sum up the signals against each other the noise will be the only thing left which you then can subtract from the signal to get a clean version without noise.
If the signals are just flipped so 2v and -2v, you sum em up, you get zero. Everything not zero is noice
@@TV4ELP
1. Using a comparator essentially does the "flipping" and noise cancelling in one step AIUI
2. Differential signalling only eliminates *common mode* noise. ie. noise that is the same on both lines.
Common mode noise is the primary source of noise in conductors that are arbitrarily close together and parallel but as frequency gets higher (with respect to line length and/or voltage), other sources of noise start to become significant too eg. capacitive coupling
@Alireza Same "Differential Signal" should find what you're looking for.
How long do you think til PAM4 on consumer RAM modules? How do you think it would affect overclocking?
Good job.
I wonder how this works when a ramdisk is mounted and used as a swap partition on Linux lol would that screw up the ram read-write timings on clock edges or would it just delay the phase by a cycle when the ram is mounted
I have a question about DDR3
What if i forgot to connect the Zq pin and it works properly?
Quick question: Does anyone know why the differential clock signals have the subscripts c and t? CK_c and CK_t ... I am guessing it has to do with some opamp standard that I am unfamiliar with.
So if I understand it correctly QDR would require a Read Clock offset by 45 degrees or 2 read clocks? I will have to watch video one more time, to understand the basics of that (I understood - probably - DDR, it's just that I started thinking about QDR).
It's a good video. And yeah it will be useful - once the damn heatwave goes away, maybe I will build the damn computer that I bought parts for.
More ddr data sheet?
This is awesome. With this diagram could you explain what the rest of the timing numbers mean and why they interplay with wall clock time? The timing information is listed as a set of 4 numbers. CL-tRDC-tRP-tRAS correct? Given this information a CL of 10 at 1000MHz is 10ns (Because 1 second / 1,000 / 1,000,000 (MHz) = 0.000000001 seconds × 10 = 10ns [For divide move the 1 from it's unit place on the number line for the number of zeros to the right including the unit number place.]) So every transaction that happens incurs this 10ns pre-amble cost. But in the same way a CL of 20 at 2000 MHz is 10ns also (Because 1 second / 2,000 / 1,000,000 (MHz) = 0.0000000005 seconds × 10 = 10ns again). So a lot of things play into wall clock time and when we are doing billions of memory access reads and writes this tiny nanoseconds really add up into perceivable time.
CL (CAS) is already shown on the diagram directly where CL=11 meaning 11 clock cycles go by before the read operation starts.
tRDC (= CL + tRDC) is harder to see on here. When does a new row open up in memory? As a software engineer what should we be looking for to avoid this if we can?
tRP (= tRP + tRCD + CL) is the recharge cycle count from issuing the precharge command and opening the next row. What is a pre-charge command? When does one open the next row? How does that play into data-bus width if at all?
tRAS (= tRCD + CL | tRCD + 2×CL) is the minimum number of clock cycles required between a row active command and issuing the pre-charge command. When is it tRDC + CL and when is it tRDC + 2×CL?
In general it's easy to see now why CL numbers are usually the thing that's advertised up front. It plays into all of the calculations and is an intimidate cost that must be paid for any command execution.
>tRDC (= CL + tRDC) is harder to see on here. When does a new row open up in memory? As a software engineer what should we be looking for to avoid this if we can?
You get something like 2048 rows on a stick and 1024 columns in each row AFAIK - some 8MB per row.
If you want to read a chunk of data which is all together, you can read all 1024 columns on the back of a single Row Access Strobe (what RCD, RAS to CAS delay, is referring to). All you have to do is open it and then spam CL, CL, CL as many times as you want. If your data is split into little fragments all over the memory, though, you cannot do this and access becomes much slower - you'll often have to wait for Precharge to close a row and then RCD to open another one and get ready to CAS again. This is somewhat analagous to fragmentation on a hard drive where it may read a single chunk of data at 100MB/s but slow below 1MB/s if it has to seek all over the platter after transferring a few kilobytes.
I don't think that there's much that you can do to optimise this from a programming standpoint unless you're writing with a low-level language like assembly and you can be very explicit about exactly where, when and how you're reading and writing with DRAM. I believe that a lot of it these days it automatic on the memory controller side in hardware and not modifiable even if you wanted to and had the know-how.
>What is a pre-charge command? When does one open the next row?
Precharge is the delay to close a memory row before you can open another one. If you have an open row and you want to read data which is in a different row of that bank, you must wait for Precharge to close the current row and then RCD to open the other one before you can CAS with that new row.
>tRAS - When is it tRDC + CL and when is it tRDC + 2×CL?
It's never explicitly either of those, nor do they make that much sense as a guide. The memory controller will try to open a row, do something inside of it and only afterwards it may try to close it again - closing a row may allow us to avoid waiting for Precharge later on if we need to access a different row, so an IMC will sometimes choose to gamble on closing a row in the hopes of the next requests being for a different row.
If tRAS hasn't elapsed then it can't close the row and has to wait. Even if you have a tRAS of 20, the IMC may choose to keep a row open for 100 or 1000 cycles if it's doing something useful in that row or thinks that it can do soon.
The fastest operations (where you can open a row, do something and then even want to close it again, thus potentially triggering tRAS limit) are probably Writes because writing is generally faster than reading. For example my bdie at 3800mt/s runs stable with RCDRD and CL of 14, but RCDWR of 8 and CWL of 10.
CL or CWL do happen for every read or write operation, but because of their headline importance they are also tuned pretty well on virtually all memory. You'll never find a stick which has CL running 2x or 3x slower than it could be. The biggest speedups in memory timings are generally in the secondary timings because you *do* find these 2-3x improvements - like a FAW of 49 which can run perfectly stable at 16, for example. That's much more important than improving CL - even from e.g. 18 to 14.
@@AerynGaming Woah, great answer. Now I need to know more about secondly timings. Is there a place where you learned about all of these things some resources you have?
Is the CPU memory controller the one that regulates the voltage?
hi Can you do videos on GDDR memory as well please?
THX a lot👍👍
Would QDR be possible and if yes how would it look like?
I mean, technically a lot is possible, you could generate a clock with doubled frequency on the memory chips and use that in the same way as in DDR, but then you're probably better off just using double the clock frequency to begin with and keep the memory simple.
@@Basement-Science that would still be DDR but with a doubled clock. Just as you say at the end.
@@SwordQuake2 Well we are talking about digital circuits, something always has to change state when you want to do anything. 1 clock cycle always only has 2 state changes, not 4 or more, so you need to get the extra state changes somehow.
You could use some variations of what I described. For example using 2 90° shifted clock signals to clock the RAM, and alternate between them. That's basically still the same thing.
So if something like that doesnt satisfy your idea of QDR, then I guess it's impossible.
@@Basement-Science well yes, the cycle has 2 edges so that's why it was a question whether it would be possible. I guess with 2 clocks that are shifted like you mentioned it would be possible to get the 4 edges in a cycle.
How is Linus still going to get this wrong, jk I love Linus dropping stuff
Thank you, that's a good explanation.
Is this a video he posted he wasn't going to make?
snore
Nice.
do the 32 bit DDR5 subchannels are in sync of each other like the 64 bit in DDR4?
depending on the memory controller all memory channels can be completely independent or behave like 1 giant memory channel. Fun fact intel 6-11th gen CPUs actually behave like 128bit single channel rather than 2x 64bit channels. 12th gen works at 2x64 or 4x32
@@ActuallyHardcoreOverclocking Ddr5 is an 80 bit in a single stick with two 32 bit + two 8bit error correction.
@@ActuallyHardcoreOverclocking Fun fact, I just installed on a rog strix x670-e with a ryzen 9 7950x3d "128 gb ddr5" 6000mt/s in 2 64gb kits (2x32) f5-600j3040g32gx2-tz5nr and it only recognizes 3200mhz :(
@@ActuallyHardcoreOverclocking In the 12th, is it because now the frequencies it will handle are higher or what would be the reason?
Wonderful explainer on DDR technology
Is there a way you recommend for someone, who isn't beyond clicking OC is bios, to learn how to overclock their system? Or is that strictly learning computer science?
still the lack of experience and playing around hardware will be a limiting factor. There are tons of guides, info and videos, but someone never stops learning. I started overclocking in 2005 with socket A, having pass through a lot of hardware and configurations and yet this year started to understand DDR4 RAM overclocking at it's fullest (only because I never bothered tbh).
There are a lot of online resources, even on BZ's playlists he has a lot of informative videos on how to approach certain aspects of overclocking. Also this area of computers is actually closer to electrical/computer engineering than computer science, which might help push you in the right direction.
Only thing that might be hard to find that BZ has are some of the datasheets to components or processes that are not readily available by the manufacturer. They don't make all of them public and/or take them down from public access.
👍🏆
It's hynix high speed time
bye bye Sam.!
Excellent explanation (for which i’m still too dumb but nevertheless i still understood something), thank you
20 minutes of bs. everybody knows the mechanics, functions and performance of RAM is in direct correlation with how much RGB it has.
nice1
Is there a way you recommend for someone, who isn't beyond clicking OC is bios, to learn how to overclock their system? Or is that strictly learning computer science?