Eh, it sounds "Crazy" when looked at from a *bandwidth* perspective; but from a *latency* perspective, the time the packet takes to transit the wire also sees a tenfold reduction. Moving from 1536 octet ethernet frames to 9000 octet jumbo frames (required as part of the 1000baseT & 10gbaseT spec now, previously optional in the 100baseT spec) is also an option for internally routed traffic that doesn't traverse the gateway out to the WAN. And for some of the new redfish management responses, they'll be chucked into time-series databases like grafana, and the finer granularity of the timestamp deltas makes order-of-events cluster-wide easier to determine accurately. This is one of the primary reasons I moved all my colocated machines onto 10G switch ports, but retained my low transit cap. I don't use a lot of bandwidth; but I frequently make interactive use of it (remote management) for control plane manipulation and VM setup, so lowering the round trip times can make a massive difference in quality-of-life, even if it's something as simple as VNC.
@@SLLabsKamilion yeah, I am aware of the latency reduction (not really sure why you went into Jframes though since they /can/ be present even with GigE, but still, for a management interface that's likely just serving a html5 configuration page the latency improvement from a few ms to tens of ns will be completely unnoticeable, so I'm not really sure I get your angle here. It's more likely that the CPU (DPU?) Just doesn't have the hardware for 1GbE, instead having 10GbE as a minimum, which is fairly reasonable from a 100GbE processor. There's a reason why the BMC only has 1GbE and that's because nothing more is needed, but unlike the DPU it doesn't natively support anything over 1GbE.
The biggest problem that I've ran into with ultra fast SSDs is that the faster they are, the faster they will die. (Because you will want to use the blazing fast system ALL. THE. TIME. And because even enterprise class SSDs that can sustain 10 DWPD still uses NAND flash memory cells/modules with finite erase/program cycles, it just means that you WILL burn through that write endurance limit really, really quickly.) What I would be MOST curious and MOST interested in is if you yanked one or a few of the drives out from each node and to see what happens to the system and the robustness and integrity of the service that's provided by the system (or when you burn through the erase/program cycles of the NVMe U.2 SSDs) and the drives mark themselves as either in read-only mode, or it just dies. I'd be really interested to see how the system reacts to that, because it WILL happen with SSDs of any kind.
How is this easier than FC? In FC you don't even have to do a single thing on the initiator, you just expose the volume on the target DONE. But it still is easy.
Crispy video for the interview, well done. The split interviewer/interviewee is kinda odd but not bad. All these DPUs are cool, but I fear much for the whitebox/open source with these DPUs
We were lucky we had what we had. With the lab mix interference had to pull audio from the shotgun pointed away from me. Easier to re-record since I was not on screen due to distancing.
Fantastic read performance! Relevant for zero workloads... Not a single question on write performance? Cmon, you guys can be a little more critical, you know this stuff.
Have some more on the STH main site on performance. Actually, they can get closer to 15M IOPS / 60GB/s per node but we are only giving credit for what was demonstrated. We only used up to 5 client servers and the client servers topped out around 2M IOPS and 60GB/s. That was a 15-20 minute demo that we had to cut down. NVDIMM is there to help absorb write workloads and sync writes. Good point. Probably should have left more of that in there.
Yes, we have a bit on this on the linked main site article where, for example, we have six systems for IBM Spectrum Scale. This is 100% designed to be a scale-out storage solution.
Encryption was on in the demos. It is sped up a bit, but you can see the at-rest version being turned on. The DPUs have line rate accelerators for crypto/compression functions.
@@prashanthb6521 It isn't only with NVMe SSD. It's ALL SSDs in that the NAND flash memory cells have a finite number of erase/program cycles. Therefore; even with provisioning, once you've consumed those finite number of erase/program cycles, the SSD is pretty much trash at that point, unless you specifically set the SSD to be a WORM, which, let's face it, nobody does that with a SSD. As a result, SSDs are like the brake pads on a car - it's a wear item, that once you've worn through the write endurance limit, you throw it out and buy a new one. All NAND flash based SSDs work this way.
@@prashanthb6521 I've burn through six the write endurance limit of 6 SSDs over the last four years because of this fact about the underlying technology behind it and that it has a finite number of erase/program cycles.
@@savagedk probably no more than $30, just put in a base-t sfp module. They already had the silicon designed, so this was probably very low cost for that. Also it probably allows you to run custom software on the DPU and provides control plane access to this software.
can you connect a Proxmox server to this server or a single dpu version, and do a local NVMe test, then a Network NVMe test ??? - what the cost of that server without the drives ? - can you connect directly or do you need a switch ? - if you have two of these storage servers can you do HA without a switch ? - what about clustering, will total storage show in one interface or do you have to go to each server interface ?
Quick comment on your recent Quadro RTX 4000 review: the article states that the VRAM is ECC-enabled. This is not true, only the 5000, 6000 and 8000 run ECC in the RTX lineup.
They have 12x SSDs. With PCIe Gen3 NVMe SSDs they need around 3x 100Gbps to match SSD bandwidth. With Gen4 SSDs they need around 6x 100Gbps lanes. A rough mental model (this is far from exact, but an easy way to remember) is that a PCIe Gen3 x4 NVMe SSD takes around a 25GbE port worth of traffic. We covered a bit more on TrueFabric and what they are announcing next in terms of the adapter in the main site article.
Not really. A conventional CPU system would rely much more on Memory accesses. These are highly custom chips that more or less directly connect NVMe to the 100GBit/s links that are directly on the same silicon.
@starshipeleven The HBM is likely for deep packet buffers on the integrated 100Gb NIC’s. When you’re throwing around that much traffic, and some congestion hits, you need to make sure you can temporarily buffer packets somewhere for QoS purposes, especially when you’re talking about NVMeoF. A single dropped packet would mean an entire reinit of the NVMeoF connection, which is slow and costly. We already see HBM in use on modern merchant silicon switches. Like others have said, this is likely a direct DMA implementation, with the built in MIPS CPU’s simply orchestrating the DMA transfers from the NVMe PCIe address space, to the integrated NIC’s address space. The data they’re transferring won’t be hitting the CPU, nor it’s memory.
@starshipeleven So I’m not sure what point you’re trying to make anymore? Are you arguing that anything with a CPU in it isn’t a true DPU? In today’s day and age, it would be foolish to build a DPU using purely state machines instead of some form of reprogrammable logic, not only for flexibility but also for the ability to fix bugs in your product after it’s been manufactured. At some point, everything is technically a CPU. The hardware video encoding and decoding block inside your phones SoC? That’s a CPU, just a custom one with an instruction set designed specifically for processing video. It even has reprogrammable firmware. Does that make it not a “highly custom chip”? My point I guess is that most things these days are some form of CPU, what differentiates a DPU from a standard x86 machine is the same thing that differentiates a GPU from an x86 processor. One is built with a specific purpose in mind, whilst the other is a general purpose platform. Just because something uses reprogrammable logic (ie a CPU), doesn’t mean that it’s not a highly custom implementation. Without any knowledge whatsoever on their specific product, I’d wager that less than 10% of the entire die size of their chip is used by the CPU core complex. The rest will be PCIe controllers, DMA controllers, high speed SERDES for PCIe and 25Gb channels, crypto and compression accelerators, custom NIC’s, etc etc. When you look at it that way, 90% of the entire thing is custom.
100GbE. They also have a TrueFabric solution including adapters for cloud servers. More on that in the STH main site article although we mention it here
I'd love to know more about the redundancy model employed, these shouldn't be run stand alone. And how about continous write? Does performance tank when the NV dimms are full? If this only have the performance because it is writing to RAM, it is like a Samsung EVO all over, not that impressive.
History rhymes. Cpus fail to keep up, so someone has to build a purpose built CPU and firmware to keep up. But in a few years cpus will have a breakthrough, and everyone will stop doing this because of the complexity of maintaining your own CPU and firmware code. Its awesome if you need this right now, but it will be short lived.
Question: 2.2M IOPS limit from a single server.. what are the specs? A single DPU server with 24 drives? All drives in a single array? ruclips.net/video/NjhTTMNGBBw/видео.html
@@thomasb1521 MIPS isn't bloated, MIPS and RISC-V are actually quite similar as MIPS is based on RISC-I. The architectural advantages of RISC-V over MIPS might not be that relevant for many workloads.
@starshipeleven So what? That's a problem for the CPU designers to solve, not for the computer users to worry about. It's just that for almost a decade Intel didn't have real competition and they got lazy. Once AMD got their shit together it became obvious how much we have been missing. And, as usual, the best is yet to come. The rumors for X86's death have been grossly exagerated. Many times before.
This is absolutely the future
For some reason I lost my shit when I saw the 10G management port haha
yeah that's really a 'because we can' thing 'eh... lol
Eh, it sounds "Crazy" when looked at from a *bandwidth* perspective; but from a *latency* perspective, the time the packet takes to transit the wire also sees a tenfold reduction. Moving from 1536 octet ethernet frames to 9000 octet jumbo frames (required as part of the 1000baseT & 10gbaseT spec now, previously optional in the 100baseT spec) is also an option for internally routed traffic that doesn't traverse the gateway out to the WAN. And for some of the new redfish management responses, they'll be chucked into time-series databases like grafana, and the finer granularity of the timestamp deltas makes order-of-events cluster-wide easier to determine accurately. This is one of the primary reasons I moved all my colocated machines onto 10G switch ports, but retained my low transit cap. I don't use a lot of bandwidth; but I frequently make interactive use of it (remote management) for control plane manipulation and VM setup, so lowering the round trip times can make a massive difference in quality-of-life, even if it's something as simple as VNC.
@@SLLabsKamilion yeah, I am aware of the latency reduction (not really sure why you went into Jframes though since they /can/ be present even with GigE, but still, for a management interface that's likely just serving a html5 configuration page the latency improvement from a few ms to tens of ns will be completely unnoticeable, so I'm not really sure I get your angle here.
It's more likely that the CPU (DPU?) Just doesn't have the hardware for 1GbE, instead having 10GbE as a minimum, which is fairly reasonable from a 100GbE processor.
There's a reason why the BMC only has 1GbE and that's because nothing more is needed, but unlike the DPU it doesn't natively support anything over 1GbE.
The biggest problem that I've ran into with ultra fast SSDs is that the faster they are, the faster they will die.
(Because you will want to use the blazing fast system ALL. THE. TIME. And because even enterprise class SSDs that can sustain 10 DWPD still uses NAND flash memory cells/modules with finite erase/program cycles, it just means that you WILL burn through that write endurance limit really, really quickly.)
What I would be MOST curious and MOST interested in is if you yanked one or a few of the drives out from each node and to see what happens to the system and the robustness and integrity of the service that's provided by the system (or when you burn through the erase/program cycles of the NVMe U.2 SSDs) and the drives mark themselves as either in read-only mode, or it just dies.
I'd be really interested to see how the system reacts to that, because it WILL happen with SSDs of any kind.
can we just appreciate how easy it is to mount a NVMeoF device? It's literally just
# nmve connect-all
I love that their Web GUI gives you the command to copy/ paste.
Redfish/swordfish not up yet?
How is this easier than FC?
In FC you don't even have to do a single thing on the initiator, you just expose the volume on the target DONE.
But it still is easy.
Thanks for this one.
Crispy video for the interview, well done. The split interviewer/interviewee is kinda odd but not bad. All these DPUs are cool, but I fear much for the whitebox/open source with these DPUs
We were lucky we had what we had. With the lab mix interference had to pull audio from the shotgun pointed away from me. Easier to re-record since I was not on screen due to distancing.
@@ServeTheHomeVideo I didn't notice any drop in quality for the interview audio, good thing to have a backup
That camera at 2:11! No wonder the videos look good.
Ha! Just a little Canon C200 rigged out. That is likely going to become a B-cam by December.
Fantastic read performance! Relevant for zero workloads...
Not a single question on write performance? Cmon, you guys can be a little more critical, you know this stuff.
Have some more on the STH main site on performance. Actually, they can get closer to 15M IOPS / 60GB/s per node but we are only giving credit for what was demonstrated. We only used up to 5 client servers and the client servers topped out around 2M IOPS and 60GB/s. That was a 15-20 minute demo that we had to cut down. NVDIMM is there to help absorb write workloads and sync writes. Good point. Probably should have left more of that in there.
Can multiple of these systems be pooled for added capacity, performance, redundancy etc? it's already two entirely separate nodes in one box.
Yes, we have a bit on this on the linked main site article where, for example, we have six systems for IBM Spectrum Scale. This is 100% designed to be a scale-out storage solution.
I wonder how much the performance suffers is when encryption is added for both data at rest and data in flight protection.
Encryption was on in the demos. It is sped up a bit, but you can see the at-rest version being turned on. The DPUs have line rate accelerators for crypto/compression functions.
No one:
Fungible: Here's an NVMe array with 10 million IOPS
just imagine how many VMs you could run on this bad boy
I would (ask for one) if it weren't for the fact that NVMe SSDs are such data death traps.
@@ewenchan1239 What is the problem with NVMe SSDs ? Please elaborate ...
@@prashanthb6521
It isn't only with NVMe SSD. It's ALL SSDs in that the NAND flash memory cells have a finite number of erase/program cycles.
Therefore; even with provisioning, once you've consumed those finite number of erase/program cycles, the SSD is pretty much trash at that point, unless you specifically set the SSD to be a WORM, which, let's face it, nobody does that with a SSD.
As a result, SSDs are like the brake pads on a car - it's a wear item, that once you've worn through the write endurance limit, you throw it out and buy a new one.
All NAND flash based SSDs work this way.
@@prashanthb6521
I've burn through six the write endurance limit of 6 SSDs over the last four years because of this fact about the underlying technology behind it and that it has a finite number of erase/program cycles.
so what am I missing here? Maybe I'm super old school but how is this different from a an all flash FC SAN like purestorage.
10Gbe mgmt interface... yeah because that is surely needed.....!
@starshipeleven Well flexing comes at a price for the customer.
@@savagedk probably no more than $30, just put in a base-t sfp module.
They already had the silicon designed, so this was probably very low cost for that.
Also it probably allows you to run custom software on the DPU and provides control plane access to this software.
The more, the more!
@@varno How often do you run around tossing a $30 away getting nothing in return?
@starshipeleven exactly
can you connect a Proxmox server to this server or a single dpu version, and do a local NVMe test, then a Network NVMe test ???
- what the cost of that server without the drives ?
- can you connect directly or do you need a switch ?
- if you have two of these storage servers can you do HA without a switch ?
- what about clustering, will total storage show in one interface or do you have to go to each server interface ?
STHstore should be the new meme
I did not completely understood the scalability argument, it was 8 servers to 8 shelves, why would that not scale?
Quick comment on your recent Quadro RTX 4000 review: the article states that the VRAM is ECC-enabled. This is not true, only the 5000, 6000 and 8000 run ECC in the RTX lineup.
Only the 8000.
Why so many 100gbe ports? Does it have to do with their fabric? We’re they using RDMA or TCP as the transport?
They have 12x SSDs. With PCIe Gen3 NVMe SSDs they need around 3x 100Gbps to match SSD bandwidth. With Gen4 SSDs they need around 6x 100Gbps lanes. A rough mental model (this is far from exact, but an easy way to remember) is that a PCIe Gen3 x4 NVMe SSD takes around a 25GbE port worth of traffic. We covered a bit more on TrueFabric and what they are announcing next in terms of the adapter in the main site article.
at 3:46 so fungible has some serious emc problems ? ;)
Wait, no car drive with mask explanation? :) well ok, at least some merch :)
No time. That car scene took longer than you would expect and we had this live
Really cool storage system but a bit confused on MIPS and not Risk V. Is it just to new or not tested enough?
When you have networking folks who have used MIPS for years, you get products based on MIPS :-)
@starshipeleven thanks for the info. It may happen at some point but not now. (I just realized how badly I was spelling RISC-V)
@@ServeTheHomeVideo a very mature platform then.
mips has been used for decades and it's very known and matured. not so power efficient but in a system like that, power is not the problem.
What is the power consumption? And even more important, the price?
5:00 that camera angle needs some work. slightly lower, less headroom. maybe pack some sound/moving blankets to help make spaces echo less
What is the latency, and the link length?
At this point, this is not "DPU", it's just a custom cpu powered system with custom software stack.
Not really. A conventional CPU system would rely much more on Memory accesses. These are highly custom chips that more or less directly connect NVMe to the 100GBit/s links that are directly on the same silicon.
@starshipeleven The HBM is likely for deep packet buffers on the integrated 100Gb NIC’s. When you’re throwing around that much traffic, and some congestion hits, you need to make sure you can temporarily buffer packets somewhere for QoS purposes, especially when you’re talking about NVMeoF. A single dropped packet would mean an entire reinit of the NVMeoF connection, which is slow and costly. We already see HBM in use on modern merchant silicon switches.
Like others have said, this is likely a direct DMA implementation, with the built in MIPS CPU’s simply orchestrating the DMA transfers from the NVMe PCIe address space, to the integrated NIC’s address space. The data they’re transferring won’t be hitting the CPU, nor it’s memory.
@starshipeleven So I’m not sure what point you’re trying to make anymore?
Are you arguing that anything with a CPU in it isn’t a true DPU? In today’s day and age, it would be foolish to build a DPU using purely state machines instead of some form of reprogrammable logic, not only for flexibility but also for the ability to fix bugs in your product after it’s been manufactured.
At some point, everything is technically a CPU. The hardware video encoding and decoding block inside your phones SoC? That’s a CPU, just a custom one with an instruction set designed specifically for processing video. It even has reprogrammable firmware. Does that make it not a “highly custom chip”?
My point I guess is that most things these days are some form of CPU, what differentiates a DPU from a standard x86 machine is the same thing that differentiates a GPU from an x86 processor. One is built with a specific purpose in mind, whilst the other is a general purpose platform. Just because something uses reprogrammable logic (ie a CPU), doesn’t mean that it’s not a highly custom implementation.
Without any knowledge whatsoever on their specific product, I’d wager that less than 10% of the entire die size of their chip is used by the CPU core complex. The rest will be PCIe controllers, DMA controllers, high speed SERDES for PCIe and 25Gb channels, crypto and compression accelerators, custom NIC’s, etc etc. When you look at it that way, 90% of the entire thing is custom.
Very impressive! Does anyone knows what network fabric is being used here? FC, infiniband...
100GbE. They also have a TrueFabric solution including adapters for cloud servers. More on that in the STH main site article although we mention it here
@@ServeTheHomeVideo 100gbe rocev2 or iwarp or tcp?
@@andrewmoch8107 You would think they would support all of the above.
I'd love to know more about the redundancy model employed, these shouldn't be run stand alone.
And how about continous write? Does performance tank when the NV dimms are full?
If this only have the performance because it is writing to RAM, it is like a Samsung EVO all over, not that impressive.
Do they have 2 x 48 pci lanes to the disks or is that switched?
No PCIe switches. They have plenty of PCIe lanes for this.
History rhymes. Cpus fail to keep up, so someone has to build a purpose built CPU and firmware to keep up. But in a few years cpus will have a breakthrough, and everyone will stop doing this because of the complexity of maintaining your own CPU and firmware code. Its awesome if you need this right now, but it will be short lived.
no matter how much better cpu's will get, custom solutions will always be better and faster, for the ppl who need them of course.
better use fungible than lumaforge
Question: 2.2M IOPS limit from a single server.. what are the specs? A single DPU server with 24 drives? All drives in a single array?
ruclips.net/video/NjhTTMNGBBw/видео.html
Someone needs to tell @LinusTechTips about this thing lol...
x86 is old and full of bloatware, time to replace it
Isn't MIPS as well. They should try to use Risk V for a modern platform.
@@thomasb1521 MIPS isn't bloated, MIPS and RISC-V are actually quite similar as MIPS is based on RISC-I.
The architectural advantages of RISC-V over MIPS might not be that relevant for many workloads.
@starshipeleven So what? That's a problem for the CPU designers to solve, not for the computer users to worry about. It's just that for almost a decade Intel didn't have real competition and they got lazy. Once AMD got their shit together it became obvious how much we have been missing. And, as usual, the best is yet to come. The rumors for X86's death have been grossly exagerated. Many times before.
@starshipeleven that is true. Also it is to general for this application to be done efficiently.