Highly Available Storage in Proxmox - Ceph Guide

Поделиться
HTML-код
  • Опубликовано: 6 фев 2025

Комментарии • 111

  • @TechnoTim
    @TechnoTim 8 месяцев назад +32

    Nice work! Thanks for making this easy! I need to try it out someday!

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад +5

      Thanks, Tim. I'm finding it particularly useful for K3S Servers and my firewall. Having the VMs failover automatically means there's no disruption to the cluster, no pulling pods etc.

  • @Layer2Clouds
    @Layer2Clouds 5 месяцев назад +4

    Great Video - we support Hosted Proxmox clusters in the US and your guides are a go to for our clients! Thank you Jim.

    • @Jims-Garage
      @Jims-Garage  5 месяцев назад

      @@Layer2Clouds wow, thanks for sharing. That's great to hear.

  • @ewenchan1239
    @ewenchan1239 8 месяцев назад +16

    1) You don't TECHNICALLY need a separate drive, you just need a separate PARTITION that Ceph can take over and have full control over.
    For example, in my OASLOA Mini PC (N95, 16 GB, 512 GB NVMe 2242 M.2 SSD), I partitioned the 512 GB NVMe SSD on each of my 3 nodes such that 128 GB is given for the Proxmox install, and the local-lvm, and then the rest is a separate partition that is given to Ceph to have dominion over.
    (My OASLOA Mini PC doesn't HAVE another slot where I can add additional storage devices, so I had to make do with what it has.)
    Once you have it partitioned like that, you can proceed with putting the 3 nodes into a Proxmox HA cluster, per usual, and you can then set up the Ceph cluster as well, also via the Proxmox GUI to perform the initial install, and also to set up your first monitor.
    2) re: iGPU passthrough
    This is why I DON'T recommend you install any VMs/CTs until the infrastructure has been set up to be what you want it to be.
    Set up the clustering and Ceph first, THEN set up your VMs/CTs.
    That way, the IOMMU groups will stabilise, such that it will be USABLE for what you're trying to do with it before deploying VMs/CTs/services.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      Thanks for the tips, I'll consider that on the next deployment.

    • @ewenchan1239
      @ewenchan1239 8 месяцев назад

      @@Jims-Garage
      No problem.
      In my case, because my storage was dependent on the Ceph RBD/Ceph FS being up and running, before I can store the VM/CT disks, so; that meant that the clustering and Ceph had to be up and running first before I could do anything else.
      I know that you are storing the VM/CT disks on local storage, rather than storing it on the Ceph storage system, so you were able to start installing VMs/CTs before your Ceph system was set up.

    • @0xKruzr
      @0xKruzr 8 месяцев назад

      yeah, but you don't want to write-exhaust the device if it's also booting the node.

    • @ewenchan1239
      @ewenchan1239 8 месяцев назад

      @@0xKruzr
      Depends on how much traffic you're putting on the system/cluster.
      For my case, my 3-node HA Proxmox cluster running Ceph exists only to serve Windows AD DC, DNS, and AdGuard Home.
      So none of that is intensive.
      The monthly backups is probably more write intensive than anything else that happens for the rest of the month.
      (My N95 Mini PC, with only 16 GB of RAM, is too slow to really do much of anything else.)

    • @MrNGm
      @MrNGm 6 месяцев назад

      In the constrained setup ewanchan1239 describes, using a separate partition on a single drive may be acceptable. Readers with other setups and/or reliability wishes should take into account that Ceph's reliability stems from (among others) being able to spread out data chunks to a larger number of OSDs (object storage daemons), such that unavailability of 1, 2, or 10 OSD's doesn't impact the cluster. The latter depends on the configured rules regarding failure domains (further reading in the Ceph documentation: CRUSH maps). I would always advise reading a bit more on Ceph, its architecture on a high level, and the failure modes.
      In setup ewenchan1239 describes (3-replicated Ceph with Proxmox), the cluster will become unavailable if you're, for example, performing maintenance on 1 host, and the disk of another one fails. Nevertheless, having a setup where VM data is accessible on all hypervisors through shared (network) storage, maintenance on a single hypervisor becomes a lot more simple.

  • @muhammadabidsaleem7048
    @muhammadabidsaleem7048 7 месяцев назад +4

    Thank You Jim
    Keep posting new videos specially on SDN please

  • @davidbuchaca
    @davidbuchaca 6 месяцев назад +2

    Very nice and detailed tutorial! abbadon, sanguinius, dorn, proposing names for the following nodes: lion, khan, corax

    • @Jims-Garage
      @Jims-Garage  6 месяцев назад

      @@davidbuchaca awesome! Sage choices too!

  • @Chris-rm1pn
    @Chris-rm1pn 8 месяцев назад +14

    MS-01s also have vPro which supports Serial over Lan, so if you lock yourself out and don't have GPU used by host you can use that to fix issues

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +6

      Thanks, I'm still to get that working. It's quite buggy from my limited trialling.

    • @Chris-rm1pn
      @Chris-rm1pn 8 месяцев назад

      @@Jims-Garage I recommend using meshcentral and their guides if you haven't tried it's the best working solution I found so far

    • @Andy-fd5fg
      @Andy-fd5fg 8 месяцев назад

      Long live the serial port!
      Tis a shame they don't have physical 9 pin serial connector

    • @cschwartz
      @cschwartz 8 месяцев назад

      @@Jims-Garageagreed. The implementation unfortunately is lacking and quirky. I loaded the meshcommander firmware on it to get web based kvm without needing meshcommander sw running on client or hosted app. However even that had quirks but enhanced functionality. I ended up giving up and going lacp with the 2.5 ports and reverted back to a trusty raritan ipkvm and a usb tty console. I never could get the wol aspect of it working and had to be in a booted state for it to function.

    • @cschwartz
      @cschwartz 8 месяцев назад

      @@Andy-fd5fgtty to usb…. No need for a db9

  • @Insightfill
    @Insightfill 8 месяцев назад +1

    Oh! I've been looking forward to this one!

  • @johnwalshaw
    @johnwalshaw 8 месяцев назад +5

    I opted for 3x Nextorage NEM-PA2TB for 2GB DDR4 SDRAM. Very happy so far. It's great having a 3 node CEPH cluster.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      That's great, sounds like a solid setup.

  • @DS-ou7xm
    @DS-ou7xm 7 месяцев назад +1

    Its Ok, Mate nothing wrong with having Cold and Flu symptoms..... And awesome video ... thanks

  • @NickS34252
    @NickS34252 4 месяца назад +1

    Excellent video - I've been following along while tinkering with my own cluster. When it comes to fast nodes like the MS-01, it's a bit tricky to figure out what to put into ceph vs local storage given the performance limitations.

    • @Jims-Garage
      @Jims-Garage  4 месяца назад

      @@NickS34252 thanks. I totally agree! I'm often scratching my head thinking which should I use.

  • @georgelza
    @georgelza Месяц назад +2

    ... have you done a video where you expose ceph storage to a K8S cluster via a csi driver? I have a Proxmox cluster with Ceph configure over it, running a K8S cluster and would like to place my shared block storage for the EBS onto my ceph pool.

  • @nadtz
    @nadtz 8 месяцев назад +2

    If I hadn't already built a new proxmox host before the MS01 came out I might have gone this route (though with dedicated hardware for opnsense), it's kind of crazy what minisforum was able to pack into the MS01 for the price and that ceph + proxmox HA is available for home users for free.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад

      I agree. There are quirks but it's impressive.

    • @Carlos-Rodrigues
      @Carlos-Rodrigues 2 месяца назад

      I was waiting for this machine for so many years. Now I have 4 of MS-01. 3 for the cluster and another just for OPNSense. It's fast. It's stable. It's amazing. I just wonder if I can create a network with the MS-A1 through Thunderbolt so I can use it as a backup server with PBS.

  • @zxxz-ob7ll
    @zxxz-ob7ll 5 месяцев назад +1

    The grim reality of the universe requires a grim order. The machine requires perfection. Any error can become a catastrophe

  • @amateurwizard
    @amateurwizard Месяц назад +1

    Only the Warhammer nerds noticed... Nice! 😼

    • @amateurwizard
      @amateurwizard Месяц назад +1

      Cluster: Grimdarkfuture

    • @Jims-Garage
      @Jims-Garage  Месяц назад

      @@amateurwizard haha, have to let the inner nerd out occasionally

  • @rodneykahane4994
    @rodneykahane4994 5 месяцев назад +1

    not sure what the performance implications are, but the nvme osds that were created were classified as ssd. in the advanced tab, you can manually select the drive type (hdd,ssd, or nvme).

    • @Jims-Garage
      @Jims-Garage  5 месяцев назад +1

      @@rodneykahane4994 thanks, let me check that!

  • @orgind7778
    @orgind7778 8 месяцев назад +1

    Thanks great video

  • @hyperprotagonist
    @hyperprotagonist 8 месяцев назад +6

    He’s only gone and bloody done it 👏

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +4

      Haha, thanks. A lot of late nights behind this one for something that on the surface is quite straightforward!

    • @hyperprotagonist
      @hyperprotagonist 8 месяцев назад +1

      @@Jims-Garagekudos for persevering. On twitter you highlighted the setbacks, on discord you kept everyone reassured, and in the video your demeanour was as if it was merely a hiccup. You weren’t lying when you said I didn’t know half of it 😂

  • @MarkConstable
    @MarkConstable 8 месяцев назад +4

    I'm pretty sure if you used the gui to set up Ceph you would have had less problems. I've done it a number of times and did not have to use the cli at all.

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад

      The cli is necessary for the backhaul network. if it was simply the vmbr0 route then you're right, GUI would be a good choice.

  • @JonatanCastro
    @JonatanCastro 8 месяцев назад

    This is amazing, I just got the MS-01 to create some content for my channel, but definitely would love to have the needed hardware to do a CEPH setup. Anyway, I digress; just want to ask you how quick it is to move a CT, considering you can't live migrate them, but on the other hand, the storage is already shared!

  • @johnvandenhurk8650
    @johnvandenhurk8650 2 месяца назад +1

    First of all, I love your videos and have watched many of them.
    I have had a similar CEPH configuration on MSI Cubi Proxmox cluster using Samsung 990 Pro NVME SSD's. I was pretty happy with this until I noticed that less then six months in the SMART monitoring is failing on two of VNME's. Wearout for the three 990 Pro's, are (150% ,255%, 6%). On the proxmox forum I'm told that this is due to consumer grade SSD's.
    The 255% is from the node that does the most IO, but my no means these are heavily loaded systems.
    i wonder what your experience is so far on wearout because of Ceph?

    • @Jims-Garage
      @Jims-Garage  2 месяца назад

      @@johnvandenhurk8650 thanks. It does chew through consumer SSDs. Mine is on about 40%, I think it's good for about 4 years in total.

    • @johnvandenhurk8650
      @johnvandenhurk8650 2 месяца назад

      @@Jims-Garage Thanks for the swift response!
      perhaps it is only mine that have an issue, but mine are failing within a year. I will reach out to my vendor and create a ticket.
      I hope yours are better!
      How happy are you with your MS-01's? I'm considering an upgrade to an MS01 (i9-12900) cluster for the SFP+

  • @cschwartz
    @cschwartz 8 месяцев назад +4

    If you are going to continue to do iGPU passthrough, have you thought of passing a TTY console via USB to serial, that way you can connect up should HW change and pve wants to move around your NIC naming.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад

      Good idea, I'll look into that. Thanks

  • @monish05m
    @monish05m 8 месяцев назад

    May i ask for a video on how to set up that virtual nic you have running on you opnsense.
    Thanks and really loved your video.

  • @dimitristsoutsouras2712
    @dimitristsoutsouras2712 7 месяцев назад

    Nice presentation of the procedure and your special case scenario as well.
    At the part where you created a cephfs (after you created individual ceph managers), where does that fs created on? The same1Tb nvme storage? If yes shouldn t it have some kind of partition seperation between VMs storage and ISOs or those object storage services arrange that automatixally (where goes what).

  • @sku2007
    @sku2007 8 месяцев назад +3

    there's some pcie passthrough translation in pve8. meaning you can set the hw for each node and in the vm the "friendly name" (don't know their wording right now, it's in datacenter somewhere)

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      Thanks, wasn't aware of that. I'll take a look

    • @sku2007
      @sku2007 8 месяцев назад +2

      it's called resource mappings, right below metric server

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      @@sku2007 thanks, I took a look just now and the i226-v isn't on the node. Very odd!

    • @sku2007
      @sku2007 8 месяцев назад +1

      @@Jims-Garage very odd! even when forwarded, the HW gets listed with lspci in host shell. with lspci -v you'll see a line with Kernel driver in use: vfio-pci

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      @@sku2007 I've tried all of those to no avail. I'm going to load a live Linux installation. If I don't see it I'll rma

  • @Eli-q5z9h
    @Eli-q5z9h 2 месяца назад

    in the system file /etc/hosts, I put the ip addresses of the public network or the ceph network?

  • @fbifido2
    @fbifido2 7 месяцев назад +1

    @4:33 - the thunderbolt backhaul does not show up as a network bridge inside Proxmox ???

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад

      Eno5 and eno6 are the thunderbolt adapters. You could create a bridge if you wanted.

  • @DavidC-rt3or
    @DavidC-rt3or 7 месяцев назад

    After having setup somewhat of a test PBS server and backing up the nodes of the cluster, trying to find the steps of how to do a restore of a node that is in a cluster and has ceph.. just to make sure all of the needed information was backed up and how to restore (ahead of time :) ) Ideas?

  • @vonwerderc
    @vonwerderc 8 месяцев назад +2

    Very interesting. I'm curious how HA with OPNsense would work. Wouldn't the WAN connection from your Modem only go into one node? If that one dies, how would the other nodes be connected?

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +3

      The WAN connection goes into a switch that splits the internet to the nodes via a vLAN. They are all members.

    • @headlibrarian1996
      @headlibrarian1996 8 месяцев назад +1

      How does routing work then? Only one member of the cluster should get the traffic and the switch wouldn’t know which one that is.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад

      @@headlibrarian1996 well there's only one firewall at a time.

  • @majoryoshi
    @majoryoshi 8 месяцев назад +2

    I could be mistaken on this, but in regards to your HA OPNsense is there any reason why you couldn't your WAN in to a switch (even an unmanaged would do the trick) and plug whatever port your WAN ports on your notes into said switch? Since you're doing HA through Proxmox/Ceph and not through OPNsense, I see no reason why that wouldn't work. Please correct me if I'm wrong though.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад

      That's what I'm going to try.

  • @RoiskiaFilms
    @RoiskiaFilms 7 месяцев назад

    I just noticed that naming scheme and i am confused. Failbaddon the Harmless and then the two primarchs? Anyway, great video. Looking forward to try this myself in the future.

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад

      Thanks 👍 Cadia stands (oh wait!) 😲

  • @jeffersonsantos4603
    @jeffersonsantos4603 8 месяцев назад +2

    Great job, man. Do you have full network performance for Opnsense via the VirtIO bridges?

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      Yeah, it maxes out 10Gb via iperf3 and full 2Gb up/down via speedtest.net

    • @romseaaccthree1448
      @romseaaccthree1448 8 месяцев назад

      ​@@Jims-Garage i'm assuming this is for the same VLAN iperf test. Would you also be able to test iperf results for inter VLAN traffic?

  • @Jayroglyph
    @Jayroglyph 5 месяцев назад +1

    How would one tap into this Ceph cluster from a Kubernetes cluster running on VMs in the HA Proxmox cluster?

    • @Jims-Garage
      @Jims-Garage  5 месяцев назад

      @@Jayroglyph you'd simply select the storage volume on the ceph as the storage volume for the VM. You can see that in my OPNSense video afterwards whereby the OPNSense uses the ceph storage to make it HA with a single node.

  • @Copernicus22
    @Copernicus22 8 месяцев назад +1

    Hi, very impressive work! are those ceph benchmark speeds normal though? I was expecting more given 25gbit/NVMe?

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад

      Normal for consumer devices. Ceph isn't about performance, it's about reliability. It's perfectly fine from my experience. Anything super heavy you want local.

    • @Copernicus22
      @Copernicus22 8 месяцев назад

      @@Jims-Garage ok thanks, yeah I did it once years ago, I think I had stimular results with ceph using microk8s.

  • @simuman
    @simuman 7 месяцев назад

    Hey jim, really like your videos. I tried this a few months back and not sure if I got this ceph system wrong or not, but couldn't get it to work with a connected external NAS storage through mapped CIFS mount as the HA did not recognize the IP address for media for plex on fail over. Do you know if this is possible or have I got the wrong end of the stick about HA and how it works?

  • @randallsalyer
    @randallsalyer 5 месяцев назад +2

    the fix for your ipv4 is now in the setup documentation , you have it after your source line, just fyi hope you see this
    also add this is as the last line to the interfaces file unless there is a sources file in which case put it immeditately before the sources lines (or delete the sources line) /etc/network/interfaces
    # This must be the last line in the file unless there is a sources line in which case put this immediately above the sources line (or delete the sources line)
    post-up /usr/bin/systemctl restart frr.service

    • @Jims-Garage
      @Jims-Garage  5 месяцев назад +1

      @@randallsalyer thanks, I will look at that!

  • @janstasik9094
    @janstasik9094 4 месяца назад +1

    Hello, may i ask you about stability of ms-01 from time you've deployed th4 and ceph? I've ordered boxes but meanwhile i've read horrible stories about ms-01, how hard is to deploy vPRO, proxmox installation is nightmare, bios upgrade and microcode deployment nearly unrealistic, how impossible is to configure and run TH4 ports and overal ceph and box stability is nightmare, every 3 days to reboot etc..what is your real life experience? Is it worth to buy em? From my side, the best hardware for homelab. Thank you.

    • @Jims-Garage
      @Jims-Garage  4 месяца назад +1

      I haven't had a single issue since buying about 3 months ago. They've been on all that time, are on stock bios and are running ceph via TB4. Proxmox installation is the same as any other device. I don't vpro as I don't have a need to but I've heard it's a nightmare. Only issue I had was to disable ASPM in the BIOS.

    • @janstasik9094
      @janstasik9094 4 месяца назад

      @@Jims-Garage Thanks...

  •  3 месяца назад

    @9:33 Try to _ALWAYS_ have a serial console. That never fails.

  • @snowballeffects
    @snowballeffects 6 месяцев назад +1

    SO... that lock out problem when you pass through the GPU - I have a standby PCI (yup PCI 😂) GPU that I popped into that previously annoyingly unused slot - leaving the original gpu in place. plug in the SVGA monitor 😂 and boom - hello cli 😅

    • @Jims-Garage
      @Jims-Garage  6 месяцев назад

      @@snowballeffects nice, that's a good failsafe!

  • @Irish2086
    @Irish2086 8 месяцев назад

    I have been looking for this answer for a while... How would one figure out the right number for a 5-7-9 nodes CEPH configuration... I just foun information about a 3 nodes config

    • @headlibrarian1996
      @headlibrarian1996 8 месяцев назад

      I like 5 more than 3, but 5 MS-01s is fairly pricey and you can’t do a full-mesh thunderbolt network with 5. With five shutting down a node for maintenance doesn’t completely degrade the cluster and erasure coding works better with more nodes.
      A 5-node Qotom cluster is interesting because they have 2 SFP+ 10G ports, but I don’t know how well it would actually perform. You could have one set of SFP interfaces on a dumb switch for the private backhaul network, and you need 5 ports on your main switch for the public facing interfaces.

  • @lsimsdj
    @lsimsdj 3 месяца назад

    My mini pcs have one 512GB NVME SDD each... This will not work? Does it mean I need to buy one additional NVME SSD for each mini pc in the cluster?

    • @Jims-Garage
      @Jims-Garage  3 месяца назад

      Correct, CEPH requires a dedicated drive.

  • @kienanvella
    @kienanvella 7 месяцев назад

    You can absolutely run with spinning disks with ceph, but you need quite a few of them, and definitely want some SSD DB/WAL devices.
    I'm running a cluster of 4 nodes, with 24 spinning disks, 6 per node. 3:1 OSD to DB/WAL drive ratio (3 OSDs share one DB/WAL SSD).
    Having said that, it's not stupendously fast - especially for my write-heavy workload, but it's fast 'enough'. I've got about 35 guests, which includes a Zabbix server with DB, 3x elasticsearch, and a graylog system.
    It was quite affordable however, buying used drives in bulk.

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад

      That's awesome, thanks for sharing. I'll do some more testing.

  • @BenjaminBenStein
    @BenjaminBenStein 8 месяцев назад +1

    🎉

  • @voldllc9621
    @voldllc9621 8 месяцев назад +1

    I did not see you creating a shared storage for vm and ct disks. Cephfs cannot host these because that gives you posix file storage only, not block storage. You need RADOS block storage.

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад +1

      Thanks, as mentioned that was in the previous video.

    • @voldllc9621
      @voldllc9621 8 месяцев назад

      Sorry, i missed that, probably since i saw you installing Ceph from scratch,and after creating a replicated pool, going straight to Cephfs for ISO and CT template file storage. ISO and CT template are not crucial for HA.

    • @DavidC-rt3or
      @DavidC-rt3or 8 месяцев назад

      In my setup I've got one crush rule and pool setup for ssd's for the vm disk and another with hdd's for data virtual disk of the vms. Not a high volume/performance need

  • @cberthe067
    @cberthe067 8 месяцев назад +1

    There is no Erasure Coding in Crush Rule ?

    • @Jims-Garage
      @Jims-Garage  8 месяцев назад

      It's a trade off from my understanding. Erasure coding ensures better replication (data loss prevention) but impacts on performance. As I always abstract my data I'm less worried about it as a long term storage mechanism (more for failover capability).

  • @mridulranjan1069
    @mridulranjan1069 7 месяцев назад +1

    You didn't show or guide through the setup of anything, just talked, showed your face and a couple of screenshots. Seriously man, what CRAP!

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад

      @@mridulranjan1069 did you ensure that your monitor was on and that the sound wasn't muted?

  • @MelroyvandenBerg
    @MelroyvandenBerg 7 месяцев назад +2

    is covid back again in the country? blehh.

    • @Jims-Garage
      @Jims-Garage  7 месяцев назад +3

      @@MelroyvandenBerg yeah, I think there has been a summer spike

    • @dazealex
      @dazealex 7 месяцев назад

      @@Jims-Garage Even here in California.