9 Steps to Recover from a Proxmox VE Boot Disk Failure

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 พ.ย. 2024

ความคิดเห็น • 62

  • @PhilippHaussleiter
    @PhilippHaussleiter 4 ปีที่แล้ว +17

    I actually have a cron job running, that commits the config from every proxmox node to a separate git repository.
    Then you also have a clear history of the changes happend to that system. Already saved me days :-).

  • @mokhalifa3528
    @mokhalifa3528 4 ปีที่แล้ว +3

    Just had this happen to me yesterday! I managed to recover everything and am operational again now but funny timing with the video being uploaded! 😛 In my case it was a root logical volume that kept increasing to fill up its expanded space and then causing unresponsive-ness due to one cpu thread hanging and cue the kernel panic.
    Ended up renaming my Volume group, manually. Adding a small SSD and reinstalling proxmox fresh on the alternative drive. Then changed the local lv thin pool into the storage config file and copying all config files into the new directory, which automatically points the VMs to the appropriate LVs for disks.
    Great video and awesome work from you and your team as always! 🙌

  • @andarvidavohits4962
    @andarvidavohits4962 4 ปีที่แล้ว +22

    I love Proxmox but I have noticed that there's really not much emphesis on backuping the host machine/boot drive(s). You can make ZFS snapshots but what If you're not using ZFS on your boot drive, dd? It might not be as important as some of the VMs that are running on it but It's clearly not that trivial to reconstruct when it dies. I, for one, would appreciate it if there were a tool, maybe not much more than a customizable script, inside the GUI or CLI, which would allow you to gather the important network, VM, storage etc. config files of the host as well as any custom drivers and pack them into a tarball which you could then extract into a newly-installed host to restore most if not all of the settings. I'd write a script myself but, since this would be something people (including myself) would entrust their lifes and livelihoods to, I'd feel much better if it came from the developers who know what goes on under the hood on a much deeper level than I.

  • @pshubert21
    @pshubert21 4 ปีที่แล้ว +1

    I am glade I watched this video. I am thing of running pfsense on a VM and I now know more about how to setup the proxmox to prevent failure so I don't lose my network with a drive failure. If I do I know how to get it back hopefully within an hour. I have only been using linux for about 2 years so still learning a lot about CLI.

  • @jamiemcparland
    @jamiemcparland 4 ปีที่แล้ว +1

    Great video for someone who has a lot of Proxmox hosts! Thanks so much!

  • @timrightnour4022
    @timrightnour4022 4 ปีที่แล้ว +4

    Good stuff! I will say if you are running fully clustered plus ceph the vm config part becomes irrelevant. You just import the drives and rejoin. I actually have an ansible play that I use to setup proxmox, so a node or root drive replacement is much easier. Boot off usb, install, run ansible, join cluster, done.

    • @Darkk6969
      @Darkk6969 ปีที่แล้ว

      Can you share the ansible playbook with us?

  • @diavuno3835
    @diavuno3835 4 ปีที่แล้ว +1

    Good video... you've convinced me to give proxmox another shot.
    I tried it a few years ago but have been on VMware since... After considering purchasing VMware 7, and watching this video, I think I will re-evaluate proxmox!

  • @francisphillipeck4272
    @francisphillipeck4272 4 ปีที่แล้ว +1

    Good timing, I use your Proxmox quick start guide to get nodes up and running, i have one ive blown up and reinstalled 12 times in the last couple weeks trying things out... Currently figuring out how this FC SAN setup I have works... havnt quite got it working yet...lol

  • @TheFlatronify
    @TheFlatronify 4 ปีที่แล้ว +4

    Really good video with important points. Pretty much everything is in there that you could need in case of an emergency. A big point for ProxmoxVE compared to VMWare is that CAN actually recover from such an error pretty easily due to it's open nature, in most cases you could also easily recover data from failed SSDs without too much trouble as everything PVE is build on is common technologies.

    • @EarthStarz
      @EarthStarz ปีที่แล้ว +1

      So true, really good points, i even hear those crazy VMware peeps don't even have the ability to backup because you know it's all proprietary.

  • @AaronPace93
    @AaronPace93 4 ปีที่แล้ว +5

    On the vm config, I personally back up /etc/Pve With a quick python script to other storage just so I have them incase I’d complete failure as well. I never had one yet, knock on wood, but I have tested a restore to the best of my ability and it seemed to work. Curious if it is a good idea as well.
    Edit: see you covered that in the article. Good stuff as always!

  • @reneb5222
    @reneb5222 4 ปีที่แล้ว +2

    Thanks for the great video. Learned some new tricks :). Just about to ripped my HyperV and move it over to ProxMox.

    • @ServeTheHomeVideo
      @ServeTheHomeVideo  4 ปีที่แล้ว

      Do it! The STH forums have plenty of folks who have done the same.

    • @clausdk6299
      @clausdk6299 4 ปีที่แล้ว +1

      Hyper-v without AD is a pain to manage... Especially considering the horrible and annoying security aspect...

    • @reneb5222
      @reneb5222 4 ปีที่แล้ว

      @@ServeTheHomeVideo You mention in the video a Naming Convention. Can you point me to it because I cannot find it on the site?
      Also would it not be easier to backup the /etc/pve/nodes directory especially when you have LXC and QEMU running?
      I got a script for backuo/restore running for my Dev ProxMox from DerDanilo at github.com/DerDanilo/proxmox-stuff/

  • @escarina781
    @escarina781 4 ปีที่แล้ว

    I learned in my IT School: No Backup, No Pity ! :D
    great vid thx for the content

    • @YeOldeTraveller
      @YeOldeTraveller 4 ปีที่แล้ว

      One of my favorite clients had under the glass covering the top of her desk the following:
      Blessed be the Pessimists, for they hath made Backups.

  • @YeOldeTraveller
    @YeOldeTraveller 4 ปีที่แล้ว +1

    I have a friend who made the following statement:
    You don't want a Backup Solution.
    You want a Restore Solution where Backup is but the first step.
    And if it has not been tested, you don't have a Solution.

  • @B4dD0GGy
    @B4dD0GGy 4 ปีที่แล้ว +1

    You are on another level for advise THANKS!

  • @Mikesco3
    @Mikesco3 4 ปีที่แล้ว +1

    Loved this episode, won my subscription

  • @Trooper_Ish
    @Trooper_Ish 4 ปีที่แล้ว +3

    So... Lesson is don't wait 3 months to replace failed disk?

  • @AndrewMerts
    @AndrewMerts 4 ปีที่แล้ว +1

    It'd be nice to have a video going over pve-efiboot-tool for software raid on EFI. IIRC Proxmox doesn't use it by default unless you're on ZFS for the root volume but it's really quite useful even outside of that to give you complete freedom on how you set up your storage and get redundancy for your /boot volume on EFI with software RAID. Personally I run all of my Proxmox servers at work on LVM for all but a short /boot partition on the two OS drives, LVM Mirroring for the root partition, pve-efiboot-tool to handle mounting /boot for updates, etc. As an added benefit /boot with pve-efiboot-tool isn't even mounted until it's being updated so if there's some system crash, the vfat partition is already unmounted and there's basically no risk of it getting corrupted.

  • @charlieday6341
    @charlieday6341 3 ปีที่แล้ว

    9 minutes in and I'm still waiting for step 1

  • @BloodyIron
    @BloodyIron 4 ปีที่แล้ว +4

    Here's what I don't understand. It seems that the failed proxmox node was in a cluster (if it wasn't, what are you doing???), so that means the other nodes already have the configs for the VM running the forums, and other stuff. Furthermore, I'm willing to bet the storage for the VM is NAS-backed, and if it isn't then seriously WTF are you doing??? Presuming both are true, a cluster and NAS storage for the VMs, then you had the means to actually migrate the VM to other nodes in the cluster, or reboot the VM on the other nodes, with minimal/no downtime. Yet somehow you guys follow a really janky practice of slamming replacement OS drives into an in-memory OS while the full node is still "running"? That's really not good advice, I have to say. Furthermore, you saw drastic issues reported by ZFS for that drive, yet didn't rigorously scrub and retest it, those images posted about it show a lot more than just "SMART errors". You should have known better in so many regards here, and as I watch it, I really see a lot of bad advice here.

  • @Darkk6969
    @Darkk6969 ปีที่แล้ว

    I know this is an old video but heads up it's on the road map to have the ProxMox Backup Server to also back up the host. I've already seen a "host" container on the PBS server so it tells me they're working on it.

  • @84Actionjack
    @84Actionjack 4 ปีที่แล้ว

    I yield to your experience and knowledge, but the only SSD I've ever had fail was an Intel which hosted the OS on my Windows Server. Never purchased an Intel drive since.

  • @azop1234
    @azop1234 4 ปีที่แล้ว

    pve-zsync is all you need...it backups the config files and takes snapshots for you. Throw that to a freenas or another Proxmox zfs pool and then restore if you need to.

  • @JonathanSwiftUK
    @JonathanSwiftUK 4 ปีที่แล้ว

    Very easy, and now highly likely, to have multiple disk failure, due to faulty firmware, check the latest HPE MSA f/w update, some SSDs stop at a precise 3 years 270 days 8 hours.

  • @salmiakki5638
    @salmiakki5638 4 ปีที่แล้ว +4

    Are you guys going to receive a dgx-a100 to review? :p

    • @ServeTheHomeVideo
      @ServeTheHomeVideo  4 ปีที่แล้ว +4

      Probably will do a HGX A100 based system when the supply of A100's on the market increases. Usually customers get them first, then we get to do reviews.

    • @salmiakki5638
      @salmiakki5638 4 ปีที่แล้ว +1

      @@ServeTheHomeVideo thank you for the info and the videos in general. Always really interesting.

  • @fuseteam
    @fuseteam 4 ปีที่แล้ว

    so if a it doesn't boot there's no way to actually get the vms off of it? i managed to the see the lvm volumes but i have no idea how to grab it off the server

  • @xdg-a
    @xdg-a 4 ปีที่แล้ว

    Did you not have protectional equipment against power inrush events?

  • @moshet842
    @moshet842 4 ปีที่แล้ว

    R1 double disk failure is stuff from nightmares.

  • @luukkuwet
    @luukkuwet 4 ปีที่แล้ว

    Do you have redundant high-available nfs-cluster?

  • @CharlesM236
    @CharlesM236 ปีที่แล้ว

    Why IS PBS not on BARE metal backups ?

  • @evertonhaisetaques4716
    @evertonhaisetaques4716 4 ปีที่แล้ว

    What's happened of I have two disks to locate virtual machines /dev/sdb1 and /dev/sdc1 and I leave first empty and removed this storage from proxmox. When I reboot the old sdc1 probably will be sdb1 because other disk is out from server. This could make my virtual machines be lost or everything will works fine?

  • @fbifido2
    @fbifido2 4 ปีที่แล้ว

    From this video it seems you are not running the server in a cluster ???
    nor you are not using Ceph or any type of share storage for VM ???
    What backup software do you recommend for proxmox ? can Veeam for linux be use ???

  • @lennutrajektoor
    @lennutrajektoor 3 ปีที่แล้ว +1

    When you showed the Intel DC series SSD-s I had recollection of the bad experience My PlayHome had on his production environment with the HPE servers where 24 Intel DC series SSD-s of 50 died just like that. On one instance 2 SSDs died simultaneously leaving production environment w/o a working server although they had good fail-over bearing failure of two servers, but it was too much and too often taking place event. It seems the issue lies in Intel DC series chipset or the build quality. The issue has to be with Intel QA. th-cam.com/video/lZ8D2APljR0/w-d-xo.html

  • @KHITTutorials
    @KHITTutorials 4 ปีที่แล้ว

    What do you do for backups of Proxmox VM's? It takes a full backup every time and there seems to be no way to handle incremental backups. For small VM's it is not a problem, but when you have larger ones it does quickly take a lot of space for even a week worth of backups

    • @andarvidavohits4962
      @andarvidavohits4962 4 ปีที่แล้ว +2

      If the underlying storage of your backup location is ZFS, you can turn on compression and that should make sure you're not needlessly storing multiple copies of identical data. In Proxmox, ZFS compression is set to 'off' by default; you can enable it from the host's CLI using the 'zfs set' command. You can also use ZFS deduplication, just make sure you understand it completely and have plenty of RAM beforehand. Also note that compression and dedup will struggle if the VMs use full disk encryption. Remember that If you do decide to use ZFS compression, changing the setting will only apply to newly written data.

  • @saswatasarkar7434
    @saswatasarkar7434 4 ปีที่แล้ว

    how about xcp-ng?

  • @johnTheSmith
    @johnTheSmith 2 ปีที่แล้ว +1

    update: if you have a cluster the configs of your dead node can found at
    /etc/pve/nodes//qemu-server
    /etc/pve/nodes//lxc

  • @luukkuwet
    @luukkuwet 4 ปีที่แล้ว

    Those steps are like in AA-club. Just don't remember how many...

  • @clausdk6299
    @clausdk6299 4 ปีที่แล้ว +2

    And that's why you build a cheap storage server that acts as storage server and backup server with zfs raid z2 😁

    • @clausdk6299
      @clausdk6299 4 ปีที่แล้ว

      BTW why do you use local storage? 🤔

    • @ServeTheHomeVideo
      @ServeTheHomeVideo  4 ปีที่แล้ว +3

      If we are being frank here, the Optane SSDs running all of our databases locally in each node makes the setup absolutely scream performance wise. We also had a switch port fail a few years ago so it reduces variables that can take the service down. We even had an all-flash Ceph cluster early on which we decommissioned in favor of local mirrored SSDs for performance reasons.
      Good point for a future article though. That may be worth going into more detail on.

    • @fbifido2
      @fbifido2 4 ปีที่แล้ว

      @@ServeTheHomeVideo Have you tried Ceph with the latest version of Promox, v6.2 ??? or Ceph is not production ready in your cluster ???

    • @WilliamEllwood
      @WilliamEllwood 4 ปีที่แล้ว +1

      Completely agree with STH, local gives you wicked performance, just be sure to replicate regularly. Sanoid / syncoid is useful. Not the right solution for all workloads, but consider it.

  • @christophertstone
    @christophertstone 4 ปีที่แล้ว +2

    5:50 Installing software isn't the same as ongoing administration - I think I can heard the collective facepalm of every SysAdmin and SysEngineer in the world.
    If you own/manage a business, run it like a business. Hire/Outsource someone to do the things you aren't good at.
    Don't assume because you "made it work" that you know a job well enough to put that business on the line.

    • @ServeTheHomeVideo
      @ServeTheHomeVideo  4 ปีที่แล้ว +1

      That is somewhat the point of this video. There are tons of resources on getting an installation up. There are far fewer resources on the tips and tricks that get you through when unexpected failures happen. This, of course, is not comprehensive in that regard, but wanted to share some knowledge learned over 7+ years of running PVE.

    • @malventano
      @malventano 4 ปีที่แล้ว +1

      Outsourced workers are not immune to screwing up and causing downtime. Sometimes it's the owner that by default has the greatest care and attention to detail for their own site (even if they lack the breadth of experience of the pros).

  • @marcoaurelio6941
    @marcoaurelio6941 3 ปีที่แล้ว

    mate 10 minutes and you are not saying a thing. Could you name it an introduction? thank you

  • @eseseis7251
    @eseseis7251 4 ปีที่แล้ว

    it has to fail, so you buy the support. nothing free is realy free. proxmox also used to kill drives.

  • @gullivergimeno7586
    @gullivergimeno7586 9 หลายเดือนก่อน

    no good video

  • @ewenchan1239
    @ewenchan1239 ปีที่แล้ว +1

    I am currently in the middle of a mass migration project where I am moving 5 of my NAS servers all into a single Proxmox host server (including all of my desktops, etc.).
    They say not to put all your eggs in one basket, and yet, here I am, doing exactly and precisely that (to save on power). And I mean, I can cut my power down possibly by 1/2 to 2/3rds as a result of this mass migration project when it is complete.
    I don't use a ZFS root because trying to run the GPU passthrough with a ZFS root drive was more complicated than I would've like (i.e. the instructions online for GPU passthrough didn't really work for a ZFS root drive), so I am using hardware RAID for that instead.
    But yeah, Proxmox is pretty awesome so far.
    It's too bad that you can't like "export" the VM (config file and any hard drives) into a single package (i.e. .ova file). (At least I haven't seen/tried it yet, but maybe there's a way to do that.)
    What I would LOVE would be a way to export the host config itself -- so that all of your settings are saved so that if you have a host that dies, you can just import all of the host-level config files, run apt and have it install all of the stuff that you've installed originally, post-OS install, so that the new host would pretty much bring most, if not, your entire system back online for you, in a highly automated fashion.
    That would be nice if it can do that.
    There's a LOT of really nice things that I like about Proxmox so far.