Great video. My single node docker container died a death last week and I made rookie mistakes in it so have had a bit of a crash course in swarm after putting it off for ages!
Wonderful video. I agree that Docker Swarm is grossly overshadowed by Kubernetes, *especially* in the self-hosted community. For the small home-labs, it is perfectly adequate.
This is exactly the video I needed maybe 3 weeks ago lol. I manager to figure it out with some effort. 6 nodes, 3 master nodes, glusterfs, keepalived, traefik, all on 3 Proxmox hosts. It's a slick setup, but not without weird issues that I can't explain. Portainer would randomly lose communication with the cluster and then when it would come back I'd see that all of the services from 2 nodes on the same proxmox host had been rescheduled to the other 4 nodes. It may very well be an issues with the host itself, but I have LXC containers and VMs (some running vanilla Docker) running that don't have any issues at all on the same hosts. There's also some containers that just don't play nice with Swarm. Notably some of the *arr containers would repeatedly get corrupted databases. I've scaled back on Swarm and only have critical services running on there, like my reverse proxy, Gitea, and some other stuff that I'd like to have accessible even when I'm taking Proxmox hosts offline for updates/maintenance.
Nice video, good job. I have 2 questions. Why do you promote all nodes to manager nodes? And can you show how to deploy containers to the swarm using the ceph fs? Like portainer, netbox, any dashboard container, etc.
I have worked with swarm on the enterprise a lot and it has serious reliability issues. For smaller cluster sizes like 3 nodes at home it might be fine, but at larger deployments of a couple of dozen nodes the problems start to show, and when you approach 50 or 60 nodes it just bursts at the seams. But for homelabbing I'm sure it's fine.
@c0p0n I am really interested in your feedback on this subject since you have worked with this quite a bit in production. What have you seen to be the failure point? Where do you see issues start to creep in at what layer with the Swarm cluster once you start scaling it up?
@@VirtualizationHowtoa few issues, all related with capacity. Adding extra leader nodes is somewhat unreliable; the networking model is brittle and can get overwhelmed relatively easily and at higher container counts (+200), leaders sometimes fail to fully reap containers, leaving running processes behind until nodes die. But at these smaller scales I don't expect any of that to ever become a problem. It's always been a function of scale as far as I can tell. I've done a few swarm->kubernetes migrations over the past 4 or 5 years
I've had a Swarm Cluster (2 Master and 3 workers) Running for a couple of month in my homelab , I expirent a lot of instability with containers that suddenly stops, i tried to put in ressources limitations but it didnt make a big improvements. Saddly i did not succed to have a stabil culster as to it would be the perfect sweet spot for my homlap with refetances to the resources i can spent for the time beeing (time and money)
Awesome video!! Looking forward to giving this a try. Still new and learning proxmox and wanting to make my docker containers HA. Can this setup for example support containers like the *arr stacks and Plex/Jellyfin that have databases?
awesome tutorial. I am running a docker swarm as well. But I did an external VM to host and NFS share for the swarm. It works but I worry about hosting a game server as a swarm service, Don't want to loose my server progress.
@Glitch_860 ;thanks for the comment. I wouldn't think you should worry if you lose the Docker service. Is the game data stored on a persistent volume on your NFS share?
Well in doing come testing. Specifically with factorio. Seems that when I force a reboot of the host running the server. Docker swarm does move the container. But it spins up a new save file. so it is not keeping them persistent. So seems I am missing something.
yeap, I'm running it too for years, but there is one problem is that swarm doesn't work with ipv6 overlay network, so deal breaker for me now as I'm moving away from deprecated protocol (ipv4)
Hi, great video and nice idea with keepalived. I'm looking to improve my cloud hosting and came very close too using Docker Swarm a few months back, but got stuck on the distributed filesystem. So, microceph looks to fix that hurdle. It would have been good to show an actual deployment of an app but this is really useful, many thanks :-)
I think the problem you will run into is that each firewall won’t have a record of the live sessions from the ‘now’ failed node, so you are going to most definitely see the outage. Both PF and OPNSense have their own failover engines that is more suitable for this.
how is this different from proxmox cluster ? afteral the whole point of hosting anything is High Availability .. so phisically if the power goes down all nodes go with it unless UPS but then no interner if power goes down etc. so this is all fine and great if you have 3 phase connection and setup your phisical machines in such a way that they are powered by L1,L2,L3 ... but even then in some power setups the phases are in sync and if one goes down they all go down .. yeah ... nice vid ... but I`ll stick to proxmox and my 3 ms-01`s
Sounds like a good solution for Docker in a Proxmox Cluster. If you put a VM with Docker inside on each node and don't check "high availability" for these VMs but instead use Docker swarm to cluster the Docker hosts inside together as a swarm, then in case of a node failure, the containers of that node should migrate to the other nodes with load balacing..
I have my swarm node vms flagged for HA as well - and I move them around between Proxmox hosts as I need to. I will probably avoid putting the MicroCEPH storage on host storage that is being replicated with Proxmox CEPH - that just sounds like asking for issues ;)
@@pschulte What would be the value to have HA and swarm at the same time ? Secondly, what are the specs of your environment (cpu, mem and storage). Finally, how did you setup your HA environment in proxmox ? (I have the same ZFS storage name and amount in each of my nodes)
@@jenswymeersch9653 Using HA in Proxmox really covers availability of the VMs (Swarm nodes), the swarm cluster handles availbility/restarts of the container services running there. They run at different layers of the infrastructure. Note: By choice, I don't run any containers directly on Proxmox - I decided to use Swarm for Microservices, and have played with Kubernetes at times as well. I just find Swarm to be a very good fit for my homelab. When I want do update or perform maintenance on a Proxmox physical PC, I just "live migrate" all the VMs to another Proxmox node, and none of the VMs or container services are negatively affected. 3 Proxmox nodes: Intel i5-12500 cpu (6 p-cores, no E-cores), 64GB RAM Storage: each node has 2TB of NVME configured into a single CEPH-pool, but I also use NFS mounts for shared VM storage HA in Proxmox: Not completely sure what you are asking? I keep things pretty simple in the Proxmox cluster HA confiuguration area.? I set up one group for each host which I use to "soft pin" a vm to that node. If everything is running, the VMs go to where the node they are "pinned" to, but for failure or maintenance, they will restart or live migrate to either other node, until their preferred/pinned node comes back online. HA works great as long as storage and networks are all well defined and fully functional across all nodes. Hope this helps.
Great video! But isn't having an orchestration (Swarm in your case) an overkill for home lab? In most cases, you will be the only consumer of the services you host there.
@igorshubovych you are right! It is probably overkill for most in their home labs. However, like most things in the home lab, we all do things that are overkill to learn and just have fun. However, on a more serious note, I actually do benefit from what orchestration gives me in my lab. It actually now allows me to dran a node for updates, patches, or other maintenance without disrupting containerized services, like PiHole which runs DNS for my non-production LAN that my family will complain about if down, haha.
Curious that you didn't read the line re: "To add a manager node to this swam..." and instead chose to join nodes and then promote them. What do you think is the difference? Even if the results "appear" to be the same.
@JamesGreen-gv4yn I believe it to be the same. The reason I go this route is to show how to join with workers and then if you want and desire to have more managers this helps to show that progression.
How well does ceph handle file locking? Every time I think I have highly available storage “figured out”, I try a container that needs SQLite and it won’t work for whatever reason. It seems like a lot of containers are built assuming they’ll have direct access to fast storage. That makes it very hard to do HA anything. I’ve tried SMB, NFS v3, v4. Now using GlusterFS and it’s better but some apps still don’t like it. Jellyfin for example. The database wants fast local storage and the app crashes after a few minutes with errors about database.
Hi, thanks for this video, one question, is this a possible way to get a kind of "HA" cross datacenters ? (yes, latency, speed might be a problem) but possible? do i need to open all the ports for ceph and swarm ? THX
Should be fine across a private wan, but you might need to edit some configurations that set the dead timers, depending on how much latency exists between the nodes.
Maybe a silly question, but what If you have your three docker nodes running on ProxMox with ceph. Can you use the ceph pool in ProxMox in any way for Docker Swarm?
@matthewfiles6913, I believe you sure can with CephFS. I have created CephFS on top of Ceph in Proxmox and presented to Windows, etc...So, for your Docker swarm nodes you would just present the manager IPs to the nodes just like I did in the vid, except it wouldn't be running local, which wouldn't matter.
@@VirtualizationHowto Careful with NFS... Things that rely on a database (like the *arr stack and Jellyfin) will get randomly corrupted network shares. Requires true block storage.
While this how to video is great, I think a moment spent on use cases might be in order first. This is for us viewers who don’t really care much about dabbling in home lab Docker just for the sake of nerding out on what’s possible, but what’s actually useful and why. For 99.99% of us, a cluster isn’t needed and it just adds a lot of useless complexity.
I don't totally agree. The more you have services running in containers, the more it becomes critical. Although that I have setup HA on proxmox, I noticed that my docker environment went down without the full machine going down. Building a cluster on the docker side, should solve this problem.
@@BrentLeVasseurfully agree with the point that you need to look at the use cases. But that is applicable for everything you do. In my case, I am running multiple docker containers for z2m. If these go down, some of the lights in the house don’t work anymore. In my view this is critical. I have currently a standalone docker running on a LXC in HA under pve. But I’ve seen that this is not fail proof. Hence clustering. That said as I am stuck with portainer for the moment and can’t move forward
Shared storage is a requirement? My Kubernetes clusters do not have any shared storage. So is this just a requirement for Docker Swarm? Can it be configured without using shared storage and if so, what limitations would it impose?
I believe it is not really, but you would need a volume driver for docker that could mount the data otf. Like a csi driver in kubernetes. What's done here is that you do a "hostPath" mount. So that's why it needs the storage to be available on all nodes on the same path
@JamesGreen-gv4yn shared storage is a requirement if you have/need persistent data (stateful apps), think databases as an example. If one Swarm or Kubernetes node fails, you would want that data to be available to the other nodes to respin that service and still have access to your data. But, you are right. If you are running all stateless apps, there is no need for shared storage.
Honestly i don't really get why you'd want to use swarm, it feels like its on life support and has been deprecated until mirantis bought docker. Almost everyone who wants high availability in their homelab gain so much more knowledge from running k8s that they can apply at their job or job interviews. And everyone else should just use the ha feature of their hypervisor to failover or live-migrate the docker vm in case of failure or maintenance
@LampJustin I really appreciate your feedback. Honestly, I think Swarm is alive and well. For me, it is about using the right tool for the right job. Is Swarm as good as Kubernetes at some use cases, no. But is Kubernetes always the answer for highly available containers? No, I don't think it is on that front either. I really like that Swarm is a middle ground between a standalone Docker host and full-blown Kubernetes.
Lost me at keepalive. The vast majority of people would want to run this in a test/production environment with a gateway and internal static IPs in a load balanced scenario instead of failover.
@ericneo2 yes that would be true as well. Keepalived is a tool for certain use cases and it is simple. For more robust configurations, you would want to stick a loadbalancer in front.
you never show how you used the /mnt/cephfs for persistent storage. you skipped over many important steps. even the keepalived with virtual ip, you never mentioned anything about it. how it is being used to see the use-case
Great video. My single node docker container died a death last week and I made rookie mistakes in it so have had a bit of a crash course in swarm after putting it off for ages!
Wonderful video. I agree that Docker Swarm is grossly overshadowed by Kubernetes, *especially* in the self-hosted community. For the small home-labs, it is perfectly adequate.
This is exactly the video I needed maybe 3 weeks ago lol. I manager to figure it out with some effort. 6 nodes, 3 master nodes, glusterfs, keepalived, traefik, all on 3 Proxmox hosts. It's a slick setup, but not without weird issues that I can't explain. Portainer would randomly lose communication with the cluster and then when it would come back I'd see that all of the services from 2 nodes on the same proxmox host had been rescheduled to the other 4 nodes. It may very well be an issues with the host itself, but I have LXC containers and VMs (some running vanilla Docker) running that don't have any issues at all on the same hosts. There's also some containers that just don't play nice with Swarm. Notably some of the *arr containers would repeatedly get corrupted databases. I've scaled back on Swarm and only have critical services running on there, like my reverse proxy, Gitea, and some other stuff that I'd like to have accessible even when I'm taking Proxmox hosts offline for updates/maintenance.
Awesome video. This is something I've been wanting to try for awhile so I'll be back to reference this :)
Nice video, good job. I have 2 questions. Why do you promote all nodes to manager nodes? And can you show how to deploy containers to the swarm using the ceph fs? Like portainer, netbox, any dashboard container, etc.
Nice video!! can you expand a bit more on how you deploy Portainer into this cluster ? and also how you deploy stacks and how the failover works?
yes, please also show how you setup Portainer in this situation
I have worked with swarm on the enterprise a lot and it has serious reliability issues. For smaller cluster sizes like 3 nodes at home it might be fine, but at larger deployments of a couple of dozen nodes the problems start to show, and when you approach 50 or 60 nodes it just bursts at the seams.
But for homelabbing I'm sure it's fine.
Avoid Yacht for docker, it cannot reliably start and stop containers.
@c0p0n I am really interested in your feedback on this subject since you have worked with this quite a bit in production. What have you seen to be the failure point? Where do you see issues start to creep in at what layer with the Swarm cluster once you start scaling it up?
@@VirtualizationHowtoa few issues, all related with capacity. Adding extra leader nodes is somewhat unreliable; the networking model is brittle and can get overwhelmed relatively easily and at higher container counts (+200), leaders sometimes fail to fully reap containers, leaving running processes behind until nodes die.
But at these smaller scales I don't expect any of that to ever become a problem. It's always been a function of scale as far as I can tell. I've done a few swarm->kubernetes migrations over the past 4 or 5 years
I like Nomad for smaller production projects.
I've had a Swarm Cluster (2 Master and 3 workers) Running for a couple of month in my homelab , I expirent a lot of instability with containers that suddenly stops, i tried to put in ressources limitations but it didnt make a big improvements.
Saddly i did not succed to have a stabil culster as to it would be the perfect sweet spot for my homlap with refetances to the resources i can spent for the time beeing (time and money)
Awesome video!! Looking forward to giving this a try. Still new and learning proxmox and wanting to make my docker containers HA. Can this setup for example support containers like the *arr stacks and Plex/Jellyfin that have databases?
Really nice video. Thanks so much
awesome tutorial. I am running a docker swarm as well. But I did an external VM to host and NFS share for the swarm. It works but I worry about hosting a game server as a swarm service, Don't want to loose my server progress.
@Glitch_860 ;thanks for the comment. I wouldn't think you should worry if you lose the Docker service. Is the game data stored on a persistent volume on your NFS share?
@@VirtualizationHowto yes I specify it as a volume in the docker compose file.
Well in doing come testing. Specifically with factorio. Seems that when I force a reboot of the host running the server. Docker swarm does move the container. But it spins up a new save file. so it is not keeping them persistent. So seems I am missing something.
Excelent video! Congratulations!
Your videos are the best. 👍👍
Thank you @fakebizPrez !
Nice, did you manage to find a ceph volume driver for docker instead of mounting ceph?
Hi. This is a great video. Can you please guide me to configure CephFS and integrate it into MicroK8s cluster.
Great tutorial! What's the SSH terminal you're using? Looks cool!
Remote Desktop Connection Manager and there is a free version as well. Very good product and I use it daily.
yeap, I'm running it too for years, but there is one problem is that swarm doesn't work with ipv6 overlay network, so deal breaker for me now as I'm moving away from deprecated protocol (ipv4)
Great video!!!!! Tks a lot!
So... after all the configuration of swarm and ceph, i install portainer normally on the first node? that's it?
You can deploy it from any manager node. Once deployed, it runs on the nodes you have configured for it. Start thinking in terms of a cluster.
@@tenekevi did you change the volume in the docker-compose file for Portainer ? (/mnt/cephfs/portainer_data/)
can you show how you did your portainer installation in this setup?
So, how do you actually manage storage for a container with that setup? Are you using all bind mounts or using volumes?
Hi, great video and nice idea with keepalived. I'm looking to improve my cloud hosting and came very close too using Docker Swarm a few months back, but got stuck on the distributed filesystem. So, microceph looks to fix that hurdle. It would have been good to show an actual deployment of an app but this is really useful, many thanks :-)
@PaulLittlefield awesome! So glad it was helpful!
Great video, i was wondering if i can setup a (bridged) firewalls (OPNsense of Pfsense) for high availability with this method
I think the problem you will run into is that each firewall won’t have a record of the live sessions from the ‘now’ failed node, so you are going to most definitely see the outage. Both PF and OPNSense have their own failover engines that is more suitable for this.
@ thnx a lot for the reply. I will search if I can find something about it👍
how is this different from proxmox cluster ? afteral the whole point of hosting anything is High Availability .. so phisically if the power goes down all nodes go with it unless UPS but then no interner if power goes down etc. so this is all fine and great if you have 3 phase connection and setup your phisical machines in such a way that they are powered by L1,L2,L3 ... but even then in some power setups the phases are in sync and if one goes down they all go down ..
yeah ... nice vid ... but I`ll stick to proxmox and my 3 ms-01`s
Sounds like a good solution for Docker in a Proxmox Cluster. If you put a VM with Docker inside on each node and don't check "high availability" for these VMs but instead use Docker swarm to cluster the Docker hosts inside together as a swarm, then in case of a node failure, the containers of that node should migrate to the other nodes with load balacing..
I have my swarm node vms flagged for HA as well - and I move them around between Proxmox hosts as I need to.
I will probably avoid putting the MicroCEPH storage on host storage that is being replicated with Proxmox CEPH - that just sounds like asking for issues ;)
@@pschulte What would be the value to have HA and swarm at the same time ? Secondly, what are the specs of your environment (cpu, mem and storage). Finally, how did you setup your HA environment in proxmox ? (I have the same ZFS storage name and amount in each of my nodes)
@@jenswymeersch9653
Using HA in Proxmox really covers availability of the VMs (Swarm nodes), the swarm cluster handles availbility/restarts of the container services running there.
They run at different layers of the infrastructure.
Note: By choice, I don't run any containers directly on Proxmox - I decided to use Swarm for Microservices, and have played with Kubernetes at times as well. I just find Swarm to be a very good fit for my homelab.
When I want do update or perform maintenance on a Proxmox physical PC, I just "live migrate" all the VMs to another Proxmox node, and none of the VMs or container services are negatively affected.
3 Proxmox nodes:
Intel i5-12500 cpu (6 p-cores, no E-cores), 64GB RAM
Storage: each node has 2TB of NVME configured into a single CEPH-pool, but I also use NFS mounts for shared VM storage
HA in Proxmox:
Not completely sure what you are asking?
I keep things pretty simple in the Proxmox cluster HA confiuguration area.?
I set up one group for each host which I use to "soft pin" a vm to that node. If everything is running, the VMs go to where the node they are "pinned" to, but for failure or maintenance, they will restart or live migrate to either other node, until their preferred/pinned node comes back online.
HA works great as long as storage and networks are all well defined and fully functional across all nodes.
Hope this helps.
Great video!
But isn't having an orchestration (Swarm in your case) an overkill for home lab?
In most cases, you will be the only consumer of the services you host there.
@igorshubovych you are right! It is probably overkill for most in their home labs. However, like most things in the home lab, we all do things that are overkill to learn and just have fun. However, on a more serious note, I actually do benefit from what orchestration gives me in my lab. It actually now allows me to dran a node for updates, patches, or other maintenance without disrupting containerized services, like PiHole which runs DNS for my non-production LAN that my family will complain about if down, haha.
Curious that you didn't read the line re: "To add a manager node to this swam..." and instead chose to join nodes and then promote them. What do you think is the difference? Even if the results "appear" to be the same.
@JamesGreen-gv4yn I believe it to be the same. The reason I go this route is to show how to join with workers and then if you want and desire to have more managers this helps to show that progression.
I have three debian servers without snap, how do i install microceph on them? Is it also possible to run microceph inside a container?
How well does ceph handle file locking? Every time I think I have highly available storage “figured out”, I try a container that needs SQLite and it won’t work for whatever reason. It seems like a lot of containers are built assuming they’ll have direct access to fast storage. That makes it very hard to do HA anything. I’ve tried SMB, NFS v3, v4. Now using GlusterFS and it’s better but some apps still don’t like it. Jellyfin for example. The database wants fast local storage and the app crashes after a few minutes with errors about database.
An you Put your storage on an external TRUENAS ZFS server , share it from there
SMB, NFS v3, v4... Yeah you'll run into file locks with those while they are in use or being viewed by a session.
i've tried to make this work for the last couple of days. But I'm getting stuck on the implementation of Portainer. Did someone get this to work
Hey Brandon, what is the app you are using in this video with the SSH sessions?
This is Remote Desktop Connection Manager from Devolutions
@@VirtualizationHowto Thank you so much for the info, I'm looking forward to to trying it out!
when u say best why isnt dockge included? i moved on from portainer while back, don't regret it.
Hi, thanks for this video, one question, is this a possible way to get a kind of "HA" cross datacenters ? (yes, latency, speed might be a problem) but possible? do i need to open all the ports for ceph and swarm ? THX
Should be fine across a private wan, but you might need to edit some configurations that set the dead timers, depending on how much latency exists between the nodes.
Maybe a silly question, but what If you have your three docker nodes running on ProxMox with ceph. Can you use the ceph pool in ProxMox in any way for Docker Swarm?
@matthewfiles6913, I believe you sure can with CephFS. I have created CephFS on top of Ceph in Proxmox and presented to Windows, etc...So, for your Docker swarm nodes you would just present the manager IPs to the nodes just like I did in the vid, except it wouldn't be running local, which wouldn't matter.
@@VirtualizationHowto Thank you, I'll give it a try. Great video!!
Is it possible to do the same thing with NFS?
Yes
Yes NFS is a viable shared storage as well.
@@VirtualizationHowto Careful with NFS... Things that rely on a database (like the *arr stack and Jellyfin) will get randomly corrupted network shares. Requires true block storage.
What is the name of SSH terminal? It is great!
@joseph7jk, thank you for the comment, this is Remote Desktop Connection Manager...I love it...maybe a video coming soon on that front, stay tuned.
Thanks for the info, I'll catch around to see it! @@VirtualizationHowto
What about plex and/or other hardware encoding apps that need nvidia?
noob question: how did you find the IPs of the VMs?
ip a from the console
anybody managed to do this with SSH key instead pass and know's how to add new fresh node to the cluster plus remove a node?
While this how to video is great, I think a moment spent on use cases might be in order first. This is for us viewers who don’t really care much about dabbling in home lab Docker just for the sake of nerding out on what’s possible, but what’s actually useful and why. For 99.99% of us, a cluster isn’t needed and it just adds a lot of useless complexity.
I don't totally agree. The more you have services running in containers, the more it becomes critical. Although that I have setup HA on proxmox, I noticed that my docker environment went down without the full machine going down. Building a cluster on the docker side, should solve this problem.
@ You don’t agree that some discussion on use cases is in order? Because that’s all I was asking for. Why and when would you want to use clustering?
@@BrentLeVasseurfully agree with the point that you need to look at the use cases. But that is applicable for everything you do.
In my case, I am running multiple docker containers for z2m. If these go down, some of the lights in the house don’t work anymore. In my view this is critical.
I have currently a standalone docker running on a LXC in HA under pve. But I’ve seen that this is not fail proof. Hence clustering.
That said as I am stuck with portainer for the moment and can’t move forward
Shared storage is a requirement? My Kubernetes clusters do not have any shared storage. So is this just a requirement for Docker Swarm? Can it be configured without using shared storage and if so, what limitations would it impose?
I believe it is not really, but you would need a volume driver for docker that could mount the data otf. Like a csi driver in kubernetes. What's done here is that you do a "hostPath" mount. So that's why it needs the storage to be available on all nodes on the same path
@JamesGreen-gv4yn shared storage is a requirement if you have/need persistent data (stateful apps), think databases as an example. If one Swarm or Kubernetes node fails, you would want that data to be available to the other nodes to respin that service and still have access to your data. But, you are right. If you are running all stateless apps, there is no need for shared storage.
I expected more angry comments towards Docker Swarm.
Honestly i don't really get why you'd want to use swarm, it feels like its on life support and has been deprecated until mirantis bought docker. Almost everyone who wants high availability in their homelab gain so much more knowledge from running k8s that they can apply at their job or job interviews. And everyone else should just use the ha feature of their hypervisor to failover or live-migrate the docker vm in case of failure or maintenance
@LampJustin I really appreciate your feedback. Honestly, I think Swarm is alive and well. For me, it is about using the right tool for the right job. Is Swarm as good as Kubernetes at some use cases, no. But is Kubernetes always the answer for highly available containers? No, I don't think it is on that front either. I really like that Swarm is a middle ground between a standalone Docker host and full-blown Kubernetes.
sudo snap install microceph was the moment stopped watching
Why ?
Lost me at keepalive. The vast majority of people would want to run this in a test/production environment with a gateway and internal static IPs in a load balanced scenario instead of failover.
@ericneo2 yes that would be true as well. Keepalived is a tool for certain use cases and it is simple. For more robust configurations, you would want to stick a loadbalancer in front.
keepalived -> load balancer -> everything else :)
First
you never show how you used the /mnt/cephfs for persistent storage. you skipped over many important steps. even the keepalived with virtual ip, you never mentioned anything about it. how it is being used to see the use-case