Hey Magnus.. FYI, upgrade of SMS can create an outage. If you are using cert based VPNs with internal CA, then the gateways check the CRL which is on the SMS. If the SMS is down and the gateways can't check the CRL then the VPNs drop.
Generally I do clean installs of GW & Mgmt whenever possible (come from a outsourcing environment and depending on who had 'control' over the system(s) before I got there, can never *assume* that they followed good practice, or didn't modify files that they shouldn't have (i.e. swapping drivers, trying to edit something stupid by following some bad advice and not reverse it, bad troubleshooting calls and not cleaning up properly, etc). Basically I kind of treat it more like sweeping the field for land-mines so the *NEXT* person who needs to handle the system doesn't walk in and get blown up. As for other steps, basically treat everything like it's a BCDR. It is like insurance, it costs a little to do (time), but in the off chance that it is needed you have what you need. - Verify all network cables/ports are working and note which ports they are going to. - Verify hardware is good (check idrac, ilo, bmc, whatever). - verify all PSU's are working and have power - verify no drive errors; raid errors/degraded status; ram/cpu errors; chassis errors, etc. - verify ilo/drac/oob connectivity to system - update ilo/drac firmware to current version - verify tty/serial (if your site has that) access (useful in some remote access scenarios to fix items from cli). - verify that you can ping across *ALL* interfaces and sub-interfaces (vlans) between members (note this assumes your policy is set up to allow this, generally I reserve the top ~5-10 rules in the policy for membermember and mgmtgateway communications and have disabled all implied rules (i.e. I code everything into the policy so it's always visible and auditable. - verify that you can ping between mgmt & gateway - check $FWDIR/boot/ha_boot.conf (I've seen many cases where people put crap in there from following suggestions from much older versions or don't know what they are doing). Clean it up if you can to get it back to factory defaults if there is not a reason to keep it. - Pull/store environmental information. Basically lets you have all the information you need *on hand* to fix various scenarios. granted this mainly for gateways (as mgmt's generally don't have multiple routed paths but they could if you are say having a mgmt straddle vrfs/zones like in a bcdr or if you're in the process of merging/splitting companies or something else complex. - clish -c "show configuration" | sort > {membername}.{old|new}.txt - ifconfig -a - netstat -rnv | sort -k8 - netstat -rnv | sort -t. -k1,1n -k2,2n -k3,3n -k4,4n - cat /etc/udev/rules.d/00-OS-XX.rules - cat /etc/sysconfig/00-ANACONDA.rules - cplic -print - cat $FWDIR/boot/modules/fwkern.conf (should not normally exist, just in case it does) - get copies/screen prints from cluster object pages (gui) (mainly only if you are or may need to rebuild the cluster object, normally NOT needed for a simple upgrade but if *crap* hits the wall good to have (i.e. as I said treat it as a BCDR). - check/get copies of /etc/ssh/templates/sshd_config.templ (NOTE, this is mainly for environments that lock down the 'default' ssh config settings from checkpoint which are pretty weak for a firewall (you can remove ciphers/kex/host key algos and similar to help harden checkpoint beyond just the policy. Then as you mentioned, do backups, for mgmt I always do a migrate export and have that (in case I need to rebuild from scratch, i.e. *assume BCDR*) and do snapshot (vmware) in a powered-off state of the VM if it is running inside a vm farm. (snapshot at rest is much safer than running as there are a LOT of database tables open in checkpoint).
Lastly, I write up a simple text/word document with bullet points listing each step one at a time of what I'm doing for each member and mgmt station; what files; what I am touching, etc. AND ALSO HAVE A SECTION ON HOW TO BACK OUT/REVERT TO WHERE I WAS. Never assume that things will go well, and if doing this at 02:00 with lack of sleep and added stress (time/window stress, or just having 10+ people yelling at you that their environment is having an issue (mostly for gateway upgrades)) having it written down can help. Even when I have done literally *thousands* of upgrades and can do them blind folded, the act of going through the above process helps focus your mind and takes only 30 mins or so at most.
I would do the same if i would take over an enviroment. Clean installation and make sure its up to your own standard. There has been many tricks over the years to "fix" things and it can be done, multiple ways, either by changing design or by tweaking some files. Such as proxy arp or by doing linknetworks and route the traffic. You have a very similar list on how i do production upgrades. However not aware of the command cat /etc/sysconfig/00-ANACONDA.rules before :) Having a step by step guide for upgrades is key. Also remember to have printscreens for diff steps so its possible to see how it should look when correct. Not sure how many times i have been sitting in an upgrade and said "is it supposed to be like this now?" This is really important when time start to tickover 02-03.00. Latly when doing VSX upgrades for some specific jumbos it takes 45min just to get the VS starting. So there is no real time for any issues. In our company we always two ppl when doing major upgrades. Just incase something happens, and to actually follow the step by step list. Not only for the upgrade but also for internal systems that need to be changed and updated (such as monitoring) We actually make sure to do screen record of the upgrades aswell. Becuase if something goes bad, its so much easier to write an incident report and check if we made any misstakes. Then you are able to put down accurate times and printscreens that is hard to think about when troubleshooting. Not sure if i have talked about it before but we uses netbox extensivly. Its a great way to keep track of changes that are done on boxes by doing journal entryes and tagging objects for "special things" Super handy when doing upgrades or changes after so its easy to go back and see if it has happen on this node before.
@@MagnusHolmberg-NetSec Yeah the ANACONDA & OS-XX udev rules files are important as that's how redhat/checkpoint record which interface is assigned & kept. So if you change hardware (or if you load from scratch on the same hardware) the way a 'new' kernel first sees / polls the interface & mac addresses is how the interfaces are assigned on open servers. So you could 'not touch anything physical' and do a fresh install and have the interfaces be assigned differently so nothing will work. You can fix that manually by editing those files. This also happens if say you deployed a system and then added network cards to it over time, the 'newer' cards may have their kernel module load before the others so will show up in a different assignment. Not too bad if you're expecting it, but if not can be a long night. It's also why I try to get TTY/Serial and ILO/idrac to every system so I can fix errors like that when deploying remotely (or like at a DR site). I haven't looked into netbox myself will have to give it a gander. Over the years/decades mainly just came up with scripts to pull information and create 'run books' or 'bcdr' books. But as you can see that's not 'automated' and can get out of date. Likewise, I would *never* trust that information in a BCDR context (i.e. always pull it yourself) probably just my long-beard attitude of self preservation. :)
Sorry, but many action which you made, like manual download etc.. it's not need. You just have error "Failed to get update from Checkpoint..." in CPUSE. Actually you can upgrade more simply. However, it is good that you paid attention to the documentation and limitations.
@@alexeyrepin3192 the error is due to lack of contract for the license. This is the reason why cpuse etc was not updated. I removed the explanation of this in the video, it’s added in the next video during the gateway update.
The reason why contract was missing is due to making videos on the same box on how to manual update cpuse, web console, upgrade tools and more. (More or less simulate how it is when missing internet)
Thanks a lot for creating this helpful content.
Hey Magnus.. FYI, upgrade of SMS can create an outage. If you are using cert based VPNs with internal CA, then the gateways check the CRL which is on the SMS. If the SMS is down and the gateways can't check the CRL then the VPNs drop.
Generally I do clean installs of GW & Mgmt whenever possible (come from a outsourcing environment and depending on who had 'control' over the system(s) before I got there, can never *assume* that they followed good practice, or didn't modify files that they shouldn't have (i.e. swapping drivers, trying to edit something stupid by following some bad advice and not reverse it, bad troubleshooting calls and not cleaning up properly, etc). Basically I kind of treat it more like sweeping the field for land-mines so the *NEXT* person who needs to handle the system doesn't walk in and get blown up.
As for other steps, basically treat everything like it's a BCDR. It is like insurance, it costs a little to do (time), but in the off chance that it is needed you have what you need.
- Verify all network cables/ports are working and note which ports they are going to.
- Verify hardware is good (check idrac, ilo, bmc, whatever).
- verify all PSU's are working and have power
- verify no drive errors; raid errors/degraded status; ram/cpu errors; chassis errors, etc.
- verify ilo/drac/oob connectivity to system
- update ilo/drac firmware to current version
- verify tty/serial (if your site has that) access (useful in some remote access scenarios to fix items from cli).
- verify that you can ping across *ALL* interfaces and sub-interfaces (vlans) between members (note this assumes your policy is set up to allow this, generally I reserve the top ~5-10 rules in the policy for membermember and mgmtgateway communications and have disabled all implied rules (i.e. I code everything into the policy so it's always visible and auditable.
- verify that you can ping between mgmt & gateway
- check $FWDIR/boot/ha_boot.conf (I've seen many cases where people put crap in there from following suggestions from much older versions or don't know what they are doing). Clean it up if you can to get it back to factory defaults if there is not a reason to keep it.
- Pull/store environmental information. Basically lets you have all the information you need *on hand* to fix various scenarios. granted this mainly for gateways (as mgmt's generally don't have multiple routed paths but they could if you are say having a mgmt straddle vrfs/zones like in a bcdr or if you're in the process of merging/splitting companies or something else complex.
- clish -c "show configuration" | sort > {membername}.{old|new}.txt
- ifconfig -a
- netstat -rnv | sort -k8
- netstat -rnv | sort -t. -k1,1n -k2,2n -k3,3n -k4,4n
- cat /etc/udev/rules.d/00-OS-XX.rules
- cat /etc/sysconfig/00-ANACONDA.rules
- cplic -print
- cat $FWDIR/boot/modules/fwkern.conf (should not normally exist, just in case it does)
- get copies/screen prints from cluster object pages (gui) (mainly only if you are or may need to rebuild the cluster object, normally NOT needed for a simple upgrade but if *crap* hits the wall good to have (i.e. as I said treat it as a BCDR).
- check/get copies of /etc/ssh/templates/sshd_config.templ (NOTE, this is mainly for environments that lock down the 'default' ssh config settings from checkpoint which are pretty weak for a firewall (you can remove ciphers/kex/host key algos and similar to help harden checkpoint beyond just the policy.
Then as you mentioned, do backups, for mgmt I always do a migrate export and have that (in case I need to rebuild from scratch, i.e. *assume BCDR*) and do snapshot (vmware) in a powered-off state of the VM if it is running inside a vm farm. (snapshot at rest is much safer than running as there are a LOT of database tables open in checkpoint).
Lastly, I write up a simple text/word document with bullet points listing each step one at a time of what I'm doing for each member and mgmt station; what files; what I am touching, etc. AND ALSO HAVE A SECTION ON HOW TO BACK OUT/REVERT TO WHERE I WAS. Never assume that things will go well, and if doing this at 02:00 with lack of sleep and added stress (time/window stress, or just having 10+ people yelling at you that their environment is having an issue (mostly for gateway upgrades)) having it written down can help. Even when I have done literally *thousands* of upgrades and can do them blind folded, the act of going through the above process helps focus your mind and takes only 30 mins or so at most.
I would do the same if i would take over an enviroment.
Clean installation and make sure its up to your own standard.
There has been many tricks over the years to "fix" things and it can be done,
multiple ways, either by changing design or by tweaking some files.
Such as proxy arp or by doing linknetworks and route the traffic.
You have a very similar list on how i do production upgrades.
However not aware of the command cat /etc/sysconfig/00-ANACONDA.rules before :)
Having a step by step guide for upgrades is key.
Also remember to have printscreens for diff steps so its possible to see how it should look when correct.
Not sure how many times i have been sitting in an upgrade and said "is it supposed to be like this now?"
This is really important when time start to tickover 02-03.00.
Latly when doing VSX upgrades for some specific jumbos it takes 45min just to get the VS starting.
So there is no real time for any issues.
In our company we always two ppl when doing major upgrades.
Just incase something happens, and to actually follow the step by step list.
Not only for the upgrade but also for internal systems that need to be changed and updated (such as monitoring)
We actually make sure to do screen record of the upgrades aswell.
Becuase if something goes bad, its so much easier to write an incident report and check if we made any misstakes.
Then you are able to put down accurate times and printscreens that is hard to think about when troubleshooting.
Not sure if i have talked about it before but we uses netbox extensivly.
Its a great way to keep track of changes that are done on boxes by doing journal entryes and tagging objects for "special things"
Super handy when doing upgrades or changes after so its easy to go back and see if it has happen on this node before.
@@MagnusHolmberg-NetSec Yeah the ANACONDA & OS-XX udev rules files are important as that's how redhat/checkpoint record which interface is assigned & kept. So if you change hardware (or if you load from scratch on the same hardware) the way a 'new' kernel first sees / polls the interface & mac addresses is how the interfaces are assigned on open servers. So you could 'not touch anything physical' and do a fresh install and have the interfaces be assigned differently so nothing will work. You can fix that manually by editing those files. This also happens if say you deployed a system and then added network cards to it over time, the 'newer' cards may have their kernel module load before the others so will show up in a different assignment. Not too bad if you're expecting it, but if not can be a long night. It's also why I try to get TTY/Serial and ILO/idrac to every system so I can fix errors like that when deploying remotely (or like at a DR site).
I haven't looked into netbox myself will have to give it a gander. Over the years/decades mainly just came up with scripts to pull information and create 'run books' or 'bcdr' books. But as you can see that's not 'automated' and can get out of date. Likewise, I would *never* trust that information in a BCDR context (i.e. always pull it yourself) probably just my long-beard attitude of self preservation. :)
Thanks for this video ❤
your welcome :)
Sorry, but many action which you made, like manual download etc.. it's not need. You just have error "Failed to get update from Checkpoint..." in CPUSE. Actually you can upgrade more simply.
However, it is good that you paid attention to the documentation and limitations.
@@alexeyrepin3192 the error is due to lack of contract for the license. This is the reason why cpuse etc was not updated.
I removed the explanation of this in the video, it’s added in the next video during the gateway update.
The reason why contract was missing is due to making videos on the same box on how to manual update cpuse, web console, upgrade tools and more. (More or less simulate how it is when missing internet)