Best Practise for OS patching

Not applicable

Hi

We have private cloud install of Edge, also all components running on seperate nodes.

We need to patch our OS each month (redhat 7)

What is best practice with patching and rebooting servers ? We are trying to make this can be done our of hours and scripted.

Do we need to have servers go down and up in certain orders ?

Solved Solved
0 3 774
1 ACCEPTED SOLUTION

hi @Paul Stilgoe assuming you are running a high-available topology then it won't matter what order you do the OS patching. You maintain full service availability by doing the rollout progressively:

  1. make one instance unavailable
  2. wait until it has quiesced
  3. apply the OS update and reboot
  4. confirm instance is operational after reboot
  5. mark instance as available and observe that it is serving traffic and there are no unexpected errors
  6. repeat 1 - 5 for remaining components

I also wanted to offer an alternative approach particularly as you are talking about scripting, commonly referred to as pets vs. cattle model. Instead of the above where each instance is upgraded (a.k.a pets model) you could look to implement something along the lines of the so called cattle model.

In this model, you don't worry about maintaining the currently running instances. You create machine images for each type of instance and then, when needed, you quickly instantiate a new instance using the image.

In the above example, you would:

  1. Start with machine image that is fully patched to the latest OS patch level.
  2. Create a new instance based on that image and connect it into whatever relevant group it needs to be in (e.g. new MP, new router, new cassandra node, etc)
  3. You now have 1 more instance than you need, you monitor it closely as it starts to take traffic to make sure that it is working as expected.
  4. If there are any issues, you remove the instance you just added and go work out what caused the issue and fix it before repeating. If there are no issues, you remove and terminate one of the old instances.
  5. Repeat 2 - 4 progressively adding new and removing old instances until all instances are running with the patched OS.

The advantage of the above model is you can also reuse this approach to address other issues, for example if there are underlying hardware failures or some other issue that is impacting a single node.

If I've answered your question please click the Accept link or alternatively let us know how we can further help.

View solution in original post

3 REPLIES 3

hi @Paul Stilgoe assuming you are running a high-available topology then it won't matter what order you do the OS patching. You maintain full service availability by doing the rollout progressively:

  1. make one instance unavailable
  2. wait until it has quiesced
  3. apply the OS update and reboot
  4. confirm instance is operational after reboot
  5. mark instance as available and observe that it is serving traffic and there are no unexpected errors
  6. repeat 1 - 5 for remaining components

I also wanted to offer an alternative approach particularly as you are talking about scripting, commonly referred to as pets vs. cattle model. Instead of the above where each instance is upgraded (a.k.a pets model) you could look to implement something along the lines of the so called cattle model.

In this model, you don't worry about maintaining the currently running instances. You create machine images for each type of instance and then, when needed, you quickly instantiate a new instance using the image.

In the above example, you would:

  1. Start with machine image that is fully patched to the latest OS patch level.
  2. Create a new instance based on that image and connect it into whatever relevant group it needs to be in (e.g. new MP, new router, new cassandra node, etc)
  3. You now have 1 more instance than you need, you monitor it closely as it starts to take traffic to make sure that it is working as expected.
  4. If there are any issues, you remove the instance you just added and go work out what caused the issue and fix it before repeating. If there are no issues, you remove and terminate one of the old instances.
  5. Repeat 2 - 4 progressively adding new and removing old instances until all instances are running with the patched OS.

The advantage of the above model is you can also reuse this approach to address other issues, for example if there are underlying hardware failures or some other issue that is impacting a single node.

If I've answered your question please click the Accept link or alternatively let us know how we can further help.

is there any impact to analytics/data (Pstgress-SQL/Cassandra/Zookeeper) nodes using cattle approach?

It should be fine, when you do the first one and bring it back in it will resync with the other two and come back to its equal share, you can check the status of the ring to confirm what state its in 🙂

You dont need to restore any data if its just one node