Differences

This shows you the differences between two versions of the page.

--- tutorial:adm:server_os_updates [2019/12/16 15:17]
fiserp [Performing the OS update]
+++ tutorial:adm:server_os_updates [2019/12/17 07:44]
fiserp [Performing the OS update]
@@ Line 10: / Line 10: @@
   * Impact on users
     * IdM is often deployed as a self-service portal for users. You should plan the downtime such that minimal number of users is affected.
-    * Users may make changes in the IdM that start some long running tasks (e.g. automatic roles changes). Those tasks are executed asynchronously and may be running even if the user who started the task has already logged off.
+    * Users may make changes in the IdM that start some long running tasks (e.g. automatic roles changes, bulk role assignments, etc.). Those tasks are executed asynchronously and may be running even if the user who started the task has already logged off.
-  * Impact on IdM batch jobs (long running tasks - LRT)
+  * Impact on long running tasks (LRT)
     * IdM has internal cron that schedules LRT jobs. To make things safe, no job should be running when you are doing the update. The safest way to achieve this is to stop the IdM service before applying updates.
     * LRTs run usually at night so it is not entirely necessary to stop the IdM, but you have to make sure you have enough time to perform the patching (and possible rollback) before jobs start to execute.
@@ Line 39: / Line 39: @@
     - Prepare backup and restore procedures.
     - Identify which LRTs can be safely killed when running.
-    - Make a checklist with timing information to determine the length of the maintenance.
+    - Make a checklist with timing estimates to determine the length of the maintenance.
   - Perform the update
     - Begin the maintenance.
@@ Line 49: / Line 49: @@
     - Stop the IdM.
     - Disable automatic start of the IdM on OS start.
-    - (If you use cold snapshots, turn of the machine and make one.)
+    - (If you use cold snapshots, turn off the machine and make one.)
     - (If you do not use snapshots, make a backup of the IdM database and store it off-machine.)
     - Make backup of ``/boot``, ``/etc``, list of processes ``ps -ef`` and list of network services ``netstat -tulnp`` (or ``ss -tulnp``). Those dumps will help you check if all the services started. You can also recover some settings from backups in case something goes wrong (in a minor way) - you will not need to roll back whole snapshot.
     - Perform the update (e.g. ``yum update``).
-    - Reboot the affected services or the whole machine if necessary.
+    - Restart affected services or reboot the whole machine if necessary.
     - When the machine is up, check ``dmesg`` and ``/var/log/{messages,syslog}`` or analogous files for your OS.
     - Check running processes and network services whether everything started properly.
@@ Line 73: / Line 73: @@
 ==== Solving issues ====
-During maintenance actions, it is necessary to know:
+For maintenance actions, it is necessary to:
-  * How long each task will take **TT**.
+  * Know how long each task will take and to measure the task duration when actually performing them.
-  * How long the whole maintenance will take **MT**.
+    * If tasks take longer than expected, you know if you can match the maintenance window or not.
-    * This is not simply a sum od **TT**s, you should add some extra time **ET** to have a proper cushion.
+  * Know how long the whole maintenance will take (maintenance time **MT**).
-  * How long (at worst) the whole rollback will take **RT**.
+    * This is not simply a sum of task times, you should add some extra time (**ET**) to have a proper cushion.
-  * To have a maintenance window that spans at least **MT**+**RT** with some extra time **ET**.
+  * Know how long (at worst) the whole rollback will take (rollback time **RT**).
-    * You are not able to safely perform the maintenance in shorter window - if something goes wrong, you need at most **RT** time to perform the rollback!
+  * Have a maintenance window that spans at least **MT**+**RT** with some extra time **ET**.
-    * If you do not have any **ET**, if anything goes wrong you have to perform rollback procedure. Therefore, **ET** gives you some time you can spend on solving the issue so you can continue with updates.
+    * You are not able to safely perform the maintenance in shorter window, there is simply not enough time. If something goes wrong, you need at most **RT** time to perform the rollback!
+    * If you do not have any **ET**, if anything goes wrong you have to perform rollback procedure. Therefore, **ET** gives you some time you can spend on solving the issue so you can carry on with updates.
+You should have a rollback procedure that can safely restore the deployment. This depends on your environment.
+Fortunately, in most cases it simply means restoring the snapshot of the virtual machine. After restoring the snapshot, you have to perform tests (with test use-cases) to confirm the rollback was performed correctly.
+Minor issues can be generally resolved with the help of ``/boot`` and ``/etc`` backups you created before updating the OS.
+If IdM installation gets hit, you can debug the configuration or restore it from periodic backup. Since IdM is not installed from OS packages, this basically never happens.