Differences

This shows you the differences between two versions of the page.

--- tutorial:adm:server_os_updates [2019/12/16 14:06]
fiserp [Things to consider]
+++ tutorial:adm:server_os_updates [2019/12/16 15:33]
fiserp [Solving issues]
@@ Line 15: / Line 15: @@
     * LRTs run usually at night so it is not entirely necessary to stop the IdM, but you have to make sure you have enough time to perform the patching (and possible rollback) before jobs start to execute.
     * Restarting IdM cancels the LRT that was currently running, LRT **will not pick up automatically** after IdM goes up again.
+    * Nightly LRTs usually read HR system data. This means there are dependecies between them (e.g. synchronize identities, then contracts and/or time slices, then run recompute on them and finally run HR processes which enable/disbale identities based on freshly synchronized data). Given the nature of deployment, those dependencies may be "hard" and it may be dangerous to skip some of LRTs or run them in different order.
   * Impact on end systems connected to IdM
     * There is no direct impact on other systems.
@@ Line 21: / Line 22: @@
       * Some end systems that are connected via WinRM. The WinRM library uses Python and some of Python's libraries come from the OS packages. Upgrading those packages system-wide has **possibly** an impact on the way the WinRM/Python works.
   * Impact on OS
-    * OS may seemingly not boot after the updates (boot or network issues, SSHd/RDP daemon issues). We recommend to have complete backup of ``/boot`` and ``/etc`` directories. Out-of-band access to a machine is a must.
+    * OS may seemingly not boot after the updates (boot or network issues, SSHd/RDP daemon issues). We recommend to have complete backup of ``/boot`` and ``/etc`` directories. Out-of-band access to a machine is a must. In case of virtualized environment, making a snapshot is a way to go.
-    * In our deployments, we use mainly RHEL/CentOS (sometimes Debian) and Windows OSes. If you deploy IdM accordingly (tutorials [[https://wiki.czechidm.com/doku.php?id=start&do=search&q=server+prep|here]] and [[https://wiki.czechidm.com/doku.php?id=start&do=search&q=idm+installation|here]], OS updates are generally painless.
+    * In our deployments, we use mainly RHEL/CentOS (sometimes Debian) and Windows OSes. If you deploy IdM accordingly (tutorials [[https://wiki.czechidm.com/doku.php?id=start&do=search&q=server+prep|here]] and [[https://wiki.czechidm.com/doku.php?id=start&do=search&q=idm+installation|here]]), OS updates are generally painless.
     * Packages from OS that IdM deployment uses
-      * Java (openjdk package referenced through ``/usr/lib/...`` and therefore through ``/etc/alternatives/...``). Java patchset may be updated, bud the version should stay the same (e.g. update ``1.8u27->1.8u90`` is OK, but update ``Java8->Java9`` is not).
+      * Java (java binary referenced through ``/usr/lib/...`` and therefore through ``/etc/alternatives/...``). Java patchset may be updated, but the version should stay the same (e.g. update ``1.8u27->1.8u90`` is OK, but update ``Java8->Java9`` is not).
       * PostgreSQL is installed generally from OS or PGDG repositories and is considered pretty stable. Updating package when PostgreSQL version stays the same is OK. Updating PostgreSQL version (e.g. ``9.6->10``) should be OK, but we recommend at least to make a backup of IdM database (in case you have to rollback the previous PostgreSQL version).
       * Apache HTTPD. Deployment should be stable and no special care is needed. We recommend to have a backup of vhost configuration.
     * Windows-based installations have all deployment components installed by-hand and therefore are not really susceptible to break by OS updates. But this also means you have to update all deployment components manually.
+  * Finding bugs
+    * It is for the best to have at least two environments - test env. and production env.
+    * Update the test environment first, then leave it running for at least one week. If no bugs are found by then, you can update the production environment. The one week provides minimal safe time frame where some of the bugs can manifest (e.g. memleaks).
+    * Define use-cases that are important for your deployment. Before and after the update, test if those use-cases work.
+==== Performing the OS update ====
+  - Preparations
+    - Prepare testing use-cases.
+    - Prepare backup and restore procedures.
+    - Identify which LRTs can be safely killed when running.
+    - Make a checklist with timing information to determine the length of the maintenance.
+  - Perform the update
+    - Begin the maintenance.
+    - (If you use hot snapshots, make one.)
+    - Make sure no user or external application can access the IdM.
+    - Log into the IdM as administrator and check if there are some LRTs running.
+      - If they are not, continue.
+      - If they are, either stop those LRTs or let them finish. This depends on your deployment.
+    - Stop the IdM.
+    - Disable automatic start of the IdM on OS start.
+    - (If you use cold snapshots, turn of the machine and make one.)
+    - (If you do not use snapshots, make a backup of the IdM database and store it off-machine.)
+    - Make backup of ``/boot``, ``/etc``, list of processes ``ps -ef`` and list of network services ``netstat -tulnp`` (or ``ss -tulnp``). Those dumps will help you check if all the services started. You can also recover some settings from backups in case something goes wrong (in a minor way) - you will not need to roll back whole snapshot.
+    - Perform the update (e.g. ``yum update``).
+    - Reboot the affected services or the whole machine if necessary.
+    - When the machine is up, check ``dmesg`` and ``/var/log/{messages,syslog}`` or analogous files for your OS.
+    - Check running processes and network services whether everything started properly.
+      - Namely PostgreSQL and HTTPd should be up and running. Those are parts of IdM deployment.
+    - If everything is ok, start the IdM service.
+    - Enable autostart of IdM service upon OS start.
+    - Check IdM logs whether it started successfuly.
+    - Log into the IdM and test connection to end systems (configuration form for the system, green button "Test connector").
+    - Check your testing use-cases.
+    - Allow users to access the IdM.
+    - End the maintenance.
+  - Wrap-up
+    - Update documentation if necessary.
+    - Perform maintenance analysis and update your procedures if necessary.
+    - Update your test cases if necessary.
+    - After about a week, check system logs to make sure all components work as expected.
+<note>For Windows OSes, the update process is roughly the same. For checking services, status of the system and system logs, use the Event Viewer and Server Manager.</note>
+==== Solving issues ====
+For maintenance actions, it is necessary to:
+  * Know how long each task will take and to measure the task duration when actually performing them.
+    * If tasks take longer than expected, you know if you can match the maintenance window or not.
+  * Know how long the whole maintenance will take (maintenance time **MT**).
+    * This is not simply a sum of task times, you should add some extra time (**ET**) to have a proper cushion.
+  * Know how long (at worst) the whole rollback will take (rollback time **RT**).
+  * Have a maintenance window that spans at least **MT**+**RT** with some extra time **ET**.
+    * You are not able to safely perform the maintenance in shorter window, there is simply not enough time. If something goes wrong, you need at most **RT** time to perform the rollback!
+    * If you do not have any **ET**, if anything goes wrong you have to perform rollback procedure. Therefore, **ET** gives you some time you can spend on solving the issue so you can carry on with updates.
+You should have a rollback procedure that can safely restore the deployment. This depends on your environment.
+Fortunately, in most cases it simply means restoring the snapshot of the virtual machine. After restoring the snapshot, you have to perform tests (with test use-cases) to confirm the rollback was performed correctly.
+Minor issues can be generally resolved with the help of ``/boot`` and ``/etc`` backups you created before updating the OS.
+If IdM installation gets hit, you can debug the configuration or restore it from periodic backup. Since IdM is not installed from OS packages, this almost never happens.