summaryrefslogtreecommitdiffstats
path: root/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
Commit message (Collapse)AuthorAgeFilesLines
* Label masters with node-role.kubernetes.io/master. This PR also sets these ↵Vadim Rutkovsky2018-01-241-10/+5
| | | | | | | | labels and scheduling status during upgrades Signed-off-by: Vadim Rutkovsky <vrutkovs@redhat.com>
* Merge pull request #5080 from sdodson/drain-timeoutsOpenShift Merge Robot2018-01-101-3/+9
|\ | | | | | | | | | | | | | | | | | | Automatic merge from submit-queue. Add the ability to specify a timeout for node drain operations A timeout to wait for nodes to drain pods can be specified to ensure that the upgrade continues even if nodes fail to drain pods in the allowed time. The default value of 0 will wait indefinitely allowing the admin to investigate the root cause and ensuring that disruption budgets are respected. In practice the `oc adm drain` command will eventually error out, at least that's what we've seen in our large online clusters, when that happens a second attempt will be made to drain the nodes, if it fails again it will abort the upgrade for that node or for the entire cluster based on your defined `openshift_upgrade_nodes_max_fail_percentage`. `openshift_upgrade_nodes_drain_timeout=0` is the default and will wait until all pods have been drained successfully `openshift_upgrade_nodes_drain_timeout=600` would wait for 600s before moving on to the tasks which would forcefully stop pods such as stopping docker, node, and openvswitch.
| * Add the ability to specify a timeout for node drain operationsScott Dodson2018-01-101-3/+9
| |
* | Merge pull request #6549 from mgugino-upstream-stage/node-meta-depends2OpenShift Merge Robot2018-01-081-4/+0
|\ \ | |/ |/| | | | | | | | | | | Automatic merge from submit-queue. Remove last of openshift_node role meta-depends Remove last non-taskless meta-depends from openshift_node role.
| * Remove last of openshift_node role meta-dependsMichael Gugino2018-01-021-4/+0
| | | | | | | | | | | | | | | | Remove last non-taskless meta-depends from openshift_node role. Remove variable 'openshift_node_upgrade_in_progress' as it is no longer used.
* | Migrate to import_role for static role inclusionScott Dodson2018-01-051-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | In Ansible 2.2, the include_role directive came into existence as a Tech Preview. It is still a Tech Preview through Ansible 2.4 (and in current devel branch), but with a noteable change. The default behavior switched from static: true to static: false because that functionality moved to the newly introduced import_role directive (in order to stay consistent with include* being dynamic in nature and `import* being static in nature). The dynamic include is considerably more memory intensive as it will dynamically create a role import for every host in the inventory list to be used. (Also worth noting, there is at the time of this writing an object allocation inefficiency in the dynamic include that can in certain situations amplify this effect considerably) This change is meant to mitigate the pressure on memory for the Ansible control host. We need to evaluate where it makes sense to dynamically include roles and revert back to dynamic inclusion if and where it makes sense to do so.
* Remove openshift.common.{is_atomic|is_containerized}Michael Gugino2017-12-201-1/+1
| | | | | We set these variables using facts in init, no need to duplicate the logic all around the codebase.
* Deprecate using Ansible tests as filtersRussell Teague2017-12-141-4/+4
|
* Refactor node upgrade to include less serial tasksMichael Gugino2017-12-121-11/+22
| | | | | | | | | | | | | This commit moves the pulling of images, packages, and updating config files into a non-serialized play. The serialized play is now in charge of marking unschedulable, draining, stopping and restarting services, and marking schedulable. If rpm install / container download takes 60s per host, this will save 3 hours and 10 minutes at 200 hosts per cluster and forks of 20 hosts.
* Remove openshift.common.service_typeMichael Gugino2017-12-071-1/+0
| | | | | | | | This commit removes openshift.common.service_type in favor of openshift_service_type. This commit also removes r_openshift_excluder_service_type from plays in favor of using the role's defaults.
* Remove all uses of openshift.common.admin_binaryScott Dodson2017-12-071-1/+1
| | | | Replace with `oc adm`
* Correct usage of include_roleRussell Teague2017-11-271-1/+1
| | | | Switch to import_role for some required roles.
* Combine openshift_node and openshift_node_upgradeMichael Gugino2017-11-161-8/+10
| | | | | | | | | Currently, having openshift_node and openshift_node_upgrade as two distinct roles has created a duplication across handlers, templates, and some tasks. This commit combines the roles to reduce duplication and bugs encountered by not putting code in both places.
* Merge pull request #4778 from jkaurredhat/drain_upgrade-1.6Scott Dodson2017-07-181-1/+1
|\ | | | | drain still pending in below files without fix :
| * drain still pending in below files without fix :jkaurredhat2017-07-181-1/+1
| | | | | | | | | | | | | | playbooks/common/openshift-cluster/upgrades/docker/docker_upgrade.yml playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml Signed-off-by: jkaurredhat <jkaur@redhat.com>
* | Add drain retries after 60 second delayScott Dodson2017-07-181-0/+4
|/
* Run dns on the node and use that for dnsmasqScott Dodson2017-06-301-1/+1
|
* Add openshift_node_dnsmasq role to upgradeScott Dodson2017-06-181-0/+1
|
* Tolerate failures in the node upgrade playbookScott Dodson2017-05-191-1/+1
|
* Rework openshift_excluders roleRussell Teague2017-05-161-10/+3
|
* run excluders over selected set of hosts during control_plane/node upgradeJan Chaloupka2017-05-031-1/+7
| | | | Disable/reset excluders over requested hosts
* Renaming oadm_manage_node to oc_adm_manage_nodeRussell Teague2017-03-101-2/+2
|
* Fixed issue where upgrade fails when using daemon sets (e.g. aggregated logging)Andrew Baldi2017-02-151-1/+1
|
* Modify playbooks to use oadm_manage_node moduleRussell Teague2017-02-131-25/+20
|
* Add excluder management to upgrade and config playbooksScott Dodson2017-02-061-0/+4
|
* run node upgrade if master is node as part of the control plan upgrade onlyJan Chaloupka2017-02-021-2/+2
|
* Move current node upgrade tasks under openshift_node_upgrade roleJan Chaloupka2017-02-011-80/+7
|
* During node upgrade upgrade openvswitch rpmsScott Dodson2017-01-311-0/+15
| | | | | Containerized upgrades of openvswitch are already handled by updating the container images and pulling them again.
* Correct usage of draining nodesRussell Teague2017-01-261-1/+1
|
* Merge pull request #2981 from dgoodwin/upgrade-wait-for-nodeJason DeTiberus2017-01-241-0/+13
|\ | | | | Wait for nodes to be ready before proceeding with upgrade.
| * Wait for nodes to be ready before proceeding with upgrade.Devan Goodwin2016-12-151-0/+13
| | | | | | | | | | | | | | | | | | Near the end of node upgrade, we now wait for the node to report Ready before marking it schedulable again. This should help eliminate delays when pods need to relocate as the next node in line is evacuated. Happens near the end of the process, the only remaining task would be to mark it schedulable again so easy for admins to detect and recover from.
* | Add a fact to select --evacuate or --drain based on your OCP versionTim Bielawa2017-01-111-1/+1
| | | | | | | | Closes #3070
* | Deprecate node 'evacuation' with 'drain'Tim Bielawa2016-12-161-3/+3
|/ | | | * https://trello.com/c/TeaEB9fX/307-3-deprecate-node-evacuation
* YAML LintingRussell Teague2016-12-121-10/+8
| | | | | * Added checks to make ci for yaml linting * Modified y(a)ml files to pass lint checks
* Cleanup ovs file and restart docker on every upgrade.Devan Goodwin2016-11-301-1/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | In 3.3 one of our services lays down a systemd drop-in for configuring Docker networking to use lbr0. In 3.4, this has been changed but the file must be cleaned up manually by us. However, after removing the file docker requires a restart. This had big implications particularly in containerized environments where upgrade is a very fragile series of upgrading and service restarts. To avoid double docker restarts, and thus double service restarts in containerized environments, this change does the following: - Skip restart during docker upgrade, if it is required. We will restart on our own later. - Skip containerized service restarts when we upgrade the services themselves. - Clean shutdown of all containerized services. - Restart Docker. (always, previously this only happened if it needed an upgrade) - Ensure all containerized services are restarted. - Restart rpm node services. (always) - Mark node schedulable again. At the end of this process, docker0 should be back on the system.
* Reference master binaries when delegating from node hosts which may be ↵Andrew Butcher2016-11-221-4/+4
| | | | containerized.
* Revert "Revert openshift.node.nodename changes"Scott Dodson2016-11-081-4/+6
|
* Revert "Fix OpenStack cloud provider"Scott Dodson2016-11-071-6/+4
| | | | This reverts commit 1f2276fff1e41c1d9440ee8b589042ee249b95d7.
* Switch from "oadm" to "oc adm" and fix bug in binary sync.Devan Goodwin2016-10-191-3/+3
| | | | | | | | Found bug syncing binaries to containerized hosts where if a symlink was pre-existing, but pointing to the wrong destination, it would not be corrected. Switched to using oc adm instead of oadm.
* Allow a couple retries when unscheduling/rescheduling nodes in upgrade.Devan Goodwin2016-09-291-0/+12
| | | | | | | | This can fail with a transient "object has been modified" error asking you to re-try your changes on the latest version of the object. Allow up to three retries to see if we can get the change to take effect.
* Skip the docker role in early upgrade stages.Devan Goodwin2016-09-291-2/+3
| | | | | | | | This improves the situation further and prevents configuration changes from accidentally triggering docker restarts, before we've evacuated nodes. Now in two places, we skip the role entirely, instead of previous implementation which only skipped upgrading the installed version. (which did not catch config issues)
* Allow filtering nodes to upgrade by label.Devan Goodwin2016-09-291-9/+9
|
* Allow customizing node upgrade serial value.Devan Goodwin2016-09-291-1/+3
|
* Split upgrade for control plane/nodes.Devan Goodwin2016-09-291-0/+60