summaryrefslogtreecommitdiffstats
path: root/docs/consistency.txt
blob: 3769a60f068f5e8d60d153879346c3590969c258 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
General overview
=================
 - etcd services (worth checking both ports)
    etcdctl3 --endpoints="192.168.213.1:2379" member list       - doesn't check health only reports members
    oc get cs                                                   - only etcd (other services will fail on Openshift)
 - All nodes and pods are fine and running and all pvc are bound
    oc get nodes
    oc get pods --all-namespaces -o wide                        - Check also that no pods stuck in Terminating/Pending status for a long time
    oc get pvc --all-namespaces -o wide
 - API health check
    curl -k https://apiserver.kube-service-catalog.svc/healthz
 - Docker status (at each node)
    docker info
    * Enough Data and Metadata Space is available 
    * The number of resident images is in check (>500-1000 - bad, >2000-3000 - terrible)

Nodes
=====
 - All systemd services are running
    * Communication with docker daemon is actually working
 - Replicas of mandatory pods (GlusterFS, Router) are running on all nodes

Storage
=======
 - Heketi status 
    heketi-cli -s  http://heketi-storage.glusterfs.svc.cluster.local:8080 --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" topology info
 - Status of Gluster Volume (and its bricks which with heketi fails often)
    gluster volume info
    ./gluster.sh info all_heketi
 - Check available storage space on system partition and  LVM volumes (docker, heketi, ands)
    Run 'df -h' and 'lvdisplay' on each node
 - Check status of hardware raids
    /opt/MegaRAID/storcli/storcli64 /c0/v0 show all

Networking
==========
 - Check that correct upstream name servers are listed for both DNSMasq (host) and SkyDNS (pods).
 If not fix and restart 'origin-node' and 'dnsmasq' (it happens that DNSMasq is just stuck).
    * '/etc/dnsmasq.d/origin-upstream-dns.conf'
    * '/etc/origin/node/resolv.conf'

 - Check that both internal and external addresses are resolvable from all hosts.
    * I.e. we should be able to resolve 'google.com'
    * And we should be able to resolve 'heketi-storage.glusterfs.svc.cluster.local'
    
 - Check that keepalived service is up and the corresponding ip's are really assigned to one
 of the nodes (vagrant provisioner would remove keepalived tracked ips, but keepalived will
 continue running without noticing it)
 
 - Ensure, we don't have override of cluster_name to first master (which we do during the
 provisioning of OpenShift plays)

 - Sometimes OpenShift fails to clean-up after terminated pod properly (this problem is particularly
 triggered on the systems with huge number of resident docker images). This causes rogue network 
 interfaces to  remain in OpenVSwitch fabric. This can be determined by errors like:
    could not open network device vethb9de241f (No such device)
 reported by 'ovs-vsctl show' or present in the log '/var/log/openvswitch/ovs-vswitchd.log' 
 which may quickly grow over 100MB quickly. If number of rogue interfaces grows too much,
 the pod scheduling gets even worse (compared to delays caused only be docker images) and 
 will start time-out on the affected node. 
  * The work-around is to delete rogue interfaces with 
    ovs-vsctl del-port br0 <iface>
 This does not solve the problem, however. The new interfaces will get abandoned by OpenShift.

ADEI
====
 - MySQL replication is working
 - No caching pods or maintenance pods are hung (for whatever reason)
    * Check no ADEI pods stuck in Deleting/Pending status
    * Check logs of 'cacher' and 'maintenace' scripts and ensure none is stuck on ages old time-stamp (unless we re-caching something huge)
    * Ensure were is no old pending scripts in '/adei/tmp/adminscripts'
 Possible reasons:
    * Stale 'flock' locks (could be found out by analyzing backtraces in correspond /proc/<pid>/stack)
    * Hunged connections to MySQL (could be found out by executing 'SHOW PROCESSLIST' on MySQL servers)