docs/consistency.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

General overview
=================
 - etcd services (worth checking both ports)
    etcdctl3 --endpoints="192.168.213.1:2379" member list       - doesn't check health only reports members
    oc get cs                                                   - only etcd (other services will fail on Openshift)
 - All nodes and pods are fine and running and all pvc are bound
    oc get nodes
    oc get pods --all-namespaces -o wide
    oc get pvc --all-namespaces -o wide
 - API health check
    curl -k https://apiserver.kube-service-catalog.svc/healthz

Storage
=======
 - Heketi status 
    heketi-cli -s  http://heketi-storage.glusterfs.svc.cluster.local:8080 --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" topology info
 - Status of Gluster Volume (and its bricks which with heketi fails often)
    gluster volume info
    ./gluster.sh info all_heketi
 - Check available storage space on system partition and  LVM volumes (docker, heketi, ands)
    Run 'df -h' and 'lvdisplay' on each node
 - Check status of hardware raids
    /opt/MegaRAID/storcli/storcli64 /c0/v0 show all

Networking
==========
 - Check that correct upstream name servers are listed for both DNSMasq (host) and SkyDNS (pods).
 If not fix and restart 'origin-node' and 'dnsmasq'.
    * '/etc/dnsmasq.d/origin-upstream-dns.conf'
    * '/etc/origin/node/resolv.conf'

 - Check that both internal and external addresses are resolvable from all hosts.
    * I.e. we should be able to resolve 'google.com'
    * And we should be able to resolve 'heketi-storage.glusterfs.svc.cluster.local'
    
 - Check that keepalived service is up and the corresponding ip's are really assigned to one
 of the nodes (vagrant provisioner would remove keepalived tracked ips, but keepalived will
 continue running without noticing it)
 
 - Ensure, we don't have override of cluster_name to first master (which we do during the
 provisioning of OpenShift plays)
 

ADEI
====
 - MySQL replication is working
 - No caching pods are hung (for whatever reason)