summaryrefslogtreecommitdiffstats
path: root/docs/performance.txt
blob: b31c02a6306557045d1e78a4e4d03ad41482e514 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Divergence from the best practices
==================================
 Due to various constraints, I had take some decisions contradicting the best practices. There were also some
 hardware limitations also resulting in suboptimal conifugration.
 
 Storage
 -------
 - RedHat documentation strongly discourages running Gluster over large Raid-60. The best performance is achieved
 if disks are organized as JBOD and each assigned a brick. The problem is that heketi is not really ready for 
 production yet. I got numerous problems with testing. Managing '3 x 24' gluster bricks manually would be a nightmare.
 Consequently, i opted for Raid-60 to simplify maintenance and ensure no data is lost due to mismanagement of gluster
 volumes.
 
 - In general, the architecture is more suitable for many small servers, not just a couple of fat storage servers. Then,
 the disk load will be distributed between multiple nodes. Furthermore, we are can't use all storage with 3 nodes. 
 We need 3 nodes to ensure abitrage in case of failure (or network outtages). Even if we the 3rd node only stores the
 checksums, we ca't easily use it to store data. OK. Technically, we can create a 3 sets of 3 bricks and put the arbiter
 brick on different nodes. But this again will complicate maintenace. Unless proper ordering is maintained the replication
 may happen between bricks on the same node, etc. So, again I decided to ensure fault tollerance over performance. We still
 can use the space when cluster is scalled.

 Network
 -------
 - To ensure high speed communication between pods running on different nodes, RedHat recommends to enable Container Native 
 Routing. This is done by creating a bridge for docker containers on the hardware network device instead of OpenVSwitch fabric.
 Unfortunatelly, IPoIB is not providing Ethernet L2/L3 capabilities and it is impossible to use IB devices for bridging. 
 It is still may be possible to solve somehow, but further research is required. The easier solution is just to switch OpenShift
 fabric to Ethernet. Anyway, we had idea to separate storage and OpenShift networks.
 
 Memory
 ------
  - There is multiple docker storage engines. We are currently using LVM-based 'devicemapper'. To build container, the data is
  copied from all image layers. The new 'overlay2' provides a virtual file system (overlayfs) joining all layers and performing 
  COW if the data is modified. It saves space, but more importantly it also enables page cache sharing reducing the memory 
  footprint if multiple containers sharing the same layers (and they do share CentOS base image at minimum). Another adantage a 
  slightly faster startup of containers with large images (as we don't need to copy all files). On the negative side, it is not
  fully POSIX compliant. Some applications may have problems because. For major applications there is work-arrounds provided by
  RedHat. But again, I opt for more standard 'devicemapper' to avoid hard to debug problems.


What is required
================
 - We need to add at least another node. It will double the available storage and I expect significant improvement of storage
 performance. Even better to have 5-6 nodes to split load.
 - We need to switch Ethernet fabric for OpenShift network. Currently, it is not critical and will only add about 20% to ADEI 
 performance. However, it may become an issue if optimize ADEI database handling or get more network intensive applications in
 the cluster.
 - We need to re-evaluate RDMA support in GlusterFS. Currently, it is unreliable causing pods to hang indefinitely. If it is 
 fixed we can re-enable RDMA support for our volumes. It hopefully may further improve storage performance. Similarly, Gluster
 block storage is significnatly faster for single-pod use case, but has significant stability issues at the moment.
 - We need to check if OverlayFS causing any problems to applications we plan to run. Enabling overlayfs should be good for 
 our cron services and may reduce memory footprint.