summaryrefslogtreecommitdiffstats
path: root/docs/mellanox.txt
blob: ed2004829017e64caeacd90d808e2d50a261b5e0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Terminology
===========
 - Send/Receive Queues
    QP (Queue Pair): Combines RQ and SQ. Generally, irrelevant for the following
    RQ (Receive Queue): 
    SQ (Send Queue):
    CQ (Completion Queue): Completed operations reported here
    EQ (Event Queue): Completions generate events (at specified rate) which in turn generate IRQs
    WR/WQ (Work Request Queue): This is basically buffers (SG-lists) which should be either send or used for data reception
    *QE (* Queue Event)

    Flow: WQE --submit work--> WQ --execute--> SQ/RQ --on completion-> CQ --signal--> EQ -> IRQ
    * Completion Event Moderation: Redeuce amount of reported events (EQ)

 - Ofloads
    RSS (Receive Side Scalling): Distribute load across CPU cores
    LRO (Large Receive Offload): Group packets and deliver to user-space as a large single grouped packet [ ethtool -K shows if LRO on/off ]

 - Various
    AEV (Asynchronous Event): Errors,etc.
    SRQ (Shared Receive Queue):
    ICM (Interconnect Context Memory): Address Translation Tables, Control Objects, User Access Region (registers)
    MPT (Memory Protection Table):
    RMP (Receive Memory Pool):
    TIR (Transport Interface Receive):
    RQT (RQ Table):
    MCG (Multicast Group):

Driver
======
 - Network packets is/are streamed to ring buffers (with all Ethernet, IP, UDP/TCP headers). 
 The number of ring buffers dependents on VMA_RING_ALLOCATION parameter:
     0 - per network interface
     1 - per IP
 => 10 - per socket
    20 - per thread (which was used to create the socket)
    30 - per core
    31 - per core (with some affinity of threads to cores)

 - The memory for ring buffer is allocated based on VMA_MEM_ALLOC_TYPE:
    0 - malloc (this will be very slow if large buffers are requested)
    1 - contigous 
 => 2 - HugePages

 - The number of buffers per ring is controlled with VMA_RX_BUFS (this is total in all rings)
    * Each buffer VMA_MTU bytes
    * Recommended: VMA_RX_BUFS ~ #rings * VMA_RX_WRE (number of WRE allocated on all interfaces)

LibVMA
======
 There is 3 interfaces: 
 - MP-RQ (Multi-packet Receive Queue): vma_cyclic_buffer_read
    This is useful for processing data streams when packet size stays contant and the packet flow doesn't change
    drastically over time. Requires ConntextX-5 or newer.

    * Use 'vma_add_ring_profile' to configure the size of ring buffer (specifies buffer size & the packet size)
    * Set per-socket SO_VMA_RING_ALLOC_LOGIC using setsockopt
    * Call 'vma_cyclic_buffer_read' to access raw ring buffer, specifies minimum and maximum packets to return

    * The returned 'completion' structure referencing the position in the ring buffer. Packets in ring buffer
    include all headers (ethernet - 14 bytes, ip - 20 bytes, udp - 8 bytes). 
    * New packets meanwhile are written in the remaining part of the ring buffer (until the linear end of the
    buffer - consequently the returned data is not overwritten).
    * The buffer rewinded only on call to 'vma_cyclic_buffer_read'. Less than the specified minimum amount of
    packets can be returned if currently near the end of buffer and not enough space to fullfil the minimum
    requirement.

    * To ensure enough space for the follow up packets, synchronization between buffer size and min/max packet
    is required. It should never happen that the space for only few packets is left when end of the buffer is
    close.

 - SocketXtreme: socketxtreme_poll
    More complex interface allowing more control over process particularly processing packets with varing size.
    Requires ConnectX-5 or newer.

    * Get ring buffers associated with socket 'get_socket_rings_num' and 'get_socket_rings_fds'
    * Get ready completions on the specified ring buffer with 'socketxtreme_poll' (pass 'fd' returned with 'get_socket_rings_fds')
    * Two types of completions: 'VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED' and 'VMA_SOCKETXTREME_PACKET'. 
    * For the second type, process an associated list of buffers and keep reference counting with 'socketxtreme_ref_vma_buf',
    'socketxtreme_free_vma_buf'.
    * Clean/unreference received packets with socketxtreme_free_vma_packets

 - Zero Copy: recvfrom_zcopy
    The simplest interface working with ConnectX-3 cards. The packet is still written to ring-buffers. The data is not copied out 
    of ring buffers. This interface provides a way to get pointers to locations in ring buffer. There is a slight overhead compared
    to MP-RQ approach to prepare list of packet pointers.