200GbE RoCE (RDMA over Converged Ethernet) kesleltetes + savszelesseg meres

tesztgep (x2): 2xAMD EPYC 7452 (2x32 mag), 1TB RAM, Mellanox ConnectX-6 2x200Gbit NIC, DAC kabellel osszekotve

ib_read_lat:

 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000          2.58           4.69         2.65     	       2.65        	0.02   		2.70    		4.69
 4       1000          2.59           2.77         2.64     	       2.65        	0.01   		2.69    		2.77
 8       1000          2.60           2.73         2.65     	       2.66        	0.02   		2.71    		2.73
 16      1000          2.60           2.93         2.65     	       2.65        	0.01   		2.70    		2.93
 32      1000          2.60           2.85         2.65     	       2.66        	0.02   		2.70    		2.85
 64      1000          2.61           2.74         2.66     	       2.66        	0.01   		2.71    		2.74
 128     1000          2.65           2.86         2.70     	       2.70        	0.02   		2.75    		2.86
 256     1000          2.70           3.10         2.74     	       2.74        	0.01   		2.79    		3.10
 512     1000          2.77           2.87         2.80     	       2.80        	0.01   		2.85    		2.87
 1024    1000          2.84           2.99         2.88     	       2.88        	0.02   		2.93    		2.99
 2048    1000          3.03           3.19         3.07     	       3.07        	0.01   		3.12    		3.19
 4096    1000          3.35           3.64         3.38     	       3.39        	0.02   		3.46    		3.64
 8192    1000          3.92           4.25         3.98     	       3.98        	0.03   		4.12    		4.25
 16384   1000          4.88           5.72         5.38     	       5.40        	0.11   		5.64    		5.72
 32768   1000          5.60           9.36         7.64     	       7.05        	1.02   		8.14    		9.36
 65536   1000          7.67           196.58       7.97     	       9.11        	1.83   		11.99   		196.58
 131072  1000          11.75          205.21       12.19    	       13.68       	6.40   		19.66   		205.21
 262144  1000          20.83          214.43       21.27    	       22.78       	7.97   		35.77   		214.43
 524288  1000          38.80          234.33       39.06    	       40.95       	10.55  		67.68   		234.33
 1048576 1000          73.80          268.39       74.13    	       76.08       	12.34  		130.75  		268.39
 2097152 1000          146.71         343.53       147.18   	       148.97      	14.82  		259.94  		343.53
 4194304 1000          287.85         532.37       288.40   	       290.16      	19.03  		299.61  		532.37
 8388608 1000          588.84         1080.85      589.40   	       590.54      	17.89  		602.27  		1080.85

ib_write_lat:

 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000          1.29           1.88         1.30     	       1.30        	0.01   		1.36    		1.88
 4       1000          1.29           2.53         1.30     	       1.30        	0.01   		1.32    		2.53
 8       1000          1.29           1.57         1.30     	       1.30        	0.01   		1.38    		1.57
 16      1000          1.29           2.28         1.30     	       1.30        	0.01   		1.31    		2.28
 32      1000          1.31           1.47         1.32     	       1.32        	0.01   		1.34    		1.47
 64      1000          1.32           1.48         1.33     	       1.33        	0.00   		1.35    		1.48
 128     1000          1.37           2.98         1.38     	       1.38        	0.01   		1.40    		2.98
 256     1000          2.06           2.50         2.08     	       2.08        	0.01   		2.10    		2.50
 512     1000          2.11           2.29         2.12     	       2.12        	0.01   		2.16    		2.29
 1024    1000          2.19           2.72         2.22     	       2.21        	0.01   		2.26    		2.72
 2048    1000          2.34           2.85         2.36     	       2.36        	0.02   		2.43    		2.85
 4096    1000          2.63           3.81         2.66     	       2.66        	0.01   		2.74    		3.81
 8192    1000          2.68           3.29         3.12     	       3.07        	0.13   		3.20    		3.29
 16384   1000          3.44           5.00         3.61     	       3.76        	0.25   		4.11    		5.00
 32768   1000          4.44           8.25         4.80     	       4.88        	0.40   		5.60    		8.25
 65536   1000          5.96           101.62       6.21     	       6.57        	2.99   		7.81    		101.62
 131072  1000          9.07           105.75       9.20     	       9.68        	3.10   		11.90   		105.75
 262144  1000          15.50          113.05       15.62    	       16.18       	4.38   		20.35   		113.05
 524288  1000          28.17          126.69       28.45    	       28.98       	4.48   		37.00   		126.69
 1048576 1000          53.09          152.33       53.66    	       54.23       	4.79   		70.24   		152.33
 2097152 1000          104.59         207.65       105.32   	       105.79      	4.33   		137.07  		207.65
 4194304 1000          208.80         314.43       210.00   	       210.47      	4.95   		213.75  		314.43
 8388608 1000          417.80         549.21       419.44   	       419.90      	5.93   		423.85  		549.21

ib_read_bw:

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             3.97               3.96   		   2.076103
 4          1000             7.94               7.91   		   2.074370
 8          1000             15.84              15.80  		   2.071083
 16         1000             31.79              31.76  		   2.081300
 32         1000             63.31              63.26  		   2.072895
 64         1000             127.06             126.97 		   2.080216
 128        1000             253.89             253.79 		   2.079035
 256        1000             507.34             507.19 		   2.077434
 512        1000             1010.21            1009.97		   2.068416
 1024       1000             1997.58            1996.38		   2.044292
 2048       1000             3964.07            3963.47		   2.029296
 4096       1000             7657.03            7644.94		   1.957104
 8192       1000             7231.78            7230.55		   0.925510
 16384      1000             7795.25            7795.21		   0.498893
 32768      1000             10679.90            7886.62		   0.252372
 65536      1000             10918.97            9383.91		   0.150143
 131072     1000             11416.63            10623.78		   0.084990
 262144     1000             12012.13            11380.44		   0.045522
 524288     1000             12063.75            11959.29		   0.023919
 1048576    1000             12865.39            12865.34		   0.012865
 2097152    1000             13303.87            13303.86		   0.006652
 4194304    1000             13658.86            13658.85		   0.003415
 8388608    1000             13738.83            13738.83		   0.001717

ib_write_bw:

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          5000             3.98               3.98   		   2.085922
 4          5000             7.96               7.96   		   2.086743
 8          5000             15.97              15.96  		   2.091558
 16         5000             31.82              31.78  		   2.082933
 32         5000             63.86              63.67  		   2.086426
 64         5000             127.73             127.40 		   2.087342
 128        5000             255.68             255.13 		   2.090065
 256        5000             511.82             511.30 		   2.094275
 512        5000             1019.10            1019.00		   2.086910
 1024       5000             2022.04            2020.74		   2.069241
 2048       5000             3991.29            3990.96		   2.043372
 4096       5000             7927.51            7915.52		   2.026373
 8192       5000             13888.11            13881.77		   1.776866
 16384      5000             20923.12            15429.73		   0.987503
 32768      5000             22652.55            18441.30		   0.590121
 65536      5000             22197.21            20016.91		   0.320271
 131072     5000             22400.50            21261.37		   0.170091
 262144     5000             22288.38            21673.38		   0.086694
 524288     5000             22310.64            22116.69		   0.044233
 1048576    5000             22200.54            22160.88		   0.022161
 2097152    5000             22058.83            22058.75		   0.011029
 4194304    5000             22102.46            22098.02		   0.005525
 8388608    5000             21915.75            21912.78		   0.002739

 

 

az a 13.7GB/s az ib_read_bwre fura, amugy jonak tunik

Hozzászólások

Nyilvan kisebb lesz a ib_read_bw mint az ib_write_bw, de mondjuk az arany (13GB/s vs. 22GB/s) meglepo. Mekkora a pagesize? Nem lehet hogy az tul kicsi? Valami readahead van itt?

Pedig lehetnek erdekes dolgok: 

$ dd if=/dev/zero of=/dev/shm/x.tmp bs=1M count=1024 conv=sync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.575274 s, 1.9 GB/s
$ dd if=/dev/shm/x.tmp of=/dev/null conv=sync
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.8693 s, 574 MB/s

Le vagyok maradva. Ez a 200GbE fizikailag hogy jön ki a kártyából? QSFP56?

igen, ugy nezem, igazad van:) a mellanox hivatalos connector definiton oldalarol:

QSFP+ denotes cables/transceivers for 4 x (10 – 14) Gb/s applications, while QSFP28 denotes the 4 x (24…28) = 100 Gb/s product range with QSFP form factor, used for InfiniBand EDR 100Gb/s ports and 100Gb/s Ethernet (100GbE) ports. The QSFP28 interface is specified in SFF-8679. QSFP56 denotes 4 x (50…56) Gb/s in a QSFP form factor, but the term with the ‘56’ is not defined in the SFF standards at the time of publishing this paper. This form factor is used for InfiniBand HDR 200Gb/s ports and 200Gb/s Ethernet (200GbE) ports.

QSFP-DD refers to a double-density QSFP transceiver that support 200 GbE and 400 GbE Ethernet. It employs 8 lanes that operate up to 25Gb/s NRZ modulation or 50Gb/s PAM4 modulation. 4 lanes will be active when plugged into a standard QSFP cage.

QSFP56 az akkor csak kodolasvaltas, es a DD kell a 400ra. sajnos ahogy irtam, meg nincs 400as switchem, de az uj backbonet mindenkepp abbol szeretnem epiteni.

Gondolom láma kérdés, de az ib_read_lat miért lassabb, mint az ib_write_lat  (kb fele idő az írás)? Az írást addig méri, ameddig megtörténik az írás, nem addig amikor visszaér a nyugta? Vagy valami más oka van?

Szép értékek egyébként, ilyen holmikkal már jól el lehet játszadozni.

Szép szép, de mi hajcsa? :) Irigy vagyok már megint. Mennyi idő alatt menne át a Columbo összes? :D

Szerkesztve: 2021. 02. 04., cs – 10:42

QSFP+/ QDR connectx 2/3 pcie gen2/3:

Ez meg nem hasznal FEC -et, a FEC noveli a latencyt.
Hany meteres a DAC kabeled ? FEC mar kotelezo -e rajta ?

2m jofele kabelnel FEC nem kell.
Egyesek allitjak, hogy 3m -nel is ki lehet kapcsolni:
https://blog.mellanox.com/2018/04/mellanox-ofc-2018-mellanox-live-demo-…
ill. hogy 2x250ns lehet az RS-FEC latency.
Mellanox cuccok tipikusan tudnak mas FEC -et pl. Fire code ami kicsit gyorsabb.
QSFP28 -nal 3m no-fec (
26AWG)  -et tobben is allitjak hogy OK.

#ib_write_lat

---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          0.81           5.94         0.88     	       0.91        	0.16   		1.10    		5.94   
 4       1000          0.81           6.51         0.88     	       0.90        	0.17   		1.11    		6.51   
 8       1000          0.83           7.21         0.88     	       0.91        	0.20   		1.10    		7.21   
 16      1000          0.82           8.42         0.89     	       0.91        	0.18   		1.14    		8.42   
 32      1000          0.85           6.67         0.94     	       0.96        	0.21   		1.15    		6.67   
 64      1000          0.90           7.93         0.97     	       0.99        	0.24   		1.18    		7.93   
 128     1000          1.03           8.39         1.09     	       1.11        	0.25   		1.30    		8.39   
 256     1000          1.96           8.65         2.03     	       2.08        	0.26   		2.35    		8.65   
 512     1000          2.19           7.94         2.28     	       2.33        	0.31   		2.84    		7.94   
 1024    1000          2.58           8.75         2.67     	       2.71        	0.30   		3.67    		8.75   
 2048    1000          3.40           8.41         3.49     	       3.54        	0.29   		5.37    		8.41   
 4096    1000          4.04           10.03        4.17     	       4.25        	0.37   		5.94    		10.03  
 8192    1000          5.13           10.09        5.46     	       5.55        	0.30   		6.84    		10.09  
 16384   1000          7.58           14.16        7.91     	       7.99        	0.49   		11.23   		14.16  
 32768   1000          12.57          18.12        12.96    	       13.00       	0.38   		15.29   		18.12  
 65536   1000          22.87          32.33        23.05    	       23.09       	0.20   		23.63   		32.33  
 131072  1000          43.04          48.95        43.28    	       43.45       	0.66   		47.03   		48.95  
 262144  1000          83.42          87.29        83.64    	       83.69       	0.26   		84.20   		87.29  
 524288  1000          164.14         169.87       164.45   	       164.49      	0.27   		165.15  		169.87 
 1048576 1000          325.65         329.35       326.03   	       326.08      	0.30   		327.62  		329.35 
 2097152 1000          648.70         656.20       649.54   	       649.67      	0.72   		653.58  		656.20 
 4194304 1000          1294.62        1304.04      1296.07  	       1296.17     	0.71   		1298.89 		1304.04
 8388608 1000          2613.64        2642.17      2631.10  	       2631.24     	1.44   		2636.03 		2642.17

# ib_read_lat
 

---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          1.70           5.88         1.75     	       1.79        	0.16   		2.23    		5.88   
 4       1000          1.71           5.95         1.76     	       1.81        	0.14   		2.24    		5.95   
 8       1000          1.72           4.62         1.78     	       1.81        	0.10   		2.23    		4.62   
 16      1000          1.70           7.95         1.73     	       1.77        	0.12   		2.19    		7.95   
 32      1000          1.71           5.31         1.74     	       1.78        	0.12   		2.15    		5.31   
 64      1000          1.67           7.11         1.76     	       1.80        	0.12   		2.19    		7.11   
 128     1000          1.78           9.94         1.82     	       1.86        	0.13   		2.34    		9.94   
 256     1000          1.88           6.63         1.93     	       1.97        	0.11   		2.37    		6.63   
 512     1000          2.13           6.20         2.18     	       2.23        	0.14   		2.64    		6.20   
 1024    1000          2.53           9.66         2.57     	       2.62        	0.22   		3.01    		9.66   
 2048    1000          3.33           10.38        3.38     	       3.42        	0.18   		3.86    		10.38  
 4096    1000          3.94           11.33        4.01     	       4.05        	0.33   		4.49    		11.33  
 8192    1000          5.16           12.39        5.23     	       5.33        	0.35   		6.05    		12.39  
 16384   1000          7.43           14.33        7.68     	       7.81        	0.25   		8.56    		14.33  
 32768   1000          12.30          18.21        12.79    	       12.74       	0.37   		13.57   		18.21  
 65536   1000          22.47          27.71        22.55    	       22.61       	0.17   		23.06   		27.71  
 131072  1000          42.44          50.41        42.58    	       42.67       	0.40   		43.32   		50.41  
 262144  1000          82.30          88.90        82.45    	       82.52       	0.37   		83.19   		88.90  
 524288  1000          162.13         168.88       162.28   	       162.36      	0.39   		163.20  		168.88 
 1048576 1000          321.65         328.88       321.93   	       321.99      	0.38   		322.74  		328.88 
 2097152 1000          640.69         646.62       641.29   	       641.33      	0.36   		643.05  		646.62 
 4194304 1000          1278.99        1307.72      1279.90  	       1279.99     	0.92   		1282.17 		1307.72
 8388608 1000          2555.37        2567.06      2557.46  	       2557.62     	0.95   		2560.52 		2567.06

A valasz idok jobbnak tunnek a kisebb csomagoknal, de nagyobb csomagoknal a bandwith szamit.

ib_read_bw

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             6.04               5.91              3.099655
 4          1000             12.39              12.23             3.206630
 8          1000             24.56              24.11             3.160711
 16         1000             49.51              48.64             3.187653
 32         1000             97.76              95.79             3.138737
 64         1000             208.36             194.84            3.192182
 128        1000             379.81             375.05            3.072390
 256        1000             776.82             766.43            3.139281
 512        1000             1529.26            1500.69           3.073417
 1024       1000             2907.27            2857.69           2.926272
 2048       1000             3099.51            3092.07           1.583142
 4096       1000             3118.43            3116.47           0.797816
 8192       1000             3124.32            3122.52           0.399682
 16384      1000             3125.64            3125.57           0.200036
 32768      1000             3131.76            3131.70           0.100214
 65536      1000             3131.81            3131.81           0.050109
 131072     1000             3132.74            3132.74           0.025062
 262144     1000             3132.77            3132.66           0.012531
 524288     1000             3132.43            3132.43           0.006265
 1048576    1000             3130.04            3130.01           0.003130
 2097152    1000             3130.06            3129.72           0.001565
 4194304    1000             3133.08            3132.71           0.000783
 8388608    1000             3132.58            3132.58           0.000392

ib_write_bw

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          5000             6.44               6.39   		   3.350616
 4          5000             12.84              12.75  		   3.341600
 8          5000             25.91              25.81  		   3.382843
 16         5000             52.09              51.45  		   3.371902
 32         5000             103.72             102.88 		   3.371066
 64         5000             207.24             204.10 		   3.343947
 128        5000             412.64             408.28 		   3.344640
 256        5000             831.96             823.11 		   3.371450
 512        5000             1627.38            1610.81		   3.298936
 1024       5000             3048.47            3041.68		   3.114685
 2048       5000             3059.80            3057.74		   1.565565
 4096       5000             3062.97            3062.34		   0.783960
 8192       5000             3064.21            3064.10		   0.392205
 16384      5000             3064.67            3064.62		   0.196136
 32768      5000             3064.99            3064.96		   0.098079
 65536      5000             3065.13            3065.09		   0.049042
 131072     5000             3065.12            3065.11		   0.024521
 262144     5000             3065.24            3065.20		   0.012261
 524288     5000             3065.25            3065.22		   0.006130
 1048576    5000             3065.12            3064.82		   0.003065
 2097152    5000             3065.22            3065.21		   0.001533
 4194304    5000             3065.28            3065.24		   0.000766
 8388608    5000             3065.25            3065.25		   0.000383

hometech from ebay ;-)
gep1:
Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0)
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

gep2:
Mellanox Technologies MT27500 Family [ConnectX-3]
AMD Ryzen 9 3900X 12-Core Processor

NVMeoF, nfs+rdma kozel az itt lathato BW -el megy, meg az osregi Q9300 -as procival is.

Dual port kartyak, 3 gep directbe (DAC).

(A test allatt cpu intiseve boinc/rosetta futott parhuzamosan)

Amit nem lehet megirni assemblyben, azt nem lehet megirni.

100GbE Mellanox-to-Mellanox Ethernet connections always enable standard Reed Solomon (RS) FEC on all cables.

szoval van FEC, mivel kotelezo is. ez egy 2 meteres kabel.

a latency papiron hiaba jobb, ha a gyakoraltban nem szamit: nalad 4kra pl 4us, annyit nem tud semelyik konkret eszkozunk, az optanek 10-12us korul hozzak, a samsung NVMejeink meg ennek tobbszorose, igy hiaba lehetne gyorsabban, ugysem menne.

https://docs.mellanox.com/display/MLNXOFEDv461000/Ethtool

ethtool --set-fec <dev> encoding off

Az az always szerintem csak default. Itt mutatnak 100Gbe -t 1usec alatt:
https://www.mellanox.com/related-docs/whitepapers/WP_RoCE_vs_iWARP.pdf

IB-nek meg switcheknel van alacsobb latency-je, mint az ethernet/IP,
de ugy nez ki directbe kotve az RoCE eleg kozel lehet az IB -hez.
Ethernet switch TCAM lookup lassabb es bonyulultabb, mint egy 16 bites subnet cim .

Amit nem lehet megirni assemblyben, azt nem lehet megirni.