Mennyire aggódjak? LSI MegaRAID SAS 1078

Hali,

Adott egy Dell Power Edge 2900-as szerver, rajta Ubuntu Hardy
Linux zeus 2.6.24-21-server #1 SMP Tue Oct 21 23:40:13 UTC 2008 x86_64 GNU/Linux

Az utóbbi időben ilyen üzenetek jelennek meg a dmesg-ben

[6635057.902060] dsm_sa_datamgr3: page allocation failure. order:4, mode:0x10d4
[6635057.902067] Pid: 5327, comm: dsm_sa_datamgr3 Not tainted 2.6.24-21-server #1
[6635057.902068]
[6635057.902069] Call Trace:
[6635057.902093] [] __alloc_pages+0x2fd/0x3d0
[6635057.902102] [] dma_alloc_pages+0xb1/0x100
[6635057.902106] [] dma_alloc_coherent+0x92/0x200
[6635057.902114] [] :megaraid_sas:megasas_mgmt_ioctl_fw+0x1e4/0x440
[6635057.902126] [] :megaraid_sas:megasas_mgmt_compat_ioctl+0x195/0x1d0
[6635057.902131] [] compat_sys_ioctl+0x17d/0x3f0
[6635057.902136] [] dput+0x30/0x130
[6635057.902142] [] __down_read+0x12/0xb1
[6635057.902150] [] sysenter_do_call+0x1b/0x67
[6635057.902153] [] dummy_sem_semctl+0x0/0x10
[6635057.902158]
[6635057.902160] Mem-info:
[6635057.902162] Node 0 DMA per-cpu:
[6635057.902165] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902168] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902170] CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902172] CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902176] CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902178] CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902181] CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902184] CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635057.902186] Node 0 DMA32 per-cpu:
[6635057.902188] CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902191] CPU 1: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902194] CPU 2: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902196] CPU 3: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902200] CPU 4: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902202] CPU 5: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902206] CPU 6: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902208] CPU 7: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902211] Node 0 Normal per-cpu:
[6635057.902213] CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902216] CPU 1: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902219] CPU 2: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902226] CPU 3: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902228] CPU 4: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902231] CPU 5: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902234] CPU 6: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902237] CPU 7: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635057.902241] Active:3258321 inactive:1691740 dirty:1709 writeback:0 unstable:0
[6635057.902242] free:32132 slab:117396 mapped:476690 pagetables:8598 bounce:0
[6635057.902245] Node 0 DMA free:10856kB min:8kB low:8kB high:12kB active:0kB inactive:0kB present:10412kB pages_scanned:0 all_unreclaimable? no
[6635057.902249] lowmem_reserve[]: 0 2995 20165 20165
[6635057.902253] Node 0 DMA32 free:84456kB min:2696kB low:3368kB high:4044kB active:1528232kB inactive:1302304kB present:3067424kB pages_scanned:0 all_unreclaimable? no
[6635057.902258] lowmem_reserve[]: 0 0 17170 17170
[6635057.902262] Node 0 Normal free:33216kB min:15468kB low:19332kB high:23200kB active:11505052kB inactive:5464656kB present:17582080kB pages_scanned:0 all_unreclaimable? no
[6635057.902268] lowmem_reserve[]: 0 0 0 0
[6635057.902274] Node 0 DMA: 2*4kB 4*8kB 2*16kB 3*32kB 5*64kB 1*128kB 4*256kB 2*512kB 2*1024kB 1*2048kB 1*4096kB = 10856kB
[6635057.902285] Node 0 DMA32: 19092*4kB 880*8kB 75*16kB 7*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 85344kB
[6635057.902293] Node 0 Normal: 4276*4kB 954*8kB 83*16kB 0*32kB 0*64kB 1*128kB 3*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 34128kB
[6635057.902306] Swap cache: add 230046, delete 96365, find 233235/260611, race 0+23
[6635057.902309] Free swap = 11036384kB
[6635057.902310] Total swap = 11595768kB
[6635057.902311] Free swap: 11036384kB
[6635058.008584] 5505024 pages of RAM
[6635058.008589] 356610 reserved pages
[6635058.008590] 2769428 pages shared
[6635058.008591] 133681 pages swap cached
[6635058.008593] megasas: Failed to alloc kernel SGL buffer for IOCTL
[6635058.009709] dsm_sa_datamgr3: page allocation failure. order:4, mode:0x10d4
[6635058.009716] Pid: 5327, comm: dsm_sa_datamgr3 Not tainted 2.6.24-21-server #1
[6635058.009718]
[6635058.009719] Call Trace:
[6635058.009745] [] __alloc_pages+0x2fd/0x3d0
[6635058.009756] [] dma_alloc_pages+0xb1/0x100
[6635058.009760] [] dma_alloc_coherent+0x92/0x200
[6635058.009768] [] :megaraid_sas:megasas_mgmt_ioctl_fw+0x1e4/0x440
[6635058.009781] [] :megaraid_sas:megasas_mgmt_compat_ioctl+0x195/0x1d0
[6635058.009787] [] compat_sys_ioctl+0x17d/0x3f0
[6635058.009791] [] dput+0x30/0x130
[6635058.009797] [] __down_read+0x12/0xb1
[6635058.009804] [] sysenter_do_call+0x1b/0x67
[6635058.009809] [] dummy_sem_semctl+0x0/0x10
[6635058.009815]
[6635058.009816] Mem-info:
[6635058.009817] Node 0 DMA per-cpu:
[6635058.009819] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009823] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009826] CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009829] CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009832] CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009835] CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009838] CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009842] CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[6635058.009844] Node 0 DMA32 per-cpu:
[6635058.009847] CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009850] CPU 1: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009853] CPU 2: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009857] CPU 3: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 49
[6635058.009860] CPU 4: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009863] CPU 5: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009866] CPU 6: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009873] CPU 7: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009875] Node 0 Normal per-cpu:
[6635058.009878] CPU 0: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009881] CPU 1: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009884] CPU 2: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009888] CPU 3: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009891] CPU 4: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009894] CPU 5: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009897] CPU 6: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009899] CPU 7: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 0
[6635058.009903] Active:3258085 inactive:1691458 dirty:1710 writeback:14 unstable:0
[6635058.009904] free:32493 slab:117396 mapped:476669 pagetables:8598 bounce:0
[6635058.009906] Node 0 DMA free:10856kB min:8kB low:8kB high:12kB active:0kB inactive:0kB present:10412kB pages_scanned:0 all_unreclaimable? yes
[6635058.009915] lowmem_reserve[]: 0 2995 20165 20165
[6635058.009919] Node 0 DMA32 free:86004kB min:2696kB low:3368kB high:4044kB active:1527288kB inactive:1301176kB present:3067424kB pages_scanned:0 all_unreclaimable? no
[6635058.009922] lowmem_reserve[]: 0 0 17170 17170
[6635058.009924] Node 0 Normal free:33308kB min:15468kB low:19332kB high:23200kB active:11505052kB inactive:5464656kB present:17582080kB pages_scanned:0 all_unreclaimable? no
[6635058.009927] lowmem_reserve[]: 0 0 0 0
[6635058.009929] Node 0 DMA: 2*4kB 4*8kB 2*16kB 3*32kB 5*64kB 1*128kB 4*256kB 2*512kB 2*1024kB 1*2048kB 1*4096kB = 10856kB
[6635058.009935] Node 0 DMA32: 19153*4kB 939*8kB 97*16kB 14*32kB 3*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 86828kB
[6635058.009940] Node 0 Normal: 4270*4kB 954*8kB 83*16kB 0*32kB 0*64kB 1*128kB 3*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 34104kB
[6635058.009956] Swap cache: add 230159, delete 96404, find 233235/260611, race 0+23
[6635058.009959] Free swap = 11035932kB
[6635058.009960] Total swap = 11595768kB
[6635058.009962] Free swap: 11035932kB
[6635058.119257] 5505024 pages of RAM
[6635058.119265] 356610 reserved pages
[6635058.119266] 2767498 pages shared
[6635058.119267] 134110 pages swap cached
[6635058.119269] megasas: Failed to alloc kernel SGL buffer for IOCTL

Mi lehet ez? Mennyire aggódjak szerintetek?
Köszi,
Sz.

Hozzászólások

Mióta megy a gép? Esetleg egy újraindítás?

A másik tipp, hogy újabb kernellel próbát lehet tenni?

nem, minden optimal-ban van.
megasasctl
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
a0d0 272GiB RAID 5 1x3 optimal
a0e32s0 136GiB a0d0 online
a0e32s1 136GiB a0d0 online
a0e32s2 136GiB a0d0 online

megaraidsas-status
-- Arrays informations --
-- ID | Type | Size | Status
a0d0 | RAID 5 | 272GiB | optimal

-- Disks informations
-- ID | Model | Status | Warnings
a0e32s0 | HITACHI HUS153014VLS300 136GiB | online
a0e32s1 | HITACHI HUS153014VLS300 136GiB | online
a0e32s2 | HITACHI HUS153014VLS300 136GiB | online

Akkor én rebutálnék (nem lehet eléggé hangsúlyozni, hogy mentés után) a mostani kernellel, hogy kizárható legyen valamilyen "használatközbeni" megőrülés. Ha kb. ugyanannyi idő után jön elő, mint most akkor kernelcsere, ha pedig kb. azonnal, akkor telefon a supportnak hogy wtf.

hát, mentés az nem árt.

nem láttam még ilyet konkrétan.

valami memoria foglalas nemsikerult:
megasas: Failed to alloc kernel SGL buffer for IOCTL

na de hogy most 3 bajtot vagy 3gigabajtot akar foglalni, azt nemtudom.

--
A vegtelen ciklus is vegeter egyszer, csak kelloen eros hardver kell hozza!

Érdekes... mentés + reboot + utána kernelcserét támogatom.

A post hatására ránéztem én is a megasasctl-re és ezt találtam:

# megasasctl
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
a0d0 930GiB RAID 5 1x3 optimal
a0e32s0 465GiB a0d0 online errs: media:71 other:0
a0e32s1 465GiB a0d0 online
a0e32s2 465GiB a0d0 online

# megasasctl -e
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
a0d0 930GiB RAID 5 1x3 optimal
a0e32s0 465GiB a0d0 online errs: media:71 other:0
write errors: corr: 0 delay: 0 rewrit: 0 tot/corr: 0 tot/uncorr: 0
read errors: corr: 21Mi delay: 17 reread: 0 tot/corr: 21Mi tot/uncorr:124
verify errors: corr: 56Mi delay: 3 revrfy: 0 tot/corr: 56Mi tot/uncorr: 71
a0e32s1 465GiB a0d0 online
write errors: corr: 0 delay: 0 rewrit: 0 tot/corr: 0 tot/uncorr: 0
read errors: corr: 5Mi delay: 0 reread: 0 tot/corr: 5Mi tot/uncorr: 0
verify errors: corr: 11Mi delay: 0 revrfy: 0 tot/corr: 11Mi tot/uncorr: 0
a0e32s2 465GiB a0d0 online
write errors: corr: 0 delay: 0 rewrit: 0 tot/corr: 0 tot/uncorr: 0
read errors: corr: 8Mi delay: 1 reread: 0 tot/corr: 8Mi tot/uncorr: 0
verify errors: corr: 19Mi delay: 0 revrfy: 0 tot/corr: 19Mi tot/uncorr: 0

A Dell OMSA nem mondja, hogy lenne probléma.

A kérdésem ugyanaz, mint a topiknyitónak. Mennyire aggódjak? =) Meg esetleg mit tegyek?

Ments! ;)
Te milyen modult használsz?
# modinfo megaraid_sas
filename: /lib/modules/2.6.24-21-server/kernel/drivers/scsi/megaraid/megaraid_sas.ko
description: LSI Logic MegaRAID SAS Driver
author: megaraidlinux@lsi.com
version: 00.00.03.10-rc5
license: GPL
srcversion: 134488BAF94ABCC2CC734BC
alias: pci:v00001028d00000015sv*sd*bc*sc*i*
alias: pci:v00001000d00000413sv*sd*bc*sc*i*
alias: pci:v00001000d00000060sv*sd*bc*sc*i*
alias: pci:v00001000d00000411sv*sd*bc*sc*i*
depends: scsi_mod
vermagic: 2.6.24-21-server SMP mod_unload

Szerintem nálad elég "csak" egy disk csere

Mentés van. ;)

# modinfo megaraid_sas
filename: /lib/modules/2.6.32-bpo.5-amd64/kernel/drivers/scsi/megaraid/megaraid_sas.ko
description: LSI MegaRAID SAS Driver
author: megaraidlinux@lsi.com
version: 00.00.04.01
license: GPL
srcversion: 43855ED5089F96717EB7F4E
alias: pci:v00001028d00000015sv*sd*bc*sc*i*
alias: pci:v00001000d00000413sv*sd*bc*sc*i*
alias: pci:v00001000d00000079sv*sd*bc*sc*i*
alias: pci:v00001000d00000078sv*sd*bc*sc*i*
alias: pci:v00001000d0000007Csv*sd*bc*sc*i*
alias: pci:v00001000d00000060sv*sd*bc*sc*i*
alias: pci:v00001000d00000411sv*sd*bc*sc*i*
depends: scsi_mod
vermagic: 2.6.32-bpo.5-amd64 SMP mod_unload modversions
parm: poll_mode_io:Complete cmds from IO path, (default=0) (int)

Dehát azt írja, hogy optimal, akkor most optimal, vagy nem? =D

Egy Consistency check-et ráengedtem, hátha mond valamit.

Érdemes még esetleg a Dell PowerEdge Diagnostics-ot futtani. Abban van mindenféle hardware eszközhöz diagnosztika.
Innét le lehet tölteni: http://support.dell.com/ (az adott géphez tartozó support oldalon)
Van SUSE-ra és RHEL-re, nem tudom, hogy ezek futnak-e Ubuntu-n rendesen. Ha nem akkor akár egy CentOS Live CD-re feldobod a csomagokat és tudod futtatni. Azt hiszem az OMSA CD-n is van valami diagnosztika, ha nem arra is tudod ezeket telepíteni.

kohinoor:
talan tamogatott oprendszert kene hasznalni...

Igen ez hasznos, mer ez esetben könnyen lehet a Dell-től supportot kérni, még akkor is, ha az oprendszerben jelentkezik a hiba.

A másik oldala a dolognak viszont az, hogy a supportált oprendszerekben egy csomó csomag nincs benne és öregek a csomagok. Bizonyos helyzetekben egyszerűen nem megfelelőek és még fizetősek is.

Ezt raktam fel (egyébként debian van a szerveren):
http://ftp.us.dell.com/diags/dell-onlinediags-linux-2.16.0.165.tar.gz

De ezt írja:
# ./pediags sasdevdiag --show test
The device class requested is not present/enumerated.

(a memóriát, hálókártyát le tudtam kérdezni rendesen)

Helyette lefuttattam egy smart tesztet, az nem írt ki hibát.

long teszt is lefutott:

# megasasctl -s -vv a0e32s0
a0e32s0 SEAGATE ST3500620SS rev:MS0A s/n:xxxxxxxx 465GiB a0d0 online errs: media:71 other:1
0: timestamp 305d17h: bg long completed without error seg:16 lba:-1 sk:0 asc:0 ascq:0 vs:0
1: timestamp 305d15h: bg short completed without error seg:16 lba:-1 sk:0 asc:0 ascq:0 vs:0

# smartctl -a -d megaraid,0 /dev/sda
smartctl 5.40 2010-02-03 r3060 [x86_64-unknown-linux-gnu] (local build)
....
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 23727170 17 0 23727187 23727311 2544.461 124
write: 0 0 0 0 0 16347.358 0
verify: 59059057 3 0 59059060 59059131 21495.947 71
Non-medium error count: 0

Van egy spare hdd-nk, biztos ami biztos, kicseréljük.