Némiképp aggasztó bejegyzéseket látok a logban:
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Sense Key : Medium Error [current]
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 CDB: Read(10) 28 00 96 6d 34 60 00 00 08 00
Apr 20 12:21:59 ****** kernel: blk_update_request: I/O error, dev sda, sector 2523739232
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Sense Key : Medium Error [current]
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 CDB: Read(10) 28 00 96 6d 34 60 00 00 08 00
Apr 20 12:21:59 ****** kernel: blk_update_request: I/O error, dev sda, sector 2523739232
A tömb - ha jól értem - azért még jól érzi magát:
# megacli -LDInfo -L0 -a0
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 1.817 TB
Sector Size : 512
Is VD emulated : Yes
Mirror Data : 1.817 TB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None
Bad Blocks Exist: Yes
Is VD Cached: No
Ami tulajdonképpen aggaszt az az, hogy felszaporodtak a hibák, és nem ismerős a hardver. Ilyenkor még vígan javít a disk és nem érdemes aggodalmaskodni, vagy most szóljak az illetékeseknek, hogy hamarosan (azonnal?) esedékes a HDD csere?
- 2255 megtekintés
Hozzászólások
Szerintem kérjél le részletesebb infot:
MegaCli64 -ldpdinfo -a0
MegaCli64 -AdpAlILog -a0
Az ADPLogból kiderül, hogy javított-e diszk hibát mostanában a vezérlő ("FATAL:Puncturing bad block").
Esetleg még közvetlenül a diszkeket is meg lehet szólítani smart szinten (ahol az X: "Device Id"):
smartctl -a -d sat+megaraid,X /dev/sda
- A hozzászóláshoz be kell jelentkezni
Áh, köszi. Így már beszédesebb, bár a tanácstalanságomat nem csökkentette :D
39593: 18-03-10,07:26:38 BG Work:Consistency Check progress on VD 00/0 is 99.73%(15998s)
39594: 18-03-10,07:27:40 BG Work:Consistency Check progress on VD 00/0 is 99.98%(16060s)
39595: 18-03-10,07:27:43 Info:Consistency Check done on VD 00/0
39596: 18-03-11,21:04:14 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 96 6d 34 60 00 00 08 00, Sense: F0 00 03 96 6D 34 60 0A 00 00 00 00 11 00 00 00 00 00
39597: 18-03-11,21:04:21 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 91 bf f9 68 00 00 08 00, Sense: F0 00 03 91 BF F9 68 0A 00 00 00 00 11 00 00 00 00 00
39598: 18-03-11,21:04:28 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 96 6d 34 60 00 00 08 00, Sense: F0 00 03 96 6D 34 60 0A 00 00 00 00 11 00 00 00 00 00
39599: 18-03-11,21:04:35 Info:Unexpected sense: PD 01(e0xfc/s1) Path 4433221101000000, CDB: 28 00 96 6d 34 60 00 00 08 00, Sense: F0 00 03 96 6D 34 60 0A 00 00 00 00 11 00 00 00 00 00
39600: 18-03-11,21:04:35 FATAL:Unrecoverable medium error during recovery on PD 00(e0xfc/s0) at 966d3460
39601: 18-03-11,21:04:35 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 91 bf f9 68 00 00 08 00, Sense: F0 00 03 91 BF F9 68 0A 00 00 00 00 11 00 00 00 00 00
39602: 18-03-11,21:04:42 Info:Unexpected sense: PD 01(e0xfc/s1) Path 4433221101000000, CDB: 28 00 02 d7 83 68 00 00 08 00, Sense: F0 00 03 02 D7 83 68 0A 00 00 00 00 11 00 00 00 00 00
39603: 18-03-11,21:04:42 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 2e 00 96 6d 34 60 00 00 01 00, Sense: 70 00 07 00 00 00 00 0A 00 00 00 00 27 00 00 00 00 00
39604: 18-03-11,21:04:45 FATAL:Uncorrectable medium error logged for VD 00/0 at 966d3460 (on PD 01(e0xfc/s1) at 966d3460)
39605: 18-03-11,21:04:49 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 92 cf 38 e0 00 00 08 00, Sense: F0 00 03 92 CF 38 E0 0A 00 00 00 00 11 00 00 00 00 00
39606: 18-03-11,21:04:55 Info:Corrected medium error during recovery on PD 00(e0xfc/s0) at 91bff968
Abből nekem úgy tűnik, hogy mintha bibis lenne valahol.
A "fail"-ra keresve még vannak ilyenek:
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:12:03: prDiskStart: starting Patrol Read on PD=00
04/17/18 18:12:03: prDiskStart: starting Patrol Read on PD=01
04/17/18 18:12:03: EVT#41642-04/17/18 18:12:03: 39=Patrol Read started
04/17/18 18:39:33: EVT#41643-04/17/18 18:39:33: 94=Patrol Read progress on PD 00(e0xfc/s0) is 9.99%(1202s)
04/17/18 18:49:19: EVT#41644-04/17/18 18:49:19: 94=Patrol Read progress on PD 01(e0xfc/s1) is 9.99%(1196s)
04/17/18 19:09:27: EVT#41645-04/17/18 19:09:27: 94=Patrol Read progress on PD 00(e0xfc/s0) is 19.99%(2444s)
04/17/18 19:20:03: EVT#41646-04/17/18 19:20:03: 94=Patrol Read progress on PD 01(e0xfc/s1) is 19.99%(2472s)
04/17/18 19:40:00: EVT#41647-04/17/18 19:40:00: 94=Patrol Read progress on PD 00(e0xfc/s0) is 29.99%(3725s)
04/17/18 19:52:46: EVT#41648-04/17/18 19:52:46: 94=Patrol Read progress on PD 01(e0xfc/s1) is 29.99%(3787s)
04/17/18 20:13:24: EVT#41649-04/17/18 20:13:24: 94=Patrol Read progress on PD 00(e0xfc/s0) is 39.99%(5065s)
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:27:35: EVT#41650-04/17/18 20:27:35: 94=Patrol Read progress on PD 01(e0xfc/s1) is 39.99%(5164s)
04/17/18 20:47:30: EVT#41651-04/17/18 20:47:30: 94=Patrol Read progress on PD 00(e0xfc/s0) is 49.99%(6463s)
04/17/18 21:01:15: EVT#41652-04/17/18 21:01:15: 94=Patrol Read progress on PD 01(e0xfc/s1) is 49.99%(6608s)
04/17/18 21:26:06: EVT#41653-04/17/18 21:26:06: 94=Patrol Read progress on PD 00(e0xfc/s0) is 59.99%(7931s)
04/17/18 21:41:37: EVT#41654-04/17/18 21:41:37: 94=Patrol Read progress on PD 01(e0xfc/s1) is 59.99%(8126s)
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 22:08:43: EVT#41655-04/17/18 22:08:43: 94=Patrol Read progress on PD 00(e0xfc/s0) is 69.99%(9496s)
04/17/18 22:24:34: EVT#41656-04/17/18 22:24:34: 94=Patrol Read progress on PD 01(e0xfc/s1) is 69.99%(9759s)
04/17/18 22:51:07: EVT#41657-04/17/18 22:51:07: 94=Patrol Read progress on PD 00(e0xfc/s0) is 79.99%(11176s)
04/17/18 23:11:00: EVT#41658-04/17/18 23:11:00: 94=Patrol Read progress on PD 01(e0xfc/s1) is 79.99%(11521s)
04/17/18 23:40:50: EVT#41659-04/17/18 23:40:50: 94=Patrol Read progress on PD 00(e0xfc/s0) is 89.99%(13015s)
04/18/18 0:06:19: EVT#41660-04/18/18 0:06:19: 94=Patrol Read progress on PD 01(e0xfc/s1) is 89.99%(13456s)
04/18/18 0:39:11: EVT#41661-04/18/18 0:39:11: 94=Patrol Read progress on PD 00(e0xfc/s0) is 99.99%(15108s)
04/18/18 0:39:13: prCallback: PR completed for pd=00
04/18/18 1:07:31: EVT#41662-04/18/18 1:07:31: 94=Patrol Read progress on PD 01(e0xfc/s1) is 99.99%(15680s)
04/18/18 1:07:33: prCallback: PR completed for pd=01
04/18/18 1:07:33: PR cycle complete
04/18/18 1:07:33: EVT#41663-04/18/18 1:07:33: 35=Patrol Read complete
04/18/18 1:07:33: Next PR scheduled to start at 04/21/18 3:00:00
04/18/18 2:15:33: GroupCmds: Fail cmd c=c0590ca0 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/18/18 2:15:33: GroupCmds: Fail cmd c=c0590ca0 ld=0 start_block=966d3460 num_blocks=8 cmd=1
Viszont a patrol read végigmegy a 100%-ig, amiből azt gondolnám hogy nem nagy a baj, de mint írtam, nem ismerem a cuccot.
Attól eltekintve hogy az akku már kiszáradt róla, egészségesnek tűnik a cucc (de az akkut még egyeztetnem kell a hostinggal, hogy most akkor mi is van).
A smartctl nem mutat semmi gonoszat.
- A hozzászóláshoz be kell jelentkezni
Pedig ezen két sor alapján nincs rendben, a smartctl-nek is mutatni kellene (Reallocated_Event_Count, Current_Pending_Sector, Offline_Uncorrectable):
FATAL:Unrecoverable medium error during recovery on PD 00(e0xfc/s0) at 966d3460
FATAL:Uncorrectable medium error logged for VD 00/0 at 966d3460 (on PD 01(e0xfc/s1) at 966d3460)
Nem véletlenül látod az errorokat OS szinten, mert szerintem mindkét vinyó ugyanott hibás (legalábbis szerintem a második sor erre utal).
- A hozzászóláshoz be kell jelentkezni
Bocsánat, közben sok minden más is történt. Futtattam rá long selftestet:
smartctl -t long -d megaraid,0 /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Se
Device Model: WDC WD2000F9YZ-09N20L1
Serial Number: WD-WMC1P0F9XDXF
LU WWN Device Id: 5 0014ee 65b518034
Firmware Version: 01.01A02
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Apr 23 19:40:53 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (23400) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 167 167 021 Pre-fail Always - 6650
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 14317
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 12
16 Unknown_Attribute 0x0022 255 000 000 Old_age Always - 19768963193801
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 121 117 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 14310 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Ez alapján okésnak látom, bár surface testet nem csináltam (elvileg arra lenne jó a patrol read, nem?), most fut a másik diskre is a selftest.
Felvettem a kapcsolatot a hostinggal, ők mindkét disk cseréjét javasolták. Egyelőre azon dolgozom hogy jól és többször le legyen mentve minden, hogy ha esetleg a cserebere során szerteszalad a raid, akkor legyen miből újra összerakni a masinát.
- A hozzászóláshoz be kell jelentkezni
Ez jónak tűnik, a másik disk-en sincs hiba? Bár host errornak nem szabadott volna jönnie, így passz. Esetleg más jellegű probléma, pl. táp.
- A hozzászóláshoz be kell jelentkezni
Nagyjából ugyanez.
Közben felvettem a kapcsolatot a hostinggal, kiderült pár, eddig nem ismert részlet. Na mindegy, az lesz hogy beviszek egy gépet, arra mentek amit lehet, és utána diskcserék lesznek. Bár szerintem alapvetően nincs gond a diskekkel, ezért előbb egy kikapcs-bekapcs, és ha egy hónapig nem lesz hiba, akkor marad úgy. Vagy nem, ezen még morfondírozok.
- A hozzászóláshoz be kell jelentkezni
Nos, némi diskcsere, miegyéb után továbbra is fennállt a hiba. Elszántabb keresgélés után ezt találtam:
https://medium.com/@george.shuklin/enterprise-grade-fuckups-from-lsi-an…
Ami egész sok mindent megmagyaráz. A BBMClr óta nyugi van.
Kulcsszavak a kereséshez:
State : Optimal
Bad Blocks Exist: Yes
Dell server PERC H730, LSI Megaraid
- A hozzászóláshoz be kell jelentkezni