sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Fórumok

Némiképp aggasztó bejegyzéseket látok a logban:


Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Sense Key : Medium Error [current] 
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 CDB: Read(10) 28 00 96 6d 34 60 00 00 08 00
Apr 20 12:21:59 ****** kernel: blk_update_request: I/O error, dev sda, sector 2523739232
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Sense Key : Medium Error [current] 
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 Add. Sense: No additional sense information
Apr 20 12:21:59 ****** kernel: sd 0:2:0:0: [sda] tag#0 CDB: Read(10) 28 00 96 6d 34 60 00 00 08 00
Apr 20 12:21:59 ****** kernel: blk_update_request: I/O error, dev sda, sector 2523739232

A tömb - ha jól értem - azért még jól érzi magát:



# megacli -LDInfo -L0 -a0
                                     

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.817 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 1.817 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: Yes
Is VD Cached: No

Ami tulajdonképpen aggaszt az az, hogy felszaporodtak a hibák, és nem ismerős a hardver. Ilyenkor még vígan javít a disk és nem érdemes aggodalmaskodni, vagy most szóljak az illetékeseknek, hogy hamarosan (azonnal?) esedékes a HDD csere?

Hozzászólások

Szerintem kérjél le részletesebb infot:
MegaCli64 -ldpdinfo -a0
MegaCli64 -AdpAlILog -a0

Az ADPLogból kiderül, hogy javított-e diszk hibát mostanában a vezérlő ("FATAL:Puncturing bad block").

Esetleg még közvetlenül a diszkeket is meg lehet szólítani smart szinten (ahol az X: "Device Id"):
smartctl -a -d sat+megaraid,X /dev/sda

Áh, köszi. Így már beszédesebb, bár a tanácstalanságomat nem csökkentette :D


39593: 18-03-10,07:26:38 BG Work:Consistency Check progress on VD 00/0 is 99.73%(15998s)
39594: 18-03-10,07:27:40 BG Work:Consistency Check progress on VD 00/0 is 99.98%(16060s)
39595: 18-03-10,07:27:43 Info:Consistency Check done on VD 00/0
39596: 18-03-11,21:04:14 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 96 6d 34 60 00 00 08 00, Sense: F0 00 03 96 6D 34 60 0A 00 00 00 00 11 00 00 00 00 00 
39597: 18-03-11,21:04:21 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 91 bf f9 68 00 00 08 00, Sense: F0 00 03 91 BF F9 68 0A 00 00 00 00 11 00 00 00 00 00 
39598: 18-03-11,21:04:28 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 96 6d 34 60 00 00 08 00, Sense: F0 00 03 96 6D 34 60 0A 00 00 00 00 11 00 00 00 00 00 
39599: 18-03-11,21:04:35 Info:Unexpected sense: PD 01(e0xfc/s1) Path 4433221101000000, CDB: 28 00 96 6d 34 60 00 00 08 00, Sense: F0 00 03 96 6D 34 60 0A 00 00 00 00 11 00 00 00 00 00 
39600: 18-03-11,21:04:35 FATAL:Unrecoverable medium error during recovery on PD 00(e0xfc/s0) at 966d3460
39601: 18-03-11,21:04:35 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 91 bf f9 68 00 00 08 00, Sense: F0 00 03 91 BF F9 68 0A 00 00 00 00 11 00 00 00 00 00 
39602: 18-03-11,21:04:42 Info:Unexpected sense: PD 01(e0xfc/s1) Path 4433221101000000, CDB: 28 00 02 d7 83 68 00 00 08 00, Sense: F0 00 03 02 D7 83 68 0A 00 00 00 00 11 00 00 00 00 00 
39603: 18-03-11,21:04:42 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 2e 00 96 6d 34 60 00 00 01 00, Sense: 70 00 07 00 00 00 00 0A 00 00 00 00 27 00 00 00 00 00 
39604: 18-03-11,21:04:45 FATAL:Uncorrectable medium error logged for VD 00/0 at 966d3460 (on PD 01(e0xfc/s1) at 966d3460)
39605: 18-03-11,21:04:49 Info:Unexpected sense: PD 00(e0xfc/s0) Path 4433221100000000, CDB: 28 00 92 cf 38 e0 00 00 08 00, Sense: F0 00 03 92 CF 38 E0 0A 00 00 00 00 11 00 00 00 00 00 
39606: 18-03-11,21:04:55 Info:Corrected medium error during recovery on PD 00(e0xfc/s0) at 91bff968

Abből nekem úgy tűnik, hogy mintha bibis lenne valahol.

A "fail"-ra keresve még vannak ilyenek:


04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:05:22: GroupCmds: Fail cmd c=c0599a60 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 18:12:03: prDiskStart: starting Patrol Read on PD=00
04/17/18 18:12:03: prDiskStart: starting Patrol Read on PD=01
04/17/18 18:12:03: EVT#41642-04/17/18 18:12:03:  39=Patrol Read started
04/17/18 18:39:33: EVT#41643-04/17/18 18:39:33:  94=Patrol Read progress on PD 00(e0xfc/s0) is 9.99%(1202s)
04/17/18 18:49:19: EVT#41644-04/17/18 18:49:19:  94=Patrol Read progress on PD 01(e0xfc/s1) is 9.99%(1196s)
04/17/18 19:09:27: EVT#41645-04/17/18 19:09:27:  94=Patrol Read progress on PD 00(e0xfc/s0) is 19.99%(2444s)
04/17/18 19:20:03: EVT#41646-04/17/18 19:20:03:  94=Patrol Read progress on PD 01(e0xfc/s1) is 19.99%(2472s)
04/17/18 19:40:00: EVT#41647-04/17/18 19:40:00:  94=Patrol Read progress on PD 00(e0xfc/s0) is 29.99%(3725s)
04/17/18 19:52:46: EVT#41648-04/17/18 19:52:46:  94=Patrol Read progress on PD 01(e0xfc/s1) is 29.99%(3787s)
04/17/18 20:13:24: EVT#41649-04/17/18 20:13:24:  94=Patrol Read progress on PD 00(e0xfc/s0) is 39.99%(5065s)
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:20:14: GroupCmds: Fail cmd c=c0597660 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 20:27:35: EVT#41650-04/17/18 20:27:35:  94=Patrol Read progress on PD 01(e0xfc/s1) is 39.99%(5164s)
04/17/18 20:47:30: EVT#41651-04/17/18 20:47:30:  94=Patrol Read progress on PD 00(e0xfc/s0) is 49.99%(6463s)
04/17/18 21:01:15: EVT#41652-04/17/18 21:01:15:  94=Patrol Read progress on PD 01(e0xfc/s1) is 49.99%(6608s)
04/17/18 21:26:06: EVT#41653-04/17/18 21:26:06:  94=Patrol Read progress on PD 00(e0xfc/s0) is 59.99%(7931s)
04/17/18 21:41:37: EVT#41654-04/17/18 21:41:37:  94=Patrol Read progress on PD 01(e0xfc/s1) is 59.99%(8126s)
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 21:53:12: GroupCmds: Fail cmd c=c0588360 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/17/18 22:08:43: EVT#41655-04/17/18 22:08:43:  94=Patrol Read progress on PD 00(e0xfc/s0) is 69.99%(9496s)
04/17/18 22:24:34: EVT#41656-04/17/18 22:24:34:  94=Patrol Read progress on PD 01(e0xfc/s1) is 69.99%(9759s)
04/17/18 22:51:07: EVT#41657-04/17/18 22:51:07:  94=Patrol Read progress on PD 00(e0xfc/s0) is 79.99%(11176s)
04/17/18 23:11:00: EVT#41658-04/17/18 23:11:00:  94=Patrol Read progress on PD 01(e0xfc/s1) is 79.99%(11521s)
04/17/18 23:40:50: EVT#41659-04/17/18 23:40:50:  94=Patrol Read progress on PD 00(e0xfc/s0) is 89.99%(13015s)
04/18/18  0:06:19: EVT#41660-04/18/18  0:06:19:  94=Patrol Read progress on PD 01(e0xfc/s1) is 89.99%(13456s)
04/18/18  0:39:11: EVT#41661-04/18/18  0:39:11:  94=Patrol Read progress on PD 00(e0xfc/s0) is 99.99%(15108s)
04/18/18  0:39:13: prCallback: PR completed for pd=00
04/18/18  1:07:31: EVT#41662-04/18/18  1:07:31:  94=Patrol Read progress on PD 01(e0xfc/s1) is 99.99%(15680s)
04/18/18  1:07:33: prCallback: PR completed for pd=01
04/18/18  1:07:33: PR cycle complete
04/18/18  1:07:33: EVT#41663-04/18/18  1:07:33:  35=Patrol Read complete
04/18/18  1:07:33: Next PR scheduled to start at 04/21/18  3:00:00
04/18/18  2:15:33: GroupCmds: Fail cmd c=c0590ca0 ld=0 start_block=966d3460 num_blocks=8 cmd=1
04/18/18  2:15:33: GroupCmds: Fail cmd c=c0590ca0 ld=0 start_block=966d3460 num_blocks=8 cmd=1

Viszont a patrol read végigmegy a 100%-ig, amiből azt gondolnám hogy nem nagy a baj, de mint írtam, nem ismerem a cuccot.

Attól eltekintve hogy az akku már kiszáradt róla, egészségesnek tűnik a cucc (de az akkut még egyeztetnem kell a hostinggal, hogy most akkor mi is van).

A smartctl nem mutat semmi gonoszat.

Pedig ezen két sor alapján nincs rendben, a smartctl-nek is mutatni kellene (Reallocated_Event_Count, Current_Pending_Sector, Offline_Uncorrectable):
FATAL:Unrecoverable medium error during recovery on PD 00(e0xfc/s0) at 966d3460
FATAL:Uncorrectable medium error logged for VD 00/0 at 966d3460 (on PD 01(e0xfc/s1) at 966d3460)

Nem véletlenül látod az errorokat OS szinten, mert szerintem mindkét vinyó ugyanott hibás (legalábbis szerintem a második sor erre utal).

Bocsánat, közben sok minden más is történt. Futtattam rá long selftestet:


smartctl -t long -d megaraid,0  /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Se
Device Model:     WDC WD2000F9YZ-09N20L1
Serial Number:    WD-WMC1P0F9XDXF
LU WWN Device Id: 5 0014ee 65b518034
Firmware Version: 01.01A02
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 23 19:40:53 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (23400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   167   167   021    Pre-fail  Always       -       6650
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       14317
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       12
 16 Unknown_Attribute       0x0022   255   000   000    Old_age   Always       -       19768963193801
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   121   117   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     14310         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Ez alapján okésnak látom, bár surface testet nem csináltam (elvileg arra lenne jó a patrol read, nem?), most fut a másik diskre is a selftest.

Felvettem a kapcsolatot a hostinggal, ők mindkét disk cseréjét javasolták. Egyelőre azon dolgozom hogy jól és többször le legyen mentve minden, hogy ha esetleg a cserebere során szerteszalad a raid, akkor legyen miből újra összerakni a masinát.

Nagyjából ugyanez.

Közben felvettem a kapcsolatot a hostinggal, kiderült pár, eddig nem ismert részlet. Na mindegy, az lesz hogy beviszek egy gépet, arra mentek amit lehet, és utána diskcserék lesznek. Bár szerintem alapvetően nincs gond a diskekkel, ezért előbb egy kikapcs-bekapcs, és ha egy hónapig nem lesz hiba, akkor marad úgy. Vagy nem, ezen még morfondírozok.