Can't import pool after drive failure #16736

Ast0e · 2024-11-08T22:32:49Z

Ast0e
Nov 8, 2024

Hi guys,

I run into issue when i tried to replace a failed hard drive.

linux 6.8.0-48-generic #48-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 14:04:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
zfs-2.2.2-0ubuntu9.1
zfs-kmod-2.2.2-0ubuntu9
Raid : Raidz
HDD : 4 x 4To

Initial situation,
Weird behavior on my vm, was watching plex and suddenly no data. I reboot the server everything was working again. I check proxmox and notice the zfs raid was on degraded with a degraded hard drive (too many error).

After that, maybe first failed, i extract the faulty disk and add a new one, the server never reboot, stuck on journal scan something like that.
I reboot in rescue mode and tried to import the raid, import process stuck for a day without nothing happen, only some kernel message about read/write error
i tried again in read only, the pool connect but no data

For further investigation I move the 4 disks on my Ubuntu and tried to import again, I let the os try a day without success.

I also tried the rescue command of the doc without success :

zpool import -F
zpool import -FX (something like that)

Same issue the import process is stuck

When i was doing my import i check with iostat if something happens and notice 0 drive activity

  pool: RAID_HDD
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Nov  3 20:06:56 2024
	0B / 12.3T scanned, 0B / 12.3T issued
	0B resilvered, 0.00% done, no estimated completion time
config:

	NAME                                          STATE     READ WRITE CKSUM
	RAID_HDD                                      DEGRADED     0     0     0
	  raidz1-0                                    DEGRADED     0     0     0
	    ata-ST4000VN008-2DR166_ZGY93K8Q           ONLINE       0     0    52
	    ata-ST4000VN008-2DR166_ZGY906MN           DEGRADED     0     0    52  too many errors
	    ata-ST4000VN006-3CW104_ZW628HH4           ONLINE       0     0    58
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K3ELTCJY  ONLINE       0     0    52

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x744f>
        <metadata>:<0x7b9b>
        <metadata>:<0x79dc>
        RAID_HDD/vm-107-disk-0:<0x1>

 dedup: no DDT entries

If someone has a idea if i can get my data back, i take it :/

Not sure what can be done because everything I find i need to import the pool but in my case it’s not possible

IvanVolosyuk · 2024-11-09T00:01:34Z

IvanVolosyuk
Nov 9, 2024

Check PSU, errors on multiple disks often originate from power issues.

…

On Sat, Nov 9, 2024, 09:33 Ast0e ***@***.***> wrote: Hi guys, I run into issue when i tried to replace a failed hard drive. linux 6.8.0-48-generic #48-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 14:04:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux zfs-2.2.2-0ubuntu9.1 zfs-kmod-2.2.2-0ubuntu9 Raid : Raidz HDD : 4 x 4To Initial situation, Weird behavior on my vm, was watching plex and suddenly no data. I reboot the server everything was working again. I check proxmox and notice the zfs raid was on degraded with a degraded hard drive (too many error). After that, maybe first failed, i extract the faulty disk and add a new one, the server never reboot, stuck on journal scan something like that. I reboot in rescue mode and tried to import the raid, import process stuck for a day without nothing happen, only some kernel message about read/write error i tried again in read only, the pool connect but no data For further investigation I move the 4 disks on my Ubuntu and tried to import again, I let the os try a day without success. I also tried the rescue command of the doc without success : zpool import -F zpool import -FX (something like that) Same issue the import process is stuck When i was doing my import i check with iostat if something happens and notice 0 drive activity pool: RAID_HDD state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Nov 3 20:06:56 2024 0B / 12.3T scanned, 0B / 12.3T issued 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM RAID_HDD DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST4000VN008-2DR166_ZGY93K8Q ONLINE 0 0 52 ata-ST4000VN008-2DR166_ZGY906MN DEGRADED 0 0 52 too many errors ata-ST4000VN006-3CW104_ZW628HH4 ONLINE 0 0 58 ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K3ELTCJY ONLINE 0 0 52 errors: Permanent errors have been detected in the following files: <metadata>:<0x744f> <metadata>:<0x7b9b> <metadata>:<0x79dc> RAID_HDD/vm-107-disk-0:<0x1> dedup: no DDT entries If someone has a idea if i can get my data back, i take it :/ Not sure what can be done because everything I find i need to import the pool but in my case it’s not possible — Reply to this email directly, view it on GitHub <#16736>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HKVTZ2ZM2TEK4TJRR3Z7U32PAVCNFSM6AAAAABROMX7IWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXGQ2DOOBXGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Ast0e · 2024-11-09T09:08:58Z

Ast0e
Nov 9, 2024
Author

What you mean ?
Because the 4 disks was on my server and after move on my PC, so technically i used 2 different PSU

0 replies

amotin · 2024-11-11T00:29:47Z

amotin
Nov 11, 2024
Collaborator

He means it might be not a drive failure, or not only a drive failure. It might be something system-wide: power fluctuation, disk controller misbehaving, memory corruption, software bug, etc. Whatever happened could happen earlier when/before you spotted the original problem. Inability to import the pool after disk removal might be only a consequence -- removed disk reduced data redundancy below some critical level, causing some actual corruption. That is why it is better to replace the disk without removing the old one fist, if hardware allows it, if it is still somehow alive.

0 replies

n0xena · 2024-11-24T23:06:30Z

n0xena
Nov 24, 2024

are there any updates on this?
let me share my story anyway:
recently also had a multiple drive failure - raidz2 8x 3tb
as I use a cheap sata hba one of the drives caused the port-multiplier to act weird and hang the upstream root port - long story short: I had to remove the faulty drive and set it to offline with faulted state manual and had to force a hard reset
luckly I got away with a complete resilver without any data corruption or loss - and it's also not my first array I recovered from a failure
tldr: if you have a drive that hangs the array - remove it and try again
also: if you get back your data - do a proper backup, destroy that raidz1 pool and create a new pool in a raidz2 config - with todays drives and thier failure rates a 2nd failure is about to expected during resilver of the first failed drive - hence "RAID5" should no longer be used anyway
also also: #16602 - unfortunately I didn't got a reply yet so I maybe raise it as an issue - there seems no way to force stop an active resilver - and issue a resilver is only supposed to restart the process over again - so one can end up stuck in a state where nothin happens cause there's nothin to do but the flags are still set

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't import pool after drive failure #16736

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can't import pool after drive failure #16736

Ast0e Nov 8, 2024

Replies: 4 comments

IvanVolosyuk Nov 9, 2024

Ast0e Nov 9, 2024 Author

amotin Nov 11, 2024 Collaborator

n0xena Nov 24, 2024

Ast0e
Nov 8, 2024

IvanVolosyuk
Nov 9, 2024

Ast0e
Nov 9, 2024
Author

amotin
Nov 11, 2024
Collaborator

n0xena
Nov 24, 2024