Hi all,


UPDATE: I closed the post (the timebox I gave myself to understand the issue is now over). Thank you all for the help ^^


DISCLAIMER: The objective of this post is to understand how people would debug issues like these when real data is involved and get to the bottom of the problem. The objective is NOT to “restore service” but to understand what failed. The tone of the post is voluntarily not serious to keep it light.


I am playing a little with TrueNas Scale and ZFS. I was trying to use a second NVME disk via USB to do a replication once a day of the main pool, however I had issues with this secondary pool being SUSPENDED for “too many errors”. This pool is not directly write/read by users/apps, but it is just there to be “replicated on” once a day.

Now, please, I know that using disks via USB is not advised. Also I am not interested in recovering the data, since there is nothing real on it. What I am doing is testing to see if the system is brittle, and if it is, how to debug if there is a real issue.

Now to the point. The pool is SUSPENDED. Good. Why? I mean, the real reason why. To see if the system can be used in real life it needs to be debuggable.

Let’s start. The pool is SUSPENDED:

pool: tank-02
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
config:

    	NAME                                	STATE 	READ WRITE CKSUM
    	tank-02                             	UNAVAIL  	0 	0 	0  insufficient replicas
      	xxx-xxx-xxx-xxx-xxx                FAULTED  	3 	0 	0  too many errors

errors: 4 data errors, use '-v' for a list

To which you may ask: why? Too many errors (the -v says nothing more). Well that doesn’t help, does it. When you run zpool clear:

# zpool clear tank-02   	 
cannot clear errors for tank-02: I/O error

Incredibly useful as you can see. dmesg to the rescue?

WARNING: Pool 'tank-02' has encountered an uncorrectable I/O failure and has been suspended.

Thanks? I guess. I know it it trying to safeguard data but again… why?

Before you ask:

  • SMART checks are good
  • Yes, I restarted the device. As soon as you try to use/mount/import you get to the same issues.
  • Nothing else peculiar in dmesg. I mean the USB was usb 2-4: USB disconnect, device number 12 whatever the reason why. I mean, kick me if I know why TrueNas scale decided that having /sys/module/usbcore/parameters/autosuspend to 2 is a good idea but again, that is not the point. I need ZFS to reply to me what is the issue for its point of view.

I have read a lot online. Maybe it is the temperarure (usb enclosure heating up), maybe it is the cable, power, “it is the usb controller”, or the chipset doing the usb -> nvme… However, therey are not saying what to check. People is guessing. I saw more tech behind reading tea leaves.

My question for you all is this: ZFS SUSPENDED one of my pools. It (seems to me) is refusing to fix it. Refusing to do anything with it and to tell me why. So, in a real world case, how to debug it? If I have to trust my data to it, I don’t want the only option to be “use many disks and just replace one and the cable when ZFS poo-poo”.

How to know the cause?

Thank you for the help.

PS: I am sure I am missing some very basic ZFS knoweldge on the topic, so please let me know what else can I do to make ZFS talk to me.

  • farcaller
    link
    fedilink
    English
    arrow-up
    4
    ·
    21 days ago

    Is there anything interesting at all reported in /proc/spl/kstat/zfs/dbgmsg?

    • justpassingby@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      20 days ago

      Thank you! A new path to check :) I didn’t find this in my search until now, so I added it on my documentation.


      Unfortunately it doesn’t tell me much, but I am really happy there is some more new info here. I can see some FAILED steps but it may be just connected to the fact it is a striped volume?

      1717612906   spa.c:6623:spa_import(): spa_import: importing tank-02
      1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADING
      1717612906   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/disk/by-partuuid/xxx-xxx-xxx-xxx-xxxx': best uberblock found for spa tank-02. txg 6462
      1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): using uberblock with txg=6462
      1717612906   spa.c:8925:spa_async_request(): spa=tank-02 async request task=4
      1717612906   spa_misc.c:404:spa_load_failed(): spa_load(tank-02, config trusted): FAILED: cannot open vdev tree after invalidating some vdevs
      1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): UNLOADING
      1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): spa_load_retry: rewind, max txg: 6461
      1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADING
      1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): vdev tree has 1 missing top-level vdevs.
      1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): current settings allow for maximum 0 missing top-level vdevs at this stage.
      1717612907   spa_misc.c:404:spa_load_failed(): spa_load(tank-02, config untrusted): FAILED: unable to open vdev tree [error=2]
      1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): UNLOADING
      

      It goes on and after a while:

      1717614235   spa_misc.c:2311:spa_import_progress_set_notes_impl(): 'tank-02' Finished importing
      1717614235   spa.c:8925:spa_async_request(): spa=tank-02 async request task=2048
      1717614235   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADED
      1717614235   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 6464, spa tank-02, vdev_id 0, ms_id 95, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 1362018 ms, loading_time 0 ms, ms_max_size 8589934592, max size error 8589934592, old_weight 840000000000001, new_weight 840000000000001
      

      But I see no other issue otherwise. Any other new path/logs/ways I can query the system?