Recover Data from SHR Array Failing During Rebuild Process

You replaced a 4 TB drive in your Synology SHR array with an 8 TB drive. The rebuild started, ran for a while — and then stopped. DSM shows Degraded, Crashed, or the progress bar has not moved in hours. The NAS may have become unresponsive entirely. This article covers data recovery after an SHR rebuild crash: what actually happened inside the array, how to read the current state without making it worse, and how to get your files back.

Recover Data from SHR Array Failing During Rebuild Process

Before Touching Anything

Three actions feel logical when a rebuild freezes — and all three can turn a recoverable situation into a permanent one:

🔄

Rebooting the NAS

A degraded array holds its member state in memory. A reboot forces mdadm to re-read superblocks from disk — if those superblocks are out of sync due to the interrupted rebuild, the array may fail to reassemble at all after the reboot.

💽

Pulling drives

Even the drive DSM marks as Faulty. Removing a member changes the array count and triggers mdadm to update superblocks on remaining drives, recording the removal as a permanent event. This can permanently exclude a drive that was actually readable.

🔧

Clicking Repair in Storage Manager

Repair starts a new rebuild attempt. If the original rebuild failed due to read errors on an existing drive, a second attempt reads the same sectors again — extending the damage on an already stressed drive and risking a second failure.

Force-powering off

A hard power cut during an active (even frozen) rebuild can write partial parity blocks to the new drive, leaving the array in a state where neither the old nor new data is consistent. Always use DSM’s shutdown procedure if the interface is still accessible.

🔬

Running fsck or btrfs check

Filesystem repair tools operate on the volume layer — one level above the array. Running them on a degraded array means they read reconstructed data that may contain parity errors, and can write corrupted metadata back to disk.

🔀

Adding another drive

Inserting a spare drive into a Crashed array causes DSM to attempt an automatic rebuild. Without understanding why the first rebuild failed, a second attempt runs into the same problem — and now with one more round of full-array reads on already stressed hardware.

Why the Rebuild Failed

When SHR replaces a drive of a different capacity, it does significantly more than copy data. The sequence looks like this:

  1. mdadm reads all data partitions from the remaining drives at sustained sequential throughput — for hours or days on multi-terabyte arrays.

  2. Parity is computed and written to the new drive. For SHR with mixed drive sizes, mdadm uses multiple md devices of different sizes stacked together, so parity calculation is more complex than in a fixed-geometry RAID 5.

  3. LVM recalculates Physical Extent allocation across the expanded storage pool. If the new drive is larger, this means remapping the Volume Group layout — a separate operation that runs in parallel with or after the mdadm rebuild.

Any error at any stage aborts the sequence. Three root causes account for the majority of SHR rebuild failures:

🧱

Unrecoverable Read Errors (URE)

Consumer drives have a URE rate of roughly 1 in 1014 bits read. On a 4 TB drive, this means a statistically likely read error somewhere during a full sequential pass. During normal operation these sectors are rarely touched. During a rebuild, every sector is read — and a single unreadable sector stops parity computation for the entire stripe. The drive does not have to fail; it just has to produce one read error at the wrong moment.

🔌

SATA timeouts under load

A cable or backplane connection that is marginal under normal workloads can fail consistently under the sustained high-throughput reads of a rebuild. The kernel logs a SATA error, mdadm interprets the drive as unreachable, and marks it as Faulty — even though the drive itself is physically healthy. The drive comes back after a reconnect, but mdadm has already removed it from the array.

⚙️

Background DSM tasks

Synology schedules S.M.A.R.T. tests, media indexing (for Photo Station, Video Station), and Btrfs scrubs automatically. Any of these running concurrently with a rebuild competes for the same disk I/O bandwidth. On a system already running a sustained full-disk read, additional I/O can push read latency high enough to trigger drive timeouts — producing the same outcome as a physical connection problem.

For a broader comparison of rebuild risk versus direct data recovery, see our article on RAID rebuild vs. software recovery.

Read the Current State of the Array

Before any recovery attempt, determine exactly what mdadm is reporting. If SSH access is available, two commands give you the full picture. For a complete walkthrough of interpreting mdadm output, see our mdadm RAID recovery guide. Below are the specific patterns to look for in this scenario.

cat /proc/mdstat — shows assembly status and, if a rebuild is running, the current progress and speed.

A frozen rebuild looks like this:

Rebuild stuck — progress not advancing
Personalities : [raid5] [raid6] [raid1]
md3 : active raid5 sdb3[0] sdc3[1] sdd3[2]
      5860468736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/2] [UU_]
      [================>....]  recovery = 83.2% (2436352/2930234) finish=∞ speed=0K/sec
unused devices: <none>

finish=∞ and speed=0K/sec confirm the rebuild has stalled — mdadm is blocked waiting for a sector it cannot read.

A crashed array looks like this:

Array inactive — not assembling
Personalities : [raid5] [raid6] [raid1]
md3 : inactive sdb3[0](S) sdc3[1](S)
      5860468736 blocks super 1.2
unused devices: <none>

inactive with (S) (spare) flags means mdadm has no active array — the devices are present but not assembled. Data is physically on the disks but inaccessible.

The table below maps the DSM status you see to what is actually happening and what to do next:

What DSM shows What it means Do not do Next step
Rebuild frozen, speed = 0 Degraded URE on an existing drive is blocking parity writes. Array degraded but intact. Do not wait; do not restart rebuild RS RAID Retrieve
One drive marked Faulty, rebuild stopped Degraded mdadm dropped a drive after repeated read or SATA errors. Running without redundancy. Do not remove the Faulty drive S.M.A.R.T. check, then RS RAID Retrieve
Storage Pool: Crashed Crashed mdadm could not maintain quorum. Array inactive — data present but inaccessible. Do not click Repair; do not reboot RS RAID Retrieve
NAS unresponsive, DSM not loading Unknown Possible kernel hang during rebuild I/O. Array state unknown. Do not force power off if avoidable Clean shutdown via power button hold, then RS RAID Retrieve

Data Recovery with RS RAID Retrieve

💻
RS RAID Retrieve Windows · Linux · macOS
Difficulty:
Low

RS RAID Retrieve reconstructs the SHR array configuration from mdadm superblocks on the remaining drives, works with degraded arrays where one member is missing or marked Faulty, and provides read-only access to the volume for selective file recovery — without initiating another rebuild attempt.

1

Step 1 — Connect drives and check S.M.A.R.T.

Shut down the NAS cleanly if possible. Connect all drives — including the one DSM flagged as Faulty — to a recovery machine and open the built-in S.M.A.R.T. monitor in RS RAID Retrieve. Check every drive, not just the one that failed. During a rebuild, the drive that appears healthy is often the one that caused the failure through read errors on the existing member.

2

Step 2 — Image any drive with elevated S.M.A.R.T. values

If any drive shows non-zero Reallocated Sector Count, Pending Sectors, or Uncorrectable Errors, create a sector-level image of that drive using RS RAID Retrieve’s built-in imaging function before scanning. All subsequent recovery work is performed on the image. This protects the source drive from additional reads during the scan and prevents further degradation on a drive that is already stressed.

3

Step 3 — Automatic array reconstruction

RS RAID Retrieve reads the mdadm superblock from each connected drive or image, identifies the array UUID, member roles, RAID level, and stripe parameters, and reconstructs the SHR volume structure. For a degraded array with one missing or failed member, the program can reconstruct using the remaining drives — computing missing data from parity, the same way mdadm would in degraded mode, but without writing anything back to disk.

4

Step 4 — Browse and recover files

Shut down the NAS cleanly if possible. Connect all drives — including the one DSM flagged as Faulty — to a recovery machine and open the built-in S.M.A.R.T. monitor in RS RAID Retrieve. Check every drive, not just the one that failed. During a rebuild, the drive that appears healthy is often the one that caused the failure through read errors on the existing member.

🔍

Works on degraded and crashed arrays

Reconstructs SHR volumes from remaining members without requiring a complete, healthy array — including inactive arrays that mdadm refuses to assemble.

📊

S.M.A.R.T. monitor

Review drive health before scanning. Identifies which drive caused the rebuild failure and whether imaging is needed before recovery.

💾

Drive imaging

Create a sector-level image of a stressed drive before recovery. All work uses the image — protecting the original from additional read cycles.

🔗

SSH connection

If the NAS is still powered on and reachable over the network, RS RAID Retrieve can connect via SSH — without physically removing drives from the chassis.

When Software Recovery Is Not Enough

If multiple drives fail to appear when connected to the recovery machine, or if S.M.A.R.T. shows critical values across more than one array member, the situation has moved beyond software. An SHR-1 array with two failed drives has no parity to reconstruct from — there is no mathematical path to the missing data through software alone.

Stop and contact a recovery lab if you observe

  • Two or more drives not detected, or showing immediate S.M.A.R.T. failure on power-up
  • Clicking, grinding, or repeated failed spin-up on any drive
  • RS RAID Retrieve cannot reconstruct the array even in manual mode
  • Drives are hot to the touch within minutes of connection

Physical recovery — head replacement, platter transfer — requires a cleanroom environment. Each additional power cycle on a mechanically failing drive reduces recovery probability.

After Recovery: Preventing the Next Rebuild Failure

A rebuild crash during drive replacement is not random. It targets a specific vulnerability: all remaining drives are under maximum sustained read load at the exact moment the array has no redundancy. The following steps reduce the probability of repeating this scenario.

📋

Check S.M.A.R.T. before replacing a drive

Run a full S.M.A.R.T. extended test on all remaining drives before pulling the old one. A drive with reallocated sectors or pending errors will likely produce a URE during the rebuild that follows.

🔕

Disable background DSM tasks during rebuild

Go to Control Panel → Task Scheduler and suspend scheduled S.M.A.R.T. tests, Btrfs scrubs, and media library scans for the duration of the rebuild. Competing I/O is one of the most preventable causes of rebuild failure.

🔌

Reseat SATA cables before starting

A marginal connection that works under light load will fail under the sustained throughput of a multi-day rebuild. Unplug and reseat all SATA data and power cables before initiating the replacement process.

🗂️

Do not mix drive batches

Drives purchased at the same time from the same production run accumulate wear at the same rate. When one fails, its peers are statistically close behind. Source replacement drives from a different manufacturer or production batch.

🔔

Enable email notifications in DSM

Control Panel → Notification → Email. DSM can alert you the moment a drive is marked Faulty or a storage pool degrades. Catching the failure early — before the rebuild has run for 60 hours — preserves more options for recovery.

💾

Keep an independent backup

SHR provides fault tolerance, not backup. A degraded array during rebuild has no protection against a second failure. Hyper Backup to an external drive or cloud destination is the only guarantee that a rebuild crash does not become permanent data loss.

A rebuild failure during drive replacement is one of the more common SHR data loss scenarios precisely because it strikes at the worst possible moment: maximum I/O load on the oldest hardware in the array, with zero redundancy margin. Once your data is recovered, treat the event as a signal — not just about the drive that failed, but about the health of everything it was running alongside.

Frequently Asked Questions

ot necessarily, and this is one of the more dangerous misconceptions about RAID rebuilds. In RAID 5 and SHR, data is not written sequentially from drive to drive — parity is distributed across all members in stripes. A rebuild at 97% means 97% of the stripes have been recalculated and written, but the array is not consistent until the full 100% completes. An interrupted rebuild leaves the parity table partially updated, which means any stripe that spans the boundary between rebuilt and non-rebuilt regions is in an undefined state. You cannot selectively access "the part that finished" — the volume either mounts cleanly as a whole or it does not mount at all.
Yes, but probably not because the new drive is defective. When mdadm marks a drive Faulty during a rebuild, it is responding to an event — a read error, a SATA timeout, or a command that did not complete within the kernel's timeout window. The new drive is a write target during the rebuild, not a read source. If it appears as Faulty, the most likely cause is a SATA connection issue (cable, backplane port, or controller slot) that manifested under the sustained write load of the rebuild. Before assuming the drive is dead, try reseating it in a different bay and connecting it with a different cable. The drive's S.M.A.R.T. data will be near-zero on a new unit and should not show errors — if it does, then the drive itself is the issue.
This is a reasonable option if the replacement NAS is identical or compatible, but it carries the same risk as any rebuild: if any of the existing drives has a URE or marginal health, the rebuild on the new hardware will hit the same problem. Before migrating, check S.M.A.R.T. on every drive. If all drives are healthy, Synology's HDD Migration procedure preserves the storage pool and volume configuration — DSM on the new unit will recognize the existing array and resume rather than rebuild from scratch. However, if the original rebuild crash was caused by a read error on an existing drive, migration does not fix that. The problematic drive goes with the array regardless of which chassis it sits in.
Rebuild speed on Synology hardware typically runs between 50–120 MB/s under ideal conditions — no competing I/O, healthy drives, stable connections. At 60 MB/s, a 4 TB rebuild takes roughly 18–19 hours; an 8 TB rebuild around 37 hours. Speed naturally fluctuates, and DSM throttles rebuild priority to keep the NAS usable, so a slow rebuild is not automatically a problem. A stalled rebuild is different: /proc/mdstat will show speed=0K/sec and finish=∞, and the percentage will not advance over 15–30 minutes. That specific combination — zero speed plus infinite estimated finish time — means mdadm is blocked on an unreadable sector and is retrying indefinitely. Waiting longer does not help; the sector will not become readable on its own.

Comments are closed.

Related Posts

How to Installing and Configuring TrueNAS
How to Installing and Configuring TrueNAS
TrueNAS is one of the most optimized operating system for NAS, which was known before as FreeNAS. It is a free operating system, that can be used on the NAS assembled by yourself. The main advantage of the TrueNAS operating … Continue reading
How to recover data from NAS OpenMediaVault (OMV)?
How to recover data from NAS OpenMediaVault (OMV)?
OpenMediaVault (OMV) is a specialized operating system for independently assembled NAS storages. It is based on Debian Linux, one of the popular operating systems, and provides software for creating data storage based on various hard drive arrays. So, how can … Continue reading
How to recover data from RAID 0 array?
How to recover data from RAID 0 array?
The “zero” format of RAID arrays today remains almost the most popular among all the options due to the minimal cost of the basic RAID 0 configuration. What are the advantages and disadvantages of the format? What are the problems … Continue reading
Recovering Data with Read/Write Errors
Recovering Data with Read/Write Errors
If you cannot access data stored on a properly configured computer, receiving errors or having your computer freeze for several seconds (or hang up completely), you may have a problem with the hard drive (assuming you’re not having a virus, … Continue reading
Online Chat with Recovery Software