For years, ECC (Error Correction Code) has been used to detect and correct data errors caused by flaws in magnetic and optical disk media. Now ECC is being used to detect and correct data errors over the SCSI bus. With the faster data rates, requirements for longer cables and hot-pluggable drives used in RAID subsystems, ECC has become a requisite in fault-tolerant disk storage systems.
When Fast SCSI was first proposed, it was envisioned for use with differential bus transceivers only, since many sources of data corruption are present on all busses, including noise, timing skews and signal reflections. It was well known that increasing the data rate from 5MHz to 10MHz would greatly exacerbate potential data integrity problems on the SCSI bus. Today, Ultra SCSI increases the data rate from 10MHz to 20MHz. Unfortunately, the trends toward smaller form-factor drives and reduced cost makes differential SCSI undesirable. As a result, the industry has imposed greater restrictions on the SCSI cabling schemes. The maximum "legal" cable length was decreased from 6 to 3 meters for Fast SCSI and to 1.5 meters for Ultra SCSI while additional constraints were placed on the methods of interconnecting devices to the cable. The trade-off, however, is that noise margins in Fast SCSI and Ultra SCSI systems, are not as great as before.
The popularity of hot-pluggable SCSI devices in RAID disk arrays has greatly aggravated this problem by creating another source of noise on the SCSI bus. Hot-plugging drives on a live SCSI bus creates "glitches" due to the sudden change in bus capacitance. Although the use of low-glitch connectors minimizes the magnitude of these glitches, all sources of noise are additive and can sometimes exceed noise thresholds, resulting in SCSI bus data errors.
Most SCSI bus data errors go unnoticed by the user. The reason lies in the error detection mechanism used. A single parity bit is transferred for every eight data bits transferred across the SCSI bus. This parity scheme allows single-bit errors to be detected. In fact, any odd number of bits changed will be detected, resulting in a SCSI bus parity error. In most systems, however, there is no way for this error to be reported to the user. The host adapter will retry the transmission until successful. Even-bit data errors go unnoticed until data or programs are affected. Marginal data integrity is often not apparent to the user until the data corruption has reached a level sufficient to hang the system or produce obviously incorrect results.
ECC protection greatly reduces the chance of data corruption through two mechanisms. Unlike parity, practically all data errors are detected. In addition, most data errors can be corrected on-the-fly. This results in a level of fault-tolerance across the SCSI cable and data paths comparable with the levels expected from the rest of the storage system and server.
How does the ECC on DPT's SmartRAID array controllers and storage cabinets operate? As data is read from the computer's memory by the adapter, four 32-bit ECC words are generated by custom ASIC circuitry and appended to each 512-byte block. SmartRAID disk drives are formatted to store 528-byte sectors rather than the typical 512 bytes, allowing the additional ECC words to be stored along with each sector. Later, when the data is read from the disk, the ECC is transferred back across the SCSI bus and checked by the adapter. Any detected data corruption, regardless of the causes, including SCSI bus, drive or adapter electronics, or drive or adapter cache, will be logged, reported to the user and, in most cases, automatically corrected.
With ECC becoming increasingly common in system RAM for high-end fileservers and used universally on all disk drives, the SCSI bus stands out today as, by far, the most likely cause of data corruption. By ECC-protecting each data block, DPT's SmartRAID controllers allow truly fault-tolerant storage subsystems to be built which meet the demanding requirements of today's enterprise-wide server environments.