Iterative message-passing decoding with global code embedded with local code in time-division manner for fault tolerance improvement Patent Grant Li J [Alibaba Group Holding Limited]

Iterative message-passing decoding with global code embedded with local code in time-division manner for fault tolerance improvement

Li J

Patent Grant 10530395

U.S. patent number 10,530,395 [Application Number 15/889,458] was granted by the patent office on 2020-01-07 for iterative message-passing decoding with global code embedded with local code in time-division manner for fault tolerance improvement. This patent grant is currently assigned to ALIBABA GROUP HOLDING LIMITED. The grantee listed for this patent is Alibaba Group Holding Limited. Invention is credited to Shu Li.

United States Patent	10,530,395
Li	January 7, 2020

Iterative message-passing decoding with global code embedded with local code in time-division manner for fault tolerance improvement

Abstract

The disclosed embodiments are directed to systems, devices, and methods for iterative message-passing decoding. In one embodiment, a method is disclosed comprising decoding a first codeword at a storage device using a detector and a decoder, the first codeword comprising a set of symbols from a first set of codewords; assigning, via the decoding, a set of confidence levels for each symbol in the first codeword; transmitting, by the storage device, the confidence levels to an iterative decoder; generating, by the iterative decoder, a second codeword based on the set of confidence levels, the second codeword excluding at least one symbol in the set of symbols; and iteratively decoding, by the iterative decoder, the second codeword using an erasure decoder; and transmitting, by the iterative decoder, soft information generated by the erasure decoder to the storage device for subsequent decoding by the storage device.

Inventors:

Li; Shu (Santa Clara, CA)

Applicant:

Name	City	State	Country	Type
Alibaba Group Holding Limited	Grand Cayman	N/A	KY

Assignee:

ALIBABA GROUP HOLDING LIMITED (Grand Cayman, KY)

Family ID:

67475753

Appl. No.:

15/889,458

Filed:

February 6, 2018

Prior Publication Data


	Document Identifier	Publication Date
	US 20190245563 A1	Aug 8, 2019

Current U.S. Class:	1/1
Current CPC Class:	H03M 13/6325 (20130101); H03M 13/45 (20130101); H03M 13/2957 (20130101); H03M 13/4138 (20130101); H03M 13/2906 (20130101); H03M 13/293 (20130101); H03M 13/154 (20130101); H03M 13/1111 (20130101); H03M 13/1515 (20130101); H03M 13/27 (20130101); H03M 13/1125 (20130101); H03M 13/2948 (20130101)
Current International Class:	H03M 13/00 (20060101); H03M 13/27 (20060101); H03M 13/41 (20060101); H03M 13/45 (20060101); H03M 13/29 (20060101); H03M 13/15 (20060101); H03M 13/11 (20060101)

References Cited [Referenced By]

U.S. Patent Documents


7447981	November 2008	Cameron et al.
7536629	May 2009	Shen et al.
7549105	June 2009	Shen et al.
7716553	May 2010	Kim et al.
8166364	April 2012	Patapoutian et al.
8707144	April 2014	Wang et al.
8775896	July 2014	Li et al.
8887020	November 2014	Shokrollahi
9048867	June 2015	Liu et al.
9048870	June 2015	Li et al.
9077378	July 2015	Wu
9112530	August 2015	Gunnam et al.
9130589	September 2015	Li et al.
9559725	January 2017	Shao et al.
2008/0235561	September 2008	Yang
2009/0249163	October 2009	Ovchinnikov et al.
2010/0241923	September 2010	Wang et al.
2016/0197703	July 2016	Park et al.

Primary Examiner: Merant; Guerrier
Attorney, Agent or Firm: DeCarlo; James J. Greenberg Traurig, LLP

Claims

What is claimed is:

1. A method comprising: decoding a first codeword at a storage device using a detector and a decoder, the first codeword comprising a set of symbols from a first set of codewords; assigning, via the decoding, a set of confidence levels for each symbol in the first codeword; transmitting, by the storage device, the confidence levels to an iterative decoder; generating, by the iterative decoder, a second codeword based on the set of confidence levels, the second codeword excluding at least one symbol in the set of symbols; iteratively decoding, by the iterative decoder, the second codeword using an erasure decoder; and transmitting, by the iterative decoder, soft information generated by the erasure decoder to the storage device for subsequent decoding by the storage device.

2. The method of claim 1, further comprising: encoding a plurality of user data items using a one-pass Reed-Solomon decoder to obtain a set of encoded symbols representing the user data items.

3. The method of claim 2, further comprising: interleaving the set of encoded symbols to generate a plurality of interleaved codewords, the plurality of interleaved codewords including the first codeword; and transmitting the first codeword to the storage device prior to decoding the first codeword.

4. The method of claim 1, the decoding a first codeword at a storage device using a detector and a decoder comprising executing multiple iterations of decoding by the detector and the decoder, an output of the decoder providing an input to the detector and an output of the detector providing an input to the decoder.

5. The method of claim 4, the detector comprising a maximum a posteriori or soft output Viterbi algorithm detector, the decoder comprising a low-density parity check decoder.

6. The method of claim 5, the detector and decoder operating in parallel.

7. The method of claim 1, the generating a second codeword based on the set of confidence levels comprising: interleaving the set of symbols to form the second codeword; and sorting the set of symbols based on the set of confidence levels, the set of confidence levels comprising log likelihood ratio values generated by the detector and decoder.

8. The method of claim 1, further comprising updating the confidence levels associated with the symbols of the second codeword after iteratively decoding the second codeword; transmitting the updated confidence levels to the detector and decoder; and using the updated confidence levels during a subsequent decoding process executed by the detector and decoder.

9. The method of claim 1, the iteratively decoding the second codeword using an erasure decoder comprising executing multiple trials of decoding, each trial corresponding to a permutation of erased symbols in the second codeword.

10. A system comprising: a storage device comprising a detector and a decoder, the storage device configured to decode a first codeword using the detector and decoder, the first codeword comprising a set of symbols from a first set of codewords, and assign, via the decoding, a set of confidence levels for each symbol in the first codeword; a time-division scheduler configured to transmit the confidence levels to an iterative decoder; and an iterative decoder configured to generate a second codeword based on the set of confidence levels, the second codeword excluding at least one symbol in the set of symbols, iteratively decode the second codeword using an erasure decoder, and transmit soft information generated by the erasure decoder to the storage device for subsequent decoding by the storage device.

11. The system of claim 10, further comprising: a one-pass Reed-Solomon encoder configured to encode a plurality of user data items to obtain a set of encoded symbols representing the user data items.

12. The system of claim 11, the time-division scheduler further configured to: interleave the set of encoded symbols to generate a plurality of interleaved codewords, the plurality of interleaved codewords including the first codeword; and transmit the first codeword to the storage device prior to decoding the first codeword.

13. The system of claim 10, the storage device further configured to execute multiple iterations of decoding by the detector and the decoder, an output of the decoder providing an input to the detector and an output of the detector providing an input to the decoder.

14. The system of claim 13, the detector comprising a maximum a posteriori or soft output Viterbi algorithm detector and the decoder comprising a low-density parity check decoder.

15. The system of claim 14, the detector and decoder operating in parallel.

16. The system of claim 10, the iterative decoder further configured to: interleave the set of symbols to form the second codeword; sort the set of symbols based on the set of confidence levels, the set of confidence levels comprising log likelihood ratio values generated by the detector and decoder.

17. The system of claim 10, the time-division scheduler further configured to: update the confidence levels associated with the symbols of the second codeword after iteratively decoding the second codeword; transmit the updated confidence levels to the detector and decoder, the updated confidence levels used by the storage device during a subsequent decoding process executed by the detector and decoder.

18. The system of claim 10, the iterative decoder further configured to execute multiple trials of decoding, each trial corresponding to a permutation of erased symbols in the second codeword.

Description

COPYRIGHT NOTICE

This application includes material that may be subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office files or records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND

The disclosed embodiments are directed toward digital decoding and, specifically, to cooperative decoding in hyperscale data clusters.

A hyperscale data center employs storage nodes in the form of storage clusters. A storage cluster may include one or more storage devices organized into storage pools. These storage devices are used to support storage requirements of, for example, network applications.

To support high performance applications, a level of redundancy is needed to ensure that the failure of a single drive does not negatively impact downstream applications. One approach is to replicate data across drives. Thus, a single item of data is stored in separate, non-overlapping storage devices. The deficiency of this approach is that the amount of storage increases linearly with the amount of data stored. Since the amount of data used by network applications grows exponentially, the amount of storage devices needed to support this scheme grows exponentially and is thus impractical in terms of energy needed, costs of storage devices, and scalability.

Another approach is to organized storage devices in erasure coded pools. Erasure coded pools have the advantage of only storing data once. To support this reduced storage, additional computational complexity is required. However, the tradeoff is generally preferable to replicated systems. In an erasure coded pool, data is segmented into individual symbols. These symbols are then distributed to different storage devices. For example, a data word (ABCD) may be split into separate symbols (A, B, C, D) and stored in four separate drives.

To handle drive failures, a level of redundancy is needed to ensure that if one symbol is lost, it can be recovered. To accomplish this, many systems utilized Reed-Solomon (RS) encoding to add additional parity bytes to a given item of data (e.g., ABCD12, where 1 and 2 are parity symbols). The choice of parity bits dictates how many symbols can be recovered. For example, with two parity bits, a system can detect two errors and correct one. These symbols are then distributed to different storage devices, where oftentimes dedicated storage devices are used to store parity symbols. Current systems generally hard-decision RS decoding in order to detect and, if possible, correct erasures.

Current systems additionally utilize further encoding to protect against errors at various levels of the storage hierarchy. One current technique is to use low-density parity check (LDPC) algorithms to perform error correction on the data at the drive-level. In general, these algorithms are not concerned with the form of data encoded and decoded. Rather, LDPC codes are used simply to correct drive-level errors or channel errors. The data is then simply returned to the RS decoder and the RS decoder performs a hard decoding of the returned data. In some systems, RS decoding is skipped if all drives produced error-free data. Thus, the RS decoder is frequently unused. This results in hardware that consumes power and clock cycles while performing no useful work. Additionally, in current systems, the LDPC decoding employed is a hard-decision decoding. As is known, hard-decision coding is time-consuming and results in reduced response time for read requests.

BRIEF SUMMARY

The disclosed embodiments remedy the above and other deficiencies of existing decoding systems. Specifically, the disclosed embodiments described an improve decoding architecture for distributing decoding work between global level erasure coding and drive-level error correction. To support this, the disclosed embodiments first utilize a soft-decision erasure decoder (e.g., a soft-decision RS decoder). The disclosed embodiments additionally utilize both an LDPC decoder and a MAP (maximum a posteriori)/SOVA (soft output Viterbi algorithm) detector to iteratively detect errors at the drive-level and assign associated log likelihood ratio (LLR) values for symbols stored on a given drive. These LLRs are fed to the erasure decoder which may then proactively generate a series of trials to execute with the soft-decision erasure decoder. Specifically, the LLRs are used to predict which symbols should be treated as erasures and the modified codeword is fed to the erasure decoder. In this manner, the system predicts the likely erasures. The erasure decoder may then feed the resulting decoded code word back to each drive such that the confidence of the decoded symbols may be used by other drives in future iterative drive-level decoding.

In one embodiment, a method is disclosed comprising decoding a first codeword at a storage device using a detector and a decoder, the first codeword comprising a set of symbols from a first set of codewords; assigning, via the decoding, a set of confidence levels for each symbol in the first codeword; transmitting, by the storage device, the confidence levels to an iterative decoder; generating, by the iterative decoder, a second codeword based on the set of confidence levels, the second codeword excluding at least one symbol in the set of symbols; and iteratively decoding, by the iterative decoder, the second codeword using an erasure decoder; and transmitting, by the iterative decoder, soft information generated by the erasure decoder to the storage device for subsequent decoding by the storage device.

In another embodiment, a system is disclosed comprising a storage device comprising a detector and a decoder, the storage device configured to decode a first codeword using the detector and decoder, the first codeword comprising a set of symbols from a first set of codewords, and assign, via the decoding, a set of confidence levels for each symbol in the first codeword; a time-division scheduler configured to transmit the confidence levels to an iterative decoder; and an iterative decoder configured to generate a second codeword based on the set of confidence levels, the second codeword excluding at least one symbol in the set of symbols, iteratively decode the second codeword using an erasure decoder, and transmit soft information generated by the erasure decoder to the storage device for subsequent decoding by the storage device.

In another embodiment, a storage device is disclosed comprising a detector configured to: identify at least one path through a Viterbi trellis to decode a received codeword, and calculate soft outputs for each step along the path; and a decoder configured to: receive, from the detector, the soft outputs, execute an iterative decoding procedure using the codeword and the soft outputs generated by the detector, generate a set of confidence values associated with each symbol of the codeword, and transmit the set of confidence values to an iterative decoder.

BRIEF DESCRIPTION OF THE FIGURES

The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following description of embodiments as illustrated in the accompanying drawings, in which reference characters refer to the same parts throughout the various views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating principles of the disclosure.

FIG. 1 is a block diagram illustrating an erasure-coded storage cluster according to some embodiments of the disclosure.

FIGS. 2A and 2B are flow diagrams illustrating write and read operations of an erasure-coded pool utilized by existing systems.

FIG. 3 is a block diagram illustrating an erasure-coded partitioning system according to some embodiments of the disclosure.

FIG. 4 is a flow diagram illustrating a method for iterative message-passing decoding with global code embedded with local code according to some embodiments of the disclosure.

FIG. 5 is a block diagram illustrating a global iterative decoding system according to some embodiments of the disclosure.

FIG. 6 is a diagram illustrating the swapping of decoding results between a drive-level decoder and a global erasure decoder according to some embodiments of the disclosure.

FIG. 7 illustrates a method for performing erasure decoding using soft-information decoding outputs according to some embodiments of the disclosure.

DETAILED DESCRIPTION

The present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, certain example embodiments. Subject matter may, however, be embodied in a variety of different forms and, therefore, covered or claimed subject matter is intended to be construed as not being limited to any example embodiments set forth herein; example embodiments are provided merely to be illustrative. Likewise, a reasonably broad scope for claimed or covered subject matter is intended. Among other things, for example, subject matter may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware or any combination thereof (other than software per se). The following detailed description is, therefore, not intended to be taken in a limiting sense.

Throughout the specification and claims, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase "in one embodiment" as used herein does not necessarily refer to the same embodiment and the phrase "in another embodiment" as used herein does not necessarily refer to a different embodiment. It is intended, for example, that claimed subject matter include combinations of example embodiments in whole or in part.

In general, terminology may be understood at least in part from usage in context. For example, terms, such as "and", "or", or "and/or," as used herein may include a variety of meanings that may depend at least in part upon the context in which such terms are used. Typically, "or" if used to associate a list, such as A, B or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B or C, here used in the exclusive sense. In addition, the term "one or more" as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as "a," "an," or "the," again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term "based on" may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

The present disclosure is described below with reference to block diagrams and operational illustrations of methods and devices. It is understood that each block of the block diagrams or operational illustrations, and combinations of blocks in the block diagrams or operational illustrations, can be implemented by means of analog or digital hardware and computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer to alter its function as detailed herein, a special purpose computer, ASIC, or other programmable data processing apparatus, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the functions/acts specified in the block diagrams or operational block or blocks. In some alternate implementations, the functions/acts noted in the blocks can occur out of the order noted in the operational illustrations. For example, two blocks shown in succession can in fact be executed substantially concurrently or the blocks can sometimes be executed in the reverse order, depending upon the functionality/acts involved.

These computer program instructions can be provided to a processor of: a general purpose computer to alter its function to a special purpose; a special purpose computer; ASIC; or other programmable digital data processing apparatus, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the functions/acts specified in the block diagrams or operational block or blocks, thereby transforming their functionality in accordance with embodiments herein.

For the purposes of this disclosure a computer readable medium (or computer-readable storage medium/media) stores computer data, which data can include computer program code (or computer-executable instructions) that is executable by a computer, in machine readable form. By way of example, and not limitation, a computer readable medium may comprise computer readable storage media, for tangible or fixed storage of data, or communication media for transient interpretation of code-containing signals. Computer readable storage media, as used herein, refers to physical or tangible storage (as opposed to signals) and includes without limitation volatile and non-volatile, removable and non-removable media implemented in any method or technology for the tangible storage of information such as computer-readable instructions, data structures, program modules or other data. Computer readable storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other physical or material medium which can be used to tangibly store the desired information or data or instructions and which can be accessed by a computer or processor.

FIG. 1 is a block diagram illustrating an erasure-coded storage cluster according to some embodiments of the disclosure.

In the illustrated embodiment, a storage cluster (102) may include one or more erasure-coded storage pools such as pool (104). In some embodiments, the cluster (102) may comprise a plurality of pools geographically distributed. In some embodiments, the pools may be distributed over a wide area, over a local area, or over a combination of local areas and wide areas. In general, a storage cluster (102) may be accessed by client devices (e.g., over a network) and store user data transmitted to the cluster.

The illustrated storage cluster (102) distributes data into one or more erasure-coded pools. Although illustrated as a single pool (104), the system may include multiple pools and no limit is placed on the number of pools

The pool (104) includes a plurality of storage devices (106A-106F and 108A-108B). In some embodiment, each of the storage devices comprises a solid-state devices (SSDs) or similar persistent storage device. In some embodiments, the storage devices may comprise object storage devices (OSDs). The particular type of storage device is not intended to be limited.

As illustrated, the storage devices include user storage devices (106A-106F) and parity drives (108A-108B). In one embodiment, user storage devices (106A-106F) are configured to store data provided by users (or applications) whereas parity drives (108A-108B) are used to store the parity bits generated by an erasure encoding algorithm. The procedure is depicted graphically in the lower portion of FIG. 1. In that portion, a piece of user data is received and segmented into portions (110A-110C). Additionally, the user data comprising portions (110A-110C) is encoded using, for example, a Reed-Solomon ("RS") encoder to generate a parity portion 112. The portions (110A-110C) are distributed among separate user drives (106A-106c) and the parity portion (112) is distributed to a parity drive (108A).

The above description of partitioning and encoding allows the system to recover from drive failures. For example, if drive (106A) fails, portion (110A) would be lost or corrupted. However, the system can recover the portion (110A) using the other portions (110B-110C) and the parity portion (112).

In addition to the RS encoding, each drive (106A-106F and 108A-108B) may employ an independent error correction encoding system. In some embodiments, this system may include a low-density parity check (LDPC) encoding and decoding algorithm (and associated hardware). In this embodiment, each individual drive (106A-106F and 108A-108B) encodes and decodes incoming and outgoing data, respectively, using a separate error-correction routine (e.g., an LDPC routine). In some embodiments, the devices use soft information evolution to decode the data stored on a given drive and ensure error-free operations.

FIGS. 2A and 2B are flow diagrams illustrating write and read operations of an erasure-coded pool utilized by existing systems.

In step 202, the method receives user data. As described above, user data may comprise any data supplied for storage in an erasure-coded pool.

In step 204, the method encodes the user data with, for example, a RS encoder to obtain a global erasure-coding (GEC) parity portion. For purposes of illustration, a piece of user data may include symbols A.sub.1A.sub.2A.sub.3. After RS encoding, the method generates a codeword containing symbols A.sub.1A.sub.2A.sub.3A.sub.4, where A.sub.4 comprises a parity check portion.

In step 206, the method distributes the codeword to the erasure-coded pool for storage. As described above, the method may split the codeword into separate symbols (A.sub.1, A.sub.2, A.sub.3, and A.sub.4) and transmit each symbol to a separate drive. In one embodiment, symbol A.sub.4 may be transmitted to a designated parity drive.

In step 208, the method encodes the received codeword using a LDPC encoder. In one embodiment, step 208 is performed at each drive on the received codeword. As described previously, a received codeword may include portions of various codewords (e.g., portions A.sub.1, B.sub.1, C.sub.1, D.sub.1, where B.sub.1, C.sub.1, D.sub.1 correspond to portions of other RS-encoded codewords. In one embodiment, the result of step 208 is an LDPC codeword comprising RS-encoded portions A.sub.1, B.sub.1, C.sub.1, D.sub.1 and a parity portion P.sub.1.

In step 210, the method begins decoding a codeword stored on a given drive by retrieving the codeword from the drive. In one embodiment, step 210 may be executed in response to a file system command, network request, or other instruction to retrieve data at a given address.

In step 212, the method decodes the codeword using a hard LPDC decoder. As known in the art, a "hard" decoder is on that produces a binary output (e.g., a symbol is either correct or incorrect). In general, a hard decoder operates on binary values (a one or zero). In contrast, a soft decoder (also referred to as a soft-decision decoder) uses a continuous range of values as inputs, in the form of log likelihood ratios (LLRs). While soft-decision decoding provides improved decoding performance, the use of hard decoders is common due to the reduced complexity of designing a hard decoder.

In step 214, the method determines if the decoding of the LPDC codeword was successful. If so, the method returns the decoded codeword (step 218). Alternatively, if the method detects an error in the LPDC decoding, the method proceeds to perform RS decoding (step 216) to correct the errors using the parity portions (or user data portions) stored on other devices.

As discussed previously, the use of each drive utilizing an independent hard-decision LPDC decoding routine results in a time-consuming decoding process. This time consumption necessarily impacts the performance of the drive, and when combined, the overall erasure-coded pool and storage cluster. On the other hand, the erasure-coding decoder (e.g., the RS decoder) is idle for most of a read operation. Thus, the erasure-coding decoder is underutilized and inefficient.

FIG. 3 is a block diagram illustrating an erasure-coded partitioning system according to some embodiments of the disclosure.

As illustrated, user data (A.sub.1A.sub.2A.sub.3, B.sub.1B.sub.2B.sub.3, C.sub.1C.sub.2C.sub.3, D.sub.1D.sub.2D.sub.3) is received by the RS encoder (302) and is RS-encoded to generate codewords A.sub.1A.sub.2A.sub.3A.sub.4, B.sub.1B.sub.2B.sub.3B.sub.4, C.sub.1C.sub.2C.sub.3C.sub.4, D.sub.1D.sub.2D.sub.3D.sub.4 which comprise the original user data plus parity portions generated by the RS encoder. These codewords are partitioned into separate symbols and each symbol is distributed among drives (304A-304D). Thus, drive 1 (304A) receives portions A.sub.1B.sub.1C.sub.1D.sub.1, drive 2 (304B) receives portions A.sub.2B.sub.2C.sub.2D.sub.2, drive 3 (304C) receives portions A.sub.3B.sub.3C.sub.3D.sub.3, and drive 4 (304D) receives portions A.sub.4B.sub.4C.sub.4D.sub.4. The portions received by each drive are referred to as partitioned codewords.

Each partitioned codeword is then input into an LDPC encoder to generate a final codeword to be stored on the drive. For example, drive 1 (304A) encodes the partitioned codeword A.sub.1B.sub.1C.sub.1D.sub.1 to generate the final codeword A.sub.1B.sub.1C.sub.1D.sub.1P.sub.1 where P.sub.1 represents a parity portion of the final codeword.

In one embodiment, the RS encoder (302) may comprise a standalone processing device. The RS encoder (302) may include processing units (CPUs) in communication with a mass memory. Mass memory may include a RAM module and other storage means. Mass memory illustrates another example of computer storage media for storage of information such as computer-readable instructions, data structures, program modules or other data. Mass memory stores a basic input/output system ("BIOS") for controlling low-level operation of client device. The mass memory may also store an operating system for controlling the operation of RS encoder (302). It will be appreciated that this component may include a general purpose operating system such as a version of UNIX, or LINUX.TM., or a specialized client communication operating system such as Windows Client.TM., or the Symbian.RTM. operating system. The operating system may include, or interface with a Java virtual machine module that enables control of hardware components and operating system operations via Java application programs.

Similarly, each storage device/drive (304A-304D) may include one or more processors or microcontrollers controlling access to the underlying disk or disks. For example, a given drive may include a physical storage medium (e.g., hard disk platters, solid-state NAND Flash storage, etc.) and one or more components (e.g., microcontrollers, microprocessors, etc.) controlling access to and from the physical medium. In some embodiments, the decoders and detectors may be implemented in the microcontrollers or microprocessors. In some embodiments, the detectors and decoders may be implemented as standalone application-specific integrated circuits (ASICs) or as field-programmable gate arrays (FPGAs). In this embodiment, a microcontroller or processor at a drive may utilize the ASIC/FPGA to perform the decoding operations discussed herein.

FIG. 4 is a flow diagram illustrating a method for iterative message-passing decoding with global code embedded with local code according to some embodiments of the disclosure.

In step 402, the method retrieves a codeword from a given drive. As described above, the codeword may comprise a codeword encoded using a one-pass RS encoding procedure. In this procedure data enters the RS encoding system and is processed stage-by-stage by the RS encoder sequentially until written to the drive.

In contrast to current systems, the method in FIG. 4 does not use a hard-decision decoding scheme but rather utilizes a soft-information, iterative decoding mechanism, described in more detail herein.

In step 404, the method initializes soft information for the decoding process. In one embodiment, initializing soft information comprises extracting an initial confidence level associated with the codeword read in step 402. In one embodiment, the codeword read in step 402 may include a LLR associated with each returned symbol. That is, the method receives a confidence level (between 0 and 1) for each symbol (e.g., A.sub.1B.sub.1C.sub.1D.sub.1P.sub.1) in the retrieved codeword.

In step 406, the method runs a local decoding procedure to evolve the initial soft information. In one embodiment, the local decoding procedure comprises an iterative message-passing algorithm used to perform belief propagation on the initial soft information. As known in the art, the method may utilize Tanner graphs or other structures to pass messages between bit nodes and check nodes of the parity check message associated with a given LDPC code used by the method.

In one embodiment, the method may execute step 406 until the soft information for each symbol exceeds a pre-defined threshold. Alternatively, the method may re-execute step 406 until the soft information for each symbol equals one. Alternatively, or in conjunction with the foregoing, the method may execute step 406 until a pre-defined period or number of iterations has been reached.

As this point, the method has generated a codeword A.sub.1B.sub.1C.sub.1D.sub.1, and a set of LLRs for each symbol in the codeword. Each drive may then transmit the codeword and the LLRs to an RS decoder for further processing.

In step 408, the method concatenates the portions of each codeword to reform the original RS-encoded codewords (e.g., A.sub.1A.sub.2A.sub.3A.sub.4 as illustrated in FIG. 3)

As illustrated, steps 402-406 are performed at a given drive and may be performed in parallel on multiple drives. The results of processing in step 402-406 are transmitted to a global RS decoder which executes steps 408-412.

In step 410, the method sorts the codeword symbols of the re-concatenated codewords by the received soft information. As described above, in one embodiment, the soft information comprises a set of LLRs between zero and one. Thus, in step 410, the method sorts the decimal LLR values.

In step 412, the method erases the symbols having low soft information values.

As described above, a low soft information value (e.g., a low LLR) indicates that the decoding procedure performed by the LDPC decoders in the individual drives may represent an incorrect symbol. Thus, in step 412, the method removes those symbols having low soft information values, thus only retaining those symbols having high LLR values. In the illustrated embodiment, the erasure of low soft information values is similar to the erasure of a value in a traditional erasure decoding scheme. Thus, in some embodiments, the method may attempt to rebuild the erased codeword symbols using the RS decoder and the bits of the codeword (including parity bits).

As illustrated in FIG. 4, the method may repeat steps 402-412 in multiple iterations. That is, in one embodiment, the RS-decoded codeword may be transmitted back to the LDPC decoder in each drive. In one embodiment, the method may again segment the symbols of the codeword. Details of the iterative nature of the method illustrated in FIG. 4 are described further herein.

FIG. 5 is a block diagram illustrating a global iterative decoding system according to some embodiments of the disclosure. The system includes a plurality of drives (502A-502D), a time-division scheduler (506), and an iterative decoder (508).

In one embodiment, a given drive (e.g., 502A) comprises a storage device as described previously. In addition to the hardware supporting the storage of data, the drives include hardware and/or software to support local decoding of encoded data stored on the drives. In the illustrated embodiment, a given drive (502A) includes an LDPC decoder (504A) and a MAP or SOVA detector (504B). As used herein, a "MAP" detector refers to maximum a posteriori detector and a "SOVA" detector refers to a soft output Viterbi algorithm detector.

In some embodiments, each drive may only use a SOVA detector. As known in the art, a SOVA detector utilizes a soft-information Viterbi algorithm to identify paths through a Viterbi trellis to decode codewords. The SOVA detector calculates the path with a maximum probability of correctness. Soft outputs are calculated for each step along the path. In alternative embodiments, a MAP detector may be used. In general a MAP detector is similar to a SOVA detector with the primarily difference being that a MAP detector utilizes a backward state processor. In general, a MAP processor selects an index that maximizes the a posteriori probability that a received codeword is correct.

In some embodiments, the SOVA and MAP detectors may be used together. In this embodiment, codewords are first processed by the SOVA detector in multiple iterations. After processing, frames with errors are forward to a MAP detector for further processing. In some embodiments, the soft output of the SOVA detector is used as the a priori input to the MAP detector.

As illustrated, the MAP/SOVA detector (504B) and LDPC decoder (504A) are connected in a loop. In one embodiment, the output of the MAP/SOVA detector (504B) is fed as input to the LDPC decoder (504A), and vice versa. As one example, during a first half of an iteration, the MAP/SOVA detector (504B) receives a codeword and attempts to calculate LLRs for each symbol in the codeword. In a first iteration, these LLRs may be based on channel observations. During a second half of the iteration, the MAP/SOVA detector (504B) forwards the codeword and LLRs to the LDPC decoder (504A) which performs a number of sub-iterations to decode the codeword. If the decoding succeeds, the drive may end the decoding process. Alternatively, in a second iteration, the LDPC decoder (504A) forwards the partially decoded codeword back to the MAP/SOVA detector (504B) for further processing. The calculated LLRs by the LDPC decoder (504A) may be used as the channel observations during the second detection phase.

As illustrated, the use of a MAP/SOVA detector (504B) and LDPC decoder (504A) may be exploited to double the throughput of the system. Specifically, each of the MAP/SOVA detector (504B) and LDPC decoder (504A) may operate on a unique processing object during a given iteration. Thus, each iteration involves the processing of two processing objects simultaneously.

Time-division scheduler (506) coordinates data transfer of decoded codewords from the drives (502A-502D) to iterative decoder (508). The details of the precise timing of transfers are discussed more fully herein and are incorporated herein by reference in its entirety. In general, as discussed above, the input to the decoding system in each drive is a received codeword with potential errors. The output of the drive-level decoding subsystem is a decoded codeword with LLRs for each symbol.

In one embodiment, the system illustrated in FIG. 5 may further include a one-pass Reed-Solomon encoder configured to encode a plurality of user data items to obtain a set of encoded symbols representing the user data items. In this embodiment, the time-division scheduler (506) may be further configured to interleave encoded symbols of the user data items to generate a plurality of interleaved codewords, the plurality of interleaved codewords including the first codeword. The time-division scheduler (506) may further be configured to transmit the first codeword to the storage prior to decoding the first codeword.

In one embodiment, time-division scheduler (506) is configured to monitor the decoding progress of the drives (502A-502D) and iterative decoder (508). In one embodiment, once the time-division scheduler (506) detects that one of the drives (502A-502D) or the iterative decoder (508) have completed decoding (or a pre-defined decoding window has expired), the time-division scheduler (506) swaps the data groups being processed by each respective component. That is, the time-division scheduler (506) swaps the group of data processed by iterative decoder (508) with the data group processed by drives (502A-502D).

Additionally, time-division scheduler (506), via saturator (506A), may be configured to update confidence saturations for symbols passed in messages between the drives (502A-502D) and iterative decoder (508). In the illustrated embodiment, the storage cluster knows which groups of data are associated with each drive (502A-502D), and consequently knows the locations of any errors if the corresponding drive malfunctions or the codeword is determined to include a decoding failure. When the failed codeword is spread into multiple other partner codewords (e.g., one LDPC codeword is spread into multiple RS codewords, or one RS codeword is spread into multiple LDPC codewords as well), from the origins of the spread fraction of the codeword and the result of the last decoding step, certain symbols can be identified as correct or incorrect. As described above and herein, a correct symbol is assigned to maximal confidence to accelerate the next decoding iteration. For incorrect symbols, what is known is the location, but the correct value at that location is unknown yet. In this case, the LLR denoting the confidence of the current sign is set as zero, which means this symbol has the equal likelihood to be all values in its range. By doing this reset of LLR, it can also accelerate the iterative decoding to remove the negative influence from the incorrect symbols. When the soft information is passed between the drives (502A-502D) and the iterative decoder (508), the symbols are adjusted during the message passing based on the former decoding result.

Iterative decoder (508) receives the symbols and LLRs from the drives (502A-502D). The iterative decoder (508) includes a sorting reshuffle (510A) and a soft information decoder (510B). In the illustrated embodiment, soft information decoder (510B) may comprise a soft-information RS decoder or similar decoder. In general, the soft information decoder (510B) receives the symbols from time-division scheduler (506) and decodes the symbols using the soft-information generated by the drives (502A-502D). The results of the soft decoding are transmitted to the sorting reshuffle (510A) which reorders the symbols based on the soft information and identifiers symbols that should be erased based on the low confidence level represented by the soft information (i.e., the LLRs).

In general, the arrangement of the soft information decoder (510B) and the reshuffle (510A) allows for a decoding procedure with a higher confidence than a hard decoder. The confidence is especially higher when the number of symbols to be erased exceeds the minimal distance of the code (e.g., RS code) used by the iterative decoder (508). Specifically, the soft information from the drives (502A-502D) assists the sorting of those most likely candidates to be erased and then decoded. For example, if there are m.sub.1 symbols that are potentially incorrect, but the erasure decoding capacity is to corrected m.sub.2 (where m.sub.2<m.sub.1) wrong symbols, the reshuffle (510A) is configured to pick the m.sub.2 symbols with the lowest confidence, erase them, and only then perform the soft information decoding via decoder (510B).

In some embodiments, there may exist multiple erasure candidates. In this embodiment, the system is configured to identify and erasure multiple permutations of erasures and try each permutation for decoding. In some embodiments, the system may set a predefined time window to try permutations.

FIG. 6 is a diagram illustrating the swapping of decoding results between a drive-level decoder and a global erasure decoder according to some embodiments of the disclosure.

As illustrated, iterative decoder (602) and drive decoder (604) operate in parallel. The illustrated embodiment illustrates the processing of each decoder at time.sub.1 and time.sub.2. In operation, the illustrated process continues until decoding is successful or a codeword cannot be decoded.

Iterative decoder-side codewords are denoted with alphabetical subscripts (A, B, C, D) whereas drive-side codewords are indicated by numerical subscripts (1, 2, 3, and 4). As described previously, codewords processed by iterative decoder (602) comprise erasure-coded versions of user data. In contrast, decoder-side codewords comprise interleaved codewords synthesized from the erasure-coded versions. Details of the interleaving are illustrated in FIG. 3 and the disclosure accompanying FIG. 3 is incorporated herein by reference in its entirety.

In the illustrated embodiment, "X" and "Y" are utilized to illustrate groupings of data processed by the system. Thus, the iterative decoder-side codewords X_A, X_B, X_C, X_D are non-interleaved versions of the drive-side codewords X_1, X_2, X_3, X_4. Similarly, drive-side codewords Y_1, Y_2, Y_3, and Y_4 are interleaved versions of iterative decoder-side codewords Y_A, Y_B, Y_C, and Y_D. As illustrated, decoding by the iterative decoder (602) occurs in parallel with drive-side decoder (604) however each side operates on a different grouping of data.

Within each drive 1-4, the LDPC turbo decoding routine (discussed in more detail supra) works on two code words (denoted as Y_1_p1 and Y_1_p2) using, for example, the MAP/SOVA decoder and LDPC local decoder subsystem discussed previously. As described above, this subsystem operates iteratively as illustrated in sequence (606). In this sequence (606), only decoding six iterations are illustrated, however more or fewer iterations may be utilized in actual practice. In general, the number of iterations of sequence (606) may be configured to match the number of iterations of sequence (608). Notably, the sequence (606) includes two rows of operations. The top row may correspond to processing by a MAP/SOVA detector while the bottom row corresponds to processing by a LDPC decoder. As illustrated, the codewords (Y_1_p1, Y_1_p2) are repeatedly alternated for processing as described in FIG. 5.

The Figure additionally illustrates sequence (608) which represents the iterative decoding trials performed in parallel with sequence (606). In one embodiment, the sequence (608) comprises a number of trials. Each trial represents a codeword (X_A) wherein one or more symbols in codeword are erased (e.g., via a reshuffle) and based to a soft-information decoder (E.g., a soft-information RS decoder). The results of the decoding are stored and the trial having the highest confidence level across the symbols is used as the output of the iterative decoder (602). As illustrated, processing on the iterative decoder-side may be performed in series while processing on the drive-side is performed in parallel. After processing is completed in timer, the system swaps data groups X and Y as illustrated in time.sub.2.

FIG. 7 illustrates a method for performing erasure decoding using soft-information decoding outputs according to some embodiments of the disclosure.

In step 702, the method receives LLR data for a plurality of code symbols in a given codeword. In step 702, the method may also receive the codeword symbols along with the LLRs. For example, the method may receive a codeword having symbols S1 through S8 and a set of LLRs represented by LLR1 through LLR8. As described above, these symbols were previously decoded by local LDPC decoding at individual drives. Since the symbols were interleaved, the method may receive symbols S1 and S2 (and LLRs LLR1 and LLR2) from a first drive, S3 and S4 (and LLRs LLR3 and LLR4) from a second drive, S5 and S6 (and LLRs LLR5 and LLR6) from a third drive, and S7 and S8 (and LLRs LLR7 and LLR8) from a fourth drive.

In one embodiment, in step 702 the method reconstructs the codewords from the individual symbols received from each drive. In one embodiment, the storage cluster operating the method includes an index of symbols stored on each drive and uses this index to re-concatenate the symbols.

In step 704, the method sorts the received symbols based on the LLRs. As known in the art, a given LLR comprises a continuous value. Thus, in step 704 the method sorts the symbols by the LLRs in ascending or descending order.

In step 706, the method identifies a number of bits having high LLRs and temporarily grants the sign of the LLR to the symbols identified. In one embodiment, granting the sign of the LLR comprises identifying that the symbol does not include an error. Thus, in step 706 the method temporary sets a symbol with a high LLR as being properly decoded, thus having maximum confidence.

In step 708, the method erases those symbols having low LLR values. In one embodiment, erasing the symbols comprises setting the confidence level of the low-LLR symbols to a minimal confidence level. In other embodiments, the method may simply discard the symbols for decoding. In the illustrated embodiment, the erasure of symbols may be temporary.

In step 710, the method decodes the codeword. As described above, the method may utilize an iterative RS decoder to decode the codeword that contains one or more erased symbols. In one embodiment, the decoding may succeed. In that case the method may proceed directly to step 714.

In step 712, the method determines if one or more trial remain to be executed. In some embodiments, the number of trials may be bounded only by a successful decoding. In other embodiments, the method may utilize a fixed number of trials to execute. In general, a trial refers to a permutation of symbols with signs granted in step 706 and symbols erased in step 708. In this manner, the method attempts to decode various permutations of codewords containing erased symbols. As known in the art, the LLR values represent a likely confidence. Thus, depending on the state of decoding, a given high or low LLR may be in error. Thus, the method attempts to try as many permutations of low and high LLRs as possible within a given time window. In this manner, the method can compensate for a partial decoding at the drive-level, thus reducing the number of iterations at each drive.

Once the method determines that decoding is successful, or no more iterations remain, the method proceeds to step 714.

In step 714, the method sets a maximum confidence value for the symbols decoded in step 710. In some embodiments, if the method is unable to decode all symbols of the given codeword, the method may update the received LLR with an LLR generated during the iterative decoding in step 710.

In step 716, the method transmits the decoded symbols and the LLRs to the drive-side decoders. As described previously, the drive-side decoders may utilize the symbols decoded in step 710 and use the updated LLRs to re-execute the MAP/SOVA detector and LDPC decoding procedure discussed previously.

As illustrated in previous Figures, the codewords on the iterative decoder-side are orthogonal to the drive-side codewords. Thus, any convergence on either side improves the performance of the other side. For example, if a drive-side decoding system cannot reliably decode symbol A.sub.1 (as part of codeword A.sub.1B.sub.1C.sub.1D.sub.1), the iterative decoder-side decoding system may be capable of decoding the codeword A.sub.1A.sub.2A.sub.3A.sub.4. In this case, the confidence of symbol A.sub.1 may be set to a maximum value. Thus, when the drive-side decoding system attempts to decode A.sub.1B.sub.1C.sub.1D.sub.1 again, the drive-side decoding system may set the confidence of A.sub.1 to a maximum value. Conversely, if the drive-side decoding system converges on a drive-side codeword, those symbols may be set to a maximum confidence level during processing by the iterative decoder.

For the purposes of this disclosure a module is a software, hardware, or firmware (or combinations thereof) system, process or functionality, or component thereof, that performs or facilitates the processes, features, and/or functions described herein (with or without human interaction or augmentation). A module can include sub-modules. Software components of a module may be stored on a computer readable medium for execution by a processor. Modules may be integral to one or more servers, or be loaded and executed by one or more servers. One or more modules may be grouped into an engine or an application.

Those skilled in the art will recognize that the methods and systems of the present disclosure may be implemented in many manners and as such are not to be limited by the foregoing exemplary embodiments and examples. In other words, functional elements being performed by single or multiple components, in various combinations of hardware and software or firmware, and individual functions, may be distributed among software applications at either the client level or server level or both. In this regard, any number of the features of the different embodiments described herein may be combined into single or multiple embodiments, and alternate embodiments having fewer than, or more than, all of the features described herein are possible.

Functionality may also be, in whole or in part, distributed among multiple components, in manners now known or to become known. Thus, myriad software/hardware/firmware combinations are possible in achieving the functions, features, interfaces and preferences described herein. Moreover, the scope of the present disclosure covers conventionally known manners for carrying out the described features and functions and interfaces, as well as those variations and modifications that may be made to the hardware or software or firmware components described herein as would be understood by those skilled in the art now and hereafter.

Furthermore, the embodiments of methods presented and described as flowcharts in this disclosure are provided by way of example in order to provide a more complete understanding of the technology. The disclosed methods are not limited to the operations and logical flow presented herein. Alternative embodiments are contemplated in which the order of the various operations is altered and in which sub-operations described as being part of a larger operation are performed independently.

While various embodiments have been described for purposes of this disclosure, such embodiments should not be deemed to limit the teaching of this disclosure to those embodiments. Various changes and modifications may be made to the elements and operations described above to obtain a result that remains within the scope of the systems and processes described in this disclosure.

* * * * *

Patent Diagrams and Documents

Iterative message-passing decoding with global code embedded with local code in time-division manner for fault tolerance improvement

Li J

D00000

D00001

D00002

D00003

D00004

D00005

D00006

D00007

XML