U.S. patent application number 11/666408 was filed with the patent office on 2009-10-08 for method, operating system and computing hardware for running a computer program.
Invention is credited to Ralf Angerbauer, Rainer Gmehlich, Werner Harter, Thomas Kottke, Bernd Mueller, Yorck von Collani, Reinhard Weiberle.
Application Number | 20090254773 11/666408 |
Document ID | / |
Family ID | 35462117 |
Filed Date | 2009-10-08 |
United States Patent
Application |
20090254773 |
Kind Code |
A1 |
Weiberle; Reinhard ; et
al. |
October 8, 2009 |
Method, operating system and computing hardware for running a
computer program
Abstract
A method for running a computer program on computing hardware,
in particular on a microprocessor, is described, the computer
program including multiple program objects and errors being
detected in the method while running the computer program on the
computing hardware, the program objects being subdivided into at
least two classes and program objects of the first class being
repeated when an error is detected and, when an error is detected
in one program object of the first class, which has already been
sent for execution, this program object of the first class being
restarted instead of a program object of the second class.
Inventors: |
Weiberle; Reinhard;
(Vaihingen/Enz, DE) ; Mueller; Bernd; (Gerlingen,
DE) ; Harter; Werner; (Illingen, DE) ;
Angerbauer; Ralf; (Schwieberdingen, DE) ; Kottke;
Thomas; (Ehningen, DE) ; von Collani; Yorck;
(Beilstein, DE) ; Gmehlich; Rainer; (Ditzingen,
DE) |
Correspondence
Address: |
KENYON & KENYON LLP
ONE BROADWAY
NEW YORK
NY
10004
US
|
Family ID: |
35462117 |
Appl. No.: |
11/666408 |
Filed: |
October 19, 2005 |
PCT Filed: |
October 19, 2005 |
PCT NO: |
PCT/EP05/55392 |
371 Date: |
June 5, 2009 |
Current U.S.
Class: |
714/2 ;
714/E11.137 |
Current CPC
Class: |
G06F 11/0751 20130101;
G06F 11/0715 20130101; G06F 11/1438 20130101 |
Class at
Publication: |
714/2 ;
714/E11.137 |
International
Class: |
G06F 11/14 20060101
G06F011/14 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 25, 2004 |
DE |
10 2004 051 991.9 |
Claims
1-15. (canceled)
16. A method for running a computer program on computing hardware
corresponding to a microprocessor, the computer program including
multiple program objects, comprising: detecting an error while the
computer program is running on the computing hardware; subdividing
the program objects into at least two classes; repeating program
objects of a first class when the error is detected; when the error
is detected in one program object of the first class, which has
already been sent for execution, this program object of the first
class is restarted instead of a program object of the second
class.
17. The method as recited in claim 16, wherein error detection is
performed only in classes which are restarted.
18. The method as recited in claim 16, wherein a total computing
time in one run is provided for all program objects and the total
computing time is divided in such a way that the program objects,
which are restarted when an error is detected, are allocated in the
error-free case a maximum of 50% of the total computing time in one
run.
19. The method as recited in claim 16, wherein the project objects
of the different classes are executed alternately.
20. The method as recited in claim 16, wherein the faulty program
object of the first class is executed instead of the directly
subsequent program object of the second class.
21. The method as recited in claim 16, wherein the program objects
are designed as tasks of the computer program, and at least one
task is executed again when an error is detected.
22. The method as recited in claim 16, wherein a program object
being executed at the point in time of the detection of the error
is executed again.
23. The method as recited claim 16, wherein at least one defined
state of the program objects is created and saved during the
execution of the program objects at the start of execution of the
program objects.
24. The method as recited in one claim 16, wherein the method is
used in a motor vehicle control unit.
25. The method as recited in claim 16, wherein the values of
quantities required for executing the program object are saved
during running of the computer program before execution of a
program object.
26. The method as recited in claim 16, wherein the system jumps
back to a certain program object at a preselectable return point in
the period of the computer program when an error is detected in a
computer program to be run periodically in a period.
27. The method as recited in claim 16, wherein the method is
executed on an operating system capable of running on computing
hardware.
28. A computing hardware for running a computer program including
multiple program objects, comprising: an error detection mechanism
for detecting an error during the running of the computer program
on the computing hardware, wherein: the error detection mechanism
causes at least one program object that had already been sent for
execution to be restarted, the program objects are subdivided into
at least two classes, and when an error is detected in one program
object of the first class, the program object of the first class is
restarted instead of a program object of the second class.
29. The computing hardware as recited in claim 28, wherein the
error handling mechanism has a trigger logic which restarts the at
least one program object on detection of an error.
30. The computing hardware as recited in claim 28, wherein a
real-time operating system runs on the computing hardware.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to a method for running a
computer program on computing hardware, in particular on a
microprocessor. The computer program includes multiple program
objects. In this method, errors are detected during the running of
the computer program on the computing hardware.
[0002] The present invention also relates to an operating system
capable of running on computing hardware, in particular on a
microprocessor.
[0003] Finally, the present invention also relates to a computing
hardware for running a computer program including multiple program
objects. The computing hardware has an error detection mechanism
for detecting an error during the running of the computer program
on the computing hardware.
BACKGROUND INFORMATION
[0004] So-called transient errors may occur in running a computer
program on computing hardware. Since the structures on
semiconductor modules (so-called chips) are becoming progressively
smaller, but the clock rates of the signals are becoming
progressively higher and the signal voltages are becoming
progressively lower, there is an increased incidence of transient
errors. Transient errors occur only temporarily, in contrast with
permanent errors, and usually disappear spontaneously after a
period of time. In transient errors, only individual bits are
faulty and there is no permanent damage to the computing hardware.
Transient errors may have various causes such as electromagnetic
influences, alpha-particles or neutrons.
[0005] The emphasis in error handling in communications systems is
even presently on transient errors. It is known that when an error
is detected in communications systems (e.g., in a controller area
network, CAN), the erroneously transmitted data are resent.
Furthermore, the use of an error counter is known in communications
systems, which is incremented on detection of an error, is
decremented when there is a correct transmission, and prevents
transmission of data as soon as it exceeds a certain value.
[0006] In the case of computing hardware for running computer
programs, however, error handling is performed essentially only for
permanent errors. Taking transient errors into account is limited
to incrementing and, if necessary, decrementing an error counter.
This counter reading is stored in a memory and may be read out
off-line, i.e., as diagnostic or error information during a visit
to a repair shop, e.g., in the case of computing hardware designed
as a vehicle control unit. Only then is it possible to respond
appropriately to the error.
[0007] Error handling via error counters thus, on the one hand,
does not allow error handling within a short error tolerance time,
which is necessary in particular for safety-relevant systems, and
also, on the other hand, does not allow constructive error handling
in the sense that the computer program is being run again properly
within the error tolerance time. Instead, in the related art, the
computer program is switched to emergency operation after exceeding
a certain value on the error counter. This means that a different
part of the computer program is run instead of the part containing
the error, and the substitute values determined in this way are
used for further computation. The substitute values may be modeled
on the basis of other quantities, for example. Alternatively, the
results calculated using the part of the computer program
containing the error may be discarded as defective and replaced by
standard values that are provided for emergency operation for
further calculation. The known methods for handling a transient
error of a computer program running on computing hardware thus do
not allow any systematic constructive handling of the transient
nature of most errors.
[0008] It is also known from the related art that transient errors
occurring in running a computer program on computing hardware may
be eliminated by completely restarting the computing hardware. This
approach is also not actually satisfactory, because quantities
obtained in processing of the computer program to that point are
lost and the computing hardware is unable to fulfill its intended
function for the duration of the restart. This is unacceptable in
the case of safety-relevant systems in particular.
[0009] Finally, it is also known that, for error handling for
transient errors of a computer program run on computing hardware,
the computer program may be set back by a few clock pulses and
individual machine instructions of the computer program may be
repeated. This method is also known as micro-rollback. With the
known method, the system only returns by objects on a machine level
(clock pulses, machine instructions). This requires appropriate
hardware support on a machine level, which is associated with a
considerable complexity in the area of the computing hardware. It
is impossible for the known method to be executed exclusively under
software control.
[0010] The error handling mechanisms known from the related art are
unable to respond in a suitable manner to transient errors
occurring in running a computer program on computing hardware.
[0011] However, transient errors are especially frequent in future
technologies. If they are detected, e.g., via dual core mechanisms,
the question of error localization still remains to be answered in
order to identify the correct result. This is true even more so if
one has the goal that a transient error does not always result in
restarting the computer. As described, error localization can
typically only be achieved via comparatively complex methods.
[0012] The object of the present invention is to provide a
constructive means of handling transient errors in running a
computer program on computing hardware in such a way that the full
functionality and functional reliability of the computer system are
restored within the shortest possible error tolerance time.
[0013] To achieve this object, starting from a method of the type
mentioned at the outset, when an error is detected, at least one
program object that has already been sent for execution is set to a
defined state and started up again from this state.
[0014] On a system level, the question nevertheless remains how to
sensibly implement such a concept of task repetition. As a rule, it
is not the case that any erroneous task can simply be re-computed
since the additionally required computing time and also the point
in time used therefor are planned from the system viewpoint to be
used differently. If the workload of the processor is already close
to 100% (and this is generally the case), such an unscheduled
additional load (which a task repetition represents) generates a
system overload which typically may result in a crash. This is even
more pronounced when time-controlled systems are considered (which,
as it becomes apparent, will prevail at least to some extent). A
deadline violation is not tolerable in these systems, just as
little as in most of the other hard real-time concepts.
[0015] From the system viewpoint, the consequence arises that the
additional load, which may result from a potential task repetition,
must be scheduled. If the computing time needed for a task
repetition is reserved after each task, then this may certainly
work; however, 100% additional performance must be paid for
compared to a system which does not handle errors. This is
unacceptable from the cost point of view.
[0016] Furthermore, it is the object of the present invention to
provide an optimum system strategy, which does not always schedule
the double computation of a task (thus generating a permanent and
very large overhead), and which at the same time solves the issue
of how to combine that with time-controlled approaches.
SUMMARY OF THE INVENTION
[0017] A method, an operating system, and computing hardware are
proposed within the scope of a system strategy according to the
present invention which make it possible to integrate the concept
of task repetition with a minimum of, or even without any,
performance overhead. Such a system strategy is an ancillary
condition for scheduling methods of tasks, functions, programs, or
program parts which are referred to in the following as a program
object or program objects in particular.
[0018] It is not necessary for the program object that is restarted
to have been executed completely on detection of the error. In the
sense of the present invention, program objects whose execution has
already begun but is not yet complete at the point in time of error
detection may also be restarted when an error occurs. According to
the present invention, when a transient or permanent error occurs,
at least one operating system object is executed again. The
advantages compared to the micro-rollback include in particular the
fact that repetition of a program object is implementable using
very little hardware support. At most, additional memory capacity
is needed to be able to store some information (e.g., input
quantities of the program object) required for renewed execution of
the program object. The actual administration of the method
according to the present invention may be performed by the
operating system of the computing hardware. In other words, the
method according to the present invention is implementable using
traditional commercial processors without requiring additional
hardware. However, it is of course also possible to implement the
method according to the present invention by using hardware
support.
[0019] Compared to the related art, it is thus more favorable to
re-compute the erroneous task, i.e., the erroneous function or the
erroneous program or program part or program object, or at least
the operating system object. If the error was originally transient,
the two outputs are equal in a re-computation, i.e., the error has
disappeared and it was handled by re-computing.
[0020] A method for running a computer program on computing
hardware, on a microprocessor in particular, is advantageously
described, the computer program including multiple program objects
and errors being detected while running the computer program on the
computing hardware using the method, the program objects being
subdivided into at least two classes, program objects of the first
class being repeated when an error is detected and, when an error
is detected in one program object of the first class, which has
already been sent for execution, this program object of the first
class being restarted instead of a program object of the second
class.
[0021] For this purpose, the detection of an error is performed
only in classes which are restarted.
[0022] A total computing time in one run is advantageously provided
for all program objects and the total computing time is divided in
such a way that the program objects, which are restarted when an
error is detected, are allocated a maximum of 50% of the total
computing time in one run when there is no error, the program
objects of the different classes being executed alternately and the
erroneous program object of the first class being executed or
restarted instead of the directly subsequent program object of the
second class.
[0023] Error detection itself may be accomplished by any method. It
is conceivable to use any type of error detection mechanism capable
of detecting errors during running of a computer program (known as
concurrent checking). In a dual-core architecture, for example, the
entire computer core may be designed in duplicate. If the computer
cores are operated in a lock-step mode, it is possible to determine
by comparison for each instruction whether both computer cores are
supplying the same result. A difference in results would then
definitely indicate an error. This error detection mechanism thus
detects errors in real time during the execution of program
objects. The situation is similar with error detecting codes which
are used throughout the processor architecture or with duplicated
subcomponents of the computing hardware. All these error detection
mechanisms have in common the fact that they detect transient
errors very rapidly and supply an error signal when an error has
been detected.
[0024] In response to such an error signal, an error handling
mechanism that repeats the program object may be triggered. If the
same error occurs again when execution is repeated, a permanent
error may be inferred or an error counter may be incremented, a
permanent error being inferred only when a certain value is
exceeded. However, if the error no longer occurs in renewed
execution of the program object, it may be assumed that the error
was a transient error. Even during renewed error-free execution of
the program object, the computer program is again ready for its
intended function. Availability is thus restored again after a very
short period of time. Repetition of at least one program object is
thus a good means for handling transient errors.
[0025] According to an advantageous refinement of the present
invention, the program objects are designed as runtime objects
(hereinafter specifically referred to as tasks) of the computer
program, at least one task being executed again on detection of an
error. A task is a typical object in particular on the operating
system level. A task may be repeated with minimal effort or even
under the exclusive control of the software, if desired.
[0026] According to a preferred specific embodiment of the present
invention, a program object, which is executed at the point in time
of the detection of the error, is restarted. Alternatively or
additionally, however, program objects that had already been
executed completely at the point in time of the detection of the
error may also be restarted and executed again.
[0027] According to the present invention, during execution of the
program objects, in particular at the start of execution of the
program objects, at least one defined status of the program objects
is generated and saved. This may be accomplished, for example, by
saving the values of all variables relevant for the status of the
program object.
[0028] In addition, another piece of computing hardware, operating
as a redundant system to the computing hardware on which the
computer program having the multiple program objects is executed,
is used for error detection. More than one piece of redundant
computing hardware may of course also be used for error
detection.
[0029] The method according to the present invention is
advantageously used in a motor vehicle, in particular a motor
vehicle control unit, to ensure a secure and reliable run of the
computer program despite unavoidable transient errors in running a
computer program. This is important in particular in running
control programs and/or regulating programs in safety-critical
applications in a motor vehicle.
[0030] It is also proposed that a permanent error be inferred if
the same error occurs again in renewed execution of the at least
one program object. It is also conceivable for a permanent error to
be inferred only when the error occurs repeatedly after a
preselectable number of repetitions of the program object. In this
case, a transient error is inferred even if it fails to occur only
after a third repetition or an even later repetition of the program
object. In this refinement of the present invention, important
program objects may thus be repeated, for example, three times
instead of only twice.
[0031] According to another advantageous refinement of the present
invention, the number of repetitions of the at least one program
object is limited to a preselectable value. This prevents the same
program object from being repeated indefinitely in the case of a
permanent error. The number of repetitions of the at least one
program object may be limited, for example, by using a counter or
based on time limits. Furthermore, the specification of the
task-dependent repetition value allows for important tasks to be
repeated more often than less important tasks and thus important
tasks are given an opportunity more often and/or for a longer
period of time to run error-free without transient errors, whereas
a permanent error may be inferred relatively rapidly in the case of
less important tasks, and another system response may be
initiated.
[0032] According to another preferred specific embodiment of the
present invention, the number of repetitions of the at least one
program object is limited dynamically to a preselectable value. The
number of repetitions of the at least one program object is
advantageously limited dynamically to a preselectable value as a
function of a remaining time for scheduling. In this way, a first
task and a second task, for example, may be run through while a
third task may be repeated several times.
[0033] To implement the method according to the present invention,
during running of the computer program before the execution of a
program object, the values of the quantities required for execution
of the program object and/or the quantities that define the status
of the program object are saved. According to this embodiment, the
quantities of all program objects are saved.
[0034] Alternatively, in the case of a computer program to be
processed periodically in a period, on detection of an error in a
certain program object, the processing jumps back to a
preselectable return point in the period of the computer program.
Thus, according to this specific embodiment, in the case of an
error, the program always jumps back to the same point within the
period. Then preferably the values of all quantities relevant for
the status of the program object are saved during running of the
computer program only before the execution of a program object at
the return point. In this way, only the values of the relevant
quantities of the program object at the return point need be stored
only once per cycle or period. Therefore, this allows savings in
time for storing and memory capacity.
[0035] In renewed execution of a program object after detection of
an error, the saved input quantities are then retrieved and made
available as input quantities to the program object to be executed
again.
[0036] As another specific embodiment of the present invention,
multiple return points may be provided for one program object. When
an error occurs, it is not necessary to execute the entire program
object again, but instead only a portion of the program object need
be executed. When an error occurs, the execution simply jumps back
to the previous return point up to which the execution of the
program object was free of errors. For example, in error-free
running of the program object up to the n-th return point, it is
possible to rollback to the n-th return point when an error occurs
between the n-th return point and the (n+1)-th return point. The
program object is then executed anew starting with the n-th return
point. This permits time savings. When each return point is passed,
at least one defined state is generated and saved, preferably
during the execution of the program object.
[0037] Implementation of the method according to the present
invention in the form of an operating system is of particular
importance. The operating system here is capable of running on
computing hardware, in particular on a microprocessor, and is
programmed for execution of the method according to the present
invention when it is running on the computing hardware. In this
case, the present invention is thus implemented by the operating
system, so this operating system constitutes the present invention
in the same way as does the method for the execution of which the
operating system is suitable. The operating system is preferably
stored in a memory element and is transmitted to the computing
hardware for processing. In particular any data medium or an
electronic memory medium may be used as the memory element, e.g., a
random access memory (RAM), a read-only memory (ROM) or a flash
memory.
[0038] Based on the computing hardware of the type mentioned at the
outset, another means of achieving the object of the present
invention is for the computing hardware to have an error handling
mechanism which prompts renewed execution of at least one program
object on detection of an error by the error detection
mechanism.
[0039] According to an advantageous refinement of the present
invention, the error handling mechanism has a trigger logic which
restarts the at least one program object on detection of an
error.
[0040] According to a preferred specific embodiment, a real-time
operating system, e.g., OSEK time, is running on the computing
hardware. Finally, the present invention provides for the computing
hardware to include a microprocessor.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] Other features, possible applications and advantages of the
present invention are derived from the following description of
exemplary embodiments of the present invention, as illustrated in
the drawing. All the features described or illustrated here
constitute the subject matter of the present invention.
[0042] FIG. 1 shows a flow chart of an method according to the
present invention according to a preferred specific embodiment.
[0043] FIG. 2 shows a schematic diagram of a preferred specific
embodiment of a computing hardware according to the present
invention.
[0044] FIG. 3 made up of FIGS. 3a and 3b shows an approach
according to the present invention for integrating the task
repetition.
DETAILED DESCRIPTION
[0045] The present invention relates to a method for running a
computer program on computing hardware, in particular on a
microprocessor. The computer program includes a plurality of
program objects preferably designed as tasks. In this method,
errors are detected during the running of the computer program on
the computing hardware. The detected errors may be transient or
permanent.
[0046] Transient errors may occur in running a computer program on
computing hardware. Since structures on the semiconductor modules
(so-called chips) in computing hardware are becoming progressively
smaller, but the clock rate of the signals is becoming
progressively higher and the signal voltages are becoming
progressively lower, transient errors are occurring with ever
greater frequency in running a computer program on computing
hardware. In contrast with permanent errors, they occur only
temporarily and they usually disappear spontaneously after a period
of time. In transient errors, only individual bits are faulty and
there is no permanent damage to the computing hardware. Transient
errors may have various causes, e.g., electromagnetic effects,
alpha-particles or neutrons.
[0047] On the basis of the fact that transient errors occur almost
unforeseeably and therefore are not reproducible, error handling is
therefore performed essentially only for permanent errors in the
computing hardware known from the related art. Taking into account
transient errors is limited to incrementation and, possibly,
decrementation of an error counter. This count is stored in a
memory and may be read out off-line, i.e., as diagnostic
information or error information during a visit to the repair shop,
for example. Only then is it possible to respond appropriately to
the error. The known error handling thus does not allow any error
handling within a short error tolerance time, which is necessary
for safety-relevant systems in particular and also does not allow
constructive error handling in the sense that the computer program
resumes proper processing within the error tolerance time and the
computing hardware is able to fulfill its intended purpose.
[0048] In contrast, the method according to the present invention
allows handling of a transient error of a computer program running
on computing hardware using a systematic constructive means of
handling the transient nature of most errors. FIG. 1 shows a flow
chart of the method according to the present invention using the
example of a runtime object, also called a task. The existence of
other tasks does not affect the basic workflow, so they need not be
taken into account. Thus according to the present invention,
multiple tasks may be handled just as one task is handled according
to the workflow depicted in FIG. 1. An error detection mechanism
(so-called concurrent checking) operating in parallel is
particularly advantageous. However, this is not presentable in a
flow chart in this way but instead is inserted as a serial module
at the particular location.
[0049] The method according to the present invention begins in a
function block 1. The execution of the task on the computing
hardware starts in function block 1. The task is called. In a
function block 2, a return point is generated. To this end,
reliable, relevant task input quantities sufficient for setting the
task in a defined state for a restart and restarting the task are
saved in a memory element in the computing hardware. All the input
quantities of the task are preferably saved. The task is then
executed further in a function block 3. The execution may proceed
either to another return point or to the end of the task. An error
detection mechanism is then executed. Error detection may occur
according to any method. Errors are detected during the running of
the computer program (so-called concurrent checking). Thus, with a
so-called dual-core architecture, for example, the entire computer
core is designed in duplicate. If the computer cores are operated
in a so-called lock-step mode, it is possible to compare whether
both computer cores deliver the same results for each instruction.
A difference in results then allows an error to be inferred with
certainty. Such an error detection mechanism thus detects errors
during execution of the task in real time. The situation is also
similar for error detecting codes which are used throughout the
processor architecture or even for duplicated subcomponents of the
computing hardware. Such error detection mechanisms that detect
transient errors very rapidly and supply an error signal when an
error has been detected are preferred.
[0050] A query block 4 checks on whether an error, i.e., a
transient error or a permanent error, has been detected. If an
error has been detected, another query block 7 branches off where
the current value of an error counter logic is checked. If the
error counter has not yet fallen below a preselectable counter
reading (in the case of a decrementing error counter) or exceeded
it (in the case of an incrementing error counter), then it is
possible to repeat the execution of the task during the execution
of which the error has occurred and/or a certain number of tasks
executed before the error occurred. If a renewed startup of the
execution of the task is possible, then the system branches off
into a function block 8, where the status of the error counter
logic is updated (decremented or incremented) with the information
that another error has occurred. From there, the system branches
off into a function block 5, in which the quantities stored in
function block 2 are loaded and sent to the task for generating a
defined status at the beginning of the execution. The system then
branches off into a function block 3, where the task to be repeated
is executed again in part, i.e., starting from a return point that
has already been executed, for example, or as a whole, i.e., the
task is restarted from the beginning.
[0051] If it is found in query block 4 that no error has occurred
during execution of the task in function block 3, then the system
branches off into a function block 9 in which the status of the
error counter logic is updated with the information that no error
has been detected. From that point forward, the system branches off
into a query block 11 which verifies whether the computer program
has been run to the end. If this is the case, then the system
branches off to the end of the computer program in function block
6. Otherwise it branches off into a function block 12, where
another return point is generated, depending on the current task
status, by defining and saving reliable, relevant task input
quantities sufficient for restarting the task. From that point
forward, the system branches back into a function block 3, where
the task to be repeated is restarted and executed again either in
part or in its entirety.
[0052] If it is found in query block 7 that another attempt at
renewed execution of the task is no longer possible because of the
reading on the error counter logic, then the system branches off
into a function block 10. Query block 7 verifies whether the error
counter logic value for this task is greater than a task-dependent
repetition value. This task-dependent repetition value may be
predetermined either as the same for different tasks or
individually for each task. It is possible in this way for
particularly important tasks, for example, to first be repeated
several times before a permanent error is reported. If the
task-dependent repetition value is preselected as 1, the task is
repeated only once before a permanent error is detected. If the
task-dependent repetition value is preselected as 2 or 3, the task
is repeated two or three times before a permanent error is
detected. In this case, the task thus has a longer period of time
or more run-throughs available, until the transient error no longer
occurs. In function block 10, a permanent error is then detected
and a corresponding measure is initiated. This measure may include,
for example, setting the computer program to an emergency mode or
initially not undertaking anything and then terminating the run of
the computer program.
[0053] The method according to the present invention need not
necessarily include all the function blocks and query blocks shown
in FIG. 1 and explained above. Thus, for example, blocks 7 through
9, which involve the error counter logic, may be omitted. On
detection of an error, the task(s) to be restarted and executed
would then be repeated until the error no longer occurs. A
permanent error would not be detected, so function block 10 could
also be omitted. Alternatively, the task-dependent repetition value
may be preselected as 1, so that function blocks 8 and 9 for
updating the error counter could be omitted. Finally, it is also
possible to omit blocks 11 and 12 if only a single task having a
single return point is executed.
[0054] FIG. 2 shows a piece of computing hardware according to the
present invention for running a computer program according to a
preferred specific embodiment. The computing hardware is labeled as
20 in its entirety. The computing hardware includes a memory
element 21, designed as an electronic memory, in particular a flash
memory, for example. Furthermore, computing hardware 20 includes a
microprocessor 22 on which a computer program may be run. The
computer program is stored in electronic memory medium 21 and is
labeled as 23. For running the computer program on microprocessor
22, the computer program is transmitted either as a whole or in
segments, e.g., command by command to microprocessor 22 over a data
link 24. Data link 24 may be designed as one or more data lines or
as a bus system for data transmission. Furthermore, an operating
system is stored in memory medium 21 which is transmitted at least
partially from memory 21 to microprocessor 22 and run there when
computing hardware 20 is booted up. The operating system is labeled
as 25. It has the function of controlling and managing the running
of computer program 23 on microprocessor 22 and the peripheral
equipment connected to computing hardware 20. According to the
present invention, operating system 25 is designed in a particular
manner, so it is programmed for executing the method according to
the present invention and executes the method according to the
present invention when it runs on microprocessor 22. In particular,
operating system 25 includes access to an error detection mechanism
for detecting an error during running of computer program 23 on
microprocessor 22. Furthermore, operating system 25 includes an
error handling mechanism which prompts renewed execution of at
least one program object (one task) of computer program 23 on
detection of an error.
[0055] A method, an operating system, and computing hardware are
proposed within the scope of a system strategy according to the
present invention which make it possible to integrate this concept
of task repetition with a minimum of, or even without any,
performance overhead.
[0056] It is based on a system strategy which minimizes the
performance overhead and thus the costs for different assumptions.
It is generally assumed that an error detection mechanism, in
particular as described above, is available which is able to detect
errors while the task is running (e.g., a dual core mechanism,
having redundant processing). Furthermore, especially transient
errors are discussed here. In order to cover permanent errors, an
extension is necessary, e.g., an error counter as mentioned
above.
[0057] Several preconditions have to be observed in the strategy
according to the present invention for integrating the task
repetition as described in FIG. 3:
[0058] A distinction is made between at least two classes of tasks
(e.g., critical and noncritical tasks). The error detection
mechanisms are not used for all task classes and/or the task
repetition is not carried out for all task classes.
[0059] In the example of FIG. 3, a distinction is made between two
classes of tasks in which a task repetition is carried out for only
one class and/or an error detection mechanism is started for only
one class. When distinguishing between critical and noncritical
tasks, only errors in critical tasks are intercepted according to
the present invention; the error detection mechanism is used only
for the first class in particular. Critical tasks are such tasks in
which a correct execution is necessary for the overall function or
the basic functionality of the system and must have taken place at
a certain point in time at the latest in order to preserve this
function. Noncritical tasks are such tasks in which the overall
system function or also the basic function is either not affected
or is not significantly restricted. In particular, in a system a
distinction is made in this regard between safety-critical and
non-safety-critical functions.
[0060] Transient errors in a task of the second class, i.e.,
noncritical class 2, may be ignored. Furthermore, as mentioned, it
holds true that a task of the second class occasionally may not
"get a turn," i.e., that there are no serious consequences from the
perspective of the system when a task of this second class is not
called up during a task processing run. In addition, the total run
time of tasks of the first, critical class 1 should preferably not
take up more than a certain percentage depending on the system
(e.g., 50%) of the total computing time of one run or task
processing round. In the case of tasks being divided into critical
and noncritical tasks, the critical tasks are allocated a total
maximum of 50% of the total computing time so that in the worst
case, when all critical tasks have errors, they may be restarted or
re-computed.
[0061] A system approach according to FIG. 3 is then possible in
which the tasks of the different classes alternate in such a way
that the "successor" S1 (class 2) of a task T1 (class 1) is
allocated at least as much time in the time schedule as the WCET
(worst case execution time) of T1.
[0062] When a transient error occurs in T1, the basic idea is then
that T1 is re-computed instead of S1. This ensures that the error
in T1 is corrected before T2 is computed. Error detection and error
handling (including a very probable healing and retaining of the
error tolerance characteristics) thus take place within the
computing time of one task.
[0063] A distinction is made in FIG. 3 between Class 1 tasks T1,
T2, and T3 and class 2 tasks S1, S2, and S3. In the event of an
error-free execution, FIG. 3a shows task sequence T1, S1, T2, S2,
T3, S3 as an example. According to FIG. 3b, if a transient error
occurs in T1, T1 is re-computed in S1 time and subsequently T2, S2,
T3, S3 so that the transient error is corrected.
[0064] This system approach is implemented according to the
explanations to FIGS. 1 and 2 using the corresponding device in
such a way that, with regard to the disclosure, each specific
embodiment is explicitly combinable with any other according to the
present invention.
[0065] In this manner, according to the present invention, an
optimum FO characteristic (FO: fail operation(al), fault
operation(al)) regarding transient errors is achievable, even with
a very short healing time which heals the FO characteristic itself
again. This approach is also very well usable in time-controlled
systems and is optimizable to that effect.
* * * * *