U.S. patent application number 13/065583 was filed with the patent office on 2011-09-29 for adaptive slip double buffer.
Invention is credited to Kishan Shenoi.
Application Number | 20110234200 13/065583 |
Document ID | / |
Family ID | 44655644 |
Filed Date | 2011-09-29 |
United States Patent
Application |
20110234200 |
Kind Code |
A1 |
Shenoi; Kishan |
September 29, 2011 |
Adaptive slip double buffer
Abstract
A method includes monitoring a fill in an adaptive slip buffer
of a digital to analog convertor; adjusting a number of samples
that are read from the adaptive slip buffer per page as a function
of the fill; and reading the number of samples from the adaptive
slip buffer. An apparatus includes a digital to analog convertor
including an adaptive slip buffer and a read address generator
coupled to the adaptive slip buffer, wherein the read address
generator includes an increment control that adjusts a number of
samples that are read from the adaptive slip buffer per page as a
function of fill of the adaptive slip buffer.
Inventors: |
Shenoi; Kishan; (Saratoga,
CA) |
Family ID: |
44655644 |
Appl. No.: |
13/065583 |
Filed: |
March 24, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61340923 |
Mar 24, 2010 |
|
|
|
61340906 |
Mar 24, 2010 |
|
|
|
61340922 |
Mar 24, 2010 |
|
|
|
Current U.S.
Class: |
324/76.38 |
Current CPC
Class: |
H04J 3/0632 20130101;
H04L 7/005 20130101 |
Class at
Publication: |
324/76.38 |
International
Class: |
G01R 13/34 20060101
G01R013/34 |
Claims
1. A method, comprising: monitoring a fill in an adaptive slip
buffer of a digital to analog convertor; adjusting a number of
samples that are read from the adaptive slip buffer per page as a
function of the fill; and reading the number of samples from the
adaptive slip buffer.
2. The method of claim 1, wherein the number of samples defines an
apparent frame interval as a function of a clock frequency of the
digital to analog convertor.
3. The method of claim 2, wherein the number of samples is
increased when the fill is decreasing and the number of samples is
decreased when the fill is increasing.
4. The method of claim 3, wherein the number of samples is
decreased when a sample is flagged as actionable.
5. The method of claim 3, wherein the number of samples is changed
when a minimum slip time interval has been exceeded.
6. The method of claim 3, wherein the number of samples is changed
when an apparent frequency change threshold has not been
exceeded.
7. A computer program, comprising computer or machine readable
program elements translatable for implementing the method of claim
1.
8. A machine readable medium, comprising a program for performing
the method of claim 1.
9. An apparatus, comprising: a digital to analog convertor
including an adaptive slip buffer and a read address generator
coupled to the adaptive slip buffer, wherein the read address
generator includes an increment control that adjusts a number of
samples that are read from the adaptive slip buffer per page as a
function of fill of the adaptive slip buffer.
10. The apparatus of claim 9, wherein the number of samples
controls an apparent frame interval as a function of a clock
frequency of the digital to analog convertor.
11. The apparatus of claim 9, wherein the adaptive slip buffer
includes a circular buffer.
12. The apparatus of claim 9, wherein the adaptive slip buffer
includes a double buffer.
13. The apparatus of claim 9, wherein the adaptive slip buffer
includes a linear buffer.
14. A digital switched network integrated access device, comprising
the apparatus of claim 11.
Description
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application claims a benefit of priority under 35
U.S.C. 119(e) from copending provisional patent applications U.S.
Ser. No. 61/340,923, filed Mar. 24, 2010, U.S. Ser. No. 61/340,922,
filed Mar. 24, 2010 and U.S. Ser. No. 61/340,906, filed Mar. 24,
2010, the entire contents of all of which are hereby expressly
incorporated herein by reference for all purposes.
BACKGROUND INFORMATION
[0002] 1. Field of the Invention
[0003] Embodiments of the invention relate generally to the field
of digital networking communications. More particularly, an
embodiment of the invention relates to methods and systems for
packet (and/or frame) switched networking that include an adaptive
slip double buffer.
[0004] 2. Discussion of the Related Art
[0005] With the advent of Internet Protocol ("IP"), packet-based
transmission and routing schemes are becoming ever more popular. It
is well accepted that Next Generation Networks ("NGN"s) will be
built upon these principles. However, several services, such as
real-time voice and voice-band communication, that are well suited
for circuit-switched ("TDM") transmission and switching, have to be
supported by this new architecture. VoIP ("voice over IP") is one
such example. The underlying premise of VoIP is that speech, after
conversion from analog to digital format, can be packetized and
several protocols such as RTP and RTCP (see Ref. [1,2]) have been
developed to support the ability of IP networks to provide such
real-time services.
[0006] One of the premises of NGNs is that the Quality of
Experience (QoE) should be at least as good as good, or even better
than, that provided by the legacy circuit-switched network or PSTN
(Public Switched Telephone Network). It is clear that delay is an
important parameter in determining the QoE. It is well known that
one-way delays that are very large (of the order of 400 ms or
larger) are extremely detrimental from the view of subjective
quality, making regular full-duplex conversation difficult. At
lower one-way delays, the impact of echo is important. The Quality
of Experience, for a given level of Echo Return Loss (ERL) drops
rapidly with increasing delay.
[0007] The overall delay has four principal components. The process
of packetization involves buffering information to fill the packet
payload and thus introduces delay. The encoding and decoding
algorithms, especially in the case of source codecs, require
buffering as well. These two delays are often known quantities. The
third component is the delay through the network. This delay is
difficult to predict a priori since it depends on the physical
distance, the number of intermediate packet switches involved in
the end-to-end transport of a packet, the bandwidth of the links
between switches (routers). However, for two given end-points there
is, in principle, a minimal network delay corresponding to the
transit time of the fastest possible packet transmission.
Considering that in a pure IP network the transmission path could
be different for different packets, and the queuing delay in
intermediate nodes is a function of congestion, the delay
experienced by packets will be variable, ranging from the minimal
delay to infinity (a packet lost in the network is construed as an
instance of infinite delay). Some maximum delay threshold must be
determined and packets with delay greater than this maximum are
discarded. Received packets are stored in a buffer whose size
corresponds to the difference between minimum and maximum delays
and so, practically speaking, fast packets are delayed so that the
packets can be decoded and converted back to analog signals in a
smooth fashion. The notion of play-out, or dejittering, whereby
some delay is introduced via a jitter buffer constitutes the fourth
delay component. Clearly, in order to maximize the subjective
quality of the call, the play-out buffer, also referred to as the
jitter buffer, should be as small as possible.
SUMMARY OF THE INVENTION
[0008] There is a need for the following embodiments of the
invention. Of course, the invention is not limited to these
embodiments.
[0009] According to an embodiment of the invention, a process
comprises: monitoring a fill in an adaptive slip buffer of a
digital to analog convertor; adjusting a number of samples that are
read from the adaptive slip buffer per page as a function of the
fill; and reading the number of samples from the adaptive slip
buffer. According to another embodiment of the invention, a machine
comprises: a digital to analog convertor including an adaptive slip
buffer and a read address generator coupled to the adaptive slip
buffer, wherein the read address generator includes an increment
control that adjusts a number of samples that are read from the
adaptive slip buffer per page as a function of fill of the adaptive
slip buffer.
[0010] These, and other, embodiments of the invention will be
better appreciated and understood when considered in conjunction
with the following description and the accompanying drawings. It
should be understood, however, that the following description,
while indicating various embodiments of the invention and numerous
specific details thereof, is given for the purpose of illustration
and does not imply limitation. Many substitutions, modifications,
additions and/or rearrangements may be made within the scope of an
embodiment of the invention without departing from the spirit
thereof, and embodiments of the invention include all such
substitutions, modifications, additions and/or rearrangements.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The drawings accompanying and forming part of this
specification are included to depict certain embodiments of the
invention. A clearer concept of embodiments of the invention, and
of components combinable with embodiments of the invention, and
operation of systems provided with embodiments of the invention,
will be readily apparent by referring to the exemplary, and
therefore nonlimiting, embodiments illustrated in the drawings
(wherein identical reference numerals (if they occur in more than
one view) designate the same elements). Embodiments of the
invention may be better understood by reference to one or more of
these drawings in combination with the following description
presented herein. It should be noted that the features illustrated
in the drawings are not necessarily drawn to scale.
[0012] FIG. 1 is a functional block view of a simplified depiction
of a VoIP call (only one direction shown), appropriately labeled
"PRIOR ART."
[0013] FIG. 2 is a functional block view of a circular buffering
action separating ADC and DAC clocks, appropriately labeled "PRIOR
ART."
[0014] FIG. 3 is a functional block view of a simplified model of
VoIP over an IP network, appropriately labeled "PRIOR ART."
[0015] FIG. 4 is a functional block view of transmission of
voice-band signals over a packet network, appropriately labeled
"PRIOR ART."
[0016] FIG. 5 is a functional block view of depicting the functions
involved in generating the received speech signal, representing an
embodiment of the invention.
[0017] FIG. 6 is a functional block view of an underlying principle
of a retiming FIFO buffer (play-out buffer), representing an
embodiment of the invention.
[0018] FIG. 7 is a functional block view of a double buffer
arrangement for delivering samples to the DAC, representing an
embodiment of the invention.
[0019] FIG. 8 is a functional block view of a simplified circular
buffer arrangement, representing an embodiment of the
invention.
[0020] FIG. 9 is a functional block view in more detail of "Read
Add. Gen." (433 in FIG. 8), representing an embodiment of the
invention.
DESCRIPTION OF PREFERRED EMBODIMENTS
[0021] Embodiments of the invention and the various features and
advantageous details thereof are explained more fully with
reference to the nonlimiting embodiments that are illustrated in
the accompanying drawings and detailed in the following
description. Descriptions of well known starting materials,
processing techniques, components and equipment are omitted so as
not to unnecessarily obscure the embodiments of the invention in
detail. It should be understood, however, that the detailed
description and the specific examples, while indicating preferred
embodiments of the invention, are given by way of illustration only
and not by way of limitation. Various substitutions, modifications,
additions and/or rearrangements within the spirit and/or scope of
the underlying inventive concept will become apparent to those
skilled in the art from this disclosure.
[0022] Within this application several publications are referenced
by Arabic numerals, or principal author's name followed by year of
publication, within parentheses or brackets. Full citations for
these, and other, publications may be found at the end of the
specification immediately preceding the claims after the section
heading References. The disclosures of all these publications in
their entireties are hereby expressly incorporated by reference
herein for the purpose of indicating the background of embodiments
of the invention and illustrating the state of the art.
[0023] The invention described herein describes a novel approach to
the play-out buffer, providing a method to maintain optimal
performance even in situations where the analog-to-digital
converter (ADC) and digital-to-analog converter (DAC) have
different underlying time-bases. In particular, a method based on
controlled slips, a technique that is well known as being efficient
in TDM architectures for addressing clock offset, is presented. The
invention is an extension of controlled slip behavior. In
particular, the slip mechanism is invoked primarily when the speech
segment represents a synthetic signal such as during periods of
silence or if the characteristics of the speech segment are such
that the repetition/deletion of a speech sample will have minimal
subjective annoyance. It will be seen that an adaptive play-out
buffer of the manner described here can form an integral part of an
adaptive jitter buffer mechanism. Extensions of the invention
include methods to implement adaptive clock control with minimal
impact on subjective quality.
1. The Inherent Need for SYNCHRONIZATION
[0024] Strictly speaking, the term synchronization applies to
alignment of time and the term syntonization applies to alignment
of frequency, but in the telecommunication environment we often use
the term synchronization to refer to either time-alignment, or
frequency-alignment, or both. It is generally clear from the
context which meaning is appropriate. All real-time communication
carried over a digital network requires synchronization to some
degree. This can be illustrated by considering the example of
delivering a real-time voice signal between two geographically
disparate points across a network.
[0025] The situation is depicted in FIG. 1. The analog source is
converted into digital format by an analog-to-digital converter
(ADC or A/D) operating at a sampling clock rate of nominally 8 kHz.
Each sample is, conventionally, quantized to 8 bits so that the
digital stream carrying the voice information is 8
kilo-octets-per-second or 64 kbps (see ITU-T Rec. G.711, Ref. [3],
and Ref. [4]). This is regarded as a DS0 and represents
"uncompressed" voice. In a conventional circuit-switched or TDM
(Time Division Multiplexed) architecture, this DS0 is delivered "as
is" to the destination for conversion back to analog format. In a
packet-switched environment, exemplified by Voice-over-IP (VoIP),
the DS0 is, possibly, compressed and organized into packets. These
packets are delivered to the destination where the expansion to DS0
format is performed prior to conversion back to analog. Whereas the
schemes described here are applicable regardless of the word-length
employed for A/D conversion or D/A conversion, we shall henceforth
assume here that these are done with a word-length of 8 bits (1
octet) (representative of .quadrature.-law and A-law formats
provided in ITU-T Recommendation G.711) for specificity.
[0026] It is important to recognize that at each end the
digital-to-analog converter (DAC or D/A) and analog-to-digital
converter (ADC or A/D) are usually in the same integrated circuit
chip or on the same circuit board and thus the same clock is used
for both functions at any one end. In the event that the (digital)
signal processing includes echo cancellation, it is mandatory that
the same clock be used for both functions else the echo canceller
will exhibit sub-par performance and there will be instances of
echo leakage and other phenomena that negatively impact the quality
of experience. In FIG. 1 we show a single direction of transmission
solely for convenience in representation and explanation.
[0027] The rate at which packets are generated (in the encoder) is
determined by the A/D clock, shown as f.sub.A in FIG. 1. In most
VoIP schemes, one packet is generated for every 80 samples from the
A/D converter. That is, using the conventional sampling rate of 8
kHz (nominal), each packet represents 10 ms (ms=millisecond) of
speech (there are variants that use block sizes other than 10 ms,
such as 5 ms, 20 ms, 30 ms, etc. but the principles described here
are applicable in all cases). The nominal word-length associated
with each sample is 8 bits, following G.711 (see Ref. [3]) so the
"uncompressed" signal represents a bit-rate of 64 kbps (or DS0).
Compression algorithms are employed to reduce the effective
bit-rate. For example, ADPCM (adaptive differential pulse code
modulation) following ITU-T Recommendation G.726 (see Ref. [5])
reduces the word-length associated with each sample to 4,
effectively reducing the data rate to 32 kbps. ITU-T Recommendation
G.727 (see Ref. [5]) describes methods for reducing the bits/sample
from 8 down to 5 or 4 or 3 or 3, corresponding to bit-rates of 40,
32, 24, and 16 kbps, respectively. More sophisticated schemes, such
as those described in ITU-T Recommendation G.723 and G.729 (see
Ref. [5]) are even more effective in reducing the bit-rate. The
notion of a "10-ms-packet" is the collection of information
produced by the coder that permits the decoder at the far end to
synthesize a 10-ms block of speech. Depending on the coding
algorithm it is possible that information from previous packets is
necessary as well. At the receiving end the decoder recreates the
appropriate digital signal (DS0) for conversion back into analog
format. The D/A clock is shown as f.sub.D in FIG. 1.
[0028] If the frequencies of the A/D clock (f.sub.A) and the D/A
clock (f.sub.D) are not equal, then slips will occur. The notion of
a slip is simple. If f.sub.A>f.sub.D then the DAC will
experience a surfeit of samples; if f.sub.A<f.sub.D then the DAC
will experience a shortage of samples. Rate-adaptation then
requires that samples be deleted or inserted. In the
circuit-switched architecture of the legacy PSTN, every
transmission boundary element is required to extract DS0 s from an
incoming digital signal (typically a DS1) and reinsert the
information into an outgoing digital signal (typically a DS1) that
may, potentially, have a different time-base. Therefore slip
buffers are very common. To minimize the occurrence of slips, the
circuit-switched network is well synchronized and this approach to
network synchronization has the derivative benefit that the clock
offset between the end points is minimized. In an NGN, where
asynchronous transport is employed, there is no guarantee that the
clock offset between the end points is negligible.
[0029] However, this phenomenon is not necessarily catastrophic,
but the DAC would have to either insert or delete a sample to
account for the difference in sampling rates. This insertion or
deletion of a block of information, such as a sample, is referred
to as a slip. Note that a slip is the result of the difference in
sampling rates and is independent of the word length associated
with the quantization and compression. The degradation of
perceptual quality caused by slips is in addition to any
degradation caused by other factors. In conventional
circuit-switched telephony, the unit of information inserted or
deleted is one sample (or octet). Considering the nominal sampling
rate is 8 kHz (one sample every 125 .quadrature.s), a slip occurs
when the accumulated phase difference, expressed in time units,
caused by the aforementioned frequency difference, crosses 125
.quadrature.s. In a packetized scenario, the unit could be as large
a block of speech, typically of duration 20 ms and thus slips would
have an impact similar to packet loss. Note that 20-ms slips occur
much less frequently than 125-.quadrature.s slips but have a
greater impact each time they occur. The thrust of the current
invention is to get the benefits of single-octet (single-sample)
slips in a packet environment. Furthermore, the thrust of the
current invention is to get the benefits of a single-octet slip in
low-cost implementations such as in customer-premises-equipment
(CPE) integrated-access-devices (IADs) and residential
gateways.
[0030] In the following table we provide the slip rate assuming
that the D/A conversion clock uses a free-running oscillator and
that the A/D clock is accurate (relative to a Primary Reference
Source). Also provided is the typical technology used for that
accuracy and a budgetary estimate (order of magnitude) of the cost
of the oscillator. The last three columns provide an approximate
time between slip occurrences for different block sizes. In
generating this table it was assumed that the transmission link
between the A/D and D/A is equivalent to a "null" link that adds no
impairments such as excessive time-delay variation or transmission
errors. The intent is to lay the baseline for the minimum
impairment that is introduced by the lack of synchronization
between the end-points.
[0031] With regard to Table I as shown below, the terminology used
includes: XO: Crystal Oscillator, TCXO: Temperature-Compensated
Crystal Oscillator and OCXO: Oven-Controlled Crystal Oscillator
TABLE-US-00001 TABLE 1 Relationship between frequency offset and
interval between buffer overflow/underflow events Accuracy
Technology Cost 125-.quadrature.sec slip 10-ms slip 1 .times.
10.sup.-10 Rubidium ~$1000 1.25 .times. 10.sup.6 sec. 1 .times.
10.sup.8 sec. (14.5 days) (3.2 years) 50 .times. 10.sup.-9
Hi-Quality ~$500 25 .times. 10.sup.3 sec. 2 .times. 10.sup.5 sec.
(50 ppb) OCXO (41.7 min) (2.3 days) 5 .times. 10.sup.-6 OCXO ~$50
25 sec. 2 .times. 10.sup.3 sec. (5 ppm) (33.3 min) 50 .times.
10.sup.-6 TCXO ~$10 2.5 sec. 20 sec. (50 ppm) 1 .times. 10.sup.-3
XO ~$1 0.125 sec. 1 sec. (0.1%) (8 per sec.) 1 .times. 10.sup.-2 XO
~$0.1 12.5 msec. 0.1 sec. (1%) (80 per sec.)
[0032] It should be noted that in carrier-grade equipment such as
that used in large telecom service provider networks, the higher
quality clock sources (oscillators) are appropriate. For
customer-premise equipment, including cases where the application
runs on a personal computer, the quality of the oscillator is
likely to be of the XO or, at best, TCXO class.
[0033] The perceptual degradation in quality caused by slips is
very subjective. The impact of an isolated slip in conventional
telephony using uncompressed signals (G.711) is typically a "click"
that could well be imperceptible, especially if it occurs during a
silent interval. However, the perceived quality degrades rapidly as
the slip-rate increases. The various digital switches in the PSTN
are all provided a PRS (Primary Reference Source) traceable
reference and thus have an absolute accuracy of better than
1.times.10.sup.-11. A call traversing two distinct timing domains
may experience slips corresponding to a worst-case frequency
difference of 2.times.10.sup.-11. Considering that this equates to
one slip every 72 days, we can, for all practical purposes, ignore
the phenomenon of slips in the traditional circuit-switched
network. In VoIP applications, the end points are quite cost
sensitive and therefore it is likely that the quality of oscillator
deployed will be represented by one of the last three rows of Table
1 and clearly slips may play an important role in determining the
quality of experience (or lack thereof).
[0034] Most studies for evaluating the perceptual quality of
compressed voice are done in a controlled environment and consider
only a single compression/expansion. Additional study is required
to assess the impact of tandem connections wherein there may be
multiple conversions of format. Furthermore, the impact of an
isolated slip may have a different perceptual effect on synthetic
speech, such as that inherent in CELP (Code Excited Linear
Prediction) methods for compression, such as G.729 (see Ref. [5]).
However, it is quite well accepted that the controlled slip method,
where one sample (octet) is deleted/inserted in an "uncompressed"
stream, works very well provided that slips do not manifest
themselves too often.
[0035] If the size of the buffer is large, then the relative
frequency of occurrence of buffer overflow/underflow events will be
small. However, large buffers imply the introduction of delay and
the decrease in quality of experience. Nevertheless, even with
large buffers deployed to mitigate the occurrence of buffer
overflow/underflow, there are other impairments that arise because
of a difference in clock between the end-points. Note that if there
is a long-term-average difference in the clock (frequency) at the
two end-points then buffer overflow/underflow will occur--the size
of the buffer will just determine the interval between these
catastrophic events.
[0036] The analog signal from the source enters the network and is
converted into a digital signal by the analog-to-digital converter
(ADC). The network acts as a pipe for these digital words (samples)
that are delivered to the far-end digital-to-analog converter (DAC)
for conversion back to analog. The conversion points could be in
equipment, such as a customer-premise located IAD or PBX or even a
Class-5 switch operated by the local telephone company. It is
important to recognize that the time-base governing the A/D clock
could be different from the time-base governing the D/A clock and
thus there could be a difference in the sampling rates associated
with these two conversions. That is, in every digital network there
is the potential of encountering the pitch modification effect. The
frequency difference could be small, of the order of 2 parts in
10.sup.11, if the conversion clocks are traceable to a Stratum-1
source (or sources); the frequency difference could be significant,
of the order of 64 parts in 10.sup.6 (64 parts per million or 64
ppm), if the only guarantee given is that the conversion clocks are
Stratum-4 quality (Stratum-4 implies an accuracy of no worse than
.+-.32 ppm). {The notions of clock strata and the frequency
accuracy of different classes of clocks are available in Ref.
[6,7].}
[0037] Clearly, if the conversion rates are different, then the DAC
will experience a surfeit of samples if the ADC clock is higher
than the DAC clock, or a dearth of samples if the situation is
reversed. In fact, such a phenomenon could be manifested at
multiple places in the network where there is a connection between
two Network Elements with different clock references. Clock offsets
of this type are accommodated by the use of slip-buffers. Whereas
buffers are always required to compensate for accumulated jitter
and wander, it is the effect of a frequency offset that is the
primary focus here.
[0038] Again for simplicity, we shall assume that there is just one
buffer, and that this buffer is associated with the DAC. This
buffer will be of a FIFO (first-in-first-out) nature where the data
is written into the buffer under control of the ADC clock and read
out of the buffer under control of the DAC clock. Clearly, if there
is a frequency offset between the two clocks, the buffer will,
eventually, either overflow (ADC clock is higher) or underflow (DAC
clock is higher). In practice the buffering method is called
"double buffering" wherein there are two pages, say A and B, and
while data is being written into page A, data is being read out of
page B. If there is no frequency offset, then the opposite-page
nature of read and write will, for the most part, be preserved.
Such a buffer needs to be just big enough to accommodate any
relative wander or jitter between the two clocks. It is convenient
to describe the size of the buffer in terms of time. For example,
if each page is "10 ms", then each page has 80 octets, assuming a
nominal sampling rate of 8 kHz and one octet per sample (e.g.
G.711; see Ref. [3] or [4]). The overall buffer is then 20 ms deep,
introduces a nominal delay of 10 ms and can accommodate .+-.10 ms
of wander.
[0039] A good way of visualizing the double-buffer action is to
consider a circular buffer as depicted in FIG. 2. The memory is
organized in a circular manner with address calculations done
Modulo-2N, where 2N is the total number of memory locations. From
the viewpoint of the DS0 channel under consideration, each location
holds one octet (corresponding to one octet per sample), the buffer
has a "length" of (2N/8) ms, introduces a nominal delay of (N/8)
ms, and can accommodate .+-.(N/8) ms of wander. The operation is
quite simple. With each write operation the write pointer moves one
location counter-clockwise and likewise the read pointer moves one
location counter-clockwise with each read operation. If the
relative time error between the read and write clocks is zero, then
the pointers remain a fixed distance apart. A frequency offset will
result in one pointer catching up to the other, resulting in an
overflow/underflow. The reset position is when the pointers access
diametrically opposite locations. When an overflow/underflow
occurs, one pointer is forcibly moved to be diametrically opposite
to the other. This action causes data corruption in the sense that
N octets will be either lost or repeated. It should be emphasized
that allowing large buffers to overflow/underflow results in losses
of large amounts of data when such events occur and this could have
a much more deleterious impact on end-user (human) quality of
experience than losses of small amounts of data that may occur more
frequently.
[0040] One special case is when the buffer is 250 .quadrature.sec
deep. This is the notion of a conventional slip buffer. Considering
the sampling rate is 8 kHz (125 .quadrature.sec period), a slip
buffer has two octets and the overflow/underflow results in either
the deletion of an octet or the repetition of an octet. This is
called a controlled slip. A slip occurs when the relative time
interval error between read and write clocks exceeds 125
.quadrature.s. For example, if the relative frequency offset
between the two clocks is 64 ppm, then a slip will occur
approximately every 2 seconds.
[0041] In packet-switched networks the delay through the network is
not steady as is the case of circuit-switched networks. Therefore,
even if the rates of the ADC and DAC are equal, the write clock
may, on a short-term basis, appear to be faster (or slower) than
the read clock. This requires the use of a buffer that is called a
jitter buffer because the term used in the industry for variable
transit delay is "jitter".
[0042] Now suppose that the buffer is 200 ms deep. The buffer will
overflow (underflow) when the relative time interval error between
the two clocks exceeds 100 ms. A 64 ppm offset will thus result in
overflows (underflows) approximately every 3000 seconds.
Considering that a telephone call rarely lasts 50 minutes, it is
clear that overflows (underflows) that are a result of a clock
offset may be ignored for all practical purposes. This is one of
the (incorrect) reasons given by proponents of IP networks that
frequency synchronization is not required because free-running
clocks can support VoIP considering that buffer overflows and
underflows can be made rare by increasing the size of the
buffer.
[0043] It should be recognized that: [0044] if a frequency offset
exists then there will be occurrences of buffer overflow/underflow.
[0045] the relative rate of such events will be smaller for larger
buffer sizes. [0046] the larger the buffer size the greater is the
loss of data when such an event occurs. [0047] the larger the
buffer size the larger the delay (important for human quality of
experience).
[0048] The thrust of this invention is to use multiple buffers. One
buffer is similar to a traditional jitter buffer. The incoming
packets are written into the jitter buffer upon arrival. Note that
this write operation is tied, effectively, to the ADC clock (of the
far end) with additional jitter introduced by the packet delay
variation in the network. The packets are extracted (read out) from
the jitter buffer using the DSP block (explained later) that is
nominally uniform. The rate of packet extraction by the DSP block
is determined by the rate of the DAC clock. The second buffer is a
double buffer whose size is altered occasionally to adjust the rate
at which the jitter buffer data is extracted by the DSP block.
2. A Simplified Model for a Next Generation Network (VoIP
Environment)
[0049] A network based on packet switching and transmission can be
quite complex, but the simple model depicted in FIG. 3 is
sufficient to illustrate how synchronization and adaptive play-out
buffers play a role. We consider an IAD (Integrated Access Device)
at the customer premises as the traffic aggregator. All the various
services are provided from the IAD to which all the customer
equipment is connected. To allow for attachment of legacy devices
such as telephones and Fax machines, the IAD will provide an FXS
port to which the Fax machine (telephone) is connected. To the Fax
machine (telephone), the FXS port appears, for all intents and
purposes, as the line circuit of a traditional Class-5 switch. The
IAD contains the codec where the conversion between analog and
digital is accomplished. The information, however, is not
transported as a conventional DS0 would in a TDM (time division
multiplexed) or circuit-switched scenario. The data is packetized
and encapsulated in the appropriate "wrappers" for transmission
over the packet network.
[0050] In terms of the important processes involved after call
set-up, a simple, though accurate, view is depicted in FIG. 4. For
convenience only one direction of transmission is shown. The analog
signal from the source Fax machine or telephone ("srce") is
converted into digital format using an A/D converter. It is quite
conventional to use a conventional telephony codec that uses a
sampling rate of 8 kHz and encodes the sample value in an octet
(G.711 coding) though there are implementations described in the
literature where a higher sampling rate and a higher word-length
are used for improved fidelity. These samples are assembled into
packets. For speech applications there may be some signal
processing involved for purposes of echo cancellation and data
compression; for Fax calls the samples are generally used without
modification. The packets are delivered to the destination by the
packet network.
[0051] Speech implementations also allow for voice activity
detection (VAD) whereby intervals of silence are detected and
transmission bandwidth conserved by just transmitting an indication
of silence rather than (encoded) speech sample information. At the
receiving end intervals of silence are synthesized using comfort
noise.
[0052] Whereas packet architectures are superior to
circuit-switched architectures in terms of efficiency of bandwidth
utilization (because of statistical multiplexing), they have some
drawbacks, comparatively speaking. Packet architectures tend to
increase latency (average delay) and introduce time delay
variations. In order to accommodate time delay variations, jitter
buffers are required. That is, buffers of an "elastic" nature are
used to account for the burstiness of the packet arrival pattern.
In order to avoid loss of data the depth of these buffers must be
large enough to span the peak-to-peak time delay variation over the
network. Put another way, the size (depth) of the jitter buffer
determines the peak-to-peak time delay variation that is allowed
for the network and a variation greater than this maximum value
will result in packets being lost or used incorrectly.
[0053] If the jitter buffer is too small, time delay variation can
be the primary cause of packet loss. For normal voice (speech)
calls, packet loss concealment ("PLC") algorithms are available to
mitigate the impact of lost packets. However, it should be
emphasized that the mitigation of the deleterious impact does not
mean that the problem is eliminated. In Ref. [8] a general picture
of the impact of packet loss on Quality of Experience is provided.
One way to reduce packet loss is to increase the size of the jitter
buffer. However, this approach, too, has its drawbacks since the
increase in delay caused by increasing the depth of the jitter
buffer has a negative impact on the Quality of Experience for voice
calls for several reasons (see Ref. [8]). Consequently most prior
art VoIP implementations utilize what is referred to as an adaptive
jitter buffer, algorithms have been developed to make the jitter
buffer size dynamic, the intent being to keep the buffer just large
enough such that the loss of packets due to time delay variation is
within an acceptable limit, which the ITU-T Recommendations specify
as 0.05%. However, adaptive litter buffer operation in the prior
art has a major problem because the proponents of VoIP and adaptive
jitter buffers have ignored the effects of lack of clock
synchronization.
[0054] With the jitter buffer set at its "optimum" size, and
providing adequate traffic engineering is in place to provide the
real-time services (such as VoIP) the appropriate priority, it is
assumed that time delay variation will not cause packet loss except
in situations of high traffic congestion. However, the frequency
offset between source and destination has two deleterious effects.
One is the pitch modification effect that has been described
elsewhere (see Ref. [12], for example) and while important, is not
the thrust of this invention. The other is a "buffer shrink"
effect. If the DAC clock is faster than the ADC clock, the jitter
buffer will empty faster than it is being filled. Suppose for
example the buffer size is 200 ms. Then, whereas at the start of
the call a 200 ms buffer will, theoretically, allow a .+-.100 ms
time delay variation, the emptying of the buffer will affect the
lower threshold. Similarly, if the ADC clock is faster than the DAC
clock, the buffer will fill faster than it is being emptied and
this will affect the upper threshold. For example, a frequency
difference of 50 ppm will cause a threshold reduction (either the
upper or the lower) of 50 .quadrature.sec every second or 1 ms
every 20 seconds. Therefore, whereas the probability of losing a
packet due to time delay variation may have been small to
nonexistent at the start of the call, the probability increases
with the duration of the call and, for calls of long duration could
become appreciable.
[0055] For voice calls there have been several methods described in
the literature to handle such problems. The notion of an adaptive
jitter buffer is to modify the size of the jitter buffer to match
the existing time-delay variation condition being experienced.
Silence-stretching and silence-compressing algorithms have been
proposed to delete or expand sections (sub-intervals) of silence.
Packet loss concealment algorithms have been developed to insert or
delete sections of "non-silence" in such a manner as to reduce
(subjectively) any annoying effects of packet loss. The interested
reader is pointed to Ref [9,10] for further information on these
methods.
[0056] In the context of this invention, silence-manipulation and
packet loss concealment will be designated as extreme measures.
Such measures are necessary because the general behaviour of IP
networks is such that packets will be lost in the network for a
variety of reasons, including excessive time-delay variation that
could lead to jitter buffer overflow or underflow. In the context
of this invention, the block labeled "Depacketization, Jitter
Buffer, and Signal Processing" in FIG. 4 will be, logically, split
into multiple entities:
a. Depacketization. The packets received from the IP network are
processed and the information content required for synthesis of the
speech signal extracted. As part of the depacketization process,
the protocol wrappers are examined to detect whether a packet was
lost in the network. If a packet is detected as "lost", then the
packet loss concealment algorithm must be invoked. The current
invention does not relate in particular to depacketization
algorithms and implementations and most methods prevalent in the
state-of-the-art can be employed. Packets contain both time-stamps
and sequence numbers (also called frame numbers) and between these
two it is straightforward to decide whether there was a missing
packet or whether the apperent missing packet was actually a
"no_transmission" corresponding to a silence packet. Basically the
block labeled "Extract Frames" in FIG. 4 extracts the (encoded)
speech frames from the packets. Note that each IP packet may
contain more than one speech frame. That is, each IP packet may
contain the information for multiple (1 or more) blocks of speech.
For example, if the block size used is 10 ms, the IP packet may
contain 20 ms or two blocks worth of speech information in encoded
form. For convenience, the unit of storage in the jitter buffer
(see below) is the speech frame since this is the most convenient
and useful unit of storage and can be either in the form of encoded
speech or even decoded speech (see the notion of processing,
below). b. Jitter Buffer. The jitter buffer in prior art VoIP
decoders comprised a first-in first-out (FIFO) buffer that was
large enough to accommodate the time delay variation encountered by
packets as they traverse the IP network from source
(encoder/packetization) to the destination decoder. In one possible
first implementation, the incoming packets are written in as they
arrive and read out by the signal processing entity at the play-out
rate. That is, the jitter buffer contains the actual received
packets with, possibly, the protocol wrappers removed. In a second
possible implementation, the incoming packets are treated by the
signal processing entity as they arrive and the synthesized speech
samples written into the FIFO. In this second implementation the
FIFO contains actual speech samples destined for the DAC and is
emptied based on the clock of the DAC. The invention disclosed
herein applies to both modes of operation. The reason for the first
mode of operation is that the jitter buffer module includes the
logic required to handle missing packets as well as "silence" when
there are really no packets available and the missing packets are
synthesized as "silence" based on other information such as
time-stamps available in the packets. Specifically, if the sequence
numbers of consecutive packets are in correct sequence but the
time-stamps indicate a time gap greater than the unit (frames or
packets) then it is deemed that there were silent frames/packets
between the two in-correct-order-sequence-number packets. In the
second mode of operation there must be logic to determine silence
packets. The invention described here is applicable to both
implementations though, for specificity, the first implementation
scheme is assumed. c. Signal Processing. The information extracted
from the received packet is processed with the appropriate
algorithms to generate the speech segment. This includes the codec
function, echo treatment (if any), comfort noise generation to
synthesize silence, and packet loss concealment. The current
invention does not relate in particular to the signal processing
algorithms and implementation and just about any methods prevalent
in the state-of-the-art can be employed.
[0057] There is one additional (though optional) requirement on the
signal processing implementation arising from the current
invention. That is, a flag is associated with each sample (octet)
of speech signal recreated/synthesized. This flag is asserted
("true") if the speech sample generated was part of a silence
segment or a segment of signal artificially created via the packet
loss concealment algorithm or had some particular characteristic as
will be described later. The intent in this flag is to indicate
that the sample is "actionable" and will have a minimal subjective
annoyance in the event that the sample was deleted (or repeated) as
part of the adaptive slip double buffer that is the crux of the
invention disclosed herein. If the signal processing entity is
incapable of providing such a flag for any reason, then the
play-out buffer will, in essence, ignore the flag and assume that
all samples are "actionable".
[0058] The notion of "actionable" is that the frame of speech is
either representative of silence or is representative of a
synthetic frame of speech used for packet loss concealment. In the
case where the speech is compressed, the nominal short-term power
of the speech is computed by the encoding function (at the
analog-to-digital converter side) and communicated to the decoding
side (the digital to analog converter side). In the case where
there is no compression, the decoding side must compute the
short-term power of the signal and invoke suitable algorithms to
determine whether the current decoded speech is part of a silence
interval. Implementing slips introduces degradation but the
degradation is much less consequential is invoked during periods of
silence.
3. The Adaptive Play-Out Buffer
[0059] The invention disclosed here deals with an adaptive play-out
buffer that is also called an adaptive slip double buffer. This is
described below by considering the fundamentals of prior-art and
the extensions that comprise the invention.
4.1 The Play-Out Buffer Viewed as a Retimer
[0060] The underlying principle of retiming is quite
straightforward. The play-out buffer can be viewed as a retimer as
described here. Fundamentally, the data (speech samples or octets)
as well as a clock ("recovered clock") are recovered from the
incoming packet stream. The "recovered clock" is used to write the
incoming packets into a buffer that is operated in a FIFO
("first-in-first-out") mode. The recovered clock in this scenario
is a burst mode clock corresponding to packet arrival instants. The
data is read out of the buffer using, effectively, the DAC clock
(the retiming function generally involves inserting the "reference"
clock), and then packets read out from the FIFO can be applied to
the signal processing function to generate the digital speech
samples for the DAC. The function of "retiming" is illustrated in
FIG. 6.
[0061] Referring to FIG. 6, a FIFO buffer 412 is coupled to a
depacketization block 411. A digital signal processor 413 is
coupled to the FIFO buffer 412.
[0062] In FIG. 6, the block labeled "DeP" refers to the circuitry
used to implement the depacketization functions. The block labeled
DSP represents the DSP functions that generate the speech samples
for handing off to the digital-to-analog converter (DAC). The FIFO
buffer represents the jitter buffer. The DSP block reads out of the
jitter buffer based on the DAC clock. The DeP writes into the
jitter buffer when packets arrive and this can be viewed as a
jittered version of the encoder clock from the far end.
[0063] For illustrative purposes, the FIFO can be viewed as a
"pipe" with the receive data that is written into the FIFO viewed
as being pushed into the pipe. The transmit data that is read out
of the FIFO is viewed as being pulled out of the pipe. The arrow
designated as "fill position" indicates where the next frame/packet
that must be read out is located within the pipe. The action of
"write" moves the fill position to the right and each read
operation moves the fill position to the left. At the beginning or
"reset" situation, the fill position, arbitrarily, points to the
middle of the FIFO buffer. With such an arrangement, if the size of
the FIFO buffer is 2N units (typically frames), short-term
frequency variations, referred to as wander, can be accommodated
without loss of data. In particular, up to N unit intervals ("UI")
of time-delay variation in the packet network (2N UI, peak-to-peak)
can be absorbed (1 UI is equivalent to 1 frame-time, 10 ms for a
frame size of 80 samples if the underlying sample rate is 8 kHz).
Needless to say, the arrangement adds transmission delay of, on the
average, N UI. A FIFO of this nature can serve as a jitter buffer
accommodating up to .+-.N UI of time-delay variation. For
reference, if N is 10, up to .+-.100 ms of time-delay variation
(wander) can be absorbed.
[0064] If the (long-term) average frequencies of the write clock
and read clock are different, then the buffer will either overflow
or underflow. With respect to FIG. 6, the fill position will move
all the way to the right if the write clock is high or all the way
to the left if the write clock is low. In this situation data will
be corrupted; either some data is lost ("overflow"), or some
"garbage" data must be inserted ("underflow"). In a generic
retiming application, the appropriate way to handle such frequency
offsets is to force the fill position to the center (the equivalent
of "reset") whenever the fill position rails at either extreme. In
such a situation, either N frames are discarded ("lost") or N
frames are repeated ("garbage"). In a VoIP scenario, where the
signal processing entity is capable of packet loss concealment, the
advent of underflow can be anticipated and instead of "garbage",
speech segments can be synthesized that have much less subjective
annoyance. Likewise, the advent of overflow can be detected and
packet loss concealment methods applied to "delete" packets in a
manner that is not arbitrary but introduces less impairment from a
subjective standpoint.
[0065] One key element of the disclosed invention is the
anticipation of overflow/underflow events.
[0066] This will be described shortly.
[0067] Another key element of the disclosed invention is the manner
in which the clock used by the DSP to read frames out of the jitter
buffer is derived from the DAC and adjusted to minimize the impact
of clock offset between the local DAC and the far-end ADC.
[0068] This is described next
3.2 Double Buffer Arrangement for Delivering the Samples to the
DAC
[0069] The arrangement for delivering samples to the
digital-to-analog converter generally involves a double buffer
arrangement. The reason for this buffering is that the actual
conversion is done on a sample by sample basis using a "continuous"
clock. The DSP unit will usually generate the samples as a block of
samples. Thus while the DSP unit generates the correct number of
samples per unit time on the average, it generates the samples in
bursts.
[0070] The most common arrangement for implementing the
double-buffer function involves the use of two buffers of equal
size, say N octets, and referred to as "Page-A" and "Page-B". One
of the sides (we shall assume the "write" side for specificity and
ease of explanation) accesses the buffer(s) sequentially. That is,
the write operation first fills buffer Page-A, moves to buffer
Page-B, fills it, and returns to filling buffer Page-A. The read
operation empties the buffers. Under "normal" conditions, the read
side is accessing buffer Page-B while the write side is accessing
buffer Page-A, and vice-versa. If the average (long-term)
frequencies of the read and write operations are equal, then the
accesses will, substantially, remain in opposite buffers. This
arrangement is sometimes referred to as a linear buffer arrangement
to distinguish it from a circular buffer arrangement. The advantage
of a linear buffer arrangement is that the memory allocation for
the buffer can be slightly more than the actual page size.
[0071] In FIG. 7 a simplified depiction of a double buffer
arrangement for implementing the interface to the DAC is shown. A
first buffer 421 (Page-A) is coupled to a second buffer 422
(Page-B). The actual DAC converts samples that are read out of the
appropriate. The two buffers are often referred to as Page-A and
Page-B. The trajectory of the write pointer ("WP") (the address to
which the next write operation will pertain to) is shown. In
particular, after filling Page-A, the pointer moves to the bottom
of Page-B and commences filling Page-B. The trajectory of the read
pointer ("RP") follows the same principle and is implied. At the
beginning (or "reset"), the WP and RP point to different pages. It
is especially pertinent to make the page size equal to one frame.
For example, in implementations using a 10 ms frame with an 8 kHz
sampling rate the frame size is nominally 80 samples. Also, in this
situation the DSP writes into the buffer the entire frame
(nominally 80 samples) almost instantaneously. That is, it computes
the appropriate sample values and fills the buffer in one "write
statement". The pseudo code for this operation will appear as:
TABLE-US-00002 Get initial value for write_pointer (establish
whether it is page A or page B) For l = 0,1,2,...,N1 { Write X[l]
into address defined by write_pointer [write instruction "W1"]
Increment write_pointer }
[0072] Write X[N1] into address defined by write_pointer [write
instruction "W2"]
[0073] Switch page designation for next block (from page A to page
B and vice versa)
[0074] In the above block of code N1 is 79 if the block size is 80
since the range of the index I starts at 0. It is assumed that the
DSP has computed the requisite sample values and these are
available in the array {X[j]; j=0, 1, 2, . . . , N1}. At the start
write_pointer identifies the memory address of the first element in
the appropriate page (page A or page B). The instruction following
the loop (in bold font) is important. What this achieves is the
replication of sample value X[N1] into page-A/B location N1 as well
as (N1+1). Thus for the case of an 80-sample frame, the same value
is placed in the 80.sup.th as well as the 81.sup.st location of the
buffer. Note that this approach is suitable for a linear buffer
arrangement; slight modifications are required for circular buffer
operation.
[0075] Note that the speed of the write operation is determined by
the speed at which the DSP operates and not by the rate of the DAC
clock. Generally speaking the machine-cycle time of the DSP will be
very small and the entire process of writing 81 samples will be a
very small fraction of the 10 ms frame duration.
[0076] In common implementations in customer-premises equipment
such as the Integrated Access Device (IAD), the DAC clock is
locally generated and may or may not be locked to a network
reference. That is, it may be derived from a free-running
oscillator. In either case it is not controlled by the DSP module
that is reading out from the jitter buffer because implementing a
clock synchronization method based on jitter-buffer fill (also
referred to as adaptive clock recovery) requires expensive
oscillators to smooth out the jitter introduced by packet network
that can be quite large (see Ref. [11] for example).
[0077] A key aspect of the invention is to allow the DAC clock to
run asynchronously with respect to the far-end ADC clock but yet
account for the frequency offset using a slip mechanism that is
based on single-sample slips while simulating clock synchronization
as applied to the jitter buffer read/write. That is, the intent of
this simulated synchronization is to avoid the "buffer shrink"
effect and keeping the data corrupted due to a slip small (one
sample) minimizes the deleterious effect on end-user quality of
experience.
[0078] The typical manner in which the "read clock" (the
"DAC-derived clock" in FIG. 6) for the DSP unit is generated is
based on the premise that the DAC unit will provide a marker
(generally implemented using the notion of a "software interrupt")
every 80 samples. In most implementations the DAC will empty one
page, say Page-A, and provide a software interrupt signal to
initiate the DSP unit operation so that the DSP will read one
frames worth of information from the jitter buffer and complete the
signal processing required to fill Page-A while the DAC is reading
from Page-B. The operation of the DAC unit will involve a counter
that starts from 0 for each page and is incremented by 1 when a
sample is extracted from that page. When the counter reaches a
"final value" the page is deemed to have been emptied and the DAC
unit flags the DSP unit that one frame interval has transpired. If
this "final value" is 80 then the frame interval is 10 ms (assuming
that the sampling rate is nominally 8 kHz) according to the DAC
clock. One key aspect of the invention is to allow the controlling
entity to adjust the "final value". Thus if the final value is set
at 79, the DAC will interrupt the DSP unit in less than 10 ms (10
ms minus 125 .mu.s) and if the final value is set at 81, the DAC
will interrupt the DSP unit in more than 10 ms (10 ms plus 125
.mu.s) where these time intervals are based on the DAC clock.
[0079] That is, the method of changing the "final value" provides
the means to either shorten or lengthen the apparent frame interval
corresponding to an apparent increase or decrease of the apparent
DAC clock frequency from the viewpoint of the read clock.
[0080] Some important points associated with this method:
a. If the final value (N1) is set at 78 then the DAC will extract
only 79 out of the 80 valid samples from the page. That is,
effectively we have deleted one sample. b. If the final value (N1)
is set at 80 then the DAC will attempt to extract 81 samples from
the page though there are only 80 valid samples. To ensure that
this is done in a reasonable manner, the buffer size should be 81
and when the DSP writes 80 samples into the buffer it repeats the
last sample to get the 81.sup.st sample. That is, sample 80 and 81
are the same. Consequently the DAC is repeating one sample. c. The
controlling entity should change the final value occasionally, and
only when necessary. At all other times it should be left at the
nominal value of 80 (N1 set to 79). Note that the example cited
above assumed a frame size of 10 ms and a sampling rate of 8 kHz.
The same technique is applicable for different frame sizes and
different sampling rates though the specific values such as 79, 80,
and 81 for the "final value" will depend on the sampling rate and
chosen frame size.
4.3 Circular Buffer Implementation for the Packet Jitter Buffer
[0081] The overall adaptive jitter double buffer arrangement can be
viewed as a combination of the linear double buffer between the DSP
block and the DAC and a "traditional" jitter buffer that stores
packets between the depacketization block and the DSP block (as
depicted by the FIFO in FIG. 6). The FIFO is advantageously
implemented as a circular buffer.
[0082] A simplified view of the circular buffer arrangement is
depicted in FIG. 8. A buffer 432 is coupled to a write address
generator 431. A read address generator 433 is coupled to the
buffer 432. A page control block 434 is coupled between the write
address generator 431 and the read address generator 433. A control
signals black 435 is coupled to the read address generator 433. The
data written into the buffer comprises the packets extracted from
the IP stream by the depacketization block. The size of the
circular buffer is 2N "locations", each "location" containing the
data associated with the packet. The data read out of the buffer
comprises the packet data that is used by the DSP to extract the
speech samples. As mentioned before, based on each read access the
DSP block gets enough information to synthesize one block/frame
worth of samples that will be fed to the DAC. In this
implementation it is assumed that the nuances of the method are
implemented in the "Read Add. Gen." block and thus the "Write Add.
Gen." block where the write address ("WR_ADD") is generated can be
quite simple. The block labeled ".DELTA." generates the difference
between the read and write addresses ["RD_ADD"-"WR_ADD"] where the
B-bit numbers are interpreted as 2's-complement represented
integers. The block labeled "Control Signals" represents the
circuitry implementing the logic associated with the control
signals required by the "Read Add. Gen." block. The functions
associated with the various blocks are elaborated upon next. These
functions have a direct counterpart for a linear buffer
arrangement.
[0083] The "Write Add. Gen." block is quite straightforward. The
starting address is provided as the initial value of the
write_pointer and then for every write operation the write_pointer
is incremented. Since a circular buffer operation is used,
modulo-2N arithmetic provides the wrap-around feature. When the
write instruction is asserted (see write instruction W1 in
pseudo-code; this applies for the jitter buffer as well), the input
data is written into the buffer in the location pointed to by the
counter contents, "WR_ADD", and the write_pointer incremented by
one. In the case of a linear buffer arrangement software
instructions are needed to determine the suitable memory address of
the start of the page.
[0084] The "page ctrl" block represents a function that monitors
whether the read operation as well as the write operations are
happening in the "location". If so then the buffer has
overflowed/underflowed and the correct action is to forcibly move
one or the other side to the opposite part of the circular buffer.
This is achieved by adding "N" (modulo-2N) to write_address or to
the read_address (depending which is to be forcibly moved to the
other page). Minor modifications are required in the case of a
linear buffer arrangement.
[0085] The block labeled ".DELTA." generates the difference
["RD_ADD"-"WR_ADD"]=.DELTA.n. This difference is done modulo-2N;
when the memory addresses are at diametrically opposite parts of
the circular buffer the difference will be N; when the addresses
are close to each other the difference is small in magnitude; when
they coincide the difference is zero. Considering the circular
nature of the buffer, defining which is "ahead" is somewhat moot.
For our purposes, if .DELTA.n is positive the write pointer is
"catching up" to the read pointer; if .DELTA.n is negative the read
pointer is catching up to the write pointer.
[0086] Assigning appropriate actions based on the value of do is a
key aspect of the invention.
[0087] To this end, three "threshold values",
T.sub.3>T.sub.2>T.sub.1 are predetermined. Suitable choices
for these thresholds and the underlying rationale are provided
later. Comparison of .DELTA.n with these determines the "state" of
the adaptive play-out buffer; the state then determines the
appropriate action.
a. If |.DELTA.n-N|.ltoreq.T.sub.1, the state is "green". The
implication of the "green" state is that the read and write
pointers are far apart and no special action is taken. Note that
the furthest they can be apart is, essentially, N, implying that
the read and write operations are occurring in diametrically
opposite parts of the circular buffer. The "increment" applied to
the read address pointer (discussed shortly) is unity implying the
read function operates in a normal manner. b. If
T.sub.2>|.DELTA.n-N|.gtoreq.T.sub.1, the state is "yellow". The
implication of the "yellow" state is that the read and write
pointers are possibly coming closer and some action is required.
This takes the form of a controlled slip provided some other
conditions are met. A controlled slip involves repeating or
deleting one signal sample by changing the final_value in the
linear double-buffer arrangement between the DSP and the DAC.
[0088] This is achieved by modifying the final_value to (N1+1) as
described earlier. As described before, this implies that we
essentially repeating a sample. This is done if .DELTA.n is
negative (read catching up with write). What this accomplishes is
artificially increasing the duration of a "frame" from the
viewpoint of accessing the jitter buffer, slowing down the rate at
which the read is catching up with the write.
[0089] Making the final_value equal to (N1-1) means the read
address reads one less location from the page, essentially deleting
a sample. This is done if .DELTA.n is positive (write catching up
with read). What this accomplishes is artificially decreasing the
duration of a "frame" from the viewpoint of accessing the jitter
buffer, slowing down the rate at which the write is catching up
with the read.
[0090] The aforementioned conditions for allowing a slip operation
to take place are the following:
1) The flag associated with the current read data should be true.
The flag will be set true by the signal processing block if the
sample is part of an "actionable" signal segment. 2) The timer has
expired. The timer is essentially a counter that is reset (to zero)
when a slip event (repetition/deletion) has occurred. The timer
counter is incremented by the DAC clock and saturates at a
(pre-determined) maximum value. Until it reaches this maximum
count, slip events are inhibited. The intent is to ensure that slip
events are not allowed to occur too close together. c. If
T.sub.3>|.DELTA.n-N|.gtoreq.T.sub.2, the state is "orange". The
implication of the "orange" state is that the read and write
pointers are very likely coming closer and some action is
definitely required. This takes the form of a controlled slip
provided some other conditions are met. This is similar to the
yellow state with relaxed conditions. In particular, the flag is
ignored. The timer constraint is the same as for the yellow state.
d. If |.DELTA.n-N|>T.sub.3, the state is "red". The implication
of the "red" state is that the read and write pointers are very
close to each other and some extreme action is required. This takes
the form of a controlled slip provided the timer constraint is met
(as in the orange state) as well as a request to the signal
processing entity that packet loss concealment must be initiated.
If .DELTA.n is negative a segment of synthetic speech must be
inserted; if .DELTA.n is positive a segment of speech must be
deleted. In the red state we invoke not just effective change of
frame duration by 1 DAC sample interval, but an entire frame in
addition.
[0091] Traditional "adaptive" jitter buffers adjust the size of the
jitter buffer to mitigate the occurrence of such overflow/underflow
events. That is, the size of the jitter buffer is increased if the
trend is seen to be towards such overflow/underflow events.
Traditional adaptive algorithms for jitter buffers malfunction
because they make no distinction between overflow/underflow that is
the result of packet delay variation and the result of a clock
offset. The slip function implemented in this algorithm addresses
the clock offset issue and therefore if overflow/underflow does
occur it is because the jitter buffer is not large enough to
accommodate the packet delay variation in the network. Consequently
the invention disclosed here will improve and enhance the behavior
of conventional adaptive jitter buffer algorithms.
e. If .DELTA.n=0, the state is "catastrophic" implying that the
write pointer and read pointer are coincident. This requires very
drastic action. This is achieved by re-centering the jitter buffer.
That is, the read pointer is "reset" to be diametrically opposite
to the write pointer. N packets will be lost or repeated by this
action that is equivalent to jitter buffer overflow/underflow.
Suitable values for the thresholds are T.sub.3=(3/4)N;
T.sub.2=(1/2)N; T.sub.1=(1/4)N, where the size of the overall
jitter buffer is 2N. If the packet loss concealment algorithm is
not very sophisticated and thus should be minimally invoked, an
alternate set of threshold values is T.sub.3=(7/8)N;
T.sub.2=(3/4)N; T.sub.1=(1/8)N. These choices are well suited for
efficient implementation and it is unlikely that "optimum" values
for these thresholds, derived by any sophisticated means, will
provide an efficacy that much greater than this particular set to
warrant an increase in implementation complexity. The value for N,
the buffer size, depends on the expected time-delay variation. If
we assume a packet size of 10 ms (80 speech samples) a "typical"
time-delay variation will be .+-.10 ms, corresponding to .+-.0.5
packet duration.
[0092] A suitable value for the timer is the closest power of 2
less than the packet size and in this case is 64. With this choice
of timer, the slip events will be constrained to no more than twice
per packet duration.
[0093] The block labeled "Read Add. Gen." is important since this
is a key aspect of the invention. A simplified view of this block
is shown in FIG. 9. A time 447 is coupled to an increment control
block 443. An increment generator 442 is coupled to the increment
control 443 and generates a final_value 441. The increment
generator 442 is coupled to an adder block 444, which in-turn is
coupled to a select block 445, which in-turn is coupled to a
register block 446, which in-turn is coupled to a read address
block 448.
[0094] The entity M-WR_ADD represents the WR_ADD modified to
represent the address diametrically opposite the current location
that is being written into. If .DELTA.n=0, the drastic action taken
is to make the select control choose M-WR_ADD to load into the read
address register (see item "e" above). The read address counter is
implemented as an accumulator that is updated based on the
DAC-derived clock ("Read_Clock"). Under normal operation the
increment is one unit (corresponding to packet size). That is, the
read operation will sequence through the jitter buffer in a normal
manner. The adjustment of the "Read_Clock" interval based on the
slip buffer mechanism between DSP and DAC will account for
frequency offset between DAC and far-end ADC clock. If the
condition is "red" (see item "d" above) then the increment is
either 0 units (the packet loss concealment algorithm is invoked)
or 2 units (one packet is effectively deleted).
[0095] The notion of "Final_value" is the control value for the
double buffer between the DSP block and the DAC. The nominal value
will be called "N" in the following. (N-1) and (N+1) are the values
for Final_value that will delete or repeat a (DAC) sample,
respectively
[0096] The block labeled "Increment Control" is one aspect of the
invention of the adaptive play-out buffer. The actions have been
described before but are summarized here for completeness. Based on
the various state conditions this block controls the generation of
the increment used by the read address counter:
1. If State is catastrophic (.DELTA.n=0): i. Assert reset (forcing
read pointer to be diametrically opposite to write pointer) ii.
Reset timer. This is optional. Included for specificity. iii. Set
increment to one unit. This is optional since counter action is
overridden by reset action. Set Final_value to "N".
2. If State is red:
[0097] i. Deliver message to signal processing entity that packet
loss concealment (deletion or synthesis, based on sign of .DELTA.n)
is required. FIG. 9 does not show this control signal explicitly
but it is implied. Set increment to 0 or 2 units. ii. If timer has
not expired, set Final_value to "N". iii. If timer has expired, set
Final_value to (N.DELTA.1) or (N+1) depending on sign of .DELTA.n
and reset timer. 3. If State is orange: i. If timer has not
expired, set Final_value to "N". ii. If timer has expired, set
Final_value to (N-1) or (N+1) depending on sign of .DELTA.n and
reset timer. 4. If State is yellow. i. If timer has not expired, or
flag is false, set Final_value to "N". ii. If timer has expired,
and flag is true, set Final_value to (N-1) or (N+1) depending on
sign of .DELTA.n and reset timer. iii. Note: If the signal
processing entity does not provide the flag it is deemed to be
always true. 5. If State is green: i. Set Final_value to "N".
(Normal slip buffer operation) Note: In states orange, yellow, and
green the increment for the read address for the jitter buffer
(i.e. RD_ADD in FIG. 9) is set to one unit.
SUMMARY
[0098] One of the problems associated with communication of
real-time information over packet networks is the time-delay
variation introduced. A second problem is that the transport is
asynchronous and therefore the receiving end may be operating at a
different timing-base from the sending end. The packetized nature
of VoIP necessitates the use of a jitter buffer and, possibly, a
second buffer to interface to the actual digital to analog
converter (DAC). The invention described herein deals with simple
and efficient methods to address the jitter buffer and clock offset
issues.
[0099] Salient points of the invention are:
1) The DAC double buffer is made adaptive in the sense that
controlled slips are implemented. 2) The signal-processing entity
can flag samples from segments of speech that are considered
"actionable". 3) The slip action can, optionally, be inhibited if
the sample affected has been flagged as "nonactionable" 4) The
controlled slip action is instantiated by monitoring the fill of
the jitter buffer. 5) The jitter buffer FIFO is implemented as a
circular buffer and the difference between the read and write
pointers used as a measure of buffer fill. 6) A timer is used to
ensure that slip events do not occur too close to each other. 7) A
timer is used to ensure that the frequency control is not too
rapid.
DEFINITIONS
[0100] The term program and/or the phrase computer program are
intended to mean a sequence of instructions designed for execution
on a computer system (e.g., a program and/or computer program, may
include a subroutine, a function, a procedure, an object method, an
object implementation, an executable application, an applet, a
servlet, a source code, an object code, a shared library/dynamic
load library and/or other sequence of instructions designed for
execution on a computer or computer system).
[0101] The term substantially is intended to mean largely but not
necessarily wholly that which is specified. The term approximately
is intended to mean at least close to a given value (e.g., within
10% of). The term generally is intended to mean at least
approaching a given state. The term coupled is intended to mean
connected, although not necessarily directly, and not necessarily
mechanically. The term proximate, as used herein, is intended to
mean close, near adjacent and/or coincident; and includes spatial
situations where specified functions and/or results (if any) can be
carried out and/or achieved. The term distal, as used herein, is
intended to mean far, away, spaced apart from and/or
non-coincident, and includes spatial situation where specified
functions and/or results (if any) can be carried out and/or
achieved. The term deploying is intended to mean designing,
building, shipping, installing and/or operating.
[0102] The terms first or one, and the phrases at least a first or
at least one, are intended to mean the singular or the plural
unless it is clear from the intrinsic text of this document that it
is meant otherwise. The terms second or another, and the phrases at
least a second or at least another, are intended to mean the
singular or the plural unless it is clear from the intrinsic text
of this document that it is meant otherwise. Unless expressly
stated to the contrary in the intrinsic text of this document, the
term or is intended to mean an inclusive or and not an exclusive
or. Specifically, a condition A or B is satisfied by any one of the
following: A is true (or present) and B is false (or not present),
A is false (or not present) and B is true (or present), and both A
and B are true (or present). The terms a and/or an are employed for
grammatical style and merely for convenience.
[0103] The term plurality is intended to mean two or more than two.
The term any is intended to mean all applicable members of a set or
at least a subset of all applicable members of the set. The phrase
any integer derivable therein is intended to mean an integer
between the corresponding numbers recited in the specification. The
phrase any range derivable therein is intended to mean any range
within such corresponding numbers. The term means, when followed by
the term "for" is intended to mean hardware, firmware and/or
software for achieving a result. The term step, when followed by
the term "for" is intended to mean a (sub)method, (sub)process
and/or (sub)routine for achieving the recited result. Unless
otherwise defined, all technical and scientific terms used herein
have the same meaning as commonly understood by one of ordinary
skill in the art to which this invention belongs. In case of
conflict, the present specification, including definitions, will
control.
CONCLUSION
[0104] The described embodiments and examples are illustrative only
and not intended to be limiting. Although embodiments of the
invention can be implemented separately, embodiments of the
invention may be integrated into the system(s) with which they are
associated. All the embodiments of the invention disclosed herein
can be made and used without undue experimentation in light of the
disclosure. Although the best mode of the invention contemplated by
the inventor(s) is disclosed, embodiments of the invention are not
limited thereto. Embodiments of the invention are not limited by
theoretical statements (if any) recited herein. The individual
steps of embodiments of the invention need not be performed in the
disclosed manner, or combined in the disclosed sequences, but may
be performed in any and all manner and/or combined in any and all
sequences.
[0105] Various substitutions, modifications, additions and/or
rearrangements of the features of embodiments of the invention may
be made without deviating from the spirit and/or scope of the
underlying inventive concept. All the disclosed elements and
features of each disclosed embodiment can be combined with, or
substituted for, the disclosed elements and features of every other
disclosed embodiment except where such elements or features are
mutually exclusive. The spirit and/or scope of the underlying
inventive concept as defined by the appended claims and their
equivalents cover all such substitutions, modifications, additions
and/or rearrangements.
[0106] The appended claims are not to be interpreted as including
means-plus-function limitations, unless such a limitation is
explicitly recited in a given claim using the phrase(s) "means for"
and/or "step for." Subgeneric embodiments of the invention are
delineated by the appended independent claims and their
equivalents. Specific embodiments of the invention are
differentiated by the appended dependent claims and their
equivalents.
REFERENCES
[0107] [1] RFC 3550, RTP: A Transport Protocol for Real-Time
Application, Internet Engineering Task Force Request for Comment.
[0108] [2] RFC 3551, RTP Profile for Audio and Video Conferences
with Minimal Control, Internet Engineering Task Force Request for
Comment. [0109] [3] ITU-T Recommendation G.711, Pulse Code
Modulation (PCM) of Voice Frequencies, Geneva, 1989. [0110] [4]
Kishan Shenoi, Digital Signal Processing in Telecommunications,
Prentice-Hall, 1995. ISBN0-13-096751-3. [0111] [5] ITU-T
Recommendations series G, Transmission systems and media, digital
systems and networks. [0112] [6] Stefano Bregni, Synchronization of
Digital Telecommunications Networks, John Wiley & Sons, 2002.
ISBN 0 471 61550 1. [0113] [7] P. K. Bhatnagar, Engineering
Networks for Synchronization, CCS 7, and ISDN, IEEE Press, 1997.
ISBN 0-7803-1158-2. [0114] [8] Danny De Vleeschauwer and Jan
Janssen, Voice Performance over packet-based networks, An Alcatel
White Paper. [0115] [9] Ramachandran Ramjee, Jim Kurose, Don
Townsley, and Henning Schulzrine, Adaptive playout mechanisms for
packetized audio applications in wide-area networks, Proceedings of
the Conference on Computer Communication (IEEE INFOCOM), Toronto,
Canada, June 1994. [0116] [10] Aman Kansal and Abhay Karandikar,
Jitter-free audio playout over Best Effort packet networks, in ATM
Forum--International Symposium on Broadband Communication in the
New Millenium, August 2001. [0117] [11] Kishan Shenoi,
Synchronization implications of providing Circuit Emulation
Services in an IP Network, NFOEC/OFC, Anaheim, Calif., March 2005.
[0118] [12] Kishan Shenoi, Synchronization Implications in VoIP,
NIST-ATIS Workshop on Synchronization in Telecommunications Systems
(WSTS), February 2004.
* * * * *