U.S. patent application number 13/267738 was filed with the patent office on 2012-12-06 for automatically creating a mapping between text data and audio data.
This patent application is currently assigned to APPLE INC.. Invention is credited to Alan C. Cannistraro, Xiang Cao, Casey M. Dougherty, Gregory S. Robbin.
Application Number | 20120310642 13/267738 |
Document ID | / |
Family ID | 47262337 |
Filed Date | 2012-12-06 |
United States Patent
Application |
20120310642 |
Kind Code |
A1 |
Cao; Xiang ; et al. |
December 6, 2012 |
AUTOMATICALLY CREATING A MAPPING BETWEEN TEXT DATA AND AUDIO
DATA
Abstract
Techniques are provided for creating a mapping that maps
locations in audio data (e.g., an audio book) to corresponding
locations in text data (e.g., an e-book). Techniques are provided
for using a mapping between audio data and text data, whether or
not the mapping is created automatically or manually. A mapping may
be used for bookmark switching where a bookmark established in one
version of a digital work is used to identify a corresponding
location with another version of the digital work. Alternatively,
the mapping may be used to play audio that corresponds to text
selected by a user. Alternatively, the mapping may be used to
automatically highlight text in response to audio that corresponds
to the text being played. Alternatively, the mapping may be used to
determine where an annotation created in one media context (e.g.,
audio) will be consumed in another media context (e.g., text).
Inventors: |
Cao; Xiang; (Sunnyvale,
CA) ; Cannistraro; Alan C.; (San Francisco, CA)
; Robbin; Gregory S.; (Mountain View, CA) ;
Dougherty; Casey M.; (San Francisco, CA) |
Assignee: |
APPLE INC.
Cupertino
CA
|
Family ID: |
47262337 |
Appl. No.: |
13/267738 |
Filed: |
October 6, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61493372 |
Jun 3, 2011 |
|
|
|
61494375 |
Jun 7, 2011 |
|
|
|
Current U.S.
Class: |
704/235 ;
704/260; 704/E13.011; 704/E15.043 |
Current CPC
Class: |
G06F 16/685 20190101;
G06F 40/169 20200101; G10L 15/19 20130101; G10L 15/26 20130101;
G10L 13/00 20130101 |
Class at
Publication: |
704/235 ;
704/260; 704/E15.043; 704/E13.011 |
International
Class: |
G10L 15/26 20060101
G10L015/26; G10L 13/08 20060101 G10L013/08 |
Claims
1. A method comprising: receiving audio data that reflects an
audible version of a work for which a textual version exists;
performing a speech-to-text analysis of the audio data to generate
text for portions of the audio data; and based on the text
generated for the portions of the audio data, generating a mapping
between a plurality of audio locations in the audio data and a
corresponding plurality of text locations in the textual version of
the work; wherein the method is performed by one or more computing
devices.
2. The method of claim 1 wherein generating text for portions of
the audio data includes generating text for portions of the audio
data based, at least in part, on textual context of the work.
3. The method of claim 2, wherein generating text for portions of
the audio data based, at least in part, on textual context of the
work includes generating text based, at least in part, on one or
more rules of grammar used in the textual version of the work.
4. The method of claim 2, wherein generating text for portions of
the audio data based, at least in part, on textual context of the
work includes limiting which words the portions can be translated
to based on which words are in the textual version of the work, or
a subset thereof.
5. The method of claim 4, wherein limiting which words the portions
can be translated to based on which words are in the textual
version of the work includes, for a given portion of the audio
data, identifying a sub-section of the textual version of the work
that corresponds to the given portion and limiting the words to
only those words in the sub-section of the textual version of the
work.
6. The method of claim 5, wherein: identifying the sub-section of
the textual version of the work includes maintaining a current text
location in the textual version of the work that corresponds to a
current audio location, in the audio data, of the speech-to-text
analysis; and the sub-section of the textual version of the work is
a section associated with the current text location.
7. The method of claim 1, wherein the portions include portions
that correspond to individual words, and the mapping maps the
locations of the portions that correspond to individual words to
individual words in the textual version of the work.
8. The method of claim 1, wherein the portions include portions
that correspond to individual sentences, and the mapping maps the
locations of the portions that correspond to individual sentences
to individual sentences in the textual version of the work.
9. The method of claim 1, wherein the portions include portions
that correspond to fixed amounts of data, and the mapping maps the
locations of the portions that correspond to fixed amounts of data
to corresponding locations in the textual version of the work.
10. The method of claim 1, wherein generating the mapping includes:
(1) embedding anchors in the audio data; (2) embedding anchors in
the textual version of the work; or (3) storing the mapping in a
media overlay that is stored in association with the audio data or
the textual version of the work.
11. The method of claim 1, wherein each of one or more text
locations of the plurality of text locations indicates a relative
location in the textual version of the work.
12. The method of claim 1, wherein one text location, of the
plurality of text locations, indicates a relative location in the
textual version of the work and another text location, of the
plurality of text locations, indicates an absolute location from
the relative location.
13. The method of claim 1, wherein each of one or more text
locations of the plurality of text locations indicates an anchor
within the textual version of the work.
14. A method comprising: receiving a textual version of a work;
performing a text-to-speech analysis of the textual version to
generate first audio data; based on the first audio data and the
textual version, generating a first mapping between a first
plurality of audio locations in the first audio data and a
corresponding plurality of text locations in the textual version of
the work; receiving second audio data that reflects an audible
version of the work for which the textual version exists; and based
on (1) a comparison of the first audio data and the second audio
data and (2) the first mapping, generating a second mapping between
a second plurality of audio locations in the second audio data and
the plurality of text locations in the textual version of the work;
wherein the method is performed by one or more computing
devices.
15. A method comprising: receiving audio input; performing a
speech-to-text analysis of the audio input to generate text for
portions of the audio input; determining whether the text generated
for portions of the audio input matches text that is currently
displayed; and in response to determining that the text matches
text that is currently displayed, causing the text that is
currently displayed to be highlighted; wherein the method is
performed by one or more computing devices.
16. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 1.
17. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 2.
18. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 3.
19. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 4.
20. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 5.
21. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 6.
22. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 7.
23. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 8.
24. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 9.
25. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 10.
26. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 11.
27. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 12.
28. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 13.
29. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 14.
30. One or more storage media storing instructions which, when
executed by one or more processors, causes performance of the
method recited in claim 15.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims priority to U.S. Provisional
Patent Application No. 61/493,372, entitled "Automatically Creating
A Mapping Between Text Data And Audio Data And Switching Between
Text Data And Audio Data Based On A Mapping," filed on Jun. 3,
2011, invented by Alan C. Cannistraro, et al., the entire
disclosure of which is incorporated by reference for all purposes
as if fully set forth herein.
[0002] The present application claims priority to U.S. Provisional
Patent Application No. 61/494,375, entitled "Automatically Creating
A Mapping Between Text Data And Audio Data And Switching Between
Text Data And Audio Data Based On A Mapping," filed on Jun. 7,
2011, invented by Alan C. Cannistraro, et al., the entire
disclosure of which is incorporated by reference for all purposes
as if fully set forth herein.
[0003] The present application is related to U.S. patent
application Ser. No. ______ entitled "Switching Between Text Data
and Audio Data Based on a Mapping," filed on the same day herewith,
the entire disclosure of which is incorporated by reference for all
purposes as if fully set forth herein.
FIELD OF THE INVENTION
[0004] The present invention relates to automatically creating a
mapping between text data and audio data by analyzing the audio
data to detect words reflected therein and compare those words to
words in the document.
BACKGROUND
[0005] With the cost of handheld electronic devices decreasing and
large demand for digital content, creative works that have once
been published on printed media are increasingly becoming available
as digital media. For example, digital books (also known as
"e-books") are increasingly popular, along with specialized
handheld electronic devices known as e-book readers (or
"e-readers"). Also, other handheld devices, such as tablet
computers and smart phones, although not designed solely as
e-readers, have the capability to be operated as e-readers.
[0006] A common standard by which e-books are formatted is the EPUB
standard (short for "electronic publication"), which is a free and
open e-book standard by the International Digital Publishing Forum
(IDPF). An EPUB file uses XHTML 1.1 (or DTBook) to construct the
content of a book. Styling and layout are performed using a subset
of CSS, referred to as OPS Style Sheets.
[0007] For some written works, especially those that become
popular, an audio version of the written work is created. For
example, a recording of a famous individual (or one with a pleasant
voice) reading a written work is created and made available for
purchase, whether online or in a brick and mortar store.
[0008] It is not uncommon for consumers to purchase both an e-book
and an audio version (or "audio book") of the e-book. In some
cases, a user reads the entirety of an e-book and then desires to
listen to the audio book. In other cases, a user transitions
between reading and listening to the book, based on the user's
circumstances. For example, while engaging in sports or driving
during a commute, the user will tend to listen to the audio version
of the book. On the other hand, when lounging in a sofa-chair prior
to bed, the user will tend to read the e-book version of the book.
Unfortunately, such transitions can be painful, since the user must
remember where she stopped in the e-book and manually locate where
to begin in the audio book, or visa-versa. Even if the user
remembers clearly what was happening in the book where the user
left off, such transitions can still be painful because knowing
what is happening does not necessarily make it easy to find the
portion of an eBook or audio book that corresponds to those
happenings. Thus, switching between an e-book and an audio book may
be extremely time-consuming.
[0009] The specification "EPUB Media Overlays 3.0" defines a usage
of SMIL (Synchronized Multimedia Integration Language), the Package
Document, the EPUB Style Sheet, and the EPUB Content Document for
representation of synchronized text and audio publications. A
pre-recorded narration of a publication can be represented as a
series of audio clips, each corresponding to part of the text. Each
single audio clip, in the series of audio clips that make up a
pre-recorded narration, typically represents a single phrase or
paragraph, but infers no order relative to the other clips or to
the text of a document. Media Overlays solve this problem of
synchronization by tying the structured audio narration to its
corresponding text in the EPUB Content Document using SMIL markup.
Media Overlays are a simplified subset of SMIL 3.0 that allow the
playback sequence of these clips to be defined.
[0010] Unfortunately, creating Media Overlay files is largely a
manual process. Consequently, the granularity of the mapping
between audio and textual versions of a work is very coarse. For
example, a media overlay file may associate the beginning of each
paragraph in an e-book with a corresponding location in an audio
version of the book. The reason that media overlay files,
especially for novels, do not contain a mapping at any finer level
of granularity, such as on a word-by-word basis, is that creating
such a highly granular media overlay file might take countless
hours in human labor.
[0011] The approaches described in this section are approaches that
could be pursued, but not necessarily approaches that have been
previously conceived or pursued. Therefore, unless otherwise
indicated, it should not be assumed that any of the approaches
described in this section qualify as prior art merely by virtue of
their inclusion in this section.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] In the drawings:
[0013] FIG. 1 is a flow diagram that depicts a process for
automatically creating a mapping between text data and audio data,
according to an embodiment of the invention;
[0014] FIG. 2 is a block diagram that depicts a process that
involves an audio-to-text correlator in generating a mapping
between text data and audio data, according to an embodiment of the
invention;
[0015] FIG. 3 is a flow diagram that depicts a process for using a
mapping in one or more of these scenarios, according to an
embodiment of the invention;
[0016] FIG. 4 is a block diagram that an example system 400 that
may be used to implement some of the processes described herein,
according to an embodiment of the invention.
[0017] FIGS. 5A-B are flow diagrams that depict processes for
bookmark switching, according to an embodiment of the
invention;
[0018] FIG. 6 is a flow diagram that depicts a process for causing
text, from a textual version of a work, to be highlighted while an
audio version of the work is being played, according to an
embodiment of the invention;
[0019] FIG. 7 is a flow diagram that depicts a process of
highlighting displayed text in response to audio input from a user,
according to an embodiment of the invention;
[0020] FIGS. 8A-B are flow diagrams that depict processes for
transferring an annotation from one media context to another,
according to an embodiment of the invention; and
[0021] FIG. 9 is a block diagram that illustrates a computer system
upon which an embodiment of the invention may be implemented.
DETAILED DESCRIPTION
[0022] In the following description, for the purposes of
explanation, numerous specific details are set forth in order to
provide a thorough understanding of the present invention. It will
be apparent, however, that the present invention may be practiced
without these specific details. In other instances, well-known
structures and devices are shown in block diagram form in order to
avoid unnecessarily obscuring the present invention.
Overview of Automatic Generation of Audio-to-Text Mapping
[0023] According to one approach, a mapping is automatically
created where the mapping maps locations within an audio version of
a work (e.g., an audio book) with corresponding locations in a
textual version of the work (e.g., an e-book). The mapping is
created by performing a speech-to-text analysis on the audio
version to identify words reflected in the audio version. The
identified words are matched up with the corresponding words in the
textual version of the work. The mapping associates locations
(within the audio version) of the identified words with locations
in the textual version of the work where the identified words are
found.
Audio Version Formats
[0024] The audio data reflects an audible reading of text of a
textual version of a work, such as a book, web page, pamphlet,
flyer, etc. The audio data may be stored in one or more audio
files. The one or more audio files may be in one of many file
formats. Non-limiting examples of audio file formats include AAC,
MP3, WAV, and PCM.
Textual Version Formats
[0025] Similarly, the text data to which the audio data is mapped
may be stored in one of many document file formats. Non-limiting
examples of document file formats include DOC, TXT, PDF, RTF, HTML,
XHTML, and EPUB.
[0026] A typical EPUB document is accompanied by a file that (a)
lists each XHTML content document, and (b) indicates an order of
the XHTML content documents. For example, if a book comprises 20
chapters, then an EPUB document for that book may have 20 different
XHTML documents, one for each chapter. A file that accompanies the
EPUB document identifies an order of the XHTML documents that
corresponds to the order of the chapters in the book. Thus, a
single (logical) document (whether an EPUB document or another type
of document) may comprise multiple data items or files.
[0027] The words or characters reflected in the text data may be in
one or multiple languages. For example, one portion of the text
data may be in English while another portion of the text data may
be in French. Although examples of English words are provided
herein, embodiments of the invention may be applied to other
languages, including character-based languages.
Audio and Text Locations in Mapping
[0028] As described herein, a mapping comprises a set of mapping
records, where each mapping record associates an audio location
with a text location.
[0029] Each audio location identifies a location in audio data. An
audio location may indicate an absolute location within the audio
data, a relative location within the audio data, or a combination
of an absolute location and a relative location. As an example of
an absolute location, an audio location may indicate a time offset
(e.g., 04:32:24 indicating 4 hours, 32 minutes, 24 seconds) into
the audio data, or a time range, as indicated above in Example A.
As an example of a relative location, an audio location may
indicate a chapter number, a paragraph number, and a line number.
As an example of a combination of an absolute location and a
relative location, the audio location may indicate a chapter number
and a time offset into the chapter indicated by the chapter
number.
[0030] Similarly, each text location identifies a location in text
data, such as a textual version of a work. A text location may
indicate an absolute location within the textual version of the
work, a relative location within the textual version of the work,
or a combination of an absolute location and a relative location.
As an example of an absolute location, a text location may indicate
a byte offset into the textual version of the work and/or an
"anchor" within the textual version of the work. An anchor is
metadata within the text data that identifies a specific location
or portion of text. An anchor may be stored separate from the text
in the text data that is displayed to an end-user or may be stored
among the text that is displayed to an end-user. For example, text
data may include the following sentence: "Why did the chicken <i
name="123"/>cross the road?" where "<i name="123"/>" is
the anchor. When that sentence is displayed to a user, the user
only sees "Why did the chicken cross the road?" Similarly, the same
sentence may have multiple anchors as follows: "<i
name="123"/>Why <i name="124"/>did <i
name="125"/>the <i name="126"/>chicken <i
name="127"/>cross <i name="128"/>the <i
name="129"/>road?" In this example, there is an anchor prior to
each word in the sentence.
[0031] As an example of a relative location, a text location may
indicate a page number, a chapter number, a paragraph number,
and/or a line number. As an example of a combination of an absolute
location and a relative location, a text location may indicate a
chapter number and an anchor into the chapter indicated by the
chapter number.
[0032] Examples of how to represent a text location and an audio
location are provided in the specification entitled "EPUB Media
Overlays 3.0," which defines a usage of SMIL (Synchronized
Multimedia Integration Language), an EPUB Style Sheet, and an EPUB
Content Document. An example of an association that associates a
text location with an audio location and that is provided in the
specification is as follows:
TABLE-US-00001 <par> <text
src="chapter1.xhtml#sentence1"/> <audio
src="chapter1_audio.mp3" clipBegin="23s" clipEnd="45s"/>
</par>
Example A
[0033] In Example A, the "par" element includes two child elements:
a "text" element and an "audio" element. The text element comprises
an attribute "src" that identifies a particular sentence within an
XHTML document that contains content from the first chapter of a
book. The audio element comprises a "src" attribute that identifies
an audio file that contains an audio version of the first chapter
of the book, a "clipBegin" attribute that identifies where an audio
clip within the audio file begins, and a "clipEnd" attribute that
identifies where the audio clip within the audio file ends. Thus,
seconds 23 through 45 in the audio file correspond to the first
sentence in Chapter 1 of the book.
Creating a Mapping Between Text and Audio
[0034] According to an embodiment, a mapping between a textual
version of a work and an audio version of the same work is
automatically generated. Because the mapping is generated
automatically, the mapping may use much finer granularity than
would be practical using manual text-to-audio mapping techniques.
Each automatically-generated text-to-audio mapping includes
multiple mapping records, each of which associates a text location
in the textual version with an audio location in the audio
version.
[0035] FIG. 1 is a flow diagram that depicts a process 100 for
automatically creating a mapping between a textual version of a
work and an audio version of the same work, according to an
embodiment of the invention. At step 110, a speech-to-text analyzer
receives audio data that reflects an audible version of the work.
At step 120, while the speech-to-text analyzer performs an analysis
of the audio data, the speech-to-text analyzer generates text for
portions of the audio data. At step 130, based on the text
generated for the portions of the audio data, the speech-to-text
analyzer generates a mapping between a plurality of audio locations
in the audio data and a corresponding plurality of text locations
in the textual version of the work.
[0036] Step 130 may involve the speech-to-text analyzer comparing
the generated text with text in the textual version of the work to
determine where, within the textual version of the work, the
generated text is located. For each portion of generated text that
is found in the textual version of the work, the speech-to-text
analyzer associates (1) an audio location that indicates where,
within the audio data, the corresponding portion of audio data is
found with (2) a text location that indicates where, within the
textual version of the work, the portion of text is found.
Textual Context
[0037] Every document has a "textual context". The textual context
of a textual version of a work includes intrinsic characteristics
of the textual version of the work (e.g. the language the textual
version of the work is written in, the specific words that textual
version of the work uses, the grammar and punctuation that textual
version of the work uses, the way the textual version of the work
is structured, etc.) and extrinsic characteristics of the work
(e.g. the time period in which the work was created, the genre to
which the work belongs, the author of the work, etc.)
[0038] Different works may have significantly different textual
contexts. For example, the grammar used in a classic English novel
may be very different that the grammar of modern poetry. Thus,
while a certain word order may follow the rules of one grammar,
that same word order may violate the rules of another grammar.
Similarly, the grammar used in both a classic English novel and
modern poetry may differ from the grammar (or lack thereof)
employed in a text message sent from one teenager to another.
[0039] As mentioned above, one technique described herein
automatically creates a fine granularity mapping between the audio
version of a work and the textual version of the same work by
performing a speech-to-text conversion of the audio version of the
work. In an embodiment, the textual context of a work is used to
increase the accuracy of the speech-to-text analysis that is
performed on the audio version of the work. For example, in order
to determine the grammar employed in a work, the speech-to-text
analyzer (or another process) may analyze the textual version of
the work prior to performing a speech-to-text analysis. The
speech-to-text analyzer may then make use of the grammar
information thus obtained to increase the accuracy of the
speech-to-text analysis of the audio version of the work.
[0040] Instead of or in addition to automatically determining the
grammar of a work based on the textual version of the work, a user
may provide input that identifies one or more rules of grammar that
are followed by the author of the work. The rules associated with
the identified grammar are input to the speech-to-text analyzer to
assist the analyzer in recognizing words in the audio version of
the work.
Limiting the Candidate Dictionary Based on Textual Version
[0041] Typically, speech-to-text analyzers must be configured or
designed to recognize virtually every word in the English language
and, optionally, some words in other languages. Therefore,
speech-to-text analyzers must have access to a large dictionary of
words. The dictionary from which a speech-to-text analyzer selects
words during a speech-to-text operation is referred to herein as
the "candidate dictionary" of the speech-to-text analyzer. The
number of unique words in a typical candidate dictionary is
approximately 500,000.
[0042] In an embodiment, text from the textual version of a work is
taken into account when performing the speech-to-text analysis of
the audio version of the work. Specifically, in one embodiment,
during the speech-to-text analysis of an audio version of a work,
the candidate dictionary used by the speech-to-text analyzer is
restricted to the specific set of words that are in the text
version of the work. In other words, the only words that are
considered to be "candidates" during the speech-to-text operation
performed on an audio version of a work are those words that
actually appear in the textual version of the work.
[0043] By limiting the candidate dictionary used in the
speech-to-text translation of a particular work to those words that
appear in the textual version of the work, the speech-to-text
operation may be significantly improved. For example, assume that
the number of unique words in a particular work is 20,000. A
conventional speech-to-text analyzer may have difficulty
determining to which specific word, of a 500,000 word candidate
dictionary, a particular portion of audio corresponds. However,
that same portion of audio may unambiguously correspond to one
particular word when only the 20,000 unique words that are in the
textual version of the work are considered. Thus, with such a much
smaller dictionary of possible words, the accuracy of the
speech-to-text analyzer may be significantly improved.
Limiting the Candidate Dictionary Based on Current Position
[0044] To improve accuracy, the candidate dictionary may be
restricted to even fewer words than all of the words in the textual
version of the work. In one embodiment, the candidate dictionary is
limited to those words found in a particular portion of the textual
version of the work. For example, during a speech-to-text
translation of a work, it is possible to approximately track the
"current translation position" of the translation operation
relative to the textual version of the work. Such tracking may be
performed, for example, by comparing (a) the text that has been
generated during the speech-to-text operation so far, against (b)
the textual version of the work.
[0045] Once the current translation position has been determined,
the candidate dictionary may further restricted based on the
current translation position. For example, in one embodiment, the
candidate dictionary is limited to only those words that appear,
within the textual version of the work, after the current
translation position. Thus, words that are found prior to the
current translation position, but not thereafter, are effectively
removed from the candidate dictionary. Such removal may increase
the accuracy of the speech-to-text analyzer, since the smaller the
candidate dictionary, the less likely the speech-to-text analyzer
will translate a portion of audio data to the wrong word.
[0046] As another example, prior to a speech-to-text analysis, an
audio book and a digital book may be divided into a number of
segments or sections. The audio book may be associated with an
audio section mapping and the digital book may be associated with a
text section mapping. For example, the audio section mapping and
the text section mapping may identify where each chapter begins or
ends. These respective mappings may be used by a speech-to-text
analyzer to limit the candidate dictionary. For example, if the
speech-to-text analyzer determines, based on the audio section
mapping, that the speech-to-text analyzer is analyzing the 4.sup.th
chapter of the audio book, then the speech-to-text analyzer uses
the text section mapping to identify the 4.sup.th chapter of the
digital book and limit the candidate dictionary to the words found
in the 4.sup.th chapter.
[0047] In a related embodiment, the speech-to-text analyzer employs
a sliding window that moves as the current translation position
moves. As the speech-to-text analyzer is analyzing the audio data,
the speech-to-text analyzer moves the sliding window "across" the
textual version of the work. The sliding window indicates two
locations within the textual version of the work. For example, the
boundaries of the sliding window may be (a) the start of the
paragraph that precedes the current translation position and (b)
the end of the third paragraph after the current translation
position. The candidate dictionary is restricted to only those
words that appear between those two locations.
[0048] While a specific example was given above, the window may
span any amount of text within the textual version of the work. For
example, the window may span an absolute amount of text, such as 60
characters. As another example, the window may span a relative
amount of text from the textual version of the work, such as ten
words, three "lines" of text, 2 sentences, or 1 "page" of text. In
the relative amount scenario, the speech-to-text analyzer may use
formatting data within the textual version of the work to determine
how much of the textual version of the work constitutes a line or a
page. For example, the textual version of a work may comprise a
page indicator (e.g., in the form of an HTML or XML tag) that
indicates, within the content of the textual version of the work,
the beginning of a page or the ending of a page.
[0049] In an embodiment, the start of the window corresponds to the
current translation position. For example, the speech-to-text
analyzer maintains a current text location that indicates the most
recently-matched word in the textual version of the work and
maintains a current audio location that indicates the most
recently-identified word in the audio data. Unless the narrator
(whose voice is reflected in the audio data) misreads text of the
textual version of the work, adds his/her own content, or skips
portions of the textual version of the work during the recording,
the next word that the speech-to-text analyzer detects in the audio
data (i.e., after the current audio location) is most likely the
next word in the textual version of the work (i.e., after the
current text location). Maintaining both locations may
significantly increase the accuracy of the speech-to-text
translation.
Creating a Mapping Using Audio-to-Audio Correlation
[0050] In an embodiment, a text-to-speech generator and an
audio-to-text correlator are used to automatically create a mapping
between the audio version of a work and the textual version of a
work. FIG. 2 is a block diagram that depicts these analyzers and
the data used to generate the mapping. Textual version 210 of a
work (such as an EPUB document) is input to text-to-speech
generator 220. Text-to-speech generator 220 may be implemented in
software, hardware, or a combination of hardware and software.
Whether implemented in software or hardware, text-to-speech
generator 220 may be implemented on a single computing device or
may be distributed among multiple computing devices.
[0051] Text-to-speech generator 220 generates audio data 230 based
on document 210. During the generation of the audio data 230,
text-to-speech generator 220 (or another component not shown)
creates an audio-to-document mapping 240. Audio-to-document mapping
240 maps, multiple text locations within document 210 to
corresponding audio locations within generated audio data 230.
[0052] For example, assume that text-to-speech generator 220
generates audio data for a word located at location Y within
document 210. Further assume that the audio data that was generated
for the work is located at a location X within audio data 230. To
reflect the correlation between the location of the word within the
document 210 and the location of the corresponding audio in the
audio data 230, a mapping would be created between location X and
location Y.
[0053] Because text-to-speech generator 220 knows where a word or
phrase occurs in document 210 when a corresponding word or phrase
of audio is generated, each mapping between the corresponding words
or phrases can be easily generated.
[0054] Audio-to-text correlator 260 accepts, as input, generated
audio data 230, audio book 250, and audio-to-document mapping 240.
Audio-to-text correlator 260 performs two main steps: an
audio-to-audio correlation step and a look-up step. For the
audio-to-audio correlation step, audio-to-text correlator 260
compares generated audio data 230 with audio book 250 to determine
the correlation between portions of audio data 230 and portions of
audio book 250. For example, audio-to-text correlator 260 may
determine, for each word represented in audio data 230, the
location of the corresponding word in audio book 250.
[0055] The granularity at which audio data 230 is divided, for the
purpose of establishing correlations, may vary from implementation
to implementation. For example, a correlation may be established
between each word in audio data 230 and each corresponding word in
audio book 250. Alternatively, a correlation may be established
based on fixed-duration time intervals (e.g. one mapping for every
1 minute of audio). In yet another alternative, a correlation may
be established for portions of audio established based on other
criteria, such as at paragraph or chapter boundaries, significant
pauses (e.g., silence of greater than 3 seconds), or other
locations based on data in audio book 250, such as audio markers
within audio book 250.
[0056] After a correlation between a portion of audio data 230 and
a portion of audio book 250 is identified, audio-to-text correlator
260 uses audio-to-document mapping 240 to identify a text location
(indicated in mapping 240) that corresponds to the audio location
within generated audio data 230. Audio-to-text correlator 260 then
associates the text location with the audio location within audio
book 250 to create a mapping record in document-to-audio mapping
270.
[0057] For example, assume that a portion of audio book 250
(located at location Z) matches the portion of generated audio data
230 that is located at location X. Based on a mapping record (in
audio-to-document mapping 240) that correlates location X to
location Y within document 210, a mapping record in
document-to-audio mapping 270 would be created that correlates
location Z of the audio book 250 with location Y within document
210.
[0058] Audio-to-text correlator 260 repeatedly performs the
audio-to-audio correlation and look-up steps for each portion of
audio data 230. Therefore, document-to-audio mapping 270 comprises
multiple mapping records, each mapping record mapping a location
within document 210 to a location within audio book 250.
[0059] In an embodiment, the audio-to-audio correlation for each
portion of audio data 230 is immediately followed by the look-up
step for that portion of audio. Thus, document-to-audio mapping 270
may be created for each portion of audio data 230 prior to
proceeding to the next portion of audio data 230. Alternatively,
the audio-to-audio correlation step may be performed for many or
for all of the portions of audio data 230 before any look-up step
is performed. The look-up steps for all portions can be performed
in a batch, after all of the audio-to-audio correlations have been
established.
Mapping Granularity
[0060] A mapping has a number of attributes, one of which is the
mapping's size, which refers to the number of mapping records in
the mapping. Another attribute of a mapping is the mapping's
"granularity." The "granularity" of a mapping refers to the number
of mapping records in the mapping relative to the size of the
digital work. Thus, the granularity of a mapping may vary from one
digital work to another digital work. For example, a first mapping
for a digital book that comprises 200 "pages" includes a mapping
record only for each paragraph in the digital book. Thus, the first
mapping may comprise 1000 mapping records. On the other hand, a
second mapping for a digital "children's" book that comprises 20
pages includes a mapping record for each word in the children's
book. Thus, the second mapping may comprise 800 mapping records.
Even though the first mapping comprises more mapping records than
the second mapping, the granularity of the second mapping is finer
than the granularity of the first mapping.
[0061] In an embodiment, the granularity of a mapping may be
dictated based on input to a speech-to-text analyzer that generates
the mapping. For example, a user may specify a specific granularity
before causing a speech-to-text analyzer to generate a mapping.
Non-limiting examples of specific granularities include:
[0062] word granularity (i.e., an association for each word),
[0063] sentence granularity (i.e., an association for each
sentence),
[0064] paragraph granularity (i.e., an association for each
paragraph),
[0065] 10-word granularity (i.e., a mapping for each 10 word
portion in the digital work), and
[0066] 10-second granularity (i.e., a mapping for each 10 seconds
of audio).
[0067] As another example, a user may specify the type of digital
work (e.g., novel, children's book, short story) and the
speech-to-text analyzer (or another process) determines the
granularity based on the work's type. For example, a children's
book may be associated with word granularity while a novel may be
associated with sentence granularity.
[0068] The granularity of a mapping may even vary within the same
digital work. For example, a mapping for the first three chapters
of a digital book may have sentence granularity while a mapping for
the remaining chapters of the digital book have word
granularity.
On-The-Fly Mapping Generation During Text-to-Audio Transitions
[0069] While an audio-to-text mapping will, in many cases, be
generated prior to a user needing to rely on one, in one
embodiment, an audio-to-text mapping is generated at runtime or
after a user has begun to consume the audio data and/or the text
data on the user's device. For example, a user reads a textual
version of a digital book using a tablet computer. The tablet
computer keeps track of the most recent page or section of the
digital book that the tablet computer has displayed to the user.
The most recent page or section is identified by a "text
bookmark."
[0070] Later, the user selects to play an audio book version of the
same work. The playback device may be the same tablet computer on
which the user was reading the digital book or another device.
Regardless of the device upon which the audio book is to be played,
the text bookmark is retrieved, and a speech-to-text analysis is
performed relative to at least a portion of the audio book. During
the speech-to-text analysis, "temporary" mapping records are
generated to establish a correlation between the generated text and
the corresponding locations within the audio book.
[0071] Once the text and correlation records have been generated, a
text-to-text comparison is used to determine the generated text
that corresponds to the text bookmark. Then, the temporary mapping
records are used to identify the portion of the audio book that
corresponds to the portion of generated text that corresponds to
the text bookmark. Playback of the audio book is then initiated
from that position.
[0072] The portion of the audio book on which the speech-to-text
analysis is performed may be limited to the portion that
corresponds to the text bookmark. For example, an audio section
mapping may already exist that indicates where certain portions of
the audio book begin and/or end. For example, an audio section
mapping may indicate where each chapter begins, where one or more
pages begin, etc. Such an audio section mapping may be helpful to
determine where to begin the speech-to-text analysis so that a
speech-to-text analysis on the entire audio book is not required to
be performed. For example, if the text bookmark indicates a
location within the 12.sup.th chapter of the digital book and an
audio section mapping associated with the audio data identifies
where the 12.sup.th chapter begins in the audio data, then a
speech-to-text analysis is not required to be performed on any of
the first 11 chapters of the audio book. For example, the audio
data may consist of 20 audio files, one audio file for each
chapter. Therefore, only the audio file that corresponds to the
12.sup.th chapter is input to a speech-to-text analyzer.
On-the-Fly Mapping Generation During Audio-to-Text Transitions
[0073] Mapping records can be generated on-the-fly to facilitate
audio-to-text transitions, as well as text-to-audio transitions.
For example, assume that a user is listening to an audio book using
a smart phone. The smart phone keeps track of the current location
within the audio book that is being played. The current location is
identified by an "audio bookmark." Later, the user picks up a
tablet computer and selects a digital book version of the audio
book to display. The tablet computer receives the audio bookmark
(e.g., from a central server that is remote relative to the tablet
computer and the smart phone), performs a speech-to-text analysis
of at least a portion of the audio book, and identifies, within the
audio book, a portion that corresponds to a portion of text within
a textual version of the audio book that corresponds to the audio
bookmark. The tablet computer then begins displaying the identified
portion within the textual version.
[0074] The portion of the audio book on which the speech-to-text
analysis is performed may be limited to the portion that
corresponds to the audio bookmark. For example, a speech-to-text
analysis is performed on a portion of the audio book that spans one
or more time segments (e.g., seconds) prior to the audio bookmark
in the audio book and/or one or more time segments after the audio
bookmark in the audio book. The text produced by the speech-to-text
analysis on that portion is compared to text in the textual version
to locate where the series of words or phrases in the produced text
match text in the textual version.
[0075] If there exists a text section mapping that indicates where
certain portions of the textual version begin or end and the audio
bookmark can be used to identify a section in the text section
mapping, then much of the textual version need not be analyzed in
order to locate where the series of words or phrases in the
produced text match text in the textual version. For example, if
the audio bookmark indicates a location within in the 3.sup.rd
chapter of the audio book and a text section mapping associated
with the digital book identifies where the 3.sup.rd chapter begins
in the textual version, then a speech-to-text analysis is not
required to be performed on any of the first two chapters of the
audio book or on any of the chapters of the audio book after the
3.sup.rd chapter.
Overview of Use of Audio-to-Text Mappings
[0076] According to one approach, a mapping (whether created
manually or automatically) is used to identify the locations within
an audio version of a digital work (e.g., an audio book) that
correspond to locations within a textual version of the digital
work (e.g., an e-book). For example, a mapping may be used to
identify a location within an e-book based on a "bookmark"
established in an audio book. As another example, a mapping may be
used to identify which displayed text corresponds to an audio
recording of a person reading the text as the audio recording is
being played and cause the identified text to be highlighted. Thus,
while an audio book is being played, a user of an e-book reader may
follow along as the e-book reader highlights the corresponding
text. As another example, a mapping may be used to identify a
location in audio data and play audio at that location in response
to input that selects displayed text from an e-book. Thus, a user
may select a word in an e-book, which selection causes audio that
corresponds to that word to be played. As another example, a user
may create an annotation while "consuming" (e.g., reading or
listening to) one version of a digital work (e.g., an e-book) and
cause the annotation to be consumed while the user is consuming
another version of the digital work (e.g., an audio book). Thus, a
user can make notes on a "page" of an e-book and may view those
notes while listening to an audio book of the e-book. Similarly, a
user can make a note while listening to an audio book and then can
view that note when reading the corresponding e-book.
[0077] FIG. 3 is a flow diagram that depicts a process for using a
mapping in one or more of these scenarios, according to an
embodiment of the invention.
[0078] At step 310, location data that indicates a specified
location within a first media item is obtained. The first media
item may be a textual version of a work or audio data that
corresponds to a textual version of the work. This step may be
performed by a device (operated by a user) that consumes the first
media item. Alternatively, the step may be performed by a server
that is located remotely relative to the device that consumes the
first media item. Thus, the device sends the location data to the
server over a network using a communication protocol.
[0079] At step 320, a mapping is inspected to determine a first
media location that corresponds to the specified location.
Similarly, this step may be performed by a device that consumes the
first media item or by a server that is located remotely relative
to the device.
[0080] At step 330, a second media location that corresponds to the
first media location and that is indicated in the mapping is
determined. For example, if the specified location is an audio
"bookmark", then the first media location is an audio location
indicated in the mapping and the second media location is a text
location that is associated with the audio location in the mapping.
Similarly, For example, if the specified location is a text
"bookmark", then the first media location is a text location
indicated in the mapping and the second media location is an audio
location that is associated with the text location in the
mapping.
[0081] At step 340, the second media item is processed based on the
second media location. For example, if the second media item is
audio data, then the second media location is an audio location and
is used as a current playback position in the audio data. As
another example, if the second media item is a textual version of a
work, then the second media location is a text location and is used
to determine which portion of the textual version of the work to
display.
[0082] Examples of using process 300 in specific scenarios are
provided below.
Architecture Overview
[0083] Each of the example scenarios mentioned above and described
in detail below may involve one or more computing devices. FIG. 4
is a block diagram that an example system 400 that may be used to
implement some of the processes described herein, according to an
embodiment of the invention. System 400 includes end-user device
410, intermediary device 420, and end-user device 430. Non-limiting
examples of end-user devices 410 and 430 include desktop computers,
laptop computers, smart phones, tablet computers, and other
handheld computing devices.
[0084] As depicted in FIG. 4, device 410 stores a digital media
item 402 and executes a text media player 412 and an audio media
player 414. Text media player 412 is configured to process
electronic text data and cause device 410 to display text (e.g., on
a touch screen of device 410, not shown). Thus, if digital media
item 402 is an e-book, then text media player 412 may be configured
to process digital media item 402, as long as digital media item
402 is in a text format that text media player 412 is configured to
process. Device 410 may execute one or more other media players
(not shown) that are configured to process other types of media,
such as video.
[0085] Similarly, audio media player 414 is configured to process
audio data and cause device 410 to generate audio (e.g., via
speakers on device 410, not shown). Thus, if digital media item 402
is an audio book, then audio media player 414 may be configured to
process digital media item 402, as long as digital media item 402
is in an audio format that audio media player 414 is configured to
process. Whether item 402 is an e-book or an audio book, item 402
may comprise multiple files, whether audio files or text files.
[0086] Device 430 similarly stores a digital media item 404 and
executes an audio media player 432 that is configured to process
audio data and cause device 430 to generate audio. Device 430 may
execute one or more other media players (not shown) that are
configured to process other types of media, such as video and
text.
[0087] Intermediary device 420 stores a mapping 406 that maps audio
locations within audio data to text location in text data. For
example, mapping 406 may map audio locations within digital media
item 404 to text locations within digital media item 402. Although
not depicted in FIG. 4, intermediary device 420 may store many
mappings, one for each corresponding set of audio data and text
data. Also, intermediary device 420 may interact with many end-user
devices not shown.
[0088] Also, intermediary device 420 may store digital media items
that users may access via their respective devices. Thus, instead
of storing a local copy of a digital media item, a device (e.g.,
device 430) may request the digital media item from intermediary
device 420.
[0089] Additionally, intermediary device 420 may store account data
that associates one or more devices of a user with a single
account. Thus, such account data may indicate that devices 410 and
430 are registered by the same user under the same account.
Intermediary device 420 may also store account-item association
data that associates an account with one or more digital media
items owned (or purchased) by a particular user. Thus, intermediary
device 420 may verify that device 430 may access a particular
digital media item by determining whether the account-item
association data indicates that device 430 and the particular
digital media item are associated with the same account.
[0090] Although only two end-user devices are depicted, an end-user
may own and operate more or less devices that consume digital media
items, such as e-books and audio books. Similarly, although only a
single intermediary device 420 is depicted, the entity that owns
and operates intermediary device 420 may operate multiple devices,
each of which provide the same service or may operate together to
provide a service to the user of end-user devices 410 and 430.
[0091] Communication between intermediary device 420 and end-user
devices 410 and 430 is made possible via network 440. Network 440
may be implemented by any medium or mechanism that provides for the
exchange of data between various computing devices. Examples of
such a network include, without limitation, a network such as a
Local Area Network (LAN), Wide Area Network (WAN), Ethernet or the
Internet, or one or more terrestrial, satellite, or wireless links.
The network may include a combination of networks such as those
described. The network may transmit data according to Transmission
Control Protocol (TCP), User Datagram Protocol (UDP), and/or
Internet Protocol (IP).
Storage Location of Mapping
[0092] A mapping may be stored separate from the text data and the
audio data from which the mapping was generated. For example, as
depicted in FIG. 4, mapping 406 is stored separate from digital
media items 402 and 404 even though mapping 406 may be used to
identify a media location in one digital media item based on a
media location in the other digital media item. In fact, mapping
406 is stored on a separate computing device (intermediary device
420) than devices 410 and 430 that store, respectively, digital
media items 402 and 404.
[0093] Additionally or alternatively, a mapping may be stored as
part of the corresponding text data. For example, mapping 406 may
be stored in digital media item 402. However, even if the mapping
is stored as part of the text data, the mapping may not be
displayed to an end-user that consumes the text data. Additionally
or alternatively still, a mapping may be stored as part of the
audio data. For example, mapping 406 may be stored in digital media
item 404.
Bookmark Switching
[0094] "Bookmark switching" refers to establishing a specified
location (or "bookmark") in one version of a digital work and using
the bookmark to find the corresponding location within another
version of the digital work. There are two types of bookmark
switching: text-to-audio (TA) bookmark switching and audio-to-text
(AT) bookmark switching. TA bookmark switching involves using a
text bookmark established in an e-book to identify a corresponding
audio location in an audio book. Conversely, another type of
bookmark switching referred to herein as AT bookmark switching
involves using an audio bookmark established in an audio book to
identify a corresponding text location within an e-book.
Text-to-Audio Bookmark Switching
[0095] FIG. 5A is a flow diagram that depicts a process 500 for TA
bookmark switching, according to an embodiment of the invention.
FIG. 5A is described using elements of system 400 depicted in FIG.
4.
[0096] At step 502, a text media player 412 (e.g., an e-reader)
determines a text bookmark within digital media item 402 (e.g., a
digital book). Device 410 displays content from digital media item
402 to a user of device 410.
[0097] The text bookmark may be determined in response to input
from the user. For example, the user may touch an area on a touch
screen of device 410. Device 410's display, at or near that area,
displays one or more words. In response to the input, the text
media player 412 determines the one or more words that are closest
to the area. The text media player 412 determines the text bookmark
based on the determined one or more words.
[0098] Alternatively, the text bookmark may be determined based on
the last text data that was displayed to the user. For example, the
digital media item 402 may comprise 200 electronic "pages" and page
110 was the last page that was displayed. Text media player 412
determines that page 110 was the last page that was displayed. Text
media player 412 may establish page 110 as the text bookmark or may
establish a point at the beginning of page 110 as the text
bookmark, since there may be no way to know where the user stopped
reading. It may be safe to assume that the user at least read the
last sentence on page 109, which sentence may have ended on page
109 or on page 110. Therefore, the text media player 412 may
establish the beginning of the next sentence (which begins on page
110) as the text bookmark. However, if the granularity of the
mapping is at the paragraph level, then text media player 412 may
establish the beginning of the last paragraph on page 109.
Similarly, if the granularity of the mapping is at the sentence
level, then text media player 412 may establish the beginning of
the chapter that includes page 110 as the text bookmark.
[0099] At step 504, text media player 412 sends, over network 440
to intermediary device 420, data that indicates the text bookmark.
Intermediary device 420 may store the text bookmark in association
with device 410 and/or an account of the user of device 410.
Previous to step 502, the user may have established an account with
an operator of intermediary device 420. The user then registered
one or more devices, including device 410, with the operator. The
registration caused each of the one or more devices to be
associated with the user's account.
[0100] One or more factors may cause the text media player 412 to
send the text bookmark to intermediary device 420. Such factors may
include the exiting (or closing down) of text media player 412, the
establishment of the text bookmark by the user, or an explicit
instruction by the user to save the text bookmark for use when
listening to the audio book that corresponds to the textual version
of the work for which the text bookmark is established.
[0101] As noted previously, intermediary device 420 has access to
(e.g., stores) mapping 406, which, in this example, maps multiple
audio locations in digital media item 404 with multiple text
locations within digital media item 402.
[0102] At step 506, intermediary device 420 inspects mapping 406 to
determine a particular text location, of the multiple text
locations, that corresponds to the text bookmark. The text bookmark
may not exactly match any of the multiple text locations in mapping
406. However, intermediary device 420 may select the text location
that is closest to the text bookmark. Alternatively, intermediary
device 420 may select the text location that is immediately before
the text bookmark, which text location may or may not be the
closest text location to the text bookmark. For example, if the
text bookmark indicates 5.sup.th chapter, 3.sup.rd paragraph,
5.sup.th sentence and the closest text locations in mapping 406 are
(1) 5.sup.th chapter, 3.sup.rd paragraph, 1.sup.st sentence and
(2), 5.sup.th chapter, 3.sup.rd paragraph, 6.sup.th sentence, then
the text location (1) is selected.
[0103] At step 508, once the particular text location in the
mapping is identified, intermediary device 420 determines a
particular audio location, in mapping 406, that corresponds to the
particular text location.
[0104] At step 510, intermediary device 420 sends the particular
audio location to device 430, which, in this example, is different
than device 410. For example, device 410 may be a tablet computer
and the device 430 may be a smart phone. In a related embodiment,
device 430 is not involved. Thus, intermediary device 420 may send
the particular audio location to device 410.
[0105] Step 510 may be performed automatically, i.e., in response
to intermediary device 420 determining the particular audio
location. Alternatively, step 510 or step 506) may be performed in
response to receiving, from device 430, an indication that device
430 is about to process digital media item 404. The indication may
be a request for an audio location that corresponds to the text
bookmark.
[0106] At step 512, audio media player 432 establishes the
particular audio location as a current playback position of the
audio data in digital media item 404. This establishment may be
performed in response to receiving the particular audio location
from intermediary device 420. Because the current playback position
becomes the particular audio location, audio media player 432 is
not required to play any of the audio that precedes the particular
audio location in the audio data. For example, if the particular
audio location indicates 2:56:03 (2 hours, 56 minutes, and 3
seconds), then audio media player 432 establishes that time in the
audio data as the current playback position. Thus, if the user of
device 430 selects a "play" button (whether graphical or physical)
on device 430, then audio media player 430 begins processing the
audio data at that 2:56:03 mark.
[0107] In an alternative embodiment, device 410 stores mapping 406
(or a copy thereof). Therefore, in place of steps 504-508, text
media player 412 inspects mapping 406 to determine a particular
text location, of the multiple text locations, that corresponds to
the text bookmark. Then, text media player 412 determines a
particular audio location, in mapping 406, that corresponds to the
particular text location. The text media player 412 may then cause
the particular audio location to be sent to intermediary device 420
to allow device 430 to retrieve the particular audio location and
establish a current playback position in the audio data to be the
particular audio location. Text media player 412 may also cause the
particular text location (or text bookmark) to be sent to
intermediary device 420 to allow device 410 (or another device, not
shown) to later retrieve the particular text location to allow
another text media player executing on the other device to display
a portion (e.g., a page) of another copy of digital media item 402,
where the portion corresponds to the particular text location.
[0108] In another alternative embodiment, intermediary device 420
and device 430 are not involved. Thus, steps 504 and 510 are not
performed. Thus, device 410 performs all other steps in FIG. 5A,
including steps 506 and 508.
Audio-to-Text Bookmark Switching
[0109] FIG. 5B is a flow diagram that depicts a process 550 for AT
bookmark switching, according to an embodiment of the invention.
Similarly to FIG. 5A, FIG. 5B is described using elements of system
400 depicted in FIG. 4.
[0110] At step 552, audio media player 432 determines an audio
bookmark within digital media item 404 (e.g., an audio book).
[0111] The audio bookmark may be determined in response to input
from the user. For example, the user may stop the playback of the
audio data, for example, by selecting a "stop" button that is
displayed on a touch screen of device 430. Audio media player 432
determines the location within audio data of digital media item 404
that corresponds to where playback stopped. Thus, the audio
bookmark may simply be the last place where the user stopped
listening to the audio generated from digital media item 404.
Additionally or alternatively, the user may select one or more
graphical buttons on the touch screen of device 430 to establish a
particular location within digital media item 404 as the audio
bookmark. For example, device 430 displays a timeline that
corresponds to the length of the audio data in digital media item
404. The user may select a position on the timeline and then
provide one or more additional inputs that are used by audio media
player 432 to establish the audio bookmark.
[0112] At step 554, device 430 sends, over network 440 to
intermediary device 420, data that indicates the audio bookmark.
The intermediary device 420 may store the audio bookmark in
association with device 430 and/or an account of the user of device
430. Previous to step 552, the user established an account with an
operator of intermediary device 420. The user then registered one
or more devices, including device 430, with the operator. The
registration caused each of the one or more devices to be
associated with the user's account.
[0113] Intermediary device 420 also has access to (e.g., stores)
mapping 406. Mapping 406 maps multiple audio locations in the audio
data of digital media item 404 with multiple text locations within
text data of digital media item 402.
[0114] One or more factors may cause audio media player 432 to send
the audio bookmark to intermediary device 420. Such factors may
include the exiting (or closing down) of audio media player 432,
the establishment of the audio bookmark by the user, or an explicit
instruction by the user to save the audio bookmark for use when
displaying portions of the textual version of the work (reflected
in digital media item 402) that corresponds to digital media item
404, for which the audio bookmark is established.
[0115] At step 556, intermediary device 420 inspects mapping 406 to
determine a particular audio location, of the multiple audio
locations, that corresponds to the audio bookmark. The audio
bookmark may not exactly match any of the multiple audio locations
in mapping 406. However, intermediary device 420 may select the
audio location that is closest to the audio bookmark.
Alternatively, intermediary device 420 may select the audio
location that is immediately before the audio bookmark, which audio
location may or may not be the closest audio location to the audio
bookmark. For example, if the audio bookmark indicates 02:43:19 (or
2 hours, 43 minutes, and 19 seconds) and the closest audio
locations in mapping 406 are (1) 02:41:07 and (2), 0:43:56, then
the audio location (1) is selected, even though audio location (2)
is closest to the audio bookmark.
[0116] At step 558, once the particular audio location in the
mapping is identified, intermediary device 420 determines a
particular text location, in mapping 406, that corresponds to the
particular audio location.
[0117] At step 560, intermediary device 420 sends the particular
text location to device 410, which, in this example, is different
than device 430. For example, device 410 may be a tablet computer
and device 430 may be a smart phone that is configured to process
audio data and generate audible sounds.
[0118] Step 560 may be performed automatically, i.e., in response
to intermediary device 420 determining the particular text
location. Alternatively, step 560 (or step 556) may be performed in
response to receiving, from device 410, an indication that device
410 is about to process the digital media item 402. The indication
may be a request for a text location that corresponds to the audio
bookmark.
[0119] At step 562, text media player 412 displays information
about the particular text location. Step 562 may be performed in
response to receiving the particular text location from
intermediary device 420. Device 410 is not required to display any
of the content that precedes the particular text location in the
textual version of the work reflected in digital media item 402.
For example, if the particular text location indicates Chapter 3,
paragraph 2, sentence 4, then device 410 displays a page that
includes that sentence. Text media player 412 may cause a marker to
be displayed at the particular text location in the page that
visually indicates, to a user of device 410, where to begin reading
in the page. Thus, the user is able to immediately read the textual
version of the work beginning at a location that corresponds to the
last words spoken by a narrator in the audio book.
[0120] In an alternative embodiment, the device 410 stores mapping
406. Therefore, in place of steps 556-560, after step 554 (wherein
the device 430 sends data that indicates the audio bookmark to
intermediary device 420), intermediary device 420 sends the audio
bookmark to device 410. Then, text media player 412 inspects
mapping 406 to determine a particular audio location, of the
multiple audio locations, that corresponds to the audio bookmark.
Then, text media player 412 determines a particular text location,
in mapping 406, that corresponds to the particular audio location.
This alternative process then proceeds to step 562, described
above.
[0121] In another alternative embodiment, intermediary device 420
is not involved. Thus, steps 554 and 560 are not performed. Thus,
device 430 performs all other steps in FIG. 5B, including steps 556
and 558.
Highlight Text in Response to Playing Audio
[0122] In an embodiment, text from a portion of a textual version
of a work is highlighted or "lit up" while audio data that
corresponds to the textual version of the work is played. As noted
previously, the audio data is an audio version of a textual version
of the work and may reflect a reading, of text from the textual
version, by a human user. As used herein, "highlighting" text
refers to a media player (e.g., an "e-reader") visually
distinguishing that text from other text that is concurrently
displayed with the highlighted text. Highlighting text may involve
changing the font of the text, changing the font style of the text
(e.g., italicize, bold, underline), changing the size of the text,
changing the color of the text, changing the background color of
the text, or creating an animation associated with the text. An
example of creating an animation is causing the text (or background
of the text) to blink on and off or to change colors. Another
example of creating an animation is creating a graphic to appear
above, below, or around the text. For example, in response to the
word "toaster" being played and detected by a media player, the
media player displays a toaster image above the word "toaster" in
the displayed text. Another example of an animation is a bouncing
ball that "bounces" on a portion of text (e.g., word, syllable, or
letter) when that portion is detected in audio data that is
played.
[0123] FIG. 6 is a flow diagram that depicts a process 600 for
causing text, from a textual version of a work, to be highlighted
while an audio version of the work is being played, according to an
embodiment of the invention.
[0124] At step 610, the current playback position (which is
constantly changing) of audio data of the audio version is
determined. This step may be performed by a media player executing
on a user's device. The media player processes the audio data to
generate audio for the user.
[0125] At step 620, based on the current playback position, a
mapping record in a mapping is identified. The current playback
position may match or nearly match the audio location identified in
the mapping record.
[0126] Step 620 may be performed by the media player if the media
player has access to a mapping that maps multiple audio locations
in the audio data with multiple text locations in the textual
version of the work. Alternatively, step 620 may be performed by
another process executing on the user's device or by a server that
receives the current playback position from the user's device over
a network.
[0127] At step 630, the text location identified in the mapping
record is identified.
[0128] At step 640, a portion of the textual version of the work
that corresponds to the text location is caused to be highlighted.
This step may be performed by the media player or another software
application executing on the user's device. If a server performs
the look-up steps (620 and 630), then step 640 may further involve
the server sending the text location to the user's device. In
response, the media player, or another software application,
accepts the text location as input and causes the corresponding
text to be highlighted.
[0129] In an embodiment, different text locations that are
identified, by the media player, in the mapping are associated with
different types of highlighting. For example, one text location in
the mapping may be associated with the changing of the font color
from black to red while another text location in the mapping may be
associated with an animation, such as a toaster graphic that shows
a piece of toast "popping" out of toaster. Therefore, each mapping
record in the mapping may include "highlighting data" that
indicates how the text identified by the corresponding text
location is to be highlighted. Thus, for each mapping record in the
mapping that the media player identifies and that includes
highlighting data, the media player uses the highlighting data to
determine how to highlight the text. If a mapping record does not
include highlighting data, then the media player may not highlight
the corresponding text. Alternatively, if an mapping record in the
mapping does not include highlighting data, then the media player
may use a "default" highlight technique (e.g., bolding the text) to
highlight the text.
Highlighting Text Based on Audio Input
[0130] FIG. 7 is a flow diagram that depicts a process 700 of
highlighting displayed text in response to audio input from a user,
according to an embodiment of the invention. In this embodiment, a
mapping is not required. The audio input is used to highlight text
in a portion of a textual version of a work that is concurrently
displayed to the user.
[0131] At step 710, audio input is received. The audio input may be
based on a user reading aloud text from a textual version of a
work. The audio input may be received by a device that displays a
portion of the textual version. The device may prompt the user to
read aloud a word, phrase, or entire sentence. The prompt may be
visual or audio. As an example of a visual prompt, the device may
cause the following text to be displayed: "Please read the
underlined text" while or immediately before the device displays a
sentence that is underlined. As an example of an audio prompt, the
device may cause a computer-generated voice to read "Please read
the underlined text" or cause a pre-recorded human voice to be
played, where the pre-recorded human voice provides the same
instruction.
[0132] At step 720, a speech-to-text analysis is performed on the
audio input to detect one or more words reflected in the audio
input.
[0133] At step 730, for each detected word reflected in the audio
input, that detected word is compared to a particular set of words.
The particular set of words may be all the words that are currently
displayed by a computing device (e.g., an e-reader). Alternatively,
the particular set of words may be all the words that the user was
prompted to read.
[0134] At step 740, for each detected word that matches a word in
the particular set, the device causes that matching word to be
highlighted.
[0135] The steps depicted in process 700 may be performed by a
single computing device that displays text from a textual version
of a work. Alternatively, the steps depicted in process 700 may be
performed by one or more computing devices that are different than
the computing device that displays text from the textual version.
For example, the audio input from a user in step 710 may be sent
from the user's device over a network to a network server that
performs the speech-to-text analysis. The network server may then
send highlight data to the user's device to cause the user's device
to highlight the appropriate text.
Playing Audio in Response to Text Selection
[0136] In an embodiment, a user of a media player that displays
portions of a textual version of a work may select portions of
displayed text and cause the corresponding audio to be played. For
example, if a displayed word from the digital book is "donut" and
the user selects that word (e.g., by touching a portion of the
media player's touch screen that displays that word), then the
audio of "donut" may be played.
[0137] A mapping that maps text locations in a textual version of
the work with audio locations in audio data is used to identify the
portion of the audio data that corresponds to the selected text.
The user may select a single word, a phrase, or even one or more
sentences. In response to input that selects a portion of the
displayed text, the media player may identify one or more text
locations. For example, the media player may identify a single text
location that corresponds to the selected portion, even if the
selected portion comprises multiple lines or sentences. The
identified text location may correspond to the beginning of the
selected portion. As another example, the media player may identify
a first text location that corresponds to the beginning of the
selected portion and a second text location that corresponds to the
ending of the selected portion.
[0138] The media player uses the identified text location to look
up a mapping record in the mapping that indicates a text location
that is closest (or closest prior) to the identified text location.
The media player uses the audio location indicated in the mapping
record to identify where, in the audio data, to begin processing
the audio data in order to generate audio. If only a single text
location is identified, then only the word or sounds at or near the
audio location may be played. Thus, after the word or sounds are
played, the media player ceases to play any more audio.
Alternatively, the media player begins playing at or near the audio
location and does not cease playing the audio that follows the
audio location until (a) the end of the audio data is reached, (b)
further input from the user (e.g., selection of a "stop" button),
or (c) a pre-designated stopping point in the audio data (e.g., end
of a page or chapter that requires further input to proceed).
[0139] If the media player identifies two text locations based on
the selected portion, then two audio locations are identified and
may be used to identify where to begin playing and where to stop
playing the corresponding audio.
[0140] In an embodiment, the audio data identified by the audio
location may be played slowly (i.e., at a slow playback speed) or
continuously without advancing the current playback position in the
audio data. For example, if a user of a tablet computer selects the
displayed word "two" by touching a touch screen of the tablet
computer with his finger and continuously touches the displayed
word (i.e., without lifting his finger and without moving his
finger to another displayed word), then the tablet computer plays
the corresponding audio creating a sound reflected by reading the
word "twoooooooooooooooo".
[0141] In a similar embodiment, the speed at which a user drags her
finger across displayed text on a touch screen of a media player
causes the corresponding audio to be played at the same or similar
speed. For example, a user selects the letter "d" of the displayed
word "donut" and then slowly moves his finger across the displayed
word. In response to this input, the media player identifies the
corresponding audio data (using the mapping) and plays the
corresponding audio at the same speed at which the user moves his
finger. Therefore, the media player creates audio that sounds as if
the reader of the text of the textual version of the work
pronounced the word "donut" as "dooooooonnnnnnuuuuuut."
[0142] In a similar embodiment, the time that a user "touches" a
word displayed on a touch screen dictates how quickly or slowly the
audio version of the word is played. For example, a quick tap of a
displayed word by the user's finger causes the corresponding audio
to be played at a normal speed, whereas the user holding down his
finger on the selected word for more than 1 second causes the
corresponding audio to be played at 1/2 the normal speed.
Transferring User Annotations
[0143] In an embodiment, a user initiates the creation of
annotations to one media version (e.g., audio) of a digital work
and causes the annotations to be associated with another media
version (e.g., text) of the digital work. Thus, while an annotation
may be created in the context of one type of media, the annotation
may be consumed in the context of another type of media. The
"context" in which an annotation is created or consumed refers to
whether text is being displayed or audio is being played when the
creation or consumption occurs.
[0144] Although the following examples involve determining a
location within audio or text location when an annotation is
created, some embodiments of the invention are not so limited. For
example, the current playback position within an audio file when an
annotation is created in the audio context is not used when
consuming the annotation in the text context. Instead, an
indication of the annotation may be displayed, by a device, at the
beginning or the end of the corresponding textual version or on
each "page" of the corresponding textual version. As another
example, the text that is displayed when an annotation is created
in the text context is not used when consuming the annotation in
the audio context. Instead, an indication of the annotation may be
displayed, by a device, at the beginning or end of the
corresponding audio version or continuously while the corresponding
audio version is being played. Additionally or alternatively to a
visual indication, an audio indication of the annotation may be
played. For example, a "beep" is played simultaneously with the
audio track in such a way that both the beep and the audio track
can be heard.
[0145] FIGS. 8A-B are flow diagrams that depict processes for
transferring an annotation from one context to another, according
to an embodiment of the invention. Specifically, FIG. 8A is a flow
diagram depicts a process 800 for creating an annotation in the
"text" context and consuming the annotation in the "audio" context,
while FIG. 8B is a flow diagram that depicts a process 850 for
creating an annotation in the "audio" context and consuming the
annotation in the "text" context. The creation and consumption of
an annotation may occur on the same computing device (e.g., device
410) or on separate computing devices (e.g., devices 410 and 430).
FIG. 8A describes a scenario where the annotation is created and
consumed on device 410 while FIG. 8B describes a scenario where the
annotation is created on device 410 and later consumed on device
430.
[0146] At step 802 in FIG. 8A, text media player 412, executing on
device 410, causes text (e.g., in the form of a page) from digital
media item 402 to be displayed.
[0147] At step 804, text media player 412 determines a text
location within a textual version of the work reflected in digital
media item 402. The text location is eventually stored in
association with an annotation. The text location may be determined
in a number of ways. For example, text media player 412 may receive
input that selects the text location within the displayed text. The
input may be a user touching a touch screen (that displays the
text) of device 410 for a period of time. The input may select a
specific word, a number of words, the beginning or ending of a
page, before or after a sentence, etc. The input may also include
first selecting a button, which causes text media player 412 to
change to a "create annotation" mode where an annotation may be
created and associated with the text location.
[0148] As another example of determining a text location, text
media player 412 determines the text location automatically
(without user input) based on which portion of the textual version
of the work (reflected in digital media item 402) is being
displayed. For example, if device 410 is displaying page 20 of the
textual version of the work, then the annotation will be associated
with page 20.
[0149] At step 806, text media player 412 receives input that
selects a "Create Annotation" button that may be displayed on the
touch screen. Such a button may be displayed in response to input
in step 804 that selects the text location, where, for example, the
user touches the touch screen for a period of time, such as one
second.
[0150] Although step 804 is depicted as occurring before step 806,
alternatively, the selection of the "Create Annotation" button may
occur prior to the determination of the text location.
[0151] At step 808, text media player 412 receives input that is
used to create annotation data. The input may be voice data (such
as the user speaking into a microphone of device 410) or text data
(such as the user selecting keys on a keyboard, whether physical or
graphical). If the annotation data is voice data, text media player
412 (or another process) may perform speech-to-text analysis on the
voice data to create a textual version of the voice data.
[0152] At step 810, text media player 412 stores the annotation
data in association with the text location. Text media player 412
uses a mapping (e.g., a copy of mapping 406) to identify a
particular text location, in mapping, that is closest to the text
location. Then, using mapping, text media player identifies an
audio location that corresponds to the particular text
location.
[0153] Alternatively to step 810, text media player 412 sends, over
network 440 to intermediary device 420, the annotation data and the
text location. In response, intermediary device 420 stores the
annotation data in association with the text location. Intermediary
device 420 uses a mapping (e.g., mapping 406) to identify a
particular text location, in mapping 406, that is closest to the
text location. Then, using mapping 406, intermediary device 420
identifies an audio location that corresponds to the particular
text location. Intermediary device 420 sends the identified audio
location over network 440 to device 410. Intermediary device 420
may send the identified audio location in response to a request,
from device 410, for certain audio data and/or for annotations
associated with certain audio data. For example, in response to a
request for an audio book version of "The Tale of Two Cities",
intermediary device 420 determines whether there is any annotation
data associated with that audio book and, if so, sends the
annotation data to device 410.
[0154] Step 810 may also comprise storing date and/or time
information that indicates when the annotation was created. This
information may be displayed later when the annotation is consumed
in the audio context.
[0155] At step 812, audio media player 414 plays audio by
processing audio data of digital media item 404, which, in this
example (although not shown), may be stored on device 410 or may be
streamed to device 410 from intermediary device 420 over network
440.
[0156] At step 814, audio media player 414 determines when the
current playback position in the audio data matches or nearly
matches the audio location identified in step 810 using mapping
406. Alternatively, audio media player 414 may cause data that
indicates that an annotation is available to be displayed,
regardless of where the current playback position is located and
without having to play any audio, as indicated in step 812. In
other words, step 812 is unnecessary. For example, a user may
launch audio media player 414 and cause audio media player 414 to
load the audio data of digital media item 404. Audio media player
414 determines that annotation data is associated with the audio
data. Audio media player 414 causes information about the audio
data (e.g., title, artist, genre, length, etc.) to be displayed
without generating any audio associated with the audio data. The
information may include a reference to the annotation data and
information about a location within the audio data that is
associated with the annotation data, where the location corresponds
to the audio location identified in step 810.
[0157] At step 816, audio media player 414 consumes the annotation
data. If the annotation data is voice data, then consuming the
annotation data may involve processing the voice data to generate
audio or converting the voice data to text data and displaying the
text data. If the annotation data is text data, then consuming the
annotation data may involve displaying the text data, for example,
in a side panel of a GUI that displays attributes of the audio data
that is played or in a new window that appears separate from the
GUI. Non-limiting examples of attributes include time length of the
audio data, the current playback position, which may indicate an
absolute location within the audio data (e.g., a time offset) or a
relative position within the audio data (e.g., chapter or section
number), a waveform of the audio data, and title of the digital
work.
[0158] FIG. 8B describes a scenario, as noted previously, where an
annotation is created on device 430 and later consumed on device
410.
[0159] At step 852, audio media player 432 processes audio data
from digital media item 404 to play audio.
[0160] At step 854, audio media player 432 determines an audio
location within the audio data. The audio location is eventually
stored in association with an annotation. The audio location may be
determined in a number of ways. For example, audio media player 432
may receive input that selects the audio location within the audio
data. The input may be a user touching a touch screen (that
displays attributes of the audio data) of device 430 for a period
of time. The input may select an absolute position within a
timeline that reflects the length of the audio data or a relative
position within the audio data, such as a chapter number and a
paragraph number. The input may also comprise first selecting a
button, which causes audio media player 432 to change to a "create
annotation" mode where an annotation may be created and associated
with the audio location.
[0161] As another example of determining an audio location, audio
media player 432 determines the audio location automatically
(without user input) based on which portion of the audio data is
being processed. For example, if audio media player 432 is
processing a portion of the audio data that corresponds to chapter
20 of a digital work reflected in digital media item 404, then
audio media player 432 determines that the audio location is at
least be somewhere within chapter 20.
[0162] At step 856, audio media player 432 receives input that
selects a "Create Annotation" button that may be displayed on the
touch screen of device 430. Such a button may be displayed in
response to input in step 854 that selects the audio location,
where, for example, the user touches the touch screen continuously
for a period of time, such as one second.
[0163] Although step 854 is depicted as occurring before step 856,
alternatively, the selection of the "Create Annotation" button may
occur prior to the determination of the audio location.
[0164] At step 858, the first media player receives input that is
used to create annotation data, similar to step 808.
[0165] At step 860, audio media player 432 stores the annotation
data in association with the audio location. Audio media player 432
uses a mapping (e.g., mapping 406) to identify a particular audio
location, in the mapping, that is closest to the audio location
determined in step 854. Then, using the mapping, audio media player
432 identifies a text location that corresponds to the particular
audio location.
[0166] Alternatively to step 860, audio media player 432 sends,
over network 400 to intermediary device 420, the annotation data
and the audio location. In response, intermediary device 420 stores
the annotation data in association with the audio location.
Intermediary device 420 uses mapping 406 to identify a particular
audio location, in the mapping, that is closest to the audio
location determined in step 854. Then, using mapping 406,
intermediary device 420 identifies a text location that corresponds
to the particular audio location. Intermediary device 420 sends the
identified text location over network 440 to device 410.
Intermediary device 420 may send the identified text location in
response to a request, from device 410, for certain text data
and/or for annotations associated with certain text data. For
example, in response to a request for a digital book of "The Grapes
of Wrath", intermediary device 420 determines whether there is any
annotation data associated with that digital book and, if so, sends
the annotation data to device 430.
[0167] Step 860 may also comprise storing date and/or time
information that indicates when the annotation was created. This
information may be displayed later when the annotation is consumed
in the text context.
[0168] At step 862, device 410 displays text data associated with
digital media item 402, which is a textual version of digital media
item 404. Device 410 displays the text data of digital media item
402 based on a locally-stored copy of digital media item 402 or, if
a locally-stored copy does not exist, may display the text data
while the text data is streamed from intermediary device 420.
[0169] At step 864, device 410 determines when a portion of the
textual version of the work (reflected in digital media item 402)
that includes the text location (identified in step 860) is
displayed. Alternatively, device 410 may display data that
indicates that an annotation is available regardless of what
portion of the textual version of the work, if any, is
displayed.
[0170] At step 866, text media player 412 consumes the annotation
data. If the annotation data is voice data, then consuming the
annotation data may comprise playing the voice data or converting
the voice data to text data and displaying the text data. If the
annotation data is text data, then consuming the annotation data
may comprises displaying the text data, for example, in a side
panel of a GUI that displays a portion of the textual version of
the work or in a new window that appears separate from the GUI.
Hardware Overview
[0171] According to one embodiment, the techniques described herein
are implemented by one or more special-purpose computing devices.
The special-purpose computing devices may be hard-wired to perform
the techniques, or may include digital electronic devices such as
one or more application-specific integrated circuits (ASICs) or
field programmable gate arrays (FPGAs) that are persistently
programmed to perform the techniques, or may include one or more
general purpose hardware processors programmed to perform the
techniques pursuant to program instructions in firmware, memory,
other storage, or a combination. Such special-purpose computing
devices may also combine custom hard-wired logic, ASICs, or FPGAs
with custom programming to accomplish the techniques. The
special-purpose computing devices may be desktop computer systems,
portable computer systems, handheld devices, networking devices or
any other device that incorporates hard-wired and/or program logic
to implement the techniques.
[0172] For example, FIG. 9 is a block diagram that illustrates a
computer system 900 upon which an embodiment of the invention may
be implemented. Computer system 900 includes a bus 902 or other
communication mechanism for communicating information, and a
hardware processor 904 coupled with bus 902 for processing
information. Hardware processor 904 may be, for example, a general
purpose microprocessor.
[0173] Computer system 900 also includes a main memory 906, such as
a random access memory (RAM) or other dynamic storage device,
coupled to bus 902 for storing information and instructions to be
executed by processor 904. Main memory 906 also may be used for
storing temporary variables or other intermediate information
during execution of instructions to be executed by processor 904.
Such instructions, when stored in non-transitory storage media
accessible to processor 904, render computer system 900 into a
special-purpose machine that is customized to perform the
operations specified in the instructions.
[0174] Computer system 900 further includes a read only memory
(ROM) 908 or other static storage device coupled to bus 902 for
storing static information and instructions for processor 904. A
storage device 910, such as a magnetic disk or optical disk, is
provided and coupled to bus 902 for storing information and
instructions.
[0175] Computer system 900 may be coupled via bus 902 to a display
912, such as a cathode ray tube (CRT), for displaying information
to a computer user. An input device 914, including alphanumeric and
other keys, is coupled to bus 902 for communicating information and
command selections to processor 904. Another type of user input
device is cursor control 916, such as a mouse, a trackball, or
cursor direction keys for communicating direction information and
command selections to processor 904 and for controlling cursor
movement on display 912. This input device typically has two
degrees of freedom in two axes, a first axis (e.g., x) and a second
axis (e.g., y), that allows the device to specify positions in a
plane.
[0176] Computer system 900 may implement the techniques described
herein using customized hard-wired logic, one or more ASICs or
FPGAs, firmware and/or program logic which in combination with the
computer system causes or programs computer system 900 to be a
special-purpose machine. According to one embodiment, the
techniques herein are performed by computer system 900 in response
to processor 904 executing one or more sequences of one or more
instructions contained in main memory 906. Such instructions may be
read into main memory 906 from another storage medium, such as
storage device 910. Execution of the sequences of instructions
contained in main memory 906 causes processor 904 to perform the
process steps described herein. In alternative embodiments,
hard-wired circuitry may be used in place of or in combination with
software instructions.
[0177] The term "storage media" as used herein refers to any
non-transitory media that store data and/or instructions that cause
a machine to operation in a specific fashion. Such storage media
may comprise non-volatile media and/or volatile media. Non-volatile
media includes, for example, optical or magnetic disks, such as
storage device 910. Volatile media includes dynamic memory, such as
main memory 906. Common forms of storage media include, for
example, a floppy disk, a flexible disk, hard disk, solid state
drive, magnetic tape, or any other magnetic data storage medium, a
CD-ROM, any other optical data storage medium, any physical medium
with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM,
NVRAM, any other memory chip or cartridge.
[0178] Storage media is distinct from but may be used in
conjunction with transmission media. Transmission media
participates in transferring information between storage media. For
example, transmission media includes coaxial cables, copper wire
and fiber optics, including the wires that comprise bus 902.
Transmission media can also take the form of acoustic or light
waves, such as those generated during radio-wave and infra-red data
communications.
[0179] Various forms of media may be involved in carrying one or
more sequences of one or more instructions to processor 904 for
execution. For example, the instructions may initially be carried
on a magnetic disk or solid state drive of a remote computer. The
remote computer can load the instructions into its dynamic memory
and send the instructions over a telephone line using a modem. A
modem local to computer system 900 can receive the data on the
telephone line and use an infra-red transmitter to convert the data
to an infra-red signal. An infra-red detector can receive the data
carried in the infra-red signal and appropriate circuitry can place
the data on bus 902. Bus 902 carries the data to main memory 906,
from which processor 904 retrieves and executes the instructions.
The instructions received by main memory 906 may optionally be
stored on storage device 910 either before or after execution by
processor 904.
[0180] Computer system 900 also includes a communication interface
918 coupled to bus 902. Communication interface 918 provides a
two-way data communication coupling to a network link 920 that is
connected to a local network 922. For example, communication
interface 918 may be an integrated services digital network (ISDN)
card, cable modem, satellite modem, or a modem to provide a data
communication connection to a corresponding type of telephone line.
As another example, communication interface 918 may be a local area
network (LAN) card to provide a data communication connection to a
compatible LAN. Wireless links may also be implemented. In any such
implementation, communication interface 918 sends and receives
electrical, electromagnetic or optical signals that carry digital
data streams representing various types of information.
[0181] Network link 920 typically provides data communication
through one or more networks to other data devices. For example,
network link 920 may provide a connection through local network 922
to a host computer 924 or to data equipment operated by an Internet
Service Provider (ISP) 926. ISP 926 in turn provides data
communication services through the world wide packet data
communication network now commonly referred to as the "Internet"
928. Local network 922 and Internet 928 both use electrical,
electromagnetic or optical signals that carry digital data streams.
The signals through the various networks and the signals on network
link 920 and through communication interface 918, which carry the
digital data to and from computer system 900, are example forms of
transmission media.
[0182] Computer system 900 can send messages and receive data,
including program code, through the network(s), network link 920
and communication interface 918. In the Internet example, a server
930 might transmit a requested code for an application program
through Internet 928, ISP 926, local network 922 and communication
interface 918.
[0183] The received code may be executed by processor 904 as it is
received, and/or stored in storage device 910, or other
non-volatile storage for later execution.
[0184] In the foregoing specification, embodiments of the invention
have been described with reference to numerous specific details
that may vary from implementation to implementation. The
specification and drawings are, accordingly, to be regarded in an
illustrative rather than a restrictive sense. The sole and
exclusive indicator of the scope of the invention, and what is
intended by the applicants to be the scope of the invention, is the
literal and equivalent scope of the set of claims that issue from
this application, in the specific form in which such claims issue,
including any subsequent correction.
* * * * *