U.S. patent application number 15/437297 was filed with the patent office on 2017-06-08 for system and method for linguistic term differentiation.
The applicant listed for this patent is Athena Ann Smyros, Constantine Smyros. Invention is credited to Athena Ann Smyros, Constantine Smyros.
Application Number | 20170161257 15/437297 |
Document ID | / |
Family ID | 58017628 |
Filed Date | 2017-06-08 |
United States Patent
Application |
20170161257 |
Kind Code |
A1 |
Smyros; Athena Ann ; et
al. |
June 8, 2017 |
SYSTEM AND METHOD FOR LINGUISTIC TERM DIFFERENTIATION
Abstract
A representative system and method for linguistic
differentiation comprises a computing device: receiving input data
from a requestor; generating a plurality of term units from the
input data, where the plurality of term units comprise of a first
number of term units; identifying a plurality of differentiable
terms of the plurality of term units, where the plurality of
differentiable terms comprise a second number of term units;
determining a differentiability score for each term unit of the
plurality of term units; determining an input data score for the
input data by evaluating a ratio of the second number of term units
to the first number of term units; and transmitting, a plurality of
differentiable term units to the requestor in order of their
differentiability scores.
Inventors: |
Smyros; Athena Ann;
(Richardson, TX) ; Smyros; Constantine;
(Richardson, TX) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Smyros; Athena Ann
Smyros; Constantine |
Richardson
Richardson |
TX
TX |
US
US |
|
|
Family ID: |
58017628 |
Appl. No.: |
15/437297 |
Filed: |
February 20, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14268581 |
May 2, 2014 |
9575958 |
|
|
15437297 |
|
|
|
|
61818904 |
May 2, 2013 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 40/279 20200101;
G06F 40/253 20200101; G06F 40/10 20200101; G06F 40/35 20200101 |
International
Class: |
G06F 17/27 20060101
G06F017/27; G06F 17/21 20060101 G06F017/21 |
Claims
1. A method comprising: receiving, by a computing device, an input;
generating, by the computing device, a plurality of term units
(TUs) from the input, the plurality of TUs consisting of a first
number of TUs; identifying, by the computing device, a plurality of
differentiable TUs of the plurality of TUs, the plurality of
differentiable TUs consisting of a second number of TUs;
determining, by the computing device, a differentiability score for
each term unit (TU) of the plurality of TUs, wherein the
differentiability score indicates a substantially fixed meaning or
a non-fixed meaning; determining, by the computing device, an input
score for the input by dividing the second number of TUs by the
first number of TUs; and transmitting, by the computing device, the
plurality of differentiable TUs in differentiability-scored
order.
2. The method of claim 1, wherein the input comprises a document,
and the method is repeated for a plurality of documents.
3. The method of claim 2, further comprising prioritizing, by the
computing device, the plurality of documents based on input
score.
4. The method of claim 2, wherein the plurality of TUs comprises a
set of topics, and further comprising prioritizing, by the
computing device, the set of topics based on differentiable TUs of
the set of topics.
5. The method of claim 1, wherein determining the differentiability
score is based on a grammatical scheme or a functional scheme.
6. The method of claim 5, wherein: the grammatical scheme comprises
classification as a noun, a modifier, an adverb, or a verb; and the
functional scheme comprises classification based on a type of
writing.
7. The method of claim 6 further comprising, producing, by the
computing device, a topical analysis of the input based on the
plurality of differentiable TUs.
8. The method of claim 6 further comprising, performing, by the
computing device, a search of the input based on the plurality of
differentiable TUs.
9. The method of claim 6, wherein the type of writing comprises a
style of writing or a linguistic functional scope of the input.
10. The method of claim 1, wherein the computing device comprises a
computer, a laptop computer, a personal computer, a server
computer, a personal data assistant, a camera, a phone, a cell
phone, a mobile phone, a smart phone, a tablet, a media server, a
music player, a game box, a data storage device, a measuring
device, a handheld scanner, a scanning device, a barcode reader, a
point-of-sale (POS) device, a digital assistant, a desk phone, an
Internet Protocol (IP) phone, a solid-state memory device, or a
memory card.
11. A method comprising: receiving, by a computing device, a
document; generating, by the computing device, a plurality of term
units (TUs) from the document, wherein each term unit (TU) of the
plurality of TUs comprises a word, a multi-word, a number, or a
symbol, wherein the plurality of TUs consist of a plurality of
differentiable TUs and a plurality of non-differentiable TUs, and
the plurality of TUs consists of a first number of TUs;
identifying, by the computing device, the plurality of
non-differentiable TUs, wherein each of the plurality of
non-differentiable TUs comprises non-fixed meaning; identifying, by
the computing device, the plurality of differentiable TUs, wherein
each of the plurality of differentiable TUs comprises substantially
fixed meaning, and the plurality of differentiable TUs consists of
a second number of TUs; determining, by the computing device, a
differentiability score for each TU of the plurality of TUs;
determining, by the computing device, a document score for the
document by dividing the second number of TUs by the first number
of TUs; and transmitting, by the computing device, the plurality of
differentiable TUs in order of differentiability score for each TU
of the plurality of differentiable TUs.
12. The method of claim 11, wherein the method is repeated for a
plurality of documents, and further comprising prioritizing, by the
computing device, the plurality of documents based on document
score.
13. The method of claim 12, wherein the plurality of TUs comprises
a set of topics, and further comprising prioritizing, by the
computing device, the set of topics based on differentiability
scores of the plurality of TUs of the set of topics.
14. The method of claim 13, wherein determining the
differentiability score is based on a grammatical scheme or a
functional scheme.
15. The method of claim 14, wherein: the grammatical scheme
comprises classification as a noun, an adjective, a modifier, an
adverb, or a verb; and the functional scheme comprises
classification based on a linguistic scope comprising at least one
of a type of writing or a style of writing.
16. The method of claim 15, wherein the determining the
differentiability score comprises providing a plurality of
differentiability-scored TUs configured for use in a topical
analysis of the plurality of documents.
17. The method of claim 15, wherein the determining the
differentiability score comprises providing a plurality of
differentiability-scored TUs configured for use in a search of the
plurality of documents.
18. A computing device comprising: one or more processors; and a
non-transitory, computer-readable medium storing a program that is
executable by the one or more processors, the program comprising
instructions to: receive a document; generate a plurality of term
units (TUs) from the document, wherein each term unit (TU) of the
plurality of TUs comprises a word, a multi-word, a number, or a
symbol, the plurality of TUs consisting of a plurality of
differentiable TUs and a plurality of non-differentiable TUs, the
plurality of TUs consisting of a first number of TUs; identify the
plurality of non-differentiable TUs, wherein each of the plurality
of non-differentiable TUs have a non-fixed linguistic meaning;
identify a plurality of differentiable TUs of the plurality of TUs,
wherein each of the plurality of differentiable TUs have a fixed
linguistic meaning, and the plurality of differentiable TUs
consists of a second number of TUs; determine a differentiability
score for each TU of the plurality of TUs, wherein the
differentiability score indicates a substantially fixed linguistic
meaning or a non-fixed linguistic meaning; determine a document
score for the document, the document score based on a ratio of the
second number of TUs to the first number of TUs; and transmit the
plurality of differentiable TUs in order of at least one of
ascending differentiability score or descending differentiability
score.
19. The computing device of claim 18, wherein the program further
comprises instructions to prioritize a set of topics based on
differentiability scores of TUs comprising the set of topics,
wherein determination of the differentiability score is based on a
grammatical scheme or a functional scheme, wherein: the grammatical
scheme comprises classification as a noun, an adjective, a
modifier, an adverb, or a verb; and the functional scheme comprises
classification based on type of writing.
20. The computing device of claim 19, wherein: the computing device
comprises a computer, a laptop computer, a personal computer, a
server computer, a personal data assistant, a camera, a phone, a
cell phone, a mobile phone, a smart phone, a tablet, a media
server, a music player, a game box, a data storage device, a
measuring device, a handheld scanner, a scanning device, a barcode
reader, a point-of-sale (POS) device, a digital assistant, a desk
phone, an Internet Protocol (IP) phone, a solid-state memory
device, or a memory card; and the plurality of scored
differentiable TUs are transmitted to a requestor, and the
requestor is a human user, a process of the computing device, or a
second computing device.
Description
RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent
application Ser. No. 14/268,581, entitled "DIFFERENTIATION TESTING"
and filed on 2 May 2014, which application claims priority to U.S.
Provisional Patent Application No. 61/818,904, entitled
"DIFFERENTIATION TESTING" and filed on 2 May 2013, which
applications are hereby incorporated herein by reference.
BACKGROUND
[0002] Currently, various communication devices are being rapidly
introduced that need to interact with natural language in an
unstructured manner. Communication systems are finding it difficult
to keep pace with the introduction of devices, as well as the
growth of information.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] The accompanying drawings are incorporated in and are a part
of this specification. Understanding that the drawings illustrate
only representative embodiments of the invention and are not
therefore to be considered as limiting scope. Representative
embodiments will be described and explained more fully with
reference to the accompanying drawing in which:
[0004] FIG. 1 illustrates a differentiation workflow that is usable
with representative embodiments described herein; and
[0005] FIG. 2 representatively illustrates a block diagram of a
computer system which is adapted for use in accordance with
representative embodiments.
DETAILED DESCRIPTION
[0006] Differentiation in language starts with a set of terms used
in the language, such as words, numbers, alphabetical and/or
numerical codes, and/or the like. This forms a largest language set
of words that provide a basis for most language analysis that
concentrates on using meaning to perform useful functions. This
term list may be similar to a dictionary of terms for a language,
but may include terms normally not found in dictionaries, such as
email addresses, phone numbers, product codes, and other
information. In representative embodiments, such a list represents
a largest set of usable terms within a language. The set contains
all such terms, and all such possible combination of terms, that
are currently usable within a data or document repository. Note
that the term list may not have all possible combinations within a
given language's alphabet, since it generally contains numbers like
integers, and such numbers may comprise an infinite set that is not
usable for language operations on a computer. For instance, a
company repository may not contain all possible combinations of the
alphabet in a given language, just those that have meaning at the
point in time that the repository exists and is being analyzed by
the system.
[0007] From a maximum set of terms, there may be, for any
language-based functions performed by a computer, ways of
distinguishing one term from another within a fixed set of
meaning(s), such as a dictionary sense of the meaning of a word.
This refers to the ability of a term to be put into a set of
meanings that does not require further modification or explanation.
Hence, such a term can be used to separate one text stream from
another. For instance, the use of the term "application" does not
indicate the fixed meaning that would allow it to be placed into a
set of meanings without some explanation of what kind of
"application" is being indicated. It could be an insurance
application, a software application, a car paint application, or
the like. Therefore, the term "application" is not useful on its
own to distinguish itself from another text stream that contains
the same term without requiring more terms to put it into a fixed
meaning that is useful to perform any analysis. This is a
nondifferentiated term. This can be contrasted with the term
"fireplace," which has a fixed meaning that is useful to perform
analysis and is a differentiated term. The type of fireplace
enhances the meaning but does not alter the fixed meaning of
fireplace, as in the above-described examples, where the word
"application" has various fixed meanings that require more terms to
provide fixed meanings.
[0008] Differentiation testing may be used to determine the
usefulness of a term on a single-term basis or within a set of
terms. Note that the term may also be a separator between one text
stream and another text stream in any given language. Differential
testing may use any form of text as an input. For example, a
message, file, blog, document, email, as well as input directly
obtained from a user, or another request from a system, can be used
as input. Differential testing involves performing a comparison
between the input against one or more members of a repository or
comparison set, depending on the implementation. Note that the
input may comprise a plurality of different input types. Thus,
input can check for differentiation with other input, in addition
to comparing input to a comparison set. The input is checked for
differentiation to the terms in each document in the
undifferentiated comparison set, or it may be compared in general
after filtering. There is no requirement that both or all operands
of a comparison require differentiation to derive benefits of
differentiated input. Any text stream within the system can be used
at any time to run differentiated testing.
[0009] FIG. 1 representatively illustrates a workflow that is
usable with the system. The differentiation list is a set of terms
101, typically a single term unit (TU) in most languages, including
English, that are considered to be differentiable in a language.
These may be restricted by part of speech (POS), such as noun or
verb. Some languages, such as English, can count all verbs as
nondifferentiable to some degree. There may be both a binary
classification or an n-ary classification scheme used separately or
for the entire set of POS used in a given language; this may also
vary depending on implementation, since while POS is normally used
to construct a list, the list may be constructed using any metric
that causes something to be differentiated from something else. The
number of total terms may be less than the number of differentiable
terms in a given language. Implication, such as pronoun usage, will
cause measurement problems for implementations that do not take
into account such uses, and also are not necessarily directly
differentiable, depending on their antecedent. Each list that is
generated is based on a specific language, and some languages may
be difficult to classify at the term level, especially where
idiomatic-based pictographs (characters) may be used to construct
sentences, as in Chinese.
[0010] User input 102 represents input in text form or a non-text
communication method that has been converted to text form. There is
generally no restriction on the size of the input. In addition, the
output required from the system can be sent or a default for a
particular implementation can be used.
[0011] There are several variables that may be measured, depending
on the implementation and its scope of input. One of these is
writing type. Generally, from a discourse level, there are four
basic writing types: expository, argumentation, description and
narration. The writing type indicates the purpose for the
communication--such as technical manuals or textbooks (expository),
marketing collateral or job applications (argumentation), news
stories or poetry (description), and novels or biographies
(narration). There are several different types of writing that are
used throughout a given language ID, and these may occur at random
within any repository, even when the repository is limited in
scope, such as representing a single device or a single information
domain.
[0012] Another variable that may need to be measured is writing
style. Some writing styles are terse, while others contain
significant modification or extra words based on use of inversion
and other sentence constructions. This refers to the range of
expression that is possible within a language, and differentiation
can be measured without regard to such expression ranges. A
significant measure in processing the input is the apparent level
of summarization. This triggers the use of differentiation when the
text is already in a summarized or highly condensed form, and lacks
a single particular focus. The use of differentiation may be
implemented for the development of an outline/hierarchy or a
summary when various language analyzers are used, such as topical,
date, and location.
[0013] Some measurements of differentiation may involve the use of
a functional scope, which can be used to limit operation of the
measurement of differentiation, such as restricting the measure to
a certain amount of text, such as a document, message, section of a
file, a single input, or a part of, or even an entire repository.
The scope can be defined as the input range over and including the
repository itself. For a given implementation, there may be a
document scope, a section of a document scope, a paragraph scope,
or a TU scope. The TU represents a discrete word, number, or other
symbol in the language that has specific meaning within the
language. The differentiable measure can then be applied to these
different functional scopes because it is applied at the TU scope.
In addition, a scope may also include parts of a word, such as a
suffix or a prefix, or an individual character within a specific
language, such as the letter "a" in English. Scopes such as these,
below the TU scope, generally do not use differentiation since it
measures TUs for their ability to restrict an object range within a
given language.
[0014] An optional intersection 103 between the differentiated list
and the repository word list, such as Windex, can be used if
implemented. Windexes are discussed in U.S. patent application Ser.
No. 12/192,794, entitled "SYSTEMS AND METHODS FOR INDEXING
INFORMATION FOR A SEARCH ENGINE" filed 15 Aug. 2008, the entirety
of which is incorporated herein by reference.
[0015] Lists for differentiation can be kept up-to-date by
intersection and using only those terms that are in the
term-encoding scheme. This may be especially useful in data caches,
since these have limited memory and there is generally no reason to
store an entire differentiated list for a language if the current
repository does not contain a particular term. Updating can be
triggered in real-time with new addition(s) to the repository word
list or Windex, which comprises a simple binary search of the
differentiated list.
[0016] A test at step 104 to see if differentiable terms exist in
the input may be performed next. If a differentiated term exists,
then a term extraction process can begin. If no differentiated term
exists, then there is generally no raw material to determine
differentiability, and the system may be configured to apply an
input value of 0 (zero). For this condition, the system may be
adapted to terminate operations at step 105. A message can be
generated that indicates that follow-on processes cannot be
performed with the input because, as far as searching is concerned,
there is generally no mechanism to differentiate the terms in the
text stream input against what is used as the comparison set.
[0017] When a differentiated term exists, then step 106 performs
term extraction. Text extraction involves finding terms that might
be used for the differentiation. This is based on what was
determined for the differentiation list in step 101. Depending on
implementation, this may involve isolation of terms that meet
similar grammatical requirements, such as all the nouns or all the
verbs found in the input. It may also be based on uses of the
input. E.g., some input may require testing for a specific set of
terms, and these may not fall into a single grammatical category
like POS. Terms are extracted and any appropriate filters, such as
a grammatical categorization, applied as part of the term
extraction process to eliminate terms that will not be used in the
differentiation process. Typically, the input may use a single
differentiation test, but any number may be applied. A
differentiation test may be unary, binary or n-ary in nature, and
the differentiable list is usually substantially equivalent to the
control set when the comparison is made. However, tests can be
performed when a document has passed through the differentiation
testing process, or with one that has no such test. Binary and
n-ary tests occur when the term extraction process indicates that
the terms comprise both single word and multi-word terms. This
generally occurs in most implementations of differentiation
testing. Therefore, the extraction 106 should result in a list of
phrases as well as single words that are used by the
differentiation tests.
[0018] Thereafter, the process for assigning differentiation
weights/scores at step 107 can begin. A term length is generally
recorded as part of the process, so that a term has at least one TU
or word (as in English). Depending on implementation, there may be
several stages of differentiation testing at this point, set up by
the grammatical or functional boundaries in the extracted terms. A
noun that ends a multi-term set may be used, or a verb may be used,
or a set of terms found in a list may be used, depending on
implementation. For instance, a set of time indicators may be found
to be differentiators for a given implementation. These are
generated as a list, and then checked to see if any of the terms
are in a phrase or not. In English, a verb phrase has auxiliary and
main parts; generally, only the main part is tested for
differentiation, since auxiliary verbs are used more frequently. In
some cases, both verbs and nouns can be used, such has when there
is a set of functions that are to be differentiated for a device,
such as the a representative differentiation list comprising "play,
run, stop, start, pause", where it is possible that different
inflections can refer to the same button or some other feature of
the device. A weight is given to each term found in the input based
on this list. In this example, the differentiation list may score a
phrase "play the DVD" higher than "stop the DVD," because the DVD
must be played before it can be stopped. Therefore, the weight in
the differentiated list would be higher for "play" and "run", but
lower for "start," "stop," and "pause."
[0019] Determining a score/weight in step 107 usually takes place
at TU level, but may be performed with any character that maps to
an identifiable unit in a specific language. Each TU in input may
be intersected with the differentiation control set. Each TU may be
graded based on any linear arrangement, and may be a simple binary
value (e.g., 0 or 1) or a more complex system that takes into
account variations in differentiation. In some cases, this may be
based on a differentiable list suitable for the process. Each TU in
a grammatical scheme may be either a noun, modifier, adverb, or
other language classification based on the individual function of a
word. Typically, a noun-based test is sufficient, as verbs might
generally be classified as nondifferentiable. This can be changed
as required, depending on implementation, as when the function of a
device is used, as in the above-described example.
[0020] For some implementations, an object test may be the major
differentiator test, since it may be a significant element. An
object is usually a noun, but depending on implementation, may also
be a subset of a noun or may include verbs or other terms that are
functioning as a noun in a particular text stream input. The object
functions would get a higher score and impact the score for a
multi-word test more than a modifier test, if used. Such a
follow-on modifier test may be used to determine differentiation
when included as part of a larger set of terms, as it serves to
restrict the set of possible objects referred to by the object.
Note that these are not a required part of the measurement, as the
noun is still a more significant determiner of differentiation when
an object test is indicated. However, a more precise test will
generally require differentiation of adjectives. This may also be
measured along grammatical lines, such as which modifier has more
impact on limiting the object in question. For many
implementations, associations, substitutions, and other such
similarity tests may be used to group like terms together to weight
differentiated terms similarly. This can vary based on
implementation. Also, some prefixes and suffixes in some languages
will receive attention when compared against other terms with the
same root words. This process also may affect the outcome of a
process that has sensitivity in this area, such as sentiment
analysis, feature-based analysis, abstractions, and/or the like.
Therefore, a weight may be assigned as well to distinguish
differentiation when the differentiation list does not contain
prefix and suffix variations. These may be applied to a modifier
relation of an object, or they may be applied to the object itself,
depending on where it occurred in the input.
[0021] Several examples follow. The differentiation list for these
examples=(acoustic, paint). The weighting comprises a simple binary
scale that adds an additional weight when the object value is
differentiable, and adds a one (1) when a term is differentiable,
regardless where the term occurs in the input. Example 1: "special
qualities" =nondifferentiable (both individually and as an
multi-word term), but still more differentiable than a
nondifferentiable single word (weight 0 or norm). Example 2:
"special (nondifferentiable) acoustics (differentiable)" and the
term is differentiable owing to the object value (acoustics).
Weight=2, since the object is higher than the nonobject. Example 3:
"acoustical (differentiable) qualities (nondifferentiable)" and is
differentiable, but less differentiable than the object that has a
differentiable value, making the weight=1 (e.g., less than the
object version). Example 4: "acoustical (differentiable) paints
(differentiable)" is differentiable and is the most differentiable
(among the examples herein) of any phrase or multi-word term since
all members are differentiable (weight=3).
[0022] Once all terms have been differentiated, then they can be
grouped together based on some criteria in optional step 108. One
approach may comprise the creation of groups based on term length
(e.g., cardinality). There are usually at least two group types,
the single and the multi-word term. In the single TU, the score
doesn't have other factors, unless the function or POS is
considered. There are implication issues and these affect the
classification system in POS-based implementations. In
multiple-unit terms, the end or terminating object may be weighted
more than other TUs in length. Modifiers or terms are measured and
scored for each multiple unit term. Each unique group has an
individual score, along with each component of an multiple-unit
term. For example, for the input "the dog went shopping at the
Cordova Mall," terms that are in the exemplary filter="dog,"
"shopping," "Cordova," "mall." No functional words or verbs are
being used in this representative basic filter measured for
differentiation. These are considered nondifferentiable for this
implementation. Each term is analyzed for differentiation as
follows: "dog" =differentiable; "shopping" =nondifferentiable;
"Cordova"=differentiable; "mall" =nondifferentiable. "Cordova
mall," in the example, is an multi-word term, so its scores are
combined. In a binary context, it gets a score of 1. This means
that it has a differentiable member. Note that the terminating
object is nondifferentiable. If it were differentiable, it would
get a special weight, and the score would be higher.
[0023] An optional input score at step 109 can be calculated for an
entire input by using a summation, division, multiplication, other
mathematical method based on the input size, output requirements,
and/or other implementation-specific data. Each differentiable term
of the input participates in the final score, for most purposes.
Using the list in the above-described example of "acoustic" and
"paint," and an input: "Acoustical paint has been found to be
useful." --in this this input, the differentiated multi-word term
is equal to "acoustical paint." The score based on step 107 is
equal to 3, since both terms are differentiable and the term
"paint" is an object, using an object scoring system as described
above. A score can be generated for an entire input based on
implementation; for this example, a score for the input may be
based on the number of differentiated terms that have been located
within the set of object and modifier-object term sets. In this
example, the addition term (a modifier) to be considered is
"useful," which has a score of 0. Therefore, the aggregate score
for this input would be equal to 3. Depending on implementation
requirements, a deduction can be made for this term since it is
undifferentiated, such as subtracting one from the initial phrase,
giving the score for this a 2. It is also possible to assign
weights based on function; since the term "useful" is not an
object, maybe only half a point is subtracted, making the score
2.5. This is advantageous to gauge the input as a whole and how
beneficial it might be in different situations. For instance, for a
search term, only the differentiated portion of the input is used,
and that should be reflected in those term(s) that comprise the
input score. In general, the other terms may be safely dropped in
certain implementations, since they will not help locate documents
that correspond to the input.
[0024] A representative use of differentiation testing may be to
analyze a topical outline. A branch of the outline may be
considered to start at the top node or a chosen topic. The topic
may or may not have enough differentiation to convey a meaning that
is related to the underlying document.
[0025] In this case, the topic path may be augmented by
differentiation-based multi-word terms and single words that
provide significant information in the form of subtopics to each
topic that implements differentiation. The root of the path may or
may not be differentiable. The ability to provide this information
is important for summarized, argumentation documents, like job
descriptions, marketing communications, and even summaries of
larger documents. A quick view can be generated for such documents
based on augmented differentiation information, and may be a
smaller subset of the full topical outline. The quick view can give
significant information without having to read a lot of extraneous
words and still provide an overview of a document. This can be
implemented in outline form, giving several subtopics that contain
differentiable information about each parent topic. Any number of
topics in the chain can be augmented using differentiation, and
therefore can build any number of quick views.
[0026] For further information on topical analysis, see U.S. patent
application Ser. No. 13/689,656, entitled "SYSTEMS and METHODS OF
TOPICAL ANALYSIS" filed on 29 Nov. 2012, the entirety of which is
incorporated herein by reference.
[0027] Another representative use of differentiation testing
involves listings of product features. In some instances,
descriptions of product features use nondifferentiable labels, such
as best, perfect, etc. This may also be true of various other
marketing documents or collateral. These labels may or may not
contain actual data that is useful for a particular analysis, and
the determination of what differentiable features may be desired to
show the user (or input process) why a certain product is better
than another product. Feature extraction, in generic terms, is the
ability to ascertain characteristics about a focus object, such as
a camera or a lawn mower. Most nondifferentiable descriptors do not
impart any actual meaning with respect to the object's
characteristics; rather, they qualify such features in a judgmental
fashion that may not concur with the judgement of a current
viewer/user.
[0028] Corrections to an input in step no are optional and may be
performed by differentiation testing to make the input more usable.
For instance, a calling function may want to compare two documents
that contain flowery language, but the information desired is of
the feature of a robot to perform a certain task. Most flowery
words, such as "quality," "feature," or the like, are generally not
helpful for running a search, and may need to be located and
separated from terms that are helpful, such as "line-of-sight
requirements," "range-of-motion," and "actuator arm." These
represent differentiators that will make the comparison beneficial
to the task at hand. These terms may then be isolated in each
document. A comparison is made and an initial decision can be made
if there is enough information to proceed with more in-depth
analysis. Any types of corrections to the inputs with respect to
differentiation may be used in any number of information retrieval
schemes. For instance, removing non sequitur portions to send to a
search engine can be considered; e.g., in "acoustical paint has
been found to be useful," only the differentiated terms "acoustical
paint" would be sent to a search engine to find documents that
relate to that kind of paint.
[0029] The data can be returned in step iii in any number of forms,
since there are several different ways in which an input can be
represented. The system can return the differentiated part of the
input only (as in "acoustical paint" above). This may be useful for
information retrieval tasks and other such data analysis that
depend on being able to find a distinguishable point from which to
perform a set of functions. Another return may be to indicate the
differentiated portions using any number of methods, such as
encoding the output to show differentiated portions. Any such
returns may also indicate weights and/or scores for each individual
component of the input, as well as optionally for the entire input.
The returned data may be presented to a user, e.g., via a display,
or other man-machine interface, or the returned data may be
provided to another program application that uses the returned data
as input for further processing.
[0030] The returned data may be used to determine if a secondary
process can be successfully run, or if more information is required
about the input. This may be the case when automated processes are
in place and human intervention is not feasible for a given
business environment. For instance, a set of messages being
compared against a control message, such as a filter that needs to
determine if the message is indicating that a part of a system
needs to be shut down. If there is not enough information to
determine whether there is a problem, then the user (or input
process) needs to be informed that the message is incomplete, which
may be determined by a lack of differentiated terms in the input
message.
[0031] Another example of this is the use of a control document (as
a requirements document) with filtering of documents that do not
contain any more detailed information on indicated requirements.
Assuming that the documents must have required information, then
the user (or input process) may be informed that a document does
not meet the requirements. This may be accomplished by using
differentiation systems and/or processes as representatively
disclosed. Accordingly, a message can be raised to the user (or
input process) without human intervention. These scenarios can be
addressed with the use of a differentiable measure that determines
suitability of a document to perform a particular task at hand. In
order to accomplish this, differentiation testing can be used to
develop a metric that a document can be measured against.
[0032] A representative system example follows. There is a single
document in the repository, and the document contains the following
text: "The use of acoustic paint is necessary in sound sensitive
environments to remove ambient sound from test equipment for
hearing tests. This removes the expense of hearing booths."
Differentiation is used to build a set of search terms that will
allow the system to distinguish between other documents such that
documents that contain similar information can be found
automatically.
[0033] The system builds a differentiated list in step 101. This is
done by performing differentiation on the language being used for
possible inputs from the user. Depending on implementation, this
list contains all possible differentiable terms for the language at
the point in time of system invocation. This process may be
performed at different points in time based on system
requirements.
[0034] The system obtains input from the requestor in step 102,
which contains the single document in the repository as
representatively identified above. The document is then reduced to
a set of terms that are separated by POS or other such grammar
function. In this case, only nouns and modifiers will be used. The
complete set of terms ignores functional words, since they are
typically undifferentiated in a language. The list is equal to
"use, acoustic, paint, necessary, sound-sensitive, environments,
remove, ambient, sound, test, equipment, hearing, booth." This list
represents the repository word list, which is then intersected with
the differentiated list in step 103 to determine if there are any
terms that are considered differentiable in this representative
example.
[0035] Intersection produces differentiable terms in step 104. This
list is equal to "acoustic, paint, booth." Once this list has been
found, the terms are extracted in their full form, including any
phrases in the above text, by text extraction in step 106 to
produce the results: "use," "acoustic paints," "necessary,"
"sound-sensitive environments," "remove," "ambient sound," "test
equipment," "expense," "hearing booths." Note that term extraction
may treat prepositional phrases and infinitives differently based
on implementation.
[0036] Each term is then processed for differentiable score weights
in step 107 based on the term extraction results. For the given
example, a simple binary system is employed as representatively
described above. For this application of a search term, only those
found to have differentiable scores equal or above 1 will be used.
The first term "use" is nondifferentiated, so it gets a score of 0
and is not used as a search term. The second multi-word term
"acoustic paints" is differentiated, where both terms are also
individually differentiated, getting the score of 3, and is
therefore used as a search term. The third term "necessary" is not
differentiated, so it gets a score of 0 and is not used as a search
term. The fourth multi-word term "sound-sensitive environments"
contains only nondifferentiable terms, but since it contains
multiple terms it gets a score of 1 and is used as a search term.
The fifth single-word term, "remove" is not differentiated, and
therefore gets a score of 0 and is not used as a search term. The
sixth term "ambient sound" is also nondifferentiated, but contains
more than one term, so it gets a score of 1 and is used as a search
term, as is the seventh term "test equipment." The eighth term
"expense" is a single term that is nondifferentiated, and gets a
score of 0 and is not used as a search term. The last term,
"hearing booths," contains a differentiated object and gets a score
of 2, and is used as a search term.
[0037] In this process, there is no grouping that is required so
optional step 108 is not performed. If there is a need to calculate
an input score in step 109, it can be done by determining the
document score to determine how differentiable the document is and
how well-formed its group of search terms will be. In this example,
the document score will be the number of terms that are
differentiated over the total number of terms. There are 9 terms in
this example, and 5 are differentiated. Accordingly, it would get a
score of 5/9, which is a moderate differentiated score for a
document. In this case, no corrections to the input would be
required. The output in step 111 is the five differentiated terms:
"acoustic paints," "hearing booths," "sound-sensitive
environments," "ambient sound," and "test equipment" in order of
their differentiated scores.
[0038] FIG. 2 representatively illustrates computer system 200
adapted to use representative embodiments. Central processing unit
(CPU) 201 is coupled to system bus 202. The CPU 201 may be any
general-purpose CPU, such as an Intel Pentium processor. However,
embodiments herein are not restricted by the architecture of CPU
201, as long as CPU 201 supports operations as described herein.
Bus 202 is coupled to random access memory (RAM) 203, which may be
SRAM, DRAM, or SDRAM. ROM 204 is also coupled to bus 202, which may
be PROM, EPROM, or EEPROM. RAM 203 and ROM 204 hold user and system
data and programs, as is appreciated in the art.
[0039] Bus 202 is also coupled to input/output (I/O) controller
card 205, communications adapter card 211, user interface card 208,
and display card 209. The I/O adapter card 205 connects to storage
devices 206, such as one or more of a hard drive, a CD drive, a
floppy disk drive, or a tape drive, to the computer system. The I/O
adapter 205 may also be connected to a printer (not illustrated),
which would allow the system to print paper copies of information
such as document, photographs, articles, etc. Note that the
printing device may comprise a printer (e.g., inkjet, laser, etc.),
a fax machine, or a copier machine. Communications card 211 is
adapted to couple computer system 200 to a network 212, which may
be one or more of a telephone network, a local (LAN) and/or a
wide-area (WAN) network, an Ethernet network, and/or the Internet.
User interface card 208 couples user input devices, such as
keyboard 213, pointing device 207, and microphone (not shown), to
the computer system 200. User interface card 208 may also provide
sound output to a user via speaker(s) (not illustrated). The
display card 209 may be driven by CPU 201 to control display on
display device 210.
[0040] Note that any of the functions described herein may be
implemented in hardware, software, and/or firmware, and/or any
combination thereof. When implemented in software, elements of
representative embodiments may comprise code segments to perform
operations or tasks. The program or code segments can be stored in
a computer-readable medium. The "computer-readable medium" may
include any physical medium configured to store or transfer
information. Examples of a processor-readable medium include an
electronic circuit, a semiconductor memory device, a ROM, a flash
memory, an erasable ROM (EROM), a floppy diskette, a compact disk
CD-ROM, an optical disk, a hard disk, a fiber optic medium, or the
like. The code segments may be downloaded via computer networks
such as the Internet, Intranet, or the like.
[0041] Embodiments described herein may operate on or in
conjunction with any network attached storage (NAS), storage array
network (SAN), blade server storage, rack server storage, jukebox
storage, cloud, storage mechanism, flash storage, solid-state
drive, magnetic disk, read only memory (ROM), random access memory
(RAM), or any other computing device, including scanners, embedded
devices, mobile, desktop, server, or the like. Such devices may
comprise one or more of: a computer, a laptop computer, a personal
computer, a personal data assistant, a camera, a phone, a cell
phone, a mobile phone, a computer server, a media server, a music
player, a game box, a smart phone, a data storage device, a
measuring device, a handheld scanner, a scanning device, a barcode
reader, a point-of-sale device, a digital assistant, a desk phone,
an IP phone, a solid-state memory device, a tablet, and/or a memory
card.
[0042] In a representative embodiment, a method comprises steps of
a computing device: receiving an input from a requestor; generating
a plurality of term units (TUs) from the input, the plurality of
TUs consisting of a first number of terms; identifying a plurality
of differentiable terms of the plurality of TUs, the plurality of
differentiable terms consisting of a second number of terms;
determining a differentiability score for each term unit (TU) of
the plurality of TUs; determining an input score for the input by
dividing the second number of terms by the first number of terms;
and transmitting a plurality of differentiable TUs to the requestor
in order of differentiability score. The input may comprise a
document or a plurality of documents. The method may further
comprise the computing device prioritizing a plurality of documents
based on their input scores. The method may further comprise the
computing device prioritizing a set of topics based on their
differentiability scores. Determination of the differentiability
scores may be based on a grammatical scheme or a functional scheme.
The grammatical scheme may comprise classification as a noun, a
modifier, an adverb, or a verb. The functional scheme may comprise
classification based on a linguistic scope. The method may further
comprise the computing device producing a topical analysis of the
input based on a plurality of differentiability-scored TUs. The
method may further comprise the computing device performing a
search of the input based on a plurality of
differentiability-scored TUs. The linguistic scope may comprise a
type of writing, a style of writing, or a linguistic functional
scope of the input. The computing device may comprise a computer, a
laptop computer, a personal computer, a server computer, a personal
data assistant, a camera, a phone, a cell phone, a mobile phone, a
smart phone, a tablet, a media server, a music player, a game box,
a data storage device, a measuring device, a handheld scanner, a
scanning device, a barcode reader, a point-of-sale (POS) device, a
digital assistant, a desk phone, an Internet Protocol (IP) phone, a
solid-state memory device, or a memory card.
[0043] In another representative embodiment, a method may comprise
a computing device: receiving a computing device, a document from
an originator; generating a plurality of term units (TUs) from the
document, wherein each term unit (TU) of the plurality of TUs
comprises a word, a multi-word, a number, or a symbol, the
plurality of TUs consisting of a first number of TUs; identifying a
plurality of differentiable terms of the plurality of TUs, the
plurality of differentiable terms consisting of a second number of
TUs; determining a differentiability score for each TU of the
plurality of TUs; determining a document score for the document by
dividing the second number of TUs by the first number of TUs; and
transmitting a plurality of differentiable TUs to the originator in
differentiability-scored order. The method may further comprise the
computing device prioritizing a plurality of documents based on
their document scores. The method may further comprise the
computing device prioritizing a set of topics based on their
differentiability scores. Determination of the differentiability
score may be based on a grammatical scheme or a functional scheme.
A grammatical scheme may comprise classification as a noun, a
modifier, an adverb, or a verb. A functional scheme may comprise
classification based on a linguistic scope. A plurality of
differentiability-scored TUs may be configured for the originator
to use the plurality of differentiability-scored TUs in a topical
analysis of a plurality of documents. A plurality of
differentiability-scored TUs may be configured for the originator
to use the plurality of differentiability-scored TUs in a search of
a plurality of documents.
[0044] In yet another representative embodiment, a computing device
has one or more processors and a non-transitory, computer-readable
medium storing a program that is executable by the one or more
processors. The program comprises instructions to: receive a
document from a requestor; generate a plurality of term units (TUs)
from the document, wherein each term unit (TU) of the plurality of
TUs comprises a word, a multi-word, a number, or a symbol, the
plurality of TUs consisting of a first number of TUs; identify a
plurality of differentiable terms of the plurality of TUs, the
plurality of differentiable terms consisting of a second number of
TUs; determine a differentiability score for each TU of the
plurality of TUs; determine a document score for the document, the
document score based on a ratio of the second number of TUs to the
first number of TUs; and transmit a plurality of differentiable TUs
to the requestor in order of at least one of ascending
differentiability score or descending differentiability score. The
program may further comprise instructions to prioritize a set of
topics based on differentiability scores, wherein determination of
the differentiability score is based on a grammatical scheme or a
functional scheme, wherein: the grammatical scheme may comprise
classification as a noun, a modifier, an adverb, or a verb; and the
functional scheme may comprise classification based on a linguistic
scope. The computing device may comprise a computer, a laptop
computer, a personal computer, a server computer, a personal data
assistant, a camera, a phone, a cell phone, a mobile phone, a smart
phone, a tablet, a media server, a music player, a game box, a data
storage device, a measuring device, a handheld scanner, a scanning
device, a barcode reader, a point-of-sale (POS) device, a digital
assistant, a desk phone, an Internet Protocol (IP) phone, a
solid-state memory device, or a memory card. The requestor may
comprise a human user, a process of the computing device, or a
second (different) computing device.
* * * * *