U.S. patent application number 15/823492 was filed with the patent office on 2018-03-29 for mining multi-lingual data.
The applicant listed for this patent is Facebook, Inc.. Invention is credited to Matthias Gerhard Eck, Alexander Waibel, Yury Andreyevich Zemlyanskiy, Ying Zhang.
Application Number | 20180089178 15/823492 |
Document ID | / |
Family ID | 56094530 |
Filed Date | 2018-03-29 |
United States Patent
Application |
20180089178 |
Kind Code |
A1 |
Eck; Matthias Gerhard ; et
al. |
March 29, 2018 |
MINING MULTI-LINGUAL DATA
Abstract
Technology is disclosed for mining training data to create
machine translation engines. Training data can be mined as
translation pairs from single content items that contain multiple
languages; multiple content items in different languages that are
related to the same or similar target; or multiple content items
that are generated by the same author in different languages.
Locating content items can include identifying potential sources of
translation pairs that fall into these categories and applying
filtering techniques to quickly gather those that are good
candidates for being actual translation pairs. When actual
translation pairs are located, they can be used to retrain a
machine translation engine as in-domain for social media content
items.
Inventors: |
Eck; Matthias Gerhard;
(Mountain View, CA) ; Zhang; Ying; (Turlock,
CA) ; Zemlyanskiy; Yury Andreyevich; (San Francisco,
CA) ; Waibel; Alexander; (Murrysville, PA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Facebook, Inc. |
Menlo Park |
CA |
US |
|
|
Family ID: |
56094530 |
Appl. No.: |
15/823492 |
Filed: |
November 27, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14559540 |
Dec 3, 2014 |
9864744 |
|
|
15823492 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 40/45 20200101;
G06F 16/951 20190101; G06F 40/58 20200101; G06F 40/44 20200101 |
International
Class: |
G06F 17/28 20060101
G06F017/28 |
Claims
1. A method, performed by a computing device, for mining
translation pairs for training in-domain machine translation
engines, the method comprising: obtaining one or more sources of
potential translation pairs comprising one or more content items,
wherein the one or more sources of potential translation pairs are
in an identified domain for which a machine translation engine is
to be trained; generating one or more potential translation pairs
from the obtained one or more sources of potential translation
pairs by applying one or more automated filtering techniques to the
obtained one or more sources of potential translation pairs,
wherein one of the one or more automated filtering techniques
applied to a selected obtained source of potential translation
pairs is configured based on a label of the selected obtained
source of potential translation pairs, and wherein each of the one
or more potential translation pairs comprises at least two language
snippets; selecting at least one actual translation pair from the
generated one or more potential translation pairs; and training the
machine translation engine using the selected at least one actual
translation pair, the training including assigning an in-domain
classification to the machine translation engine, according to a
type for sources of translation pairs used to train that machine
translation engine.
2. The method of claim 1, wherein: the obtained one or more sources
of potential translation pairs comprise single content items that
each contain multiple languages; each of the at least two language
snippets for each potential translation pair is a portion of one of
the single content items; and each of the at least two language
snippets for each potential translation pair comprises two or more
consecutive words for which a particular language has been
identified.
3. The method of claim 2, wherein applying the one of the or more
automated filtering techniques comprises eliminating from
consideration an unlikely potential translation pair of the one or
more potential translation pairs by: determining a first count of
terms in a first of the at least two language snippets of the
unlikely potential translation pair; determining a second count of
terms in a second of the at least two language snippets of the
unlikely potential translation pair; computing that a ratio of
terms between the first count of terms and the second count of
terms is beyond a specified threshold value; and in response to the
computing that the ratio of terms is beyond the specified threshold
value, eliminating from consideration the unlikely potential
translation pair.
4. The method of claim 1: wherein each of the obtained one or more
sources of potential translation pairs comprise multiple content
items in different languages; wherein the multiple content items in
different languages of each individual obtained one or more sources
of potential translation pairs are related to the same target; and
wherein the at least two language snippets for each potential
translation pair are: from different ones of the multiple content
items of one of the obtained one or more sources of potential
translation pairs, and in different languages.
5. The method of claim 1, wherein the obtained one or more sources
of potential translation pairs comprise multiple content items that
are linked to the same social graph node.
6. The method of claim 1, wherein the obtained one or more sources
of potential translation pairs comprise multiple content items that
contain the same URL target.
7. The method of claim 1, wherein applying the one of the one or
more automated filtering techniques comprises eliminating from
consideration an unlikely potential translation pair by:
determining, for each of multiple content items comprised by of the
unlikely potential translation pair, a corresponding time indicator
specifying when that content item was created or published;
computing that the determined time indicators are not within a
specified time window threshold; and in response to computing that
the time indicators are not within the specified time window
threshold, eliminating from consideration the unlikely potential
translation pair.
8. The method of claim 1, wherein applying the one of the one or
more automated filtering techniques comprises: dividing a first
content item comprisinq multiple first content items into a first
group of sentences; dividing a second content item comprising
multiple second content items into a second group of sentences;
receiving an identification of a particular segment length;
dividing each sentence of the first group of sentences into a third
group of consecutive term segments each segment of length no
greater than the particular segment length; dividing each sentence
of the second group of sentences into a fourth group of consecutive
term segments each segment of length no greater than the identified
segment length; finding at least one segment match between a
particular segment of the third group of consecutive term segments
and a particular segment of the fourth group of consecutive term
segments by determining that a specified threshold number of terms
between the particular segment of the third group and the
particular segment of the fourth group are translations of each
other; and in response to the finding of at least one segment
match, generating as the one or more potential translation pairs,
each permutation of sentence pairs where one sentence of each
sentence pair is selected from the first group of sentences and the
other sentence of each sentence pair is selected from the second
group of sentences.
9. The method of claim 8, wherein the received identification of
the particular segment length identifies a segment length of three
terms.
10. The method of claim 1: wherein the obtained one or more sources
of potential translation pairs comprise multiple content items that
are generated by the same author; and wherein the at least two
language snippets for each potential translation pair: are from
different ones of the multiple content items, are in different
languages, and were published within a time window of each
other.
11. The method of claim 1, wherein applying the one of the one or
more automated filtering techniques comprises applying smoothing to
at least one of the obtained one or more sources of potential
translation pairs by: identifying one or more language
classifications for at least one term in the one or more obtained
sources of potential translation pairs as a mistaken
classification; and changing the classification for the at least
one term to a language classification of an adjacent term.
12. The method of claim 1, wherein applying the one of the one or
more automated filtering techniques comprises: receiving an
identification of one or more desired languages; for at least one
selected language snippet of the at least two language snippets of
each potential translation pair, identifying a language for the at
least one selected language snippet; and determining that the
identified language for the at least one selected language snippet
is one of the one or more desired languages.
13-20. (canceled)
21. The method of claim 1, wherein the identified domain for which
the machine translation engine is to be trained is a social media
domain.
22. The method of claim 1, wherein the selecting the at least one
actual translation pair from the generated one or more potential
translation pairs includes determining that the two language
snippets of the at least one of the one or more potential
translation pairs are translations of each other.
23. The method of claim 22, wherein the determining that the two
language snippets are translations of each other is based on
identified characteristics from each of the two language snippets,
the identified characteristics comprising at least one of: a ratio
of a number words; an IBM score, a number of covered words, a
length of a longest sequence of covered words, a length of a
longest sequence of not-covered words; a maximal number of
consequent source words which have corresponding consequent target
words; a maximum number of consequent not-covered words; or any
combination thereof.
24. A computer-readable storage medium storing instructions that,
when executed by a computing system, cause the computing system to
perform operations for mining translation pairs, the operations
comprising: obtaining one or more sources of potential translation
pairs comprising one or more content items; generating one or more
potential translation pairs from the obtained one or more sources
of potential translation pairs by applying one or more automated
filtering techniques to the obtained one or more sources of
potential translation pairs, wherein one of the one or more
automated filtering techniques applied to a selected obtained
source of potential translation pairs is configured based on a
characteristic of the selected obtained source of potential
translation pairs, and wherein each of the one or more potential
translation pairs comprises at least two language snippets;
selecting at least one actual translation pair from the generated
one or more potential translation pairs; and training the machine
translation engine using the selected at least one actual
translation pair, the training including assigning an in-domain
classification to the machine translation engine, according to the
type for sources of translation pairs used to train that machine
translation engine.
25. The computer-readable storage medium of claim 24: wherein each
of the obtained one or more sources of potential translation pairs
comprise multiple content items in different languages; and wherein
the at least two language snippets for each potential translation
pair are: from different ones of the multiple content items of one
of the obtained one or more sources of potential translation pairs
and are in different languages.
26. The computer-readable storage medium of claim 24, wherein
applying the one of the one or more automated filtering techniques
comprises eliminating from consideration an unlikely potential
translation pair by: determining, for each of multiple content
items comprised by of the unlikely potential translation pair, a
corresponding time indicator specifying when that content item was
created or published; computing that the determined time indicators
are not within a specified time window threshold; and in response
to computing that the time indicators are not within the specified
time window threshold, eliminating from consideration the unlikely
potential translation pair.
27. A computing system for mining translation pairs, the computing
system comprising: one or more processors; and a memory storing
instructions that, when executed by the one or more processors,
cause the computing system to perform operations comprising:
obtaining one or more sources of potential translation pairs
comprising one or more content items; generating one or more
potential translation pairs from the obtained one or more sources
of potential translation pairs by applying one or more automated
filtering techniques configured based on a characteristic of the
selected obtained source of potential translation pairs, wherein
each of the one or more potential translation pairs comprises at
least two language snippets; selecting at least one actual
translation pair from the generated one or more potential
translation pairs; and training the machine translation engine
using the selected at least one actual translation pair, the
training including assigning an in-domain classification to the
machine translation engine, according to a type for sources of
translation pairs used to train that machine translation
engine.
28. The computing system of claim 27: wherein the obtained one or
more sources of potential translation pairs comprise multiple
content items that are generated by the same author; and wherein
the at least two language snippets for each potential translation
pair: were identified as being from different ones of the multiple
content items, were identified as being in different languages, and
were identified as being published within a specified time window
of each other.
Description
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application a continuation of U.S. patent application
Ser. No. 14/559,540, filed on Dec. 3, 2014; the disclosure of which
is hereby incorporated herein in its entirety by reference.
BACKGROUND
[0002] The Internet has made it possible for people to connect and
share information globally in ways previously undreamt of. Social
media platforms, for example, enable people on opposite sides of
the world to collaborate on ideas, discuss current events, or just
share what they had for lunch. In the past, this spectacular
resource has been somewhat limited to communications between users
having a common natural language ("language"). In addition, users
have only been able to consume content that is in their language,
or for which a content provider is able to determine an appropriate
translation.
[0003] While communication across the many different languages used
around the world is a particular challenge, several machine
translation engines have attempted to address this concern. Machine
translation engines enable a user to select or provide a content
item (e.g., a message from an acquaintance) and quickly receive a
translation of the content item. Machine translation engines can be
created using training data that includes identical or similar
content in two or more languages. Multilingual training data is
generally obtained from news reports, parliament domains,
educational "wiki" sources, etc. In many cases, the source of the
training data that is used to create a machine translation engine
is from a considerably different domain than the content on which
that machine translation engine is used for translations. For
example, content in the social media domain often includes slang
terms, colloquial expressions, spelling errors, incorrect
diacritical marks, and other features not common in carefully
edited news sources, parliament documents, or educational wiki
sources.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] FIG. 1 is a block diagram illustrating an overview of
devices on which some embodiments of the disclosed technology can
operate.
[0005] FIG. 2 is a block diagram illustrating an overview of an
environment in which some embodiments of the disclosed technology
can operate.
[0006] FIG. 3 is a block diagram illustrating components which, in
some embodiments, can be used in a system implementing the
disclosed technology.
[0007] FIG. 4 is a flow diagram illustrating a process used in some
embodiments for mining and using translation pairs from social
media sources.
[0008] FIG. 5A is a flow diagram illustrating a process used in
some embodiments for locating potential translation pairs from a
single content item.
[0009] FIG. 5B is a flow diagram illustrating a process used in
some embodiments for locating potential translation pairs from
multiple content items corresponding to the same or similar
target.
[0010] FIG. 5C is a flow diagram illustrating a process used in
some embodiments for locating potential translation pairs from
multiple content items generated by the same author.
[0011] FIG. 6 is a flow diagram illustrating a process used in some
embodiments for selecting actual translation pairs from potential
translation pairs.
[0012] FIG. 7 is a flow diagram illustrating a process used in some
embodiments selecting a machine translation engine based on a
content item classification.
DETAILED DESCRIPTION
[0013] Implementations of machine translation engines are described
herein that are trained using in-domain training data that are
potential translation pairs from 1) single content items that
contain multiple languages; 2) multiple content items in different
languages that are related to the same or similar target; or 3)
multiple content items that are generated by the same author in
different languages. These sources can be filtered to remove
potential sources that are unlikely to contain translations.
Remaining potential translations can be analyzed to obtain
in-domain training data. This process improves the ability of
machine translation engines to automatically translate text without
requiring significant manual input.
[0014] One of the challenges in building machine translation
engines is a lack of current, in-domain translation pair training
data. As used herein, a "translation pair" or "actual translation
pair" is a set of two language snippets where the language snippets
are in different languages and one language snippet is a
translation of the other. As used herein, a "language snippet" is a
representation of one or more words that are all in the same
language. It is impractical for humans to manually create
translation pairs for the purpose of generating current in-domain
training data. In the social medial domain, for example, the volume
of content items is too massive for representative samples to be
manually translated for creating in-domain training data.
Furthermore, various subdomains can exist within any domain. For
example, there could be a London dweller subdomain where a content
item posted on a local event webpage included the text "I think the
Chelsea referee is bent," meaning the author thinks the referee is
dishonest or corrupt. A Spanish translation by an out-of-domain
machine translation engine or by a general social media domain
machine translation engine could generate the translation "Creo que
el arbitro Chelsea se dobla," the literal meaning of which is that
the author thinks the referee is contorted at an angle. In
addition, in the social medial domain, the language used can change
as segments of the population adopt new slang terms and grammar, as
words are used in a manner inconsistent with standard definitions,
or as specialized punctuation is employed. However, creating
machine translation engines with training data that is not
in-domain or that is stale can significantly reduce the accuracy of
such machine translation engines.
[0015] Alternate sources of translation pairs, other than humans
generating translations for the purpose of creating training data,
can help generate current in-domain machine translation engines.
Three alternate sources of potential translation pairs are 1)
single content items that contain multiple languages; 2) multiple
content items in different languages that are related to the same
or similar target; and 3) multiple content items that are generated
by the same author in different languages. As used herein, a
"potential translation pair" is a set of two language snippets,
whether from the same or different sources, that have not been
verified as qualifying as a translation pair because one or both of
the language snippets have not been identified as in a language
desired for a translation pair, or because the language snippets
have not been verified as translations of each other. As used
herein, a "content item," "post," or "source" can be any recorded
communication including text, audio, or video. As examples, a
content item can be anything posted to a social media site such as
a "wall" post, comment, status message, fan post, news story, etc.
As used herein, a "target" of a content item is one or more of: a
part within the content item such as a URL, image, or video; is an
area to which the content item is posted, such as the comment area
of another content item or a webpage for a particular topic such as
a fan page or event; or is a node in a social graph to which the
content item points to, links to, or is otherwise related.
[0016] In social media systems where millions of content items can
be posted every hour, it is impractical to find translation pairs
by analyzing each potential translation pair source thoroughly to
find actual translation pairs if such pairs exist at all. By
locating content items that fall into one of the above three
categories, the sources that need to be thoroughly analyzed to find
actual translation pairs can be significantly reduced. Locating
content items, which can be social media posts or sub-parts
thereof, that fall into one of the above three categories can
include classifying language snippets of the selected items as
being in a particular language. Classifying language snippets can
be accomplished in numerous ways, such as by using context
classifiers, dictionary classifiers, or trained n-gram analysis
classifiers, as discussed in U.S. patent application Ser. No.
14/302,032. In addition, by applying filtering techniques described
below, the number of located potential translation pairs can be
further narrowed to quickly gather those that are good candidates
for further analysis to locate actual translation pairs.
[0017] A first filtering technique can be applied for single
content items that contain multiple languages. In some
implementations, filtering for the single post source includes
eliminating from consideration posts where a ratio of the number of
terms between the language snippets of that post is beyond a
specified threshold value.
[0018] A second filtering technique can be applied for the
potential translation pairs that are from multiple content items in
different languages and that are related to the same or similar
target. In some implementations, filtering these sources from
multiple content items in different languages includes eliminating
potential translations pairs that are not within a particular time
window of each other. In some implementations, filtering for
sources from multiple content items in different languages includes
comparing segments of content, such as three consecutive words
across different snippets, for substantial similarity, and where a
match is found, identifying the possible permutation of sentences
between the posts containing those segments as potential
translation pairs.
[0019] A third filtering technique can be applied for the potential
translation pairs from multiple content items in different
languages that are by the same author. In some implementations,
filtering for the multiple post, same author, source includes
eliminating potential translations pairs that were not posted
within a specified time (e.g., a sliding time window) of each
other.
[0020] In some implementations, there can be a desired one or more
languages for the translation pair. In these implementations, each
of the filtering techniques can further eliminate potential
translation pairs that are not in those desired languages. In some
implementations, each of these filtering techniques may also apply
a smoothing technique to change term classifications that might be
mistakes. For example, in a post that read, "we went to mi house,"
the typo of "mi" instead of "my" may have caused "mi" to be
classified as Spanish. Applying smoothing in this example can cause
this single Spanish classified word, surrounded by other
non-Spanish words, to be reclassified to the surrounding
classification.
[0021] While general machine translation engines do not perform
very well in creating out-of-domain translations, they perform much
better in identifying whether two language snippets of a potential
translation pair are actual translations of each other. For
example, in a single post that contains two language snippets in
different languages, the snippets are likely to either be exact
translations of each other, or they were the result of a user
switching to a different language mid-post, in which case the two
snippets are unlikely to have significant overlapping terms.
General machine translation engines are able to reliably
distinguish between these cases. Therefore, in some
implementations, once potential translation pairs are obtained, a
general machine translation engine can be used to determine whether
they are actual translation pairs or not. However, in some
implementations, much more involved comparisons are necessary to
determine actual translation pairs from potential translation
pairs. For example, where two posts are selected as having the same
or similar target, they are likely to have similar terms but not be
translations of each other. For example, two people could post a
link to a video clip of a soccer goal by the Seattle Sounders
Football Club. The first post may include the text "OMG, what a
great shot, go Sounders!" The second post may include a Japanese
version of "OMG, what a major fail, go away Sounders!" While having
significantly different meanings, they use many of the same terms
or terms such as "great" and "major." Therefore, determining
whether to classify these as a translation pair may require
significantly more analysis than just comparing terms for
equivalence.
[0022] When actual translation pairs are located, they can be used
to retrain a machine translation engine. The retrained machine
translation engine would then be more domain specific a can
classify input data according to a domain or subdomain identified
for the translation pair training data. Subsequently, when a
request for the translation of a content item is received, a
corresponding in-domain machine translation engine can be selected.
Furthermore, when the content item is identified as being in a
particular subdomain, a machine translation engine specialized for
that subdomain can be selected to perform the translation.
[0023] Several embodiments of the described technology are
discussed below in more detail in reference to the figures. Turning
now to the figures, FIG. 1 is a block diagram illustrating an
overview of devices 100 on which some embodiments of the disclosed
technology may operate. The devices can comprise hardware
components of a device 100 that is configured to mine translation
pairs. Device 100 can include one or more input devices 120 that
provide input to the CPU (processor) 110, notifying it of actions.
The actions are typically mediated by a hardware controller that
interprets the signals received from the input device and
communicates the information to the CPU 110 using a communication
protocol. Input devices 120 include, for example, a mouse, a
keyboard, a touchscreen, an infrared sensor, a touchpad, a wearable
input device, a camera-or image-based input device, a microphone,
or other user input devices.
[0024] CPU 110 can be a single processing unit or multiple
processing units in a device or distributed across multiple
devices. CPU 110 can be coupled to other hardware devices, for
example, with the use of a bus, such as a PCI bus or SCSI bus. The
CPU 110 can communicate with a hardware controller for devices,
such as for a display 130. Display 130 can be used to display text
and graphics. In some examples, display 130 provides graphical and
textual visual feedback to a user. In some implementations, display
130 includes the input device as part of the display, such as when
the input device is a touchscreen or is equipped with an eye
direction monitoring system. In some implementations, the display
is separate from the input device. Examples of display devices are:
an LCD display screen, an LED display screen, a projected display
(such as a heads-up display device or a head-mounted device), and
so on. Other I/O devices 140 can also be coupled to the processor,
such as a network card, video card, audio card, USB, firewire or
other external device, camera, printer, speakers, CD-ROM drive, DVD
drive, disk drive, or Blu-Ray device.
[0025] In some implementations, the device 100 also includes a
communication device capable of communicating wirelessly or
wire-based with a network node. The communication device can
communicate with another device or a server through a network
using, for example, TCP/IP protocols. Device 100 can utilize the
communication device to distribute operations across multiple
network devices.
[0026] The CPU 110 has access to a memory 150. A memory includes
one or more of various hardware devices for volatile and
non-volatile storage, and can include both read-only and writable
memory. For example, a memory can comprise random access memory
(RAM), CPU registers, read-only memory (ROM), and writable
non-volatile memory, such as flash memory, hard drives, floppy
disks, CDs, DVDs, magnetic storage devices, tape drives, device
buffers, and so forth. A memory is not a propagating signal
divorced from underlying hardware; a memory is thus non-transitory.
Memory 150 includes program memory 160 that stores programs and
software, such as an operating system 162, translation pair miner
164, and any other application programs 166. Memory 150 also
includes data memory 170 that can include dictionaries and
lexicons, multi-lingual single social media posts, social media
posts with a common target, social media posts with a common
author, machine translation engines, domain and subdomain machine
translation engine classifiers, configuration data, settings, and
user options or preferences which can be provided to the program
memory 160 or any element of the device 100.
[0027] The disclosed technology is operational with numerous other
general purpose or special purpose computing system environments or
configurations. Examples of well-known computing systems,
environments, and/or configurations that may be suitable for use
with the technology include, but are not limited to, personal
computers, server computers, handheld or laptop devices, cellular
telephones, wearable electronics, tablet devices, multiprocessor
systems, microprocessor-based systems, set-top boxes, programmable
consumer electronics, network PCs, minicomputers, mainframe
computers, distributed computing environments that include any of
the above systems or devices, and the like.
[0028] FIG. 2 is a block diagram illustrating an overview of an
environment 200 in which some embodiments of the disclosed
technology may operate. Environment 200 can include one or more
client computing devices 205A-D, examples of which may include
device 100. Client computing devices 205 can operate in a networked
environment using logical connections 210 through network 230 to
one or more remote computers such as a server computing device.
[0029] In some implementations, server 210 can be an edge server
which receives client requests and coordinates fulfillment of those
requests through other servers, such as servers 220A-C. Server
computing devices 210 and 220 can comprise computing systems, such
as device 100. Though each server computing device 210 and 220 is
displayed logically as a single server, server computing devices
can each be a distributed computing environment encompassing
multiple computing devices located at the same or at geographically
disparate physical locations. In some implementations, each server
220 corresponds to a group of servers.
[0030] Client computing devices 205 and server computing devices
210 and 220 can each act as a server or client to other
server/client devices. Server 210 can connect to a database 215.
Servers 220A-C can each connect to a corresponding database 225A-C.
As discussed above, each server 220 may correspond to a group of
servers, and each of these servers can share a database or can have
their own database. Databases 215 and 225 can warehouse (e.g.
store) information such as lexicons, machine translation engines,
social media posts and other data to search for potential
translation pairs, and actual translation pairs that have been
located. Though databases 215 and 225 are displayed logically as
single units, databases 215 and 225 can each be a distributed
computing environment encompassing multiple computing devices, can
be located within their corresponding server, or can be located at
the same or at geographically disparate physical locations.
[0031] Network 230 can be a local area network (LAN) or a wide area
network (WAN), but can also be other wired or wireless networks.
Network 230 may be the Internet or some other public or private
network. The client computing devices 205 can be connected to
network 230 through a network interface, such as by wired or
wireless communication. While the connections between server 210
and servers 220 are shown as separate connections, these
connections can be any kind of local, wide area, wired, or wireless
network, including network 230 or a separate public or private
network.
[0032] FIG. 3 is a block diagram illustrating components 300 which,
in some embodiments, can be used in a system implementing the
disclosed technology. The components 300 include hardware 302,
general software 320, and specialized components 340. As discussed
above, a system implementing the disclosed technology can use
various hardware including central processing units 304, working
memory 306, storage memory 308, and input and output devices 310.
Components 300 can be implemented in a client computing device such
as client computing devices 205 or on a server computing device,
such as server computing device 210 or 220.
[0033] General software 320 can include various applications
including an operating system 322, local programs 324, and a BIOS
326. Specialized components 340 can be subcomponents of a general
software application 320, such as a local program 324. Specialized
components 340 can include single item potential pair finder 344,
multiple item potential pair finder 346, same author potential pair
finder 348, actual pair analyzer 350, machine translation engine
selector 352, and components which can be used for controlling and
receiving data from the specialized components, such as interface
342.
[0034] Single item potential pair finder 344 can be configured to
obtain and filter potential translation pairs from singular social
media content items. This can be accomplished by locating, within
the social media content items, items that contain a language
snippet classified as in a different language from another language
snippet of that item. Single item potential pair finder 344 can
determine that some of these content items are either not relevant
for a desired language or are not likely actual translation pairs
and thus eliminate these potential translation pairs from
consideration. This filtering can occur by eliminating the content
items whose language snippet classifications do not match a desired
set of languages for which training data is being gathered. In
addition, this filtering can occur by computing, for two or more
language snippets of a content item in different languages, a ratio
of terms between the language snippets. If the ratio is too high or
too low, for example outside a specified threshold window such as
3:2-2:3, then it is unlikely that these snippets are translations
of one another and can be eliminated. In some implementations, the
threshold window can be set based on the languages that are being
compared. For example, it may be determined that average German
phrases use twice as many words as the same phrases in Chinese.
Thus, the threshold ratio for German/Chinese can be set to a value
larger than two to one, such as 3:1. Such filtering, as described
here and below in relation to other filtering systems and
procedures, can comprise either selecting the initial set of
eligible content items and then removing those that are not desired
or that are not likely translation pairs, or can comprise
additional parameters for the initial search for content items,
such that only content items that are in desired languages and that
are likely translation pairs are selected as potential translation
pairs.
[0035] Multiple item potential pair finder 346 can be configured to
obtain and filter additional types of potential translation pairs
from social media content items. This can be accomplished by
locating, within the social media content items, pairs of content
items that are for the same or similar target. A pair of content
items that are for the same or similar target, in various
implementations, are ones that contain the same or similar element
such as a URL, an image, a video, or a document, or that are posted
to or about the same or similar other content item, such as a forum
for a particular topic, a comments area of another post, within a
group chat session, a fan site, a page for reviews of an item or a
service, etc. The number of posted content items that multiple item
potential pair finder 346 can locate can be quite large in large
social media sites where the same URL, for example, could be shared
millions of times, and the permutations of different language pairs
increases the size of this set exponentially. Multiple item
potential pair finder 346 can determine that some of these content
item pairs are either not relevant for a desired language or are
not likely actual translation pairs and thus eliminate such
potential translation pairs from consideration. This filtering can
occur by only identifying content items as a potential translation
pair if the content items are within a specified threshold amount
of time of one another. In addition, the filtering can comprise
eliminating individual content items which, individually or as a
pair, do not match one or more desired languages. Furthermore,
individual content items can be split (divided) into segments of a
particular length, such as three words. Using a matching algorithm,
such as translation dictionary matching, segments from content
items in a first language can be compared to segments in other
content items for the same or similar target in another language.
The level of match required between segments can vary across
implementations. For example, in various implementations, a match
may be found when all the words of a segment match, when a
percentage such as at least 75% match, or when at least a number,
such as two, match. All potential translation pairs can be
eliminated that do not have at least one matching segment between
the pair.
[0036] Same author potential pair finder 348 can be configured to
obtain and filter potential translation pairs from social media
content items that are from the same author. These potential
translation pairs can be filtered based on being in different
languages and being within a sliding time window. As in the
filtering methods above, these potential translation pairs can also
be filtered by eliminating those that are not in a desired
language.
[0037] Potential translation pairs from: single item potential pair
finder 344, multiple item potential pair finder 346, and same
author potential pair finder 348 can be passed to actual pair
analyzer 350. Actual pair analyzer 350 can be configured to analyze
potential translation pairs to determine whether they comprise an
actual translation pair. Depending on the source of the potential
translation pair, this can be accomplished in a variety of ways.
For example, when the content item source is a single "wall" post
or multiple posts by a single author, and therefore the language
snippets of resulting potential translation pairs are only likely
to be similar if they are translations of each other, a general
machine translation engine can be used to quickly determine whether
they are actual translations. However, when the source is multiple
content items that share a target, and thus are likely to be
similar without being translations, a more advanced analysis can be
performed. Such an advanced analysis can include identifying a
number of characteristics of each language snippet and using them
to perform an in-depth analysis to identify actual translation
pairs. Furthermore, determining actual translation pairs can be a
two-step process in which, first, a general machine translation
engine is used to determine whether the potential translation pair
is an actual translation, and if the results from the general
machine translation engine are inconclusive, the more advanced
analysis can be performed.
[0038] The machine translation engine selector 352 can select a
particular machine translation engine to use to fulfill a request
to translate a particular content item. In some cases, a content
item that has been requested to be translated is associated with a
particular domain or subdomain. Machine translation engine selector
352 can select a machine translation engine to translate that
content item which most closely matches the domain or subdomain of
the content item. In some implementations, domains and subdomains
may be logically organized as a tree structure, and the machine
translation engine selector 352 may select the machine translation
engine corresponding to the lowest node (i.e. closest to a leaf
node) in the tree which matches the domain or subdomain of the
content item. For example, a content item could be classified in
the subdomain Welling United, which is a soccer club in London. A
domain tree could include the branch: (Social
Media)->(England)->(London)->(Soccer Fans)->(Chelsea).
The most closely matching machine translation engine could the one
corresponding to the (Social
Media)->(England)->(London)->(Soccer Fans) node.
[0039] Those skilled in the art will appreciate that the components
illustrated in FIGS. 1-3 described above, and in each of the flow
diagrams discussed below, may be altered in a variety of ways. For
example, the order of the logic may be rearranged, substeps may be
performed in parallel, illustrated logic may be omitted, other
logic may be included, etc.
[0040] FIG. 4 is a flow diagram illustrating a process 400 used in
some embodiments for mining and using translation pairs from social
media sources. Translation pairs found by process 400 can be used,
for example, to train machine translation engines to be in-domain
for translating social media content items. Process 400 begins at
block 402. At block 404 sources of potential translation pairs are
obtained. The sources obtained at block 404 may be for a particular
domain, such as social media generally, or for a subdomain, such as
boat enthusiasts, people in Sydney, Australia, or the Xiang Chinese
dialect. Each of the potential translation pair sources found at
block 404 can be in any one or more of the following three
categories: 1) a single content item containing language snippets
in different languages, 2) multiple content items that have the
same or similar target; and 3) multiple content items by the same
author.
[0041] At block 406, the sources of potential translation pairs can
be filtered to eliminate potential translation pairs that are
unlikely to contain actual translation pairs. Depending on the
category of each potential translation pair source, various
filtering procedures can be applied. Filtering procedures that can
be applied for a single post containing language snippets in
different languages are described in greater detail below in
relation to FIG. 5A. Filtering procedures that can be applied for
multiple posts that have the same or similar target are described
in greater detail below in relation to FIG. 5B. Filtering
procedures that can be applied for multiple posts by the same
author are described in greater detail below in relation to FIG.
5C. Filtering procedures can be automatic or automated, meaning
that, though they may or may not be configured by a human, they are
applied by a computing system without the need for further human
input.
[0042] At block 408, remaining potential translation pairs are
analyzed to select actual translation pairs. Selecting potential
translation pairs that are actual translation pairs is discussed in
greater detail below in relation to FIG. 6.
[0043] At block 410, the selected translation pairs from block 408
can be used to train one or more in-domain machine translation
engines. In some embodiments, creating an in-domain machine
translation engine comprises retraining a previously created
machine translation engine with the selected translation pair
training data. This can comprise updating a general machine
translation engine or further training an in-domain machine
translation engine. In some embodiments, creating an in-domain
machine translation engine comprises using only training data from
the domain or subdomain of content items from which the resulting
machine translation engine will translate. Once a machine
translation engine is created it can be classified according to the
source of the training data used to create it. For example, a high
level "social media" machine translation engine can be created,
such as for English->Spanish; regional or dialectic machine
translation engines can be created such as Dublin->Mandarin;
topic based machine translation engines can be created such as
United States Navy Personnel->German. In some implementations,
combinations thereof can be created such as Russian Car
Enthusiast->General English. In some implementations, machine
translation engines can be used within the same language, such as
South Boston->Northern England or Australian Soccer
Fan->American Football fan. Use of the classifications assigned
to machine translation engines for the domain or subdomain is
described in greater detail below in relation to FIG. 7.
[0044] FIG. 5A is a flow diagram illustrating a process 500 used in
some embodiments for locating potential translation pairs from a
single item. Process 500 begins at block 502 and continues to block
504. At block 504 a potential translation pair from a single item
is received. As described above, a potential translation pair from
a single item can be a post where, within the post, multiple
languages are used. Content items that comprise language snippets
in multiple languages may be a good source for potential
translation pairs because there are many content item authors that
are attempting to reach audiences across language barriers, and
thus they create posts with the same content written in multiple
languages. In some implementations, these posts are collected from
particular sources where they are likely to contain translation
pairs, such as: when the post is by a business that has a
multi-lingual clientele, when the post is to a fan page focused in
a region where multiple languages are spoken, or where the user who
authored the post is known to be multilingual or is known to
interact with other users who are facile with at least one language
other than the primary language of the post author.
[0045] At block 506 the languages in the potential translation pair
can be identified. At block 508, the languages identified in block
506 are compared to desired languages for a machine translation
engine to be generated. For example, where the machine translation
engine to be generated is a Chinese->German machine translation
engine, content items that do not contain language snippets in both
Chinese and German are eliminated by blocks 506 and 508. If the
language(s) identified is a desired language, process 500 continues
to block 510. Otherwise, process 500 continues to block 516, where
it ends. In some implementations, process 500 is not performed to
obtain specific desired languages translation pairs, and thus in
these implementations, the operations of blocks 506 and/or 508 to
eliminate potential translation pairs that do not match desired
languages may not be performed.
[0046] At block 510 the content item identified at block 504 can be
smoothed to eliminate language classifications for small snippets
which are likely to be mistakenly classified. Smoothing can include
finding snippets that have a number of terms below a specified
smoothing threshold that are also surrounded by two portions which
have the same language classification as each other and that
language classification is different from the language
classification of the snippet. Such snippets are likely
misclassified, and thus the language classification of these
snippets can be changed to that of the portions surrounding that
snippet. For example, a content item that includes the text: "It's
all por the money," could be classified as three snippets 1)
English: "It's all" 2) Spanish: "por," and 3) English: "the money."
The specified smoothing threshold could be two words, so "por," a
single word surrounded by English snippets, would be reclassified
as English, making the entire post into a single snippet-English:
"It's all por the money."
[0047] At block 512 the post identified at block 504 is split
according to the portions that are snippets in different languages.
The possible permutations of different language pairs from these
snippets can be created. For example, if a post includes the
snippets: <German>, <English>, and <French>, the
resulting three permutations of potential translation pairs could
be <German><English>, <German><French>, and
<English><French>. If process 500 is being performed
with one or more desired languages, it can be that only the
permutations that include a snippet in at least one of those
desired languages are created or kept. Continuing the previous
example, if process 500 is being performed to create a
German/French social media machine translation engine, the only
permutation that would be created is the
<German><French>pair.
[0048] In some implementations, at block 512, the potential
translation pairs can also only be kept (or only ever created)
where a ratio between terms of that potential translation pair is
within a specified term ratio threshold. For example, the specified
term ratio threshold could be 3:1 indicating that only language
snippets where the number of terms in a first of the snippets is no
more than three time the number of terms of the second snippet. In
some implementations, the ratio could be independent of order, for
example the 3:1 ratio can be 3:1 or 1:3.
[0049] At block 514, if the potential translation pair source
received at block 504 resulted in one or more potential translation
pairs, the potential translation pairs identified by process 500
are returned. Process 500 then continues to block 516, where it
ends.
[0050] FIG. 5B is a flow diagram illustrating a process 530 used in
some embodiments for locating potential translation pairs from
multiple items corresponding to the same or similar target. Users
that create posts for the same or similar target are likely to say
the same thing: therefore, such posts in different languages are a
good source of potential translation pairs. Process 530 begins at
block 532 and continues to block 534. At block 534 two sources of
multi-post potential translation pairs that are in different
languages and are directed to the same or similar target are
obtained. As discussed above, content items that are directed to
the same or similar target comprise those that either A) contain
the same item such as a URL, image, video, sound, or document or B)
are for the same topic, such as being a comment on the same post, a
message or post directed to the same user, or otherwise a content
item on a page or content area dedicated to the same subject. In
some implementations, only sources of potential translation pairs
that are within a specified threshold time of each other are
obtained. Because each source of multi-post potential translation
pairs can be paired with numerous other sources of multi-post
potential pairs, process 530 can be performed multiple times with
different permutations that comprise sources of multi-post
potential translation pairs that have previously been analyzed in
other permutations.
[0051] Similarly to block 506, at block 536 the languages in the
potential translation pair sources are identified. At block 538,
the languages identified in block 536 are compared to desired
languages for a machine translation engine to be generated. If the
identified language matches a desired language, process 530
continues to block 540. Otherwise, process 530 continues to block
548, where it ends. In some implementations, process 530 is not
being performed to obtain specific desired languages translation
pairs, and thus in these implementations, the operations of block
536 and/or 538 to eliminate potential translation pairs that do not
match desired languages may not be performed.
[0052] At block 540 each source of multi-post potential pairs
obtained at block 534 can be split into a group of sentences. The
group of sentences from the first source is referred to herein as
group X and the group of sentences from the second source is
referred to herein as group Y. In some implementations, smoothing,
as discussed above in relation to block 510 may also be applied to
the sentences in either group X and/or group Y.
[0053] At block 542 each sentence from group X and group Y are
further split into segments of no more than a particular length,
such as three, four, or five terms. The segments from group X are
referred to herein as segments X*and the segments from group Y are
referred to herein as segments Y*. Each of segments X*can be
compared to each of segments Y*to determine if there is any match.
A match between a first and a second segment means that at least
some specified threshold of words from the first segment is a
translation of the words in the second segment. In various
implementations, this threshold can be 100%, 80%, 75%, 66%, or
50%.
[0054] At block 544, process 530 makes a determination of whether
there are any matching segments between the segments in segments
X*and segments Y*. In some implementations, if any segment from
segments X*match a segment from segments Y*then each permutation of
the sentences from group X and group Y is identified as a potential
translation pair. In some implementations only the pairs of
sentences, one from group X and one from group Y, containing the
matching segments are identified as a potential translation pair.
At block 546, any of the potential translation pairs identified in
block 544 are returned. Process 530 then continues to block 548,
where it ends.
[0055] FIG. 5C is a flow diagram illustrating a process 570 used in
some embodiments for locating potential translation pairs from
multiple items generated by the same author. Process 570 begins at
block 572 and continues to block 574. At block 574 two sources of
potential translation pairs that are in different languages and are
by the same author are obtained. Content items that are by the same
author that are in different languages, particularly when within a
short time frame, are likely to be direct translations of each
other, for example where a store posts an item for sale in English,
then immediately reposts the same item for sale in Spanish.
Furthermore, because the permutations of sources of potential
translation pairs that are in different languages and are by the
same author are likely to be few in number as compared to other
sources of potential translation pairs, they can be quickly
searched for being actual translation pairs.
[0056] Similarly to blocks 506 and 536, at block 576 the languages
in the potential translation pair sources are identified. At block
578, the languages identified in block 576 are compared to desired
languages for a machine translation engine to be generated. If the
identified language is a desired language, process 570 continues to
block 580. Otherwise, process 570 continues to block 584, where it
ends. In some implementations, process 570 is not being performed
to obtain specific desired languages translation pairs, and thus in
these implementations, the operations of block 576 and/or 578 to
eliminate potential translation pairs that do not match desired
languages may not be performed.
[0057] At block 580 the sources of potential translation pairs
obtained at block 574 are compared to determine whether they are
within a specified time threshold of each other. This time
threshold can be configured to select sources of potential
translation pairs that were posted closely so as to be likely to be
direct translations of each other. In some implementations, this
filtering of sources of potential translation pairs can be part of
the query operations performed at block 574. Pairs of sources of
potential translation pairs within the time threshold can be marked
as potential translation pairs. These marked potential translation
pairs can be returned at block 582. Process 570 then continues to
block 584, where it ends.
[0058] FIG. 6 is a flow diagram illustrating a process 600 used in
some embodiments for selecting actual translation pairs from
potential translation pairs. Process 600 begins at block 602 and
continues to block 604. At block 604 a potential translation pair
is received. In some implementations, the potential translation
pair includes an identification of a source of the potential
translation pair. Examples of potential translation pair sources
are: 1) a single content item that contains multiple languages; 2)
multiple content items in different languages that are related to
the same or similar target; and 3) multiple content items that are
generated by the same author in different languages within a
timeframe. In some implementations, the received potential
translation pair is a pair returned from one of process 500, 530,
or 570.
[0059] At block 606 one or more characteristics are extracted for
each of the language snippets that the received potential
translation pair comprises. In some implementations extracted
characteristics comprise one or more words or phrases from the
first language snippet to compare to one or more words or phrases
from the second language snippet. A more general machine
translation engine can be sued to compare the extracted words or
phrases to determine if the potential translation pair is an actual
translation pair. This type of computationally inexpensive
comparison can be highly accurate for determining if a potential
translation pair is an actual translation pair where the language
snippets are highly likely to be direct translations of each other
when they are similar. In some implementations, it can be the case
that potential translation pairs that are highly similar are not
direct translations of each other. In these implementations, more
characteristics of the languages snippets can be extracted for
comparison to determine if the potential translation pair is an
actual translation pair. In some of these implementations where
more characteristics of the languages snippets are extracted, the
extracted characteristics can comprise data to compute, as an
all-to-all alignment between language snippets, any one or more of:
a ratio of a number of words; an IBM score, maximum fertility, a
number of covered words, a length of a longest sequence of covered
words, or a length of a longest sequence of not-covered words. In
some implementations, these characteristics can be normalized by
source sentence length. In some of these implementations where more
characteristics of the languages snippets are extracted, the
extracted characteristics can comprise data to compute, as a
maximum alignment between language snippets, any one or more of: a
total IBM score; a set, such as three, top fertility values; a
number of covered words; a maximal number of consequent source
words which have corresponding consequent target words; or a
maximum number of consequent not-covered words.
[0060] The extent to which characteristics are extracted can be
based on a source identified with the potential translation pair.
For example, some sources can be known to produce potential
translation pairs for which a simple analysis is likely to be
highly accurate in identifying actual translation pairs. Examples
of these types of sources are single content items that contain
multiple languages and multiple content items that are generated by
the same author in different languages within a timeframe. Other
sources of potential translation pairs can be known to produce
potential translation pairs which have very similar but not direct
translation language snippets, and therefore require a more
detailed analysis using additional extracted characteristics. An
example of this type of source is multiple content items in
different languages that are related to the same or similar
target.
[0061] At block 608 extracted characteristics are compared to
determine whether the potential translation pair received at block
604 is an actual translation pair. As discussed above, this can
include a computationally inexpensive analysis, such as one based
on a comparison of term translations or using a general machine
translation engine, or can be a more expensive analysis using
additional extracted characteristics. As also discussed above, in
some implementations, the type of analysis performed is based on an
identified source of the potential translation pair. If, at block
608, the potential translation pair is determined not to be an
actual translation pair, process 600 continues to block 612, where
it ends. If, at block 608, the potential translation pair is
determined to be an actual translation pair, process 600 continues
to block 610, where it returns an identification of the potential
translation pair as an actual translation pair. Process 600 then
continues to block 612, where it ends.
[0062] FIG. 7 is a flow diagram illustrating a process 700 used in
some embodiments for selecting a machine translation engine based
on a content item classification. Process 700 begins at block 702
and continues to block 704. At block 704 a content item to be
translated can be received. In some implementations, the received
content item is associated with a classification identifying the
domain or subdomain for the content item. A classification for the
content item could be based on the terms used in the content item.
For example, using the term "blimey" could be an indication of a
British classification. The classification could be based on where
the content item was posted within a social media site, such as a
soccer classification for a post to a professional soccer player's
fan page. The classification could be based on who posted the
content item, such as a user who has been identified as living in
Louisiana is likely to post content items that use Southern
American slang, and therefore could be classified as such.
[0063] At block 706 a machine translation engine matching the
classification of the content item can be selected. In some
implementations, machine translation engines are associated with a
hierarchy, such as a tree structure, of domains with a most general
domain at the root and more specific subdomains further along the
structure. For example, a social media domain could be the root of
a tree with regions, dialects, topics, etc. at the next level of
the tree, and with further subdivisions within each node as the
tree is traversed. For example, a content item could have a
classification from being a post to a social media fan page for
southern Vietnam motorbike enthusiasts. The tree hierarchy could
have a root of Social Media, a regions second level node, a Vietnam
third level node, a southern fourth level node, and a vehicles
fifth level node. In this example, while the tree hierarchy has
cars and planes below the fifth level vehicle node, it does not
have a motorbike node below the vehicles node. Accordingly the
fifth level vehicles node would be the best matching node for the
southern Vietnam motorbike enthusiasts content item. A machine
translation engine corresponding to the node could be selected at
block 706. In some implementations where the content items is not
associated with a classification, a default machine translation
engine, such as a general machine translation engine or a social
media domain machine translation engine, can be selected at block
706 to perform a translation.
[0064] At block 708 the content item received at block 704 is
translated using the machine translation engine selected at block
706. At block 710 the translation of the content item is returned.
Process 700 then continues to block 712, where it ends.
[0065] Several embodiments of the disclosed technology are
described above in reference to the figures. The computing devices
on which the described technology may be implemented may include
one or more central processing units, memory, input devices (e.g.,
keyboard and pointing devices), output devices (e.g., display
devices), storage devices (e.g., disk drives), and network devices
(e.g., network interfaces). The memory and storage devices are
computer-readable storage media that can store instructions that
implement at least portions of the described technology. In
addition, the data structures and message structures can be stored
or transmitted via a data transmission medium, such as a signal on
a communications link. Various communications links may be used,
such as the Internet, a local area network, a wide area network, or
a point-to-point dial-up connection. Thus, computer-readable media
can comprise computer-readable storage media (e.g.,
"non-transitory" media) and computer-readable transmission
media.
[0066] As used herein, being above a threshold means a
determination that a value for an item under comparison is above a
specified other value, that an item under comparison is among a
certain specified number of items with the largest value, or that
an item under comparison has a value within a specified top
percentage value. As used herein, being below a threshold means a
determination that a value for an item under comparison is below a
specified other value, that an item under comparison is among a
certain specified number of items with the smallest value, or that
an item under comparison has a value within a specified bottom
percentage value. As used herein, being within a threshold means a
determination that a value for an item under comparison is between
two specified other values, that an item under comparison is among
a middle specified number of items, or that an item under
comparison has a value within a middle specified percentage
range.
[0067] Although the subject matter has been described in language
specific to structural features and/or methodological acts, it is
to be understood that the subject matter defined in the appended
claims is not necessarily limited to the specific features or acts
described above. Specific embodiments and implementations have been
described herein for purposes of illustration, but various
modifications can be made without deviating from the scope of the
embodiments and implementations. The specific features and acts
described above are disclosed as example forms of implementing the
claims that follow. Accordingly, the embodiments and
implementations are not limited except as by the appended
claims.
[0068] Any patents, patent applications, and other references noted
above, are incorporated herein by reference. Aspects can be
modified, if necessary, to employ the systems, functions, and
concepts of the various references described above to provide yet
further implementations. If statements or subject matter in a
document incorporated by reference conflicts with statements or
subject matter of this application, then this application shall
control.
* * * * *