U.S. patent application number 10/973959 was filed with the patent office on 2005-07-07 for system, method and program product for management of life sciences data and related research.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Baek, Ock Kee, Ewig, Carl Stephen.
Application Number | 20050149566 10/973959 |
Document ID | / |
Family ID | 34468764 |
Filed Date | 2005-07-07 |
United States Patent
Application |
20050149566 |
Kind Code |
A1 |
Baek, Ock Kee ; et
al. |
July 7, 2005 |
System, method and program product for management of life sciences
data and related research
Abstract
System, method and program product for managing data for
researchers. A research data server receives and manages
experimental data and research data and results from the
researchers, and operates with a virtual storage device to maintain
the experimental data and research data and results. A reference
data access server receives and manages external reference data
relating to the research and operating with the virtual storage
device to maintain the external reference data. Computational
resources allow researchers to capture, process and analyze
experimental data to obtain results. A research data network
connects the virtual storage device, research data server,
reference data access server and the computational resources to
allow transfer of data there between. Security management services
authenticate and authorize access by the researchers to the
system.
Inventors: |
Baek, Ock Kee; (Unionville,
CA) ; Ewig, Carl Stephen; (San Diego, CA) |
Correspondence
Address: |
IBM CORPORATION
IPLAW IQ0A/40-3
1701 NORTH STREET
ENDICOTT
NY
13760
US
|
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
ARMONK
NY
|
Family ID: |
34468764 |
Appl. No.: |
10/973959 |
Filed: |
October 25, 2004 |
Current U.S.
Class: |
1/1 ;
707/999.107 |
Current CPC
Class: |
G06Q 10/06313 20130101;
G06Q 10/10 20130101; G16B 50/30 20190201; G16B 50/40 20190201; G16B
50/20 20190201; G16B 50/00 20190201 |
Class at
Publication: |
707/104.1 |
International
Class: |
G06F 007/00 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 31, 2003 |
CA |
2447963 |
Claims
1. A system for managing data for researchers, said system
comprising: means for automatically receiving research data from
laboratory instruments; means for accessing a database containing
established reference data; means for automatically obtaining
recently available reference data; means for accessing a database
containing experimental data; a plurality of applications to
process respective types of said data; means, responsive to a
request by a researcher to perform a data processing function, for
invoking one or more of said processing applications and supplying
to said one or more processing applications parameters to perform
said data processing function; and means, responsive to the
invoking and supply of parameters, for said one or more processing
applications to automatically access types of said data required to
perform the respective data processing function.
2. A system as set forth in claim 1 wherein said invoking and
supplying means determines which of said processing applications to
invoke and which parameters to supply to said processing
applications to be invoked, based on a type of function requested
by said researcher.
3. A system as set forth in claim 1 wherein said invoking and
supplying means determines identities of files containing said data
required by said one or more applications, based on a type of
function requested by said researcher.
4. A system as set forth in claim 1 further comprising an
application to analyze patterns in respective types of said data,
and wherein one of said one or more processing applications
receives from the pattern analyzing application a pattern used to
perform the requested data processing function.
5. A system as set forth in claim 1 further comprising means for
determining if available data is valid, and if not, not using said
available data for said one or more processing applications; if so,
using said available data for said one or more processing
applications.
6. A system as set forth in claim 1 further comprising means for
formatting results of said one or more data processing applications
to correspond to respective types of data processing requests.
7. A method for managing data for researchers, said method
comprising the steps of: automatically receiving research data from
laboratory instruments; maintaining a database containing
established reference data; automatically obtaining recently
available reference data; maintaining a database containing
experimental data; responsive to a request by a researcher to
perform a data processing function, invoking one or more processing
applications and supplying to said one or more processing
applications parameters to perform said data processing function;
and in response to the invoking and supply of parameters, said one
or more processing applications automatically access types of said
data required to perform the respective data processing
function.
8. A method as set forth in claim 7 wherein said invoking and
supplying step determines which of said processing applications to
invoke and which parameters to supply to said processing
applications to be invoked, based on a type of function requested
by said researcher.
9. A method as set forth in claim 7 wherein said invoking and
supplying step determines identities of files containing said data
required by said one or more applications, based on a type of
function requested by said researcher.
10. A method as set forth in claim 7 further comprising the steps
of analyzing patterns in respective types of said data, and
providing results of said analyzing pattern step to one of said one
or more processing applications to perform the requested data
processing function.
11. A method as set forth in claim 7 further comprising the step of
determining if available data is valid, and if not, not using said
available data for said one or more processing applications; if so,
using said available data for said one or more processing
applications.
12. A method as set forth in claim 7 further comprising the step of
formatting results of said one or more data processing applications
to correspond to respective types of data processing requests.
13. A system for managing data for researchers, said system
comprising: a virtual storage device including online and near line
storage, and having predefined policies for moving stored data
between the online and near line storage; a research data server
for receiving and managing experimental data, research data and
research results, and operating with the virtual storage device to
maintain the experimental data, research data and research results;
a reference data access server receiving and managing external
reference data relating to research of the researchers and
operating with the virtual storage device to maintain the external
reference data; computational resources for the researchers to
capture and process experimental data to generate the research
results; and a research data network connecting the virtual storage
device, research data server, reference data access server and
computational resources to allow transfer of data there between,
the research data network further including security management
services to authenticate and authorize access by the
researchers.
14. The system of claim 13 further comprising a data import
controller connected to the research data network and operable to
retrieve external reference data from data sources external to the
research data network according to one or more policies predefined
by the researchers for retrieving external reference data.
15. The system of claim 14 wherein the data import controller
processes retrieved reference data to determine if it is lower
quality or redundant in view of reference data already stored in
the virtual storage device.
16. The system of claim 15 wherein the data import controller
filters out redundant or lower quality retrieved reference data
from entry in the virtual storage device.
17. The system of claim 13 wherein the computational resources
include a cluster of computing resources; and the computational
results further comprise a post processor, the post processor
converts experimental data into useful forms which are relevant for
the purpose and context of the research.
18. The system of claim 13 further comprising a laboratory
information management system connected to the research data
network and to one or more laboratory instruments, the laboratory
information management system receiving experimental data from the
laboratory instruments and providing that data to the research data
server via the research data network.
19. The system of claim 18 further comprising a preprocessing
server connected to the research data network, the laboratory
information management server providing experimental data from the
laboratory instruments to the preprocessing server which converts
the experimental data into data which is useful and relevant for
the research, the preprocessing server providing the converted data
to the research data server via the research data network.
20. The system of claim 13 further including a knowledge management
server connected to the research data network and operable to
identify and provide a researcher with reference data and/or
experimental data and results from the research data server and the
reference data access server in accordance with queries made by the
researcher.
21. The system of claim 20 wherein a researcher can create policies
defining data types of interest to the researcher, and the
knowledge management server, in accordance with the defined policy,
identifies and provides reference data and experimental data and
results of interest to the researcher.
22. The system of claim 13 further including a research application
server connected to the research data network, the research
application server providing at least one software application
and/or tool required by researchers, the at least one application
and/or tool operating on data stored in said virtual storage device
in accordance with instructions from the researchers.
23. A method of managing research conducted by researchers, said
method comprising the steps of: creating a set of policies defining
external reference information relevant to the research;
retrieving, at predefined intervals, external reference information
in accordance with the policies; comparing the retrieved
information with reference data stored in a reference data server
to determine if the retrieved information is redundant or of lower
quality than data already stored in the reference data server and
storing retrieved information which was determined to be
non-redundant and/or of acceptable better quality in the reference
data server; storing experimental data from at least one laboratory
instrument in a research data server; and providing the researchers
with access to the stored information in the reference data server
and to experimental data in the research data server.
24. The method of claim 23, further comprising the step of the
researchers defining a set of data storage policies for a virtual
storage device including both online and near line storage capacity
to store the data of the reference data server and the research
data server, and moving the stored data between online storage
capacity and near line storage capacity in accordance with the data
storage policies.
25. The method of claim 23 further comprising the step of
preprocessing the experimental data from the at least one
laboratory instrument and storing the preprocessed experimental
data in the research data server.
26. The method of claim 23 further comprising the step of
publishing information to researchers by having the researchers
identify to a knowledge management server information of interest
to them and the knowledge management server examining the contents
of the reference data server and the research data server to
identify the information of interest to a researcher and the
knowledge management server making the identified information
available to the researcher.
27. The method of claim 23 further comprising the step of verifying
the authenticity and authority of each researcher to access stored
experimental data and stored reference data before providing that
access.
28. A method of managing data for research, said method comprising
the steps of: providing a plurality of researchers with access to a
research data network; creating reference data policies defining
for each of said researchers types of reference data that will be
of use to the researcher, creating experimental data policies
defining for each of said researchers types of experimental data
and results that will be of use to the researcher and storing these
policies on the research data network; retrieving, at defined
intervals, from data sources outside the research data network,
external reference data as defined by the reference policies;
examining the retrieved reference data to determine if it is
redundant in view of reference data already stored on the research
data network or if it is of better quality than reference data
already stored on the network and storing retrieved reference data
on the research data network which has been determined to be
non-redundant or of better quality than reference data already
stored; collecting experimental data from laboratory instruments
through the research data network and storing the collected data on
the research data network; and publishing new reference data and
experimental data to researchers according to the reference data
policies and experimental data policies defined for the
researchers.
29. The method of claim 28 wherein the reference data and the
experimental data are stored in a virtual storage device on the
research data network, the virtual storage device having both
online and near line data storage capabilities and the research
team having predefined a storage policy executed by the virtual
storage device to transfer stored data between the online storage
and the near line storage.
30. The method of claim 28 further comprising the step of at least
one researcher processing published experimental data to obtain
experimental results which are stored on the research data network,
the stored experimental results also being subsequently published
in the publishing step to researchers in accordance with
experimental data policies of researchers.
31. A computer program product to manage data for research
conducted by researchers, said computer program product comprising:
a computer readable medium; first program instructions executable
to provide the researchers with access to a research data network;
second program instructions to retrieve, at defined intervals, from
data sources outside the research data network, external reference
data as defined by reference policies created by said researchers;
third program instructions to examine the retrieved reference data
to determine if it is redundant in view of reference data already
stored on the research data network or if it is of better quality
than reference data already stored on the network and to store
retrieved reference data on the research data network which has
been determined to be non-redundant or of better quality than
reference data already stored; fourth program instructions to store
experimental data from at least one laboratory instrument in a
research data server; and fifth program instructions to provide the
researchers with access to the stored information in the reference
data server and to experimental data in the research data server;
and wherein said first, second, third, fourth and fifth program
instructions are recorded on said medium.
32. A computer program product according to claim 31, further
including sixth program instructions to implement a set of data
storage policies for a virtual storage device including both online
and near line storage capacity to move the stored data between
online storage capacity and near line storage capacity in
accordance with the data storage policies; and wherein said sixth
program instructions are recorded on said medium.
Description
BACKGROUND OF THE INVENTION
[0001] The present invention relates generally to computer
management of data and related research results. More specifically,
the present invention relates to computer management of data and
related research in life science fields.
[0002] Modern life sciences research, such as pharmaceutical
research, typically requires applied, iterative, parallel research
across many technical disciplines. Modern pharmaceutical research
typically involves researchers from biology, genetics, chemistry,
clinical and pathology disciplines. The research process is
typically iterative, with the results from one discipline being
supplied to another discipline, etc. with each discipline analyzing
processing the supplied and other data. Heretofore, there have been
inadequate computer systems and methods for collaboration between
researchers in the different disciplines, and management of the
overall process. These problems are exacerbated when large amounts
of data are generated and must be transformed, translated,
reorganized, analyzed or otherwise processed as the data moves
between disciplines and/or research teams.
[0003] An object of the present invention is to provide an
improved, comprehensive system, method and program product for
collaboration among researchers and management of data and related
research results.
[0004] Another object of the present invention is to provide a
system, method and program product of the foregoing type which is
suited for development of pharmaceuticals and other medical
therapies.
SUMMARY OF THE INVENTION
[0005] The invention resides in a system, method and program
product for managing data for researchers. Research data is
automatically received from laboratory instruments. Established
reference data is accessed from a database. Recently available
reference data is automatically obtained. Experimental data is
accessed from a database. There are a plurality of applications to
process respective types of the data. In response to a request by a
researcher to perform a data processing function, one or more of
the processing applications are invoked and supplied with
parameters to perform the data processing function. The one or more
applications automatically access types of the data required to
perform the respective data processing function.
[0006] According to features of the present invention, the
determination of which of the processing applications to invoke and
which parameters to supply to these processing applications can be
based on a type of function requested by the researcher. The
determination of the identities of files containing the data
required by the one or more applications can be based on a type of
function requested by the researcher. There can also be an
application to analyze patterns in respective types of the data,
and one of the processing applications receives from the pattern
analyzing application a pattern used to perform the requested data
processing function. There can also be a program for determining if
available data is valid. If not, the available data is not used for
the one or more processing applications. If so, the available data
is used for the one or more data processing applications. There can
also be a program for formatting results of the one or more data
processing applications to correspond to respective types of data
processing requests.
[0007] According to another embodiment of the present invention,
there is provided another system for managing data for researchers.
This other system comprises a virtual storage device including
online and near line storage and having policies predefined for
moving stored data between the online and near line storage. The
system also comprises a research data server for receiving and
managing experimental data and research data and results from the
researchers and operating with the virtual storage device to
maintain the experimental data and research data and results. The
system also comprises a reference data access server receiving and
managing external reference data relating to the research and
operating with the virtual storage device to maintain the external
reference data. The system also comprises computational resources
for the researchers to capture, process and analyze experimental
data to obtain results. The system also comprises a research data
network connecting the virtual storage device, research data
server, reference data access server and the computational
resources to allow transfer of data there between. The research
data network further includes security management services to
authenticate and authorize access by the researchers to the
system.
[0008] This other system may also include a data import controller
connected to one or more public data networks (e.g., the Internet)
as well as to the research data network. The data import controller
is operable to retrieve external reference data from data sources
external to the research data network according to one or more
policies predefined by the researchers for retrieving external
reference data. Also, the computational resources may include a
high performance computing server comprising a cluster of
homogeneous or hybrid computing resources. Also, the system may
include a laboratory information management system connected to the
research data network and to one or more laboratory instruments.
The laboratory information management system receives experimental
data from the laboratory instruments and provides that data to the
research data server via the research data network.
[0009] According to another aspect of this other embodiment of the
present invention, there is provided a method of managing data for
research. A set of policies defining external reference information
relevant to the research program is created. At predefined
intervals, external reference information in accordance with the
policies is retrieved. The retrieved information is compared with
reference data stored in a reference data server to determine if
the retrieved information is redundant or of lower quality than
data already stored in the reference data server. The retrieved
information which was determined to be non-redundant and/or of
acceptable better quality is stored in the reference data server.
The experimental data from optionally one or more laboratory
instruments is stored in a research data server. The researchers
are provided with access to the stored information in the reference
data server and to experimental data in the research data
server.
[0010] According to yet another aspect of the present invention,
there is provided a method of managing data for research.
Researchers are provided with access to a research data network.
Reference data policies define for each researcher types of
reference data that will be of use to the researcher. Experimental
data policies define for each researcher types of experimental data
and results that will be of use to the researcher are created and
stored on the research data network. At defined intervals, from
data sources outside the research data network, external reference
data as defined by the reference policies is retrieved and examined
to determine if it is redundant in view of reference data already
stored on the research data network or if it is of better quality
than reference data already stored on the network. The retrieved
reference data which has been determined to be non-redundant or of
better quality than reference data already stored is stored on the
research data network. Experimental data is collected from
laboratory instruments through the research data network and stored
on the research data network. New reference data and experimental
data are published to researchers according to the reference data
policies and experimental data policies defined for the
researchers.
[0011] According to yet another aspect of the present invention,
there is provided a computer program product stored on a computer
readable medium to manage data for research. First program
instructions provide the researchers with access to a research data
network. Second program instructions retrieve, at defined
intervals, from data sources outside the research data network,
external reference data as defined by reference policies created on
the computer by researchers. Third program instructions examine the
retrieved reference data to determine if it is redundant in view of
reference data already stored on the research data network or if it
is of better quality than reference data already stored on the
network, and store retrieved reference data on the research data
network which has been determined to be non-redundant or of better
quality than reference data already stored. Fourth program
instructions store experimental data from optionally one or more
laboratory instruments in a research data server. Fifth program
instructions provide the researchers with access to the stored
information in the reference data server and to experimental data
in the research data server.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] FIG. 1 is a block diagram of a computer system in accordance
with the present invention.
[0013] FIGS. 2a, 2b, 2c, 2d and 2e show control and data flows
between components of the system of FIG. 1 when retrieving and
processing external data and publications.
[0014] FIGS. 3a, 3b, 3c and 3d show control and data flows between
components of the system of FIG. 1 when collecting experimental
data from laboratory instruments.
[0015] FIGS. 4a, 4b and 4c show control and data flows between
components of the system of FIG. 1 when pattern matching and
pattern recognition of experimental data.
[0016] FIGS. 5a, 5b, 5c, 5d and 5e show control and data flows
between components of the system of FIG. 1 when performing data
analysis and/or result generation of experimental data.
[0017] FIGS. 6a and 6b show control and data flows between
components of the system of FIG. 1 when publishing results to an
internal researcher.
[0018] FIGS. 7a and 7b show control and data flows between
components of the system of FIG. 1 when publishing results to an
external researcher or external sites.
[0019] FIG. 8 is a flow diagram showing an example of the present
invention.
[0020] FIG. 9 is a flow chart showing function of a research data
management program within an application server system of FIG.
1.
[0021] FIG. 10 is a flow chart showing a function within a
reference data access server within the system of FIG. 1 to
validate new external data.
DETAILED DESCRIPTION OF THE INVENTION
[0022] While the following embodiment illustrates use of the
present invention for pharmaceutical research, the present
invention has other embodiments and uses as well. For example, the
present invention can be employed in research for pharmaceuticals,
treatments, diagnostics, non-drug treatment protocols and
preventatives, and other sciences.
[0023] Modern life sciences research, such as high level drug
discovery and development, comprises a series of steps for
acquiring and analyzing chemical and biological data, wherein
processing is performed at each step. For example, in the
high-level drug research areas, the key activities typically
include (a) the collection of "assay" data generated by laboratory
instruments, (b) searching for and obtaining external reference and
research materials, (c) analyzing the consolidated assay data and
the external reference materials, and (d) deriving knowledge
through the analysis. These tasks are typically repeated, in a
cyclical manner, by each discipline within the research team. The
present invention assists in these key activities, providing
automation, data management and multidisciplinary collaboration
facilities between researchers in different disciplines.
[0024] In FIG. 1, a system in accordance with the present invention
is indicated generally at 20. System 20 interfaces with external
researchers 24, external data sources such as external reference
data 28 and external published data 32, internal researchers 36 and
laboratory instruments 40. As explained above, modern life sciences
research often involves the exchange of information with various
external researchers 24 who are not members of the primary research
organization. Examples of such external researchers may be
government funded researchers, such as researchers funded by the
U.S. National Institutes of Health, researchers affiliated with
Contract Research Organizations (CRO), etc. While FIG. 1 only shows
a single icon for external researchers 24, the present invention
supports interaction with multiple external researchers 24 at
disparate geographic locations and at disparate organizations.
Access to and use of external reference data 28 and external
published data 32 is often an important part of a life sciences
research effort.
[0025] External reference data 28 can comprise well known or
established gene sequence and protein databases, either public or
private, clinical data, etc. External published data 32 can include
newly discovered genes or proteins, novel drug targets such as
novel chemical entities or novel molecular entities, new insights
in disease mechanisms, etc. The external published data 32 may
reside in a known database, such as the PubMed database which is
maintained by the National Center for Biotechnology Information.
The relevant published data an be periodically identified and
retrieved automatically by the data retrieval engine 56 by key word
searching or author searching, based on predefined key words and
authors.
[0026] Internal researchers 36 are primary drivers of the current
research effort. While FIG. 1 only shows a single icon for internal
researchers 36, the present invention supports interaction with
multiple internal researchers 36 who can be part of different
groups involved in the research effort and/or who can be located at
disparate logical and/or geographic locations.
[0027] Laboratory instruments 40 can include any instrument useful
to the research effort. In the life sciences field, such
instruments can include gene sequencers, mass-spectrometers,
crystallographic imaging devices, tomographic equipment, etc. Many
such instruments are now robotic in nature and can directly
interface with a laboratory information management system (LIMS)
44. An example of such an instrument is an ABI DNA sequencing
system which directly interfaces to LIMS 44 and to one or more
personal computers in the laboratory which provide for control
and/or calibration of the device. Other instruments require manual
operation and/or examination of their results by a technician or
researcher, but these results are still provided to LIMS 44 for the
assays. LIMS 44 are well known in the life sciences field and can
be custom designed for a laboratory or can be purchased
commercially, as desired.
[0028] As shown in FIG. 1, LIMS 44 is connected to instruments 40
by a research data network 48 which also connects the other
components of system 20. Network 48 is not particularly limited;
all components of system 20 should be able to communicate as
necessary and, preferably, at a sufficient speed to allow effective
transfer of information between various nodes connected by the
network. Network 48 can be, for example, (a) a homogenous network
of gigabit Ethernet links and network devices, (b) a heterogeneous
network of, for example, fiber optic, gigabit Ethernet and ATM
links with appropriate bridges, protocol converters, etc., (c) a
private, operated and maintained for or by a research organization,
(d) or a mix of private and public network portions separated
and/or protected as required by appropriate firewalls, domain
controllers for directory, security and system management services
and encryption/decryption engines.
[0029] The data stored and used by system 20 is classified as
follows for subsequent processing by research applications, as
explained below. The external reference data 28 is classified by
type based on the data file in which the external reference data
resides. When a unit of external reference data 28 is identified as
a candidate to be included in a data file, a data manager (person)
determines the type of the candidate external reference data and
stores it in the data file that is earmarked for this type of data.
(Alternately, a program tool can search each unit of external
reference by key word, and classify the external reference data
based on the results of the key word search.) The web retrieval
server 56 classifies each item of external published data 32 by
type based on key words found in the publication. The type of data
obtained from the LIMS is based on the type of instruments that are
generating the data. Each of the classification types is one of a
multiplicity of predetermined types. These types were predetermined
by one or more scientists with expertise in such classification and
research applications that will need the data. As explained in more
detail below, each research application will need and fetch certain
types of data to be processed on behalf of a researcher. Each data
item can also be accompanied by a header file which indicates
whether the data needs to be preprocessed (such as by preprocessing
server 92 shown in FIG. 1), whether the data needs to be routed to
a data mining application (such as data mining application 290
shown in FIG. 8), and the format of the data needed for the type of
end user/researcher.
[0030] System 20 also includes a virtual storage device 52. As
mentioned above, life sciences research can produce large
quantities (petabytes or exabytes) of data. System 20 employs
virtual storage device 52 to facilitate the handling of this data.
Virtual storage device 52 comprises a collection of online storage
devices, such as disk drives, solid state drives, etc. and a
variety of near line storage devices, such as robotic tape
libraries, etc. which can retrieve requested data within about one
minute. Virtual storage device 52 employs a set of policies to
manage the storage of research data sets between the online and
near line storage subsystems. Such policies can employ strategies
such as automatic migration of aged data from online to near line
storage, heuristic migration based upon determined usage patterns
for the data, etc. Because storage device 52 is virtual, it can be
scaled as necessary by adding more storage devices. Also, it is
transparent to a user whether desired data is stored in online or
near line storage, although in the case of data stored on near line
storage, the user may experience a slight delay in access. By way
of example, virtual storage device 52 for current research efforts
has at least several terabytes of online storage and several
petabytes of near line storage. An example of a suitable virtual
storage device 52 would be one or more IBM Enterprise Storage
Servers and one or more IBM LTO UltraScalable Tape Libraries. In
system 20, virtual storage device 52 stores all research data
relating to a research effort, although local copies of smaller
data sets can be maintained by researchers. By employing virtual
storage device 52 across system 20, quality, integrity, security,
privacy and availability of research data is assured.
[0031] Data retrieval engine 56 operates with a web retrieval
engine 60 to retrieve (based on predefined key word search and
predefined author search policies) and process desired external
reference data over public networks such as the Internet 68.
Specifically, a data retrieval engine 56 in the form of a data
import controller uses these policies established by the research
team to have web retrieval engine 60 retrieve external data. Web
retrieval engine 60 processes the policy-driven requests from data
import controller 56 to automatically retrieve predefined external
reference data through the Internet 68, or other networks, via
appropriate protocol and/or data format converters. Policies for
web retrieval engine 60 can include regularly scheduled searches of
specific databases, identification and retrieval of updated
versions of previously retrieved data, searches for new data
sources, etc. Examples of suitable web retrieval engines are the
IBM WebSphere software platform or Apache Software Foundation web
server integrated with the IBM WebSphere platform. Web retrieval
engine 60 can use any appropriate computer program to retrieve the
desired external references, such as ftp for document transfers,
SQL queries for database searches, etc. In one embodiment of the
invention, web retrieval engine 60 includes local storage where
retrieved information is temporarily stored, for subsequent
processing by data import controller 56.
[0032] B2B engine 64 includes a web server and operates to make
data from system 20 available to external researchers 24. Examples
of suitable B2B web retrieval engines 64 also are the IBM WebSphere
software platform or Apache Software Foundation server integrated
with IBM WebSphere platform. As discussed below, system 20 includes
security management services which operate to limit the data that
can be accessed by an external user 24.
[0033] As illustrated in FIG. 1, both web retrieval engine 60 and
B2B engine 64 are located in a "demilitarized zone" (DMZ) 70 and
are separated from network 48 and public networks such as Internet
68 by a protocol firewall 72 and a domain firewall 76. Protocol
firewall 72 operates as a first line of isolation and acts to
control the direction of data flows between network 48 and public
networks such as Internet 68 and filters the data traffic flow
based upon source addresses, destination addresses and enabled
ports. Domain firewall 76 acts as a second line of isolation and
establishes the DMZ between the trusted internal network 48 and
external networks, such as Internet 68. Protocol firewalls 72 and
domain firewalls 76 and other techniques for establishing DMZ's are
well known and well not be further discussed herein. In a similar
manner, system 20 connects internals researchers 36 through an
internal DMZ formed by a protocol firewall 80, a web server 84 and
domain firewall 88. Examples of suitable web engines are the IBM
WebSphere software platform or an Apache Software Foundation server
integrated with the IBM WebSphere platform.
[0034] System 20 also includes a preprocessing server 92 which
comprises one or more computer systems. Preprocessing server 92
operates on the raw data provided by LIMS 44 to convert that data
into data which is chemically, biologically, clinically etc. useful
and relevant for the purpose and context of the research. Depending
upon the nature of the assay and the devices performing the assay,
this preprocessing can include data filtering, data normalization,
data validation, etc. For example, preprocessing server 92 can
filter data by removing a data set with key missing values, or data
with noisy or statistically improbable values, such as long
nucleotide segments in which all bases are identical. For example,
preprocessing server 92 can normalize data by establishing a common
scale or set of units for comparing disparate data, such as by
multiplying the data by a constant to make the maximum value in
each set precisely 1.0. For example, preprocessing server 92 can
invalidate data which falls outside of certain expected data
ranges, or is inherently invalid, such as ovarian cancer found in a
man. Preprocessing server 92 can also assign a higher reliability
to data which has previously been reviewed and annotated by a
researcher.
[0035] System 20 includes an application server system 96 which
includes a high performance computing server (HPCS). In a presently
preferred embodiment, the HPCS comprises a high performance
computing cluster, such as a Linux cluster of high speed
processors, as this allows a large amount of available computing
resources to be scaled appropriately, as needed, by adding or
removing computing processors to and from the cluster. Gene
multiple sequence alignment and/or protein folding are just a few
examples of research activities which can require large amounts of
computing resources. The HPCS is capable of serving as a back end
processing resources for several research applications. As
explained in more detail below, application server system 96 also
includes a data mining application 292 and research applications
282, 284 and 286.
[0036] System 20 also includes a post processor 100 to operate on
results produced in application server system 96, or elsewhere, to
convert resultant data into chemically, biologically, etc. useful
forms which are relevant for the purpose and context of the
research. This post processing can comprise data clustering,
annotation, classification, presentation, etc. and can be performed
by various applications.
[0037] System 20 also includes a knowledge management server (KMS)
104. KMS 104 provides the researchers with access to relevant
biological, chemical and/or clinical information. The functions
provided by KMS 104 can include, without limitation, data mining,
ad hoc queries, statistical analysis, report generation, decision
support, etc. An example of a suitable knowledge management server
104 can include an IBM Information Management for Scoring,
Visualization, Modeling and Mining.
[0038] System 20 also includes a research application server (RAS)
108. RAS 108 runs a number of research applications, data mining
applications and/or other tools required by researchers. These
applications, data mining applications and/or tools can include,
without limitation: an NCBI Basic Local Alignment Search Tool
("Blast") program, multiple sequence alignment tools, gene
expression applications, and applications for protein structure and
function prediction. The Blast program is a search tool to
determine the similarity of a given nucleic acid or protein
sequence to thousands of other sequences in databases, such as NCBI
databases. The multiple sequence alignment tools assist in deducing
the function of new proteins, assisting in answering other
biological questions such as the evolution and/or phylogenic
relationship of the protein. The gene expression applications
permit interactive retrieval and analysis of gene expression data
with spotted microarray, high density oligonucleotide array,
hybridization filtering, serial analysis of gene expression data
and other techniques. The applications for protein structure and
function prediction include primary sequence alignment, secondary
and tertiary structure prediction methods, homology modeling and
crystallographic diffraction pattern analysis, etc. As will be
apparent to those of skill in the art, many other applications
and/or tools can be employed in system 20, and RAS 108 can provide
for centralized maintenance and control of these tools.
[0039] System 20 also includes a reference data access server 112
and a research data server 116 which operate with virtual storage
device 52 and data import controller 56. Reference data access
server 112 allows researchers to access reference data, both
external data and internal data, for ad hoc queries against a
virtual database through federated access to the data sources. By
way of example, the virtual database system in reference data
access server 112 can be the IBM DiscoveryLink middleware
application and the IBM DB2 Universal Database, although other
suitable techniques and/or applications can be used. The
DiscoveryLink application allows an ad hoc query against multiple
data sources in a single request and provides a single response,
regardless of geographic locations of data sources, types, formats,
schemas and operating platforms, network protocols, etc. External
reference data (for example, genome, EST, protein and/or clinical
databanks) can be consolidated, using replication, into one logical
location to mitigate accessing external reference data through
slow, external links such as Internet 68. Using local replicas of
reference data provides significant advantages over using the
original external sources, although provision must be made to
maintain the currency of the external data and costs are incurred
in providing the storage space for the local replicas. However,
these issues are addressed via data import controller 56 and
virtual storage device 52, as described above.
[0040] Research data server 116 allows various research
applications to access consolidated research and/or experimental
data. Examples of such research and/or experimental data include
microarray data and serial analysis of gene expression data.
Experimental data typically results from experiments performed by
the same organization as employs the researcher which uses
application server system 96. When a computer or other device
generates the experimental data, the computer or other device
automatically populates a database with the experimental data based
on a configuration file within the computer. Examples of suitable
research data servers 116 include the IBM Enterprise Storage System
and the IBM Hierarchical Storage Management solutions.
[0041] In addition to the nodes, servers and other devices
described above, system 20 also includes the following shared
system services. Directory management provides naming services to
registered entities (e.g.--users, applications, other resources,
etc.). Security management provides services to protect assets and
resources such as user/entity identification and
validation/authentication, access control, privacy protection and
security audit functions. System management in conjunction with
client software running on managed devices/nodes, provides
management services such as problem alerts/reports, performance
monitoring, software distribution, data backup and recovery, etc.
Storage management in conjunction with virtual storage device 52,
provides integrated, consolidated and reliable data storage for
reference data access, research data and experimental data.
[0042] Examples of the operation and use of system 20 in aspects of
a research program will now be described.
[0043] FIGS. 2a through 2e show an example of system 20 retrieving
external data 28 and external published information 32 via Internet
68. In the first step, shown in FIG. 2a, data import controller 56,
in accordance with the data retrieval policies established by the
research team using system 20, instructs web retrieval engine 60 to
retrieve the information. In FIG. 2b, web retrieval engine 60,
which runs in external DMZ 70, retrieves the information from
predefined sites on the Internet, such as GenBank, etc. and stores
the retrieved information in the temporary local storage of web
retrieval engine 60. For security, the session is initiated for an
outbound session only and the data flows are inbound only through
protocol firewall 72 and domain firewall 76. The retrieved
information is not particularly limited and can include gene data,
protein data, documents, abstracts, chemical data, etc. and will
include both structured data (protein database) and unstructured
data (academic papers/journals). If the nature of the retrieved
information and data requires it, web retrieval engine 60 can scan
for the presence of computer viruses and/or otherwise check the
retrieved information and data for security issues. When the data
has been retrieved and placed into temporary storage, web retrieval
engine 60 forwards the information to data import controller 56, as
shown in FIG. 2c. Data import controller 56 parcels the data into
smaller and relevant data sets and then sends the data to the
reference data access server 112 for quality checks, as shown in
FIG. 2d.
[0044] As illustrated in FIG. 10, a program 400 running on the
reference data access server 112 compares the retrieved, partially
processed, data to the authoritative, non-redundant data sets
stored in virtual storage device 52 (step 402). If newly collected
data is already in the authoritative data sets (decision 403, yes
branch), the program 400 checks the quality of the newly collected
data to determine whether the newly collected data is superior to
the existing data. "Superior" data is typically data which is input
later in time, and not "out of bounds", i.e. is within constraints
permitted for the data. For example, if the data states that the
disease is ovarian cancer, and indicates that the patient is male,
then the data is out of bounds. As another example, if the range of
a certain biological chemical is one thousand ppm to two thousand
ppm, and the measure for the new data is three thousand ppm, then
it is out of bounds. If the newly collected data is superior
(decision 405, yes branch), the existing data is replaced in
virtual storage device 52 with the newly collected data and the
meta data managed by the reference data access server 112 is
updated accordingly (step 406). Otherwise, the newly collected data
is deleted or marked inferior (step 407). Referring again to
decision 403, no branch where the newly collected data is not
already in the authoritative data sets, a check is made if the
newly collected data is within bounds (step 410). If so, then the
meta data stored in reference data access server 112 is accordingly
updated (STEP 406), and the new data will be stored in virtual
storage device 52, as shown in FIG. 2e, whether unique or
redundant, to allow researchers to revisit and/or confirm the
retrieved data in the future, should the need arise. If the new
data is out of bounds (decision 410, no branch), then it is
discarded (step 407).
[0045] FIGS. 3a through 3d show an example of collection of
experimental data by system 20. In FIG. 3a, an internal researcher
36 interacts, via a personal computer or other interface device
(not shown) with an appropriate web page served by web server 84 to
input experimental conditions, experimental samples and other
relevant information into LIMS 44. This session is authenticated
and authorized by the security management services in system 20. As
shown in FIG. 3b, the raw measured experimental data is provided to
LIMS 44 from instruments 40, and LIMS 44 then merges the
experimental data with the data input by the researcher 36. The
merged data is then preprocessed, by preprocessing server 92 to
filter and normalize the data to obtain a useful data set as shown
in FIG. 3c. Referring to FIG. 3d, the filtered and normalized data
set generated by reference data access server 112 is placed in
research data sever 116 and stored in virtual storage device 52,
where it will be initially stored in online storage and eventually
moved to near line storage, in accordance with the storage policies
defined for virtual storage device 52 by the research team.
[0046] FIG. 8 illustrates an example of preprocessing performed by
server 92. Data is input from three sources, i.e. (1) an automated
LIMS 40 controlling automated, high-throughput production of
microarrays from a large number of fractionated tissue extracts,
(2) clinical data, particularly the presence and severity of
disease symptoms, and their associated microarray data, typically
from specific affected tissues and optionally other epidemiological
data from researchers 36, and (3) a database 28 or 32 of
biochemical pathways, optionally focusing on specific types of
pathways of interest. For sources 36, 28 or 32, the reference data
import may be either local or remote over the Internet. As
explained above, each piece of data has been classified as one of a
multiplicity of predetermined types. In this example, the three
types of data are assigned a common representation. Microarrays and
biomolecular pathways may be described by two-dimensional matrices,
clinical data by annotations to such matrices, and the combination
of the three by a relational database.
[0047] The raw microarray data 40 is collected into a local
repository 270, then sent to preprocessor server 92 to filter out
missing data, to check for errors such as smearing of the spots,
and usually to perform a cluster analysis to group rows and columns
that display similar colors and intensities. The result would
typically be the standard Clustered Image Map (CIM) representation,
which is stored in a reference repository. In parallel, clinical
data 36 may be obtained in a standard representation, such as HL7
and CDISC and stored in a local repository 272. Also in parallel,
one or more databases of biochemical interaction data 24, such as
the STKE database, are accessed and stored in a local repository
274.
[0048] The data 24, 36 and 40 is automatically made available to
the application server system 96, research application server 108
and post processor 100, which in the example of FIG. 8, have been
consolidated into one server. These include a high performance
computer system (HPCS) 292, and act on the data (as illustrated in
FIG. 9). (The raw data 40 is provided to the application server
system 96 via the preprocessor 92.) In a typical scenario, the
research application server 108 and/or the application server
system 96 would scan the reference microarray data repository for
highly probable patterns. For example, a standard Hidden Markov
algorithm in a data mining application 290 (within research
application server 108) detects patterns ranked by their
probabilities. Data mining application 290 then compares these
patterns with the probable patterns detected in the clinical
microarrays from the clinical samples. The concordance of these two
types of data identifies what characteristic biomolecules are
associated with a disease state. Next, a research application 282,
284 and/or 286 within the research application server 108
numerically compares biomolecular pathway data connecting differing
biomolecules with the biomolecules identified from both sets of
microarray data. This provides information on the disease
mechanism, especially indicating related sets of biomolecules, any
one of which may be targeted in treating a disease.
[0049] FIG. 8 also illustrates four types of end user researchers
that use the processed output from the research application server
108, application server system 96 and post processor 100.
Occasionally, the researchers will also access the raw data 40 as
stored in data repositories 270, 272, 274 and 275. Each of these
types of researchers interfaces to the application server system 96
via a research data management program 280. Research data
management program 280 processes the queries made by the researches
for different types of processed and raw data. In response, the
research management program 280 invokes the corresponding
applications 282, 284 and/or 286 and supplies them with the
requisite query parameters and research data files in order to
process the query and supply the researcher with the processed data
in a form tailored to the type of researcher.
[0050] A microarray research technician 250 scans the output from
the LIMS and the performance of the microarray instrumentation, for
example by number of entries in the local repository or reference
data repository. The research technician 250 also uses microarray
pattern statistics output from the application server system 96. A
molecular biologist researcher 252 obtains pattern and probability
numerical values from the application server system 96, and then
uses them to validate or extend the pathway data, using the
identities of interacting biomolecules as obtained from the LIMS
and external sources. A physician researcher 254 analyzes
statistical correlation of microarray pattern probabilities with
known disease states as a means of diagnosis and prognosis.
Consequently, the physician uses primary trends and statistics of
microarray patterns and clinical data output from the research
application server 108 via the application server system 96 and
access to the reference data repository of patterns generated by
data mining application 290. A drug design researcher 256 uses
clinical symptoms microarray statistics and their correlation with
pathways output from research application server 108 via the
application server system 96, and the following types of data
output from the application server system 96: (a) names of
biomolecules from clinical microarray data connected to disease
state, (b) their probabilities of appearance in the microarrays
(where these biomolecules have been identified using the reference
data repository), and (c) probable molecular interactions to design
a drug that will specifically interact with the molecules that
support the disease. With this data, the pharmaceutical designer
may then use standard drug design tools to develop a novel drug
once these relevant biomolecule targets have been identified.
[0051] FIG. 9 illustrates the function of the research data
management application 280 within the combined application server
system 96, post processor 100 and research application server 108
in more detail. In step 300, research data management application
280 receives a data processing request from researcher 250, 252,
254 or 256. The request specifies one or more data processing
functions to be performed by one or more of the research
applications 282, 284 or 286. If more than one data processing
function is needed, the request specifies the order of that the
data processing functions should be performed. The request also
specifies types of data processing results of interest, type or
source of the research data that should be processed, and
optionally, the age or range of data that should be used or that
data should pass basic validity tests before being used. In
response to the data processing request and based on tables,
research data management application 280 determines what research
application(s) 282, 284 and/or 286 are needed to perform the
requested function, and the names of data files that contain the
data needed to be processed by these research application(s) to
return the processed data needed by the researcher (step 302). If
the data processing request does not specify the type of data
needed by the research application(s), the research data management
application 280 consults a configuration file for the research
application to determine the types of data needed for the research
application to perform its function (step 302). Next, research data
management application 280 determines if all of the data needed by
the research application(s) to perform the requested function is
available from data repositories within system 20 (decision 304).
This determination is made by checking the data research files
correlated to the requisite data types, to determine if they
contain valid data within the data age range, if any, specified by
the researcher's data processing request. If not, then the research
data management application 280 sends an error message to the
researcher (step 305). (In response, the researcher can change the
age, range or type parameters, and submit another data processing
request.) If so, the research data management application 280
determines the parameters needed by each research application to
perform the requested function (step 306). These parameters include
the names of the data files that contain the requisite data, the
data age range, if any, for data to be processed, identifiers for
the specific data items within the data files (such as molecule or
molecule family names), and specification of the requested function
or functions (if the research application is capable of performing
more than one function). This determination is based on information
in the researcher's data processing request and a table which lists
which data files contain which types of data. Then, the research
data management program 280 invokes the target research
application(s) and supplies the requisite parameters in a function
request (step 308). These parameters include the sequence in which
the applications are to be performed (the sequence possibly being
computed based on intermediate results and possibly including
iteration), and data to control the operation of the application,
such as minimal probabilities of sequences resulting from a Blast
search to be retained for further analysis. In response, the
research application(s) performs its function. When the research
application(s) performs its function, it may query the data mining
application 290 for pattern information and analysis, as needed.
Also, if two or more research applications are invoked and
executed, they may communicate with each other to collectively
perform the requested function. After completing its processing,
the research application(s) returns the results to the research
data management program 280. As explained above, each item or data
file may include an indication of the proper format of the report
or processed data furnished to the researcher, based on a parameter
specified in the researcher's function request. In such a case, the
research data management program 280 will format the report or
processed data accordingly (step 280). Then, the research data
management program 280 returns the reformatted resultant data to
the researcher that made the data processing request (step
310).
[0052] One of the important activities in research programs is to
compare experimental data to reference data, to interpret the
experimental data to obtain results, and to store those results
with the experimental data. FIGS. 4a, 4b and 4c show an example of
such data "fitting" within system 20. It is assumed, for the
purposes of this example, that the data fitting is performed
through asynchronous interfaces/services as data fitting can be
computationally complex and is typically implemented as a batch
process which does not require user intervention. However, it is
important to note that system 20 is not limited to such batch
processes, and system 20 can support interactive data fitting
and/or examination, such as by data visualization tools, etc. with
researchers 36.
[0053] In FIG. 4a, reference data access server 112 provides RAS
108 with reference data from virtual storage device 52. This
reference data will be used to interpret and analyze experimental
data. Research data server 116 provides experimental data that has
been appropriately preprocessed to RAS 108 from virtual storage
device 52. Next, in FIG. 4b, RAS 108 employs computing services of
application server system 96 to perform high speed pattern matching
and recognition, or other techniques such as statistical analysis,
etc. to best interpret the experimental data. Finally, as shown in
FIG. 4c, the obtained results are stored in virtual storage device
52, via research data server 116.
[0054] Another important activity in research is the analysis of
data and generation of results. An example of this activity,
employing system 20, is shown in FIGS. 5a through 5e. In FIG. 5a,
an internal researcher 36 employs a Web Browser interface, provided
by web server 84, to interact with RAS 108. If researcher 36
determines that further analysis of a data set is merited, the data
set to be analyzed is provided to RAS 108 from virtual storage
device 52, via research data server 116, as shown in FIG. 5b. RAS
108 employs the HPCS to process the retrieved data set, using the
analytic tools selected by researcher 36 via the Web Browser, as
shown in FIG. 5c. If desired, or required, researcher 36 can have
RAS 108 post process any results, via post processor 100 and the
HPCS, as shown in FIG. 5d. The processed data, such as the
conclusions, researcher annotations, etc. are then stored in
virtual storage device 52, via research data server 116 with the
original data sets as shown in FIG. 5e.
[0055] Is some cases, the results of the research may be published
to internal researchers 36 on the research team, and external
researchers 24 and/or external databases. As used herein, the term
"publishing" is intended to include the act of making information
available for subsequent review, and this would include placing
research results into a database which can be externally accessed,
publishing scientific articles in journals, etc. and "pushing"
results to researchers or institutions, etc. which have previously
subscribed or otherwise indicated an interest in the results.
Internal publishing can be synchronous, with internal researchers
36 accessing results and other information as it becomes available
while external publication can be either synchronous, such as when
an external researcher 24 accesses results through knowledge
management server 104, or asynchronous such as when an external
researcher 24 accesses an external replica of the published data
held in an external database.
[0056] FIGS. 6a, 6b, 7a and 7b show examples of the publishing of
research results and other information with system 20. Publication
to an internal researcher 36 is illustrated in FIG. 6a, wherein
researcher 36 uses a web browser to access a Web Browser via the
web server 84, to interface with knowledge management server 104.
KMS 104 has access to all federated information and results, the
federated information and results being appropriately indexed, in
virtual storage device 52 and throughout system 20. As internal
researcher 36 commences his or her interaction with KMS 104, the
security management services authenticate the authority of the
internal researcher to access the results and/or any research
applications internal researcher 36 requests. FIG. 6b shows the
requested results being provided to internal researcher 36.
[0057] In FIG. 7a, an example of an asynchronous "push" of results
is shown. In this example, an external researcher 24 or other
external data user, such as a public or commercial database, has
previously indicated an interest in research results and this
interest has been authenticated and approved by the above-described
security management services. Accordingly, when KMS 104 determines
that results or information are available in virtual storage device
52, via research data server 116, are to be pushed to an external
researcher 24 or external database, the results are packaged by KMS
104 and forwarded to the appropriate external destination via B2B
server 64. In FIG. 7b, an external researcher 24 interfaces with
KMS 104 in a synchronous manner, much like an internal researcher
would, via B2B server 64. In this case, the above-described
security management services authenticate the external researcher
24 and, on an ongoing basis, ensure the researcher's authority to
access requested information. KMS 104 retrieves properly requested
information from virtual storage device 52, via research data
server 116, and provides it to the external researcher 24 via B2B
server 64.
[0058] As will now be apparent, the present invention provides an
end to end information technology system to allow researchers from
multiple disciplines and in geographically diverse locations to
cooperate in research efforts. Management of large amounts of
experimental and reference information is provided to meet diverse
researcher and regulatory requirements, while appropriate security
of the managed information is maintained.
[0059] The above-described embodiments of the invention are
intended to be examples of the present invention and alterations
and modifications may be effected thereto, by those of skill in the
art, without departing from the scope of the invention which is
defined solely by the claims appended hereto.
* * * * *