System, method and program product for management of life sciences data and related research Baek, Ock Kee ; et al. [INTERNATIONAL BUSINESS MACHINES CORPORATION]

System, method and program product for management of life sciences data and related research

Baek, Ock Kee ; et al.

Patent Application Summary

U.S. patent application number 10/973959 was filed with the patent office on 2005-07-07 for system, method and program product for management of life sciences data and related research. This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Baek, Ock Kee, Ewig, Carl Stephen.

Application Number	20050149566 10/973959
Document ID	/
Family ID	34468764
Filed Date	2005-07-07

United States Patent Application	20050149566
Kind Code	A1
Baek, Ock Kee ; et al.	July 7, 2005

System, method and program product for management of life sciences data and related research

Abstract

System, method and program product for managing data for researchers. A research data server receives and manages experimental data and research data and results from the researchers, and operates with a virtual storage device to maintain the experimental data and research data and results. A reference data access server receives and manages external reference data relating to the research and operating with the virtual storage device to maintain the external reference data. Computational resources allow researchers to capture, process and analyze experimental data to obtain results. A research data network connects the virtual storage device, research data server, reference data access server and the computational resources to allow transfer of data there between. Security management services authenticate and authorize access by the researchers to the system.

Inventors:	Baek, Ock Kee; (Unionville, CA) ; Ewig, Carl Stephen; (San Diego, CA)
Correspondence Address:	IBM CORPORATION IPLAW IQ0A/40-3 1701 NORTH STREET ENDICOTT NY 13760 US
Assignee:	INTERNATIONAL BUSINESS MACHINES CORPORATION ARMONK NY
Family ID:	34468764
Appl. No.:	10/973959
Filed:	October 25, 2004

Current U.S. Class:	1/1 ; 707/999.107
Current CPC Class:	G06Q 10/06313 20130101; G06Q 10/10 20130101; G16B 50/30 20190201; G16B 50/40 20190201; G16B 50/20 20190201; G16B 50/00 20190201
Class at Publication:	707/104.1
International Class:	G06F 007/00

Foreign Application Data

Date	Code	Application Number
Oct 31, 2003	CA	2447963

Claims

1. A system for managing data for researchers, said system comprising: means for automatically receiving research data from laboratory instruments; means for accessing a database containing established reference data; means for automatically obtaining recently available reference data; means for accessing a database containing experimental data; a plurality of applications to process respective types of said data; means, responsive to a request by a researcher to perform a data processing function, for invoking one or more of said processing applications and supplying to said one or more processing applications parameters to perform said data processing function; and means, responsive to the invoking and supply of parameters, for said one or more processing applications to automatically access types of said data required to perform the respective data processing function.

2. A system as set forth in claim 1 wherein said invoking and supplying means determines which of said processing applications to invoke and which parameters to supply to said processing applications to be invoked, based on a type of function requested by said researcher.

3. A system as set forth in claim 1 wherein said invoking and supplying means determines identities of files containing said data required by said one or more applications, based on a type of function requested by said researcher.

4. A system as set forth in claim 1 further comprising an application to analyze patterns in respective types of said data, and wherein one of said one or more processing applications receives from the pattern analyzing application a pattern used to perform the requested data processing function.

5. A system as set forth in claim 1 further comprising means for determining if available data is valid, and if not, not using said available data for said one or more processing applications; if so, using said available data for said one or more processing applications.

6. A system as set forth in claim 1 further comprising means for formatting results of said one or more data processing applications to correspond to respective types of data processing requests.

7. A method for managing data for researchers, said method comprising the steps of: automatically receiving research data from laboratory instruments; maintaining a database containing established reference data; automatically obtaining recently available reference data; maintaining a database containing experimental data; responsive to a request by a researcher to perform a data processing function, invoking one or more processing applications and supplying to said one or more processing applications parameters to perform said data processing function; and in response to the invoking and supply of parameters, said one or more processing applications automatically access types of said data required to perform the respective data processing function.

8. A method as set forth in claim 7 wherein said invoking and supplying step determines which of said processing applications to invoke and which parameters to supply to said processing applications to be invoked, based on a type of function requested by said researcher.

9. A method as set forth in claim 7 wherein said invoking and supplying step determines identities of files containing said data required by said one or more applications, based on a type of function requested by said researcher.

10. A method as set forth in claim 7 further comprising the steps of analyzing patterns in respective types of said data, and providing results of said analyzing pattern step to one of said one or more processing applications to perform the requested data processing function.

11. A method as set forth in claim 7 further comprising the step of determining if available data is valid, and if not, not using said available data for said one or more processing applications; if so, using said available data for said one or more processing applications.

12. A method as set forth in claim 7 further comprising the step of formatting results of said one or more data processing applications to correspond to respective types of data processing requests.

13. A system for managing data for researchers, said system comprising: a virtual storage device including online and near line storage, and having predefined policies for moving stored data between the online and near line storage; a research data server for receiving and managing experimental data, research data and research results, and operating with the virtual storage device to maintain the experimental data, research data and research results; a reference data access server receiving and managing external reference data relating to research of the researchers and operating with the virtual storage device to maintain the external reference data; computational resources for the researchers to capture and process experimental data to generate the research results; and a research data network connecting the virtual storage device, research data server, reference data access server and computational resources to allow transfer of data there between, the research data network further including security management services to authenticate and authorize access by the researchers.

14. The system of claim 13 further comprising a data import controller connected to the research data network and operable to retrieve external reference data from data sources external to the research data network according to one or more policies predefined by the researchers for retrieving external reference data.

15. The system of claim 14 wherein the data import controller processes retrieved reference data to determine if it is lower quality or redundant in view of reference data already stored in the virtual storage device.

16. The system of claim 15 wherein the data import controller filters out redundant or lower quality retrieved reference data from entry in the virtual storage device.

17. The system of claim 13 wherein the computational resources include a cluster of computing resources; and the computational results further comprise a post processor, the post processor converts experimental data into useful forms which are relevant for the purpose and context of the research.

18. The system of claim 13 further comprising a laboratory information management system connected to the research data network and to one or more laboratory instruments, the laboratory information management system receiving experimental data from the laboratory instruments and providing that data to the research data server via the research data network.

19. The system of claim 18 further comprising a preprocessing server connected to the research data network, the laboratory information management server providing experimental data from the laboratory instruments to the preprocessing server which converts the experimental data into data which is useful and relevant for the research, the preprocessing server providing the converted data to the research data server via the research data network.

20. The system of claim 13 further including a knowledge management server connected to the research data network and operable to identify and provide a researcher with reference data and/or experimental data and results from the research data server and the reference data access server in accordance with queries made by the researcher.

21. The system of claim 20 wherein a researcher can create policies defining data types of interest to the researcher, and the knowledge management server, in accordance with the defined policy, identifies and provides reference data and experimental data and results of interest to the researcher.

22. The system of claim 13 further including a research application server connected to the research data network, the research application server providing at least one software application and/or tool required by researchers, the at least one application and/or tool operating on data stored in said virtual storage device in accordance with instructions from the researchers.

23. A method of managing research conducted by researchers, said method comprising the steps of: creating a set of policies defining external reference information relevant to the research; retrieving, at predefined intervals, external reference information in accordance with the policies; comparing the retrieved information with reference data stored in a reference data server to determine if the retrieved information is redundant or of lower quality than data already stored in the reference data server and storing retrieved information which was determined to be non-redundant and/or of acceptable better quality in the reference data server; storing experimental data from at least one laboratory instrument in a research data server; and providing the researchers with access to the stored information in the reference data server and to experimental data in the research data server.

24. The method of claim 23, further comprising the step of the researchers defining a set of data storage policies for a virtual storage device including both online and near line storage capacity to store the data of the reference data server and the research data server, and moving the stored data between online storage capacity and near line storage capacity in accordance with the data storage policies.

25. The method of claim 23 further comprising the step of preprocessing the experimental data from the at least one laboratory instrument and storing the preprocessed experimental data in the research data server.

26. The method of claim 23 further comprising the step of publishing information to researchers by having the researchers identify to a knowledge management server information of interest to them and the knowledge management server examining the contents of the reference data server and the research data server to identify the information of interest to a researcher and the knowledge management server making the identified information available to the researcher.

27. The method of claim 23 further comprising the step of verifying the authenticity and authority of each researcher to access stored experimental data and stored reference data before providing that access.

28. A method of managing data for research, said method comprising the steps of: providing a plurality of researchers with access to a research data network; creating reference data policies defining for each of said researchers types of reference data that will be of use to the researcher, creating experimental data policies defining for each of said researchers types of experimental data and results that will be of use to the researcher and storing these policies on the research data network; retrieving, at defined intervals, from data sources outside the research data network, external reference data as defined by the reference policies; examining the retrieved reference data to determine if it is redundant in view of reference data already stored on the research data network or if it is of better quality than reference data already stored on the network and storing retrieved reference data on the research data network which has been determined to be non-redundant or of better quality than reference data already stored; collecting experimental data from laboratory instruments through the research data network and storing the collected data on the research data network; and publishing new reference data and experimental data to researchers according to the reference data policies and experimental data policies defined for the researchers.

29. The method of claim 28 wherein the reference data and the experimental data are stored in a virtual storage device on the research data network, the virtual storage device having both online and near line data storage capabilities and the research team having predefined a storage policy executed by the virtual storage device to transfer stored data between the online storage and the near line storage.

30. The method of claim 28 further comprising the step of at least one researcher processing published experimental data to obtain experimental results which are stored on the research data network, the stored experimental results also being subsequently published in the publishing step to researchers in accordance with experimental data policies of researchers.

31. A computer program product to manage data for research conducted by researchers, said computer program product comprising: a computer readable medium; first program instructions executable to provide the researchers with access to a research data network; second program instructions to retrieve, at defined intervals, from data sources outside the research data network, external reference data as defined by reference policies created by said researchers; third program instructions to examine the retrieved reference data to determine if it is redundant in view of reference data already stored on the research data network or if it is of better quality than reference data already stored on the network and to store retrieved reference data on the research data network which has been determined to be non-redundant or of better quality than reference data already stored; fourth program instructions to store experimental data from at least one laboratory instrument in a research data server; and fifth program instructions to provide the researchers with access to the stored information in the reference data server and to experimental data in the research data server; and wherein said first, second, third, fourth and fifth program instructions are recorded on said medium.

32. A computer program product according to claim 31, further including sixth program instructions to implement a set of data storage policies for a virtual storage device including both online and near line storage capacity to move the stored data between online storage capacity and near line storage capacity in accordance with the data storage policies; and wherein said sixth program instructions are recorded on said medium.

Description

BACKGROUND OF THE INVENTION

[0001] The present invention relates generally to computer management of data and related research results. More specifically, the present invention relates to computer management of data and related research in life science fields.

[0002] Modern life sciences research, such as pharmaceutical research, typically requires applied, iterative, parallel research across many technical disciplines. Modern pharmaceutical research typically involves researchers from biology, genetics, chemistry, clinical and pathology disciplines. The research process is typically iterative, with the results from one discipline being supplied to another discipline, etc. with each discipline analyzing processing the supplied and other data. Heretofore, there have been inadequate computer systems and methods for collaboration between researchers in the different disciplines, and management of the overall process. These problems are exacerbated when large amounts of data are generated and must be transformed, translated, reorganized, analyzed or otherwise processed as the data moves between disciplines and/or research teams.

[0003] An object of the present invention is to provide an improved, comprehensive system, method and program product for collaboration among researchers and management of data and related research results.

[0004] Another object of the present invention is to provide a system, method and program product of the foregoing type which is suited for development of pharmaceuticals and other medical therapies.

SUMMARY OF THE INVENTION

[0005] The invention resides in a system, method and program product for managing data for researchers. Research data is automatically received from laboratory instruments. Established reference data is accessed from a database. Recently available reference data is automatically obtained. Experimental data is accessed from a database. There are a plurality of applications to process respective types of the data. In response to a request by a researcher to perform a data processing function, one or more of the processing applications are invoked and supplied with parameters to perform the data processing function. The one or more applications automatically access types of the data required to perform the respective data processing function.

[0006] According to features of the present invention, the determination of which of the processing applications to invoke and which parameters to supply to these processing applications can be based on a type of function requested by the researcher. The determination of the identities of files containing the data required by the one or more applications can be based on a type of function requested by the researcher. There can also be an application to analyze patterns in respective types of the data, and one of the processing applications receives from the pattern analyzing application a pattern used to perform the requested data processing function. There can also be a program for determining if available data is valid. If not, the available data is not used for the one or more processing applications. If so, the available data is used for the one or more data processing applications. There can also be a program for formatting results of the one or more data processing applications to correspond to respective types of data processing requests.

[0007] According to another embodiment of the present invention, there is provided another system for managing data for researchers. This other system comprises a virtual storage device including online and near line storage and having policies predefined for moving stored data between the online and near line storage. The system also comprises a research data server for receiving and managing experimental data and research data and results from the researchers and operating with the virtual storage device to maintain the experimental data and research data and results. The system also comprises a reference data access server receiving and managing external reference data relating to the research and operating with the virtual storage device to maintain the external reference data. The system also comprises computational resources for the researchers to capture, process and analyze experimental data to obtain results. The system also comprises a research data network connecting the virtual storage device, research data server, reference data access server and the computational resources to allow transfer of data there between. The research data network further includes security management services to authenticate and authorize access by the researchers to the system.

[0008] This other system may also include a data import controller connected to one or more public data networks (e.g., the Internet) as well as to the research data network. The data import controller is operable to retrieve external reference data from data sources external to the research data network according to one or more policies predefined by the researchers for retrieving external reference data. Also, the computational resources may include a high performance computing server comprising a cluster of homogeneous or hybrid computing resources. Also, the system may include a laboratory information management system connected to the research data network and to one or more laboratory instruments. The laboratory information management system receives experimental data from the laboratory instruments and provides that data to the research data server via the research data network.

[0009] According to another aspect of this other embodiment of the present invention, there is provided a method of managing data for research. A set of policies defining external reference information relevant to the research program is created. At predefined intervals, external reference information in accordance with the policies is retrieved. The retrieved information is compared with reference data stored in a reference data server to determine if the retrieved information is redundant or of lower quality than data already stored in the reference data server. The retrieved information which was determined to be non-redundant and/or of acceptable better quality is stored in the reference data server. The experimental data from optionally one or more laboratory instruments is stored in a research data server. The researchers are provided with access to the stored information in the reference data server and to experimental data in the research data server.

[0010] According to yet another aspect of the present invention, there is provided a method of managing data for research. Researchers are provided with access to a research data network. Reference data policies define for each researcher types of reference data that will be of use to the researcher. Experimental data policies define for each researcher types of experimental data and results that will be of use to the researcher are created and stored on the research data network. At defined intervals, from data sources outside the research data network, external reference data as defined by the reference policies is retrieved and examined to determine if it is redundant in view of reference data already stored on the research data network or if it is of better quality than reference data already stored on the network. The retrieved reference data which has been determined to be non-redundant or of better quality than reference data already stored is stored on the research data network. Experimental data is collected from laboratory instruments through the research data network and stored on the research data network. New reference data and experimental data are published to researchers according to the reference data policies and experimental data policies defined for the researchers.

[0011] According to yet another aspect of the present invention, there is provided a computer program product stored on a computer readable medium to manage data for research. First program instructions provide the researchers with access to a research data network. Second program instructions retrieve, at defined intervals, from data sources outside the research data network, external reference data as defined by reference policies created on the computer by researchers. Third program instructions examine the retrieved reference data to determine if it is redundant in view of reference data already stored on the research data network or if it is of better quality than reference data already stored on the network, and store retrieved reference data on the research data network which has been determined to be non-redundant or of better quality than reference data already stored. Fourth program instructions store experimental data from optionally one or more laboratory instruments in a research data server. Fifth program instructions provide the researchers with access to the stored information in the reference data server and to experimental data in the research data server.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] FIG. 1 is a block diagram of a computer system in accordance with the present invention.

[0013] FIGS. 2a, 2b, 2c, 2d and 2e show control and data flows between components of the system of FIG. 1 when retrieving and processing external data and publications.

[0014] FIGS. 3a, 3b, 3c and 3d show control and data flows between components of the system of FIG. 1 when collecting experimental data from laboratory instruments.

[0015] FIGS. 4a, 4b and 4c show control and data flows between components of the system of FIG. 1 when pattern matching and pattern recognition of experimental data.

[0016] FIGS. 5a, 5b, 5c, 5d and 5e show control and data flows between components of the system of FIG. 1 when performing data analysis and/or result generation of experimental data.

[0017] FIGS. 6a and 6b show control and data flows between components of the system of FIG. 1 when publishing results to an internal researcher.

[0018] FIGS. 7a and 7b show control and data flows between components of the system of FIG. 1 when publishing results to an external researcher or external sites.

[0019] FIG. 8 is a flow diagram showing an example of the present invention.

[0020] FIG. 9 is a flow chart showing function of a research data management program within an application server system of FIG. 1.

[0021] FIG. 10 is a flow chart showing a function within a reference data access server within the system of FIG. 1 to validate new external data.

DETAILED DESCRIPTION OF THE INVENTION

[0022] While the following embodiment illustrates use of the present invention for pharmaceutical research, the present invention has other embodiments and uses as well. For example, the present invention can be employed in research for pharmaceuticals, treatments, diagnostics, non-drug treatment protocols and preventatives, and other sciences.

[0023] Modern life sciences research, such as high level drug discovery and development, comprises a series of steps for acquiring and analyzing chemical and biological data, wherein processing is performed at each step. For example, in the high-level drug research areas, the key activities typically include (a) the collection of "assay" data generated by laboratory instruments, (b) searching for and obtaining external reference and research materials, (c) analyzing the consolidated assay data and the external reference materials, and (d) deriving knowledge through the analysis. These tasks are typically repeated, in a cyclical manner, by each discipline within the research team. The present invention assists in these key activities, providing automation, data management and multidisciplinary collaboration facilities between researchers in different disciplines.

[0024] In FIG. 1, a system in accordance with the present invention is indicated generally at 20. System 20 interfaces with external researchers 24, external data sources such as external reference data 28 and external published data 32, internal researchers 36 and laboratory instruments 40. As explained above, modern life sciences research often involves the exchange of information with various external researchers 24 who are not members of the primary research organization. Examples of such external researchers may be government funded researchers, such as researchers funded by the U.S. National Institutes of Health, researchers affiliated with Contract Research Organizations (CRO), etc. While FIG. 1 only shows a single icon for external researchers 24, the present invention supports interaction with multiple external researchers 24 at disparate geographic locations and at disparate organizations. Access to and use of external reference data 28 and external published data 32 is often an important part of a life sciences research effort.

[0025] External reference data 28 can comprise well known or established gene sequence and protein databases, either public or private, clinical data, etc. External published data 32 can include newly discovered genes or proteins, novel drug targets such as novel chemical entities or novel molecular entities, new insights in disease mechanisms, etc. The external published data 32 may reside in a known database, such as the PubMed database which is maintained by the National Center for Biotechnology Information. The relevant published data an be periodically identified and retrieved automatically by the data retrieval engine 56 by key word searching or author searching, based on predefined key words and authors.

[0026] Internal researchers 36 are primary drivers of the current research effort. While FIG. 1 only shows a single icon for internal researchers 36, the present invention supports interaction with multiple internal researchers 36 who can be part of different groups involved in the research effort and/or who can be located at disparate logical and/or geographic locations.

[0027] Laboratory instruments 40 can include any instrument useful to the research effort. In the life sciences field, such instruments can include gene sequencers, mass-spectrometers, crystallographic imaging devices, tomographic equipment, etc. Many such instruments are now robotic in nature and can directly interface with a laboratory information management system (LIMS) 44. An example of such an instrument is an ABI DNA sequencing system which directly interfaces to LIMS 44 and to one or more personal computers in the laboratory which provide for control and/or calibration of the device. Other instruments require manual operation and/or examination of their results by a technician or researcher, but these results are still provided to LIMS 44 for the assays. LIMS 44 are well known in the life sciences field and can be custom designed for a laboratory or can be purchased commercially, as desired.

[0028] As shown in FIG. 1, LIMS 44 is connected to instruments 40 by a research data network 48 which also connects the other components of system 20. Network 48 is not particularly limited; all components of system 20 should be able to communicate as necessary and, preferably, at a sufficient speed to allow effective transfer of information between various nodes connected by the network. Network 48 can be, for example, (a) a homogenous network of gigabit Ethernet links and network devices, (b) a heterogeneous network of, for example, fiber optic, gigabit Ethernet and ATM links with appropriate bridges, protocol converters, etc., (c) a private, operated and maintained for or by a research organization, (d) or a mix of private and public network portions separated and/or protected as required by appropriate firewalls, domain controllers for directory, security and system management services and encryption/decryption engines.

[0029] The data stored and used by system 20 is classified as follows for subsequent processing by research applications, as explained below. The external reference data 28 is classified by type based on the data file in which the external reference data resides. When a unit of external reference data 28 is identified as a candidate to be included in a data file, a data manager (person) determines the type of the candidate external reference data and stores it in the data file that is earmarked for this type of data. (Alternately, a program tool can search each unit of external reference by key word, and classify the external reference data based on the results of the key word search.) The web retrieval server 56 classifies each item of external published data 32 by type based on key words found in the publication. The type of data obtained from the LIMS is based on the type of instruments that are generating the data. Each of the classification types is one of a multiplicity of predetermined types. These types were predetermined by one or more scientists with expertise in such classification and research applications that will need the data. As explained in more detail below, each research application will need and fetch certain types of data to be processed on behalf of a researcher. Each data item can also be accompanied by a header file which indicates whether the data needs to be preprocessed (such as by preprocessing server 92 shown in FIG. 1), whether the data needs to be routed to a data mining application (such as data mining application 290 shown in FIG. 8), and the format of the data needed for the type of end user/researcher.

[0030] System 20 also includes a virtual storage device 52. As mentioned above, life sciences research can produce large quantities (petabytes or exabytes) of data. System 20 employs virtual storage device 52 to facilitate the handling of this data. Virtual storage device 52 comprises a collection of online storage devices, such as disk drives, solid state drives, etc. and a variety of near line storage devices, such as robotic tape libraries, etc. which can retrieve requested data within about one minute. Virtual storage device 52 employs a set of policies to manage the storage of research data sets between the online and near line storage subsystems. Such policies can employ strategies such as automatic migration of aged data from online to near line storage, heuristic migration based upon determined usage patterns for the data, etc. Because storage device 52 is virtual, it can be scaled as necessary by adding more storage devices. Also, it is transparent to a user whether desired data is stored in online or near line storage, although in the case of data stored on near line storage, the user may experience a slight delay in access. By way of example, virtual storage device 52 for current research efforts has at least several terabytes of online storage and several petabytes of near line storage. An example of a suitable virtual storage device 52 would be one or more IBM Enterprise Storage Servers and one or more IBM LTO UltraScalable Tape Libraries. In system 20, virtual storage device 52 stores all research data relating to a research effort, although local copies of smaller data sets can be maintained by researchers. By employing virtual storage device 52 across system 20, quality, integrity, security, privacy and availability of research data is assured.

[0031] Data retrieval engine 56 operates with a web retrieval engine 60 to retrieve (based on predefined key word search and predefined author search policies) and process desired external reference data over public networks such as the Internet 68. Specifically, a data retrieval engine 56 in the form of a data import controller uses these policies established by the research team to have web retrieval engine 60 retrieve external data. Web retrieval engine 60 processes the policy-driven requests from data import controller 56 to automatically retrieve predefined external reference data through the Internet 68, or other networks, via appropriate protocol and/or data format converters. Policies for web retrieval engine 60 can include regularly scheduled searches of specific databases, identification and retrieval of updated versions of previously retrieved data, searches for new data sources, etc. Examples of suitable web retrieval engines are the IBM WebSphere software platform or Apache Software Foundation web server integrated with the IBM WebSphere platform. Web retrieval engine 60 can use any appropriate computer program to retrieve the desired external references, such as ftp for document transfers, SQL queries for database searches, etc. In one embodiment of the invention, web retrieval engine 60 includes local storage where retrieved information is temporarily stored, for subsequent processing by data import controller 56.

[0032] B2B engine 64 includes a web server and operates to make data from system 20 available to external researchers 24. Examples of suitable B2B web retrieval engines 64 also are the IBM WebSphere software platform or Apache Software Foundation server integrated with IBM WebSphere platform. As discussed below, system 20 includes security management services which operate to limit the data that can be accessed by an external user 24.

[0033] As illustrated in FIG. 1, both web retrieval engine 60 and B2B engine 64 are located in a "demilitarized zone" (DMZ) 70 and are separated from network 48 and public networks such as Internet 68 by a protocol firewall 72 and a domain firewall 76. Protocol firewall 72 operates as a first line of isolation and acts to control the direction of data flows between network 48 and public networks such as Internet 68 and filters the data traffic flow based upon source addresses, destination addresses and enabled ports. Domain firewall 76 acts as a second line of isolation and establishes the DMZ between the trusted internal network 48 and external networks, such as Internet 68. Protocol firewalls 72 and domain firewalls 76 and other techniques for establishing DMZ's are well known and well not be further discussed herein. In a similar manner, system 20 connects internals researchers 36 through an internal DMZ formed by a protocol firewall 80, a web server 84 and domain firewall 88. Examples of suitable web engines are the IBM WebSphere software platform or an Apache Software Foundation server integrated with the IBM WebSphere platform.

[0034] System 20 also includes a preprocessing server 92 which comprises one or more computer systems. Preprocessing server 92 operates on the raw data provided by LIMS 44 to convert that data into data which is chemically, biologically, clinically etc. useful and relevant for the purpose and context of the research. Depending upon the nature of the assay and the devices performing the assay, this preprocessing can include data filtering, data normalization, data validation, etc. For example, preprocessing server 92 can filter data by removing a data set with key missing values, or data with noisy or statistically improbable values, such as long nucleotide segments in which all bases are identical. For example, preprocessing server 92 can normalize data by establishing a common scale or set of units for comparing disparate data, such as by multiplying the data by a constant to make the maximum value in each set precisely 1.0. For example, preprocessing server 92 can invalidate data which falls outside of certain expected data ranges, or is inherently invalid, such as ovarian cancer found in a man. Preprocessing server 92 can also assign a higher reliability to data which has previously been reviewed and annotated by a researcher.

[0035] System 20 includes an application server system 96 which includes a high performance computing server (HPCS). In a presently preferred embodiment, the HPCS comprises a high performance computing cluster, such as a Linux cluster of high speed processors, as this allows a large amount of available computing resources to be scaled appropriately, as needed, by adding or removing computing processors to and from the cluster. Gene multiple sequence alignment and/or protein folding are just a few examples of research activities which can require large amounts of computing resources. The HPCS is capable of serving as a back end processing resources for several research applications. As explained in more detail below, application server system 96 also includes a data mining application 292 and research applications 282, 284 and 286.

[0036] System 20 also includes a post processor 100 to operate on results produced in application server system 96, or elsewhere, to convert resultant data into chemically, biologically, etc. useful forms which are relevant for the purpose and context of the research. This post processing can comprise data clustering, annotation, classification, presentation, etc. and can be performed by various applications.

[0037] System 20 also includes a knowledge management server (KMS) 104. KMS 104 provides the researchers with access to relevant biological, chemical and/or clinical information. The functions provided by KMS 104 can include, without limitation, data mining, ad hoc queries, statistical analysis, report generation, decision support, etc. An example of a suitable knowledge management server 104 can include an IBM Information Management for Scoring, Visualization, Modeling and Mining.

[0038] System 20 also includes a research application server (RAS) 108. RAS 108 runs a number of research applications, data mining applications and/or other tools required by researchers. These applications, data mining applications and/or tools can include, without limitation: an NCBI Basic Local Alignment Search Tool ("Blast") program, multiple sequence alignment tools, gene expression applications, and applications for protein structure and function prediction. The Blast program is a search tool to determine the similarity of a given nucleic acid or protein sequence to thousands of other sequences in databases, such as NCBI databases. The multiple sequence alignment tools assist in deducing the function of new proteins, assisting in answering other biological questions such as the evolution and/or phylogenic relationship of the protein. The gene expression applications permit interactive retrieval and analysis of gene expression data with spotted microarray, high density oligonucleotide array, hybridization filtering, serial analysis of gene expression data and other techniques. The applications for protein structure and function prediction include primary sequence alignment, secondary and tertiary structure prediction methods, homology modeling and crystallographic diffraction pattern analysis, etc. As will be apparent to those of skill in the art, many other applications and/or tools can be employed in system 20, and RAS 108 can provide for centralized maintenance and control of these tools.

[0039] System 20 also includes a reference data access server 112 and a research data server 116 which operate with virtual storage device 52 and data import controller 56. Reference data access server 112 allows researchers to access reference data, both external data and internal data, for ad hoc queries against a virtual database through federated access to the data sources. By way of example, the virtual database system in reference data access server 112 can be the IBM DiscoveryLink middleware application and the IBM DB2 Universal Database, although other suitable techniques and/or applications can be used. The DiscoveryLink application allows an ad hoc query against multiple data sources in a single request and provides a single response, regardless of geographic locations of data sources, types, formats, schemas and operating platforms, network protocols, etc. External reference data (for example, genome, EST, protein and/or clinical databanks) can be consolidated, using replication, into one logical location to mitigate accessing external reference data through slow, external links such as Internet 68. Using local replicas of reference data provides significant advantages over using the original external sources, although provision must be made to maintain the currency of the external data and costs are incurred in providing the storage space for the local replicas. However, these issues are addressed via data import controller 56 and virtual storage device 52, as described above.

[0040] Research data server 116 allows various research applications to access consolidated research and/or experimental data. Examples of such research and/or experimental data include microarray data and serial analysis of gene expression data. Experimental data typically results from experiments performed by the same organization as employs the researcher which uses application server system 96. When a computer or other device generates the experimental data, the computer or other device automatically populates a database with the experimental data based on a configuration file within the computer. Examples of suitable research data servers 116 include the IBM Enterprise Storage System and the IBM Hierarchical Storage Management solutions.

[0041] In addition to the nodes, servers and other devices described above, system 20 also includes the following shared system services. Directory management provides naming services to registered entities (e.g.--users, applications, other resources, etc.). Security management provides services to protect assets and resources such as user/entity identification and validation/authentication, access control, privacy protection and security audit functions. System management in conjunction with client software running on managed devices/nodes, provides management services such as problem alerts/reports, performance monitoring, software distribution, data backup and recovery, etc. Storage management in conjunction with virtual storage device 52, provides integrated, consolidated and reliable data storage for reference data access, research data and experimental data.

[0042] Examples of the operation and use of system 20 in aspects of a research program will now be described.

[0043] FIGS. 2a through 2e show an example of system 20 retrieving external data 28 and external published information 32 via Internet 68. In the first step, shown in FIG. 2a, data import controller 56, in accordance with the data retrieval policies established by the research team using system 20, instructs web retrieval engine 60 to retrieve the information. In FIG. 2b, web retrieval engine 60, which runs in external DMZ 70, retrieves the information from predefined sites on the Internet, such as GenBank, etc. and stores the retrieved information in the temporary local storage of web retrieval engine 60. For security, the session is initiated for an outbound session only and the data flows are inbound only through protocol firewall 72 and domain firewall 76. The retrieved information is not particularly limited and can include gene data, protein data, documents, abstracts, chemical data, etc. and will include both structured data (protein database) and unstructured data (academic papers/journals). If the nature of the retrieved information and data requires it, web retrieval engine 60 can scan for the presence of computer viruses and/or otherwise check the retrieved information and data for security issues. When the data has been retrieved and placed into temporary storage, web retrieval engine 60 forwards the information to data import controller 56, as shown in FIG. 2c. Data import controller 56 parcels the data into smaller and relevant data sets and then sends the data to the reference data access server 112 for quality checks, as shown in FIG. 2d.

[0044] As illustrated in FIG. 10, a program 400 running on the reference data access server 112 compares the retrieved, partially processed, data to the authoritative, non-redundant data sets stored in virtual storage device 52 (step 402). If newly collected data is already in the authoritative data sets (decision 403, yes branch), the program 400 checks the quality of the newly collected data to determine whether the newly collected data is superior to the existing data. "Superior" data is typically data which is input later in time, and not "out of bounds", i.e. is within constraints permitted for the data. For example, if the data states that the disease is ovarian cancer, and indicates that the patient is male, then the data is out of bounds. As another example, if the range of a certain biological chemical is one thousand ppm to two thousand ppm, and the measure for the new data is three thousand ppm, then it is out of bounds. If the newly collected data is superior (decision 405, yes branch), the existing data is replaced in virtual storage device 52 with the newly collected data and the meta data managed by the reference data access server 112 is updated accordingly (step 406). Otherwise, the newly collected data is deleted or marked inferior (step 407). Referring again to decision 403, no branch where the newly collected data is not already in the authoritative data sets, a check is made if the newly collected data is within bounds (step 410). If so, then the meta data stored in reference data access server 112 is accordingly updated (STEP 406), and the new data will be stored in virtual storage device 52, as shown in FIG. 2e, whether unique or redundant, to allow researchers to revisit and/or confirm the retrieved data in the future, should the need arise. If the new data is out of bounds (decision 410, no branch), then it is discarded (step 407).

[0045] FIGS. 3a through 3d show an example of collection of experimental data by system 20. In FIG. 3a, an internal researcher 36 interacts, via a personal computer or other interface device (not shown) with an appropriate web page served by web server 84 to input experimental conditions, experimental samples and other relevant information into LIMS 44. This session is authenticated and authorized by the security management services in system 20. As shown in FIG. 3b, the raw measured experimental data is provided to LIMS 44 from instruments 40, and LIMS 44 then merges the experimental data with the data input by the researcher 36. The merged data is then preprocessed, by preprocessing server 92 to filter and normalize the data to obtain a useful data set as shown in FIG. 3c. Referring to FIG. 3d, the filtered and normalized data set generated by reference data access server 112 is placed in research data sever 116 and stored in virtual storage device 52, where it will be initially stored in online storage and eventually moved to near line storage, in accordance with the storage policies defined for virtual storage device 52 by the research team.

[0046] FIG. 8 illustrates an example of preprocessing performed by server 92. Data is input from three sources, i.e. (1) an automated LIMS 40 controlling automated, high-throughput production of microarrays from a large number of fractionated tissue extracts, (2) clinical data, particularly the presence and severity of disease symptoms, and their associated microarray data, typically from specific affected tissues and optionally other epidemiological data from researchers 36, and (3) a database 28 or 32 of biochemical pathways, optionally focusing on specific types of pathways of interest. For sources 36, 28 or 32, the reference data import may be either local or remote over the Internet. As explained above, each piece of data has been classified as one of a multiplicity of predetermined types. In this example, the three types of data are assigned a common representation. Microarrays and biomolecular pathways may be described by two-dimensional matrices, clinical data by annotations to such matrices, and the combination of the three by a relational database.

[0047] The raw microarray data 40 is collected into a local repository 270, then sent to preprocessor server 92 to filter out missing data, to check for errors such as smearing of the spots, and usually to perform a cluster analysis to group rows and columns that display similar colors and intensities. The result would typically be the standard Clustered Image Map (CIM) representation, which is stored in a reference repository. In parallel, clinical data 36 may be obtained in a standard representation, such as HL7 and CDISC and stored in a local repository 272. Also in parallel, one or more databases of biochemical interaction data 24, such as the STKE database, are accessed and stored in a local repository 274.

[0048] The data 24, 36 and 40 is automatically made available to the application server system 96, research application server 108 and post processor 100, which in the example of FIG. 8, have been consolidated into one server. These include a high performance computer system (HPCS) 292, and act on the data (as illustrated in FIG. 9). (The raw data 40 is provided to the application server system 96 via the preprocessor 92.) In a typical scenario, the research application server 108 and/or the application server system 96 would scan the reference microarray data repository for highly probable patterns. For example, a standard Hidden Markov algorithm in a data mining application 290 (within research application server 108) detects patterns ranked by their probabilities. Data mining application 290 then compares these patterns with the probable patterns detected in the clinical microarrays from the clinical samples. The concordance of these two types of data identifies what characteristic biomolecules are associated with a disease state. Next, a research application 282, 284 and/or 286 within the research application server 108 numerically compares biomolecular pathway data connecting differing biomolecules with the biomolecules identified from both sets of microarray data. This provides information on the disease mechanism, especially indicating related sets of biomolecules, any one of which may be targeted in treating a disease.

[0049] FIG. 8 also illustrates four types of end user researchers that use the processed output from the research application server 108, application server system 96 and post processor 100. Occasionally, the researchers will also access the raw data 40 as stored in data repositories 270, 272, 274 and 275. Each of these types of researchers interfaces to the application server system 96 via a research data management program 280. Research data management program 280 processes the queries made by the researches for different types of processed and raw data. In response, the research management program 280 invokes the corresponding applications 282, 284 and/or 286 and supplies them with the requisite query parameters and research data files in order to process the query and supply the researcher with the processed data in a form tailored to the type of researcher.

[0050] A microarray research technician 250 scans the output from the LIMS and the performance of the microarray instrumentation, for example by number of entries in the local repository or reference data repository. The research technician 250 also uses microarray pattern statistics output from the application server system 96. A molecular biologist researcher 252 obtains pattern and probability numerical values from the application server system 96, and then uses them to validate or extend the pathway data, using the identities of interacting biomolecules as obtained from the LIMS and external sources. A physician researcher 254 analyzes statistical correlation of microarray pattern probabilities with known disease states as a means of diagnosis and prognosis. Consequently, the physician uses primary trends and statistics of microarray patterns and clinical data output from the research application server 108 via the application server system 96 and access to the reference data repository of patterns generated by data mining application 290. A drug design researcher 256 uses clinical symptoms microarray statistics and their correlation with pathways output from research application server 108 via the application server system 96, and the following types of data output from the application server system 96: (a) names of biomolecules from clinical microarray data connected to disease state, (b) their probabilities of appearance in the microarrays (where these biomolecules have been identified using the reference data repository), and (c) probable molecular interactions to design a drug that will specifically interact with the molecules that support the disease. With this data, the pharmaceutical designer may then use standard drug design tools to develop a novel drug once these relevant biomolecule targets have been identified.

[0051] FIG. 9 illustrates the function of the research data management application 280 within the combined application server system 96, post processor 100 and research application server 108 in more detail. In step 300, research data management application 280 receives a data processing request from researcher 250, 252, 254 or 256. The request specifies one or more data processing functions to be performed by one or more of the research applications 282, 284 or 286. If more than one data processing function is needed, the request specifies the order of that the data processing functions should be performed. The request also specifies types of data processing results of interest, type or source of the research data that should be processed, and optionally, the age or range of data that should be used or that data should pass basic validity tests before being used. In response to the data processing request and based on tables, research data management application 280 determines what research application(s) 282, 284 and/or 286 are needed to perform the requested function, and the names of data files that contain the data needed to be processed by these research application(s) to return the processed data needed by the researcher (step 302). If the data processing request does not specify the type of data needed by the research application(s), the research data management application 280 consults a configuration file for the research application to determine the types of data needed for the research application to perform its function (step 302). Next, research data management application 280 determines if all of the data needed by the research application(s) to perform the requested function is available from data repositories within system 20 (decision 304). This determination is made by checking the data research files correlated to the requisite data types, to determine if they contain valid data within the data age range, if any, specified by the researcher's data processing request. If not, then the research data management application 280 sends an error message to the researcher (step 305). (In response, the researcher can change the age, range or type parameters, and submit another data processing request.) If so, the research data management application 280 determines the parameters needed by each research application to perform the requested function (step 306). These parameters include the names of the data files that contain the requisite data, the data age range, if any, for data to be processed, identifiers for the specific data items within the data files (such as molecule or molecule family names), and specification of the requested function or functions (if the research application is capable of performing more than one function). This determination is based on information in the researcher's data processing request and a table which lists which data files contain which types of data. Then, the research data management program 280 invokes the target research application(s) and supplies the requisite parameters in a function request (step 308). These parameters include the sequence in which the applications are to be performed (the sequence possibly being computed based on intermediate results and possibly including iteration), and data to control the operation of the application, such as minimal probabilities of sequences resulting from a Blast search to be retained for further analysis. In response, the research application(s) performs its function. When the research application(s) performs its function, it may query the data mining application 290 for pattern information and analysis, as needed. Also, if two or more research applications are invoked and executed, they may communicate with each other to collectively perform the requested function. After completing its processing, the research application(s) returns the results to the research data management program 280. As explained above, each item or data file may include an indication of the proper format of the report or processed data furnished to the researcher, based on a parameter specified in the researcher's function request. In such a case, the research data management program 280 will format the report or processed data accordingly (step 280). Then, the research data management program 280 returns the reformatted resultant data to the researcher that made the data processing request (step 310).

[0052] One of the important activities in research programs is to compare experimental data to reference data, to interpret the experimental data to obtain results, and to store those results with the experimental data. FIGS. 4a, 4b and 4c show an example of such data "fitting" within system 20. It is assumed, for the purposes of this example, that the data fitting is performed through asynchronous interfaces/services as data fitting can be computationally complex and is typically implemented as a batch process which does not require user intervention. However, it is important to note that system 20 is not limited to such batch processes, and system 20 can support interactive data fitting and/or examination, such as by data visualization tools, etc. with researchers 36.

[0053] In FIG. 4a, reference data access server 112 provides RAS 108 with reference data from virtual storage device 52. This reference data will be used to interpret and analyze experimental data. Research data server 116 provides experimental data that has been appropriately preprocessed to RAS 108 from virtual storage device 52. Next, in FIG. 4b, RAS 108 employs computing services of application server system 96 to perform high speed pattern matching and recognition, or other techniques such as statistical analysis, etc. to best interpret the experimental data. Finally, as shown in FIG. 4c, the obtained results are stored in virtual storage device 52, via research data server 116.

[0054] Another important activity in research is the analysis of data and generation of results. An example of this activity, employing system 20, is shown in FIGS. 5a through 5e. In FIG. 5a, an internal researcher 36 employs a Web Browser interface, provided by web server 84, to interact with RAS 108. If researcher 36 determines that further analysis of a data set is merited, the data set to be analyzed is provided to RAS 108 from virtual storage device 52, via research data server 116, as shown in FIG. 5b. RAS 108 employs the HPCS to process the retrieved data set, using the analytic tools selected by researcher 36 via the Web Browser, as shown in FIG. 5c. If desired, or required, researcher 36 can have RAS 108 post process any results, via post processor 100 and the HPCS, as shown in FIG. 5d. The processed data, such as the conclusions, researcher annotations, etc. are then stored in virtual storage device 52, via research data server 116 with the original data sets as shown in FIG. 5e.

[0055] Is some cases, the results of the research may be published to internal researchers 36 on the research team, and external researchers 24 and/or external databases. As used herein, the term "publishing" is intended to include the act of making information available for subsequent review, and this would include placing research results into a database which can be externally accessed, publishing scientific articles in journals, etc. and "pushing" results to researchers or institutions, etc. which have previously subscribed or otherwise indicated an interest in the results. Internal publishing can be synchronous, with internal researchers 36 accessing results and other information as it becomes available while external publication can be either synchronous, such as when an external researcher 24 accesses results through knowledge management server 104, or asynchronous such as when an external researcher 24 accesses an external replica of the published data held in an external database.

[0056] FIGS. 6a, 6b, 7a and 7b show examples of the publishing of research results and other information with system 20. Publication to an internal researcher 36 is illustrated in FIG. 6a, wherein researcher 36 uses a web browser to access a Web Browser via the web server 84, to interface with knowledge management server 104. KMS 104 has access to all federated information and results, the federated information and results being appropriately indexed, in virtual storage device 52 and throughout system 20. As internal researcher 36 commences his or her interaction with KMS 104, the security management services authenticate the authority of the internal researcher to access the results and/or any research applications internal researcher 36 requests. FIG. 6b shows the requested results being provided to internal researcher 36.

[0057] In FIG. 7a, an example of an asynchronous "push" of results is shown. In this example, an external researcher 24 or other external data user, such as a public or commercial database, has previously indicated an interest in research results and this interest has been authenticated and approved by the above-described security management services. Accordingly, when KMS 104 determines that results or information are available in virtual storage device 52, via research data server 116, are to be pushed to an external researcher 24 or external database, the results are packaged by KMS 104 and forwarded to the appropriate external destination via B2B server 64. In FIG. 7b, an external researcher 24 interfaces with KMS 104 in a synchronous manner, much like an internal researcher would, via B2B server 64. In this case, the above-described security management services authenticate the external researcher 24 and, on an ongoing basis, ensure the researcher's authority to access requested information. KMS 104 retrieves properly requested information from virtual storage device 52, via research data server 116, and provides it to the external researcher 24 via B2B server 64.

[0058] As will now be apparent, the present invention provides an end to end information technology system to allow researchers from multiple disciplines and in geographically diverse locations to cooperate in research efforts. Management of large amounts of experimental and reference information is provided to meet diverse researcher and regulatory requirements, while appropriate security of the managed information is maintained.

[0059] The above-described embodiments of the invention are intended to be examples of the present invention and alterations and modifications may be effected thereto, by those of skill in the art, without departing from the scope of the invention which is defined solely by the claims appended hereto.

* * * * *