Peak Correlation And Clustering In Fluidic Sample Separation Heinje; Gerd ; et al. [AGILENT TECHNOLOGIES, INC.]

Peak Correlation And Clustering In Fluidic Sample Separation

Heinje; Gerd ; et al.

Patent Application Summary

U.S. patent application number 13/252731 was filed with the patent office on 2012-05-10 for peak correlation and clustering in fluidic sample separation. This patent application is currently assigned to AGILENT TECHNOLOGIES, INC.. Invention is credited to Gerd Heinje, Rainer Jaeger.

Application Number	20120116689 13/252731
Document ID	/
Family ID	43414327
Filed Date	2012-05-10

United States Patent Application	20120116689
Kind Code	A1
Heinje; Gerd ; et al.	May 10, 2012

PEAK CORRELATION AND CLUSTERING IN FLUIDIC SAMPLE SEPARATION

Abstract

A device (100) for analyzing measurement data having a plurality of data sets (206), each data set (206) being assigned to a respective one of a plurality of measurements, each data set (206) having multiple features (208) being indicative of different fractions of a fluidic sample, the device (100) comprising a cluster determining unit (108) configured for determining feature clusters (350) by clustering features (208) from different data sets (206) presumably relating to the same fraction, a spread determining unit (110) configured for determining for at least a part of the feature clusters (350) a spread (352) of the features (208) within a respective feature cluster (350), and a display unit (112) configured for displaying at least the part of the feature clusters (350) together with a graphical indication of the corresponding spread (352).

Inventors:	Heinje; Gerd; (Waldbronn (Baden-Wuerttemberg), DE) ; Jaeger; Rainer; (Waldbronn (Baden-Wuerttemberg), DE)
Assignee:	AGILENT TECHNOLOGIES, INC. Santa Clara CA
Family ID:	43414327
Appl. No.:	13/252731
Filed:	October 4, 2011

Current U.S. Class:	702/25
Current CPC Class:	G16C 20/80 20190201; G16C 20/70 20190201; G01N 30/8651 20130101; G06T 11/206 20130101
Class at Publication:	702/25
International Class:	G06F 19/00 20110101 G06F019/00

Foreign Application Data

Date	Code	Application Number
Nov 4, 2010	GB	10186088

Claims

1. A device for analyzing measurement data having a plurality of data sets, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, the device comprising a cluster determining unit configured for determining feature clusters by clustering features from different data sets presumably relating to the same fraction; a spread determining unit configured for determining for at least a part of the feature clusters a spread of the features within a respective feature cluster; a display unit configured for displaying at least the part of the feature clusters together with a graphical indication of the corresponding spread.

2. The device of claim 1, wherein each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter, wherein the cluster determining unit is configured for: ordering at least a part of the features in accordance with the value of the first measurement parameter, particularly ordering from small to large values; determining the feature clusters by clustering features to a respective feature cluster which fulfill the clustering condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value.

3. The device of claim 2, wherein the predetermined threshold value is a time interval indicative of a difference regarding a retention time of a corresponding fraction in different ones of the measurements, particularly in different chromatographic measurements.

4. The device of claim 2, wherein the predetermined threshold value is a time interval within a range from 0.001 minutes to 0.1 minutes, particularly within a range from 0.005 minutes to 0.08 minutes.

5. The device of claim 2, wherein the cluster determining unit is configured for excluding a feature from a feature group upon determining that this feature has a value of the first measurement parameter which is larger than a value of the first measurement parameter of another feature of the same data set by less than the predetermined threshold value.

6. The device of claim 2, wherein the cluster determining unit is configured for determining the feature clusters by clustering all features to a respective feature cluster which fulfill the clustering condition among each other under consideration of the boundary condition that not more than one feature per data set may form part of the same feature cluster.

7. The device of claim 1, wherein the cluster determining unit is configured for determining whether a first and a last of the features in the ordered representation of a feature cluster have a difference regarding the value of the first measurement parameter of more than a predetermined further threshold value, and for triggering an action upon determining that the difference exceeds the predetermined further threshold value.

8. The device of claim 1, wherein the cluster determining unit is configured for determining the feature clusters using a non-recursive algorithm.

9. The device of claim 1, wherein the display unit is configured for displaying, as the graphical indication, a bar having a width corresponding to the respective spread.

10. The device of claim 1, wherein each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter, wherein the display unit is configured for displaying a coordinate system having a first dimension along which the value of the first measurement parameter is displayable for at least a part of the features and having a second dimension along which at least a part of the data sets is displayable for at least a part of the features, wherein the value of the second measurement parameter for at least a part of the features is displayable encoded by a graphical property of a respective marker representing a corresponding feature in the coordinate system.

11. The device of claim 10, wherein the coordinate system is a Cartesian coordinate system.

12. The device of claim 10, wherein the graphical property is a size of the marker, particularly an area of a circular marker.

13. The device of claim 10, wherein the display unit is configured for displaying the graphical indication in an overlaying manner with the markers of the features of the corresponding feature cluster.

14. The device of claim 10, wherein the second dimension is a vertical coordination axis, wherein the display unit is configured for displaying the graphical indication extending along the vertical coordination axis.

15. The device of claim 2, wherein the first measurement parameter is indicative of one of the group consisting of a retention time of a chromatography measurement, and a mass to charge ratio of a coupled liquid or gaseous chromatography and mass spectroscopy measurement.

16. The device of claim 2, wherein the second measurement parameter is indicative of a detection intensity of a peak of a chromatography measurement.

17. The device of claim 1, comprising a fraction identification unit configured for identifying individual fractions assigned to features in different data sets by determining a match with preknown technical information; wherein the cluster determining unit is configured for determining feature clusters by clustering exclusively features which have not been assigned to individual fractions by the fraction identification unit.

18. The device of claim 1, configured as a graphical user interface.

19. The device of claim 1, wherein the measurement data comprises liquid or gaseous chromatography data.

20. The device of claim 1, wherein the measurement data comprises coupled liquid or gaseous chromatography and mass spectroscopy data.

21. The device of claim 1, wherein the measurement data is provided by a measurement device which comprises at least one of a sensor device, a test device for testing a device under test or a substance, a device for chemical, biological and/or pharmaceutical analysis, a fluid separation system configured for separating compounds of a fluid, a capillary electrophoresis device, a liquid chromatography device, a gas chromatography device, an electronic measurement device, and a mass spectroscopy device.

22. A method of analyzing measurement data having a plurality of data sets, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, the method comprising determining feature clusters by clustering features from different data sets presumably relating to the same fraction; determining for at least a part of the feature clusters a spread of the features within a respective feature cluster; displaying at least the part of the feature clusters together with a graphical indication of the corresponding spread.

23. A device for processing measurement data having a plurality of data sets, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, wherein each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter, the device being configured for determining feature clusters by clustering features from different data sets presumably relating to the same fraction by: ordering at least a part of the features in accordance with the value of the first measurement parameter; determining the feature clusters by clustering features to a respective feature cluster which fulfill the condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value.

24. The device of claim 23,

25. A method of processing measurement data having a plurality of data sets, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, wherein each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter, the method comprising determining feature clusters by clustering features from different data sets presumably relating to the same fraction by: ordering at least a part of the features in accordance with the value of the first measurement parameter; determining the feature clusters by clustering features to a respective feature cluster which fulfill the condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value.

26. A software program or product, stored on a non-transitory data carrier, for controlling or executing the method of claim 25, when run on a data processing system.

Description

BACKGROUND ART

[0001] The present invention relates to a data analysis system.

[0002] Measurement instruments are applied to execute various measurement tasks in order to measure any kind of physical parameter. As a result of a measurement, measurement data is output by the measurement instrument. Such measurement data may include values of physical parameters such as concentrations of components of a sample, intensity values of a fluorescence measurement, etc. This information can be displayed to a user via a graphical user interface for evaluation of the data.

[0003] An example for such a measurement instrument is a coupled liquid chromatography and mass spectroscopy device (for instance the 1200 Series LC/MSD of Agilent Technologies).

[0004] DE 10 2007 000 627 A1 discloses a device which has a processing unit, e.g. CPU, for processing of measured data of a liquid chromatography measurement and mass spectrometer measurements such that the processed data are represented in two-dimensions. Parameters such as retention time and mass spectrometer-spectrum and characterizing the measurements are represented in dimensions, where the latter parameter is correlated with the former parameter. The processing unit is arranged such that data of an original sample, i.e. fluid sample, and data of fragments of the sample are represented in two dimensions.

[0005] Niels-Peter Vest Nielsen, Jens Michael Carstensen, Jon Smedsgaard, "Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimized warping", Journal of Chromatography A, 805 (1998) 17-35, discloses that the use of chemometric data processing is becoming an important part of modern chromatography. Most chemometric analyses are performed on reduced data sets using areas of selected peaks detected in the chromatograms, which means a loss of data and introduces the problem of extracting peak data from the chromatographic profiles. These disadvantages shall be overcome by using the entire chromatographic data matrix in chemometric analyses, but it is necessary to align the chromatograms, as small unavoidable differences in experimental conditions causes minor changes and drift. The method uses the entire chromatographic data matrices and does not require any preprocessing, e.g. peak detection. It relies on piecewise linear correlation optimized warping (COW) using two input parameters which can be estimated from the observed peak width. COW is demonstrated on constructed single trace chromatograms and on single and multiple wavelength chromatograms obtained from HPLC diode detection analyses of fungal extracts.

[0006] WO 2005/106920 discloses a method of mass spectrometry which comprises determining a first physico-chemical property and a second physico-chemical property of components, molecules or analytes in a first sample, wherein said first physicochemical property comprises the mass or mass to charge ratio and said second physico-chemical property comprises the elution time, hydrophobicity, hydrophilicity, migration time, or chromatographic retention time. A first physico-chemical property and a second physico-chemical property of components, molecules or analytes in a second sample is determined, wherein said first physicochemical property comprises the mass or mass to charge ratio and said second physico-chemical property comprises the elution time, hydrophobicity, hydrophilicity, migration time, or chromatographic retention time. Data relating to components, molecules or analytes in said first sample is probabilistically associated, clustered or grouped with data relating to components, molecules or analytes in said second sample.

[0007] For the management of such measurement data, a user interface may be appropriate for visualizing corresponding data items to a user in a way that a technically reasonable evaluation of the measurement data is enabled. In this respect, conventional data analysis systems may be inconvenient in use.

DISCLOSURE

[0008] It is an object of the invention to provide a convenient data analysis system simplifying a technically reasonable evaluation of the measurement data for a user. The object is solved by the independent claims. Further embodiments are shown by the dependent claims.

[0009] According to an exemplary embodiment, a device for analyzing measurement data having a plurality of data sets is provided, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample (particularly of a fluidic sample to be separated by a respective one of the plurality of measurements), the device comprising a cluster determining unit configured for determining feature clusters by clustering features from different data sets presumably relating (or assumed to relate) to the same fraction, a spreading determining unit configured for determining for at least a part of the feature clusters a spreading of the features within a respective feature cluster, and a display unit configured for displaying at least the part of the feature clusters together with a graphical indication of the corresponding spreading.

[0010] According to another exemplary embodiment, a method of analyzing measurement data having a plurality of data sets is provided, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, wherein the method comprises determining feature clusters by clustering features from different data sets relating to the same fraction, determining for at least a part of the feature clusters a spreading of the features within a respective feature cluster, and displaying at least the part of the feature clusters together with a graphical indication of the corresponding spreading.

[0011] According to an exemplary embodiment, a device for processing measurement data having a plurality of data sets is provided, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, wherein each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter, the device being configured for determining feature clusters by clustering features from different data sets presumably relating (or assumed to relate) to the same fraction by ordering at least a part of the features in accordance with the value of the first measurement parameter, and determining the feature clusters by clustering features to a respective feature cluster which fulfill the condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value (particularly clustering all features to a respective feature cluster which fulfill the mentioned condition under consideration of the boundary condition that not more than one feature of a respective data set forms part of the same feature cluster).

[0012] According to another exemplary embodiment, a method of processing measurement data having a plurality of data sets is provided, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, wherein each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter, wherein the method comprises determining feature clusters by clustering features from different data sets relating to the same fraction by ordering at least a part of the features in accordance with the value of the first measurement parameter, and determining the feature clusters by clustering features to a respective feature cluster which fulfill the condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value.

[0013] According to still another exemplary embodiment of the present invention, a software program or product is provided, preferably stored on a data carrier, for controlling or executing any of the methods having the above mentioned features, when run on a data processing system such as a computer.

[0014] Embodiments of the invention can be partly or entirely embodied or supported by one or more suitable software programs, which can be stored on or otherwise provided by any kind of data carrier, and which might be executed in or by any suitable data processing unit. Software programs or routines can be preferably applied in the context of measurement data analysis. The measurement data analysis scheme according to an embodiment of the invention can be performed or assisted by a computer program, i.e. by software, or by using one or more special electronic optimization circuits, i.e. in hardware, or in hybrid form, i.e. by means of software components and hardware components.

[0015] In the context of this application, the term "measurement data" may particularly denote experimental data obtained from a measurement regarding a sample comprising multiple fractions or components which are to be separated from one another. For example, such measurement data may be liquid or gaseous chromatography data.

[0016] The term "data set" may particularly denote a portion of the measurement data, more precisely experimental data which relate to one and the same measurement on one and the same fluidic sample. For instance, multiple measurements may be performed with multiple physically different samples, whereas the samples are preferably treated under same or comparable measurement conditions. Hence, each data set may correspond to a respective one of several experimental runs on a measurement device for separating a corresponding fluidic sample in the different fractions. It is possible to use different samples, one for each measurement relating to a corresponding data set. In another embodiment, it is possible to use the same sample and run the same experiment multiple times to capture various data sets together forming the measurement data.

[0017] The term "feature" (more particularly signal feature) may particularly denote a characteristic subsection in a measurement signal which has a special shape, value, etc., which distinguishes the subsection from surrounding portions. When referring to a "signal feature", "signal " should be understood as relating to a measurement signal of any type such as a chromatogram. For example, such a feature may be a peak, a dip, a step or the like in the signal with a dedicated pattern being indicative of a certain measurement event.

[0018] The term "fractions of a fluidic sample" may particularly denote different components (such as different chemical compounds) of a fluidic sample, i.e. of a gaseous and/or liquid sample. For example, different genes or different proteins in a biological sample can form the different fractions. By a fluid separation method performed by the measurement device, it is possible to physically and spatially separate the different fractions of the fluidic sample, for instance by liquid or gaseous chromatography or gel electrophoresis.

[0019] The term "presumably relating to the same fraction" (or assumed to relate to the same fraction) may reflect the fact that the evaluation scheme considers features to relate to the same fraction in the case of certain circumstances, for instance if one or more decision criteria is or are fulfilled. Such a decision criterion may be that clustered features of a respective feature cluster fulfill the condition that a difference regarding a value of a measurement parameter between adjacent features of a feature cluster in an ordered representation is below a predetermined threshold value. Another decision criteria may be that a result of the application of a recursive algorithm results in that certain features in fact relate to the same fraction. Since however, for instance in the presence of artifacts in the measurement signal, it cannot be ruled out completely that the evaluation scheme erroneously assigns a certain feature to a certain fraction under undesired circumstances, an assignment will be denoted here a presumable relation to the same fraction.

[0020] The term "feature cluster" may particularly denote a group of two, three, four or more features relating to different measurements and therefore data sets, but apparently relating to the same fraction, e.g. physical, chemical or biochemical component. For simplifying evaluation of multiple measurements with multiple fractions of a fluidic sample for a user, the clustering of the features may visually ease the understanding which of the features relate to one another in a physical sense.

[0021] The term "spread of the features" (which may also be denoted as "spreading of the features", "cluster bandwidth of the features", "distribution of the features") may particularly denote a deviation or variation of the features among a feature cluster regarding a certain measurement parameter. Such a spread may be any statistical measure (particularly a reliability value) indicative of to which quantitative amount the individual features of a cluster presumably relating to the same fraction differ from measurement to measurement. Hence, the spread gives a quantitative measure for the degree of reliability of the clustering.

[0022] The term "graphical indication" may particularly denote any visualization of the correlation between the individual features of a feature cluster on the one hand and their spread on the other hand. The graphical indication shall make clear to a user how large the uncertainty of the grouping is. A large spread usually corresponds to a lower certainty or reliability of the feature grouping as compared to a small spread.

[0023] The term "value of a measurement parameter" may particularly denote a quantitative value of a measured parameter in a certain measurement. Which measurement parameter is analyzed depends on the kind of measurement being performed.

[0024] The term "adjacent features of a feature cluster in an ordered representation" may particularly denote that firstly, the features may be quantitatively ordered after a projection on a measurement parameter axis (particularly from small values to larger values), and secondly, direct neighbors in the quantitative order are regarded. In a corresponding one-dimensional representation of these features, it is possible to compare neighbored or adjacent features with regard to their distance from one another in terms of the (first) measurement parameter. Hence, the smallest and the second smallest feature are considered adjacent, the second smallest and the third smallest, . . . , and the second largest and the largest feature are considered adjacent. Thus, directly neighbored features (particularly all pairs of directly neighbored features) are pairwise compared (by a subtraction operation) with regard to the difference concerning the first measurement parameter.

[0025] According to a first aspect, a technical assistance system is provided for a technician such as an engineer, a chemist or a biologist which takes a technically well-founded approach for a grouping of different signal features into corresponding clusters. Particularly the occurrence of features at basically the same position on a measurement axis is considered as a clear indication for the assumption that they relate to the same separation/measurement conditions. However, since it cannot be ruled out that such an algorithm-based clustering of potentially defective measurement data maintains the risk of a false clustering, a spread indicative of the reliability of this machine-based clustering is calculated and displayed to the user in combination with the result of the clustering. Therefore, a visual indication is given to the user indicative of the reliability of the clustering performed by the system. Therefore, the technically skilled user is assisted to properly evaluate multiple features in multiple measurements, but at the same time the system clearly gives the user an indication with regard to the amount of technical uncertainty of the clustering. Therefore, it can be safely prevented that the technician simply accepts the clustering of the machine as always correct, and hence technically meaningful information is provided to the user whether the estimation is reliable to a very high degree or to a lower degree.

[0026] According to a second aspect, an accurate and numerically simple algorithm for clustering is provided which allows to cluster features with reasonable computational burden and therefore in a very fast manner for forming feature clusters in an intuitive and technically well-grounded manner. For this purpose, a simple ordering scheme is applied which orders the clusters of the multiple measurements in accordance with a quantitative ordering criteria, for instance in ascending order or in descending order. Particularly, it is not necessary to perform a numerically complex, time-consuming recursive algorithm for the clustering, but in contrast to this a simple comparison of the distance of (or difference between) adjacent pairs of features in terms of the first measurement parameter is sufficient. It is simply checked whether the distance of the value of the measurement parameter between adjacent features is larger or smaller than a predefined threshold value. On the basis of this estimation, a reliable clustering can be performed which has turned out to be properly reliable and which can avoid artifacts to a large extent.

[0027] In the following, further exemplary embodiments of the devices will be explained. However, these embodiments also apply to the methods and to the software program or product.

[0028] In an embodiment, each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter. The cluster determining unit may be configured for ordering at least a part of the features in accordance with the value of the first measurement parameter, particularly ordering from small to large values, and determining the feature clusters by clustering features to a respective feature cluster which fulfill the clustering condition that a difference regarding the value of the first measurement parameter between each adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value. In this context, "each" means that all features of a group are clustered to one cluster, in which group the condition is pairwise fulfilled that each two neighbors in the ordered representation have a distance in terms of the value of the first measurement data of less than the predetermined threshold value. This a very simple algorithm which provides surprisingly reliable results.

[0029] In an embodiment, the predetermined threshold value is a time interval indicative of a difference regarding a retention time of a corresponding fraction in different ones of the measurements. The retention time can be defined as a parameter in chromatography which corresponds to the elapsed time between the time of injection of a sample or solute and the time of elution of the peak maximum of a fraction of that sample or solute. Hence, the retention time is a unique characteristic of the fraction in the solute and can be used for identification purposes. The value of the predetermined threshold value may for instance be estimated using expert knowledge, i.e. empirical information regarding liquid or gaseous chromatography being indicative of the variation of the retention time (or alternatively the retention volume) in different measurements.

[0030] In an embodiment, the predetermined threshold value is a time interval within a range from about 0.001 minutes to about 0.1 minutes, particularly within a range from about 0.005 minutes to about 0.08 minutes. It is turned out that the provided values are very suitable to ensure a proper clustering, particularly when the predetermined threshold value is between 0.01 minutes to 0.03 minutes.

[0031] In an embodiment, the cluster determining unit is configured for determining the feature clusters using a non-recursive algorithm. Recursion may be denoted as a method of defining functions in which a function being defined is applied within its own definition. Thus, recursion implies an iterative approach with a relatively high computational burden. In contrast to this, exemplary embodiments of the invention rely on a simple pairwise comparison of adjacent measurement values which does not need recursions and is therefore less prone to a high consumption of processing capacity.

[0032] In an embodiment, the cluster determining unit is configured for excluding a feature from a feature group (i.e. for not including this feature in a cluster) upon determining that this feature has a value of the first measurement parameter which is larger than a value of the first measurement parameter of another feature of the same data set by less than another predetermined threshold value, i.e. a further threshold value which can be considered as a parameter which is separate from the above mentioned threshold value determining whether different features of different data set should be considered to relate to the same cluster. In an embodiment, the cluster determining unit is configured for determining the feature clusters by clustering all features to a respective feature cluster which fulfill the clustering condition among each other under consideration of the boundary condition that at most one feature per data set may form part of the same feature cluster. Hence, according to such embodiments it shall be ruled out that a feature cluster includes multiple features from the same measurement, because different distinguishable features in the same measurement are considered as a clear technical indication for two different fractions, thereby contravening the assumption that features of a cluster relate to the same fraction. Hence, if two features relating to the same data set are closer to one another than the other predetermined threshold value, the second fraction in the ordered list will not be allowed to form part of the cluster in the described embodiment. The other predefined threshold value is preferably the same threshold value as the one used for determining whether two features of different data sets relate to the same cluster or not. However, the values may also be different from one another, if desired or required.

[0033] In an embodiment, the cluster determining unit is configured for determining whether a first (for instance having the smallest value of the first parameter) and a last (for instance having the largest value of the first parameter) of the features in the ordered representation of a feature cluster differ regarding the value of the first measurement parameter by more than a predetermined further threshold value, and for triggering a predefined action upon determining that the predetermined further threshold value is exceeded. Under undesired circumstances, it can happen that all adjacent features of a cluster fulfill the above-mentioned threshold value condition, but nevertheless the distance between the features of a cluster as a whole is too large to reasonably assume from a technical point of view that the cluster features really relate to the same fraction. Therefore, if a further threshold value which is usually larger than the before mentioned threshold values is exceeded, it will not be assumed in the described embodiment that all the features of the determined cluster relate to the same fraction. For this reason, a corresponding action may be triggered when this criteria is met. This action may for instance be an alarm alarming a user that the clustering is probably not reliable. The action may however also be that the clustering algorithm will not be applied for clustering and no or another clustering algorithm has to be applied, for instance a recursive clustering algorithm.

[0034] In an embodiment, the display unit is configured for displaying a bar having a width corresponding to the respective spread as the graphical indication. A bar is a clear visual indicator showing to a human user in a very intuitive manner how reliable the clustering has been. A bar structurally connects all cluster features visually and therefore gives a further visual indication for the clustering result. However, as an alternative to a bar, it is also possible to use for instance a line of a corresponding length, a color code or a numerical indication of the spread. By such an illustration of the clustering in connection with the two measurement parameters in a coordinate system, it can be possible for a user with one view to understand which clusters have been formed.

[0035] In an embodiment, each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter. The display unit may be configured for displaying a coordinate system having a first dimension along which the value of the first measurement parameter is displayable for at least a part of the features and having a second dimension along which at least a part of the data sets is displayable for at least a part of the features. The value of the second measurement parameter for at least a part of the features is displayable encoded by a graphical property of a respective marker in the coordinate system. Hence, the display of the second measurement parameter does not necessarily require a separate coordination axis, since its value can be encoded as a property of marker.

[0036] In an embodiment, the coordinate system is a Cartesian coordinate system. Alternatively, other two dimensional coordinate systems are possible. Also a three- or more-dimensional coordinate system may be used. However, the use of a Cartesian coordinate system makes the visual confirmation and approval of a clustering by a user very easy, since the uncertainty connected with the clustering can be easily derived visually from a Cartesian coordinate system.

[0037] In an embodiment, the graphical property is a size of the marker, particularly an area of a circular marker. For example, the larger the value of the second measurement value, the larger the area. Hence, the area of such a circular marker can be used as an indication how large the feature was in the original measurement signal, for instance which area a corresponding peak of a liquid or gaseous chromatography measurement has. However, it is also possible to use additionally or alternatively other indicators than the size of the marker--for instance a color--for indicating the value of the second measurement parameter.

[0038] In an embodiment, the first parameter is indicative of a retention time (or a retention volume) of a chromatography measurement, or a mass to charge ratio of a coupled liquid chromatography and mass spectroscopy measurement. However, these parameters are only exemplary, since other parameters may be used when other kinds of measurements are carried out.

[0039] In an embodiment, the second parameter is indicative of a detection intensity of a peak of a chromatography measurement. Again, also the second parameter may be different from the detection intensity when other measurements are carried out.

[0040] In an embodiment, the display unit is configured for displaying the graphical indication in an overlaying manner with the markers of the features of the corresponding feature cluster. By visually projecting the graphical indication with the markers of the features in a coordinate system, it is easy for a user to verify which features relate to the same cluster and how large the spread of the individual features within a cluster is.

[0041] In an embodiment, the second dimension is a vertical coordination axis on a display. The display unit may be configured for displaying the graphical indication extending along the vertical coordination axis. By drawing a bar along a vertical coordination axis, it is easy for a user to check the distribution of the clusters within the bar extending along such a vertical coordination axis but relating to different measurements. Therefore, this makes the evaluation of the measurement even more intuitive.

[0042] In an embodiment, the device comprises a fraction identification unit configured for identifying individual fractions assigned to features in different data sets by determining a match with preknown technical information. The cluster determining unit may be configured for determining feature clusters by clustering exclusively features which have not been assigned to individual fractions by the fraction identification unit. Such a fraction identification unit can be configured in a conventional manner, since it is known to the skilled person for instance in the art of liquid or gaseous chromatography as to how a fraction is identified from a measurement signal. Usually, certain fractions of a fluidic sample to be separated are expected at certain retention times, so that the retention time, the intensity of the corresponding measurement peaks or other features can be used for fraction identification. However, it is also possible in a liquid or gaseous chromatography measurement or another measurement, that certain features cannot be identified or assigned unambiguously or with a sufficient reliability to a certain fraction. In this case, exclusively these non-identified features can be made subject to the clustering algorithm of embodiments of the invention, whereas identified clusters need not to go through the clustering algorithm. Therefore, the technically clear cases need no clustering, but only the peaks which are difficult to assign are clustered to make the evaluation easier for the user. For instance, the clustering may be performed only for non-identified peaks which can relate to impurities which occur in the sample or the like.

[0043] In an embodiment, the device may be configured as a graphical user interface (GUI) which may be denoted as a user interface which allows people to interact with electronic devices such as computers or handheld devices. A GUI offers graphical icons and visual indicators as opposed to purely text based interfaces, typed command labels or text navigation to fully represent the information and actions available to a user. The actions may then be performed through direct manipulation of the graphical elements. Therefore, a user may input preferences to make clustering appropriate for her or his purposes. For instance, the various threshold parameters may be input by a user, therefore allowing to adjust the clustering to the needs of a user. Alternatively, the system can be fully automatic, or it can be a combination of an automatic and a user-defined clustering and spread estimation.

[0044] In an embodiment, the measurement data comprises liquid or gaseous chromatography data. In one embodiment, the measurement data comprises coupled liquid chromatography and mass spectroscopy data. In an embodiment, the measurement data is provided by a measurement device which comprises at least one of a sensor device, a test device for testing a device under test or a substance, a device for chemical, biological and/or pharmaceutical analysis, a fluid separation system configured for separating compounds of a fluid, a capillary electrophoresis device, a liquid chromatography device, a gas chromatography device, an electronic measurement device, and a mass spectroscopy device. However, other applications and kinds of measurements are possible as well.

[0045] The device may be adapted for processing a displayed two-dimensional set of data, particularly may be adapted for processing a measurement curve. Such a measurement curve may be provided by a measurement apparatus, for instance a life science apparatus or any other technical apparatus. Evaluating such measurement data may be conventionally a challenge and may be significantly simplified by the intuitive user interface according to an exemplary embodiment. However, in other embodiments, it is also possible to display three or more-dimensional data.

[0046] By clustering, accumulations of features relating to the same species of a sample, particularly a biochemical sample, may be identified. Hence, a user interface particularly for liquid or gaseous chromatography and mass spectroscopy technology may be provided, wherein a number of measurement diagrams or spectra are taken from various different measurements. Then, it is identified from this which peaks correspond to one another. Due to slightly varying experimental conditions in the various measurements, a change or variation in the sample, or change of other parameters such as solvent and/or temperature may result in a slight shifting of various features or peaks in different data sets although these peaks relate to the same fraction, species or chemicals. Identifying and assigning peaks relating to the same cluster is then important for purposes of reproducibility, which is particularly important in pharmacy and related technologies. A measure for the spread which is then estimated can for instance be the variance or a standard deviation. It may alternatively be a distance between centers of the features on the lower limit and the upper limit of a cluster.

[0047] Hence, embodiments of the invention relate to a system of correlating any desired measurement value in a row of repeated measurements. Result of the correlation is the classification of the measured values at the individual measurements in terms of clusters. An exemplary application of an embodiment of the invention is the purity control of synthesized products, for instance in pharmacology. In this example, the repeated measurements may be chromatograms of different samples from one batch or multiple batches producing the same product. The measurement value as a basis for the clustering is the retention time of non-identified peaks. The result of the correlation are clusters of peaks from the various chromatograms with nearly identical retention time, i.e. retention times differing only within a retention time window. In this example, the clusters can be considered as unknown components such as impurities which have been introduced in the sample (for instance components which should not occur at an optimum processing or only in very small amounts). The diagram then allows to identify such peaks showing unexpected fractions. The clustering then allows for a more detailed understanding of the characteristics of the peak.

BRIEF DESCRIPTION OF DRAWINGS

[0048] Other objects and many of the attendant advantages of embodiments of the present invention will be readily appreciated and become better understood by reference to the following more detailed description of embodiments in connection with the accompanied drawings. Features that are substantially or functionally equal or similar will be referred to by the same reference signs.

[0049] FIG. 1 shows a device for analyzing measurement data having a plurality of data sets according to an exemplary embodiment of the invention.

[0050] FIG. 2 to FIG. 4 are schemes relating to the execution of a method of processing measurement data having a plurality of data sets and illustrating an algorithm of clustering, calculating a spread and illustrating both together according to an exemplary embodiment of the invention.

[0051] FIG. 5 to FIG. 22 show different images relating to a clustering procedure, spread calculation procedure and a graphic illustration of the latter according to an exemplary embodiment of the invention.

[0052] FIG. 23 shows a diagram graphically illustrating different fractions of a fluidic sample separated and being analyzed in terms of cluster formation and spread calculation and illustration.

[0053] FIG. 24 shows a liquid separation system, in accordance with embodiments of the present invention, for instance used in high performance liquid chromatography (HPLC) and ultra high performance liquid chromatography (UHPLC).

[0054] The illustration in the drawing is schematically.

[0055] Referring now in greater detail to the drawings, FIG. 24 depicts a general schematic of a liquid separation system 10. A pump 20 receives a mobile phase from a solvent supply 25, typically via a degasser 27, which degases and thus reduces the amount of dissolved gases in the mobile phase. The pump 20--as a mobile phase drive--drives the mobile phase through a separating device 30 (such as a chromatographic column) comprising a stationary phase. A sampling unit 40 can be provided between the pump 20 and the separating device 30 in order to subject or add (often referred to as sample introduction) a fluidic sample into the mobile phase. The stationary phase of the separating device 30 is adapted for separating compounds of the fluidic sample. A detector 50 is provided for detecting separated compounds of the fluidic sample. A fractionating unit 60 can be provided for outputting separated compounds of the fluidic sample.

[0056] While the mobile phase can be comprised of one solvent only, it may also be mixed from plural solvents. Such mixing might be a low pressure mixing and provided upstream of the pump 20, so that the pump 20 already receives and pumps the mixed solvents as the mobile phase. Alternatively, the pump 20 might be comprised of plural individual pumping units, with plural of the pumping units each receiving and pumping a different solvent or mixture, so that the mixing of the mobile phase (as received by the separating device 30) occurs at high pressure and downstream of the pump 20 (or as part thereof). The composition (mixture) of the mobile phase may be kept constant over time, the so called isocratic mode, or varied over time, the so called gradient mode.

[0057] A data processing unit 70, which can be a PC or workstation, might be coupled (as indicated by the dotted arrows) to one or more of the devices in the liquid separation system 10 in order to receive information and/or control operation. For example, the data processing unit 70 might control operation of the pump 20 (for instance setting control parameters) and receive therefrom information regarding the actual working conditions (such as output pressure, flow rate, etc. at an outlet of the pump). The data processing unit 70 might also control operation of the solvent supply 25 (for instance setting the solvent/s or solvent mixture to be supplied) and/or the degasser 27 (for instance setting control parameters such as vacuum level) and might receive therefrom information regarding the actual working conditions (such as solvent composition supplied over time, flow rate, vacuum level, etc.). The data processing unit 70 might further control operation of the sampling unit 40 (for instance controlling sample injection or synchronization sample injection with operating conditions of the pump 20). The separating device 30 might also be controlled by the data processing unit 70 (for instance selecting a specific flow path or column, setting operation temperature, etc.), and send--in return--information (for instance operating conditions) to the data processing unit 70. Accordingly, the detector 50 might be controlled by the data processing unit 70 (for instance with respect to spectral or wavelength settings, setting time constants, start/stop data acquisition), and send information (for instance about the detected sample compounds) to the data processing unit 70. The data processing unit 70 might also control operation of the fractionating unit 60 (for instance in conjunction with data received from the detector 50) and provides data back.

[0058] Reference numeral 90 schematically illustrates a switchable valve which is controllable for selectively enabling or disabling specific fluidic paths within apparatus 10. The switchable valve 90 is not limited to the position between the pump 20 and the separating device 30 and can also be implemented at other positions, depending on the application.

[0059] The data processing unit 70 may also process and display measurement data measured by device 10 to enable a user to derive technical information from the measurement. Such procedures according to exemplary embodiments will be described in detail in the following. Particularly, methods for evaluating chromatographic results using data correlation and clustering will be explained.

[0060] FIG. 1 shows a device 100 (which corresponds to device 10 of FIG. 24) for analyzing liquid chromatography measurement data captured by a liquid chromatography measurement device 102 (which corresponds to components 20, 25, 27, 30, 40, 50, 60, 90 of FIG. 24). The measurement device 102 carries out a plurality of measurements on a fluidic sample to be separated into various fractions. With each measurement, a corresponding data set is captured by the measurement device 102. Each data set can be indicative of a chromatogram which has a plurality of peaks which will also be called signal features or only features. Each feature indicates the presence of a corresponding fraction or species in the fluidic sample.

[0061] After finishing the measurements, the measurement data can be stored in a database 104 for later evaluation.

[0062] A fraction identification unit 106 of the device 100 is configured for identifying individual fractions assigned to the features in the chromatogram in different data sets by determining a match with preknown technical information. In other words, certain fractions or components of the fluidic sample which is presently analyzed are expected so that the fraction identification unit 106 can identify peaks in the measurement signals and assign them to the various expected fractions. However, it may also happen that some of the determined features in the measurement spectra cannot be identified, i.e. cannot be assigned to an expected species. This can for instance be caused by impurities in the samples.

[0063] Such impurities, which may correspond to undesired or parasitic fractions of the fluidic sample, can then be analyzed by a cluster determining unit 108. The cluster determining unit 108 is configured for determining feature clusters by clustering only the features which could not be assigned to individual fractions by the fraction identification unit 106. For this purpose, the clustering determining unit 108 determines feature clusters by clustering features from different data sets which presumably relate to the same fraction. Examples for a corresponding clustering algorithm, i.e. an algorithm for determining which of the unidentified peaks or features relate to the same fraction or are at least considered to relate to the same fraction will be discussed below in more detail.

[0064] The result of the cluster determination is then supplied to a spread determining unit 110. The spread determining unit 110 is configured for determining, for each of the feature clusters individually, a corresponding spread of the features within a respective feature cluster. In other words, a value can be statistically derived which is indicative of a width of the distribution of the individual features within a cluster. In other words, the spread is an indication for the reliability of the clustering (the larger the spread, the lower the reliability).

[0065] After having determined a quantitative measure for the spread for each feature cluster individually, a display unit 112 may be fed with the corresponding data and may be configured for determining display data for actually displaying the feature clusters together with the graphical indication of the corresponding spread, for instance on a monitor.

[0066] As can be taken by a dashed rectangle in FIG. 1 denoted with reference numeral 114 (which corresponds to component 70 of FIG. 24), units 106, 108, 110, 112 can be realized as a common processor or computer. It is however also possible that each of the units is realized as a separate processor or computer or that some of the units only are realized as a common processor.

[0067] An input/output unit 116 is provided for bidirectional communication with the processor 114 as well as the database 104 and the measurement device 102. Via the input/output unit 116, a user may input instructions to the system, for instance may determine parameters or may define a measurement to be carried out. It is also possible that results of such a measurement or the evaluation is displayed to the user via the input/output interface 116, for instance via a monitor.

[0068] FIG. 2 to FIG. 4 illustrate how the clustering, the spread determination and the graphical display can be performed for the system shown in FIG. 1.

[0069] FIG. 2 shows a diagram 200 having an abscissa 202 along which a retention time is plotted according to a liquid or gaseous chromatography measurement. Along an ordinate 204, different measurements performed with the liquid or gaseous chromatography apparatus 102 are illustrated. This means in the shown example that four different measurements are indicated in the diagram of FIG. 2, each illustrated as a corresponding horizontal dotted line. A number of signal features 208 are shown for each measurement in the diagram 200. Hence, each measurement shows a plurality of such features 208. All features 208 relating to one and the same measurement together form a corresponding data set 206, as shown in FIG. 2 as well. Therefore, the four data sets 206 shown in FIG. 2 correspond to the four measurements. In the example of FIG. 2, each data set 206 has three (in this case unidentified) features 208 which are arranged at remarkably different retention times. The following procedure intends to cluster corresponding features 208 which most probably relate to the same fraction of a sample to be separated in the various measurements.

[0070] The way how the clustering is performed is shown in FIG. 3 and will be illustrated in the following. Firstly, all unidentified features 208 shown in FIG. 2 are projected on and are ordered quantitatively along an axis 330 shown in FIG. 3 which relates to the retention time axis 202. In other words, all twelve features 208 shown as circles in FIG. 2 are projected onto the retention time axis 202. Hence, the twelve features 208 illustrated as "1", "2", . .. , "11", "12" in FIG. 3 are ordered according to their value of the retention time from small to large values. Feature clusters 350 are then determined by clustering all features 208 which fulfill the clustering condition that a difference regarding the value of the retention time between adjacent features 208 of a feature cluster 350 in the ordered representation is below a predetermined threshold value .DELTA..sub.TH being indicated in FIG. 3 with reference numeral 354. Hence, a distance .DELTA..sub.12 between features "1" and "2" is determined and compared to .DELTA..sub.TH. Since .DELTA..sub.12 is smaller than .DELTA..sub.TH, features "1" and "2" are considered to relate to the same cluster 350. Next, features "2" and "3" are analyzed which have a mutual distance .DELTA..sub.23. Since .DELTA..sub.23 is smaller than .DELTA..sub.TH, also features "2" and "3" are considered to relate to the same cluster 350. This procedure is continued until it is estimated that the difference .DELTA..sub.45 between features "4" and "5" is larger than .DELTA..sub.TH. Therefore, it is concluded that features "4" and "5" do not relate to the same cluster 350. Correspondingly, features "1" to "4" are grouped to form the first cluster 350. This procedure is continued so that three clusters 350, which are denoted as C1, C2 and C3 in FIG. 3, are identified.

[0071] A further consistency check of the cluster formation may be made by comparing a respective width S1, S2 or S3 between the center of the first and the center of the last feature 208 of a respective cluster 350 with another threshold value S.sub.TH denoted as reference numeral 356. If one of S1, S2 or S3 would be larger than S.sub.TH, then the corresponding cluster formation would not be considered as reliable and this would be indicated to a user, for instance in the form of an alarm. However, in the present case, each of the cluster formation is considered as consistent. The corresponding value S1, S2 and S3 can be denoted as a spread of a corresponding cluster C1, C2 or C3.

[0072] In FIG. 4, a diagram 400 similar to diagram 200. In addition to FIG. 2, a bar 406 being indicative for the extension of the corresponding spread S1, S2 or S3 visually shows to the user how reliable the clustering is.

[0073] Coming back to FIG. 2, a further feature 210 is shown which relates to the second measurement and has a distance to a preceding feature 212 of less than .DELTA..sub.TH. If such a situation occurs, i.e. that the same measurement shows two features 210, 212 differing less than .DELTA..sub.TH from one another but relating to the same data set 206, then the later feature 210 is not considered to relate to the same cluster 350, because two separable features in the same measurement are indicative of two different fractions and can therefore not be considered to relate to the same fraction for technical considerations. Feature 210 can form a separate cluster with a width or spread of zero, since it is only a single feature.

[0074] In the following, referring to FIG. 5 to FIG. 22, a system of forming a graphical illustration of measurement results according to exemplary embodiments of the invention will be explained.

[0075] FIG. 5 shows a chromatographic signal 500 illustrating different signal features such as peaks 502 as regions of locally high intensity in a liquid chromatography experiment in dependency of a retention time plotted along abscissa 202. A baseline 504 is shown as well.

[0076] FIG. 6 shows how the chromatographic signal 500 can be transformed into an equivalent bubble diagram in which the individual peaks 502 are displayed as circular structures or features 208. In other words, the area of each feature 208 corresponds to an area under a corresponding peak 502.

[0077] FIG. 7 shows an illustration similar to that of FIG. 6, wherein expected retention time windows--more precisely spreads relating to expected peaks--are illustrated in the form of bars which are denoted with reference numeral 700.

[0078] FIG. 8 shows a similar diagram as FIG. 7 with the exception that apart from identified peaks, compare reference numeral 208, also some unidentified peaks are shown which are illustrated by reference numeral 800. Unidentified peaks 800 means that the corresponding peak is seen in the signal 500, however no such peak would be expected theoretically. Such unidentified peaks 800 may result from impurities in a sample or the like.

[0079] FIG. 9 shows that, apart from the unidentified peaks 800, it may also happen that certain expected peaks are not found in a signal 500, as indicated by reference numeral 900. Not found means that there is no local maximum in the signal 500 although it would be expected theoretically.

[0080] In some events, compare reference numeral 1000 in FIG. 10, an alert may be triggered since an alert rule is violated. In other cases, see reference numeral 1002, a warning may be output to a user when a warning rule is violated.

[0081] FIG. 11 shows a diagram 1100 in which all peaks 208 are shown as bubbles, wherein the size can be proportional to area, height, amount, etc. Vertical bars 700 show the expected retention time window.

[0082] FIG. 12 shows a so-called sequence peak diagram 1200. In this diagram 1200, all peaks 208 of different injections or measurements are shown as bubbles, wherein the size can be proportional to area, height, amount. The vertical bars 700 show the expected retention time window. Hence, peaks 208 from various measurements are illustrated in the sequence peak diagram 1200.

[0083] FIG. 13 shows a graphical user interface 1300, in which a user can, in a user-defined manner, design the way of illustrating the various resonances 208 and bars 700 in accordance with user preferences.

[0084] In the graphical user interface 1400 shown in FIG. 14, two peaks 1402 are marked as suspicious, because certain rules have failed (relating to warning and alert status).

[0085] FIG. 15 shows a diagram 1500 in which expected but not found peaks 1502 are shown as well.

[0086] FIG. 16 shows a diagram 1600 which indicates that three injections or measurements show unidentified peaks 1602. As a result of clustering, bands 1604 indicate that these peaks 1602 could be assigned to two unknown compounds.

[0087] FIG. 17 shows a graphical user interface 1700 in which a comparison against a reference chromatogram is performed, and a proper match is found.

[0088] User interface 1800 shown in FIG. 18 shows that at a peak 1602, reference and sequence chromatograms do not match very well.

[0089] In diagram 1900 in FIG. 19, the sequence chromatogram shows one expected but not found peak 1902, one peak 1904 to much, and one peak 1906 not found.

[0090] FIG. 20 shows a diagram 2000, in which peaks of a reference and a sequence chromatogram do not match. However, there is some similarity. FIG. 21 shows a diagram 2100 in which the peaks are aligned (see alignment lines 2102).

[0091] FIG. 22 shows a user interface 2200 in which a suspicious marker 2202 is shown.

[0092] In FIG. 23, a diagram 2300 can be seen which is similar to diagram 400 and that shows that after clustering of features 208 or peaks the resulting clusters are displayed together with a measure for the spreading.

[0093] Unidentified peaks are denoted with reference numeral 2304, identified peaks are denoted with reference numeral 2302, and vertical bands (reference numeral 2306) show formed clusters.

[0094] The following description referring to FIG. 23 relates to peak correlation and clustering components. It allows a user to correlate (cluster) unidentified peaks 2304 based on retention times (see axis 202). Peaks with retention times, which are very close to each other, are assigned to the same cluster. The results are visualized as a graphic control (see FIG. 23) and as table entries (not shown) for further evaluation. The user can control the window size 354 which is used for clustering, correct manually a given clustering and apply various filter operations in order to explore the clusters and peaks in details.

[0095] Clustering of peaks can be used when multiple samples show unidentified peaks 2304 and the question rises whether these peaks 2304 are likely to be caused by the same compound or impurity. The described method will help the user to classify the peaks 2304 by aligning all those peaks 2304 which show up closely at the same retention time and handle them as new entity, i.e. as a yet unknown compound or impurity.

[0096] This may also be useful for developing new methods where retention times of all peaks 2302, 2304 are not known in advanced. The found clusters can then be turned into expected retention times for identifying these peaks 2302, 2304.

[0097] Depending on the nature of the retention time values clustering will not always lead to a unique solution. Therefore, the user needs an easy way to change the window size 354 used for clustering and view in real-time how these manipulation alter the clustering. This will enable the user to select the most meaningful solution.

[0098] The user interface for this feature comprises a graphical control showing the positions of all peaks 2302, 2304 and clusters as retention time bands 2306, additional entries for the column table where each column (group of columns) represents data from a specific cluster, and various interactive manipulation means for evaluating the clustered peaks 2302, 2304.

[0099] Since expected peaks 2302 are clustered implicitly by data analysis, i.e. peak identification step, this additional clustering will only be applied to unidentified peaks 2304, in an embodiment.

[0100] Therefore, input for clustering are the set of retention times of all unidentified peaks from all injections. Clustering is performed for each signal separately. The only parameter is the Clustering Window Size 354 which specifies the size of the window used to cluster peaks in retention time units (min/sec). If this parameter is not specified the algorithm will determine a default cluster window size from the minimum of non-zero differences of all unidentified peaks.

[0101] Output is a collection of clusters (compare reference numeral 350 in FIG. 3). Each cluster lists the retention times, signal and injections which comprise the cluster, as well as the real width of the cluster, calculated as maximum minus minimum of retention times within the cluster.

[0102] This clustering feature can be switched on or activated interactively when evaluating peak or compound results. In case clustering is switched on the method will hold the user specified cluster window size 354 or the information to use a default value.

[0103] When exploring the clustering interactively the software may vary the cluster window size 354 and calculate the clustering in the background. As a result the relationship of "number of clusters" versus "cluster window size" can be inspected to allow the user to find an optimal cluster window size 354 for the user data. The software will mark the largest cluster window size 354 at which for all injections not more than one peak 2302, 2304 is included in each cluster.

[0104] In the case that multiple signals are available the software can optionally collect all identified peaks 2302 from all signals as input to the correlation algorithm. In the correlation result set that peak gets marked that posses the largest area from the set of peaks which are from the same injection within the same cluster but from different signals.

[0105] In the case multiple detectors are available the signal alignment algorithm may be applied before determining the retention times. This is especially advantageous when combining retention times from all signals as input for the correlation/clustering algorithm.

[0106] In case the cluster window size 354 is smaller than the minimum of non-zero differences of all peaks, the number of created clusters is equal to the number of different retention times. In case the cluster window size 354 is larger than the total spread, i.e. maximum minus minimum, of retention times, the number of created clusters equals one. For all other values for the cluster window size 354 the number of resulting clusters is between the two above described values; actually it is a monotonically following step function. The cluster window size 354 is limited by the largest size at which for each injection not more than one peak is included in each cluster.

[0107] As mentioned above, FIG. 23 shows the principal layout of the graphical control for presenting all peaks from many injections and their clusters. The X-axis (see reference numeral 202) has the same units as the analyzed signals, i.e. time given as in units of min or sec. The Y-axis (see reference numeral 204) shows just the number of injection from which the peaks 2302, 2304 are taken. The position of each peak 2302, 2304 is presented by a circle. The size of the circle represents area, height or any other chosen numerical value of a peak 2302, 2304.

[0108] Clusters can be visualized by bands 2306 which may be colored. The presentation of FIG. 23 includes also the identified peaks 2302. The width of the bands 2306 for identified peaks 2302 is just the expected retention time plus/minus the identification window size. The width of the bands 2306 for the unidentified peaks 2304 is chosen in a way that retention times, i.e. center of the circles, of all peaks 2304 belonging to a cluster are within the band 2306. In the case a cluster contains only one peak 2304 then only one colored line is drawn as a cluster band 2306.

[0109] Identified peaks 2302 and their clusters may be colored differently from unidentified peaks 2304 and the corresponding clusters. For instance, identified peaks 2302 may be colored blue, unidentified peaks 2304 grey.

[0110] A selected injection or measurement is visualized by reference numeral 206; a selected peak may be emphasized by four arrows pointing to the according circle (see reference numeral 2308).

[0111] Next, an interactive evaluation of correlated unidentified peaks 2304 will be explained. Prerequisite is that multiple injections are already loaded and integrated; identification can be completed but is not needed. In the case no identification has been done, all peaks 2302, 2304 are handled as unidentified. This might be a useful starting point for developing a new method from scratch.

[0112] Assuming the user is evaluating chromatograms and peaks, depending on the user interface layout the user would either switch on the correlation/clustering control or switch to a specific sub-view. The system will immediately calculate the clusters and display the result as a graphic and as added columns to the compound table displaying values for the found clusters. The default is to start with all unidentified peaks from a signal and the cluster window size given by the method: either a specific or the system calculated default value. Using a toolbar, the user can easily switch between different available signals.

[0113] In order to determine a proper clustering, the user can display a small popup window that shows the relationship between cluster window size 354 and number of clusters. The user can adapt the cluster window size 354 if needed. There may be a slider on the toolbar which allows the user to evaluate the diagram real time for varying the cluster window size 354.

[0114] Other options are to select which attribute will be shown by the size of the circles that represent each peak 2302, 2304 in the graphic. Possible values are: area, height, peak type, or any numeric value that is outcome of the rule calculator. The real value is proportional to the area of the circle. The sizes of the circles vary between two predefined values for the minimum and maximum circle.

[0115] Further on, the user can suppress peaks 2302, 2304 or full injections (measurements) for clustering. This makes sense when outliers have been identified by the data analysis and these outliers might create values which are not representative for all samples or would distort clustering. Peaks 2302, 2304 or full injections can manually be suppressed interactively for instance by moving the cursor near to a circle. The cursor may change its shape visualizing the possible action to suppress a peak 2302, 2304 or injection or to re-activate a suppressed item.

[0116] Other filter options are to show and mark unidentified peaks 2304 that are only detected in some of the injections but not at all, and/or to show and mark ranges of signal where expected peaks 2302 have not identified, i.e. are for any reasons not available.

[0117] A method according to an embodiment of the invention which includes an algorithm for clustering and correlating data from a series of repeated measurements will be described in detail in the following with an emphasis on the logic of such an algorithm. Integrated with a graphical presentation of the resulting clusters this method allows the user to examine specific features of the measured data in a highly efficient way. The outlined example of peak correlation of chromatographic measurements illustrates advantages of this method, especially in the area of impurity profiling or development of chromatography methods.

[0118] The described method allows correlating and clustering any measured numerical feature from a series of repeated measurements. Based on a given small Cluster Window Width (also denoted as predefined threshold value), an algorithm creates clusters of values of a measured feature that are taken from the different measurements of the series. Adjacent values within a cluster are closer to each other than the given window width. However, in an embodiment the chosen Cluster Window Width shall not exceed a size such that more than one data point from a single measurement falls into the same cluster. In general the resulting cluster size may be larger than the starting Cluster Window Width.

[0119] The method includes a graphical and tabular presentation of the correlation result. The graphical presentation is a scatter diagram of the measured values. An X-axis relates to the data range of the measured data values and a Y-axis numbers the measurements of the series. The format of the single data points such as color, shape and size can visualize additional features of the data point. A table may be used to list any selected feature of each cluster in a single table column.

[0120] In an embodiment, such a system may be applied to chromatographic measurement data. Gas chromatography (GC) and liquid chromatography (LC) are techniques to characterize the chemical composition of gaseous and liquid, i.e. fluidic, samples. During a chromatography run fractions or components (also called compounds) of a mixture are separated, and optionally, identified and quantified. The time it takes the component molecules to travel through the system is called retention time. The result of a chromatographic analysis is a signal (chromatogram) that shows peaks at different retention times corresponding to the different components. In addition, the height or area of the peak can be used to quantify the component in the sample.

[0121] One task of data analysis is to allot these peaks, based on the retention time, to components. During method development the retention time of all components of interest are determined and inserted in the method as expected retention time. When running real samples the data analysis part of the system scans the chromatograms for peaks at expected retention times and uses the peak area or height to determine the amount of the components.

[0122] Applied to chromatography peak clustering can be used to examine un-identified peaks. For instance, LC or GC analysis is applied to create a series of analysis from different samples taken from a batch of a new synthesize product. In this example the repeated measurements are the recorded chromatograms; the measured feature is the retention time of any unidentified peak within the chromatograms. The described algorithm creates clusters of unidentified peaks from the different chromatograms for which the retention times are very close to each other. One interpretation is that such clusters are caused by unknown compounds which are regarded as impurities or by-products which should not exist at optimal process control. The found clusters are added as "yet unknown" compounds to the compound list.

[0123] Some of the diagrams below (for instance FIG. 23) show an exemplary layout of a scatter plot for peak correlation. Not only the unidentified peaks (reference numeral 2304 in FIG. 23) may be drawn, but also the identified (reference numeral 2302 in FIG. 23). Vertical bands (reference numeral 2306 in FIG. 23) show the created clusters, either given by the below described clustering algorithm for unidentified peaks or for expected peaks by peak identification. The width of the bands for identified peaks is just the expected retention time plus/minus the identification window size as specified in the method. The width of the bars or bands for unidentified peaks is chosen in a way that retention times, i.e. center of the circles, of all peaks belonging to a cluster are within the band. The size of circles is chosen to be proportional to the peak area.

[0124] This visualization concept may be integrated into a general data analysis software package for chromatographic data. If a user selects any chromatogram or peak for further inspection the related peak will also be highlighted in the scatter diagram.

[0125] In addition of displaying all peaks and their correlation the graphical presentation can be used to highlight a variety of peak attributes and to help navigate to suspicious signals. Peaks can be flagged based on the results from applied data evaluation rules.

[0126] Next, an exemplary peak clustering algorithm will be described which may be used for the above-described way of illustrating clusters and their spread.

[0127] Prerequisite for peak correlation is that multiple signals are loaded and already integrated; identification could have been completed but is not required. In case no identification has been done all peaks are handled as unidentified. This might be a useful starting point for developing a new method from scratch.

[0128] The following cluster algorithm may be applied:

TABLE-US-00001 STEP 1: From each loaded Signal k collect all unidentified Peaks, result: PeaksInSignal (k) STEP 2: Merge all PeaksInSinal (k) lists, result: PeakList STEP 3: Sort PeakList (smallest to largest), result: SortedPeakList STEP 4: Set ClusterInd = 1, add SortedPeakList(1) to PeakCluster (ClusterInd) STEP 5: FOR i = 2 to NumberOfPeaks in SortedPeakList Set k such SortedPeakList (i) is in PeaksInSignal (k) IF ((SortedPeakList (i) - SortedPeakList (i-1)) <= "Cluster Window Width") AND (No Peaks of PeaksInSignal (k) in PeakCluster (ClusterInd)) Add SortedPeakList (i) to current PeakCluster (ClusterInd) ELSE Create a new cluster, increment ClusterInd by 1 Add SortedPeakList (i) to new PeakCluster (ClusterInd) END NEXT i

[0129] The number of found clusters depends on the size of the Cluster Window Width. A very small width will create many clusters, in extreme as many as unidentified peaks. A helpful tool to preselect an optimal starting value is to show the graph of the number of resulting clusters versus Cluster Window Width.

[0130] Embodiments of the invention are capable to assist the chemist to review many peaks from many samples at a glance. Peak clustering and the graphical presentation allows the chemist to check whether all components have been identified and whether additional compounds have been detected. From this diagram, the chemist can directly focus on checking those components that show unexpected behavior.

[0131] It should be noted that the term "comprising" does not exclude other elements or features and the "a" or "an" does not exclude a plurality. Also elements described in association with different embodiments may be combined. It should also be noted that reference signs in the claims shall not be construed as limiting the scope of the claims.

* * * * *