Process Control Techniques For Semiconductor Manufacturing Processes DAVID; Jeffrey Drue [STREAM MOSAIC, INC.]

Process Control Techniques For Semiconductor Manufacturing Processes

DAVID; Jeffrey Drue

Patent Application Summary

U.S. patent application number 14/952266 was filed with the patent office on 2016-05-26 for process control techniques for semiconductor manufacturing processes. The applicant listed for this patent is STREAM MOSAIC, INC.. Invention is credited to Jeffrey Drue DAVID.

Application Number	20160148850 14/952266
Document ID	/
Family ID	56010944
Filed Date	2016-05-26

United States Patent Application	20160148850
Kind Code	A1
DAVID; Jeffrey Drue	May 26, 2016

PROCESS CONTROL TECHNIQUES FOR SEMICONDUCTOR MANUFACTURING PROCESSES

Abstract

Techniques for measuring and/or compensating for process variations in a semiconductor manufacturing processes. Machine learning algorithms are used on extensive sets of input data, including upstream data, to organize and pre-process the input data, and to correlate the input data to specific features of interest. The correlations can then be used to make process adjustments. The techniques may be applied to any feature or step of the semiconductor manufacturing process, such as overlay, critical dimension, and yield prediction.

Inventors:

DAVID; Jeffrey Drue; (San Jose, CA)

Applicant:

Name	City	State	Country	Type
STREAM MOSAIC, INC.	San Jose	CA	US

Family ID:

56010944

Appl. No.:

14/952266

Filed:

November 25, 2015

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
62084551	Nov 25, 2014
62091567	Dec 14, 2014
62103946	Jan 15, 2015

Current U.S. Class:	438/5 ; 355/53
Current CPC Class:	G03F 7/70625 20130101; G03F 7/70633 20130101; G06N 7/005 20130101; G03F 7/705 20130101; G06N 20/00 20190101; H01L 22/14 20130101; H01L 21/67253 20130101; H01L 22/20 20130101; H01L 22/12 20130101
International Class:	H01L 21/66 20060101 H01L021/66; H01L 21/67 20060101 H01L021/67; G03F 7/20 20060101 G03F007/20; H01L 21/306 20060101 H01L021/306

Claims

1. A method, comprising: receiving real-time inputs of a current production run of semiconductor wafers from a lithography process and at least one upstream process into an overlay measurement model stored in a data processing apparatus, wherein the overlay measurement model is configured to determine a multi-variate relationship of a plurality of input data to overlay measurement, the input data is obtained from the lithography process and the upstream process in previous production runs; generating a predicted overlay measurement from the real-time inputs using the overlay measurement model; and adjusting the lithography process or the upstream process such that the predicted overlay measurement correlates with an actual overlay measurement.

2. The method of claim 1, further comprising: the overlay measurement model obtains additional input data from processes in the previous production runs after the lithography process for use in determining the multi-variate relationship; and feeding additional real-time inputs from processes after the lithography process into the model for each production run.

3. A method, comprising: obtaining a plurality of overlay measurements from a plurality of wafers in a plurality of production runs of a lithography process, wherein each overlay measurement indicates an offset between a first set of features formed on a first layer and a second set of features formed on a second layer above the first layer; collecting a set of input data from each production run including data obtained from the lithography process and data obtained from upstream processes; analyzing the sets of input data to determine a multi-variate relationship of the input data to the overlay measurements; generating a predicted overlay measurement for each set of input data; and adjusting the lithography process or the upstream processes such that the predicted overlay measurements correlate with an actual overlay measurement.

4. The method of claim 3, further comprising: creating a model for overlay measurement based on the analysis of the input data and the corresponding overlay measurements; deploying the model for a wafer production run, wherein real-time inputs are obtained from the lithography process and the upstream processes and fed into the model; generating a predicted overlay measurement using the model; and adjusting the lithography process or the upstream processes such that the predicted overlay measurement correlates with an actual overlay measurement.

5. The method of claim 3, wherein the data obtained from the lithography process and the upstream processes includes metrology and parametric data.

6. The method of claim 5, wherein the metrology and parametric data from the lithography process includes feature critical dimensions, wafer shape, wafer geometry, film thickness, film resistivity, device channel length, device channel width, device channel depth, device operating thresholds, and device resistance.

7. The method of claim 5, wherein the metrology and parametric data from the upstream processes includes, for each upstream process, process duration, process temperature, process pressure, process frequency, and optical measurements.

8. The method of claim 3, wherein the overlay measurements are obtained using image-based overlay or diffraction-based overlay.

9. The method of claim 3, wherein the analyzing step is performed by at least one machine learning algorithm.

10. The method of claim 3, wherein the analyzing step is performed by a combination of machine learning algorithms.

11. The method of claim 3, wherein the analyzing step is performed by a multi-step algorithm.

12. The method of claim 4, further comprising: creating a virtual metrology model based on the data obtained from upstream processes; and providing an output of the virtual metrology model as an input to the overlay measurement model.

13. The method of claim 4, further comprising: obtaining in-situ metrology data; and providing the in-situ metrology data as an input to the overlay measurement model.

14. The method of claim 4, further comprising: performing a transformation of one or more sets of the input data; and providing the transformed input data as an input to the overlay measurement model.

15. The method of claim 4, further comprising: normalizing the real-time inputs when a second statistical distribution of the real-time input has changed from a first statistical distribution of the input data

16. The method of claim 15, wherein the normalizing step is implemented by determining a z-score for the first and second statistical distributions.

17. A non-transitory machine-readable medium having stored thereon one or more sequences of instructions, which instructions, when executed by one or more processors, cause the one or more processors to carry out the steps of: obtaining a plurality of overlay measurements from a plurality of wafers in a plurality of production runs of a lithography process, wherein each overlay measurement indicates an offset between a first set of features formed on a first layer and a second set of features formed on a second layer above the first layer; collecting a set of input data from each production run including data obtained from the lithography process and data obtained from upstream processes; analyzing the sets of input data to determine a multi-variate relationship of the input data to the overlay measurements; generating a predicted overlay measurement for each set of input data; and adjusting the lithography process or the upstream processes such that the predicted overlay measurements correlate with an actual overlay measurement.

18. The non-transitory machine-readable medium of claim 17, comprising further instructions that cause the one or more processors to carry out the steps of: creating a model for overlay measurement based on the analysis of the input data and the corresponding overlay measurements; deploying the model for a wafer production run, wherein real-time inputs are obtained from the lithography process and the upstream processes and fed into the model; generating a predicted overlay measurement using the model; and adjusting the lithography process or the upstream processes such that the predicted overlay measurement correlates with an actual overlay measurement.

19. A system, comprising: at least one processor; and a memory coupled to the processor comprising instructions executable by the processor, the instructions, when executed by the processor, cause the processor to: obtain a plurality of overlay measurements from a plurality of wafers in a plurality of production runs of a lithography process, wherein each overlay measurement indicates an offset between a first set of features formed on a first layer and a second set of features formed on a second layer above the first layer; collect a set of input data from each production run including data obtained from the lithography process and data obtained from upstream processes; analyze the sets of input data to determine a multi-variate relationship of the input data to the overlay measurements; generate a predicted overlay measurement for each set of input data; and adjust the lithography process or the upstream processes such that the predicted overlay measurements correlate with an actual overlay measurement.

20. The system of claim 19, comprising further instructions that cause the processor to: create a model for overlay measurement based on the analysis of the input data and the corresponding overlay measurements; deploy the model for a wafer production run, wherein real-time inputs are obtained from the lithography process and the upstream processes and fed into the model; generate a predicted overlay measurement using the model; and adjust the lithography process or the upstream processes such that the predicted overlay measurement correlates with an actual overlay measurement.

Description

CROSS REFERENCE

[0001] This application claims priority from U.S. Patent Application No. 62/084,551 entitled System and Methods for Overlay Error Compensation, Measurements, and Lithography Apparatus Control, filed Nov. 25, 2014; U.S. Patent Application No. 62/091,567 entitled System and Methods for Yield Prediction, Test Optimization, and Burn-In Optimization, filed Dec. 14, 2014; and U.S. Application Patent No. 62/103,946 entitled System and Methods for Using Algorithms for Semiconductor Manufacturing, filed Jan. 15, 2015; each of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] This disclosure relates generally to semiconductor manufacturing processes, and more particularly, to improved process control techniques for lithography, yield prediction, and other aspects of semiconductor manufacturing processes.

BACKGROUND

[0003] The semiconductor manufacturing industry is known as a complex and demanding business, and it continues to evolve with major changes in device architectures and process technologies. Traditionally, the semiconductor industry has been characterized by sophisticated high-tech equipment, a high degree of factory automation, and ultra-clean manufacturing facilities that cost billions of dollars in capital investment and maintenance expense.

[0004] For decades, semiconductor manufacturing was driven by Moore's Law and planar transistor architecture. This provided a predictable, self-sustaining roadmap for transistor cost scaling and well-defined interfaces where each individual process/layer could follow its own technology trajectory independently. However, as the industry scales to provide sub-20 nm nodes and other popular device architectures, such as MEMS, new processes are required, and new approaches for semiconductor manufacturing are being explored and implemented.

[0005] For sub-20 nm nodes, entirely new device architectures are needed. In parallel, the rapid growth in the Internet of Things (IoT) is driving the MEMS market. These changes have presented difficult and unprecedented challenges for the industry, generally resulting in lower manufacturing yields.

[0006] In order to achieve acceptable yield and device performance levels with these new architectures, very tight process specifications must be achieved. Thus, better process control and integration schemes are needed now more than ever.

[0007] One example of a specific current challenge for the industry is lithography processes for sub-20 nm node manufacturing. EUV lithography techniques are known but have not yet been widely adopted for production, and therefore, 193 nm immersion lithography must extend its capability via multi-patterning schemes, which adds masks and process steps, and is therefore complicated and expensive.

[0008] Various processes also require more complex integration, and therefore can no longer be developed independently of each other. For example, the three-dimensional architecture of finFET's and 3-D NAND's, as well as the complex relationships between corresponding process steps, have changed the way that process variabilities can affect device performance and yield. As an example, many semiconductor manufacturers are experiencing lower yield on their finFET lines, and the need to increase yield is urgent. In the memory space, 3-D NAND has become the dominant architecture, and process control is a key issue for 3-D NAND process layers. The IoT space is increasingly dominated by the "More-than-Moore" trend, where devices incorporate technologies that do not necessarily scale to Moore's Law. This growing market space is driven by diversified and specific processes, and the need for new ways to improve yield and reduce manufacturing costs when implementing manufacturing solutions is needed.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 is a flow chart illustrating a process for making a semiconductor device.

[0010] FIG. 2 is a block diagram illustrating relationships between different steps of the process of FIG. 1 and their cumulative effects on process variation and product performance.

[0011] FIG. 3A is a top plan view of features formed in two different layers of a device, with no overlay error.

[0012] FIG. 3B is a top plan view of features formed in two different layers of a device, with overlay error.

[0013] FIG. 4 is a top plan view of features formed in a single layer of a device, with a critical dimension error.

[0014] FIG. 5A is a side plan view of a substrate having features formed in two different layers of a device, with no critical dimension or overlay errors.

[0015] FIG. 5B is a side plan view of a substrate having features formed in two different layers of a device, with no critical dimension or overlay errors.

[0016] FIG. 6 is a flow chart illustrating a method for training and deploying a model.

[0017] FIG. 7 is a block diagram illustrating examples of input data and the sources for input data.

[0018] FIG. 8 is a flow chart illustrating a method for using a deployed model to make process adjustments.

[0019] FIG. 9 is a graph showing the error between a DBO measurement and a CD-SEM measurement.

[0020] FIG. 10 is a flow chart illustrating yield prediction using a classification algorithm and a confidence metric.

[0021] FIG. 11 is a flow chart illustrating a method for training and deploying a model to predict yield.

[0022] FIG. 12 is a block diagram of one embodiment of a yield prediction system.

[0023] FIG. 13 shows equations illustrating a process for determining the status of a manufactured product as a function of weighted test data, confidence metrics, and classification.

[0024] FIG. 14 shows equations illustrating a process for optimizing burn-in time.

[0025] FIG. 15 is a block diagram illustrating additional applications in a semiconductor manufacturing process for predictive analytics.

DETAILED DESCRIPTION

1. Overview

[0026] This disclosure describes new techniques for measuring and/or compensating for process variations in production runs of a semiconductor manufacturing processes, for using these techniques to predict yield at any step of the process, and for optimizing testing and burn-in procedures. For example, machine learning algorithms can be used to create new approaches to data analysis by incorporating new types of input data, and the data can be more effectively correlated, organized and pre-processed, then used to make process adjustments. Data from prior production runs can be used to create a model for a target parameter, and data from a current production run can be input to the model to generate a prediction for the target parameter, and to correlate the prediction with the actual data.

2. Semiconductor Manufacturing Processes Generally

[0027] FIG. 1 is a high level view a typical semiconductor manufacturing process 100, in which there may actually be hundreds of steps. In general, data can be collected at every step and sub-step of the process for a production run, and yield may be calculated for each step as well as total yield for the entire process predicted.

[0028] Wafer fabrication occurs in step 102, where a large number of integrated circuits are formed on a single slice of semiconductor substrate, such as silicon, known as a wafer. Many steps are required in various sequences to build different integrated circuits. For example, deposition is the process of growing an insulating layer on the wafer. Diffusion is the process of baking impurities into areas of the wafer to alter the electrical characteristics. Ion implantation is another process for infusing the silicon with dopants to alter the electrical characteristics. In between these steps, lithographic processing allows areas of wafer to be patterned with an image, then a mask is used to expose photoresist that has been applied across the wafer, and the exposed photoresist is developed. The pattern is then etched to remove selected portions of the developed photoresist, and these steps are repeated to create multiple layers. Finally, metallization is a specialized deposition process that forms electrical interconnections between various devices/circuits formed on the wafer. The fabrication process can take several months to complete before moving on to the post-fabrication steps.

[0029] Wafer test and sort occurs in step 104. After a wafer has been fabricated, all the individual integrated circuits that have been formed on the wafer are tested for functional defects, for example, by applying test patterns using a wafer probe. Circuits may either pass or fail the testing procedure, and failed circuits will be marked or otherwise identified, e.g., stored in a file that represents a wafer map.

[0030] Assembly and packaging takes place in step 106. The wafer is diced up into separate individual circuits or dies, and each die that passes through wafer sort and test is bonded to and electrically connected to a frame to form a package. Each die/package is then encapsulated to protect the circuit.

[0031] In step 108, the packages are subjected to random electrical testing to ensure that circuits in the package are still working as expected.

[0032] In step 110, the remaining packages go through a burn-in cycle by exposing the package to extreme but possible operating conditions. Burn-in may involve electrical testing, thermal exposure, stress screening, or a combination of these, over a period of time. Burn-in testing reveals defective components.

[0033] Finally, in step 112, a final round of electrical testing is conducted on the remaining packages.

3. Machine Learning Algorithms

[0034] Recent advances in computing technologies and data analysis techniques, such as performing parallel processing on a massive scale, has led to progress in machine learning algorithms, data mining, and predictive analytics. Machine Learning is a branch of artificial intelligence that involves the construction and study of systems that can learn from data. These types of algorithms, along with parallel processing capabilities, allow for much larger datasets to be processed, without the need to physically model the data. This opens up the possibility of incorporating data analysis to make corrections on the lithographic apparatus for overlay error and critical dimension (CD) variation. For example, in addition to using the usual parameters to correct for overlay error (e.g., CD metrology, on-scanner data, wafer shape and geometry metrology, DBO measurement), process parameters and other metrology from upstream processes and metrology can also be used to train a machine learning algorithm.

[0035] Data has always played a role in semiconductor and electronics manufacturing. In the semiconductor industry, data was initially collected manually to track work-in-progress (WIP). The types of data collected included metrology data (measurements taken throughout the IC fabrication process), parametric test data, die test data, final test data, defect data, process data, and equipment data. Standard statistical and process control techniques were used to analyze and utilize the datasets to improve yields and manufacturing efficiencies. In many instances, the analysis was performed in a manual "ad-hoc" fashion by domain experts.

[0036] However, as device nodes became smaller and tolerances became tighter, factories became more automated and the ability to collect data improved. Even with this improvement in the ability to collect data, it has been estimated that no more than half of the data is ever processed. Further, of the data that is processed and stored, more than 90% of it is never again accessed.

[0037] Moving forward, data volume and velocity continues to increase rapidly. The recent norm for data collection rates on semiconductor process tools is 1 Hz. The International Technology Roadmap for Semiconductors (ITRS) predicts that the requirement for data collection rates will reach 100 Hz in three years. Most experts believe a more realistic rate will be 10 Hz. Even a 10 Hz rate represents a 10.times. increase in data rates. In addition to faster data rates, there are also more sensors being deployed in the semiconductor manufacturing process. For example, Applied Materials Factory Automation group has a roadmap that shows that advanced technology requirements are driving a 40% increase in sensors.

[0038] Given the massive amount of sensor data now collected, and the low retention rates of the data, advancements in data science could and should be implemented to solve the problems of the semiconductor industry. Some progress has been made to leverage data to improve efficiencies in the semiconductor and electronics industries. For example, microchip fabrication factories are combining and analyzing data to predict when a tool for a particular process needs maintenance, or to optimize throughput in the fab.

[0039] Predictive analytics and Machine Learning Algorithms can thus be used to address the challenges facing the semiconductor industry. By drilling deeper into the details of semiconductor manufacturing and knowing how to apply predictive analytics to detect and resolve yield issues faster, and to tighten and target the specifications of individual manufacturing steps, increased yield can result. FIG. 2 shows an example of the cumulative effects of process variation on product performance. The relationships can be complex and difficult to correlate, e.g., key performance indicators (KPIs) of the process steps, such as the critical dimensions of lithographic and etch steps 202, the dielectric film thickness 204, and film resistivity 206; parametrics, such as channel length and width 212, transistor and diode thresholds 214, and resistance 216; and product performance, such as maximum frequency 222, and maximum current 224. We can use predictive analytics to quantify those relationships, and then leverage the relationships to predict and improve product performance.

[0040] The semiconductor industry presents some unique challenges for applying predictive analytics and machine learning algorithms. Some of these challenges are: nonlinearity in most batch processes; multimodal batch trajectories due to product mix; process drift and shift; small amount of training data (maybe less than a lot); and process steps with variable durations (often deliberately adjusted).

[0041] A good understanding of these challenges is needed to properly employ predictive analytics. If applied properly, predictive analytics can find complex correlations that may have been difficult to uncover using other techniques. This new access to deeper understanding and insight can then be leveraged to increase yield, improve device performance, and reduce costs like never before.

[0042] In one example, machine learning algorithms can be used to predict yield. Yield prediction for a product refers to the prediction of the quality or usability of the product after any number of manufacturing steps are completed. If the yield prediction for a product is "good" at a given manufacturing step, then that product is predicted to be usable as of that manufacturing process and should continue processing. If the yield prediction is predicted to be "bad", then that product is predicted to be faulty or not usable as of that manufacturing step and is not recommended for continued processing. The yield prediction is useful in determining if it is cost effective to continue processing of a product. In some embodiments, the yield prediction is a component in deciding whether or not to continue processing of the product. The yield prediction is not necessarily the only variable in making a decision about whether or not to continue processing of a product.

[0043] In another example, virtual metrology can use machine learning algorithms to predict metrology metrics such as film thickness and critical dimensions (CD) without having to take actual measurements, in real-time. This can have a big impact on throughput and also lessen the need for expensive TEM or SEM x-section measurements. Based on sensor data from production equipment and actual metrology values of sampled wafers to train the algorithm, virtual metrology can predict metrology values for all wafers. The algorithm can be a supervised learning algorithm, where a model can be trained using a set of input data and measured targets. The targets can be the critical dimensions that are to be controlled. The input data can be upstream metrology measurements, or data from process equipment (such as temperatures and run times).

[0044] In yet another example, the metrology measurements taken in-situ, or after a particular semiconductor process is complete, can be used as part of the input data for the virtual metrology system. For example, metrology data can be collected after a CMP step that occurred in one or more processing steps preceding the current lithography step. These metrology measurements can also be thickness data determined by each metrology system, or the refractive index and absorption coefficient.

[0045] In another example, metrology data can be collected during etch processes. Optical emissions spectra or spectral data from photoluminescence can be utilized as input data. Data transformation or feature engineering can be performed on in-situ spectral data or other sensor data that is collected during a particular process such as etch, deposition, or CMP. As an example, multiple spectra may be collected in-situ during processing. The spectral set used may be all spectra collected during processing, or a subset of spectra collected during processing. Statistics such as mean, standard deviation, min, and max may be collected at each wavelength interval of the spectral set over time and used as data inputs. As an alternative example, similar statistics can be collected for a given spectrum, and the time series of those statistics can be used as data inputs. As yet another example, peaks and valleys in the spectrum can be identified and used as data inputs (applying similar statistical transformation). The spectra may need to be normalized or filtered (e.g., lowpass filter) to reduce process or system noise. Examples of in-situ spectral data include reflectometry from the wafer, optical emissions spectra (OES), or photoluminescence.

[0046] In yet another example, the target of a virtual metrology model can be the output of wafer probe tests, or measurements made by wafer probe tests. Additionally, the outputs from final wafer electrical testing, wafer sort tests and wafer acceptance tests can be used as a target to the virtual metrology model. Examples of final wafer electrical testing parameters include, but are not limited to, diode characteristics, drive current characteristics, gate oxide parameters, leakage current parameters, metal layer characteristics, resistor characteristics, via characteristics, etc. Examples of wafer sort parameters include, but are not limited to, clock search characteristics, diode characteristics, scan logic voltage, static IDD, IDDQ, VDD min, power supply open short characteristics, ring oscillator frequency, etc. The target of a virtual metrology model can be the output from a final test. The target can come from tests that occur multiple times under different electrical and temperature conditions, and before and after device reliability stresses, such as burn-in, or tests that occur at a burn-in step. The target can come from electrical tests that are a mix of functional, structural and system-level tests.

[0047] In yet another example, machine learning algorithms can be used to control a manufacturing process step. As noted above, virtual metrology can be used to predict a critical dimension or film thickness for a manufacturing process step. Before or during processing of this manufacturing step, the prediction can then be used to set and/or control any number of processing parameters (e.g. run time) for that processing step.

[0048] In yet another example, machine learning algorithms can be used to predict when a fault or defect will occur in the manufacturing process or on a specific tool at a process step. Identifying a machine fault or failure, and finding the root cause of faults quickly can be essential in semiconductor manufacturing. If faults in the manufacturing process can be better detected and resolved, downtime and scrap can be reduced. This is also referred to as fault detection and classification (FDC). If faults can be predicted before they occur, then downtime can be optimally scheduled and scrap can be even further reduced. As an example, decision trees can be used to determine which input features can best predict a fault in a process, and develop decision rules around detecting a fault.

4. Lithography and Overlay Errors

[0049] As noted above, lithography processes present a challenge for sub-20 nm node manufacturing. A lithographic apparatus is a machine that applies a desired pattern onto a substrate, usually onto a targeted portion of the substrate. A circuit pattern of an individual integrated circuit (IC) layer is generated by a patterning device, usually referred to as a mask or a reticle, which transfers the pattern onto a target. Typically, the pattern is transferred by imaging onto a layer of material (e.g., resist) that is sensitive to radiation, which has been formed on the substrate. A network of successively patterned adjacent target portions will reside on one substrate.

[0050] One type of lithographic apparatus is a stepper, in which the entire pattern of a target portion is exposed in a single instance. Another type of lithography apparatus is a scanner, where the target portion is irradiated via scanning the pattern with a radiation beam in a given direction, while scanning the substrate parallel or anti-parallel to this direction.

[0051] The location of patterned features in subsequent layers must be very precise in order to build the devices properly. All features should have sizes and shapes that are formed within specified tolerances. The overlay error, which refers to the offset or mismatch between features on adjacent layers, should be minimized and within tolerance in order for the manufactured devices to function properly. Overlay measurements are thus important for determining the overlay error of a given pattern exposed with a mask on the resist layer.

[0052] An overlay measurement module typically performs the overlay measurement using an optical inspection system. The position of the mask pattern in the resist layer relative to the position of the pattern on the substrate is determined by measuring an optical response from an optical marker on the substrate which is illuminated by an optical source. The signal generated by the optical marker is measured by a sensor arrangement. Using the output of the sensors, the overlay error can be derived. Typically, the patterns on which overlay error are measured are located within a scribe lane in between target portions.

[0053] Two common concepts for measuring overlay are image based overlay (IBO) and diffraction based overlay (DBO). For IBO, the image position of the substrate pattern is compared to the mask pattern position in the resist layer. Overlay error is a result of the comparison of these two image positions. Imaging approaches are conceptually straightforward, since they are based on analysis of a "picture" directly showing the alignment of the two layers. For example, box-in-box or line-in-line alignment marks are commonly used in the two layers. However, IBO error measurement may be sensitive to vibrations and also to the quality of focus during measurement, which can both result in blurring of the picture. Aberrations in the optics may further reduce the accuracy of the IBO measurement.

[0054] For DBO, a first diffraction grating pattern is located on the pattern layer, and a second diffraction grating pattern with identical pitch is located in the resist layer. The second grating should be nominally on top of the first grating, and by measuring the intensity of the diffraction patterns, an overlay measurement may be obtained. If there is an overlay error between the two gratings, it will be detectable in the diffraction pattern. DBO is less sensitive to vibration than IBO.

[0055] To make multi-patterning solutions work, especially in light of the extremely small dimensions now being implemented, the need for more precise and accurate mask overlay has become critically important. In addition to minimizing mask overlay errors, critical dimension uniformity (CDU) has also become important as the convolution of overlay error and critical dimension (CD) variation can lead to shorts, connection failures, and malfunctioning devices.

[0056] For example, FIG. 3A shows a top view of a portion of a device 300 having a feature 302 formed on a first layer and a feature 304 formed on a second layer, e.g. above the first layer, without any apparent overlay error. Another feature (not shown) is also formed on the first layer under and in direct alignment with feature 304 thereby creating no overlay error.

[0057] In contrast, FIG. 3B shows a top view of a portion of a different device 310 having features 312 and 313 formed on the first layer. Feature 314 is formed on the second layer and should line up with feature 313 on the first layer, but in this example exhibits an overlay error 311 due to the misalignment of features 313 and 314.

[0058] FIG. 4 shows a top view of a portion of a device 400 having a CD variation between features formed in a single layer. Thus, the dimension between features is designed to be "x" and that dimension is observed between features 401 and 402 and between features 403 and 404. However, between features 402 and 403 the dimension is "less than x" which is a critical dimension error.

[0059] FIG. 5A is a side view of a device 500 having a substrate 501 and a first layer 502 of features formed on top of the substrate. A second layer of features 503 is formed on top of the first layer 502 in two different lithography steps. For example, features 511-514 are formed in a first lithography step, and then features 515-517 are formed in a second lithography step. In this example, there are no apparent overlay errors between features on the different layers, as well as no CD errors since the dimension between the features formed in the different lithography steps is consistently "x."

[0060] FIG. 5B is a side view of a different device 520 having a substrate 521, a first layer 522 of features formed on top of the substrate, and a second layer of features 503 formed on top of the first layer 502 in two different lithography steps, namely features 531-534 formed in a first lithography step, and features 535-537 formed in a second lithography step. In this example, however, there is an apparent overlay error 550 in the second lithography step as features 535-537 are misaligned relative to the first layer. There is also a CD error between the features formed in the different lithography steps, where the dimension on one side of the features is "greater than x" and the dimension on the other side of the features is "less than x."

[0061] Thus, determining and applying compensation for overlay errors and CD errors has become extremely important in the lithography process. Table I below illustrates the ever-tightening budget for acceptable overlay error and CD error for smaller and smaller nodes:

TABLE-US-00001 TABLE I Technology Node (nm) 28 20 14 10 Overlay budget (nm) 9.0 6.0 4.5 3.5 CD spec (nm) 4.5 3.0 2.0 1.3

[0062] There are many sources of patterning errors that lead to overlay and CD errors. For example, the reticle may cause placement errors, CD uniformity errors, and haze defects. The lithography and etch processes may have focus and/or exposure errors, overlay issues, etch profile issues (such as CD and shape), and other defects. The wafer fabrication and other processes may have issues with wafer shape and uniformity, film property uniformity, CMP uniformity, thermal processing, and backside and edge defects.

[0063] As processing technology transitions toward smaller and smaller nodes, such as 10 nm and 7 nm, there is serious concern about the capability of available metrology solutions. The uncertainty in these solutions must be minimized so the proper adjustments can be made to the scanner or stepper to correct for the overlay and CD errors. While overlay can be defined in an x-y coordinate system, or a vector representing the overlay, there are many components on the lithography apparatus that can provide adjustments to correct for overlay.

[0064] Thus, new techniques are described for measuring and/or compensating for lithographic pattern errors such as overlay error and CD error. Machine learning algorithms can be used to create new approaches to data processing and process control. For example, more and varied types of input data can be provided to the machine learning algorithms, and the data can be more effectively organized and pre-processed to determine how to adjust one or more parameters of the lithography apparatus to correct the errors.

[0065] Referring to FIG. 6, a flow chart illustrates a method 600 for creating and deploying a model to evaluate a semiconductor manufacturing process in order to correct for errors in a lithographic process, such as overlay errors and CD errors. In step 602, a target is selected. In one embodiment, the target is an overlay measurement (e.g., IBO measurement, DBO measurement, CD-SEM, TEM, etc.) and could be a linear overlay offset in the x and y direction. The target could also be other lithography apparatus parameters that need to be controlled to minimize overlay error, such as reticle position, reticle rotation, or reticle magnification. The target could be parametric data such as on/off current of the transistor, transistor thresholds, or some other parameter that quantifies the health of the transistor. The target could also be yield information, such as the functionality of a given die or area on the wafer (sometimes measured as either pass or fail). The target could also be semiconductor device performance data.

[0066] In step 604, the parameters that are useful in evaluating the target are identified, and in step 606, input data relevant to the parameters is collected. Every set of input data is associated with a specific output or target. For example, a set of measured and observed values can be associated with an overlay offset. Those values would be an input vector to the model, and would be associated with the target, e.g., the measured offset. If there are n input variables, then the input vector size for each target would be 1.times.n. Therefore, if there are m targets, there will be an input data matrix of size m.times.n, with each row of the input data matrix associated with a target. This is a typical training set in matrix format for a machine learning algorithm. An illustration of this matrix is given in Table II below:

TABLE-US-00002 TABLE II Target Input Data Target 1 Input feature 1, 1 Input feature 1, 2 . . . Input feature 1, n Target 2 Input feature 2, 1 Input feature 2, 2 Input feature 2, n . . . Target Input feature m, 1 Input feature m, 2 Input feature m, n m

[0067] The target data could be collected after other processes have been completed, or could be collected after the semiconductor device has finished all of its processing. Post packaging data could also be used as targets.

[0068] Some of the parameters that are already regularly used in overlay error compensation and lithography apparatus control will be used as part of this input dataset. For example, these regularly used parameters can include DBO measurements from the metrology equipment, wafer shape and geometry measurements, or parameters from the lithography apparatus.

[0069] Most importantly, other parameters from upstream semiconductor processes and metrology can be used as inputs to the algorithm as well. These input parameters can include other metrology measurements from earlier process steps, including optical reflectometry or ellipsometry (normal incident, polarized or unpolarized light, oblique angles of incidence, and varying azimuth angles).

[0070] These metrology measurements can be inputs to the algorithm as an intensity at a given wavelength. For example, metrology data may be incorporated from a reflectometry measurement taken after a certain processing step (for example, etch, or deposition). If the reflectometry data is collected by illuminating the target with unpolarized broadband light and has a detectable wavelength range of 250 nm to 850 nm, then the user could choose to sample that light from 250 nm to 850 nm at 2 nm intervals, to get a total of 301 spectral intensity measurements for that wavelength range. These 301 samples would each be an input to the algorithm. An example of how the input data is associated with a target is shown in Table III.

TABLE-US-00003 TABLE III Input Data Target Intensity 250 nm Intensity 252 nm . . . Intensity 850 nm Target 1 1.2 1.4 . . . 1.5 Target 1 1.3 1.2 . . . 1.7 . . . . . . . . . . . . . . . Target 0.9 0.8 . . . 1.1 m

[0071] The metrology measurements can be taken in-situ, or after a particular semiconductor process is complete. For example, metrology data can be collected after a CMP step that occurred in one or more processing steps preceding the current lithography step. These metrology measurements can also be thickness data determined by each metrology system, or the refractive index and absorption coefficient. In another example, metrology data can be collected during etch processes. Optical emissions spectra or spectral data from photoluminescence can be utilized as input data.

[0072] Data transformation or feature engineering can be performed on in-situ spectral data or other sensor data that is collected during a particular process such as etch, deposition, or CMP. As an example, multiple spectra may be collected in-situ during processing. The spectral set used may be all spectra collected during processing, or a subset of spectra collected during processing. Statistics such as mean, standard deviation, min, and max may be collected at each wavelength interval of the spectral set over time and used as data inputs. As an alternative example, similar statistics can be collected for a given spectrum, and the time series of those statistics can be used as data inputs. As yet another example, peaks and valleys in the spectrum can be identified and used as data inputs (applying similar statistical transformation). The spectra may need to be normalized or filtered (e.g lowpass filter) to reduce process or system noise. Examples of in-situ spectral data include reflectometry from the wafer, optical emissions spectra (OES), or photoluminescence.

[0073] The input parameters could also include non-optical measurements, such as Rs (conductivity, resistivity) measurements taken by probes and other types of contact measurements, or contact measurements such as the high resolution profiler (HRP).

[0074] The input parameters can also originate from a Plasma Impedance Monitor (PIM) which can be installed between the matching network and the plasma electrodes of an etcher, and can provide data on reactance, impedance, resistance, current, voltage, power, phase and fundamental frequencies.

[0075] Process equipment measurements or metrics can also be used as inputs to the algorithm, such as gas flow sensors, power sensors, pressure sensors, temperature sensors, current sensors, voltage sensors, etc. This data can be collected in process steps that occurred before the lithography step where overlay is to be measured and controlled. Examples of these include process time, RF frequency and power from an etch chamber, electric current and impedance measurements, CMP polish times, motor current from the CMP tool, CVD deposition times and information from mass flow controllers, temperatures, pressures, etc. This data could be from any or all upstream processes from the lithography step being performed.

[0076] Parametric data and measurements such as channel width and depth, transistor thresholds, and resistance can also be used as inputs to the algorithm.

[0077] The diffraction spectra or data used in the DBO technique can be part of the input data as well. All of the above mentioned inputs could be correlated to slight variations in the DBO output, and could thus result in better control of the overlay error compensation or better lithography control given the CD measurements from etch.

[0078] CD measurements taken after etch is an important parameter to single out as an input. As discussed above, these measurements are convolved with the overlay error to determine device performance or yield.

[0079] In DBO measurement systems, diffracted light is used to measure overlay. However, changes in upstream processes can affect the spectral signature. For example, if there is a shift in the index of refraction of an upstream film property, then the spectral signature can change. Likewise, if the sidewall angle of the diffraction grating shifts due to a process shift, this may cause a change in the spectral signature. Therefore, by training the machine learning algorithm with upstream data that may have an effect on the diffraction spectra, the overlay error can be tightened or the overlay measurement can be made to be more accurate if correlations are discovered between upstream processes and the spectral signature of the diffraction grating.

[0080] Returning to FIG. 6, in step 608, filtering, normalization and/or cleansing steps can be performed on the input data.

[0081] In step 610, a dimensionality reduction or feature selection step is performed. The purpose of this step is to reduce the number of input parameters for the algorithm. Dimensionality reduction techniques are generally known, for example, principle component analysis (PCA).

[0082] In step 612, the data is then fed into the algorithm for training. The algorithm could be one of many different types of algorithms. Examples of machine learning algorithms include Decision Trees, such as CART (Classification and Regression Trees), C5.0, C4.5, and CHAID; Support Vector Regression; Artificial Neural Networks, including Perceptron, Back Propagation, and Deep Learning (BigData enabled); and Ensemble, including Boosting/Bagging, Random Forests, and GBM (Gradient Boosting Machine). The best algorithm may not be a single algorithm, but can be an ensemble of algorithms.

[0083] In particular, the GBM (Gradient Boosting Machine) and Random Forests algorithms can produce the best results. Other machine learning algorithms, including the ones mentioned above, can also work well and should be considered.

[0084] Given the training input data and training targets, the algorithm will produce a model in step 614. The model can then be deployed in step 616.

[0085] FIG. 7 illustrates one example of collecting input data for an input feature set 710, which is a matrix 712 having a number of input parameters 712a, 712b . . . 712x, which are relevant to a specified target, which may be a measurement, a calculated parameter, or a modeled parameter. The input data may be collected during wafer fabrication, at or before wafer test and sort and/or wafer probe testing. For example, input data can be collected from the process equipment 720 during steps for etch, CMP, gap fill, blanket, RTP, etc., and may include process variables such as process duration, temperature, pressure, RF frequency, etc. Input data may also include metrology data 730 such a CD, wafer shape, film thickness, film resistivity, inline or in-situ measurements, etc. Input data may also include parametric data 740 such as channel length, channel width, channel depth, transistor thresholds, resistance, etc.

[0086] FIG. 8 illustrates use of the model. In step 802, specified input data is collected, e.g., as an input vector, then fed into the model in step 804. If some of the specified data is not present in the 1.times.n vector, there are a number of techniques that can replace or estimate the missing data in the input vector.

[0087] For each input vector of size 1.times.n fed into the algorithmic model, a score will be generated in step 806. The score is a prediction of the target made by the model, given the input data. The score generated by the model will correspond to whatever metric was used as a target for training the algorithm that generated the model. For example, if a DBO measurement was used for the target to train the algorithm, then the score will be a predicted DBO measurement. If the target was a parametric test value, then the score will be a prediction of that parametric test value. In a typical situation, the score can be the overlay offset prediction, for example, an offset in the x direction or the y direction. In step 808, the score is used to determine an adjustment to be made to one or more components of the lithographic apparatus. For example, the offset data could be applied to a control system to make an adjustment to the lithography apparatus parameters or "control knobs" to adjust for the overlay error.

[0088] In addition to the score, the model can also output a confidence metric that describes how reliable the score prediction is. This can be useful in determining whether or not to employ the score, or weight the use of that prediction in conjunction with other traditional measurements. For example, if the predicted offset is 3.0 nm, the DBO measured offset is 6 nm, and there is a confidence of 0.8 (out of 1.0) in the prediction, then the final predicted offset would be:

(3.0*0.8)+(6.0*0.2)=3.6 nm

[0089] As previously discussed, the convolution of CD error and overlay error can affect device performance. In order to optimize the device performance, it may be necessary to adjust the overlay for a given CD. In one embodiment, machine learning algorithms could be used with all or some of the above mentioned input data, along with CD error measurement and overlay error measurement to create a model whose target is a lithography apparatus control parameter, such as focus, power, or x-y direction control. The goal is to optimize the lithography apparatus control parameter (given a measured CD) such that the lithography apparatus output results in the best semiconductor device performance or yield.

[0090] As new input data and corresponding target data is generated, the algorithm can be retrained so as to produce a better model that will give better scores. A set of algorithms can be trained simultaneously with the same input and target dataset. The algorithm that gives the best output can be the algorithm that is ultimately deployed. Alternatively, an ensemble of algorithms can be identified as the best algorithm to be employed. The best algorithm is identified by whichever algorithm gives the best results through means of a validation test on the training dataset. For example, k-means cross validation is a popular technique for validating algorithms.

[0091] As noted above, the input dataset should undergo preprocessing. The preprocessing step can improve the quality of the input dataset and increase the accuracy and precision of predictions made by the model. In some embodiments, other data preparation techniques can be applied to the input data, such as normalization or parameterization of the data.

[0092] Additionally, a z-score can be generated to compensate for drift and shift in the data. For example if a tool is calibrated, the input data may shift. If a shift occurs, this may change the overall mean and standard deviation of the input data, which would generate poor results with the model. Either a human or algorithm can signal when a shift occurs, such as when a process tool undergoes calibration, and the data can be collected for a period of time in a "listening mode" (algorithm prediction is not applied to product) after the calibration to ensure there are not faulty predictions. After a certain period of time, a z-score is generated from that data. The z-score should be similar to the z-score of the data that occurred before the calibration. This is an example of normalizing the data before and after a calibration has taken place.

[0093] In some embodiments, virtual metrology predictions generated from upstream process equipment and metrology data can be used as inputs to the model. This essentially represents a multi-step model or algorithm, where first the virtual metrology predictions are determined by a first algorithm. For example, the outputs can be used as inputs to another algorithm designed for overlay error compensation, overlay error measurement, or yield prediction.

[0094] A prediction by the algorithm can be made after all testing and manufacturing is complete on the product. In a typical situation, the goal is to predict if the product will fail after shipping and/or is in use, even if the product has passed all final testing successfully.

[0095] The algorithm can be a classification or regression algorithm, which are types of machine learning algorithms, but could be one of many different types of algorithms. Examples of some of these algorithms that can be used include: Decision Trees, CART (Classification and Regression Trees), C5.0, C4.5, CHAID, Support Vector Regression, Artificial Neural Networks, Perceptron, Back Propagation, Deep Learning, Ensemble, Boosting/Bagging, Random Forests, GBM (Gradient Boosting Machine), AdaBoost.

[0096] In some embodiments, the best algorithm may not be a single algorithm, but can be an ensemble of algorithms. An ensemble of algorithms can use different techniques to determine which algorithm or combination of algorithms gives the best prediction. For example, an ensemble algorithm can take the average recommendation from all of the algorithms in the ensemble. In another example, an ensemble algorithm can use a voting scheme to make the final recommendation. The ensemble algorithm can use different weighting schemes applied to a collection of individual algorithms in order to produce the best prediction.

[0097] In particular, good predictions have been produced using the GBM (Gradient Boosting Machine) and Random Forests algorithms.

[0098] The score is a prediction made for each input vector fed into the model when the model is deployed. For example, if the goal is to predict whether or not a wafer will be identified as "good" at wafer test, the input vector can consist of all input data associated with that wafer and that input data will be fed into the model to make the prediction.

[0099] In some embodiments, the model can also output a confidence metric that can describe how reliable the score is. This can be useful in determining whether or not to employ the score, or to optimize final testing, or to calculate burn-in time, or it could be used in a final yield prediction. In the case of a multi-step algorithm, the confidence metric can be used as an input to a subsequent algorithm.

[0100] A propensity metric can also be generated when the algorithm is a classification algorithm, and in one embodiment, will have a value between 0 and 1. As an example, if the propensity value is near 0, then the likelihood is that a prediction is one classification (e.g., FALSE). If the propensity value is near 1, then the likelihood is that a prediction is the other classification (e.g., TRUE). The propensity metric can indicate how confident the algorithm is in making the given prediction, i.e., the closer the propensity metric is to either 0 or 1, the higher the confidence that the prediction is correct. In the case of a multi-step algorithm, the propensity metric can be used as an input to a subsequent algorithm.

[0101] In an embodiment, as new input data and corresponding target data is generated, the algorithm can be retrained so as to produce a better model that will give better scores.

[0102] In some embodiments, a set of algorithms can be trained simultaneously with the same input and target dataset. The algorithm that gives the best output can be selected for deployment.

[0103] In one example, algorithms can be applied to the processing and manufacturing of finFET structures. Flowable gap-fill film material properties are variable, which affects the film density and its optical properties. This can confuse optical metrologies used to measure and control film thicknesses, leading to erroneous film thickness measurements. In the fabrication of finFET's, this can lead to erroneous measurement of the gate height, and thus cause the gate heights to be variable. Variable gate height can lead to increased gate capacitance, leakage, and a need for higher drive current. Thus, inputs to the algorithm(s) can be etch process parameters, flowable CVD process parameters, CMP process parameters, oxide metrology outputs, TEM's, and yield results. The algorithms can be used to either detect and fix problems with the etch process, flowable CVD process, and CMP process.

[0104] Etch depth can play a big role in the determination of gate height. Etch process can also influence gate sidewall angles, which can have an effect on gate performance and the optical metrology signature. In some embodiments, etch process parameters can either be used as input parameters to the above models to detect problems or control the CMP process, or can be the target for control. The algorithms can control the process, detect process issues, and achieve tighter gate specs. In some embodiments, the etch process parameters can be used as inputs in determining the lithographic tool control. Etch tool process parameters can be used to predict the etch rate or final etch depth, as in the case of virtual metrology. The outputs of the virtual metrology algorithm can then be used as input to the lithographic tool control, for example, as an intermediate step algorithm.

[0105] Algorithms can also be applied to the processing and manufacturing of 3D-NAND, or vertical NAND memory structures. To form vertical NAND (3-D NAND) structures, semiconductor manufacturers use alternating layers of oxide and nitride or oxide and conductor layers. These stacks can be a very thick, such as 2 um high, and are continuing to scale thicker. This results in high stress, delamination, and cracking.

[0106] To address the stress issues, algorithms can use as inputs the process parameters (e.g., gas flows, temperature, process cycle times) of the blanket deposition of these films, as well as the in-situ and inline metrologies (including broadband light metrologies) used to measure these film stacks. Without explicitly having to apply any physical modeling, correlations can be found between yield/inspection/stress tests and the inputs mentioned above to immediately identify problems with the blanket deposition.

[0107] 3-D memory characterization and failure analysis presents many challenges, and there is a great need for better characterization. Currently, TEM and x-ray techniques are used, but are low throughput and may result in material state change. Further, correlating probe failures and inline defect inspection is difficult due to the fact that many defects are embedded. E-beam inspection is increasingly being used to identify structural defects, but incurs additional cost. In some embodiments, gap fill process parameters are used as inputs to the algorithm(s). E-beam 3D inspection can also be used as targets for the algorithm.

5. Process Example for Overlay Error

[0108] An overlay process can be performed on one or more training wafers, and the training wafers are then analyzed for actual overlay errors. The most accurate way to measure overlay error is CD-SEM or TEM. All available wafer geometry parameters, such as thickness, diameter wafer shape variation, in-plane displacement, stress-induced local curvature, wafer thickness and flatness variation, front and back surface nanotopography (NT), wafer edge roll-off (ERO), sliplines; scanner parameters such as translation (x,y,z), rotation (x,y,z), focus tilt, dose error, focus residual, magnification, asymmetric magnification, asymmetric rotation; CD measurements such as film thickness, trench depth, metal gate recess, high k recess, side wall angle, resist height, hard mask height, pitch walking; film property parameters such as refractive index and absorption coefficient (n & k optical constants); parameters of other overlay measurements such as DBO and IBO (can also include the intensity values of the diffraction signature along with the DBO measurement itself), are used as inputs to the training model, with the corresponding actual overlay error as the target. The location on the wafer of the actual overlay measurement is matched with the location of all of the input parameters for that site, where applicable. Some process parameters such as temperature, pressure, process duration, etc. and other tool-related parameters are collected on a per-wafer basis and cannot be mapped specifically to a site. Rather, all sites for a given wafer will contain the same values collected for the wafer when site-specific information is not applicable or available. Alternatively if the spatial resolution of the overlay error measurement is greater than the spatial resolution of a given input parameter (e.g. a 9-site CD measurement on a wafer), then the closest input parameter will be mapped to that actual overlay error measurement. A good technique for doing this is k-means clustering. Other techniques include interpolating (3-D) to determine the value of the input parameter or cubic spline.

[0109] It is generally known that DBO and IBO are not perfect techniques for measuring overlay due to process and geometry influences. For example, FIG. 9 shows the error between DBO and a more-accurate CD-SEM representation of overlay, for 143 measurements. If DBO parameters (such as intensity at each wavelength of the diffraction spectra are included in the input dataset, along with the DBO predicted measurement, it is possible to correlate the error shown in FIG. 9 to process parameters of the lithography tool.

[0110] One approach specifies the target as the delta between the DBO measurement and CD-SEM measurement. The error associated between DBO and CD-SEM or TEM can be attributed to an input dataset and corrected in production.

[0111] Once the training input data set is organized, it is cleansed. The training input data may have corrupted values, in which case the corrupted values are removed and replaced with blanks or null values. The dataset may also contain inconsistent values for various informational features such as lot or wafer ID. For example, a lot description may appear as "lot_A" in some cases and "lot.A" in other cases. These values will all have to be converted to the same nomenclature, for example "lot.A."

[0112] The input data is then normalized or transformed. For example, in the case of tool calibration, the data may need to be mean shifted. A z-score can also be calculated from the input data set for different populations or distributions within a given input data set. For example, if a portion of an input is collected for a given tool calibration between time A and time B, then that data is normalized or a z-score is generated for the portion of data. If a different tool calibration is used between times B and time C, then normalization or z-score generation is performed for that portion. The result is a complete dataset that is insensitive to tool calibration. Events other than tool calibration that can generate the need for data transformation are upstream process changes and consumable changes. It is important to note that the same transformation will need to be applied once production commences. To gather enough data in real-time production in order to make the transformation, predictions may not be applied until a user-specified amount of data is collected in order to make transformation. However, it may be determined that the transformed data is not an important feature for the model.

[0113] The training dataset can be partitioned into training, testing, and validation portions to ensure a robust model is built that is not over-fit or over-biased. A typical partition can be 60% training, 30% testing, and 10% validation. For some models, such as boosted or bootstrap-aggregated models implemented in analytics platforms such as IBM SPSS Modeler, the testing and validation sets need to be separated as the testing dataset is used to further optimize the model while the validation set is completely blind to any model training or optimizing activity. For other types of models, such as standard linear regression, it is acceptable to separate the partitions into training and testing only. It is important to note that techniques such as k-fold cross validation can be employed during the model building phase to ensure the model is not over-fit to any given training set. This involves rotating the training/testing/validation portions of the dataset to ensure that all data sees a training or testing portion.

[0114] If a given input has a large number of missing or corrupted values, then that input feature may be removed from consideration in training the model. For example, if more than 50% of the data is not present for a given input feature, then that input feature can be thrown out. Alternatively, the missing data fields may be filled in with nominal values, or the records that do not contain values may be completely removed from the training dataset. A determination of which technique to use can be decided based on a human judgment of the importance of a given input feature.

[0115] That dataset may also have to be merged for a given key. The key typically is an x-y coordinate on the wafer or scanner, or could be a die number. As mentioned above, datasets may need to be mapped to a given key (cubic spline, interpolation, or nearest neighbor). The location on the wafer, such as a specific die or its location, is matched with the location of all of the input parameters for that site, where applicable. Some process parameters such as temperature, pressure, process duration, etc. and other tool-related parameters are collected on a per-wafer basis and cannot be mapped specifically to a site. Rather, all sites for a given wafer will contain the same values collected for the wafer when site-specific information is not applicable or available. Alternatively, if the spatial resolution of the die location is greater than the spatial resolution of a given input parameter (e.g., a 9-site CD measurement on a wafer), then the closest input parameter will be mapped to that actual die. A good technique for doing this is k-means clustering. Other techniques include interpolating (3-D) to determine the value of the input parameter or cubic spline.

[0116] A training input dataset may contain thousands of input features, and a relevant set of input features may need to be determined. A process for removing irrelevant input features that weakly correlate to overlay error may need to be implemented. As a first step in this process, input features that do not change at all can be removed.

[0117] There are also a number of approaches to feature selection. One approach is implementing random forests which identify which input features are most relevant to predicting overlay error. Another technique is the CHAID decision tree, which will also identify features that are important. Linear regression is another technique. ANOVA is another technique.

[0118] Alternatively, dimensionality reduction can also be employed. Common dimensionality reduction techniques include partial least squares and principal component analysis, which will create a new smaller set of input parameters based on the large set of initial input parameters. For example, an input set of 5000 features can be reduced to an input set of 30 newly-generated principle components that can explain a significant portion of the variance in the data. The outcome or output of the dimensionality reduction step can be used as new inputs to the model. For example, the principle components generated by PCA can be inputs to the model. The principle components will represent a reduced set of inputs from a larger set of inputs.

[0119] From the original input data, a set of virtual metrology models may be constructed. The purpose of a virtual metrology model is to predict a key metric in the semiconductor fabrication process. For example, an etch depth may be predicted given certain upstream variables such as etch tool process parameters, previous step thickness and process variables such as deposition tool process parameters, CMP process parameters, and optical n and k values of the film. In some embodiments, the etch process parameters can be used as inputs in determining the lithography tool control. Etch tool process parameters can be used to predict the etch rate or final etch depth (as in the case of virtual metrology). The outputs of the virtual metrology algorithm can then be used as inputs to the lithography tool control as an intermediate step algorithm. The output of the intermediate step algorithm (or virtual metrology algorithm) can be used as an input variable for the determination of overlay error.

[0120] Certain parameters in the models are important in determining the best model, of which certain variations can be tried. The best combination of model parameters that gives the least error between predicted and actual overlay error is chosen. For example, the minimum number of records allowed in a decision tree leaf can be set, or the number of weak learners employed in a random forest algorithm or GBM model, or the number of input features for each weak learner in a random forest algorithm.

[0121] The candidate model predicts the overlay errors and compares them with the actual overlay errors on the validation wafers. If the prediction accuracy satisfies certain thresholds based on the overlay budget and other considerations, the candidate model is considered to be valid and ready to be deployed to predict overlay errors on other production wafers which share similar processing conditions with the training and validation wafers.

[0122] Once a model or multi-step model and associated parameters are chosen, the model is first implemented in production in a "listening mode" where overlay error predictions are made as wafers run through production. The predicted overlay error can be compared to actual overlay error. If the predicted error is found to be within a user-defined threshold or overlay error budget, then the production is allowed to continue to run and more data is collected.

[0123] If instead the model is not predicting within the defined limits as compared to actual measured overlay error, then all data collected up to that point is used to retrain the model as outlined in the above steps. If the model now predicts a result within the user-defined thresholds after being re-trained, the model is then re-deployed in listening mode in production. If the model performs within the specified error limit (predicted--actual overlay) for a user-specified period of time (for example, 8 weeks of production), then the model is allowed to replace some of the actual overlay measurements used in actual production. Over time, if the model continues to perform well, more and more product will rely on the predicted overlay, until the overlay prediction is used on all production.

[0124] The model will continue to be re-trained at user-defined intervals (for example, once a week) as new data is made available. To retrain the model, the entire dataset available may be used. It may also be beneficial to use only the latest data available for a period of time to train the model, for example the last 3 months only, and discarding very old data as it becomes obsolete as the process undergoes significantly shifts. It may also be beneficial to retain for model training older data that defines the extremes of the input and target variance, and discard older redundant data to maintain model training efficiency or save memory space. It may be beneficial to continue to monitor the performance of the predicted overlay, even after full production release, by continuing to compare to actual overlay measurements. If it is found that the error between predicted and actual overlay falls out of tolerance, then predictions will not be deployed for a period of time until it is determined why the predictions fell out of tolerance and the model is retrained and gradually released back into production.

[0125] Once a candidate model is determined, one or more validation wafers are selected from the production wafers, and patterned wafer geometry parameters are obtained for the validation wafers using a patterned wafer geometry metrology tool. An overlay process is performed on the one or more validation wafers and the one or more validation wafers are analyzed for actual overlay errors. The candidate model predicts the overlay errors and compares them with the actual overlay errors on the validation wafers. If the prediction accuracy satisfies certain thresholds based on the overlay budget and other considerations, the candidate model is considered to be valid and ready to be deployed to predict overlay errors on other production wafers which share similar processing conditions with the training and validation wafers.

[0126] Once the candidate model is validated, the remaining production wafers are scanned with a patterned wafer geometry metrology tool to determine wafer geometry parameters. Based on the wafer geometry parameters and the deployed predictive model, the system predicts an overlay error for the remaining production wafers and adjusts the lithography scanner to correct for the predicted overlay error. Point-to-point prediction is crucial for feeding forward the predicted overlay, applying the adjustment, and hence reducing the actual overlay error after the exposure.

6. Yield Prediction

[0127] Predicting yield is generally important in the manufacture of semiconductor devices, and even more so as the fabrication of semiconductor devices becomes increasingly expensive. A yield prediction can be made at different steps in the process.

[0128] If yield can be accurately predicted at any stage of the manufacturing process, then it becomes possible to optimize and save costs in later processes. For example, if a device can be predicted to be bad before wafer sort and test, then further testing and processing of that device can be avoided thus saving further processing costs. Typically, there are hundreds of steps in a semiconductor manufacturing process. The process for fabrication of wafers can take 2 to 3 months before moving on to the post-fabrication stages, which usually include wafer test and sort, assembly/packaging, final testing, and burn-in. At each of these steps, a predicted yield can be calculated. The fabrication yield can be measured as the ratio of good wafers that make it through the wafer fabrication process to all wafers that entered the given process. The wafer test yield can be calculated as the ratio of non-defective chips determined at wafer test to all chips that entered into wafer test. The assembly and packaging yields are calculated in a similar manner, i.e. the ratio of good chips out to the total chips into those respective processes.

[0129] Existing techniques for yield prediction have been based primarily on a univariate analysis. For example, Markov chains predict whether a chip results in positive yields given the number of defects. However, multivariate analysis has become more popular as the amount of test data has become very large. A common technique employed for multivariate analysis is discriminant analysis, but this technique assumes that the data is normally distributed and independent, which is not always the case.

[0130] Further compounding the need for multivariate analysis is the fact that the amount of data that is accessible in the semiconductor manufacturing process continues to grow. However, the use of machine learning algorithms, data mining, and predictive analytics make the handling of large data sets manageable. Furthermore, confidence and propensity metrics associated with many machine learning algorithms can be used to optimize wafer sort/testing, final tests, and burn-in activities.

[0131] For semiconductor manufacturing, the measure of defective parts per million (DPPM) is evaluated when testing the outgoing packaged chips. In a typical situation, functional/structural test patterns are used at wafer sort and also after the parts (or products) are packaged to determine which products/die are faulty. Functional system level testing then follows. The expense of testing at each subsequent stage can be significantly higher than at the previous stage. Usually, packaged products are tested in burn-in chambers and on load boards, using either the same structural patterns used at wafer sort or with functional test patterns. The cost of such testing has increased significantly over the past several years as design complexity has increased.

[0132] A typical business model for manufacturing microchips is the foundry/fabless model, where wafers are fabricated at a foundry and then passed off to the fabless design house or packaging partner for subsequent processing and testing. The term "known good die" (KGD) refers to die at or before wafer sort/test which have been tested to the same quality and reliability levels as their packaged counterparts. If a die passes at the wafer sort/test phase but is found to be faulty at some point after wafer sort, then the design house or packaging house can incur the cost of any steps taken in manufacturing the product after wafer sort. In one business model, dies from the foundry that pass wafer sort are bought by the fabless design house. If the die are found to be faulty after packaging, then the design house pays for those die. This can get very expensive for dies that go into stacked IC's or multi-chip modules, as all dies in the packaged chip would have to be scrapped if only one of the die were found to be bad.

[0133] Thus, it has become very important to know at the earliest stage possible if a die will be functional after it is packaged. If post-package yield can be more accurately predicted at wafer sort, or at various stages of final test, or pre burn-in, it can significantly reduce the costs incurred by whichever entity owns the faulty product post-packaging. Also, prediction and confidence metrics can be determined and can be used to optimize burn-in times, which can result in significant cost savings.

[0134] In general, yield prediction for a product refers to the prediction of the quality or usability of the product. In one embodiment, yield prediction can be one of two values, namely, either "pass" or "fail" (or "good" or "bad" or "usable" or "not usable"). For example, if the yield prediction for a product is "pass" at a given manufacturing step, then that product is predicted to be usable as of that manufacturing process and should continue processing. If the yield prediction is predicted to be "fail," then that product is predicted to be faulty or not usable as of that manufacturing step and is not recommended for continued processing. The yield prediction is thus useful in determining if it is cost effective to continue processing of a product. In some embodiments, the yield prediction is a component in deciding whether or not to continue processing of the product. The yield prediction is not necessarily the only variable in making a decision about whether or not to continue processing of a product.

[0135] This disclosure describes novel techniques for predicting yield before, during and after wafer sort. These yield predictions can be used to reduce costs by more accurately predicting yield at wafer sort, final test, burn-in, and other post-wafer sort testing. Yield predictions and their associated confidence metrics can also be used to make decisions about which tests to perform after wafer sort. Yield predictions can also be used to optimize and reduce burn-in time.

[0136] In one embodiment, yield prediction can be the prediction or outcome of a classification system or algorithm. The classification system or algorithm can determine if the product will be functional or non-functional after all manufacturing steps are complete, given an input dataset to the algorithm. For example, if the classification system or algorithm predicts the product will be functional, then it can be said that the yield prediction is positive, or that the product will yield. For example, a "0" may be assigned to indicate a passing/functional product, while a "1" may be assigned for a failing/nonfunctional product.

[0137] As discussed with regard to overlay error, the classification system or algorithm used to make a yield prediction can also provide a confidence or propensity metric along with a pass or fail classification, given the input data to the algorithm. The confidence or propensity metric can be a value in a defined range or an undefined range. In a typical situation, the value can be a real number between 0 and 1. In this example, if the value is close to 0, then the confidence is low. If the value is close to 1, then the confidence is high.

[0138] A threshold can be set for the confidence value to bin the confidence value as high or low. For example, if the confidence metric varies between 0 and 1, and the threshold is set at 0.5, then confidence values above 0.5 will be deemed as high confidence, while values below 0.5 will be deemed to be low confidence.

[0139] The confidence or propensity metric may be used in conjunction with the pass or fail classification to make the final yield prediction, as illustrated in FIG. 10. Data is input to the classification algorithm in step 1002. If in step 1004 the classification algorithm predicts that the product will pass, and the confidence metric is high for the classification prediction in step 1006, then the yield prediction in step 1008 is said to be positive meaning there is a high confidence that the product will pass.

[0140] However, if the classification algorithm predicts in step 1004 that the product will pass, but the confidence value is low in step 1006, then the yield prediction in step 1010 is negative so as to not produce any false positive outcomes. In some situations, a false positive of this nature is very undesirable, as products that are actually faulty but predicted to be good can be very costly for the manufacturer.

[0141] Similar to the discussion of predicting overlay error above, a yield prediction can be made by implementing machine learning, predictive analytics, and data mining algorithms (all of which will be referred to as algorithms). The types of input data identified in the overlay sections are also relevant to predicting yield and evaluation of other targets. Further, the techniques and examples described in the overlay sections above are incorporated by reference here as well since they are also relevant to predicting yield or evaluating other targets. Thus, the techniques described for identifying input data, collecting input data, transforming the input data, training and re-training the model, and deploying the model, are applicable to yield prediction and evaluation of other targets. FIG. 11 illustrates a method 1100 for creating and deploying a model to evaluate a semiconductor manufacturing process in order to predict yield. In step 1102, a target is selected. In one embodiment, the target is total yield for the entire manufacturing process. In another embodiment, the target is yield for an individual process step. The target could be yield for an individual die on a wafer, or the entire wafer. The target could also be the yield of a packaged chip or product at final test, before burn-in, or a packaged chip or product at final test, after burn-in.

[0142] In step 1104, the parameters that are useful in evaluating yield are identified, and in step 1106, input data relevant to the parameters is collected. Every set of input data is associated with a specific output or target. For example, a set of measured and observed values are associated with actual yield values, and those values are provided as an input vector to the model.

[0143] In general, the input data to the algorithm can be input data from any or all processes performed during wafer fabrication. Wafer level data from the semiconductor fabrication processes and metrology that are collected before wafer sort and test can be used as part or all of the total inputs to the algorithm. These input parameters can include metrology measurements from process steps or metrology measurements collected during the wafer fabrication process. These measurements can include optical reflectomety or ellipsometry data, and the intensity of each measurement at a given wavelength. The metrology data can be incorporated from a reflectometry measurement taken after a certain processing step (for example, CMP or Etch, or Gap Fill processes). The metrology measurements can also be produced by non-optical measurements, such as Rs (conductivity, resistivity) measurements taken by probes and other types of contact measurements, or contact measurements such as the HRP or high resolution profiler.

[0144] In some embodiments, part or all of the input data can be from the output of wafer probe tests, or measurements made by wafer probe tests. Additionally, data from final wafer electrical testing, wafer sort tests, and wafer acceptance tests can be used as input data. Examples of final wafer electrical testing parameters include, but are not limited to, diode characteristics, drive current characteristics, gate oxide parameters, leakage current parameters, metal layer characteristics, resistor characteristics, via characteristics, etc. Examples of wafer sort parameters include, but are not limited to, clock search characteristics, diode characteristics, scan logic voltage, static IDD, IDDQ, VDD min, power supply open short characteristics, ring oscillator frequency, etc.

[0145] The input data can come from a final test. The input data can come from tests that occur multiple times under different electrical and temperature conditions, and before and after device reliability stresses, such as burn-in, or tests that occur at a burn-in step. The input data can come from electrical tests that are a mix of functional, structural and system-level tests.

[0146] The test outputs which can serve as inputs to the yield prediction system can be of binary type (pass/fail) or can be analog, or a real number that can be bounded or unbounded. The analog output can be a voltage reading, or a current reading.

[0147] In step 1108, the input data undergoes filtering, normalization and/or cleansing steps. In step 1110, dimensionality reduction or feature selection is performed to reduce the number of input parameters for processing the algorithm.

[0148] In step 1112, the data is then fed into one or more algorithms for training. Given the training input data and training targets, the algorithm(s) will produce a model in step 1114, which can be deployed in step 1116 to act on real time data.

[0149] In one embodiment, the status of the manufactured product can be the result of a function that weights the results of final tests, the confidence metric of the yield prediction system, and the classification of the yield prediction system, as illustrated in FIG. 13. If the status prediction is above a specified threshold, then the part can be determined to be good, or usable.

[0150] In an embodiment, the algorithm utilizes calculated propensity from an upstream test which contains more failures to determine the failure rate of the final test, which may contain much fewer failures. For example, at the end of an upstream testing process, the failure rate may be higher, which would make it easier to produce a model that gives more accurate predictions (e.g., a CHAID decision tree). A model can be built to determine the failure rate of this upstream process, and produce a pass/fail prediction along with a confidence and propensity metric. The failure prediction, confidence and propensity metric can then be used as inputs to predict the failure of a test further downstream. This may be particularly useful when the test downstream has a lower number of failures, making it more difficult to build an accurate model.

[0151] In some embodiments, a data processing step for a classification model may include oversampling. For example, if there are 100 failed chips and 10,000 passed chips in the training dataset, oversampling would mean replicating the rows of failed chips 100 times so that there are now 10,000 rows of failed chips. This balanced set is then fed into the model. Alternatively, undersampling would mean (randomly) selecting 100 passed chips and feeding that into the model, along with the other 100 failed chips to create a balanced training set. This can be an important step in creating a decision tree.

[0152] In some embodiments, limits are set on how small the leaf nodes of the decision tree can be so as not to result in an over-biased or over-fit model to the training dataset.

[0153] In some embodiments, the model is trained on a portion of the data. It is then tested on a different portion of the data that is blind to the training phase. K-fold cross validation can also be applied to determine the robustness of the model. In the case of boosted on bagged algorithms, a training, testing, and validation dataset can be partitioned, where the validation set is completely blind while the testing set is used to optimize the model.

[0154] The following is an example of a yield prediction algorithm. The input data is cleansed, transformed, and organized as previously described. The input data can be associated with each die, or mapped to a particular die by using the techniques described above. The input dataset can contain a set of die manufactured throughout the manufacturing process with associated input data for each die. Along with each die can be the associated health of the die, i.e., pass or fail. Typically, most of the die will pass but some of the die will be determined to fail after the final testing step. Throughout the final testing process, the die will undergo various tests and reliability stresses (e.g., burn-in), and some of the die will incrementally fail and be removed. The model is a type of classification model that uses the die's health (pass/fail) as a target. The issue with training a model around the die health (pass/fail) at the final stage of the process is that the number of failures is usually very low by this stage. For example, the number of failures after final testing may be only 100 out of 1,000,000. Most classification models will not be able to predict failure accurately with such a low number of failures in the dataset used to train the model. To mitigate this issue, an intermediary model is trained around an earlier upstream test that will have more failures. Balancing techniques such as oversampling are still applied to the dataset since the number of failures will be relatively low as compared to number of passes, for example 10,000 failures out of 1,000,000. From this intermediary model, a propensity metric is generated for all remaining passed die which will continue to undergo subsequent processing. By the time the die reaches final test, the propensity score from the earlier intermediary model is used as an additional input to train the final failure prediction model. The dataset is again balanced (e.g., oversampling) to ensure the number of failures will equal the number of passes in the model training set. The overall accuracy of the model can improve if the propensity of the upstream model is also used as input. Training, testing, validation, and cross validation techniques are applied to determine the best model. Various models are tried in the techniques described earlier. The model that gives the least number of false positives and/or false negatives (depending on which metric is of most importance to the user) will be the model that is selected. Typically, the user will be interested in minimizing false negatives (i.e., predicting a die will pass but in actuality it fails), since this will mean it may be erroneously routed for less stringent testing or burn-in, resulting in a sub-stardard die being shipped to a customer, thus increasing risk of field failure.

7. Testing and Burn-in Optimization

[0155] The yield prediction system can be used to calculate and optimize burn-in time. The burn-in time calculation can be a function of the yield prediction or classification produced by the yield prediction system, the confidence or propensity metric computed by the yield prediction system, and/or actual final test results, as illustrated in FIG. 14. As an example, if the yield is predicted to be positive by the yield prediction system, and the confidence metric calculated by the yield prediction system is a relatively high value, then the burn-in time can be calculated to be lower than average, or completely eliminated. In another example, if the product is predicted to be good by the yield prediction system, and the confidence metric is calculated by the yield prediction system to be low, then the burn-in time may be calculated to be higher than average. In another example, if the product is predicted to be bad by the yield prediction system, then the burn-in time can be set to a maximum value.

[0156] The yield prediction can also be used to optimize final testing. For example, if the product is predicted to be good with a high confidence value, then certain expensive tests can be skipped. In another example, if the yield prediction is good but the confidence value is low, then more exhaustive testing can be implemented than the case where yield prediction is good and confidence is low. In yet another example, if the product is predicted to be good, a decision can be made to do the most rigorous amount of testing, or the decision can be made to forgo further testing and processing, and scrap the product.

8. Other Applications

[0157] As discussed herein, predictive analytics can be used to discover the relationships between the various process steps, parametrics, and product performance, which can then be leveraged to predict and improve product performance. By incorporating the advantages of machine learning and parallel processing, predictive analytics can find complex correlations among the input data that have been difficult to uncover using other techniques. Thus, in addition to predicting yield and correcting for overlay errors and CD variations, as discussed above, predictive analytics can be used in many ways in the semiconductor manufacturing process to improve performance, quality, and yield, and to reduce costs. Algorithms can be used to optimize some or all of the processes in semiconductor manufacturing.

[0158] FIG. 15 illustrates several additional applications 1302 for the techniques described herein, including yield prediction/improvement; run-to-run control; wafer-to-wafer control; real-time and in-situ control; virtual metrology; fault prediction and classification; factory-wide control; and predictive maintenance, among others. With regard to yield, the techniques disclosed herein can predict yield, or identify the root cause of yield detractors, or link parametric faults to inline process data, as shown in box 1304, among others. With regard to virtual metrology, the techniques disclosed herein can predict specific process metrics using metrology equipment data, process equipment data, and upstream data, as shown in box 1306. With regard to fault prediction and classification, the techniques disclosed herein can classify or detect faults on process equipment using process equipment data and in-situ metrology, as shown in box 1308. With regard to factory-wide control, the techniques disclosed herein can discover relationships hidden in the process data, as shown in box 1310. With regard to predictive maintenance, the techniques disclosed herein can identify root causes for different types of defects, and predict future defects using inline process data, as shown in box 1312.

[0159] In some embodiments, virtual metrology can use algorithms to predict metrology metrics such as film thickness and critical dimensions (CD) without having to take actual measurements in real time. This can have a big impact on throughput and also lessen the need for expensive TEM or SEM x-section measurements. Based on sensor data from production equipment and actual metrology values of sampled wafers to train the algorithm, virtual metrology can predict metrology values for all wafers. The algorithm can be a supervised learning algorithm, where a model can be trained using a set of input data and measured targets. The targets can be the critical dimensions that are to be controlled. The input data can be upstream metrology measurements, or data from process equipment (such as temperatures and run times).

[0160] Identifying a machine fault or failure, and finding the root cause of faults quickly, can be essential in semiconductor manufacturing. If faults in the manufacturing process can be better detected and resolved, downtime and scrap can be reduced. This is also referred to as fault detection and classification (FDC). If faults can be predicted before they occur, then downtime can be optimally scheduled and scrap can be even further reduced. Thus, algorithms can be used to predict when a fault or defect will occur in the manufacturing process or on a specific tool at a process step.

[0161] In some embodiments of the invention, algorithms can be used to determine when maintenance needs to be performed on manufacturing equipment. This is referred to as predictive maintenance in the semiconductor manufacturing process.

9. Conclusion

[0162] While the foregoing written description of the invention enables one of ordinary skill to make and use what is considered presently to be the best mode thereof, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the specific embodiment, method, and examples herein. The invention should therefore not be limited by the above described embodiments, methods, and examples.

* * * * *