Microprocessor Pipeline Circuitry To Support Cryptographic Computing Kounavis; Michael E. ; et al. [Intel Corporation]

Microprocessor Pipeline Circuitry To Support Cryptographic Computing

Kounavis; Michael E. ; et al.

Patent Application Summary

U.S. patent application number 17/576533 was filed with the patent office on 2022-05-05 for microprocessor pipeline circuitry to support cryptographic computing. This patent application is currently assigned to Intel Corporation. The applicant listed for this patent is Intel Corporation. Invention is credited to Sergej Deutsch, David M. Durham, Santosh Ghosh, Michael E. Kounavis, Michael D. LeMay, Stanislav Shwartsman.

Application Number	20220138329 17/576533
Document ID	/
Family ID
Filed Date	2022-05-05

United States Patent Application	20220138329
Kind Code	A1
Kounavis; Michael E. ; et al.	May 5, 2022

MICROPROCESSOR PIPELINE CIRCUITRY TO SUPPORT CRYPTOGRAPHIC COMPUTING

Abstract

In one embodiment, a processor of a cryptographic computing system includes a register to store an encryption key and address generation circuitry to obtain a pointer representing a linear address to be accessed by a read or write operation, the pointer being at least partially encrypted, obtain the key from the register and a context value, decrypt the encrypted portion of the pointer using the key and the context value as a tweak input, and generate an effective address for use in the read or write operation based on an output of the decryption.

Inventors:

Kounavis; Michael E.; (Portland, OR) ; Ghosh; Santosh; (Hillsboro, OR) ; Deutsch; Sergej; (Hillsboro, OR) ; LeMay; Michael D.; (Hillsboro, OR) ; Durham; David M.; (Beaverton, OR) ; Shwartsman; Stanislav; (Haifa, IL)

Applicant:

Name	City	State	Country	Type
Intel Corporation	Santa Clara	CA	US

Assignee:

Intel Corporation
Santa Clara
CA

Appl. No.:

17/576533

Filed:

January 14, 2022

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
16724105	Dec 20, 2019	11281782
17576533
62868884	Jun 29, 2019

International Class:

G06F 21/60 20060101 G06F021/60; G06F 12/0897 20060101 G06F012/0897; G06F 9/30 20060101 G06F009/30; G06F 9/48 20060101 G06F009/48; G06F 21/72 20060101 G06F021/72; H04L 9/06 20060101 H04L009/06; G06F 12/06 20060101 G06F012/06; G06F 12/0875 20060101 G06F012/0875; G06F 21/79 20060101 G06F021/79; G06F 9/455 20060101 G06F009/455; G06F 12/0811 20060101 G06F012/0811; G06F 21/12 20060101 G06F021/12; H04L 9/08 20060101 H04L009/08; G06F 12/14 20060101 G06F012/14; G06F 9/32 20060101 G06F009/32; G06F 9/50 20060101 G06F009/50; G06F 12/02 20060101 G06F012/02; H04L 9/14 20060101 H04L009/14; G06F 21/62 20060101 G06F021/62

Claims

1. A processor comprising: a register to store an encryption key; and address generation circuitry to: obtain a pointer representing a linear address to be accessed by a read or write operation, the pointer being at least partially encrypted; obtain the key from the register and a context value; decrypt the encrypted portion of the pointer using the key and the context value as a tweak input; and generate an effective address for use in the read or write operation based on an output of the decryption.

2. The processor of claim 1, wherein the context value is obtained from another register of the processor.

3. The processor of claim 1, wherein the context value is obtained from bits of the pointer.

4. The processor of claim 1, wherein the context value is obtained from memory.

5. The processor of claim 1, wherein the pointer comprises an encrypted base address, plaintext upper bits, and a plaintext offset, and the address generation circuitry is to generate the effective address by: decrypting the encrypted base address portion to yield a decrypted base address; and combining the decrypted base address, the upper bits, and the offset.

6. The processor of claim 5, wherein the address generation circuitry is to generate the effective address by: concatenating the decrypted base address with a set of complimentary upper bits and the offset to yield an intermediate base address; and combining the upper bits with the intermediate base address.

7. The processor of claim 6, wherein the address generation circuitry is to combine the upper bits with the intermediate base address using one or more of an XOR, ADD, or logical AND function.

8. The processor of claim 1, further comprising: a data cache unit storing encrypted data; and memory access circuitry to: access the encrypted data stored in the data cache unit; and decrypt the encrypted data based on the key and the effective address.

9. The processor of claim 8, wherein the effective address is used as a tweak input to the decryption.

10. The processor of claim 8, wherein the circuitry is to decrypt the encrypted data by: generating a key stream based on the effective address and a counter value; and performing an XOR operation on the key stream and the encrypted data to yield decrypted data.

11. A method comprising: obtaining a pointer representing a linear address to be accessed by a read or write operation, the pointer being at least partially encrypted; obtaining the key from a processor register and a context value; decrypting the encrypted portion of the pointer using the key and the context value as a tweak input; and generating an effective address for use in the read or write operation based on an output of the decryption.

12. The method of claim 11, wherein the context value is obtained from another processor register.

13. The method of claim 11, wherein the context value is obtained from bits of the pointer.

14. The method of claim 11, wherein the context value is obtained from memory.

15. The method of claim 11, wherein the pointer comprises an encrypted base address, plaintext upper bits, and a plaintext offset, and generating the effective address comprises: decrypting the encrypted base address portion to yield a decrypted base address; and combining the decrypted base address, the upper bits, and the offset.

16. The method of claim 15, generating the effective address comprises: concatenating the decrypted base address with a set of complimentary upper bits and the offset to yield an intermediate base address; and combining the upper bits with the intermediate base address.

17. The method of claim 16, wherein combining the upper bits with the intermediate base address comprises using one or more of an XOR, ADD, or logical AND function.

18. The processor of claim 11, further comprising: accessing encrypted data stored in a data cache unit; and decrypting the encrypted data based on the key and the effective address.

19. The method of claim 18, wherein the effective address is used as a tweak input to the decryption.

20. The method of claim 18, wherein decrypting the encrypted data comprises: generating a key stream based on the effective address and a counter value; and performing an XOR operation on the key stream and the encrypted data to yield decrypted data.

21. A system comprising: memory; and a processor coupled to the memory, the processor comprising: a register to store an encryption key; and address generation circuitry to: obtain a pointer representing a linear address to be accessed by a read or write instruction stored in the memory, the pointer being at least partially encrypted; obtain the key from the register and a context value; decrypt the encrypted portion of the pointer using the key and the context value as a tweak input; and generate an effective address for use in the read or write operation based on an output of the decryption.

22. The system of claim 21, wherein the pointer comprises an encrypted base address, plaintext upper bits, and a plaintext offset, and the address generation circuitry is to generate the effective address by: decrypting the encrypted base address portion to yield a decrypted base address; and combining the decrypted base address, the upper bits, and the offset.

23. The system of claim 22, wherein the address generation circuitry is to generate the effective address by: concatenating the decrypted base address with a set of complimentary upper bits and the offset to yield an intermediate base address; and combining the upper bits with the intermediate base address.

24. The system of claim 23, wherein the address generation circuitry is to combine the upper bits with the intermediate base address using one or more of an XOR, ADD, or logical AND function.

25. The system of claim 21, wherein the processor further comprises memory access circuitry to: access encrypted data stored in the memory; and decrypt the encrypted data based on the key and the effective address.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This Application is a continuation (and claims the benefit of priority under 35 U.S.C. .sctn. 120) of U.S. application Ser. No. 16/724,105 filed on Dec. 20, 2019, entitled MICROPROCESSOR PIPELINE CIRCUITRY TO SUPPORT CRYPTOGRAPHIC COMPUTING, which application claims the benefit of and priority from U.S. Provisional Patent Application Ser. No. 62/868,884 entitled "Cryptographic Computing" and filed Jun. 29, 2019. The disclosures of the prior applications are each incorporated herein by reference.

TECHNICAL FIELD

[0002] This disclosure relates in general to the field of computer systems and, more particularly, to microprocessor pipeline circuitry to supporting cryptographic computing.

BACKGROUND

[0003] Cryptographic computing may refer to solutions for computer system security that employ cryptographic mechanisms inside processor components. Some cryptographic computing systems may involve the encryption and decryption of pointers, keys and data in a processor core using new encrypted memory access instructions.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] To provide a more complete understanding of the present disclosure and features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying figures, where like reference numerals represent like parts, in which:

[0005] FIG. 1 is a flow diagram of an example process of scheduling microoperations.

[0006] FIG. 2 is a diagram of an example process of scheduling microoperations based on cryptographic-based instructions.

[0007] FIG. 3 is a diagram of another example process of scheduling microoperations based on cryptographic-based instructions.

[0008] FIGS. 4A-4B are diagrams of an example data decryption process in a cryptographic computing system.

[0009] FIGS. 5A-5C are diagrams of another example data decryption process in a cryptographic computing system.

[0010] FIGS. 6A-6B are diagrams of an example data encryption process in a cryptographic computing system.

[0011] FIGS. 7A-7B are diagrams of an example pointer decryption process in a cryptographic computing system.

[0012] FIGS. 8A-8B are diagrams of an example base address slice decryption process in a cryptographic computing system.

[0013] FIG. 9 is a flow diagram of an example process of executing cryptographic-based instructions in a cryptographic computing system.

[0014] FIG. 10 is a block diagram illustrating an example processor core and memory according to at least one embodiment;

[0015] FIG. 11A is a block diagram of an example in-order pipeline and an example register renaming, out-of-order issue/execution pipeline according to one or more embodiments of this disclosure;

[0016] FIG. 11B is a block diagram of an example in-order architecture core and register renaming, out-of-order issue/execution architecture core to be included in a processor according to one or more embodiments of this disclosure; and

[0017] FIG. 12 is a block diagram of an example computer architecture according to at least one embodiment.

DETAILED DESCRIPTION

[0018] The following disclosure provides various possible embodiments, or examples, for implementation of cryptographic computing. Cryptographic computing may refer to computer system security solutions that employ cryptographic mechanisms inside processor components. Some cryptographic computing systems may involve the encryption and decryption of pointers, keys, and data in a processor core using new encrypted memory access instructions. Thus, the microarchitecture pipeline of the processor core may be configured in such a way to support such encryption and decryption operations.

[0019] Some current systems may address security concerns by placing a memory encryption unit in the microcontroller. However, such systems may increase latencies due to the placement of cryptographic functionality in the microcontroller. Other systems may provide a pointer authentication solution. However, these solutions cannot support multi-tenancy and may otherwise be limited when compared to the cryptographic computing implementations described herein.

[0020] In some embodiments of the present disclosure, an execution pipeline of a processor core first maps cryptographic computing instructions into at least one block encryption-based microoperation (.mu.op) and at least one regular, non-encryption-based load/store .mu.op. Load operations performed by load .mu.ops may go to a load buffer (e.g., in a memory subsystem of a processor), while store operations performed by store .mu.ops may go to store buffer (e.g., in the same memory subsystem). An in-order or out-of-order execution scheduler is aware of the timings and dependencies associated with the cryptographic computing instructions. In some embodiments, the load and store .mu.ops are considered as dependent on the block encryption .mu.ops. In embodiments where a counter mode is used, the load and store .mu.ops may execute in parallel with the encryption of the counter. In these implementations, a counter common to the plurality of load/store .mu.ops may be encrypted only once. In certain embodiments, block encryptions coming from cryptographic computing instructions are scheduled to be executed in parallel with independent .mu.ops, which may include .mu.ops not coming from cryptographic computing instructions.

[0021] Further, in some embodiments, functional units include block encryption or counter encryption operations. For example, data decryption may be performed (e.g., on data loaded from a data cache unit) by a decryption unit coupled to or implemented in a load buffer, and data encryption may be performed (e.g., on data output from an execution unit) by an encryption unit coupled to or implemented in a store buffer. As another example, pointer decryption may be performed by an address generation unit. Any suitable block cipher cryptographic algorithm may be implemented. For example, a small block cipher (e.g., a SIMON, or SPECK cipher at a 32-bit block size, or other variable bit size block cipher) or their tweakable versions may be used. The Advanced Encryption Standard (AES) may be implemented in any number of ways to achieve encryption/decryption of a block of data. For example, an AES xor-encrypt-xor (XEX) based tweaked-codebook mode with ciphertext stealing (AES-XTS) may be suitable. In other embodiments, an AES counter (CTR) mode of operation could be implemented.

[0022] In certain embodiments, cryptographic computing may require the linear address for each memory access to be plumbed to the interface with the data cache to enable tweaked encryption and decryption at that interface. For load requests, that may be accomplished by adding a new read port on the load buffer. In embodiments utilizing stream ciphers, e.g., those using the counter mode, the keystream may be pre-computed as soon as the load buffer entry is created. Data may be encrypted as it is stored into the store buffer or may be encrypted after it exits the store buffer on its way to a Level-1 (L1) cache. In some instances, it may be advantageous to start encrypting the data as soon as its address becomes available (e.g., while it may still be in the store buffer) to minimize the total delay for storing the data. If the data is encrypted outside of the store buffer, then a read port may be utilized on the store buffer so that a cryptographic execution unit can read the address.

[0023] Aspects of the present disclosure may provide a good cost/performance trade-off when compared to current systems, as data and pointer encryption and decryption latencies can be hidden behind the execution of other .mu.ops. Other advantages will be apparent in light of the present disclosure.

[0024] FIG. 1 is a flow diagram of an example process 100 of scheduling microoperations. The example process 100 may be implemented by an execution scheduler, such as an out-of-order execution scheduler in certain instances. At 102, a sequence of instructions is accessed by an execution scheduler. The instructions may be inside a window of fixed size (e.g., 25 instructions or 50 instructions). At 104, the sequence of instructions is mapped to a sequence of microoperations (.mu.ops). In typical pipelines, each instruction may be mapped to one or more .mu.ops in the sequence. At 106, the scheduler detects dependencies between .mu.ops and expresses those dependencies in the form of a directed acyclic graph. This may be performed by dependencies logic of the scheduler. As an example, two independent .mu.ops, an XOR .mu.op and a load .mu.op, may be represented as nodes in separate parallel branches in the graph. Conversely, dependent .mu.ops such as an ADD .mu.op and a following store .mu.op may be represented as sequential nodes in the same branch of the graph. The acyclic graph may include speculative execution branches in certain instances.

[0025] At 108, the scheduler may annotate the graph with latency and throughput values associated with the execution of the .mu.ops, and at 110, the scheduler performs maximal scheduling of at least one subset of independent .mu.ops by the functional units of the processor core. The annotation of 108 may be performed by timing logic of the scheduler and the scheduling of 110 may be performed by scheduling logic of the scheduler. Maximal scheduling may refer to the assignment of independent .mu.ops to core functional units that are locally optimal according to some specific objective. For example, the scheduler may perform assignments such that the largest possible number of independent functional units are simultaneously occupied to execute independent .mu.op tasks. In certain embodiments, the scheduling performed at 110 may be repeated several times.

[0026] FIG. 2 is a diagram of an example process 200 of scheduling microoperations based on cryptographic-based instructions. The example process 200 may be implemented by an execution scheduler, such as an out-of-order execution scheduler in cryptographic computing systems. At 202, a sequence of cryptographic-based instruction is accessed. This operation may correspond to operation 102 of the process 100. Cryptographic-based instructions may refer to instructions that are to be executed in cryptographic computing systems or environments, where data is stored in memory in encrypted form and decrypted/encrypted within a processor core. An example cryptographic-based instruction includes an encrypted load and store operation. The sequence of instructions may be within a particular window of fixed size as in process 100.

[0027] At 204, at least one encryption-based .mu.op and at least one non-encryption based .mu.op are generated for each instruction accessed at 202. This operation may correspond to operation 104 of the process 100. In some embodiments, the encryption-based .mu.op is based on a block encryption scheme. The at least one encryption-based .mu.op may include a data block encryption .mu.op and the at least one non-encryption based .mu.op may include a regular, unencrypted load or store .mu.op. As another example, the at least one encryption-based .mu.op may include a data block decryption .mu.op and the at least one non-encryption based .mu.op may include a regular, unencrypted load or store .mu.op. As yet another example, the at least one encryption-based .mu.op may include a data pointer encryption .mu.op and the at least one non-encryption-based .mu.op may include a regular, unencrypted load or store .mu.op. As yet another example, the at least one encryption-based .mu.op may include a data pointer decryption .mu.op and the non-encryption-based .mu.op may include a regular, unencrypted load or store .mu.op.

[0028] At 206, the non-encryption based .mu.ops are expressed as dependent upon the (block) encryption-based .mu.ops. This operation may correspond to operation 106 of the process 100, and may accordingly be performed by dependencies logic of the scheduler during generation of an acyclic graph. As an example, in some embodiments, the scheduler may compute dependencies between .mu.ops by identifying regular, unencrypted load or store .mu.ops that have resulted from the mapping of cryptographic-based instructions into .mu.ops as dependent on at least one of a data block encryption .mu.op, a data block decryption .mu.op, a pointer encryption .mu.op, or a pointer decryption .mu.op.

[0029] At 208, encryption or decryption timings are added to an acyclic graph that expresses .mu.op dependencies. This operation may correspond to operation 108 of the process 100, whereby the acyclic graph is annotated by timing logic of a scheduler. In some embodiments, the timings are otherwise implicitly taken into account by the scheduler. At 210, the encryption-based .mu.ops are scheduled to execute in parallel with independent .mu.ops (e.g., those not originating from the cryptographic-based instructions accessed at 202). This operation may correspond to operation 110 of the process 100, whereby the maximal scheduling is performed by scheduling logic of a scheduler. For instance, the scheduling logic that assigns .mu.ops to functional units may ensure that data block and pointer encryption/decryption tasks are scheduled to be executed in parallel with other independent .mu.ops.

[0030] FIG. 3 is a diagram of another example process 300 of scheduling microoperations based on cryptographic-based instructions. In particular, in the example, shown, a block cipher encryption scheme is utilized, and the mode used for data block and pointer encryption is the counter mode. In the counter mode, data are encrypted by being XOR-ed with an almost random value, called the key stream. The key stream may be produced by encrypting counter blocks using a secret key. Counter blocks comprising tweak bits (as well as the bits of a block-by-block increasing counter) may be encrypted with the same key and the resulting encrypted blocks are XOR-ed with the data. Using the counter mode, key stream generation microoperations can be parallelized with microoperations for the reading of the data from memory.

[0031] At 302, a sequence of cryptographic-based instruction is accessed. Cryptographic-based instructions may refer to instructions that are to be executed in cryptographic computing systems or environments, where data is stored in memory in encrypted form and decrypted/encrypted within a processor core. An example cryptographic-based instruction includes an encrypted load and store operation. The sequence of instructions may be within a particular window of fixed size as in processes 100, 200.

[0032] At 304, at least one counter mode encryption-based .mu.op and at least one non-encryption based .mu.op are generated for each instruction accessed at 302, in a similar manner as described above with respect to 204 of process 200.

[0033] At 306, non-encryption-based .mu.ops that can execute in parallel with the encryption of the counter are identified, and the counter common to the identified .mu.ops is encrypted once (instead of multiple times). This operation may correspond to operation 106 of the process 100, and may accordingly be performed by dependencies logic of the scheduler during generation of an acyclic graph. As an example, the scheduler logic that computes .mu.op dependencies may ensure that regular unencrypted load .mu.ops coming from the cryptographic-based instructions are not expressed as dependent on their associated counter encryption .mu.ops. In the counter mode, the encryption of the counter blocks may proceed independently from the loading of the data. Hence, the corresponding .mu.ops of these two steps may be represented by nodes of two separate parallel branches in the dependencies graph. These branches would merge in a node presenting the XOR operation which adds the encrypted counter to the loaded data, according to the counter mode specification. In some implementations, the dependencies logic of the scheduler may also identify a plurality of load and store .mu.ops coming from the cryptographic-based instructions, the associated data of which need to be encrypted or decrypted with the same counter value and key stream. For these .mu.ops, the dependencies logic may schedule the computation of the key stream only once and represent it as a single node in the dependencies graph.

[0034] At 308, encryption or decryption timings are added to an acyclic graph that expresses .mu.op dependencies. This operation may correspond to operation 108 of the process 100, whereby the acyclic graph is annotated by timing logic of a scheduler. In some embodiments, the timings are otherwise implicitly taken into account by the scheduler. At 310, the encryption-based .mu.ops are scheduled to execute in parallel with independent .mu.ops (e.g., those not originating from the cryptographic-based instructions accessed at 302). This operation may correspond to operation 110 of the process 100, whereby the maximal scheduling is performed by scheduling logic of the scheduler. For instance, the scheduling logic that assigns .mu.ops to functional units may ensure that data block and pointer encryption/decryption tasks are scheduled to be executed in parallel with other independent .mu.ops.

[0035] The above descriptions have described how an out-of-order-execution scheduler may support the execution of cryptographic-based instructions in cryptographic computing implementations. The following examples describe certain embodiments wherein the functional units of a core support the execution of the microoperations as discussed above. In some of the example embodiments described below, the encryption and decryption of data is done in the load and store buffers, respectively, of a processor core microarchitecture.

[0036] FIGS. 4A-4B are diagrams of an example data decryption process in a cryptographic computing system. In particular, FIG. 4A shows an example system 400 for implementing the example process 450 of FIG. 4B. In certain embodiments, the system 400 is implemented entirely within a processor as part of a cryptographic computing system. The system 400 may, in certain embodiments, be executed in response to a plurality of .mu.ops issued by an out-of-order scheduler implementing the process 200 of FIG. 2.

[0037] Referring to the example system 400 of FIG. 4A, a load buffer 402 includes one or more load buffer entries 404. The load buffer 402 may be implemented in a memory subsystem of a processor, such as in a memory subsystem of a processor core. Each load buffer entry 404 includes a physical address field 406 and a pointer field 408. In the example shown, a state machine servicing load requests obtains data from a data cache unit 412 (which may, in some implementations be a store buffer), then uses the pointer field 408 (obtained via read port 410) as a tweak in a decryption operation performed on the encrypted data via a decryption unit 414. The decrypted data are then delivered to an execution unit 416 of the processor core microarchitecture. Although shown as being implemented outside (and coupled to) the load buffer 402, the decryption unit 414 may be implemented inside the load buffer 402 in some embodiments.

[0038] Referring now to the example process 450 of FIG. 4B, a data cache unit (or store buffer) stores encrypted data (ciphertext) to be decrypted by the decryption unit 414 as described above. At 452, the decryption unit 414 accesses the ciphertext to begin fulfilling a load operation. The decryption unit 414 then decrypts the ciphertext at 454 using an active key obtained from a register along with a tweak value, which, in the example shown, is the value of the pointer field 408 (i.e., the data's linear address). At 456, the decryption unit 414 provides the decrypted plaintext to an execution unit 416 to fulfill the load operation. Finally, at 458, the decryption unit 414 sends a wake-up signal to a reservation station of the processor (which may track the status of register contents and support register renaming).

[0039] FIGS. 5A-5C are diagrams of another example data decryption process in a cryptographic computing system. In particular, FIG. 5A shows an example system 500 for implementing the example processes 550, 560 of FIGS. 5B, 5C. In certain embodiments, the system 500 is implemented entirely within a processor as part of a cryptographic computing system. In the examples shown in FIGS. 5A-5B, a counter mode block cipher is used for encryption/decryption of data. The system 500 may be executed, in certain embodiments, in response to a plurality of .mu.ops issued by an out-of-order scheduler implementing the process 300 of FIG. 3.

[0040] Referring to the example system 500 of FIG. 5A, a load buffer 502 includes one or more load buffer entries 504. The load buffer 502 may be implemented in a memory subsystem of a processor, such as in a memory subsystem of a processor core. Each load buffer entry 504 includes a physical address field 506, a pointer field 508, and a key stream 510. In the example shown, since the counter mode is being used, the key stream generator 512 produces the key stream 510 by encrypting a counter value loaded from the register 522. The pointer field 508 of the load buffer entry 504 tweaks the encryption operation performed by the key stream generator 512. The encryption performed by the key stream generator 512 may be tweaked by other fields, such as, for example, other cryptographic context values. An XOR operation is then performed on the key stream 510 by the XOR unit 518 (which reads the key stream 510 via the read port 514) and encrypted data coming from the data cache unit 516 (which may, in some embodiments, be a store buffer). The decrypted data are then delivered to an execution unit 520 of the processor core microarchitecture. Although shown as being implemented inside the load buffer 502, the key stream generator 512 may be implemented outside the load buffer 502 in some embodiments. Further, although shown as being implemented outside (and coupled to) the load buffer 502, the XOR unit 518 may be implemented inside the load buffer 502 in some embodiments.

[0041] Referring now to the example process 550 of FIG. 5B, at 552, a load buffer entry 504 is created. At 554, a key stream generator 512 is invoked. The key stream generator 512 uses a key obtained from a register along with a tweak value (which, in the example shown, is the pointer value 508) to generate a key stream 510, which is stored in the load buffer entry 504.

[0042] Referring now to the example process 560 of FIG. 5C (which may execute independently from the process 550 of FIG. 5B), the ciphertext associated with the load operation may become available from a data cache unit (or store buffer). At 562, the cipher text is accessed, and at 564, the ciphertext is XOR-ed with the key stream 510. At 564, the result of the XOR operation is provided to an execution unit 520 of the processor core microarchitecture to fulfill the load operation. Finally, at 568, a wake-up signal is sent to a reservation station of the processor.

[0043] FIGS. 6A-6B are diagrams of an example data encryption process in a cryptographic computing system. In particular, FIG. 6A shows an example system 600 for implementing the example process 650 of FIG. 6B. In certain embodiments, the system 600 is implemented entirely within a processor as part of a cryptographic computing system. The system 600 may, in certain embodiments, be executed in response to a plurality of .mu.ops issued by an out-of-order scheduler implementing the process 200 of FIG. 2.

[0044] Referring to the example system 600 shown in FIG. 6A, a store buffer 602 includes one or more store buffer entries 604. The store buffer 602 may be implemented in a memory subsystem of a processor, such as in a memory subsystem of a processor core. Each store buffer entry 604 includes a physical address field 606, a pointer field 608, and store data 610 (which is to be stored). In the example shown, a state machine servicing store requests obtains data from a register file 620 (or execution unit), and an encryption unit 612 uses the pointer field 608 as a tweak during an encryption operation performed on the data obtained from the register file 620. The encrypted data are then passed to a data cache unit 630 (or other execution unit of the CPU core microarchitecture). Although shown as being implemented inside the store buffer 602, the encryption unit 612 may be implemented outside the store buffer 602 in some embodiments.

[0045] Referring now to the example process 650 of FIG. 6B, plaintext data to be encrypted is available from a register file 620. At 652, the store buffer entry 604 is populated with a pointer value 608. At 654, the plaintext data is accessed from the register file 620 and at 656, the plaintext data is encrypted by the encryption unit 612 using an active key obtained from a register 640 along with a tweak (which, in the example shown, is the value of the pointer field 408 (i.e., the data's linear address)) and stored in the store buffer entry 604 as store data 610. At 658, the encrypted store data 610 is provided to a data cache unit 630 (or another waiting execution unit, in some implementations).

[0046] In some implementations, the pointer values used in the encryption and decryption operations may themselves be encrypted for security purposes. The pointer values may be entirely or partially encrypted (that is, only a portion of the bits of the pointer value may be encrypted). In these instances, the encrypted pointer values may first be decrypted prior to being used in the encryption/decryption operations described above. FIGS. 7A-7B and 8A-8B describe example embodiments for decrypting pointer values prior to use in the encryption/decryption operations.

[0047] FIGS. 7A-7B are diagrams of an example pointer decryption process in a cryptographic computing system. In particular, FIG. 7A shows an example system 700 for implementing the example process 750 of FIG. 7B. In certain embodiments, the system 700 is implemented entirely within a processor as part of a cryptographic computing system. The system 700 may, in certain embodiments, be executed in response to a plurality of .mu.ops issued by an out-of-order scheduler implementing the process 200 of FIG. 2 or the process 300 of FIG. 3.

[0048] Referring to the example system 700 shown in FIG. 7A, an address generation unit 702 is configured to decrypt parts of a linear address, which are encrypted for security. A decryption unit 704 in the address generation unit 702 accepts as input an encrypted pointer 710 representing a first encoded linear address, along with a key obtained from a register and a context value tweak input (e.g., the tweak input may come from a separate register, or may consist of unencrypted bits of the same linear address). The decryption unit 704 outputs a decrypted subset of the bits of the encrypted pointer 710, which are then passed to address generation circuitry 706 within the address generation unit 702 along with other address generation inputs. The address generation circuitry 706 generates a second effective linear address to be used in a memory read or write operation based on the inputs.

[0049] Referring now to the example process 750 shown in FIG. 7B, the tweak value (which is also described in FIG. 7B as the "context value") may be available either statically or dynamically--if it is not available statically, it is loaded dynamically from memory. At 752, request to generate an effective address from an encrypted pointer 710 is received by an address generation unit 702. The address generation unit 702 determines at 754 whether a context value is available statically. If it is available statically, then the value is used at 756; if not, the context value is loaded dynamically from a table in memory at 755. The process then proceeds to 756, where the encrypted pointer 710 is decrypted using an active decryption key obtained from a register along with the obtained context value. At 758, a decrypted address is output to the address generation circuitry 706, which then generates, at 760, an effective address for use in read/write operations based on the decrypted address (and any other address generation inputs).

[0050] FIGS. 8A-8B are diagrams of an example base address slice decryption process in a cryptographic computing system. In particular, FIG. 8A shows an example system 800 for implementing the example process 850 of FIG. 8B. In certain embodiments, the system 800 is implemented entirely within a processor as part of a cryptographic computing system. The system 800 may, in certain embodiments, be executed in response to a plurality of .mu.ops issued by an out-of-order scheduler implementing the process 200 of FIG. 2 or the process 300 of FIG. 3.

[0051] Referring to the example system 800 shown in FIG. 8A, a generation unit 802 is configured to decrypt parts of a linear address, as described above with respect to FIGS. 7A-7B. However, in the example shown, the bit set that is encrypted (i.e., slice 824) occupies a middle slice of an encoded linear address 820 rather than the entire address being encrypted as in the examples described above with respect to FIGS. 7A-7B. The upper bits 822 of the encoded linear address 820 may denote the data object size, type, format, or other security information associated with the encoded linear address 820. The encoded linear address 820 also includes an offset 826.

[0052] In the example shown, a decryption unit 804 in the address generation unit 802 accepts as input the encrypted base address slice 824, along with a key obtained from a register and a context value tweak input (e.g., the tweak input may come from a separate register, or may consist of unencrypted bits of the same linear address). The decryption unit 804 outputs a decrypted base address. The decrypted base address slice is then provided to a concatenator/adder unit 806, which concatenates the decrypted base address with a set of complementary upper bits from a register or context table entry and the offset 826 to yield an intermediate base address. In certain embodiments, the set of complementary bits is different from the upper bits 822, and the set of complementary does not convey metadata information (e.g., data object size, type, format, etc.) but instead includes the missing bits of the effective linear address that is constructed, denoting a location in the linear address space.

[0053] The intermediate base address is then combined with the upper bits 822 by the OR unit 808 to yield a tagged base address. In other embodiments, the upper bits 822 may be combined using an XOR unit, an ADD unit or a logical AND unit. In yet other embodiments, the upper bits 822 may act as a tweak value and tweak the decryption of the middle slice of the address. The tagged base address is then provided to address generation circuitry 810 in the address generation unit 802, along with other address generation inputs. The address generation circuitry 810 then generates an effective address to be used in a memory read or write operation based on the inputs. In one embodiment, the upper bits 822 may be used to determine a number of intermediate lower address bits (e.g., from offset 826) that would be used as a tweak to the encrypted base address 824.

[0054] For embodiments with an encrypted base address, a Translation Lookaside Buffer (TLB) may be used that maps linear addresses (which may also be referred to as virtual addresses) to physical addresses. A TLB entry is populated after a page miss where a page walk of the paging structures determines the correct linear to physical memory mapping, caching the linear to physical mapping for fast lookup. As an optimization, a TLB (for example, the data TLB or dTLB) may instead cache the encoded address 820 to physical address mapping, using a Content Addressable Memory (CAM) circuit to match the encrypted/encoded address 820 to the correct physical address translation. In this way, the TLB may determine the physical memory mapping prior to the completion of the decryption unit 804 revealing the decrypted linear address, and may immediately proceed with processing the instructions dependent on this cached memory mapping. Other embodiments may instead use one or both of the offset 826 and upper bits 822 of the address 820 as a partial linear address mapping into the TLB (that is, the TLB lookup is performed only against the plaintext subset of the address 820), and proceed to use the physical memory translation, if found, verifying the remainder of the decrypted base address (824) to determine the full linear address is a match (TLB hit) after completion of the decryption 804. Such embodiments may speculatively proceed with processing and nuke the processor pipeline if the final decrypted linear address match is found to be a false positive hit in the TLB, preventing the execution of dependent instructions, or cleaning up the execution of dependent instructions by returning processor register state and/or memory to its prior state before the TLB misprediction (incorrect memory mapping).

[0055] In some embodiments, a subset of the upper bits 822 indicates address adjustment, which may involve adding offset value (which is a power of two) to the effective linear address that is produced by the address generation unit. The offset value may include a bit string where only a single bit is equal to 1 and all other bits are equal to zero. In some other embodiments, address adjustment may involve subtracting from the effective linear address an offset value, which is a power of two. Adjustment may be included in certain implementations because some memory object allocations cross power of two boundaries. In some embodiments, the smallest power-of-two box that contains a memory object allocation is also a unique property of the allocation and may be used for cryptographically tweaking the encryption the base address 824 associated with the allocation. If address adjustment is not supported, allocations that cross power of two boundaries may be associated with exceedingly large power-of-two boxes. Such large boxes may be polluted with data of other allocations, which, even though cryptographically isolated, may still be accessed by software (e.g., as a result of a software bug). The adjustment may proceed in parallel with the decryption of the base address bits 824. In certain embodiments, performing the adjustment involves: (i) passing the upper bits 822 though a decoder circuit, (ii) obtaining the outputs of the decoder circuit; (iii) using those decoder outputs together with a first offset value 826 to form a second offset value to add to the bits of the linear address which are unencrypted; (iv) obtain a carry out value from this addition; (v) add the carry out value to the decrypted address bits 824 once they are produced. In other embodiments, a partial TLB lookup process may begin as soon as the adjustment process has produced the linear address bits which are used by the partial TLB lookup process.

[0056] Referring now to the example process 850 shown in FIG. 8B, as in FIG. 7B, the tweak value (also described in FIG. 8B as the "context value") may be available either statically or dynamically--if it is not available statically, it is loaded dynamically from memory. In particular, at 852, request to generate an effective address from an encrypted base address slice 824 is received by an address generation unit 802. The address generation unit 802 determines at 854 whether a context value is available statically. If it is available statically, then the value is used at 856; if not, the context value is loaded dynamically from a table in memory at 855. At 856, the encrypted base address slice 824 is decrypted using an active decryption key obtained from a register along with the context value.

[0057] At 858, the address generation unit 802 determines whether both (1) the memory access is being performed with a static context value, and (2) the input context value has its dynamic flag bit cleared. The dynamic flag bit may be a flag bit in the pointer that indicates whether context information is available statically or dynamically. For instance, if an object represented by the pointer is not entirely within the bounds of a statically addressable memory region, then a dynamic flag bit may be set in the pointer. The dynamic flag bit may indicate that context information is to be dynamically obtained, for example, via a pointer context table. In other words, there may be a region of memory in which the upper bits for a base address can be supplied statically from a control register, and allocations outside that region may need to draw their upper bits for the base address dynamically from a table entry in memory.

[0058] If both of the conditions are true at 858, the process 850 moves to 860; if one or both are not true, then the upper base address bits are loaded dynamically from a table entry in memory at 859 before proceeding to 860. In some cases, the operations of 858 can be performed alongside those of 854, or the operations may be merged. Likewise, in some cases, the operations of 859 can be performed alongside those of 855, or the operations may be merged.

[0059] At 860, the concatenator/adder unit 806 of the address generation unit 802 concatenates the upper base address bits with the decrypted base address slice, and at 862, adds the offset 826 to the concatenation. At 864, the address generation unit 802 recombines tag information from the upper bits 822 with the result of the concatenation/addition of 860 and 862 via the OR unit 808. The result of the concatenation, addition, and ORing is provided to address generation circuitry 810 in the address generation unit 802, along with other address generation inputs. At 866, the address generation circuitry 810 generates an effective address to be used in a memory read or write operation based on the inputs.

[0060] FIG. 9 is a flow diagram of an example process 900 of executing cryptographic-based instructions in a cryptographic computing system. The example process 900 may be performed by circuitry of a microprocessor pipeline of a processor (e.g., one or more of the components described above, which may be implemented in a processor configured similar to the processor 1000 of FIG. 10) in response to accessing a set of cryptographic-based instructions. In some embodiments, the circuitry of the microprocessor pipeline performs each of the operations described, while in other embodiments, the circuity of the microprocessor pipeline performs only a subset of the operations described.

[0061] At 902, encrypted data stored in a data cache unit of a processor (e.g., data cache unit 412 of FIG. 4A, data cache unit 516 of FIG. 5A, or data cache unit 1024 of FIG. 10) is accessed.

[0062] At 904, the encrypted data is decrypted based on a pointer value. The decryption may be performed in manner similar to that described above with respect to FIGS. 4A-4B, FIGS. 5A-5B, or in another manner. In some instances, the pointer value or a portion thereof may itself be encrypted. In these instances, the pointer value may first be decrypted/decoded, for example, in a similar manner to that described above with respect to FIGS. 7A-7B or FIGS. 8A-8B.

[0063] At 906, a cryptographic-based instruction is executed based on data obtained from the decryption performed at 904. The instruction may be executed on an execution unit of the processor (e.g., execution unit 416 of FIG. 4A, execution unit 520 of FIG. 5A, or execution unit(s) 1016 of FIG. 10).

[0064] At 908, a result of the execution performed at 906 is encrypted based on another pointer value. The encryption may be performed in a similar manner to that described above with respect to FIGS. 6A-6B.

[0065] At 910, the encrypted result is stored in a data cache unit of the processor or another execution unit.

[0066] The example processes described above may include additional or different operations, and the operations may be performed in the order shown or in another order. In some cases, one or more of the operations shown in the flow diagrams are implemented as processes that include multiple operations, sub-processes, or other types of routines. In some cases, operations can be combined, performed in another order, performed in parallel, iterated, or otherwise repeated or performed in another manner. Further, although certain functionality is described herein as being performed by load or store buffers, address generation units, or other certain aspects of a processor, it will be understood that the teachings of the present disclosure may be implemented in other examples by other types of execution units in a processor, including but not limited to separate data block encryption units, separate key stream generation units, or separate data pointer decryption units.

[0067] FIGS. 10-12 are block diagrams of example computer architectures that may be used in accordance with embodiments disclosed herein. Generally, any computer architecture designs known in the art for processors and computing systems may be used. In an example, system designs and configurations known in the arts for laptops, desktops, handheld PCs, personal digital assistants, tablets, engineering workstations, servers, network devices, servers, appliances, network hubs, routers, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, micro controllers, smart phones, mobile devices, wearable electronic devices, portable media players, hand held devices, and various other electronic devices, are also suitable for embodiments of computing systems described herein. Generally, suitable computer architectures for embodiments disclosed herein can include, but are not limited to, configurations illustrated in FIGS. 10-12.

[0068] FIG. 10 is an example illustration of a processor according to an embodiment. Processor 1000 is an example of a type of hardware device that can be used in connection with the implementations above. Processor 1000 may be any type of processor, such as a microprocessor, an embedded processor, a digital signal processor (DSP), a network processor, a multi-core processor, a single core processor, or other device to execute code. Although only one processor 1000 is illustrated in FIG. 10, a processing element may alternatively include more than one of processor 1000 illustrated in FIG. 10. Processor 1000 may be a single-threaded core or, for at least one embodiment, the processor 1000 may be multi-threaded in that it may include more than one hardware thread context (or "logical processor") per core.

[0069] FIG. 10 also illustrates a memory 1002 coupled to processor 1000 in accordance with an embodiment. Memory 1002 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art. Such memory elements can include, but are not limited to, random access memory (RAM), read only memory (ROM), logic blocks of a field programmable gate array (FPGA), erasable programmable read only memory (EPROM), and electrically erasable programmable ROM (EEPROM).

[0070] Processor 1000 can execute any type of instructions associated with algorithms, processes, or operations detailed herein. Generally, processor 1000 can transform an element or an article (e.g., data) from one state or thing to another state or thing.

[0071] Code 1004, which may be one or more instructions to be executed by processor 1000, may be stored in memory 1002, or may be stored in software, hardware, firmware, or any suitable combination thereof, or in any other internal or external component, device, element, or object where appropriate and based on particular needs. In one example, processor 1000 can follow a program sequence of instructions indicated by code 1004. Each instruction enters a front-end logic 1006 and is processed by one or more decoders 1008. The decoder may generate, as its output, a microoperation such as a fixed width microoperation in a predefined format, or may generate other instructions, microinstructions, or control signals that reflect the original code instruction. Front-end logic 1006 also includes register renaming logic 1010 and scheduling logic 1012 (which includes a reservation station 1013), which generally allocate resources and queue the operation corresponding to the instruction for execution. In some embodiments, the scheduling logic 1012 includes an in-order or an out-of-order execution scheduler.

[0072] Processor 1000 can also include execution logic 1014 having a set of execution units 1016a, . . . , 1016n, an address generation unit 1017, etc. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. Execution logic 1014 performs the operations specified by code instructions.

[0073] After completion of execution of the operations specified by the code instructions, back-end logic 1018 can retire the instructions of code 1004. In one embodiment, processor 1000 allows out of order execution but requires in order retirement of instructions. Retirement logic 1020 may take a variety of known forms (e.g., re-order buffers or the like). In this manner, processor 1000 is transformed during execution of code 1004, at least in terms of the output generated by the decoder, hardware registers and tables utilized by register renaming logic 1010, and any registers (not shown) modified by execution logic 1014.

[0074] Processor 1000 can also include a memory subsystem 1022, which includes a load buffer 1024, a decryption unit 1025, a store buffer 1026, an encryption unit 1027, a Translation Lookaside Buffer (TLB) 1028, a data cache unit (DCU) 1030, and a Level-2 (L2) cache unit 1032. The load buffer 1024 processes microoperations for memory/cache load operations, while the store buffer 1026 processes microoperations for memory/cache store operations. In cryptographic computing systems, the data stored in the data cache unit 1030, the L2 cache unit 1032, and/or the memory 1002 may be encrypted, and may be encrypted (prior to storage) and decrypted (prior to processing by one or more execution units 1016) entirely within the processor 1000 as described herein. Accordingly, the decryption unit 1025 may decrypt encrypted data stored in the DCU 1030, e.g., during load operations processed by the load buffer 1024 as described above, and the encryption unit 1027 may encrypt data to be stored in the DCU 1030, e.g., during stored operations processed by the store buffer 1026 as described above. In some embodiments, the decryption unit 1025 may be implemented inside the load buffer 1024 and/or the encryption unit 1027 may be implemented inside the store buffer 1026. The Translation Lookaside Buffer (TLB) 1028 maps linear addresses to physical addresses and performs other functionality as described herein.

[0075] Although not shown in FIG. 10, a processing element may include other elements on a chip with processor 1000. For example, a processing element may include memory control logic along with processor 1000. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches. In some embodiments, non-volatile memory (such as flash memory or fuses) may also be included on the chip with processor 1000.

[0076] FIG. 11A is a block diagram illustrating both an example in-order pipeline and an example register renaming, out-of-order issue/execution pipeline according to one or more embodiments of this disclosure. FIG. 11B is a block diagram illustrating both an example embodiment of an in-order architecture core and an example register renaming, out-of-order issue/execution architecture core to be included in a processor according to one or more embodiments of this disclosure. The solid lined boxes in FIGS. 11A-11B illustrate the in-order pipeline and in-order core, while the optional addition of the dashed lined boxes illustrates the register renaming, out-of-order issue/execution pipeline and core. Given that the in-order aspect is a subset of the out-of-order aspect, the out-of-order aspect will be described.

[0077] In FIG. 11A, a processor pipeline 1100 includes a fetch stage 1102, a length decode stage 1104, a decode stage 1106, an allocation stage 1108, a renaming stage 1110, a schedule (also known as a dispatch or issue) stage 1112, a register read/memory read stage 1114, an execute stage 1116, a write back/memory write stage 1118, an exception handling stage 1122, and a commit stage 1124.

[0078] FIG. 11B shows processor core 1190 including a front end unit 1130 coupled to an execution engine unit 1150, and both are coupled to a memory unit 1170. Processor core 1190 and memory unit 1170 are examples of the types of hardware that can be used in connection with the implementations shown and described herein. The core 1190 may be a reduced instruction set computing (RISC) core, a complex instruction set computing (CISC) core, a very long instruction word (VLIW) core, or a hybrid or alternative core type. As yet another option, the core 1190 may be a special-purpose core, such as, for example, a network or communication core, compression engine, coprocessor core, general purpose computing graphics processing unit (GPGPU) core, graphics core, or the like. In addition, processor core 1190 and its components represent example architecture that could be used to implement logical processors and their respective components.

[0079] The front end unit 1130 includes a branch prediction unit 1132 coupled to an instruction cache unit 1134, which is coupled to an instruction translation lookaside buffer (TLB) unit 1136, which is coupled to an instruction fetch unit 1138, which is coupled to a decode unit 1140. The decode unit 1140 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode unit 1140 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 1190 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 1140 or otherwise within the front end unit 1130). The decode unit 1140 is coupled to a rename/allocator unit 1152 in the execution engine unit 1150.

[0080] The execution engine unit 1150 includes the rename/allocator unit 1152 coupled to a retirement unit 1154 and a set of one or more scheduler unit(s) 1156. The scheduler unit(s) 1156 represents any number of different schedulers, including reservation stations, central instruction window, etc. The scheduler unit(s) 1156 is coupled to the physical register file(s) unit(s) 1158. Each of the physical register file(s) units 1158 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) unit 1158 comprises a vector registers unit, a write mask registers unit, and a scalar registers unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers (GPRs). In at least some embodiments described herein, register units 1158 are examples of the types of hardware that can be used in connection with the implementations shown and described herein (e.g., registers 112). The physical register file(s) unit(s) 1158 is overlapped by the retirement unit 1154 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using register maps and a pool of registers; etc.). The retirement unit 1154 and the physical register file(s) unit(s) 1158 are coupled to the execution cluster(s) 1160. The execution cluster(s) 1160 includes a set of one or more execution units 1162 and a set of one or more memory access units 1164. The execution units 1162 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. Execution units 1162 may also include an address generation unit (AGU) to calculate addresses used by the core to access main memory and a page miss handler (PMH).

[0081] The scheduler unit(s) 1156, physical register file(s) unit(s) 1158, and execution cluster(s) 1160 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file(s) unit, and/or execution cluster--and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access unit(s) 1164). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.

[0082] The set of memory access units 1164 is coupled to the memory unit 1170, which includes a data TLB unit 1172 coupled to a data cache unit 1174 coupled to a level 2 (L2) cache unit 1176. In one example embodiment, the memory access units 1164 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 1172 in the memory unit 1170. The instruction cache unit 1134 is further coupled to a level 2 (L2) cache unit 1176 in the memory unit 1170. The L2 cache unit 1176 is coupled to one or more other levels of cache and eventually to a main memory. In addition, a page miss handler may also be included in core 1190 to look up an address mapping in a page table if no match is found in the data TLB unit 1172.

[0083] By way of example, the example register renaming, out-of-order issue/execution core architecture may implement the pipeline 1100 as follows: 1) the instruction fetch 1138 performs the fetch and length decoding stages 1102 and 1104; 2) the decode unit 1140 performs the decode stage 1106; 3) the rename/allocator unit 1152 performs the allocation stage 1108 and renaming stage 1110; 4) the scheduler unit(s) 1156 performs the schedule stage 1112; 5) the physical register file(s) unit(s) 1158 and the memory unit 1170 perform the register read/memory read stage 1114; the execution cluster 1160 perform the execute stage 1116; 6) the memory unit 1170 and the physical register file(s) unit(s) 1158 perform the write back/memory write stage 1118; 7) various units may be involved in the exception handling stage 1122; and 8) the retirement unit 1154 and the physical register file(s) unit(s) 1158 perform the commit stage 1124.

[0084] The core 1190 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 1190 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.

[0085] It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel.RTM. Hyperthreading technology). Accordingly, in at least some embodiments, multi-threaded enclaves may be supported.

[0086] While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 1134/1174 and a shared L2 cache unit 1176, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.

[0087] FIG. 12 illustrates a computing system 1200 that is arranged in a point-to-point (PtP) configuration according to an embodiment. In particular, FIG. 12 shows a system where processors, memory, and input/output devices are interconnected by a number of point-to-point interfaces. Generally, one or more of the computing systems or computing devices described herein may be configured in the same or similar manner as computing system 1200.

[0088] Processors 1270 and 1280 may be implemented as single core processors 1274a and 1284a or multi-core processors 1274a-1274b and 1284a-1284b. Processors 1270 and 1280 may each include a cache 1271 and 1281 used by their respective core or cores. A shared cache (not shown) may be included in either processors or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.

[0089] Processors 1270 and 1280 may also each include integrated memory controller logic (MC) 1272 and 1282 to communicate with memory elements 1232 and 1234, which may be portions of main memory locally attached to the respective processors. In alternative embodiments, memory controller logic 1272 and 1282 may be discrete logic separate from processors 1270 and 1280. Memory elements 1232 and/or 1234 may store various data to be used by processors 1270 and 1280 in achieving operations and functionality outlined herein.

[0090] Processors 1270 and 1280 may be any type of processor, such as those discussed in connection with other figures. Processors 1270 and 1280 may exchange data via a point-to-point (PtP) interface 1250 using point-to-point interface circuits 1278 and 1288, respectively. Processors 1270 and 1280 may each exchange data with an input/output (I/O) subsystem 1290 via individual point-to-point interfaces 1252 and 1254 using point-to-point interface circuits 1276, 1286, 1294, and 1298. I/O subsystem 1290 may also exchange data with a high-performance graphics circuit 1238 via a high-performance graphics interface 1239, using an interface circuit 1292, which could be a PtP interface circuit. In one embodiment, the high-performance graphics circuit 1238 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. I/O subsystem 1290 may also communicate with a display 1233 for displaying data that is viewable by a human user. In alternative embodiments, any or all of the PtP links illustrated in FIG. 12 could be implemented as a multi-drop bus rather than a PtP link.

[0091] I/O subsystem 1290 may be in communication with a bus 1220 via an interface circuit 1296. Bus 1220 may have one or more devices that communicate over it, such as a bus bridge 1218 and I/O devices 1216. Via a bus 1210, bus bridge 1218 may be in communication with other devices such as a user interface 1212 (such as a keyboard, mouse, touchscreen, or other input devices), communication devices 1226 (such as modems, network interface devices, or other types of communication devices that may communicate through a computer network 1260), audio I/O devices 1214, and/or a data storage device 1228. Data storage device 1228 may store code and data 1230, which may be executed by processors 1270 and/or 1280. In alternative embodiments, any portions of the bus architectures could be implemented with one or more PtP links.

[0092] The computer system depicted in FIG. 12 is a schematic illustration of an embodiment of a computing system that may be utilized to implement various embodiments discussed herein. It will be appreciated that various components of the system depicted in FIG. 12 may be combined in a system-on-a-chip (SoC) architecture or in any other suitable configuration capable of achieving the functionality and features of examples and implementations provided herein.

[0093] Although this disclosure has been described in terms of certain implementations and generally associated methods, alterations and permutations of these implementations and methods will be apparent to those skilled in the art. For example, the actions described herein can be performed in a different order than as described and still achieve the desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve the desired results. In certain implementations, multitasking and parallel processing may be advantageous. Other variations are within the scope of the following claims.

[0094] The architectures presented herein are provided by way of example only, and are intended to be non-exclusive and non-limiting. Furthermore, the various parts disclosed are intended to be logical divisions only, and need not necessarily represent physically separate hardware and/or software components. Certain computing systems may provide memory elements in a single physical memory device, and in other cases, memory elements may be functionally distributed across many physical devices. In the case of virtual machine managers or hypervisors, all or part of a function may be provided in the form of software or firmware running over a virtualization layer to provide the disclosed logical function.

[0095] Note that with the examples provided herein, interaction may be described in terms of a single computing system. However, this has been done for purposes of clarity and example only. In certain cases, it may be easier to describe one or more of the functionalities of a given set of flows by only referencing a single computing system. Moreover, the system for deep learning and malware detection is readily scalable and can be implemented across a large number of components (e.g., multiple computing systems), as well as more complicated/sophisticated arrangements and configurations. Accordingly, the examples provided should not limit the scope or inhibit the broad teachings of the computing system as potentially applied to a myriad of other architectures.

[0096] As used herein, unless expressly stated to the contrary, use of the phrase `at least one of` refers to any combination of the named elements, conditions, or activities. For example, `at least one of X, Y, and Z` is intended to mean any of the following: 1) at least one X, but not Y and not Z; 2) at least one Y, but not X and not Z; 3) at least one Z, but not X and not Y; 4) at least one X and Y, but not Z; 5) at least one X and Z, but not Y; 6) at least one Y and Z, but not X; or 7) at least one X, at least one Y, and at least one Z.

[0097] Additionally, unless expressly stated to the contrary, the terms `first`, `second`, `third`, etc., are intended to distinguish the particular nouns (e.g., element, condition, module, activity, operation, claim element, etc.) they modify, but are not intended to indicate any type of order, rank, importance, temporal sequence, or hierarchy of the modified noun. For example, `first X` and `second X` are intended to designate two separate X elements that are not necessarily limited by any order, rank, importance, temporal sequence, or hierarchy of the two elements.

[0098] References in the specification to "one embodiment," "an embodiment," "some embodiments," etc., indicate that the embodiment(s) described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment.

[0099] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any embodiments or of what may be claimed, but rather as descriptions of features specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub combination or variation of a sub combination.

[0100] Similarly, the separation of various system components and modules in the embodiments described above should not be understood as requiring such separation in all embodiments. It should be understood that the described program components, modules, and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0101] Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of this disclosure. Numerous other changes, substitutions, variations, alterations, and modifications may be ascertained to one skilled in the art and it is intended that the present disclosure encompass all such changes, substitutions, variations, alterations, and modifications as falling within the scope of the appended claims.

[0102] The following examples pertain to embodiments in accordance with this specification. It will be understood that one or more aspects of certain examples described below may be combined with or implemented in certain other examples, including examples not explicitly indicated.

[0103] Example 1 includes a processor comprising: data cache units storing encrypted data; and a microprocessor pipeline coupled to the data cache units. The microprocessor pipeline comprises circuitry to access and execute a sequence of cryptographic-based instructions based on the encrypted data. Execution of the sequence of cryptographic-based instructions comprises at least one of: decryption of the encrypted data based on a first pointer value; execution of a cryptographic-based instruction based on data obtained from decryption of the encrypted data; encryption of a result of execution of a cryptographic-based instruction, wherein the encryption is based on a second pointer value; and storage of encrypted data in the data cache units, wherein the encrypted data stored in the data cache units is based on an encrypted result of execution of a cryptographic-based instruction.

[0104] Example 2 includes the subject matter of Example 1, and optionally, wherein the circuitry is further to: generate, for each cryptographic-based instruction, at least one encryption-based microoperation and at least one non-encryption-based microoperation the cryptographic-based instruction; and schedule the at least one encryption-based microoperation and the at least one non-encryption-based microoperation for execution based on timings of the encryption-based microoperation.

[0105] Example 3 includes the subject matter of Example 2, and optionally, wherein the encryption-based microoperation is based on a block cipher, and the non-encryption-based microoperation is scheduled as dependent upon the encryption-based microoperation.

[0106] Example 4 includes the subject matter of Example 2, and optionally, wherein the encryption-based microoperation is based on a counter mode block cipher, and the non-encryption-based microoperation is scheduled to execute in parallel with encryption of a counter.

[0107] Example 5 includes the subject matter of Example 2, and optionally, wherein the encryption-based microoperation is one of an encryption operation and a decryption operation.

[0108] Example 6 includes the subject matter of Example 2, and optionally, wherein the non-encryption-based microoperation is one of a load operation and a store operation.

[0109] Example 7 includes the subject matter of any one of Examples 1-6, and optionally, wherein the circuitry is to decrypt the encrypted data by using the first pointer value as an input to a decryption function.

[0110] Example 8 includes the subject matter of Example 7, and optionally, wherein the circuitry to decrypt the encrypted data is in a load buffer of the processor.

[0111] Example 9 includes the subject matter of Example 7, and optionally, wherein the circuitry is to decrypt the encrypted data further by: generating a key stream based on the first pointer value and a counter value; and performing an XOR operation on the key stream and the encrypted data to yield the decrypted data.

[0112] Example 10 includes the subject matter of any one of Examples 1-6, and optionally, wherein the circuitry is to encrypt the result of the execution of the cryptographic-based instruction by using the second pointer value as an input to an encryption function.

[0113] Example 11 includes the subject matter of Example 10, and optionally, wherein the circuitry to encrypt the result of the execution of the cryptographic-based instruction is in a store buffer of the processor.

[0114] Example 12 includes the subject matter of any one of Examples 1-6, and optionally, wherein at least one of the first pointer value and the second pointer value is an effective address based on an encoded linear address that is at least partially encrypted, and the circuitry is further to: access the encoded linear address; decrypt an encrypted portion of the encoded linear address based on a key obtained from a register of the processor; and generate the effective address based on a result of the decryption of the encrypted portion of the encoded linear address.

[0115] Example 13 includes the subject matter of Example 12, and optionally, wherein the entirety of the encoded linear address is encrypted.

[0116] Example 14 includes the subject matter of Example 12, and optionally, wherein the circuitry to decrypt the encoded linear address is in an address generation unit of the processor.

[0117] Example 15 includes a method comprising: accessing a sequence of cryptographic-based instructions to execute on encrypted data stored in data cache units of a processor; and executing the sequence of cryptographic-based instructions by a core of the processor, wherein execution comprises one or more of: decryption of the encrypted data based on a first pointer value; execution of a cryptographic-based instruction based on data obtained from decryption of the encrypted data; encryption of a result of execution of a cryptographic-based instruction, wherein the encryption is based on a second pointer value; and storage of encrypted data in the data cache units, wherein the encrypted data stored in the data cache units is based on an encrypted result of execution of a cryptographic-based instruction.

[0118] Example 16 includes the subject matter of Example 15, and optionally, wherein executing the sequence of cryptographic-based instructions comprises: generating, for each cryptographic-based instruction, at least one encryption-based microoperation and at least one non-encryption-based microoperation the cryptographic-based instruction; scheduling the at least one encryption-based microoperation and the at least one non-encryption-based microoperation for execution based on timings of the encryption-based microoperation; and executing the scheduled microoperations.

[0119] Example 17 includes the subject matter of Example 16, and optionally, wherein the encryption-based microoperation is based on a block cipher, and the non-encryption-based microoperation is scheduled as dependent upon the encryption-based microoperation.

[0120] Example 18 includes the subject matter of Example 16, and optionally, wherein the encryption-based microoperation is based on a counter mode block cipher, and the non-encryption-based microoperation is scheduled to execute in parallel with encryption of a counter.

[0121] Example 19 includes the subject matter of Example 16, and optionally, wherein the encryption-based microoperation is one of an encryption operation and a decryption operation, and the non-encryption-based microoperation is one of a load operation and a store operation.

[0122] Example 20 includes the subject matter of Example 19, and optionally, wherein the encryption operation and decryption operation each utilize a pointer value as a tweak input.

[0123] Example 21 includes the subject matter of any one of Examples 16-20, and optionally, wherein the decryption is performed by circuitry coupled to or implemented in, a load buffer of the processor.

[0124] Example 22 includes the subject matter of any one of Examples 16-20, and optionally, wherein the encryption is performed by circuitry coupled to or implemented in, a store buffer of the processor.

[0125] Example 23 includes the subject matter of any one of Examples 16-20, and optionally, wherein decrypting the encrypted data comprises: generating a key stream based on the first pointer value and a counter value; and performing an XOR operation on the key stream and the encrypted data to yield the decrypted data.

[0126] Example 24 includes the subject matter of any one of Examples 16-20, and optionally, wherein at least one of the first pointer value and the second pointer value is an effective address based on an encoded linear address that is at least partially encrypted, and the method further comprises: accessing the encoded linear address; decrypting an encrypted portion of the encoded linear address based on a key obtained from a register of the processor; and generating the effective address based on a result of the decryption of the encrypted portion of the encoded linear address.

[0127] Example 25 includes the subject matter of Example 24, and optionally, wherein the entirety of the encoded linear address is encrypted.

[0128] Example 26 includes the subject matter of Example 24, and optionally, wherein the decryption of the encoded linear address is by an address generation unit of the processor

[0129] Example 27 includes a system comprising: memory storing cryptographic-based instructions, and a processor coupled to the memory. The processor comprises: data cache units storing encrypted data; means for accessing the cryptographic-based instructions, the cryptographic instructions to execute based on the encrypted data; means for decrypting the encrypted data based on a first pointer value; means for executing the cryptographic-based instruction using the decrypted data; means for encrypting a result of the execution of the cryptographic-based instruction based on a second pointer value; and means for storing the encrypted result in the data cache units.

[0130] Example 28 includes the subject matter of Example 27, and optionally, wherein the means for decrypting the encrypted data comprises a load buffer of the processor.

[0131] Example 29 includes the subject matter of Example 27, and optionally, wherein the means for encrypting a result of the execution of the cryptographic-based instruction comprises a store buffer of the processor.

[0132] Example 30 includes the subject matter of any one of Examples 27-29, and optionally, wherein at least one of the first pointer value and the second pointer value is an effective address based on an encoded linear address that is at least partially encrypted, and the processor further comprises additional means for: accessing the encoded linear address; decrypting an encrypted portion of the encoded linear address based on a key obtained from a register of the processor; and generating the effective address based on a result of the decryption of the encrypted portion of the encoded linear address.

[0133] Example 31 includes the subject matter of Example 30, and optionally, wherein the additional means comprises an address generation unit of the processor.

[0134] Example 32 includes a processor core supporting the encryption and the decryption of pointers keys, and data in the core and where such encryption and decryption operations are performed by logic and circuitry which is part of the processor microarchitecture pipeline.

[0135] Example 33 includes the subject matter of Example 32, and optionally, wherein instructions that perform encrypted memory loads and stores are mapped into at least one block encryption .mu.op and at least one regular load/store .mu.op.

[0136] Example 34 includes the subject matter of Example 32, and optionally, wherein an in order or out-of-order execution scheduler schedules the execution of encryption, decryption and load/store .mu.ops and where load and store .mu.ops are considered as dependent on one of a block encryption and a block decryption .mu.op.

[0137] Example 35 includes the subject matter of Example 34, and optionally, wherein the out-of-order execution scheduler may load and store .mu.ops can execute in parallel with the encryption of a counter.

[0138] Example 36 includes the subject matter of Example 32, and optionally, wherein decryption of data is tweaked by a pointer and the decryption takes place in the load buffer.

[0139] Example 37 includes the subject matter of Example 32, and optionally, wherein encryption of data is tweaked by a pointer and the encryption takes place in the store buffer.

[0140] Example 38 includes the subject matter of Example 32, and optionally, wherein decryption of a pointer takes place in the address generation unit.

[0141] Example 39 includes the subject matter of Example 32, and optionally, wherein decryption of a slice of a base takes place in the address generation unit.

[0142] Example 40 may include a device comprising logic, modules, circuitry, or other means to perform one or more elements of a method described in or related to any of the examples above or any other method or process described herein.

* * * * *