1 Introduction

This paper extends the capability of our previous work [1] which described an SDK which includes a fully instruction and cycle-accurate Intel 8080/z80 emulator [2]. The emulator executes all salient parts of execution on a quantum computer using quantum circuits and showed how the capabilities of such quantum computers can be extended with traditional logic functions. That approach focused heavily on emulating the traditional CPU as closely as possible without prioritizing optimization. The real-world realization of this implementation strategy was delivered by mimicking the CPU's functionality by constructing quantum circuit-based logic gates and rebuilding the CPU as it would be constructed “in silicon” [3]. In this paper, we will again override the emulated CPU instructions, but this time the main focus will be on improving efficiency and making better use of the capabilities of a quantum computer. These improvements were delivered via additional selectable quantum implementation methods for each CPU instruction. This allows the user to contrast the initial functional methods with the newly delivered optimized methods. The quantum implementation methods now include additional features such as incorporating the entire functionality of the instruction, including the setting of the flags register in the quantum circuit. This allows for an interesting adaptation of the techniques used in a traditional CPU pipeline, where multiple CPU instructions can be combined into a single quantum circuit, realizing some of the benefits of the traditional CPU pipeline but with quantum circuits. This greatly increases the amount of work which can be achieved in a quantum circuit. The paper also describes a machine code to quantum instruction mapping table, enabling an interesting proposal of a hybrid CPU architecture with quantum calls included in the CPU itself. The motivation of this paper aligns to the goals of the previous paper, in that the SDK it describes and provides a learning tool and reusable reference framework, a framework that will help bridge the steep learning curve and enable developers to drive useful results from quantum computers. The framework will show developers that quantum computers can do anything a traditional computer can, just in a different manner. A different manner which has now been optimized under continual improvement and now warrants an exploration of the improvements.

1.1 Rationale

The SDK will be greatly improved by offering multiple reusable methods for each instruction. These methods can be contrasted and referenced by the programmer, allowing them to initially use a method they are familiar with and then progress to methods that take better advantage of quantum computers.

2 Overview

2.1 The grouping of instructions

To save silicon die space and consequently cost, CPU designers reuse circuits wherever possible [4]. For example, a decrement circuit reuses a full adder circuit with all the secondary input bits set to 1 and has the effect of rolling over the input to itself minus one. This reusability was originally exploited in the creation of reusable methods for each of the CPU instruction groups. Quantum methods for the instruction grouping listed in Table 1 were also created with a focus on reusability, so that they can be combined to construct a full instruction set for any traditional CPU.

Table 1 Instruction grouping for quantum methods

3 Instruction methods

The SDK and fully functioning emulator provide a unique platform to explore alternative approaches for overriding and implementing CPU instructions. This work presents some alternatives which have been designed to more effectively utilize the benefits of quantum computing, reduce execution time and improve the educational and accessibility aspects of the SDK.

To this end, each CPU instruction will be implemented with 2 to 4 user selectable methods. Method 1 will implement the instruction on a quantum computer as it would be executed in silicon. Method 2 presents an optimized approach where the entire instruction is executed in the quantum domain. Versions of Methods 1 and 2 were delivered in release 1 of the SDK. Method 3 takes better advantage of the quantum gates and effects of quantum computing, as does Method 4. These additional and alternative methods demonstrate new knowledge/novelty in their usability/reusability allowing the developer to pick and choose which method best fits their requirements. The improvements described in this work will be delivered in the next version release of the SDK. This approach allows for the comparison and use of methods delivered by the authors and the wide variety of highly optimized circuits delivered by external researchers. It is not the intension of the SDK/emulator to provide an exhaustive inclusion of the quantum circuits delivered by the research community, but provide a reusable framework for using/selecting/contrasting quantum methods.

3.1 Method 1—copying a traditional silicon-based approach with quantum-based logic circuits

The baseline of the SDK is implemented with quantum circuits [5] that are used to represent the traditional logic gates. These quantum logic gates are joined to construct larger circuits such as adders, subtractors and latches. This process mimics how the instructions would be constructed as part of a CPU in silicon. For example, the quantum circuits in Figs. 1 and 2 can be combined to construct an increment circuit (Fig. 3) which in turn can be used to emulate the INC instruction.

Fig. 1
figure 1

A quantum XOR gate implementation using two Ry gates and a C-Not

Fig. 2
figure 2

A quantum AND gate implementation using two Ry gates and a Toffoli gate

Fig. 3
figure 3

A traditional increment circuit using AND and XOR gates

Although a useful guide and interesting proof of concept for performing bitwise, logic and arithmetic functions on a quantum computer, this technique does not take advantage of the underlying benefits of quantum computers such as being able to handle multiple states in one qubit.

3.2 Method 2—executing the circuit entirely in the quantum domain

Previous work that presented the Qx86 SDK showed that the baseline approach is logically accurate but extremely slow [1]. This shortfall is mostly due to the API-based method of accessing IBMs’ quantum computers [6] as each circuit/logic gate must join a queue for execution. This is compounded by some larger circuits such as a 16-bit subtractor requiring over 100 logic gates, and passing information back and forth for each of these gates has no practical benefit (other than for process clarification). To mitigate this, shortfall quantum circuits can be created that combine the required logic gates into a single circuit. For the practical implementation of larger 8- or 16-bit instructions, multiple 4-bit variants were created where noise or the number of qubits available would not allow for a full implementation. The scalability of larger instructions was assessed on a case by case basis and was primarily dependent on the number of gates/qubits required to implement the instruction. For example, a fictious a 16-bit XOR circuit requires less qubits and gates than a 4-bit subtractor. An example of this approach is shown in Fig. 8 where the XOR and AND gates are combined to implement an 8-bit increment instruction. This method has vastly improved execution time as the majority of instructions (excluding repetitive instructions such as CPIR, LDIR, etc.) can be completed utilizing one or two circuits. This approach requires more qubits but care was taken to ensure that no unreasonable overheads existed. For this reason, the qubit requirements were kept below 32 for all circuits to ensure they could be performed on a Falcon r5.11H class quantum computer [7]. The accuracy of each of the 1300 + CPU instructions was assessed and only deemed satisfactory if the circuit was executed accurately over 2000 shots.

3.3 Methods 3 & 4—new and experimental methods for implementing instructions

This section presents a number of examples that describe the methods and quantum circuits required to create the new alternative (improved) instruction methods.

  1. 1.

    Utilizing the probabilistic nature of qubits. This method is suited to mathematics functions such as the ADD, SUB, INC and DEC instructions. The input numbers of arithmetic based instructions can be encoded into the probability of two Quantum Ry gate rotations. Both Ry gates are then applied to one qubit (Fig. 4). Each input number is scaled to a fraction of an arbitrary word size (in this example using a 4-bit 0–15 word size) and this fraction is then scaled from 0–π which represents the rotation angle of the Ry gate. Both Ry rotations are performed on one qubit and the resulting qubit probability is descaled to equal the sum of the original inputs.

Fig. 4
figure 4

Adding the numbers, A and B by scaling them into Ry rotations and decoding the sum form the resultant probability

Figure 5 shows how the input number 9 scales to a predictable qubit probability when run through a Ry gate.

Fig. 5
figure 5

A cross reference of the mapping between input number, the required Ry rotation and output probability

As an example consider adding two input numbers, 4 and 5. These input numbers can be encoded with the Ry rotations of 0.84 and 1.05 that will result in the qubit probability of 34.78%. The resultant probability can be descaled back into the sum by referencing the chart in Fig. 5 or running the code in Listing 1. It was important to fine tune the word size versus noise to allow enough margin of error so that all results were correct. In this example, a 4-bit word size was found to be the largest number that could be encoded into a Ry rotation before errors were detected.

Listing 1
figure e

A pseudo code demonstration of converting input numbers to Ry rotation and decoding the resultant qubit probability from the total

This approach has the advantage of reducing the required number of qubits to 1, and the size of the numbers being added (4-bit, 8-bit, etc.) is limited only by the noise inherent in the system. However, it requires the orchestrating computer to perform more tasks (such as the encoding of the input rotation, decoding of the sum from the probability and handling of overflows).

  1. 2.

    Using entanglement via a Hadamard gate as a latch circuit for BIT, SET, RES instructions and their combination into LOAD instructions.

As described in our previous paper, qubit entanglement using a Hadamard (Fig. 6) gate [1] offers an interesting parallel with a latch circuit.

Fig. 6
figure 6

A quantum circuit which entangles qubits 0 and 1 via the a Hadamard and a controlled not gates

The SET, RES and by proxy all load instructions utilize latch circuits. If the logic table of a latch circuit (Table 2) is compared against a simple Hadamard entanglement circuit (Table 3), the similarities can be seen and exploited.

Table 2 Latch circuit truth table
Table 3 A Hadamard-based (entanglement) quantum circuit truth table

Using Fig. 6 and Table 3 as a guide, the “Set” qubit rotation can be set to 0 and the “Reset” qubit rotation to π and will measure a 50% probability of the qubits collapsing to 0 and 1 and a 50% probability of the qubits collapsing to 1 and 0. Conversely if the “Set” qubit rotation is set to π and the “Reset” qubit rotation to 0 then 00 and 11 will each be measured 50% of the time. Previously, the method achieved the same results as a latch circuit by performing a logical XOR (Listing 2) on the measured qubits.

Listing 2
figure f

Retrieving latch equivalent results from a Hadamard circuit

The instruction method can be improved by keeping the entire function in the quantum domain via the inclusion of the XOR calculation in the quantum circuit. This can be achieved with the extended quantum circuit shown in Fig. 7. These improvements can again be used in all BIT/SET/RES instructions and combined with multiple copies to form an 8-bit or 16-bit LOAD instruction.

Fig. 7
figure 7

An extended entanglement circuit with additional XOR functionality

This approach further simplifies the quantum circuit required to implement a latch and all subsequent LOAD instructions. The Hadamard-based latch circuit also reduces the number of qubits needed (compared to mimicking the “in silicon” approach) to perform an 8-bit LOAD from 28 to 16 and the number of gates from 122 to 32.

This approach is not without its drawbacks, as consideration must be made to the physical layout of the underlying qubits and hence, it is not sufficient to rely on the circuit/intermediate representations only. Intermediate representations abstract the physical implementation and can utilize additional nonuser visible gates at build/optimization time. IBM incorporates either a heavy hex or square lattice approach to qubit layout. This results in many qubits not being directly connected and thus swap gates are required to pass quantum information across noncontiguous qubits in the chain. This is generally not an issue on smaller circuits where the qubit location can be easily optimized. However, circuits which utilize more qubits inherently compound the noise problem due to the difficulty in arranging multiple entangled qubits. The impact of these challenges are limited with this method, as each bit of the load uses adjacent/directly connected qubit pairs for the source and destination, therefore the drawbacks will not be realized. Larger more complex instructions such as the SUB(tract) instruction can take advantage of recent advancements in the qiskit SDK and IBMs quantum offering, which allow for in circuit measurement and thus “in circuit” quantum teleportation. This offers another useful tool for reducing errors across larger circuits where the use of contiguous qubits is not achievable. However, this was not required for the realization of the emulator in this SDK, as noise levels where never high enough to impact the results of any instruction while using 2000 shots.

  1. 3.

    An implementation of the LOAD/EXX instruction using quantum swap gates.

    In contrast to the original LOAD implementation that uses 8/16 latch circuits in parallel, quantum swap gates can be used to move the bits from the source to the destination. Care must be taken only to use this technique where the LOAD instruction has a transitory source value, otherwise there is a risk of overriding the source with the destination data. However, the EXX (exchange) instruction has no such constraints and is well suited to the use of quantum swap gates.

  2. 4.

    An implementation of the ROTATE instruction using quantum swap gates.

    The rotate instructions RL, RLC, SLA, SLC, RR, RRC all perform variations of rotating the bits in a register left or right in combination with the C flag. These instructions make excellent candidates for the use of quantum swap gates as bits can be swapped from one qubit to the next. This approach is much more efficient than the silicon method of multiple in-line latch circuits.

  3. 5.

    An implementation of the JP/JR instructions using quantum swap gates.

    This example is similar to example 3 in that instead of loading a destination with a source, the program counter is now loaded with a 16-bit number. The JP/JR routines can be more complex as they often depend on the status of a flag, e.g., JP C jumps to a location if the C flag is set. This is not problematic as the checking of the C flag can be prepended with a quantum c-not gate, the subsequent steps then follow the same process as in example 3, again being careful only to use swap gates if the source is transitory.

  4. 6.

    The CP instruction.

    The “in silicon” approach for performing the CP (compare) instruction on the registers A and B involves subtracting B from A and checking if the result equals 0. This approach is sensible where silicon die space is restricted as the SUB instruction can be reused. However, this is an inefficient slow process for the quantum computer as the subtractor circuit requires multiple XOR, NOT and AND gates to be performed on each bit in turn. The process can be optimized by performing the quantum implementation of a XOR circuit (Fig. 2) on each bit in parallel and then checking if the result equals 0.

4 Improving the emulation

This section focuses on the logistics of improving the CPU emulation. The F register is a register that stores status “Flags” and can be considered a hot spot for read/write activities, as many instructions set the value of the bits in the F register depending on the result of their execution [2]. It is therefore prudent to optimize the writing to and reading from the F register as it is accessed so often.

We can optimize the setting and reading from the F register (which contains the flags: signed, zero, half-carry, parity, overflow, negative and carry) utilizing a single quantum circuit with the approach detailed in the following example.

Consider the quantum implementation of an increment circuit shown in Fig. 8. This circuit implements the salient points of execution, i.e., the increment computation. In order to implement all points of execution the quantum circuit must be extended to include the relevant flags in the F register, being mindful not to alter the output qubits as this may impact the result. The first 5 of 8 flags are relatively simple to compute as the N flag is set programmatically, while C and S are directly available in the circuit and the F3/F5 flags are not documented (Fig. 9).

Fig. 8
figure 8

A quantum increment circuit

Fig. 9
figure 9

Reading the F register flags, N, C, S, F3 and F5 from the increment circuit

The remaining flags, Z (zero), H (half-carry) and P/V (parity/overflow) will all use variants of the quantum circuit listed in Fig. 10. The circuit in Fig. 10 configures a C-not on each output qubit which is fed into a NOT gate and checked via three Toffoli gates. The Toffoli gates then switch the desired output flag. The Z (zero) flag reuses the C-not chain across all output qubits to check if the result = 0. The quantum circuit required for the H (half-carry) flag checks if the four least significant bits are 0. The lower four bits being 0 indicates that the last increment instruction carried over and a half-carry occurred. The overflow bit repeats the circuit over the first seven least significant bits. The P (parity) flag is not required for the increment instruction but is shown here for reference. The parity flag is computed with a chain of controlled not gates that mimics the traditional XOR gate method of computing parity. Note this approach is reusable but will require modification as each instruction grouping sets the F register in slightly different ways. For example, the half-carry flag method shown here works for an increment instruction as we know if the four least significant bits are 0 (so reflecting the half-carry). Conversely, the four least significant bits in an ADD instruction may be less than the four least significant bits started with, therefore the half-carry bit calculation must add the four least significant bits of both inputs and check if the result is greater than 15. The computation of the traditional CPUs parity bit offers an interesting and reusable technique for managing fault tolerance. The in (quantum) circuit implementation of the parity bit enables the detection and subsequent correction of errors in circuit, at the cost of using more qubits. This is not to suggest it will replace more advanced forms of fault tolerance but rather a fortunate ability to reuse what is already required to achieve accurate emulation.

Fig. 10
figure 10

Calculating the parity, zero, overflow and half-carry flags from the increment circuit

The next optimization we describe takes advantage of the programmatic nature of creating quantum circuits. In silicon (excluding FPGA devices) once the circuit design is finalized and etched no changes can be made, but with quantum circuits the circuits can be altered as required. It is therefore beneficial to perform as many instructions as possible in one circuit. However, various constraints to the number and type of instructions that can be incorporated into a single circuit have been identified, including:

4.1 The use of internal registers only

Instructions can only be grouped into 1 circuit if they do not access external RAM or devices. For example consider the instructions in Listing 3.

Listing 3
figure g

Instructions that can be combined in one quantum circuit as all computations are performed on internal registers

The instructions in Listing 3 could be grouped into one circuit as all operations are executed within the CPU.

Listing 4
figure h

Instructions that should not be combined in a quantum circuit

Conversely the instructions of Listing 4 would be unsafe to group into one circuit as exclusive control of the RAM cannot be guaranteed, i.e., the real CPU could have been halted during a DMA while another device makes changes to the memory address referenced.

4.2 A move to complete emulation

To group instructions, the move from executing the salient points of execution to a full implementation, including the emulation of the F register must be made. Consider the program shown in Listing 5, which loads the A register with 5h, adds FFh, then adds 1h with the carry bit.

Listing 5
figure i

Loading, adding and adding with carry

If the instructions in listing 5 were grouped the result would be incorrect. This is because execution could not be passed back to the traditional computer to perform ancillary tasks, such as the setting of the carry flags. This would result in the A register being incorrect as the carry flag would be missed in the addition in the ADC instruction. Therefore, without a move to complete emulation (detailed in Sect. 4), the type of instructions that can be grouped in one circuit would be restricted to those which do not affect the F register.

4.3 The number of instructions being combined

The constructed quantum circuit must not be so large that decoherence or other errors alter the results. A simple experiment was conducted on a Falcon r5.11H quantum computer where an XOR equivalent quantum instruction was repeated (Fig. 11) until the result from the XOR was no longer accurate.

Fig. 11
figure 11

A quantum circuit executing multiple XOR instructions in an effort to measure the rate of error added by each instruction

The results in Table 4 show a clear degradation in accuracy as more instructions are incorporated in the circuit. The resultant probability varies from the perfect result of 1.0 to a background noise level at 0.5. The results indicate that a maximum of two instructions per circuit can be reasonably performed. This halves the number of queues required, doubles execution speed, reduces the number of shots and the amount of passing of information/execution back and forth between the traditional and quantum computer. It could be argued that using a C-not to “copy” the result of the XOR instruction to another qubit introduces additional error, and so a swap gate should have been used. This approach was considered but not chosen as it would result in swapping in a clean qubit for each instruction which would skew the outcome in favor of cleaner results.

Table 4 Qubit accuracy of executing n instructions against number of Shots

5 Creating a machine code mapping to call quantum opcodes

Cross et al. [8] propose assembler-based routines for calling quantum computers. The work presented here provides a basis on which such a prototype can be implemented. With the translation layer in place OpenQasm [8] calls can be directly implemented in the emulator/translation layer, making the experience much closer to that of a physical CPU but with quantum capabilities. This can be achieved by implementing the “Machine Target Code Generation” component (Fig. 12) of OpenQasm [8]. This can be undertaken by modifying the VASM [9] assembler to map OpenQasm instructions to machine code, and the machine code will then be interpreted by the emulator/translation layer into QisKit [6] calls to IBMs quantum computer.

Fig. 12
figure 12

An implementation of Cross, A Et [8] all’s OpenQasm

5.1 Machine code mapping method

Many CPUs contain special opcodes [10] which act as a modifier for the next opcode, for example, the machine code “2C” increments the l register, however prepending the “2C” with “DD” alters the instruction to increment the ixl register instead. This was originally implemented to allow more unique instructions to fit into a smaller data bus width, but this design feature can be advantageously utilized by selecting an unused machine code value to create an opcode modifier. This allows a mapping of machine code to OpenQasm instructions. To make the development of OpenQasm-based quantum circuits easier the source code for VASM can be edited with new mnemonics added to the quantum circuit gates (Listing 6).

Listing 6
figure j

Pseudo code to add a quantum assembler instruction to the VASM assembler. The proposed machine code, 0xEDFF01 maps to an RY gate rotated by the value in the A register

Listing 6 will allow for the assembler instruction RY,A to be translated into the machine code “ED,FF,01.” This process enables a complete assembler to machine code mapping table (Table 5) to be constructed for the opcode modifiers “ED,FF.**.”

Table 5 Assembler to OpenQasm machine code mapping table (abridged)

With the new mnemonics in place the assembler code for a Hadamard gate on Qubit x (H,x) can be built into the machine code “ED,FF,03,01.” For this case, the final two bytes of the machine code are an operand which represents the qubit to apply the Hadamard gate to. The emulator can now be edited to catch all machine code values which start with “ED,FF” and interpret them to the corresponding calls to IBMs quantum computer.

6 Results

The largest gains in performance were achieved in the move from executing individual logic gates to executing the entire circuit in the quantum domain. When dealing with the variable execution time of running quantum circuits, i.e., the queuing mechanism, it was important to benchmark any potential improvements against a fair playing field and not a favorable queue length. With this in mind, an adaptation of algorithmic complexity was used to assess performance increases. Given that the input size n, is fixed at 8-bit or 16-bit, there is no need to assess infinities and all resultant complexity will be constant (ignoring recursive instructions which will have a linear complexity). This does not reduce the value of assessing the capability but rather simplifies it, in that it is possible to substitute high-level instructions for quantum gates, or more practically quantum circuits. With this tactic in place each of the quantum implementation methods were compared with the following results. Note that Fig. 13 shows and abridged dataset as the goal of quantum implementation methods 3 and 4 was to include more functionality or explore the benefits of the quantum computer, not necessarily reduce the number of circuits.

Fig. 13
figure 13

An overview of the improvements in efficiency made by the new instruction implementations

The upper field shows the number of quantum circuits required for the original gate by gate quantum implementation method, whereas the lower field shows the number of quantum circuits required for the optimized (quantum) method, i.e., method 2, 3 or 4. The optimized method shows a clear reduction in the number of quantum circuits required to perform an instruction and hence execution time is vastly reduced.

7 Limitations and evaluation of the methodology

Although orders of magnitude slower than the performance of traditional CPUs in arithmetic, logic and bitwise operations, this new work substantially improved execution speed of an emulated CPU on a quantum computer (in some cases by more than 1,000 percent) without introducing any unrealistic requirements, i.e., all execution can be performed utilizing less than 32 qubits. This is a marked improvement in efficiency compared to the “in silicon” method described in the author’s previous paper [1]. Overall performance is still heavily bound by the number of qubits available, the delays involved in the remote API-based approach and the queuing method of accessing IBM's quantum computers. The main improvements are in usability, where each instruction can be executed in circa 30s and the breadth of methods available to the researcher to contrast.

Evaluating the scale of progress in the development of quantum computers versus traditional computers regarding logic operations is difficult because of their different architectures and goals. However, some interesting statistics can be used to compare the linear growth of traditional CPU performance [11] and the exponential growth in quantum computing performance [12]. One approach which provides a generalized appreciation of the trend (and arguably highlights the timeliness of the work presented in this paper) is to predict the number of transistors used in a CPU compared to the number of qubits available in a quantum computer. On the one hand using the last 34 years [11] of Intel CPUs as a baseline to forecast the number of transistors used (normalized by dividing by 6 as it takes 6 transistors to store 1 bit in a CPU [13]), and on the other hand using the limited data set of 3 years of quantum computer development and IBM’s roadmap [12], a simplified performance growth prediction of number of transistors versus qubits can be made. This suggests that the number of qubits will overtake the number of normalized transistors in the next 25 years, around the middle of the twenty-first century (Fig. 14). This raises the possibility that the work presented here may have greater practical relevance to the wider industry in the near to medium future [15].

Fig. 14
figure 14

A prediction of the number of available qubits (exponential) versus the number of transistors in a CPU

8 Conclusion

Alternative methods of performing traditional CPU instructions on a quantum computer have been demonstrated. These new methods will be included in the next version of the authors’ Qx86 SDK. This work enables the programmer and/or researcher to select a method they are familiar with and contrast a traditional method with a more efficient quantum method(s) of execution. The multiple method approach improves the accessibility/usability of quantum computers when transitioning from traditional computing [16].

Future work will include adding full support for a full × 86 instruction set [17]. This is not currently practical due to the requirement of handling multiple concatenated 32-bit registers in one instruction, but will become possible as more advanced quantum computers are developed and released (e.g., the IBM Osprey class of quantum computers [7]).