Table of contents
1 Emulator Performance Compared to Physical SPARC Hardware
Please keep in mind that this performance comparison is based on experience from previous customer projects; your specific use-case may differ. Platform emulation is a complex and CPU-intensive task. To better illustrate this, let us consider how a physical SCSI controller works...
1.1 A Physical SCSI Controller as Example
On a physical system, the controller hardware (network, disk, etc) typically has some local RAM on the card which is mapped into the address-space of the system. This local RAM is used for various things, chiefly for buffering data and providing control and status registers.
control registers are used for signaling from the OS (operating system) to the controller hardware that there is work to do. status registers are used for signaling some result from the controller hardware to the OS.
For example, to write some data to a disk, the OS would copy the data (or part of the data), along with some metadata which describes the operation, to the buffer area of the local RAM of the controller and then write a specific value to a control register to tell the controller to read the metadata and perform the operation indicated. The controller would detect the change to the control register and then perform, or attempt to perform, the write operation, and when finished, write a value to a status register and, if the controller is not ancient, raise a hardware interrupt to signal to the OS that the operation has completed.
1.2 Emulated Hardware
An emulator must behave exactly as the controller hardware would: it must monitor the emulated control registers for changes, emulate interrupts, and write values to the emulated status registers.
You can imagine that the above is a lot of work. It is also important to understand that the work described above is limited by the throughput of the host CPU.
2 Emulation Means a Change in Architecture
The difference between virtualization and emulation is that with virtualization the host and guest share the same ISA (Instruction-Set Architecture), whereas emulation means the guest has a different ISA from the host.
Thus, one of the biggest tasks an emulator must perform is translating the guest ISA instructions into the host ISA. Some instructions, for example memory fetch or put, may only require a single instruction in the host ISA. Other instruction types, for example floating-point, may be translated into a large number of host ISA instructions. Imagine, for example, the work required to convert a floating-point value of a format that is not supported by the host ISA into a format supported by the host ISA.
As you can see, it is not surprising that emulation comes with a significant performance penalty. The Stromasys engineering team has made an enormous effort to make emulation as fast as possible, using various techniques such as caching blocks of translated code which then don't have to be re-translated. Nevertheless, there will always be a penalty.
3 Characterizing Emulator Performance
3.1 Effective Emulated CPU Frequency
In order to be able to (crudely) estimate the emulator performance, we use the term Effective Emulated CPU Frequency (EECF). This gives us a rough idea of how well the emulator will perform on modern Intel and AMD processors running at a particular frequency (Host CPU Frequency - HCF). The EECF lies somewhere between 1/4 and 1/3 of the HCF. Alternatively represented:
For example, if the HCF is 3.0 GHz, the EECF will lie between 750 MHz and 1.0 GHz. This means that the estimated performance of the emulator will be comparable to running the workload on an UltraSPARC CPU with a frequency between 750 and 1000 MHz. The actual performance depends heavily on the workload. If the workload uses many SPARC instructions that require more work to translate, the EECF will be lower.
3.2 Emulator Performance Scales with Host Processor Frequency
On modern Intel and AMD processors, the emulator performance scales more or less proportionally to the host processor frequency. Thus, a host processor running at 3.5 GHz will ~15% better performance compared to a host processor running at 3.0 GHz.
3.3 Emulator Maximum Throughput
Due to the nature of emulation (as explained above), there are various upper limits on network, disk and CPU throughput. The CPU limits are described above. Because the emulator must emulate controller hardware, such as Ethernet and SCSI controllers, which is complex, there are upper limits on the network and disk throughput. These limits vary with the frequency of the host processor. In general, the maximum throughput of an emulated NIC will be between 300 and 600 Mbits/s, also depending on parameters such as the packet-size. The maximum throughput to a disk or through a SCSI controller is harder to characterize, but also depends on such things as the block-size of the I/Os being performed.
3.4 Sizing the Emulator For Your Workload
We can provide tools for capturing performance metrics on your physical SPARC system which we can then analyze. Please get in touch with Stromasys for help collecting performance data on your SPARC workload and sizing the emulator configuration.
4 Charon-SSP+
The "plus" version of SSP, which can only be deployed on physical hardware ("bare-metal"), uses the CPU and I/O virtualization capabilities of modern Intel and AMD processors to improve emulator performance. The benefit varies, but in general will be between 10 and 30%, depending on the workload.
Related articles
Readers/approval (this part is to be removed before publication)
- Marco Wang - please approve for publication
- Gregory Reut - please approve for publication
- luis.ramos - please approve for publication
- watson zhang - please approve for publication
- John Prot - please approve for publication