Charon Emulator Performance Considerations
Table of contents
1 Emulator Performance Compared to Physical SPARC Hardware
Please keep in mind that this performance comparison is based on experience from previous customer projects; your specific use-case may differ. Platform emulation is a complex and CPU-intensive task. To better illustrate this, let us consider how a physical SCSI controller works...
1.1 A Physical SCSI Controller as Example
On a physical system, the controller hardware (network, disk, etc) typically has some local RAM on the card which is mapped into the address-space of the system. This local RAM is used for various things, chiefly for buffering data and providing control and status registers.
control registers are used for signaling from the OS (operating system) to the controller hardware that there is work to do. status registers are used for signaling some result from the controller hardware to the OS.
For example, to write some data to a disk, the OS would copy the data (or part of the data), along with some metadata which describes the operation, to the buffer area of the local RAM of the controller and then write a specific value to a control register to tell the controller to read the metadata and perform the operation indicated. The controller would detect the change to the control register and then perform, or attempt to perform, the write operation, and when finished, write a value to a status register and, if the controller is not ancient, raise a hardware interrupt to signal to the OS that the operation has completed.
1.2 Emulated Hardware
An emulator must behave exactly as the controller hardware would: it must monitor the emulated control registers for changes, emulate interrupts, and write values to the emulated status registers.
You can imagine that the above is a lot of work. It is also important to understand that the work described above is limited by the throughput of the host CPU.
2 Emulation Means a Change in Architecture
The difference between virtualization and emulation is that with virtualization the host and guest share the same ISA (Instruction-Set Architecture), whereas emulation means the guest has a different ISA from the host.
Thus, one of the biggest tasks an emulator must perform is translating the guest ISA instructions into the host ISA. Some instructions, for example memory fetch or put, may only require a single instruction in the host ISA. Other instruction types, for example floating-point, may be translated into a large number of host ISA instructions. Imagine, for example, the work required to convert a floating-point value of a format that is not supported by the host ISA into a format supported by the host ISA.
As you can see, it is not surprising that emulation comes with a performance penalty. The Stromasys engineering team has made emulation as fast as possible, using various advanced techniques. Nevertheless, there will always be a penalty.
3 Characterizing Emulator Performance
3.1 Effective Emulated CPU Frequency
In order to be able to (crudely) estimate the emulator performance, we use the term Effective Emulated CPU Frequency (EECF). This gives us a rough idea of how well the emulator will perform on modern Intel and AMD processors running at a particular frequency (Host CPU Frequency - HCF). The EECF lies somewhere between 1/4 and 1/3 of the HCF.
For example, if the HCF is 3.0 GHz, the EECF will lie between 750 MHz and 1.0 GHz. This means that the estimated performance of the emulator will be comparable to running the workload on an UltraSPARC CPU with a frequency between 750 and 1000 MHz. The actual performance depends heavily on the workload. If the workload uses many SPARC instructions that require more work to translate, the EECF will be lower.
3.2 Emulator Performance Scales with Host Processor Frequency
On modern Intel and AMD processors, the emulator performance scales more or less proportionally to the host processor frequency. Thus, a host processor running at 3.5 GHz will ~15% better performance compared to a host processor running at 3.0 GHz. As the frequency of the host processor drops we see a non-linear deterioration of emulator performance and very low frequencies may lead to inability to run the guest OS. We strongly reccommend not going below 3.0 GHz except in rare cases in which the original processors have a very low frequency.
The emulator performs best on the most recent Intel and AMD processor generations.
3.3 Emulator Maximum Throughput
Due to the nature of emulation (as explained above), there are various upper limits on network, disk and CPU throughput. The CPU limits are described above. Because the emulator must emulate controller hardware, such as Ethernet and SCSI controllers, which is complex, there are upper limits on the network and disk throughput. These limits vary with the frequency of the host processor. In general, the maximum throughput of an emulated NIC will be between 300 and 600 Mbits/s, also depending on parameters such as the packet-size. The maximum throughput to a disk or through a SCSI controller is harder to characterize, but also depends on such things as the block-size of the I/Os being performed.
3.4 Sizing the Emulator For Your Workload
At this point it should be clear to the reader that an emulator cannot match the performance of certain modern SPARC and PA-Risc platforms. However, we often encounter workloads running on such modern platforms that do not take full advantage of the available performance. In many cases we can successfully migrate such workloads to the emulator. In order to assess the feasibility of emulation, we have developed tools for capturing and analyzing performance metrics on the original SPARC and PA-Risc systems. After we have processed such performance data in most cases we are able to say if emulation can provide the required performance.
Please get in touch with Stromasys for help collecting performance data on your SPARC workload and sizing the emulator configuration.
4 Charon-SSP Emulator Performance Improvements
4.1 MMU-Pass-through
With SSP version 6 we introduced the MMU-Pass-through (Memory-Management-Unit Pass-Through) which can significantly improve performance in virtual environments such as cloud.
4.2 SSP+
The "plus" version of SSP, which can only be deployed on physical hardware ("bare-metal"), uses the CPU and I/O virtualization capabilities of modern Intel and AMD processors to improve emulator CPU performance, which can be between 10 and 30%, depending on the workload. Unfortunately this comes with a penalty for I/O operations, so SSP+ can not always improve overall workload performance.
Related articles
© Stromasys, 1999-2024 - All the information is provided on the best effort basis, and might be changed anytime without notice. Information provided does not mean Stromasys commitment to any features described.