CHARON-VAX/CHARON-AXP - OpenVMS fails with CPUSPINWAIT fatal bug check

Problem

CHARON-VAX/CHARON-AXP / OpenVMS fails with "Fatal BUG CHECK: CPUSPINWAIT, CPU spinwait timer expired" message

Solution

When running CHARON, there are two most typical reasons that can cause this problem:

  1. CHARON node is overloaded and is unable to cope with workload. Upgrading hardware or CHARON flavor usually helps to resolve the issue.

  2. The timeout which causes the bug check is managed by two internal VMS parameters and by one parameter which can be set though SYSGEN (modparams.dat).
    Internal parameters, calculated automatically on OpenVMS startup, are CPU$L_TENUSEC and CPU$L_UBDELAY. They depend on hardware (in case of CHARON – on Intel) performance and are out of our control.
    The third parameter we can manage is SGN$GL_SMP_SPINWAIT or SGN$GL_SMP_LNGSPINWAIT on older OpenVMS versions
    VMS takes these 3 parameters, multiplies them,  and uses the result to calculate the loop counter which will be used to measure the delay:

    (SP) = SGN$GL_SMP_SPINWAIT * CPU$L_TENUSEC * CPU$L_UBDELAY

    A potential issue here is that all three source variables are LONG INT. The result (SP) is also a LONG INT. So, if the result of multiplication exceeds 2^32, it could actually result in a very small number.

    Settings

    We recommend reducing the value of SMP_LNGSPINWAIT to 1 million (1 000 000), test the system stability, and set it to 300 000 if the stability is not so good. If it’s not better with 300 000 then try with 100 000. In case the problem persists, please call Stromasys support

    We highly recommend not changing the value of SMP_SPINWAIT and let it to its default value of 100 000. Same for SMP_SANITY_CNT, default value to 300

    .

  3. Example / OpenVMS 7.3-2 using SYSGEN

    $ MC SYSGEN

    SYSGEN>  SHOW /MULTI

     

    Parameters in use: Active

    Parameter Name            Current    Default     Min.     Max.     Unit        Dynamic

    --------------            -------    -------    -------  -------   ----        -------

    SMP_CPUS                        1         -1         0        -1   CPU bitmask

    MULTIPROCESSING                 3          3         0         4   Coded-value

    SMP_SANITY_CNT                300        300         1        -1   10ms.

    SMP_SPINWAIT               100000     100000         1   8388607   10 usec.

    SMP_LNGSPINWAIT           3000000    3000000         1   8388607   10 usec.


    SYSGEN>  USE CURRENT

    SYSGEN>  SET SMP_LNGSPINWAIT 1000000

    SYSGEN>  WRITE CURRENT

    SYSGEN>  SHOW SMP_LNGSPINWAIT

    Parameter Name            Current    Default     Min.     Max.     Unit        Dynamic

    --------------            -------    -------    -------  -------   ----        -------

    SMP_LNGSPINWAIT           1000000    3000000         1   8388607   10 usec.   

    SYSGEN>  EXIT

    $

     
     (lightbulb)  You should also update the MODPARAMS.DAT file and run AUTOGEN to store the new value that must survive a reboot. Please refer to your OpenVMS version documentation.

  4. .

    Definitions (from HP OpenVMS Systems Documentation)

    • SMP_LNGSPINWAIT: certain shared resources in a multiprocessing system take longer to become available than allowed by the SMP_SPINWAIT parameter. SMP_LNGSPINWAIT establishes, in 10-microsecond intervals, the length of time a processor in a multiprocessing system waits for these resources. A timeout causes a CPUSPINWAIT bugcheck.

    • SMP_SPINWAIT establishes, in 10-microsecond intervals, the amount of time a CPU in an SMP system normally waits for access to a shared resource. This process is called spinwaiting. A timeout causes a CPUSPINWAIT bugcheck.




© Stromasys, 1999-2024  - All the information is provided on the best effort basis, and might be changed anytime without notice. Information provided does not mean Stromasys commitment to any features described.