Friday, January 27, 2006

CMT RSA

CMT cores can be dynamically disabled in the
case of core failure and Solaris can continue to run with LWPs
rescheduled on remaining cores!

So, in addition to all the reliability due to simplicity arguments,
the lower heat arguments, the ECC/parity checking of L1/L2 cache/
main DRAM, and the thermal management RAS features, T1 does have a
Defect Engine (DE) that will blacklist individual threads or a core
(depending upon what FMA associates the failure with) and Solaris will
continue to run on remaining cores.

Of course there can be certain hard errors that force the system to
reboot with blacklisted threads/cores, but that can also be the case
with UltraSPARC III/IV.

The L2 cache could perhaps be considered a single point of failure, but
L2 cache does have Error Correcting Code protection so is no less fault
tolerant than most RAID setups, but with far longer MTBF. In the event
of kernel access that gets an uncorrectable L2 cache error on a dirty
line (note the 3 unlikely criteria that much each be met), the system
will crash, but I'm sure there are similar edge cases for most SMP
platforms.

Overall Kabira KTS on T1 on T2000 should be an extremely reliable and
fault tolerant CPU and platform!

No comments: