Correctable DIMM Errors UCSB-B200-M3

A B series server had the following memory errors:
Image

The memory errors were filling up the System Event Log as shown here:
FAB-INT-B# show sel 1/1
1 | 04/21/2014 17:44:13 | CIMC | Event Logging Disabled SEL_FULLNESS #0x8a | Log Area Reset/Cleared |  | Asserted
2 | 04/21/2014 17:44:13 | CIMC | System Event SEL_FULLNESS #0x8a | Upper critical – going high | Deasserted | Reading 0 <= Threshold 80 unspecified
3 | 04/21/2014 17:56:35 | CIMC | Memory DDR3_P2_F0_ECC #0x81 |  | read 199 correctable ECC errors on CPU2 DIMM F0  | Asserted
4 | 04/21/2014 18:13:32 | CIMC | Memory DDR3_P2_F0_ECC #0x81 |  | read 96 correctable ECC errors on CPU2 DIMM F0  | Asserted
5 | 04/21/2014 18:13:44 | CIMC | Memory DDR3_P2_F0_ECC #0x81 |  | read 102 correctable ECC errors on CPU2 DIMM F0  | Asserted

After clearing the SEL logs on server 1/1 as follows:
FAB-INT-B# clear sel 1/1
FAB-INT-B# commit-buffer

The memory status for Chassis-ID/ Server-ID 1/1 – memory in location “F0” shows Operability as “Degraded”:
FAB-INT-B /chassis/server # show memory detail | b “Location: F0”
Location: F0
Presence: Equipped
Overall Status: Operable
Operability: Degraded
Visibility: Yes
Product Name: 8GB DDR3-1600MHz RDIMM/PC3-12800/dual rank/1.35V
PID: UCS-MR-1X082RY-A
VID: V01
Vendor: 0xAD00
Vendor Description: Hynix.
Vendor Part Number: HMT31GR7CFR4A-PB
Vendor Serial (SN): 1F7938FE
HW Revision: 0
Form Factor: DIMM
Type: DDR3
Capacity (MB): 8192
Clock: 1600
Latency: 0.600000
Width: 64

That “Degraded” state took me to the following Cisco doc:
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/ts/guide_old_FM/TS_Server.html#wp1073850

Which states the following:
Correctable DIMM Errors
DIMMs with correctable errors are not disabled and are available for the OS to use. The total memory and effective memory are the same (memory mirroring is taken into account). These correctable errors are reported in Cisco UCS Manager as degraded.
If you see a correctable error reported that matches the information above, the problem can be corrected by resetting the BMC instead of reseating or resetting the blade server. Use the following Cisco UCS Manager CLI commands:

UCS1-A# scope server x/y
UCS1-A /chassis/server # scope bmc
UCS1-A /chassis/server/bmc # reset
UCS1-A /chassis/server/bmc* # commit-buffer

Apparently the BMC reference has been deprecated and the correct syntax is:
UCS1-A# scope server x/y
UCS1-A /chassis/server # scope cimc
UCS1-A /chassis/server/bmc # reset
UCS1-A /chassis/server/bmc* # commit-buffer

Per the above referenced document:
“Resetting the BMC does not impact the OS running on the blade. “

So this command should only reset the OOB management interface and not be service impacting.  I’ll follow up when I’ve had a chance to perform this workaround and confirm that it worked.

Advertisements

About Amir Safayan

I live with my family in beautiful Colorado. We have a basset / beagle named Cisco. We board / ski, ride dirtbikes, RV and enjoy living in the Western US. I've done R/S, security, wireless, IPT and am now focused on virtualization technologies on the High Touch Team for Shoregroup.com, a large Cisco reseller.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s