Stabilizing ESXi on Lenovo X3850

It’s been a while since I’ve posted my first blog. However, for the last couple of weeks, I’ve been busy stabilizing my ESXi environment. I want to share with you my journey, in the hope that you can also benefit from the ~~problems I encountered~~ knowledge I gained 😀

The journey

The issues started with a PSOD of an ESXi. First question of L1 support is always if you’re at the latest level of firmware for your UEFI & IMM. I can tell you that it’s not always that easy on a multi-tenant environment to be at the latest level of firmware. You know the change management process? Well I can say that it’s something I can add in my list of “knowledge gained” on how to deal with that process 😉
There were other PSOD’s in the past, but always with different root causes. Only after the fourth or fifth crash in a 3 month period I was assigned with the task to get the environment stable. On a multi-tenant environment no need to say that the customers were getting kind of nervous about the stability of the environment.
I started analyzing the environment and after the analysis I created a stabilization plan. One part was the firmware upgrade on the UEFI / IMM & adapters (NIC / FC / ….). This was checked against the vmware HCL list => https://www.vmware.com/resources/compatibility/search.php
Another part of my stabilization plan was the upgrade to vSphere 6.7 U1. I was really excited about the HTML5 client that now supported all the features. But it was mostly driven by the higher amount of supported RDM disks, which we needed due to the high amount of Microsoft Clusters.

During the upgrade of my environment (vrops / log insight / PSC / VCSA / SRM) to 6.7 U1 I had other crashes but it started to get interesting when we started getting the same below Critical uncorrectable processor error.

In a period of not even 2 months we got 4 of these crashes. But at least they were now consistent. After the 2nd similar crash I involved our L3 Lenovo support to get this analyzed. Maybe there were known issues with this type of hardware or other issues they could help us with.

What did I learn?

I summarized the different topics below:

1. Always make sure that firmware on UEFI / IMM & adapters are upgraded to the latest versions. On a Lenovo environment, this can be done with Lenovo XClarity Administrator. It can be simplified even more by using Lenovo XClarity integrator on which I’ll write a seperate article once I have the integrator implemented.

2. There is another interesting link from Lenovo where you can see the best recipe for your Lenovo hardware. It also contains the vibs depending on the version of ESXi you’re running https://download.lenovo.com/server-repo/vmware/content/latest_index/

3. Create a scratch partition to file. This is explained in https://kb.vmware.com/s/article/2077516

On top of that we also disabled the scratch partition to partition completely on request of Lenovo L3 support. This can be done by executing below commands on a putty session.
/bin/esxcli system coredump partition set -e 0
/bin/esxcli system coredump partition set -u
/bin/esxcli system coredump partition set -e 0

However to keep it persistent you will also need to modify the following file /etc/rc.local.d/local.sh
You edit that file with vi and you paste those 3 commands in there until you have below result.

4. If you have enabled logging to a syslog server and/or a loginsight server and you have created the scratch partition as mentioned in point 3. Then you can choose to reboot your server automatically after a PSOD. This is explained in https://kb.vmware.com/s/article/2042500

In a putty session to your ESX you can run the following command:
esxcfg-advcfg -s 180 /Misc/BlueScreenTimeout (this configures a timeout of 3 min)

You can check your setting by running :
esxcli system settings advanced list -o /Misc/BlueScreenTimeout

5. UEFI settings. There is a general article from Lenovo on system tuning => https://support.lenovo.com/be/en/solutions/ht115952 However there is 1 extra setting that came out during our L3 support which isn’t mentioned on that link. And that’s the one you can see in below printscreen.