Echoes Corsac.net - Echoes camshot
jeudi 13 octobre 2011 (1 post)

The issue

This morning (while I was running late for an appointement) I had a very weird stuff happening on my Thinkpad T61 laptop. Since I recently offered myself a shiny Thinkpad x201s, I have to admit I don't use much my T61 anymore. But this morning I had to print a page (for this appointement) and, as I didn't yet configured my printer on the x201s, I went to the T61. But I noticed that the network was down. I've tried quickly on wireless but, bad luck, my current wifi setup selects the channel automatically and it prefers choosing channels which aren't available in the US. Guess what, my T61 comes from the US and has those channels completely disabled, so no wireless available either.

The investigation

I first tried to

modprobe -r e1000e
modprobe e1000e

to see if it fixed the problem, but it didn't. Worse, the interface disappeared and never reappeared. I tried to reboot but it didn't fix the problem, the link was still down. Running really late, I put the file on a usb key and printed it from the powerbook and postponed the fix for later.

Now, this evening, I tried to investigate a bit more. Symptoms weren't only that the nic wasn't working, but there was a high load on the system (1-2 at idle), unresponsiveness every second or so, and watching top I could see spikes of high cpu usage for the kworker kernel thread. Typing that on google you can find a lot of people running on this issue, usually starting around kernel 2.6.36 or 2.6.37. Now, I might have upgraded the kernel recently to 3.0.0-4, but that didn't look related since the problem first appeared when the laptop was up and running. And I tried to reboot under 2.6.39, 2.6.38 and even 2.6.32 and the problem was still present. Each time, unloading the module would fix the problem, but loading it again wouldn't make the interface reappear. People advised to boot with pcie_ports=compat but that didn't do anything. I tried to boot without intel_iommu=force (disable Intel Vt-d) and pcie_aspm (Active State Power Management) but nothing either.

Considering a userland issue, I've tried to boot a grml live distro (always keep a grml.iso in your /boot, extlinux-update will even put it in your menu automatically), and the problem was still present. So not a Debian kernel issue, not a userland issue, only thing left was the laptop. I didn't update the Bios recently, so I wondered exactly what could be the problem. I started to feel a little bad, since I still really like that laptop, and that I already decided to lend it to my sister since her own T61 is sitting with a dead system board in my shelf. I know she might have some negative waves, but she was not even landed when the problem first appear.

The fix

Then I had a flash. It's not mystery that I'm used to break network cards, and I had the bright idea to shutdown the laptop, disconnect AC and battery, then let it idle a bit. I even tried the secret Thinkpad power button code but I think it's unrelated. Then I re-plugged the battery, booted to grml and the issue was gone. I rebooted on the standard Debian and the link was up, network was working.

So what happenned?

The (tentative) explanation

My guess is that, somehow, the network card firmware has an issue and choked on something (a network frame or an attack exactly like the one we demonstrated on ASF firmware). In fact, no, I don't think it's the e1000e firmware. My T61 comes with Intel vPro, which includes AMT (Active Management Technology), a remote management solution a bit like ASF but more advanced. As far as I know, AMT firmware always runs, even when it's disabled, it's just completely idle. Idle, but in this case I think it choked on something, and a reboot isn't enough to restart the AMT firmware. But a real hard reset without any power seems to do the trick.

What next?

Well, a part of me is pretty scared, but another is just bored. I mean, we know about that, that's exactly the kind of issue we are warning people of. I have no idea what exactly happened, and there's no way I'll be able to reproduce that, but I'm pretty sure it's something lying at a pretty low level in the platform, and which can severely disable your workstation. Now if it happens again I won't lose too much time on this.

TL;DR: helping other people

In case you came here because you searched on google terms like “kworker cpu usage”, e1000e, interrupts, it might be a good idea to first reboot on a live CD to eliminate installation issues, then shutdown the laptop, remove the battery and let it few seconds idle. This might be enough to reset “something” inside and fix the situation.

Corsac@22:44:53 (Debian)

Images
Stats
  • 1526 posts
  • 7960 jours
  • 0.19 posts/jour
  • IRC
  • Last.fm
Stuff
Tech
Weblogs
Desktop