I'm not a programmer but a Sysadmin. I ran into an issue a few months ago that nearly drove me crazy. I was custom building a new server for a customer, very nice build, Xeon E5 10 Core, 64GB DDR ECC RAM, Windows Server 2012 R2, SSDs, etc...
Everything is going well, update and configure bios, install Windows, install drivers and software. Then start configuring the server, sever needs to reboot, reboots to a blue screen of death. Can't get it to boot up normally. Ok must be bad software, driver, etc... Time for a clean install, everything is going fine and then reboot -> blue screen. Look up the bug check code, no help. Another clean install, this time no extra software, no drivers, same problem after a reboot. Finally figure out that it is only happening after I make the computer a domain controller. After Googling with this new information, I find one thread on some forum where people are having the same problem, turns out it was the solid state drive, if you use a Samsung 850 Evo SSD with Windows Server 2012, it will blue screen if you make it a domain controller. I never thought a problem like this was possible. Sure enough changed the installation drive and no more problems. Nearly drove me crazy, took me two days of troubleshooting.
Haha, these multi-hundred-thousand-line-source-code SSD firmwares need to do something. In particular they detect popular filesystems and try to optimize some operations (specifically if filesystem is recoverable by check disk after crash, it's considered to be acceptable optimization; i.e. it's possible to delay actual commit of free block info for example). Seems like detection glitched on the blocks that contained data related to being domain controller.
Thinking about SSD and HDD firmware can make a guy reach for tape drives as a last ditch hope that data stays just data.
At least with those (and to some degree optical media, but the write process there is more laborious) the storage and the RW logic is different pieces.
The closest I've come to that is tinkering around with random old hardware: a specific ISA (PnP) sound card and a specific 4GB IDE HDD I have here hate each other and if both are in the same system, when Linux sends the 'IDENTIFY' IDE command, it times out and never gets a response. This "works" across two different motherboards.
Thankfully I was just messing around, I knew the system I was using had used up 7 or 8 of its 9 lives and I was being careful, and I only changed one or two pieces of system state at a time so I could easily rollback (as nonsensical as "okay let's try removing the soundcard" is :P).
If anyone wants to know, model info shouldn't be too hard to locate, and pix would be fun to grab, I'd just have to go find all the parts.
Years ago (maybe 10) I decided to update Ubuntu on my dev machine. Everything went well, I boot into the new system and start playing around. But after some time the internet connection drops in the middle of apt-get. I am furious, start cursing at my ISP. I reboot the system, and thankfully, the internet is back on. Start installing new packages, and after some time the connection drops again. I was like "seriously???", then reboot again and net was on again.
Took me the whole day to figure out that the particular kernel version I updated to made my particular brand of router drop the connection to the ISP after ~15min. Had to downgrade the kernel to make it work again...
Another story about ubuntu; and this time I'm the idiot;
When you're a sysadmin who's only dealt with one type of linux for a long time you forget what the different distro's do.. there really isn't that much, naming conventions/subtle commands etc;
But, apt's packages have the nature of 'always run the latest service if upgraded', a notion that does not exist in centos/rhel/fedora/bsd etc;
So, apt-get upgrade on our production database server means the server goes away for a while; not 'the packages are available to be used when you restart the service at midnight'*
I was working on a dev server (which was pretty identical to the prod servers), vmhost/kvm/ubuntu 10.04 LTS.
One day I'm staging my updates, to make sure they work when I go to prod.. I do my apt-get and reboot into the new kernel.. but- nothing can be mounted.
No developers can work either, they depend on that machine..
I'm looking at it and I can mount the drives from busybox after the kernel panics.. No idea what's going on.
I'm asking around in #ubuntu on freenode, since googling is getting me nowhere. 'Just reinstall it' they say.
'Sometimes it's better not to know why these things happen'
:|
I found the issue, some person was smart enough to think lvm2 was called 'lvm' and then initramfs all my previous kernels during the upgrade procedure- so I couldn't boot into old kernels either.
Lost a day of dev time (although it was only 3 developers and me)
Many of you might have faced this issue, but it was pretty weird facing it the first time.
Problem was with the realtek soundchip on my card. Headphones stops working on windows (even though it detects it), but still works on ubuntu. Try all usual stuff (remove, update, reinstall drivers, etc), nothing works. One fine day, I find it starts working again. Then another day it stops working. Realise that sound stops working after i switch from Ubuntu to Windows. Weird. How does ubuntu have an effect on sound in windows? Google issue and find that its some strange ubuntu driver issue that puts audio driver in a weird state. Workaround is to shutdown laptop , remove battery, and hold down power button for 5 seconds to discharge capacitors and then boot into windows.
That is insane. And also one of the reasons I try to avoid being a system integrator whenever possible. It just takes up too much time - so much simpler having my friendly Supermicro distributor build my server for me!
I almost lost faith in my abilities through the ordeal. I love building, picking every single component, I love having the latest hardware.
This particular build was setup to be primarily a Hyper-V host, for the VM drive I chose an Intel 750 PCIe SSD. The drive is amazing, super fast, makes traditional SSDs look like spinning disk. Choose the EVO to save a few bucks for the customer, as the server OS wasn't going to doing any heavy lifting. I never imagined that a hard drive could be incompatible with an OS.
Everything is going well, update and configure bios, install Windows, install drivers and software. Then start configuring the server, sever needs to reboot, reboots to a blue screen of death. Can't get it to boot up normally. Ok must be bad software, driver, etc... Time for a clean install, everything is going fine and then reboot -> blue screen. Look up the bug check code, no help. Another clean install, this time no extra software, no drivers, same problem after a reboot. Finally figure out that it is only happening after I make the computer a domain controller. After Googling with this new information, I find one thread on some forum where people are having the same problem, turns out it was the solid state drive, if you use a Samsung 850 Evo SSD with Windows Server 2012, it will blue screen if you make it a domain controller. I never thought a problem like this was possible. Sure enough changed the installation drive and no more problems. Nearly drove me crazy, took me two days of troubleshooting.