Wednesday, 4 July 2018

Another dead SSD, lost files and lost hours...

So recently, my home desktop Windows 10 PC started refusing to boot. It got to the BIOS splash screen, twiddled for a while and then errored. Apparently, it couldn't "automatically repair" the system so it then offers "advanced options".

Long story short, Windows handling of hardware errors really sucks! They have made efforts to have an automatic Recovery Environment, which is great but the navigation and understanding of this is still hard, even for a programmer like me and the information on the web for IT solutions is even worse than it is for programming questions - millions of "helpful" articles from adware sites, old sites, official sites, unofficial sites and people touting their magic tool.

For me, the failure of a boot might mean two things: hardware failure or corrupted system files. What I expected:

  1. Check Hardware option
  2. Check System Files option
What do you get instead?
  1. Reset system. Literally reinstall windows and lose all of your apps in the process! A very inconvenient option and it is the first option you are offered. All the others are under advanced options.
  2. Use a device. Not sure what I am supposed to do with this. I can boot into another device using the boot menu, is this the same thing?
  3. Shutdown. Thanks
  4. Go back to a restore point. OK, this sounds promising but, again, if the problem is a simpe file corruption, I should not be doing something so big. Anyway, despite the fact that restore points are supposed to be automatic, I had none listed from the 6 months that windows has been installed and updated.
  5. A command prompt. Useful for advanced people but what do I do with that? Drive letters are all over the place, a virtual disk contains the contents of the recovery environment?
I finally found some help online and ran chkdsk, with no problems reported. This was a shame because it was the disk that was at fault I later discovered. I also ran the system file checker sfc /scannow and again it reported no problems. Why aren't these just buttons in the recovery environment?

I then found these instructions about running the windows instrumentation console and checking the disks, again both reported OK. I suspect SSDs would only report a problem if the controller was knackered rather than the more likely cause of bad nand memory.

I tried the Windows 10 memory test tool several times but it always hanged at test 4 27%, even after 10 hours! Again, what use is a tool that hangs?

I found another tool, whose name I can't remember, that I ran and it reported various combinations of "no errors", "access is denied" and "no windows installations found". Again, not particularly useful for 100% of the population and should be contained into a utility that can be run graphically and a result that might say something like, "problems were found, possible causes include hardware failure or windows corruption".

Of course, unless you are someone who completely understands the bios, loader and OS, you are relying on unreliable advice online when Microsoft absolutely should have the go-to guide to run the various tools in a certain order to understand what is happening. I was also surprised how many were recommending GUI tools that only run in Windows to fix startup issues!

I decided I only had one option. To copy everything using xcopy from the main disk to the second disk and then reinstall Windows on the main disk and copy stuff back. What I didn't realise until later is that if xcopy encounters an error, it stops copying so I didn't copy my documents at all!

I then ran the windows 10 installer on the main disk and got some random error code - again, why can't the installer simply say: This shouldn't happen, you have a hardware problem? I tried again with the USB installer in another port and this time it just hanged at a certain point - timeout anyone?

This is when I assumed it must be a bad SSD (not a bad assumption considering my experience with Kingston drives!) and had to work out how to move the data off the second drive so I could install windows onto that instead. The second drive had mostly temp stuff and downloads so it was fine to lose that and this is when I realised I did not have anything useful copied from the old disk anyway so wiped it and reinstalled Windows 10. Of course, I then had to wait about 2 hours for all the updates to run and get it back to the latest version of Windows 10. I also need to work out how to recover anything that I have now lost. I have deployed versions of two web apps, which are mostly up-to-date on other servers so thankfully I can get those back. I have lost my VirtualBox VM which had the test databases on, so hopefully I can get those from the live versions too.

Top Tips so you will not write a blog post like this!

  1. As soon as you get a new PC or install Windows, make sure it is all up to date and take a System Restore point.
  2. Every time you install something new of any size, create a new restore point.
  3. Consider spreading things across disks within your PC. The Windows disk is both likely to fail first and also to fail most obviously. The more you have on other disks, the less you lose if you lose the Windows disk. If you have a problem with the other disk, at least you can load Windows and deal with the problems on the other disks - in most cases, you can copy most things off at the first sign of a problem on the other disk.
  4. Backup to a separate physical device - ideally using a decent tool like Acronis TrueImage but even Windows 10 backup is better than nothing. Make sure you include your documents. Most other stuff can be restored onto a clean Windows installation even if it is a pain to install Pinnacle Studio from scratch again! Network attached disks can be fairly cheap and easy to use and fast to backup to (I use a Western Digital MyCloud EX2, but there are plenty of others).
  5. Using Cloud Storage for backups or main files is slightly risky if they are not encrypted because you expose your personal data into a public space. You also risk deleting things on another device that will then delete on your main device.
  6. If you are writing code, make use of off-site repositories like github or bitbucket and get in the habit of frequent commits so you lose the least amount of data. If you can code your database schemas, these can also help you quickly restore databases if you lose your whole system.
  7. If you can, test the restoration of data from a backup at least once. A backup isn't a backup if you've never tested it. I lost an entire backup once for the simple reason I had used a strong password and had forgotten it and never tested it! If I had, I would have realised and simply created a new backup.
  8. Keep a simple list of any non-free software that you have installed, especially nowadays that you might not have physical media and keys. How can you get the installer and keys again? What happens if you bought an upgrade key for an older version? Can you install the new version without having to install the old one?
  9. Don't use Kingston SSDs!
Post a Comment