- Computers & Software
IT for mere mortals - Troubleshooting 101
There are two kinds of trouble - one that you can reproduce and one that you can’t. You can always solve the one you can reproduce because you can fix it systematically. Your only enemy is time, deadlines. But there is another kind of trouble where you can not reproduce it. It happens randomly. You can never solve intermittent problems. Most of the time they are either dismiss as a glitch, or replace with something proven to work. The question now is how to tell if the problem you are having is a reproducible one or an intermittent one? That is the shooting part in troubleshooting.
Read The Freaking Manual. If you are a user, a customer, it is probably a great hint of how competent the IT guy (or any other technician) you are dealing with. This is not a mars and venus thing, this is not a matter of style, this is what the manual is for. Failing to do this is like refusing to look on a map when you are lost.
The good ones will ask for it. The great ones will search and download the manuals when they asked and do not get it.
Do it systematically from simply close the application, logging out, rebooting down to cold booting. If the problem reappears in any of that stage, you have a reproducible trouble. You can now proceed to eliminate possibility depending on the problem.
If the problem go away, do not dismiss it as a glitch right away, intermittent problem might as well be seen as reproducible problem which pattern you have not yet known. Monitor it for at least a week (the longer the better). If the problem never shows up in this monitoring period you most likely are having an intermittent problem. Other than that you know the drill.
Changing thing and testing it, or what usually affectionately called “trial and error” is considered a basic troubleshooting instinct because even an uneducated person would think of it. That is probably why most people would look down on such approach. As if the more you know the less “trial and error” you should do. Well those who think like that prove that education and intelligence are two different things.
The ability to trial and error is one of the things that keep human superior to machines and or computers. It is the reason human never hangs. So when you have no clear indicator of what went wrong, practice it proudly especially when you are a pro who deals with deadlines. Business is result oriented and this is the fastest way to yield results.
You have a display problem? Replace it with known good monitor. If it stays problematic then the problem is either on the graphic card or the graphic card slot on the motherboard. Replace the card with known good one and you will zero in the culprit in no time. Similarly, if you run into a faulty program, comparing another installation on other machine will hint you where to go next.
Always backup the current state when doing trial and error with software!
When the practical approach fail, read the freaking logs. Event viewer, boot up logs, application logs any kinds of relevant logs. Filter it by status to see the obvious (error/fail and warning) and trace it by time. Often you don’t have them ready because they take up so much space you switched them off. If time permits, don’t panic. You can always generate them. Put the questionable object in monitoring period in which you switch the logging service on to monitor relevant parameters and wait.
What if you don’t have it and deadline is tight? Do these next few things and pray.
Google the error message.
If there is an error message copy paste it to Google and search. You now have the information of how common the problem you are having, if it is random or reproducible, and often you will have the solution ready. Even if it is a world wide unsolved problem, you can follow up on people’s work instead of doing it from scratch.
If you can ask don’t search.
Push the panic button by making calls. Send up the flares through twitter. SOS on every related forum you know. If time is crucial this step might bump up to first thing you should do.
It is especially useful to find out the worst case scenario fast. To decide what constitute worst case, keep in mind a network admin’s job is not to fight hackers, or to write up a serum for virus, or to fix the hardware, or to debug software. The admin’s job is to keep the services which are needed by the organization to function, running.
Revised manuals, update procedures, or list every solved problem on your helpdesk’s knowledge base or FAQ. Make it part of the M in the RTFM so you don’t panic for the same thing over and over again.