Dictionary Definition
troubleshooter n : a worker whose job is to
locate and fix sources of trouble (especially in mechanical
devices) [syn: trouble
shooter]
User Contributed Dictionary
Extensive Definition
Troubleshooting is a form of problem
solving. It is the systematic search for the source of a
problem so that it can be solved. Troubleshooting is often a
process
of elimination - eliminating potential causes of a problem.
Troubleshooting is used in many fields such as system
administration and electronics.
In general troubleshooting is the identification
or diagnosis of
"trouble" in a system.
The problem is initially described as symptoms of malfunction and
troubleshooting is the process of determining the causes of these
symptoms.
A system can be described in terms of its
expected or intended behavior (usually, for artificial systems, its
purpose). Events or inputs to the system are expected to generate
specific results or outputs. (For example selecting the "print"
option from various computer applications is intended to result in
hardcopy emerging from
some specific device). Any unexpected, particularly undesirable
behavior is a symptom and troubleshooting is the process of
isolating its specific cause or causes. Frequently the symptom is a
failure to observe any results. (Nothing was printed, for
example).
Most discussion of troubleshooting, and
especially training in formal troubleshooting procedures, is
extremely domain specific. The bulk of the material is relevant to
a particular field of endeavor (such as automotive repair, computer
hardware services, or software systems support). However,
troubleshooting has common elements regardless of the
specifics.
Any system can be described in terms of its
components or subsystems. Each subsystem can be described in terms
of its expected behavior. So the inputs to a system can be
described as a cascade of inputs and results among the components
of the system. (For example: selecting the "print" option in a
computer application may cause the software to call on a separate
utility, such as lpr on a
UNIX system;
that in turn might open, read and parse a number of configuration
files which might direct it to perform some form of hostname
address resolution via DNS,
NIS, or
LDAP, and then initiate a TCP/IP connection to
a specific network device, and so on).
The domain-specific knowledge that drives the
troubleshooting process is the understanding of these systems in
terms of the interactions and dependencies among their subsystems
and components. In particular the specialist can ennumerate the
components and knows a set of procedures for testing many of them
in isolation from the system as a whole. (For example the systems
administrator may know which configuration files lpr is trying to
parse and may read them manually, check their permissions, or may
assume the identity of the user who is experiencing the problem and
manually run an lpr command from the system's shell prompt; this
may isolate the problem to the application's configuration, the
user's preference settings, the workstation's configuration or
network settings, the network's name services domain, or back to
the printer's configuration or hardware).
Well-designed systems have designated "test
points" or monitoring instrumentation. (For example most printers
have indicator lights which change colors or blink, or LCD
panels which display messages for detectable problems: paper jams,
empty paper trays, network or other cable disconnection, etc. As
another example UNIX and Linux systems support features for system
call tracing through commands like truss, strace, and
ktrace).
Usually troubleshooting is applied to something
that has suddenly stopped working, since its previously working
state forms the expectations about its continued behavior. So the
initial focus is often on recent changes to the system or to the
environment in which it exists. (For example a printer that "was
working when it was plugged in over there"). However, there is a
well known principle that correlation does not imply
causality. (For
example the failure of a device shortly after it's been plugged
into a different outlet doesn't necessarily mean that the events
were related. The failure could have been a matter of coincidence).
It's useful to consider the common experiences we
have with light bulbs. Light bulbs "burn out" more or less at
random; eventually the repeated heating and cooling of its filament,
and fluctuations in the power supplied to it cause the filament to
crack or vaporize. The same principle applies to most other
electronic devices and similar principles apply to mechanical
devices. Some failures are part of the normal wear-and-tear of
components in a system.
A basic principle in troubleshooting is to start
from the simplest and most probable possible problems
first. This is illustrated by the old saying "When you see hoof
prints, look for horses, not zebras", or to use another maxim, use
the KISS
principle. This principle results in the common complaint about
help
desks or manuals, that they sometimes first ask: "Is it plugged
in and does that receptacle have power?", but this should not be
taken as an affront, rather it should serve as a reminder or
conditioning to
always check the simple things first before calling for help.
A troubleshooter could check each component in a
system one by one,
substituting known good components for each potentially suspect
one. However, this process of "serial substitution" can be
considered degenerate when components are substituted without
regards to a hypothesis concerning how their failure could result
in the symptoms being diagnosed.
Efficient methodical troubleshooting starts with
a clear understanding of the expected behavior of the system and
the symptoms being observed. From there the troubleshooter forms
hypotheses on potential causes, and devises (or perhaps references
a standardized checklist) of tests to eliminate these prospective
causes. Two common strategies used by troubleshooters are to check
for frequently encountered or easily tested conditions first (for
example, checking to ensure that a printer's light is on and that
its cable is firmly seated at both ends), and to "bisect" the
system (for example in a network printing system, checking to see
if the job reached the server to determine whether a problem exists
in the subsystems "towards" the user's end or "towards" the
device).
This latter technique can be particular efficient
in systems with long chains of serialized dependencies or
interactions among its components. It's simply the application of a
binary
search across the range of dependences.
Simple and intermediate systems are characterized
by lists or trees of dependencies among their components or
subsystems. More complex systems contain cyclical dependencies or
interactions (feedback
loops). Such systems are less amenable to "bisection"
troubleshooting techniques.
It also helps to start from a known good state,
the best example being a computer reboot. A
cognitive
walkthrough is also a good thing to try. Comprehensive documentation produced by
proficient technical
writers is very helpful, especially if it provides a theory
of operation for the subject device or system.
A common cause of problems is bad design, for example bad human
factors design, where a device could be inserted backward or
upside down due to the lack of an appropriate forcing function
(behavior-shaping
constraint), or a lack of error-tolerant
design. This is especially bad if accompanied by habituation, where the user
just doesn't notice the incorrect usage, for instance if two parts
have different functions but share a common case so that it isn't
apparent on a casual inspection which part is being used.
Troubleshooting can also take the form of a
systematic checklist,
troubleshooting procedure, flowchart or table that is
made before a problem occurs. Developing troubleshooting procedures
in advance allows sufficient thought about the steps to take in
troubleshooting and organizing the troubleshooting into the most
efficient troubleshooting process. Troubleshooting tables can be
computerized to make them more efficient for users.
Reproducing symptoms
One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably induce the symptom to occur.Once this is done then systematic strategies can
be employed to isolate the cause or causes of a problem; and the
resolution generally involves repairing or replacing those
components which are at fault.
Intermittent symptoms
Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem. Another, extremely common, problem in electronic and electro-mechanical systemsIn computer programming race
conditions often lead to intermittent symptoms which are
extremely difficult to reproduce; various techniques can be used to
force the particular function or module to be called more rapidly
than it would be in normal operation (analogous to "heating up" a
component in a hardware circuit) while other techniques can be used
to introduce greater delays in, or force synchronization among,
other modules or interacting processes.
Intermittent issues can be defined thus:
An intermittent fault is a
one which occurs irregularly or inconsistently.Steven Litt|http://www.troubleshooters.com/tpromag/9812.htm#DefinitionofanIntermittent
In particular he asserts that there is a
distinction between frequency of occurrence and a "known procedure
to consistently reproduce" an issue. For example knowing that an
intermittent problem occurs "within" an hour of a particular
stimulus or event ... but that sometimes it happens in five minutes
and other times it takes almost an hour ... does not constitute a
"known procedure" even if the stimulus does increase the frequency
of observable exhibitions of the symptom.
Nevertheless, sometimes troubleshooters must
resort to statistical methods ... and can only find procedures to
increase the symptom's occurrence to a point at which serial
substitution or some other technique is feasible. In such cases,
even when the symptom seems to disappear for significantly longer
periods, there is a low confidence that the root cause has
been found and that the problem is truly solved.
Multiple problems
Isolating single component failures which cause reproducible symptoms is relatively straightforward.However, many problems only occur as a result of
multiple failures or errors. This is particularly true of fault
tolerant systems, or those with built-in redundancy. Features
which add redundancy, fault detection and failover to a system may also
be subject to failure, and enough different component failures in
any system will "take it down."
Even in simple systems the troubleshooter must
always consider the possibility that there is more than one fault.
(Replacing each component, using serial substitution, and then
swapping each new component back out for the old one when the
symptom is found to persist, can fail to resolve such cases. More
importantly the replacement of any component with a defective one
can actually increase the number of problems rather than
eliminating them).
Note that, while we talk about "replacing
components" the resolution of many problems involves adjustments or
tuning rather than "replacement." For example, intermittent breaks
in conductors --- or "dirty or loose contacts" might simply need to
be cleaned and/or tightened. All discussion of "replacement" should
be taken to mean "replacement or adjustment or other
maintenance."
See also
External links
troubleshooter in Spanish: Resolución de
problemas
troubleshooter in Indonesian:
Troubleshooting
troubleshooter in Italian: Troubleshooting
troubleshooter in Japanese: トラブルシューティング
troubleshooter in Chinese: 排错