| This paper will explain a scientific approach to problem | | | | solved. An infinite-loop was discovered. |
| solving. Although it is written to address Information | | | | What the "Problem Solver" did, was to replicate the |
| Technology related problems, the concepts might | | | | problem and at the same time tried to isolate the |
| also be applicable in other disciplines. The methods, | | | | code that caused the problem. In doing so, the |
| concepts, and techniques described here is nothing | | | | complex (and time consuming) stored procedure |
| new, but it is shocking how many "problem solvers" | | | | became something fast and simple. |
| fail to use them. In between I will include some | | | | If the problem is inside an application, create a new |
| real-life examples. | | | | application and try to simulate the problem inside the |
| Why do problem solvers guess in stead of following | | | | new application as simple as possible. If the problem |
| a scientific approach to problem solving? Maybe | | | | occurs when a certain method for a certain control |
| because it feels quicker? Maybe a lack of experience | | | | gets called, then try to only include this control in the |
| in efficient problem solving? Or maybe because it | | | | empty application and call that method with |
| feels like hard work to do it scientifically? Maybe | | | | hard-coded values. If the problem is with embedded |
| while you keep on guessing and not really solving, | | | | SQL inside a C# application, then try to simulate the |
| you generate more income and add some job | | | | SQL inside of a Database Query tool (like SQL*Plus |
| security? Or maybe because you violate the first | | | | for Oracle, Query Analyzer for SQL Server, or use |
| principle of problem solving: understand the problem. | | | | the code in MS Excel via ODBC to the database). |
| Principle #1. Understand the *real* problem. | | | | The moment you can replicate the problem in a |
| Isn't it obvious that before you can solve, you need | | | | simple way, you are more than 80% on your way to |
| to understand the problem? Maybe. But, most of the | | | | solve it. |
| time the solver will start solving without knowing the | | | | If you do not know where in the program the |
| real problem. What the client or user describe as | | | | problem is, then use DEBUG. |
| "The Problem" is normally only the symptom! "My | | | | Principle #4. Debug. |
| computer does not want to switch on" is the | | | | Most application development tools come standard |
| symptom. The real problem could be that the whole | | | | with a debugger. Weather it is Macromedia Flash, |
| building is without power. "Every time I try to add a | | | | Microsoft Dot Net, Delphi, or what ever development |
| new product, I get an error message" is the | | | | environment there will be some sort of debugger. If |
| symptom. Here the real problem could be "Only the | | | | the tool does not come standard with a debugger, |
| last 2 products I tried to add gave a 'Product already | | | | then you can simulate one. |
| exists' error". Another classic example: "Nothing is | | | | The first thing you want to do with the debugger is |
| working"... | | | | to determine where the problem is. You do this by |
| You start your investigation by defining the "real | | | | adding breakpoints at key areas. Then you run the |
| problem". This will entail asking questions (and | | | | program in debug mode and you will know between |
| sometimes verify them), and doing some basic | | | | which breakpoints the problem occurred. Drill down |
| testing. Ask the user questions like "when was the | | | | and you will find the spot. Now that you know where |
| last time it worked successfully?", "How long have | | | | the problem is, you can "conquer it simple" |
| you been using the system?", "Does it work on | | | | Another nice feature of most debuggers includes the |
| another PC or another user?", "What is the exact | | | | facility to watch variables, values, parameters, etc. as |
| error message?" etc. Ask for a screen-print of the | | | | you step through the program. With these values |
| error if possible. Your basic testing will be to ensure | | | | known at certain steps, you can hard-code them into |
| the end-to-end equipment is up and running. Check | | | | your "simplified version" of the program |
| the user's PC, the network, the Web Server, | | | | If a development tool does not support debugging, |
| Firewalls, the File Server, the Database back-end, etc. | | | | then you can simulate it. Put in steps in the program |
| Best-case you will pint-point the problem already. | | | | that outputs variable values and "hello I am here" |
| Worst-case you can eliminate a lot of areas for the | | | | messages either to the screen, to a log file, or to a |
| cause of the problem. | | | | database table. Remember to take them out when |
| A real life example. The symptom according to the | | | | the problem is resolved... you don't want your file |
| user: "The system hangs up at random times when I | | | | system to be cluttered or filled up with log files! |
| place orders". The environment: The user enters the | | | | Principle #5. There is a wealth of information on the |
| order detail on a form in a mainframe application. | | | | database back-end that will help to solve a problem. |
| When all the detail is completed, the user will tab off | | | | The "Problem Solver" was called to help solve a very |
| the form. The mainframe then sends this detail via | | | | tricky problem. A project was migrating system from |
| communication software to an Oracle Client/Server | | | | a mainframe to client-server technology. All went well |
| system at the plant. The Oracle system will do | | | | during testing, but when the systems went live, all of |
| capacity planning and either returns an error or an | | | | a sudden there were quite a few, and quite random |
| expected order date back to the mainframe system. | | | | "General Protection Faults". (The GPF-error was the |
| This problem is quite serious, because you can loose | | | | general error trap in Windows 95 and 98). It was |
| clients if they try to place orders and the system | | | | tried to simplify the code, debugging was attempted, |
| does not accept them! To attempt to solve this | | | | but it was impossible to replicate. In the LAB |
| problem, people started by investigating: 1) The load | | | | environment, the problem would not occur! Debugging |
| and capacity of the mainframe hardware 2) | | | | trace messages to log files indicated that the |
| Monitoring the network load between the mainframe | | | | problem occurred very randomly. Some users |
| and the Oracle system 3) Hiring consultants to debug | | | | experienced it more than others, but eventually all |
| the communication software 4) Debugging the Oracle | | | | users will get them! Interesting problem. |
| capacity planning system After spending a couple of | | | | The "Problem Solver" solved this after he started to |
| months they could not solve the problem. | | | | analyze the database back-end. Not sure if it was by |
| The "Scientific Problem Solver" was called in. It took | | | | chance or because he systematically moved in the |
| less than a day and the problem was solved! How? | | | | right direction because of a scientific approach. |
| The solver spends the day at the user to see what | | | | Through tracing what is happening on the back-end |
| the "real problem" was. It was found that the | | | | level, it was found that all these applications were |
| problem only occurs with export orders. By | | | | creating more-and-more connections to the database. |
| investigating the capture screen and user actions, it | | | | Every time a user starts a new transaction another |
| was found that with export orders the last field on | | | | connection was established to the database. The |
| the form is always left blank and the user did not tab | | | | sum-total of the connections were only released |
| off this field. The system was not hanging, it waited | | | | when the application was closed. As the user |
| for the user to press "tab" another time. Problem | | | | navigated to new windows inside the same |
| solved. It can be noted that the "Scientific Problem | | | | application, more and more connections are opened, |
| Solver" had very limited knowledge of the | | | | and after a specific number of connections, the |
| mainframe, of the order capturing system, of the | | | | application will have enough and then crash. This was |
| communication software, and of the Oracle capacity | | | | a programming fault in a template that was used by |
| planning system. And this brings us at Principle#2. | | | | all the developers. The solution was to first test if a |
| Principle #2. Do not be afraid to start the solving | | | | cursor to the database is already open, before |
| process, even if you do not understand the system. | | | | opening it again. |
| How many times have you heard "I cannot touch | | | | How do you trace on the back-end database what is |
| that code, because it was developed by someone | | | | happening? The main database providers have GUI |
| else!", or "I cannot help because I am a HR | | | | tools that help you to trace or analyze what queries |
| Consultant and that is a Finance problem"? If you | | | | are fired against the database. It will also show you |
| washing machine does not want to switch on, you | | | | when people connect, disconnect, or were unable to |
| do not need to be an Electrical Engineer, Washing | | | | connect because of security violations. Most |
| Machine Repair Specialist, Technician, or whatever | | | | databases also include some system dictionary tables |
| specialist to do some basic fault finding. Make sure | | | | that can be queried to get this information. These |
| the plug is working. Check the trip-switch, etc. "I | | | | traces can sometimes tell 'n whole story of why |
| have never seen this error before" should not stop | | | | something is failing. The query code you retrieve |
| you from attempting to solve. With the error | | | | from the trace can be help to "simplify the search". |
| message and an Internet Search engine, you can get | | | | You can see from the trace if the program makes |
| lots of starting points. | | | | successful contact with the database. You can see |
| In every complex system there are a couple of basic | | | | how long it takes for a query to execute. |
| working principles. System A that reads data from | | | | To add to Principle#2 (do not be afraid to start...); |
| System B can be horribly complex (maybe a | | | | you can analyze this trace information, even though |
| Laboratory Spectrometer that reads data from a | | | | you might not know anything about the detail of the |
| Programmable Logic Computer via an RS-232 port). | | | | application. |
| But, some basics to test for: Does both systems | | | | Remember though that these back-end traces can |
| have power? Is there an error message in the event | | | | put a strain on the back-end resources. Do not leave |
| log on one of these systems? Can you "ping" or | | | | them running for unnecessary long. |
| trace a network packet from the one system to the | | | | Principle #6. Use fresh eyes. |
| other? Try a different communication cable. Search | | | | This is the last principle. Do not spend too much time |
| the internet for the error message. | | | | on the problem before you ask for assistance. The |
| Once you have established what the problem is, you | | | | assistance does not have to be from someone more |
| need to start solving it. Sometimes the initial | | | | senior than you. The principle is that you need a pair |
| investigation will point you directly to the solution | | | | of fresh eyes for a fresh perspective and sometimes |
| (switch the power on; replace the faulty cable, etc). | | | | a bit of fresh air by taking a break. The other person |
| But, sometimes the real problem is complex in itself, | | | | will look and then ask a question or two. Sometimes |
| so the next principle is to solve it simple. | | | | it is something very obvious that was missed. |
| Principle #3. Conquer it simple. | | | | Sometimes just by answering the question it makes |
| Let's start this section with a real-life example. Under | | | | you think in a new directions. Also, if you spend |
| certain conditions, a stored procedure will hang. The | | | | hours looking at the same piece of code, it is very |
| stored procedure normally takes about an hour to | | | | easy to start looking over a silly mistake. A lot of |
| run (when it is not hanging). So, the developer tried | | | | finance balancing problems get solved over a beer. It |
| to debug. Make some changes and then wait another | | | | could be a change of scenery, and/or the relaxed |
| hour or so to see if the problem is solved. After | | | | atmosphere that will pop out the solution. Maybe it is |
| some days the developer gave up and the "Problem | | | | the fresh oxygen that went to the brain while |
| Solver" took over. The "Problem Solver" had to his | | | | walking to the pub. Maybe it is because the problem |
| disposal the knowledge under witch conditions the | | | | got discussed with someone else. |
| stored procedure would hang. So, it was a simple | | | | Conclusion |
| exercise to make a copy of the procedure, and then | | | | After reading this paper, the author hope that you |
| with this copy to strip all unnecessary code. All | | | | will try these the next time you encounter a problem |
| parameters were changed with hard-coded values. | | | | to solve. Hopefully by applying these six principles you |
| Bits of code were executed at a time and the | | | | will realize the advantages they bring, rather than to |
| result-sets were then again hard-coded into the copy | | | | "guess" your way to a solution. |
| of the procedure. Within 3 hours the problem was | | | | |