Every program out there has bugs (and yes, has much as I hate to admit it, sometimes even Winstep applications). Most of them will be minor and - hopefuly - you will only find them when Saturn is aligned with Mars and the sun is just over the horizon.
If you never created software you might have a hard time understanding this - why aren't all programs perfect and bug-less?! - but, if you have, you know how complex it is to write even a simple application. If you can sometimes find bugs on code that is just 2 or 3 lines long, imagine the potential for disaster in programs that have hundreds of thousands of lines of code!
Programming is also basically an exercise in futurology: you have to predict everything the user might do and every way each piece of code is going to interact with the other. It's like a juggling act! You have to come up with different scenarios in your mind and then test them one by one to make sure everything is working as it should and that nothing breaks. Of course, programmers are only human, so there is always going to be something we missed or didn't think about.
This is where a good team of testers comes in: they are going to do things with your program you would never have dreamed of, and, in the process, probably uncover some potential problems. You then fix these problems and you pray that those fixes didn't break something else in your code that was previously working fine. And yes, we do a lot of praying.
So, getting back on topic, what does the art of bug reporting consist of?
Well, first it consists on actually reporting the bug. Yes, pardon me for stating the obvious, but you have no idea on how many users run into a problem and then don't report it, either because they can't be bothered at the time or because they think we must know about it already. Well, in the later case, if it is a common bug then
eventually somebody will report it to us, but, if you still see it on the next release, then you can be pretty sure we don't know about it. Winstep takes pride in fixing bugs as soon as they are reported in.
The second most important thing about bug reporting is STEPS TO REPRODUCE. Let me say that again: what do I have to do to reproduce that bug? If all you're going to tell me is that the application crashed, then all you'll get in return is a blank stare. What were you doing when the application crashed? Can you make it happen 100% of the time or is it one of those hard-to-fix 'it only happens sometimes' bug? How can I reliably reproduce the bug here?
To fix a bug, I must be able to reproduce it here so I can at least have an idea on where to look for the source of the problem. It helps a lot if the user does a bit of detective work first since this will eliminate many possible causes, and, sometimes, even provide that 'ah-ah!' moment you need to figure it out. 'The application will crash if option x is on and option y is off, but not if option x is also off'
When you are reporting a bug, besides the steps to reproduce, you should also be as specific and provide as much detail as possible. Most often that not, the cause of a bug is some obscure feature that I NEVER use here (or used once when I coded it and then forgot all about it). If you don't mention that you're using that feature, I will probably have a hard time figuring out what is causing the problem.
And as far as bugs go, you should be aware that they are not all equal. You can divide them into three distinctive categories:
1) The bug that can be reproduced 100% of the time by following the user's directions.
2) The pseudo-random bug. The one that happens sometimes but not all the time.
3) The bug that only happens on your system.
Bug type 1 is the easiest and quickest to fix. 'nuff said about it.
Bug type 2 is a nightmare: it will only happen when very specific conditions meet, most of them out of your control. Since you can't reproduce it reliably, you have no idea what is causing it, so you have to approach the problem based on a trial and error method: you suspect it might be because of y, so you change y and then wait to see if the bug rears its ugly head again. Usually you will only know what was actually causing the problem when one of your several 'blind' attempts to fix it finaly works - this will give you the first clue to the real cause.
A short story to ilustrate a type 2 bug (you might have a hard time understanding this if you are not a programmer):
When adding GDI+ and PNG file support to WorkShelf, I suddenly started getting random Access Violation exceptions. An Access Violation exception happens when a program is trying to access memory that doesn't belong to it any more. This exception would happen only when using GDI+ to draw a GDI+ bitmap created from an icon image in memory, but it didn't happen all the time. I could run the same code hundreds of times without a problem, and then suddenly, without any particular reason - bang, Access Violation.
I checked and double-checked the code, but, no matter how hard I looked, I couldn't find anything wrong. I was releasing objects when I should and not before, I had no GDI or memory leaks, everything seemed peachy, but... I was still getting those errors from time to time.
The solution to this problem only occured to me after solving yet another, apparently un-related, issue that was also getting on my nerves: any PNG bitmap file used as the background of a Desktop Module would become locked for as long as the associated Desktop Module was visible, i.e.; if I made a modification to the original bitmap and then tried to save the result, Photoshop would tell me that I couldn't because the file was locked. I had either to close the desktop module that was using that bitmap or change themes.
Now, this makes no sense because, for performance reasons, when WorkShelf loads a bitmap from disk for the first time, it makes a copy of the original bitmap in memory and uses that copy instead from then on - this way it doesn't have to perform an expensive disk access every time it needs to use that bitmap.
However, to decode PNG files I had to use GDI+. And GDI+ was, for some strange reason, locking the source file, only releasing the lock when the copy of the bitmap in memory was destroyed.
After searching the net for a while, I managed to find an obscure entry in the MS Knowledge Base explaining why. You see, it seems that when you create a GDI+ bitmap from a file, a stream or a memory bitmap, that bitmap will ALWAYS hold a reference to the original source. This is because the developers of GDI+ thought it would be ok for GDI+ to release the memory used by a bitmap whenever it felt like it (i.e. it
might, but it might also not, with you having no control on the process), as long as that bitmap was a copy of some other bitmap.
If it did decide to release the memory used by that bitmap, then it would later on refer to the source bitmap to re-construct it when necessary. This explains why GDI+ locks the source file - it might need to access it again if, in the mean time, it decides to destroy the copy it has in memory.
For me, this is absolutely insane and goes against everything you would expect, specially when instead of having a file as the source of a bitmap, you have, say, an icon or another bitmap in memory!
You see, I (as everybody else would) assumed that once you converted an icon image in memory into a GDI+ bitmap, the resulting GDI+ bitmap would be a
copy of the original bitmap. And it is. Except that it might also be destroyed at any time too!
Since you don't want two identical bitmaps using up valuable memory space, what you logically do after converting the source bitmap into a GDI+ bitmap, is to destroy the original bitmap. After all, you will only be using the GDI+ version from then on to draw on GDI+ surfaces.
With GDI+ you can't! You MUST keep the original source bitmap lying around until you dispose of the GDI+ bitmap. But where is this very important piece of information stated? In bold letters in every document related to GDI+ as it should? No, in an obscure one page KB resource, which you will only find when you've already run into the problem and know what you are looking for!
Anyway, this explains why I was getting those random Access Violation errors. If GDI+ decided to destroy its version of the bitmap, it would then refer to the original source whenever necessary. Since I had already destroyed the source bitmap, GDI+ would be trying to access memory that had already been discarded - instant Acess Violation!
Because GDI+ only decides to release the memory used by a bitmap whenever the sun is on the horizon and saturn is aligned with mars, most of the time your code runs smoothly and you get a type 2 bug!
Ah, in case you're wondering, the solution to the file locking problem is to create a blank bitmap with the same dimentions of the original, draw the file based bitmap into it, dispose of the file based bitmap, thus releasing the lock, and using the copy you made from then on. Nothing but a waste of CPU cycles.
The solution to the Access Violation error is to keep the source bitmap available at all times until you dispose of the GDI+ bitmap, at which time you can safely destroy the source bitmap as well. Nothing but a waste of memory.
And talking about waste, these two problems made me waste a lot of my time - I can only hope I never run into one of the Microsoft GDI+ developers, it won't be pleasent for him.
Anyway, getting back on track: type 3 bugs are, well, un-fixable. See, it's not really a problem with the application per se. It's something external, and it can be anything from your video driver to DLL hell (where you have mismatched versions of Dynamic Link Libraries on your system). Unfortunately it's very hard to explain to a user that it is a problem with his machine and not with the application itself - he will only understand that when you finaly talk him into trying the application on a friend's machine and he realizes that it doesn't happen there. Nor on his machine at work.