Networks create lots of events. Sometimes thousands per minute. Events can be SNMP traps generated by a server rebooting, syslog messages, Microsoft Windows event logs etc. How do you know which events are important? The ones telling you something important? That is where event correlation tools come in handy. You feed all of the events into the tool, as well as a description of the structure of your systems, and its job is to flag up the important ones.
At the end of last week one of our sites barfed. Nothing particularly unusual in that. The database process went rogue and stopped responding to queries. Once the problem was detected, restarting the process solved the problem very easily. The rather unfortunate side effect was that Elmah sent 10,844 emails to UserVoice, which then created the corresponding number of issues. That’s not very helpful. What is interesting is how dumb all of the tools actually are in the chain.