Title: Adaptive Real-time Monitoring
Author: Alberto Gonzalez
e-mail: gonzalez@kth.se
Partner: KTH
Supervisor: Rolf Stadler
Committee:
Year of start: 2002
Year of end: 2008
Funding institution: EU Ambient Networks Project
We pursue a design that enables a large set of nodes in a networked environment to achieve a goal through cooperation. Given a goal, the nodes must determine and execute the appropriate actions (considering their capabilities) that lead them to the achievement of their goal. The design must be adaptive, controllable, and have low and controllable computational overhead. Adaptability means that the design must be able to react to changes in networking conditions so that, despite these changes, it achieves its goal. Control is a must in every management system. While the node must be autonomous in its decision-making process, the administrator must have the capacity to guide or control its behavior. The administrator must be allowed to set limits to the node behavior. This control can be expressed as a set of forbidden actions and/or forbidden states. In order to be practically feasible, the computational cost of the design must be low and controllable. Low cost is a must for real-time management. It is crucial for achieving timely adaptability. Controllability permits vendors to bound the processing resources required by management tasks running on their devices. A key aspect is the coordination among nodes. If the nodes do not coordinate their actions appropriately, it might be impossible to achieve the goal. For instance, two uncoordinated nodes may perform opposed actions that cancel each other. Another example is an erratic node jeopardizing the goal achievement. Our research has been centered in the context of continuous monitoring of large-scale networks. Specifically, on the monitoring of network-wide metrics computed from device counters using aggregation functions, such as SUM, AVERAGE and MAX. Examples of such metrics include the total number of VoIP flows and the maximum link utilization in a network domain.
We present A-GAP, a novel protocol for continuous monitoring of network state variables, which aims at achieving a given monitoring accuracy with minimal overhead. Network state variables are computed from device counters using aggregation functions, such as SUM, AVERAGE and MAX. The accuracy objective is expressed as the average estimation error. A-GAP is decentralized and asynchronous to achieve robustness and scalability. It executes on an overlay that interconnects management processes on the devices. On this overlay, the protocol maintains a spanning tree and updates the network state variables through incremental aggregation. It dynamically configures local filters that control whether an update is sent towards the root of the tree. We evaluate A-GAP through simulation using real traces and two different types of topologies of up to 650 nodes. The results show that we can effectively control the trade-off between accuracy and protocol overhead, and that the overhead can be reduced by almost two orders of magnitude for allowing small errors. The protocol quickly adapts to a node failure and exhibits short spikes in the estimation error for a fraction of a second. Lastly, it can provide an accurate estimate of the error distribution in real-time.