Yemanja A Layered Fault Localization System for Multi-domain Computing Utilities
K. Appleby
IBM T.J. Watson Research Center, 30 Saw Mill River Road,
Hawthorne, NY 10532, USA
Email: applebyk_AT_us.ibm.com
G. Goldszmidt
IBM T.J. Watson Research Center, 30 Saw Mill River Road,
Hawthorne, NY 10532, USA
Email: gsg_AT_us.ibm.com
M. Steinder
Computer and Information Sciences, University of Delaware,
Newark, DE 19716, USA
Abstract
Yemanja is a model-based event correlation engine for multi-layer fault diagnosis. It targets complex propagating fault scenarios, and can smoothly correlate low-level network events with high-level application performance alerts related to quality-of-service violations. Entity-models that represent devices or abstract components encapsulate their behavior. Distantly associated entity-models are not explicitly aware of each other, and communicate through internal event chains. Yemanja's state-based engine supports generic scenario definitions, prioritization of alternate solutions, integrated problem and device testing, and simultaneous analysis of overlapping problems.
The system of correlation rules was developed based on the analysis of device and layer functions, and the dependencies among physical and abstract system components. The primary objectives of this research include the development of reusable, configuration independent, correlation scenarios, adaptability and extensibility of the engine to match the constantly changing topology of a multi-domain server farm, and development of a concise specification language that is relatively simple yet powerful.
Keywords: Problem Determination, Event Correlation, Fault and Performance Management, Service Level Agreements
JNSM: Vol. 10, No. 2, 2002
Yemanja A Layered Fault Localization System for Multi-domain Computing Utilities [Vol. 10, No. 2, 2002]
NOTE: only abstract of paper available on-line; please contact your library or the authors for the full paper
Back to JNSM main page