Monitoring is crucial for ensuring normal operation of any IT service. It does not only provide information about the current service status, but also helps identifying problems and investigating their cause. To configure monitoring for a single service might not be a complicate task, but building a scalable system that serves thousands of machines and hundreds of different services makes it challenging. Such high-scale systems should also have high reliability levels in order to satisfy the service managers’ expectations when they need to quickly identify and repair complicated and distributed issues.
In this lecture Nikolay will share his experience working for the IT Monitoring Service at CERN and will give an introduction on building unified monitoring pipeline, able to serve heterogeneous clients at scale. You will also hear about system reliability challenges and monitoring techniques that help addressing this problem.
Nikolay is a Computing Engineer at CERN working for the IT Monitoring Service. His previous experience relates to various projects from the CERN accelerators sector, including the Accelerators Logging and Controls Configuration services. Passionate about Big Data and distributed computing technologies Nikolay’s main goal has always been building and providing reliable systems.