OPTIMIZATION OF THE APPLICATION-LEVEL CHECKPOINTING INTERVAL IN STATEFUL MICROSERVICES

Authors

  • Bohdan Marchuk National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute", Ukraine
  • Viktor Selivanov National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute", Ukraine

Keywords:

microservice architecture, stateful services, application-level checkpoints, recovery time, interval optimization, chaos engineering, garbage collector, operational overhead, amnesia interval

Abstract

This paper addresses the complex scientific and applied problem of enhancing the operational readiness of stateful microservice architectures. The foundation of the proposed approach is the fine-grained optimization of the chronological frequency of checkpoint generation directly at the application abstraction level. The study aims to formulate a mathematical model and subsequently empirically verify the critical state fixation interval that guarantees a strict balance between context rehydration kinetics and associated infrastructural overhead. An interdisciplinary methodological framework is employed, integrating systems analysis methods, classical rollback recovery theory, and modern chaos engineering tools deployed within a Kubernetes ecosystem. The destructive impact of a hyperactive caching strategy is proven. An aggressive fixation periodicity of 5 seconds induces anomalous CPU load (5.77%) and provokes the exponential accumulation of objects in the garbage collector memory (5.81 MB), degrading the aggregate recovery time to 0.795 s. Conversely, the experimental basis confirms that a calibrated window of 10 - 15 seconds radically neutralizes the intensity of background perturbations. This balanced mode not only ensures a minimal latent period of application recovery (0.503 - 0.563 s) but also strictly limits the amnesia interval - the time gap between the last successful checkpoint and the moment of failure - restricting the volume of event replay to strictly manageable limits (34 - 47 events). The derived regularities form the theoretical foundation for designing highly resilient systems with deterministic cold start characteristics, completely eliminating the probability of operational degradation under standard load conditions.

References

Newman S. Building microservices: designing fine-grained systems. 2nd ed. Sebastopol: O’Reilly Media, 2021. 612 p.

Fowler M. Event sourcing [Electronic resource]. 2005. URL: https://martinfowler.com/eaaDev/EventSourcing.html (accessed: 28.04.2026).

Elnozahy E. N., Alvisi L., Wang Y. M., Johnson D. B. A survey of rollback-recovery protocols in message-passing systems // ACM Computing Surveys. 2002. Vol. 34, No. 3. P. 375–408.

Burns B., Grant B., Oppenheimer D., Brewer E., Wilkes J. Borg, Omega, and Kubernetes // Queue. 2016. Vol. 14, No. 1. P. 70–93.

Zhao Y., Li Y., Zhang Y. Checkpoint-based fault tolerance in microservice architectures // Proceedings of the IEEE International Conference on Cloud Computing. 2023. P. 112–120.

Chaos Mesh Authors. Chaos Mesh documentation [Electronic resource]. 2025. URL: https://chaos-mesh.org/docs/ (accessed: 25.04.2026).

Published

2026-05-09

Issue

Section

IoT, Real Time Systems (RT)