Failure Management and Fault Tolerance Techniques in a Multi-Module Computing System Based on Creation and Replication of Checkpoints

Denis Kochurov

doi:doi:10.20295/2413-2527-2025-242-103-111

Home / Journals / Intellectual Technologies on Transport / Issue 2 / Failure Management and Fault Tolerance Techniques in a Multi-Module Computing System Based on Creation and Replication of Checkpoints

Failure Management and Fault Tolerance Techniques in a Multi-Module Computing System Based on Creation and Replication of Checkpoints

Submit manuscript Download PDF
Text

To cite

Citations:

FAILURE MANAGEMENT AND FAULT TOLERANCE TECHNIQUES IN A MULTI-MODULE COMPUTING SYSTEM BASED ON CREATION AND REPLICATION OF CHECKPOINTS

Journal: INTELLECTUAL TECHNOLOGIES ON TRANSPORT № 2 , 2025

Rubrics: INFORMATION SECURITY AND DATA PROTECTION

Denis Kochurov ¹

Author and publication information

Authors:

1. Mozhaisky Military Aerospace Academy (Department of Information and Computing Systems and Networks)
graduate student

Russian Federation

Type:

Article

DOI:

https://doi.org/10.20295/2413-2527-2025-242-103-111

EDN:

https://elibrary.ru/rdtehm

Pages:

from 103 to 111

Status:

Published

Received:

20.05.2025

Accepted:

21.05.2025

Published:

26.06.2025

Subject area:

VAK Russia 2.3.6
VAK Russia 1.2.2
UDC 004.021

Language:

Russian

Keywords:

multi-module computing system, model of the computing process, checkpoint

Abstract and keywords

Abstract:
Introduction: in order to enhance the efficiency of target information processing, it is necessary to adopt new approaches to the rapid detection and recovery from failures and faults to minimize the impact of such issues on the overall computing system. Purpose: to outline a technique for failure management and fault recovery in a multi-module computing system. This system implements periodic saving of calculations (checkpoints) and their exchange between all computing modules. Results: the problem of planning such a computing process has been outlined, including the determination of the optimal number and time points for creating checkpoints. The time points for creating checkpoints are determined based on the law of distribution of time points of computing module failures. Practical significance: the results of the simulation modelling calculations conducted as part of the proposed approach demonstrate the feasibility of implementing the proposed technique.

Keywords:
multi-module computing system, model of the computing process, checkpoint

Text

Text (PDF): Read Download

References

1. Bondarenko A. A., Iakobovski M. V. Obespechenie otkazoustoychivosti vysokoproizvoditelnykh vychisleniy s pomoshchyu lokalnykh kontrolnykh tochek [Fault Tolerance for HPC by Using Local Checkpoints], Vestnik Yuzhno-Uralskogo gosudarstvennogo universiteta. Seriya “Vychislitelnaya matematika i informatika” [Bulletin of the South Ural State University. Series “Computational Mathematics and Software Engineering”], 2014, Vol. 3, No. 3, Pp. 20–36. (In Russian) EDN: https://elibrary.ru/SMCHOV

2. Polyakov A. Yu., Danekina A. A. Optimizatsiya vremeni sozdaniya i obema kontrolnykh tochek vosstanovleniya parallelnykh programm [Optimization of Size and Creation Time of Parallel Programs Checkpoints], Vestnik SibGUTI [The Herald of the Siberian State University of Telecommunications and Information Science], 2010, No. 2, Pp. 87–100. (In Russian) EDN: https://elibrary.ru/MUIPBD

3. Elnozahy E. N., Alvisi L., Wang Y.-M., Johnson D. B. A Survey of Rollback-Recovery Protocols in Message-Passing Systems, ACM Computing Surveys, 2002, Vol. 34, Iss. 3, Pp. 375–408. DOI:https://doi.org/10.1145/568522.568525. EDN: https://elibrary.ru/DTOZWT

4. Basyrov A. G., Zykova S. S., Koshel I. N., Kuznecov V. V. Metod otkazoustoychivoy parallelnoy obrabotki informatsii v bortovykh vychislitelnykh sistemakh letatelnykh apparatov na osnove vremennoy izbytochnosti vychislitelnogo protsessa [A Method of Fault-Tolerant Parallel Processing of Information in On-Board Computing Systems of Aircraft Based on the Temporary Redundancy of the Computing Process], Aviakosmicheskoe priborostroenie [Aerospace Instrument-Making], 2023, No. 6, Pp. 33–39. DOI:https://doi.org/10.25791/aviakosmos.6.2023.1345. (In Russian) EDN: https://elibrary.ru/TWKITL

5. Zykova S. S. Model i algoritm planirovaniya parallelnoy obrabotki informatsii v otkazoustoychivoy bortovoy vychislitelnoy sisteme na osnove vremennoy izbytochnosti vychislitelnogo protsessa [A Model and Algorithm for Planning Parallel Information Processing in a Fault-Tolerant On-Board Computing System Based on the Time Redundancy of the Computing Process], Intellektualnye tekhnologii na transporte [Intellectual Technologies on Transport], 2023, No. 4 (36), Pp. 28–33. DOI:https://doi.org/10.24412/2413-2527-2023-436-28-33. (In Russian) EDN: https://elibrary.ru/AGHCJO

6. GOST R ISO/MEK 25010—2015. Informatsionnye tekhnologii. Sistemnaya i programmnaya inzheneriya. Trebovaniya i otsenka kachestva sistem i programmnogo obespecheniya (SQuaRE). Modeli kachestva sistem i programmnykh produktov [GOST R ISO/MEK 25010—2015. Information technology. Systems and software engineering. Systems and software Quality Requirements and Evaluation (SQuaRE). System and software quality models]. Effective from June 01, 2016. Moscow, StandartInform Publishing House, 2015, 36 p. (In Russian)

7. Rathore N. Checkpointing: Fault Tolerance Mechanism, i-manager’s Journal on Cloud Computing, 2017, Vol. 4, No. 1, Pp. 28–35. DOI:https://doi.org/10.26634/jcc.4.1.13756.

8. Koren I., Mani Krishna C. Fault-Tolerant Systems. Second Edition. Cambridge (MA), Morgan Kaufmann Publishers, 2020, 416 p. DOI: https://doi.org/10.1016/B978-0-12-818105-8.00014-0

9. Elnozahy E. N., Plank J. S. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery, IEEE Transactions on Dependable and Secure Computing, 2004, Vol. 1, Iss. 2, Pp. 97–108. DOI:https://doi.org/10.1109/TDSC.2004.15.

10. Aupy G., Benoit A., Hérault T., et al. Optimal Checkpointing Period: Time vs. Energy, High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation (PMBS 2013): Revised Selected Papers of the 4th International Workshop, Denver, CO, USA, November 18, 2013. Lecture Notes in Computer Science, Vol. 8551. Cham, Springer International Publishing, 2013, Pp. 203–214. DOI:https://doi.org/10.1007/978-3-319-10214-6_10.

Submit manuscript Download PDF
Text JATS XML

To cite

Citations:

Confirmation

Регистрация