graduate student
Russian Federation
VAK Russia 1.2.2
UDC 004.021
Introduction: in order to enhance the efficiency of target information processing, it is necessary to adopt new approaches to the rapid detection and recovery from failures and faults to minimize the impact of such issues on the overall computing system. Purpose: to outline a technique for failure management and fault recovery in a multi-module computing system. This system implements periodic saving of calculations (checkpoints) and their exchange between all computing modules. Results: the problem of planning such a computing process has been outlined, including the determination of the optimal number and time points for creating checkpoints. The time points for creating checkpoints are determined based on the law of distribution of time points of computing module failures. Practical significance: the results of the simulation modelling calculations conducted as part of the proposed approach demonstrate the feasibility of implementing the proposed technique.
multi-module computing system, model of the computing process, checkpoint
1. Bondarenko A. A., Iakobovski M. V. Obespechenie otkazoustoychivosti vysokoproizvoditelnykh vychisleniy s pomoshchyu lokalnykh kontrolnykh tochek [Fault Tolerance for HPC by Using Local Checkpoints], Vestnik Yuzhno-Uralskogo gosudarstvennogo universiteta. Seriya “Vychislitelnaya matematika i informatika” [Bulletin of the South Ural State University. Series “Computational Mathematics and Software Engineering”], 2014, Vol. 3, No. 3, Pp. 20–36. (In Russian) EDN: https://elibrary.ru/SMCHOV
2. Polyakov A. Yu., Danekina A. A. Optimizatsiya vremeni sozdaniya i obema kontrolnykh tochek vosstanovleniya parallelnykh programm [Optimization of Size and Creation Time of Parallel Programs Checkpoints], Vestnik SibGUTI [The Herald of the Siberian State University of Telecommunications and Information Science], 2010, No. 2, Pp. 87–100. (In Russian) EDN: https://elibrary.ru/MUIPBD
3. Elnozahy E. N., Alvisi L., Wang Y.-M., Johnson D. B. A Survey of Rollback-Recovery Protocols in Message-Passing Systems, ACM Computing Surveys, 2002, Vol. 34, Iss. 3, Pp. 375–408. DOI:https://doi.org/10.1145/568522.568525. EDN: https://elibrary.ru/DTOZWT
4. Basyrov A. G., Zykova S. S., Koshel I. N., Kuznecov V. V. Metod otkazoustoychivoy parallelnoy obrabotki informatsii v bortovykh vychislitelnykh sistemakh letatelnykh apparatov na osnove vremennoy izbytochnosti vychislitelnogo protsessa [A Method of Fault-Tolerant Parallel Processing of Information in On-Board Computing Systems of Aircraft Based on the Temporary Redundancy of the Computing Process], Aviakosmicheskoe priborostroenie [Aerospace Instrument-Making], 2023, No. 6, Pp. 33–39. DOI:https://doi.org/10.25791/aviakosmos.6.2023.1345. (In Russian) EDN: https://elibrary.ru/TWKITL
5. Zykova S. S. Model i algoritm planirovaniya parallelnoy obrabotki informatsii v otkazoustoychivoy bortovoy vychislitelnoy sisteme na osnove vremennoy izbytochnosti vychislitelnogo protsessa [A Model and Algorithm for Planning Parallel Information Processing in a Fault-Tolerant On-Board Computing System Based on the Time Redundancy of the Computing Process], Intellektualnye tekhnologii na transporte [Intellectual Technologies on Transport], 2023, No. 4 (36), Pp. 28–33. DOI:https://doi.org/10.24412/2413-2527-2023-436-28-33. (In Russian) EDN: https://elibrary.ru/AGHCJO
6. GOST R ISO/MEK 25010—2015. Informatsionnye tekhnologii. Sistemnaya i programmnaya inzheneriya. Trebovaniya i otsenka kachestva sistem i programmnogo obespecheniya (SQuaRE). Modeli kachestva sistem i programmnykh produktov [GOST R ISO/MEK 25010—2015. Information technology. Systems and software engineering. Systems and software Quality Requirements and Evaluation (SQuaRE). System and software quality models]. Effective from June 01, 2016. Moscow, StandartInform Publishing House, 2015, 36 p. (In Russian)
7. Rathore N. Checkpointing: Fault Tolerance Mechanism, i-manager’s Journal on Cloud Computing, 2017, Vol. 4, No. 1, Pp. 28–35. DOI:https://doi.org/10.26634/jcc.4.1.13756.
8. Koren I., Mani Krishna C. Fault-Tolerant Systems. Second Edition. Cambridge (MA), Morgan Kaufmann Publishers, 2020, 416 p. DOI: https://doi.org/10.1016/B978-0-12-818105-8.00014-0
9. Elnozahy E. N., Plank J. S. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery, IEEE Transactions on Dependable and Secure Computing, 2004, Vol. 1, Iss. 2, Pp. 97–108. DOI:https://doi.org/10.1109/TDSC.2004.15.
10. Aupy G., Benoit A., Hérault T., et al. Optimal Checkpointing Period: Time vs. Energy, High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation (PMBS 2013): Revised Selected Papers of the 4th International Workshop, Denver, CO, USA, November 18, 2013. Lecture Notes in Computer Science, Vol. 8551. Cham, Springer International Publishing, 2013, Pp. 203–214. DOI:https://doi.org/10.1007/978-3-319-10214-6_10.