{"id":2517,"date":"2025-06-06T23:57:45","date_gmt":"2025-06-06T23:57:45","guid":{"rendered":"https:\/\/diznr.com\/?p=2517"},"modified":"2025-06-06T23:57:45","modified_gmt":"2025-06-06T23:57:45","slug":"distributed-computing-distributed-system-checkpoints-and-its-type-checkpoint-levels-tools-its","status":"publish","type":"post","link":"https:\/\/www.reilsolar.com\/pdf\/distributed-computing-distributed-system-checkpoints-and-its-type-checkpoint-levels-tools-its\/","title":{"rendered":"Distributed Computing: Distributed System Checkpoints and it&#8217;s type-checkpoint levels its tools"},"content":{"rendered":"<p>Distributed Computing: Distributed System Checkpoints and it&#8217;s type-checkpoint levels its tools.<\/p>\n<p>[fvplayer id=&#8221;59&#8243;]<\/p>\n<h3 data-start=\"0\" data-end=\"65\"><strong data-start=\"2\" data-end=\"63\">Distributed Computing: Checkpoints in Distributed Systems<\/strong><\/h3>\n<h3 data-start=\"67\" data-end=\"124\"><strong data-start=\"70\" data-end=\"122\">\u00a0What is Checkpointing in Distributed Systems?<\/strong><\/h3>\n<p data-start=\"126\" data-end=\"350\">Checkpointing is a <strong data-start=\"145\" data-end=\"174\">fault-tolerance mechanism<\/strong> in distributed computing that periodically saves the system state. If a failure occurs, the system can <strong data-start=\"278\" data-end=\"314\">restart from the last checkpoint<\/strong> instead of starting from scratch.<\/p>\n<p data-start=\"352\" data-end=\"500\"><strong data-start=\"355\" data-end=\"368\">Key Idea:<\/strong><br data-start=\"368\" data-end=\"371\" \/>\u00a0Saves system state at intervals.<br data-start=\"405\" data-end=\"408\" \/>\u00a0Reduces computation loss during failures.<br data-start=\"451\" data-end=\"454\" \/>\u00a0Speeds up recovery in distributed systems.<\/p>\n<h3 data-start=\"507\" data-end=\"562\"><strong data-start=\"510\" data-end=\"560\">\u00a0Types of Checkpoints in Distributed Systems<\/strong><\/h3>\n<h3 data-start=\"564\" data-end=\"603\"><strong data-start=\"568\" data-end=\"601\">Coordinated Checkpointing<\/strong><\/h3>\n<p data-start=\"604\" data-end=\"692\"><strong data-start=\"607\" data-end=\"622\">Definition:<\/strong> All nodes in the system synchronize and save their states together.<\/p>\n<p data-start=\"694\" data-end=\"831\"><strong data-start=\"696\" data-end=\"719\">Ensures consistency<\/strong> (no orphan or lost messages).<br data-start=\"749\" data-end=\"752\" \/>\u00a0Used in <strong data-start=\"762\" data-end=\"782\">global snapshots<\/strong>.<br data-start=\"783\" data-end=\"786\" \/><strong data-start=\"789\" data-end=\"799\">Slower<\/strong> due to coordination overhead.<\/p>\n<p data-start=\"833\" data-end=\"850\"><strong data-start=\"836\" data-end=\"848\">Example:<\/strong><\/p>\n<ul data-start=\"851\" data-end=\"906\">\n<li data-start=\"851\" data-end=\"877\">Two-Phase Commit (2PC)<\/li>\n<li data-start=\"878\" data-end=\"906\">Chandy-Lamport Algorithm<\/li>\n<\/ul>\n<h3 data-start=\"913\" data-end=\"954\"><strong data-start=\"917\" data-end=\"952\">\u00a0Uncoordinated Checkpointing<\/strong><\/h3>\n<p data-start=\"955\" data-end=\"1049\"><strong data-start=\"958\" data-end=\"973\">Definition:<\/strong> Each process takes checkpoints <strong data-start=\"1005\" data-end=\"1022\">independently<\/strong> without synchronization.<\/p>\n<p data-start=\"1051\" data-end=\"1146\"><strong data-start=\"1053\" data-end=\"1063\">Faster<\/strong>, no coordination required.<br data-start=\"1090\" data-end=\"1093\" \/><strong data-start=\"1096\" data-end=\"1144\">Risk of cascading rollbacks (domino effect).<\/strong><\/p>\n<p data-start=\"1148\" data-end=\"1165\"><strong data-start=\"1151\" data-end=\"1163\">Example:<\/strong><\/p>\n<ul data-start=\"1166\" data-end=\"1197\">\n<li data-start=\"1166\" data-end=\"1197\">Individual process backups.<\/li>\n<\/ul>\n<h3 data-start=\"1204\" data-end=\"1253\"><strong data-start=\"1208\" data-end=\"1251\">\u00a0Communication-Induced Checkpointing<\/strong><\/h3>\n<p data-start=\"1254\" data-end=\"1354\"><strong data-start=\"1257\" data-end=\"1272\">Definition:<\/strong> A hybrid approach where <strong data-start=\"1297\" data-end=\"1326\">checkpoints are triggered<\/strong> based on message passing.<\/p>\n<p data-start=\"1356\" data-end=\"1510\"><strong data-start=\"1358\" data-end=\"1391\">Prevents inconsistent states.<\/strong><br data-start=\"1391\" data-end=\"1394\" \/><strong data-start=\"1396\" data-end=\"1424\">Avoids the domino effect<\/strong> from uncoordinated checkpointing.<br data-start=\"1458\" data-end=\"1461\" \/><strong data-start=\"1464\" data-end=\"1508\">Higher overhead due to message tracking.<\/strong><\/p>\n<p data-start=\"1512\" data-end=\"1529\"><strong data-start=\"1515\" data-end=\"1527\">Example:<\/strong><\/p>\n<ul data-start=\"1530\" data-end=\"1583\">\n<li data-start=\"1530\" data-end=\"1583\">Log-based checkpointing in distributed databases.<\/li>\n<\/ul>\n<h3 data-start=\"1590\" data-end=\"1635\"><strong data-start=\"1594\" data-end=\"1633\">\u00a0Application-Level Checkpointing<\/strong><\/h3>\n<p data-start=\"1636\" data-end=\"1732\"><strong data-start=\"1639\" data-end=\"1654\">Definition:<\/strong> Checkpoints are managed at the <strong data-start=\"1686\" data-end=\"1704\">software level<\/strong> rather than system level.<\/p>\n<p data-start=\"1734\" data-end=\"1890\">\u00a0Allows <strong data-start=\"1743\" data-end=\"1771\">customized checkpointing<\/strong> in applications.<br data-start=\"1788\" data-end=\"1791\" \/>\u00a0Efficient for <strong data-start=\"1807\" data-end=\"1843\">high-performance computing (HPC)<\/strong>.<br data-start=\"1844\" data-end=\"1847\" \/><strong data-start=\"1850\" data-end=\"1888\">Requires developer implementation.<\/strong><\/p>\n<p data-start=\"1892\" data-end=\"1909\"><strong data-start=\"1895\" data-end=\"1907\">Example:<\/strong><\/p>\n<ul data-start=\"1910\" data-end=\"1960\">\n<li data-start=\"1910\" data-end=\"1960\">MPI (Message Passing Interface) checkpointing.<\/li>\n<\/ul>\n<h3 data-start=\"1967\" data-end=\"2019\"><strong data-start=\"1970\" data-end=\"2017\">\u00a0Checkpoint Levels in Distributed Systems<\/strong><\/h3>\n<p data-start=\"2021\" data-end=\"2240\"><strong data-start=\"2023\" data-end=\"2055\">Process-Level Checkpointing:<\/strong> Saves state of individual processes.<br data-start=\"2092\" data-end=\"2095\" \/><strong data-start=\"2097\" data-end=\"2128\">System-Level Checkpointing:<\/strong> Saves the entire OS state.<br data-start=\"2155\" data-end=\"2158\" \/><strong data-start=\"2160\" data-end=\"2196\">Application-Level Checkpointing:<\/strong> Saves the state at the application level.<\/p>\n<h3 data-start=\"2247\" data-end=\"2305\"><strong data-start=\"2250\" data-end=\"2303\">\u00a0Tools for Checkpointing in Distributed Systems<\/strong><\/h3>\n<p data-start=\"2307\" data-end=\"2366\"><strong data-start=\"2310\" data-end=\"2364\">1. DMTCP (Distributed MultiThreaded Checkpointing)<\/strong><\/p>\n<ul data-start=\"2367\" data-end=\"2444\">\n<li data-start=\"2367\" data-end=\"2403\">Application-level checkpointing.<\/li>\n<li data-start=\"2404\" data-end=\"2444\">Supports parallel computing systems.<\/li>\n<\/ul>\n<p data-start=\"2446\" data-end=\"2496\"><strong data-start=\"2449\" data-end=\"2494\">2. CRIU (Checkpoint\/Restore in Userspace)<\/strong><\/p>\n<ul data-start=\"2497\" data-end=\"2584\">\n<li data-start=\"2497\" data-end=\"2539\">Process-level checkpointing for Linux.<\/li>\n<li data-start=\"2540\" data-end=\"2584\">Saves process state &amp; resumes execution.<\/li>\n<\/ul>\n<p data-start=\"2586\" data-end=\"2636\"><strong data-start=\"2589\" data-end=\"2634\">3. BLCR (Berkeley Lab Checkpoint\/Restart)<\/strong><\/p>\n<ul data-start=\"2637\" data-end=\"2717\">\n<li data-start=\"2637\" data-end=\"2684\">Kernel-level checkpointing for HPC systems.<\/li>\n<li data-start=\"2685\" data-end=\"2717\">Works with MPI applications.<\/li>\n<\/ul>\n<p data-start=\"2719\" data-end=\"2751\"><strong data-start=\"2722\" data-end=\"2749\">4. Hadoop Checkpointing<\/strong><\/p>\n<ul data-start=\"2752\" data-end=\"2826\">\n<li data-start=\"2752\" data-end=\"2826\">Used in <strong data-start=\"2762\" data-end=\"2803\">HDFS (Hadoop Distributed File System)<\/strong> for fault tolerance.<\/li>\n<\/ul>\n<h3 data-start=\"2833\" data-end=\"2855\"><strong data-start=\"2836\" data-end=\"2853\">\u00a0Conclusion<\/strong><\/h3>\n<p data-start=\"2857\" data-end=\"3058\">Checkpointing <strong data-start=\"2871\" data-end=\"2906\">reduces system failures&#8217; impact<\/strong> by allowing recovery from saved states. Different <strong data-start=\"2957\" data-end=\"2974\">types &amp; tools<\/strong> are used based on the <strong data-start=\"2997\" data-end=\"3020\">system requirements<\/strong> (speed, reliability, and overhead).<\/p>\n<p data-start=\"3060\" data-end=\"3120\" data-is-last-node=\"\" data-is-only-node=\"\">Would you like <strong data-start=\"3075\" data-end=\"3116\">code examples or real-world use cases<\/strong>?<\/p>\n<p data-start=\"0\" data-end=\"210\"><strong data-start=\"0\" data-end=\"25\">Distributed Computing<\/strong> involves multiple computer systems working together to achieve a common goal. One key challenge in such systems is ensuring <strong data-start=\"150\" data-end=\"169\">fault tolerance<\/strong>, which is where <strong data-start=\"186\" data-end=\"201\">checkpoints<\/strong> come in.<\/p>\n<hr data-start=\"212\" data-end=\"215\" \/>\n<h2 data-start=\"217\" data-end=\"271\">\ud83e\udde9 <strong data-start=\"223\" data-end=\"271\">What is a Checkpoint in Distributed Systems?<\/strong><\/h2>\n<p data-start=\"273\" data-end=\"469\">A <strong data-start=\"275\" data-end=\"289\">checkpoint<\/strong> is a saved state of a process or the entire system at a specific point in time. If a failure occurs, the system can <strong data-start=\"406\" data-end=\"419\">roll back<\/strong> to the last checkpoint rather than starting over.<\/p>\n<hr data-start=\"471\" data-end=\"474\" \/>\n<h2 data-start=\"476\" data-end=\"511\">\ud83d\udee0\ufe0f <strong data-start=\"483\" data-end=\"511\">Purpose of Checkpointing<\/strong><\/h2>\n<ul data-start=\"513\" data-end=\"614\">\n<li data-start=\"513\" data-end=\"534\">\n<p data-start=\"515\" data-end=\"534\"><strong data-start=\"515\" data-end=\"534\">Fault Tolerance<\/strong><\/p>\n<\/li>\n<li data-start=\"535\" data-end=\"557\">\n<p data-start=\"537\" data-end=\"557\"><strong data-start=\"537\" data-end=\"557\">Failure Recovery<\/strong><\/p>\n<\/li>\n<li data-start=\"558\" data-end=\"588\">\n<p data-start=\"560\" data-end=\"588\"><strong data-start=\"560\" data-end=\"588\">Performance Optimization<\/strong><\/p>\n<\/li>\n<li data-start=\"589\" data-end=\"614\">\n<p data-start=\"591\" data-end=\"614\"><strong data-start=\"591\" data-end=\"614\">Minimizing Downtime<\/strong><\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"616\" data-end=\"619\" \/>\n<h2 data-start=\"621\" data-end=\"674\">\ud83d\udcca <strong data-start=\"627\" data-end=\"674\">Types of Checkpoints in Distributed Systems<\/strong><\/h2>\n<h3 data-start=\"676\" data-end=\"704\">1. <strong data-start=\"683\" data-end=\"704\">Local Checkpoints<\/strong><\/h3>\n<ul data-start=\"705\" data-end=\"830\">\n<li data-start=\"705\" data-end=\"750\">\n<p data-start=\"707\" data-end=\"750\">Each process independently saves its state.<\/p>\n<\/li>\n<li data-start=\"751\" data-end=\"830\">\n<p data-start=\"753\" data-end=\"830\"><strong data-start=\"753\" data-end=\"780\">Simple but inconsistent<\/strong>: may lead to <em data-start=\"794\" data-end=\"809\">domino effect<\/em> (rollback cascades).<\/p>\n<\/li>\n<\/ul>\n<h3 data-start=\"832\" data-end=\"861\">2. <strong data-start=\"839\" data-end=\"861\">Global Checkpoints<\/strong><\/h3>\n<ul data-start=\"862\" data-end=\"1013\">\n<li data-start=\"862\" data-end=\"976\">\n<p data-start=\"864\" data-end=\"976\">A set of local checkpoints, one per process, such that the combination represents a <strong data-start=\"948\" data-end=\"975\">consistent global state<\/strong>.<\/p>\n<\/li>\n<li data-start=\"977\" data-end=\"1013\">\n<p data-start=\"979\" data-end=\"1013\">Used in coordinated checkpointing.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"1015\" data-end=\"1018\" \/>\n<h2 data-start=\"1020\" data-end=\"1050\">\ud83e\udded <strong data-start=\"1026\" data-end=\"1050\">Checkpointing Levels<\/strong><\/h2>\n<div class=\"_tableContainer_16hzy_1\">\n<div class=\"_tableWrapper_16hzy_14 group flex w-fit flex-col-reverse\">\n<table class=\"w-fit min-w-(--thread-content-width)\" data-start=\"1052\" data-end=\"1521\">\n<thead data-start=\"1052\" data-end=\"1086\">\n<tr data-start=\"1052\" data-end=\"1086\">\n<th data-start=\"1052\" data-end=\"1060\" data-col-size=\"sm\">Level<\/th>\n<th data-start=\"1060\" data-end=\"1074\" data-col-size=\"sm\">Description<\/th>\n<th data-start=\"1074\" data-end=\"1086\" data-col-size=\"md\">Use Case<\/th>\n<\/tr>\n<\/thead>\n<tbody data-start=\"1122\" data-end=\"1521\">\n<tr data-start=\"1122\" data-end=\"1227\">\n<td data-start=\"1122\" data-end=\"1146\" data-col-size=\"sm\"><strong data-start=\"1124\" data-end=\"1145\">Application-level<\/strong><\/td>\n<td data-col-size=\"sm\" data-start=\"1146\" data-end=\"1175\">App explicitly saves state<\/td>\n<td data-col-size=\"md\" data-start=\"1175\" data-end=\"1227\">Custom control, efficient for app-specific logic<\/td>\n<\/tr>\n<tr data-start=\"1228\" data-end=\"1322\">\n<td data-start=\"1228\" data-end=\"1252\" data-col-size=\"sm\"><strong data-start=\"1230\" data-end=\"1247\">Library-level<\/strong><\/td>\n<td data-col-size=\"sm\" data-start=\"1252\" data-end=\"1281\">Uses a library (like BLCR)<\/td>\n<td data-col-size=\"md\" data-start=\"1281\" data-end=\"1322\">Transparent to app, often used in HPC<\/td>\n<\/tr>\n<tr data-start=\"1323\" data-end=\"1421\">\n<td data-start=\"1323\" data-end=\"1347\" data-col-size=\"sm\"><strong data-start=\"1325\" data-end=\"1341\">System-level<\/strong><\/td>\n<td data-col-size=\"sm\" data-start=\"1347\" data-end=\"1374\">OS or VM-level snapshots<\/td>\n<td data-col-size=\"md\" data-start=\"1374\" data-end=\"1421\">No modification to app, broader but heavier<\/td>\n<\/tr>\n<tr data-start=\"1422\" data-end=\"1521\">\n<td data-start=\"1422\" data-end=\"1446\" data-col-size=\"sm\"><strong data-start=\"1424\" data-end=\"1442\">Hardware-level<\/strong><\/td>\n<td data-col-size=\"sm\" data-start=\"1446\" data-end=\"1477\">Hardware saves memory states<\/td>\n<td data-col-size=\"md\" data-start=\"1477\" data-end=\"1521\">Fastest, but rare and hardware-dependent<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"sticky end-(--thread-content-margin) h-0 self-end select-none\">\n<div class=\"absolute end-0 flex items-end\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<hr data-start=\"1523\" data-end=\"1526\" \/>\n<h2 data-start=\"1528\" data-end=\"1571\">\ud83d\udd04 <strong data-start=\"1534\" data-end=\"1571\">Types of Checkpointing Techniques<\/strong><\/h2>\n<h3 data-start=\"1573\" data-end=\"1611\">\u2705 <strong data-start=\"1579\" data-end=\"1611\">1. Coordinated Checkpointing<\/strong><\/h3>\n<ul data-start=\"1612\" data-end=\"1742\">\n<li data-start=\"1612\" data-end=\"1671\">\n<p data-start=\"1614\" data-end=\"1671\">All processes agree to take checkpoints at the same time.<\/p>\n<\/li>\n<li data-start=\"1672\" data-end=\"1742\">\n<p data-start=\"1674\" data-end=\"1742\">Avoids inconsistency but can delay execution due to synchronization.<\/p>\n<\/li>\n<\/ul>\n<h3 data-start=\"1744\" data-end=\"1784\">\u2705 <strong data-start=\"1750\" data-end=\"1784\">2. Uncoordinated Checkpointing<\/strong><\/h3>\n<ul data-start=\"1785\" data-end=\"1880\">\n<li data-start=\"1785\" data-end=\"1822\">\n<p data-start=\"1787\" data-end=\"1822\">Processes checkpoint independently.<\/p>\n<\/li>\n<li data-start=\"1823\" data-end=\"1880\">\n<p data-start=\"1825\" data-end=\"1880\">Prone to <strong data-start=\"1834\" data-end=\"1851\">domino effect<\/strong>, but simpler implementation.<\/p>\n<\/li>\n<\/ul>\n<h3 data-start=\"1882\" data-end=\"1930\">\u2705 <strong data-start=\"1888\" data-end=\"1930\">3. Communication-Induced Checkpointing<\/strong><\/h3>\n<ul data-start=\"1931\" data-end=\"2046\">\n<li data-start=\"1931\" data-end=\"1989\">\n<p data-start=\"1933\" data-end=\"1989\">Checkpoints are taken based on message passing behavior.<\/p>\n<\/li>\n<li data-start=\"1990\" data-end=\"2046\">\n<p data-start=\"1992\" data-end=\"2046\">Tries to ensure consistency without full coordination.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"2048\" data-end=\"2051\" \/>\n<h2 data-start=\"2053\" data-end=\"2109\">\ud83e\uddf0 <strong data-start=\"2059\" data-end=\"2109\">Tools for Checkpointing in Distributed Systems<\/strong><\/h2>\n<div class=\"_tableContainer_16hzy_1\">\n<div class=\"_tableWrapper_16hzy_14 group flex w-fit flex-col-reverse\">\n<table class=\"w-fit min-w-(--thread-content-width)\" data-start=\"2111\" data-end=\"2739\">\n<thead data-start=\"2111\" data-end=\"2141\">\n<tr data-start=\"2111\" data-end=\"2141\">\n<th data-start=\"2111\" data-end=\"2126\" data-col-size=\"md\">Tool\/Library<\/th>\n<th data-start=\"2126\" data-end=\"2141\" data-col-size=\"md\">Description<\/th>\n<\/tr>\n<\/thead>\n<tbody data-start=\"2173\" data-end=\"2739\">\n<tr data-start=\"2173\" data-end=\"2258\">\n<td data-start=\"2173\" data-end=\"2218\" data-col-size=\"md\"><strong data-start=\"2175\" data-end=\"2217\">BLCR (Berkeley Lab Checkpoint\/Restart)<\/strong><\/td>\n<td data-col-size=\"md\" data-start=\"2218\" data-end=\"2258\">Kernel-level checkpointing for Linux<\/td>\n<\/tr>\n<tr data-start=\"2259\" data-end=\"2357\">\n<td data-start=\"2259\" data-end=\"2304\" data-col-size=\"md\"><strong data-start=\"2261\" data-end=\"2303\">CRIU (Checkpoint\/Restore In Userspace)<\/strong><\/td>\n<td data-col-size=\"md\" data-start=\"2304\" data-end=\"2357\">Linux tool to freeze running apps and store state<\/td>\n<\/tr>\n<tr data-start=\"2358\" data-end=\"2492\">\n<td data-start=\"2358\" data-end=\"2412\" data-col-size=\"md\"><strong data-start=\"2360\" data-end=\"2411\">DMTCP (Distributed MultiThreaded CheckPointing)<\/strong><\/td>\n<td data-col-size=\"md\" data-start=\"2412\" data-end=\"2492\">Transparent user-level checkpointing for distributed and multi-threaded apps<\/td>\n<\/tr>\n<tr data-start=\"2493\" data-end=\"2578\">\n<td data-start=\"2493\" data-end=\"2521\" data-col-size=\"md\"><strong data-start=\"2495\" data-end=\"2520\">OpenMPI Checkpointing<\/strong><\/td>\n<td data-col-size=\"md\" data-start=\"2521\" data-end=\"2578\">MPI-based applications using BLCR for fault tolerance<\/td>\n<\/tr>\n<tr data-start=\"2579\" data-end=\"2653\">\n<td data-start=\"2579\" data-end=\"2593\" data-col-size=\"md\"><strong data-start=\"2581\" data-end=\"2592\">LAM\/MPI<\/strong><\/td>\n<td data-col-size=\"md\" data-start=\"2593\" data-end=\"2653\">Supports checkpoint\/restart via coordination in MPI apps<\/td>\n<\/tr>\n<tr data-start=\"2654\" data-end=\"2739\">\n<td data-start=\"2654\" data-end=\"2686\" data-col-size=\"md\"><strong data-start=\"2656\" data-end=\"2685\">Docker Checkpoint\/Restore<\/strong><\/td>\n<td data-col-size=\"md\" data-start=\"2686\" data-end=\"2739\">Uses CRIU under the hood to checkpoint containers<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"sticky end-(--thread-content-margin) h-0 self-end select-none\">\n<div class=\"absolute end-0 flex items-end\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<hr data-start=\"2741\" data-end=\"2744\" \/>\n<h2 data-start=\"2746\" data-end=\"2783\">\u26a0\ufe0f <strong data-start=\"2752\" data-end=\"2783\">Challenges in Checkpointing<\/strong><\/h2>\n<ul data-start=\"2785\" data-end=\"3009\">\n<li data-start=\"2785\" data-end=\"2834\">\n<p data-start=\"2787\" data-end=\"2834\"><strong data-start=\"2787\" data-end=\"2799\">Overhead<\/strong> of taking and storing checkpoints.<\/p>\n<\/li>\n<li data-start=\"2835\" data-end=\"2883\">\n<p data-start=\"2837\" data-end=\"2883\"><strong data-start=\"2837\" data-end=\"2852\">Consistency<\/strong> across distributed components.<\/p>\n<\/li>\n<li data-start=\"2884\" data-end=\"2960\">\n<p data-start=\"2886\" data-end=\"2960\">Handling <strong data-start=\"2895\" data-end=\"2914\">message replays<\/strong>, <strong data-start=\"2916\" data-end=\"2930\">open files<\/strong>, and <strong data-start=\"2936\" data-end=\"2959\">network connections<\/strong>.<\/p>\n<\/li>\n<li data-start=\"2961\" data-end=\"3009\">\n<p data-start=\"2963\" data-end=\"3009\"><strong data-start=\"2963\" data-end=\"2974\">Storage<\/strong> and management of checkpoint data.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"3011\" data-end=\"3014\" \/>\n<h2 data-start=\"3016\" data-end=\"3029\">\ud83e\udde0 Summary<\/h2>\n<p data-start=\"3031\" data-end=\"3262\">Checkpointing is a <strong data-start=\"3050\" data-end=\"3072\">critical technique<\/strong> in distributed systems for ensuring <strong data-start=\"3109\" data-end=\"3139\">reliability and resilience<\/strong>. Depending on the application\u2019s complexity and requirements, different <strong data-start=\"3211\" data-end=\"3249\">checkpointing strategies and tools<\/strong> can be used.<\/p>\n<hr data-start=\"3264\" data-end=\"3267\" \/>\n<p data-start=\"3269\" data-end=\"3393\" data-is-last-node=\"\" data-is-only-node=\"\">Would you like a diagram to visualize checkpointing, or help implementing one in code or a cloud platform (like Kubernetes)?<\/p>\n<h3 data-start=\"3269\" data-end=\"3393\"><a href=\"https:\/\/archive.mu.ac.in\/myweb_test\/MCA%20study%20material\/M.C.A.(Sem%20-%20V)%20Distributed%20Computing.pdf\" target=\"_blank\" rel=\"noopener\">Distributed Computing: Distributed System Checkpoints and it&#8217;s type-checkpoint levels its tools<\/a><\/h3>\n<h3 class=\"LC20lb MBeuO DKV0Md\"><a href=\"https:\/\/eclass.uoa.gr\/modules\/document\/file.php\/D245\/2015\/DistrComp.pdf\" target=\"_blank\" rel=\"noopener\">Distributed Computing: Principles, Algorithms, and Systems<\/a><\/h3>\n<h3 class=\"LC20lb MBeuO DKV0Md\"><a href=\"https:\/\/dmice.ac.in\/wp-content\/uploads\/2023\/10\/CS3551-Distributed-Computing.pdf\" target=\"_blank\" rel=\"noopener\">CS3551 &#8211; DISTRIBUTED SYSTEMS 2 MARKS AND 16 &#8230;<\/a><\/h3>\n<h3 class=\"LC20lb MBeuO DKV0Md\"><a href=\"https:\/\/www.aalimec.ac.in\/wp-content\/uploads\/Material\/cse\/3\/CS3551%20-%20Distributed%20Computing.pdf\" target=\"_blank\" rel=\"noopener\">CS3551- DISTRIBUTED COMPUTING UNIT I &#8230;<\/a><\/h3>\n","protected":false},"excerpt":{"rendered":"<p>Distributed Computing: Distributed System Checkpoints and it&#8217;s type-checkpoint levels its tools.<\/p>\n","protected":false},"author":64,"featured_media":2518,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[84],"tags":[1707,1708,1709,1710],"class_list":["post-2517","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-distributed-computing","tag-checkpoint-levels-its-tools","tag-distributed-computing","tag-distributed-system-checkpoints","tag-type-of-checkpoints"],"_links":{"self":[{"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/posts\/2517","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/users\/64"}],"replies":[{"embeddable":true,"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/comments?post=2517"}],"version-history":[{"count":0,"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/posts\/2517\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/media\/2518"}],"wp:attachment":[{"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/media?parent=2517"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/categories?post=2517"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.reilsolar.com\/pdf\/wp-json\/wp\/v2\/tags?post=2517"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}