2.4 Phase Three – Increased backfilling during (the) day
Following the decision to backfill only during the nights to help minimize the impact of the migration for our customers and ensure stable smooth operations during the day, the data redundancy degraded due to failing Filestore OSDs and nodes. As backfilling was completely stopped during the main activity/working hours, recovery processes were paused as well. Due to the recovery process being not automatically prioritized by Ceph we manually enforced recovery over backfilling. PGs had been allowed to degrade for a long time and finally were unable to recover anymore. Those PGs also had to be backfilled. This led to a slow degradation of redundancy which by itself was no reason to be worried. However, due to a failure of a filestore node and a wrong decision by the admin on duty, a situation occurred where only one copy of data was left. The mistake resulted in Ceph only having one replica. While this enabled write access to the cluster, after some time it became impossible to recover the PGs without setting the cluster to read-only. In order to recover from this, we performed automated maintenances (agreed time-slots) each night.
The metrics of Filestore and Bluestore have not changed significantly and were essentially the same as before. Therefore, these are not considered. However, we made a comparison of the reliability of these two.
2.5. Reliability of Filestore and Bluestore
The reliability of Filestore nodes has been an issue for quite some time for us. Especially under load Filestore OSDs tend to “flap” which means the process becomes unresponsive for some time, most likely because it is waiting on the underlying disc.
If multiple OSDs processes are hanging the whole node can become unstable as has happened multiple times. If the node fails to become responsive again, and has to be restarted, there is a highly increased chance for the filesystem to become corrupted. A XFS repair takes roughly 12 hours and does not always succeed.
Unfortunately our long-term monitoring data was sampled with a rate too low to really plot OSDs down so it only showed some of the flapping OSDs and nodes during the backfilling process: