Overview
This document explains step-by-step how to safely remove OSDs from a live Ceph cluster without data loss or downtime. Continuously monitoring the cluster's health during the process is critically important.
Preparations
ceph -s
command.ceph osd tree
or ceph osd df
commands.Step-by-Step Process
1. Mark the OSD as "Out"
Command:
Copy
ceph osd out <osd_id>
Example:
Copy
ceph osd out 34
Note: When the command is executed, you should receive the message "marked out osd.34.""
2. Monitor Cluster Status
Purpose: Ensure the rebalance process is complete.
Command:
Copy
ceph -s
Checkpoints:
3. Removing the OSD from the CRUSH Map
Purpose: Remove the OSD from the CRUSH map so it no longer participates in the cluster's data distribution.
Command:
Copy
ceph osd crush remove <osd_name>
Example:
Copy
ceph osd crush remove osd.34
Checkpoints:
4. Removing OSD Authorization
Purpose: Remove the authentication information associated with the OSD.
Command:
Copy
ceph auth del osd.<osd_id>
Example:
Copy
ceph auth del osd.34
5. Removing the OSD from the Cluster
Purpose: Completely delete the OSD's record from the Ceph cluster.
Command:
Copy
ceph osd down <osd_id>
ceph osd rm <osd_id>
Example:
Copy
ceph osd down 34
ceph osd rm 34
Sometimes the ceph osd down
command works, but the rm
command might not. In that case, you may need to stop the service on the OSD node using systemctl
.
Copy
systemctl stop ceph.34.service
Example: Removing an OSD Node
Copy
ceph osd crush rm <node name>
We are removing the node from the crush map.
Copy
ceph orch host drain <node name>
We are removing all services from the node by draining it.
Copy
ceph orch daemon rm osd.34 --force
We are removing the remaining OSDs on the node as a daemon.
Copy
ceph orch host rm <node name>
Finally, we completely remove the node. It will no longer appear in our host list.
Additional Notes
Process Intervals:
After each step, be sure to check the cluster status with the ceph -s
command. Only proceed to the next step once it is in a healthy (active+clean)
state.
Total Number of OSDs:
If there are 35 OSDs in the cluster and you want to remove, for example, 15 OSDs, perform the operation in small groups (2-3 at a time) instead of removing them all at once.
Conclusion
By carefully following these steps, you can remove OSDs from a live Ceph cluster without any downtime. Continuously monitoring the cluster's health and proceeding in small steps will help prevent data loss and performance issues.