Ceph: mon is down and/or can’t rejoin the quorum

Sometimes, we have seen that a Ceph mon down and could not rejoin the ceph mon quorum, even though that specific ceph mon is up and running (along with ceph-mon process is also up and running).

A quick solution as below:

As one ceph mon is down and out of quorum, then its safe to remove the down mon node from the quorum with below steps:

Pre-requisites: Connect to ceph mon node (or controller node, where down ceph mon is installed and check if its running or not using “ps -ef | grep ceph-mon”. [ If its running with nonresponsive, then stop/kill this process]. The output of this “ps” command should be empty.

Remove the ceph mon data directory:
  # rm -f /var/lib/ceph/mon/ceph-node-x

Create new auth key
    # ceph auth get mon. -o key.txt

Get a copy of mon map (like monmap.bin)
   # ceph mon getmap -o monmap.bin

Inject the down and out of quorum ceph mon into ceph monmap

   # ceph-mon -i node-x --mkfs --inject-monmap map.bin --keyring key.txt

Now, start the ceph mon service
    # start ceph-mon id=node-x
        or
    # systemctl start ceph-mon@id

Remove the monmap.bin and key.txt files.

 

Ref: 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap

Leave a comment