Ceph: OSD heartbeats

Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 seconds by default, which is configurable of course. User can change the “heartbeat interval” by adding an “osd_heartbeatinterval" setting under the [osd] section in the  ceph configuration file, or by setting the value at runtime.

If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a  grace period (by default 20 seconds), the Ceph OSD Daemon may consider the neighboring Ceph OSD Daemon as “down" and report it back to a Ceph Monitor, which will update the Ceph Cluster Map. We can change the  “grace period” by adding an “osd_heartbeat_grace" setting under the [osd] section in the  ceph configuration file, or by setting the value at runtime.

If the heartbeat check from one OSD doesn’t hear from the other within the set value for `osd_heartbeat_grace`, which is set to 20 seconds by default, the OSD that sends the heartbeat check reports the other OSD (the one that didn’t respond within 20 seconds) as down, to the MONs. Once an OSD reports three times that the non-responding OSD is indeed `down`, the ceph  mon acknowledges it and marks it as that OSD is down.

The  Ceph monitor will update the cluster map and send it to all participating nodes in the cluster.

When an OSD can’t reach another OSD for a heartbeat, it reports the following in the OSD logs:

osd.15 1497 heartbeat_check: no reply from osd.14 since back 2016-02-28 17:29:44.013402

Note: From ceph Jewel release, the ceph mons require a minimum of 2 OSDs report a specific OSD as down from two nodes, which are in different CRUSH subtrees, in order to actually mark the OSD as “down”. These are controlled by the following configuration flags:

 Number of OSDs from different subtrees who need to report a down OSD for it to count


In which level of parent bucket the reporters are counted.


Ceph: How to map PGs to pool

Here I will explain – how a PGs (i.e. Placement Groups) mapping into pool.

First use the “ceph pg dump” command, to dump  PGs information:

# ceph pg dump

This command should output something like the following:

pg_stat    objects    mip    degr    unf    bytes    log    disklog   state               state_stamp …
19.6b           1234              0         0          0   3453134   3001     3001   active+clean

Note: The above output is snip from “ceph pg dump”.

The first field is the PG ID,  which are two values separated by a single dot (.).

– The left side value is the POOL ID   (19 is pool id),

– The right side value is the actual PG number (6b is PG number).

NOTE: No PGs can be shared across pools. But please note that OSDs can be shared across multiple PGs.

Use the below command to get the pools  information:

# ceph osd lspools
11 images, data,12 metadata3 ,13 rbd, .rgw 14 users, 15, 16 images,17 ec_pool,19 volumes

So, the PG 19.6b belongs to the pool numbered ‘19’,  i.e. ‘volumes’, and the PG number is ‘6b’.

Note:The  ‘ceph pg dump’  shows other details such as the acting OSD set, the primary OSD, the last time the PG was reported, the state of the PG, the time at which a normal scrub as well as a deep-scrub was run, etc.

Ceph, Openstack

Check IOPS on disks

I use the spew tool to perform the IOPS. The spew tool comes with standard  Unix/Linu.

spew  tool – which measures I/O performance and/or generates I/O load. An I/O performance measurement and load generation tool. Writes and/or reads generated data to or from a character device, block device, or regular file.

spew -i 20 -v -d –write  -r -b 4096 1g ./test-spew   -> which give IOPS

dd if=/dev/zero of/vol101/test bk=16K count=1600 oflag=direct // direct flat will disable the chache – will slow down the speed

swami@ubuntu:~$ spew -i 20 -v -d --write -r -b 4096 1m ./test-spew
Total iterations: 20
Total runtime: 00:00:11
Total write transfer time (WTT): 00:00:11
Total write transfer rate (WTR): 1776.76 KiB/s
Total write IOPS: 444.19 IOPS


swami@ubuntu:~$ dd if=/dev/zero of=.//test count=1600 oflag=direct
1600+0 records in
1600+0 records out
819200 bytes (819 kB) copied, 2.50733 s, 327 kB/s

NOTE: direct flag will disable the cache - will slow down the speed



Show Ceph OSD’s to Journal disk Mappings

Use the  “ceph-disk list”  command  on ceph osd node, to see the  data drives and journal drives:

# ceph-disk list
/dev/sdb :
/dev/sdb1 ceph data, active, test cluster 5a7cebf2-ceef-49b1-8928-2d36e6044db4, osd.19, journal /dev/sde1
/dev/sdc :
/dev/sdc1 ceph data, active, test cluster 5a7cebf2-ceef-49b1-8928-2d36e6044db4, osd.20, journal /dev/sde2
/dev/sde1 ceph journal, for /dev/sdy1
/dev/sde2 ceph journal, for /dev/sdz1

In the output above,  see the ceph data and journal drive’s partitions.

To see ceph OSD's partitions:
# ceph-disk list | grep "ceph data"
To see ceph OSD's journal partitions:
ceph-disk list | grep "ceph journal"



Ceph: pool’s PG and PGP numbers

PG number calculation per pool with an example:

Ceph cluster capacity: 900 TB and the rbd pool have 8196 PG_num with replica count:3.

900/3 => 300 TB

300 *1024 => 307200 G * 1024 => 314572800M/4M (object chunk size) => 78643200

so, from the above:

— Max number of objects: 78643200 can be stored in the rbd pool

— Max number of objects per PG: 78643200/8196 (pg_num for rbd pool) => 9595

— with filestore split/merge 8/40 – they can store at max 8*40*16 =>5120 before split starts

This is the current status of the cluster:

690 TB  used, 210 TB/900 TB avail

690/3 => 230 TB

230*1024 => 235520G*1024 => 241172480M/4M (object chunk size) =>  60293120

— Objects already storage per pg: 60293120/8196 => 7356

Means they have already crossed the slit threshold of 5120 objects per PG and randomly  when they are hitting PG slit to their OSDs, those OSDs are may be busy in split and hence slow request.

PG calculations as per the Ceph pg calculator at    http://ceph.com/pgcalc/

OSD count = 950, as this is main poll, I am taking data as 99% cluster capacity
and then target PGs per OSD: 100

PG count -> 32768

OSD count = 950, as this is main poll, I am talking %data as 99% cluster capacity
and then target PGS per OSD: 200

PG count -> 65536




How to change default configuration options

Ceph configuration options can be updated by setting  in /etc/ceph/ceph.conf file. But once you update the config options in ceph.conf file, need to restart the daemon to see configuration option changes in-place.

How to change  configuration options without restart ceph daemons?
We have 2 ways to inject the configuration option dynamically (i.e. without restarting the ceph daemons):

1. Use the "config set" to set a configuration option. For ex:
   $ceph daemon osd.0 config set osd_recovery_threade 2

2. Use "ceph tell <type.id> injectargs" to inject an option. For ex:
   $ceph tell osd.0 injectargs -- --osd_recovery_threads=2

   With help of injectargs, we can set more than one option at a time:
   $ ceph tell osd.* injectargs -- --osd_recovery_threads=2 --osd_disk_threads=2

NOTE:  Configuration options set at run-time are will go way, if the daemons restart. If you want configuration options set run-time and update the ceph.conf to persist these options after restart the daemons.


Show diff between current and default configurations

Use the below command to dump  difference  of current config and default configurations.

NOTE: The below command works from Ceph Hammer and above release versions.

$ ceph daemon  config  diff  // dump diff of current and default config

For ex:
$ ceph daemon osd.0 config diff | more
    "diff": {
        "current": {
            "auth_client_required": "cephx",
            "auth_supported": "cephx",
            "cluster_addr": "\/0",
            "cluster_network": "\/24",
            "fsid": "8ff78c2f-19ed-42ef-9f8a-8d046c66e3ec",
            "internal_safe_to_start_threads": "true",
            "keyring": "\/var\/lib\/ceph\/osd\/ceph-60\/keyring",
            "leveldb_log": "",
            "log_file": "\/var\/log\/ceph\/radosgw.log",
            "log_to_stderr": "false",
            "log_to_syslog": "true",
            "mon_host": "",
            "mon_initial_members": "ceph-1 ceph-2 ceph-3",
            "mon_osd_down_out_interval": "1800",
            "mon_pg_warn_max_object_skew": "20",
            "osd_backfill_scan_max": "32",
            "osd_backfill_scan_min": "16",
            "osd_deep_scrub_interval": "3600",
            "osd_journal_size": "10240",
            "osd_max_backfills": "8",
            "osd_pool_default_min_size": "1",
            "osd_pool_default_pg_num": "4096",
            "osd_pool_default_pgp_num": "4096",
            "osd_recovery_max_active": "10",
            "osd_recovery_op_priority": "5",
            "public_addr": "\/0",
            "public_network": "\/24"
        "defaults": {
            "auth_client_required": "cephx, none",
            "auth_supported": "",
            "cluster_addr": ":\/0",
            "cluster_network": "",
            "fsid": "00000000-0000-0000-0000-000000000000",
            "internal_safe_to_start_threads": "false",
            "keyring": "\/etc\/ceph\/ceph.osd.0.keyring,\/etc\/ceph\/ceph.keyring,\/etc\/ceph\/keyring,\/etc\/ceph\/keyring.bin",
            "leveldb_log": "\/dev\/null",
            "log_file": "\/var\/log\/ceph\/ceph-osd.0.log",
            "log_to_stderr": "true",
            "log_to_syslog": "false",
            "mon_host": "",
            "mon_initial_members": "",
            "mon_osd_down_out_interval": "300",
            "mon_pg_warn_max_object_skew": "10",
            "osd_backfill_scan_max": "512",
            "osd_backfill_scan_min": "64",
            "osd_deep_scrub_interval": "604800",
            "osd_journal_size": "5120",
            "osd_max_backfills": "10",
            "osd_pool_default_min_size": "0",
            "osd_pool_default_pg_num": "8",
            "osd_pool_default_pgp_num": "8",
            "osd_recovery_max_active": "15",
            "osd_recovery_op_priority": "10",
            "public_addr": ":\/0",
            "public_network": ""
    "unknown": []