Ceph – Diff between erasure and replicated pool type

Here, I just collected information from multiple docs (from blogs and ceph docs) for easy to understand the replicated and erasure code pool type.

A Ceph pool is associated to a type to sustain the loss of an OSD (i.e. a disk since most of the time there is one OSD per disk). The default choice when creating pool is replicated, meaning every object is copied on multiple disks. The Erasure code  pool type can be used instead to save space.

We can set how many OSD are allowed to fail without losing data. The pool type which may either be replicated to recover from lost OSDs by keeping multiple copies of the objects or erasure to get a kind of general RADI5  capability.

Replicated pools

  1. It is the desired number of copies/replicas of an object.
  2. A typical configuration stores an object and one additional copy (i.e., size = 2).
  3. The replicated pools require more raw storage,  but implement all Ceph operations.
  4. It is the ruleset specified by the osd pool default crush replicated ruleset config variable.
  5. The replicated pool crush ruleset targets faster hardware to provide better response times.

Calculating the storage of a replicated pool is easy.  Just divide the amount of space you have by the “size” (number of replicas) of the storage pool.

For ex: Let’s work with some rough numbers: 25 OSDs of 4TB each.

Raw size: 25*4  = 100TB
Size 2  : 100/2  = 50TB
Size 3  : 100/3  = 33.33TB

Replicated pools are expensive in terms of storage.

Erasure coded pools  –  

Erasure coding (EC) is a method of data protection in which data is broken into fragments , encoded and then store in a distributed manner. Erasure coding makes use of a mathematical equation to achieve data protection. The entire concept revolves around the following equation.

n = k + m  where , 
k  =  The number of chunks original data divided into.
m =  The extra codes added to original data chunks to provide data protection.
n  =  The total number of chunks created after erasure coding process.
  1. It is the number of coding chunks (i.e. m=2 in the erasure coded profile) .
  2. It require less raw storage but only implement a subset of the available operations for instance, partial write is not supported).
  3. it is erasure-code if the default erasure code profile is used or {pool-name} otherwise.
  4. This ruleset will be created implicitly if it doesn’t exist already.
  5. The erasure-coding support has higher computational requirements.
  6. The erasure-coded pool crush ruleset targets hardware designed for cold storage with high latency and slow access time.
Recovery : To perform recovery operation, we would require any k chunks out of n chunks and thus can tolerate failure of any chunks
Reliability Level : We can tolerate failure upto any m chunks.
Encoding Rate (r) : r = k / n , where r < 1
Storage Required : 1 / r
N = 5 , K = 3 AND M = 2  ( M = N - K )
So  2  coded chunks will be added to 3 data chunks to form 5 total chunks that will be stored distributedly. In an event of failure , to construct the original file , we need any 3 chunks out of these 5 chunks to recover it. Hence in this example we can manage loss of any 2 chunks.
Encoding rate (r) = 3 / 5 = 0.6 < 1
Storage Required = 1 / 0.6 = 1.6 times of original file.

If the original file size is 1GB then to store this file in a ceph cluster erasure coded (3,5) pool , you would need 1.6GB of storage space.

Use cases:
1. COLD STORAGEAn erasure-coded pool is created to store a large number of 1GB objects (imaging,   genomics, etc.) and 10% of them are read per month. New objects are added every day and the objects are not modified after being written. On average there is one write for 10,000 reads.

A replicated pool is created and set as a cache tier for the erasure coded pool. An agent demotes objects (i.e. moves them from the replicated pool to the erasure-coded pool) if they have not been accessed in a week.


Ten datacenters are connected with dedicated network links. Each datacenter contains the same amount of storage with no power-supply backup and no air-cooling system.

An erasure-coded pool is created with a crush map ruleset that will ensure no data loss if at most three datacenters fail simultaneously. The overhead is 50% with erasure code configured to split data in six (k=6) and create three coding chunks (m=3). With replication the overhead would be 400% (four replicas).


Ceph: CRUSH map for rack aware

Here I will discuss the default ceph’s CRUSH map – getting, decompile, edit for rack aware and do validate the same before really applying to the Ceph cluster.

Get the current CRUSH map:

sudo ceph osd getcrushmap -o crushmap

Decompile the current CRUSH map:

sudo crushtool  -d crushmap -o crushmap.txt

The crushmap looks below:

  • cat crushmap.txt 

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host cephtest {
id -2 # do not change unnecessarily
# weight 0.180
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.060
item osd.1 weight 0.060
item osd.2 weight 0.060
root rack {
id -1 # do not change unnecessarily
# weight 0.180
alg straw
hash 0 # rjenkins1
item cephtest weight 0.180

  • # rules
    rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take rack
    step choose firstn 0 type osd
    step emit


# end crush map

How to edit the CRUSH map and validate the same using the crushtools

Note: Applying the new CRUSH map will cause the re-balancing of data, so this should be a potentially lengthy performance impact (depending on the size of cluster data).