A Hands on guide to ceph


Introduction

This guide is designed to be used as a self-training course covering ceph. The first part is a gentle introduction to ceph and will serve as a primer before tackling more advanced concepts which are covered in the latter part of the document.

The course is aimed at engineers and administrators that want to gain familiarization with ceph quickly. If difficulties are encountered at any stage the ceph documentation should be consulted as ceph is being constantly updated and the content here is not guaranteed to apply to future releases.

Pre-requisites

  • Familiarization with Unix like Operating Systems
  • Networking basics
  • Storage basics
  • Laptop with 8GB of RAM for Virtual machines or 4 physical nodes

Objectives:

At the end of the training session the attendee should be able to:

  • To describe ceph technology and basic concepts
    • Understand the roles played by Client, Monitor, OSD and MDS nodes
  • To build and deploy a small scale ceph cluster
  • Understand customer use cases
  • Understand ceph networking concepts as it relates to public and private networks
  • Perform basic troubleshooting

Pre course activities

Activities

  • Prepare a Linux environment for ceph deployment
  • Build a basic 4/5 node ceph cluster in a Linux environment using physical or virtualized servers
  • Install ceph using the ceph-deploy utility
  • Configure admin, monitor and OSD nodes
  • Create replicated and Erasure coded pools
    • Describe how to change the default replication factor
    • Create erasure coded profiles
  • Perform basic benchmark testing
  • Configure object storage and use PUT and GET commands
  • Configure block storage, mount and copy files, create snapshots, set up an iscsi target
  • Investigate OSD to PG mapping
  • Examine CRUSH maps

About this guide

The training course covers the pre-installation steps for deployment on Ubuntu V14.04, and Centos V7. There are some slight differences in the repository configuration with between Debian and RHEL based distributions as well as some settings in the sudoers file. Ceph installation can of course be deployed using Red Hat Enterprise Linux.

Disclaimer

Other versions of the Operating System and the ceph release may require different installation steps (and commands) from those contained in this document. The intent of this guide is to provide instruction on how to deploy and gain familiarization with a basic ceph cluster. The examples shown here are mainly for demonstration/tutorial purposes only and they do not necessarily constitute the best practices that would be employed in a production environment. The information contained herein is distributed with the best intent and although care has been taken, there is no guarantee that the document is error free. Official documentation should always be used instead when architecting an actual working deployment and due diligence should be employed.

Getting to know ceph – a brief introduction

This section is mainly taken from ceph.com/docs/master which should be used as the definitive reference.

Ceph is a distributed file system supporting block, object and file based storage. It consists of MON nodes, OSD nodes and optionally an MDS node. The MON node is for monitoring the cluster and there are normally multiple monitor nodes to prevent a single point of failure. The OSD nodes house ceph Object Storage Daemons which is where the user data is held. The MDS node is the Meta Data Node and is only used for file based storage. It is not necessary if block and object storage is only needed. The diagram below is taken from the ceph web site and shows that all nodes have access to a front end Public network, optionally there is a backend Cluster Network which is only used by the OSD nodes. The cluster network takes replication traffic away from the front end network and may improve performance. By default a backend cluster network is not created and needs to be manually configured in ceph’s configuration file (ceph.conf). The ceph clients are part of the cluster.

The Client nodes know about monitors, OSDs and MDS’s but have no knowledge of object locations. Ceph clients communicate directly with the OSDs rather than going through a dedicated server.

The OSDs (Object Storage Daemons) store the data. They can be up and in the map or can be down and out if they have failed. An OSD can be down but still in the map which means that the PG has not yet been remapped. When OSDs come on line they inform the monitor.

The Monitors store a master copy of the cluster map.

Ceph features Synchronous replication – strong consistency.

The architectural model of ceph is shown below.

RADOS stands for Reliable Autonomic Distributed Object Store and it makes up the heart of the scalable object storage service.

In addition to accessing RADOS via the defined interfaces, it is also possible to access RADOS directly via a set of library calls as shown above.

Ceph Replication

By default three copies of the data are kept, although this can be changed!

Ceph can also use Erasure Coding, with Erasure Coding objects are stored in k+m chunks where k = # of data chunks and m = # of recovery or coding chunks

Example k=7, m= 2 would use 9 OSDs – 7 for data storage and 2 for recovery

Pools are created with an appropriate replication scheme.

CRUSH (Controlled Replication Under Scalable Hashing)

The CRUSH map knows the topology of the system and is location aware. Objects are mapped to Placement Groups and Placement Groups are mapped to OSDs. It allows dynamic rebalancing and controls which Placement Group holds the objects and which of the OSDs should hold the Placement Group. A CRUSH map holds a list of OSDs, buckets and rules that hold replication directives. CRUSH will try not to shuffle too much data during rebalancing whereas a true hash function would be likely to cause greater data movement

The CRUSH map allows for different resiliency models such as:

#0 for a 1-node cluster.

#1 for a multi node cluster in a single rack

#2 for a multi node, multi chassis cluster with multiple hosts in a chassis

#3 for a multi node cluster with hosts across racks, etc.

osd crush chooseleaf type = {n}

Buckets

Buckets are a hierarchical structure of storage locations; a bucket in the CRUSH map context is a location. The Bucket Type structure contains

  • id #unique negative integer
  • weight # Relative capacity (but could also reflect other values)
  • alg #Placement algorithm
    • Uniform # Use when all devices have equal weights
    • List # Good for expanding clusters
    • Tree # Similar to list but better for larger sets
    • Straw # Default allows fair competition between devices.
  • Hash #Hash algorithm 0 = rjenkins1

An extract from a ceph CRUSH map is shown following:


An example of a small deployment using racks, servers and host buckets is shown below.

Placement Groups

Objects are mapped to Placement Groups by hashing the object’s name along with the replication factor and a bitmask

The PG settings are calculated by Total PGs = (OSDs * 100) /#of OSDs per object) (# of replicas or k+m sum ) rounded to a power of two.


Enterprise or Community Editions?

Ceph is available as a community or Enterprise edition. The latest version of the Enterprise edition as of mid-2015 is ICE1.3. This is fully supported by Red Hat with professional services and it features enhanced monitoring tools
such as Calamari. This guide covers the community edition.

Installation of the base Operating System

Download either the Centos or the Ubuntu server iso images. Install 4 (or more OSD nodes if resources are available) instances of Ubuntu or CentOS based Virtual Machines (these can of course be physical machines if they are available), according to the configuration below:

Hostname Role NIC1 NIC2 RAM HDD
monserver0 Monitor, Mgmt, Client DHCP 192.168.10.10 1 GB 1 x 20GB Thin Provisioned
osdserver0 OSD DHCP 192.168.10.20 1 GB 2 x 20GB Thin Provisioned
osdserver1 OSD DHCP 192.168.10.30 1 GB 2 x 20GB Thin Provisioned
osdserver2 OSD DHCP 192.168.10.40 1 GB 1 x 20GB Thin Provisioned
osdserver3 OSD DHCP 192.168.10.40 1 GB 1 x 20GB Thin Provisioned

If more OSD server nodes can be made available; then add them according to the table above.

VirtualBox Network Settings

For all nodes – set the first NIC as NAT, this will be used for external access.

Set the second NIC as a Host Only Adapter, this will be set up for cluster access and will be configured with a static IP.

VirtualBox Storage Settings

OSD Nodes

For the OSD nodes – allocate a second 20 GB Thin provisioned Virtual Disk which will be used as an OSD device for that particular node. At this point do not add any extra disks to the monitor node.

Mount the ISO image as a virtual boot device. This can be the downloaded Centos or Ubuntu iso image

Enabling shared clipboard support

  • Set GeneralàAdvancedàShared Clipboard to Bidirectional
  • Set GeneralàAdvancedàDrag,n,Drop to Bidirectional

Close settings and start the Virtual Machine. Select the first NIC as the primary interface (since this has been configured for NAT in VirtualBox). Enter the hostname as shown.

Select a username for ceph deployment.

Select the disk

Accept the partitioning scheme

Select OpenSSH server

Respond to the remaining prompts and ensure that the login screen is reached successfully.

The installation steps for Centos are not shown but it is suggested that the server option is used at the software selection screen if CentOS is used.

Installing a GUI on an Ubuntu server

This section is purely optional but it may facilitate monitoring ceph activity later on. In this training session administration will be performed from the monitor node. In most instances the monitor node will be distinct from a dedicated administration or management node. Due to the limited resources (in most examples shown here) the monserver0 node will function as the MON node, an admin/management node and as a client node as shown in the table on page 8.

If you decide to deploy a GUI after an Ubuntu installation then select the Desktop Manager of your choice using the instruction strings below, the third option is more lightweight than the other two larger deployments.

  • sudo apt-get install ubuntu-desktop
  • sudo apt-get install ubuntu-gnome-desktop
  • sudo apt-get install xorg gnome-core gnome-system-tools gnome-app-install

Reboot the node.

sudo reboot

Installing a GUI on CentOS 7

A GUI can also be installed on CentOS machines by issuing the command:

sudo yum groupinstall “Gnome Desktop”

The GUI can be started with the command

startx

Then to make this default environment –

systemctl set-default graphical.target

VirtualBox Guest Additions

To increase screen resolution go to the VirtualBox main menu and select devicesàInstall Guest Additions CD Image


Select <OK> and reboot.

Preparation – pre-deployment tasks

Configure NICs on Ubuntu

Edit the file /etc/networks/interfaces according to the table below:

hostname NIC1 NIC2
monserver0 DHCP 192.168.10.10
osdserver0 DHCP 192.168.10.20
osdserver1 DHCP 192.168.10.30
osdserver2 DHCP 192.168.10.40
osdserver3 DHCP 192.168.10.50

The screenshot shows the network settings for the monitor node; use it as a template to configure nic1 and nic2 on the osd nodes.

 

 

 

 

 

Bring up eth1 and restart the network.

Verify the IP address.

Configure NICs on CentOS

Ensure NetworkManager is not running and disabled.

Or use the more updated command systemctl disable NetworkManager

Then edit the appropriate interface in /etc/sysconfig/network-scripts e.g. vi ifcfg-enps03 setting the static IPs according to the table shown at the beginning of this section.

Edit /etc/hosts on the monitor node.

Setting up ssh

If this option was not selected at installation time – Install openssh-server on all nodes. For Ubuntu enter:
sudo apt-get install openssh-server
For CentOS use sudo yum install openssh-server
Next from the monitor node push the hosts file out to the osd servers.
scp /etc/hosts osdserver0:/home/cephuser
scp /etc/hosts osdserver1:/home/cephuser
scp /etc/hosts osdserver2:/home/cephuser


Now copy the hosts file to /etc/hosts on each of the osd nodes

sudo cp ~/hosts /etc/hosts


Disabling the firewall

Note: Turning off the firewall is obviously not an option for production environments but is acceptable for the purposes of this tutorial. The official documentation can be consulted with regards to port configuration if the implementer does not want to disable the firewall. In general the exercises used here should not require disabling the firewall.

Disabling the firewall on Ubuntu

sudo ufw disable

Disabling the firewall on CentOS

systemctl stop firewalld

systemctl disable firewalld

Configuring sudo

Do the following on all nodes:

If the user cephuser has not already been chosen at installation time, create this user and set a password.

sudo useradd –d /home/cephuser –m cephuser

sudo passwd cephuser

Next set up the sudo permissions

echo “cephuser ALL = (root) NOPASSWD:ALL” | sudo tee /etc/sudoers.d/cephuser

sudo chmod 0440 /etc/sudoers.d/cephuser


Repeat on osdserver0, osdserver1, osdserver2

Centos disabling requiretty

For CentOS only, on each node disable requiretty for user cephuser by issuing the sudo visudo command and adding the line Defaults:cephuser !requiretty as shown below.

Add in the line Defaults:cephuser !requiretty under the Defaults requiretty line as shown to the <Defaults> section of the sudo file.

Note: If an error message similar to that shown below occurs double check the sudoers setting as shown above.

Setting up passwordless login

The ceph-deploy tool requires passwordless login with a non-root account, this can be achieved by performing the following steps:

On the monitor node enter the ssh-keygen command.

Now copy the key from monserver0 to each of the OSD nodes in turn.

ssh-copy-id cephuser@osdserver0


Repeat for the other two osd nodes.

Finally edit ~/.ssh/config for the user and hostnames as shown.

And change the permissions

chmod 600 ~./ssh/config

Configuring the ceph repositories on Ubuntu

On the monitor node Create a directory for ceph administration under the cephuser home directory. This will be used for administration.

On monserver0 node enter:

wget -q -O- ‘https://download.ceph.com/keys/release.asc&#8217; | sudo apt-key add –

For the hammer release of ceph enter:

echo deb http://ceph.com/debian-hammer/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list


The operation can be verified by printing out /etc/apt/sources.list.d/ceph.list.

Configuring the ceph repositories on CentOS

As user cephuser, enter the ~/cephcluster directory and edit the file /etc/yum.repos.d/ceph.repo with the content shown below.

Note: The version of ceph and O/S used here is “hammer” and “el7”, this would change if a different distribution is used, (el6 and el7 for Centos V6 and 7, rhel6 and rhel7 for Red Hat® Enterprise Linux® 6 and 7, fc19, fc20 for Fedora® 19 and 20)

[ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-{ceph-release}/{distro}/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc

For the jewel release:

[ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-jewel/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc

Installing and configuring ceph

Ceph will be deployed using ceph-deploy. Other tools are widespread but will not be used here. First install the deploy tool on the monitor node.

For Ubuntu

sudo apt-get update && sudo apt-get install ceph-deploy (as shown in the screen shot)

For CentOS use

yum update && sudo yum install ceph-deploy


From the directory ~/cephcluster

Setup the monitor node.

The format of this command is ceph-deploy new <monitor1>, <monitor2>, . . <monitor n>.

Note production environments will typically have a minimum of three monitor nodes to prevent a single node of failure

ceph-deploy new monserver0

Examine ceph.conf

There are a number of configuration sections within ceph.conf. These are described in the ceph documentation (ceph.com/docs/master).

Note the file ceph.conf is hugely important in ceph. This file holds the configuration details of the cluster. It will be discussed in more detail during the course of the tutorial. This is also the time to make any changes to the configuration file before it is pushed out to the other nodes. One option that could be used within a training guide such as this would be to lower the replication factor as shown following:

Changing the replication factor in ceph.conf

The following options can be used to change the replication factor:

osd pool default size = 2
osd pool default min size = 1

In this case the default replication size is 2 and the system will run as long as one of the OSDs is up.

Changing the default leaf resiliency as a global setting

By default ceph will try and replicate to OSDS on different servers. For test purposes, however only one OSD server might be available. It is possible to configure ceph.conf to replicate to OSDs within a single server. The chooseleaf setting in ceph.conf is used for specifying these different levels of resiliency – in the example following a single server ceph cluster can be built using a leaf setting of 0. Some of the other chooseleaf settings are shown below:

  • #0 for a 1-node cluster.
  • #1 for a multi node cluster in a single rack
  • #2 for a multi node, multi chassis cluster with multiple hosts in a chassis
  • #3 for a multi node cluster with hosts across racks, etc.

The format of the setting is:

osd crush chooseleaf type = n

Using this setting in ceph.conf will allow a cluster to reach an active+clean state with only one OSD node.

Install ceph on all nodes

ceph-deploy install monserver0 osdserver0 osdserver1 osdserver

Note at the time of writing a bug has been reported with CentOS7 deployments which can result in an error message stating “RuntimeError: NoSectionError No section: `ceph'”. If this is encountered use the following workaround:

sudo mv /etc/yum.repos.d/ceph.repo /etc/yum.repos.d/ceph-deploy.repo

Note always verify the version as there have been instances where the wrong version of ceph-deploy has pulled in an earlier version!

Once this step has completed, the next stage is to set up the monitor(s)

Note make sure that you are in the directory where the ceph.conf file is located (cephcluster in this example).

Enable the monitor(s)

This section assumes that you are running the monitor on the same node as the management station as described in the setup. I f you are using a dedicated management node that does not house the monitor then pay particular attention to section regarding keyrings on page 28.

ceph-deploy mon create-initial


The next stage is to change the permissions on /etc/ceph/ceph.client.admin.keyring.


Note: This step is really important as the system will issue a (reasonably) obscure message when attempting to perform ceph operations such as the screen below.

Now check again to see if quorum has been reached during the deployment.


The status of the ceph cluster can be shown with the ceph –s or ceph health commands.



Note regarding Keyrings

In this example the ceph commands are run from the monitor node, however if a dedicated management node is deployed, the authentication keys can be gathered from the monitor node one the cluster is up and running (after a successful ceph-deploy mon create-initial has been issued). The format of the command is ceph-deploy gatherkeys <Host> . . .



Note By default when a ceph cluster is first created a single pool
(rbd) is created consisting of 64 placement groups. At this point no OSDs have been created and this is why there is a health error. It is also possible that a message may be issued stating too few PGs but this can be ignored for now.


Creating OSDs

Ceph OSDs consist of a daemon, a data device (can normally a disk drive, but can also be a directory), and an associated journal device which can be separate or co-exist as a separate partition.

Important commands relating to osd creation are listed below:

ceph-deploy disk list <nodename>
ceph-deploy disk zap <nodename>:<datadiskname>
ceph-deploy osd prepare <nodename>:<datadiskname>:<journaldiskname>
ceph-deploy osd activate <node-name>:<datadiskpartition>[:<journaldisk partition>

The picture below shows a small (20GB disk) with co-existing journal and data partitions

The first stage is to view suitable candidates for OSD deployment

ceph-deploy disk list osdserver0


In this example three OSDs will be created. The command will only specify a single device name which will cause the journal to be located on the device as a second partition.

Prior to creating OSDS it may be useful to open a watch window which will show real time progress.

ceph –w

ceph-deploy disk zap osdserver0:sdb

Next prepare the disk.

ceph-deploy osd prepare osdserver0:sdb

. . .

The output of the watch window now shows:

The cluster at this stage is still unhealthy as by default a minimum of three OSDs are required for a healthy pool. The replication factor can be changed in ceph.conf but for now continue to create the other OSDs on nodes osdserver1 and osdserver2.

After a second OSD has been created the watch window shows:

After the third OSD has been created, the pool now has the required degree of resilience and the watch window shows that all pgs are active and clean.

Note: This is typically scripted as shown below, in this example 4 servers are used (osdserver0 osdserver1 osdserver2 osdserver3) with each having 3 disks (sdb, sdc and sdd). The script can easily be adapted to a different configuration.

for node in osdserver0 osdserver1 osdserver2 osdserver3
do
for drive in sdb sdc sdd
do
ceph-deploy disk zap $node:$drive
ceph-deploy osd prepare $node:$drive
done
done

ceph –s shows:

Listing the OSDs

The ceph osd tree command shows the osd status

More information about the OSD’s can be found with the following script:

for index in ‘seq 0 $(ceph osd stat | awk {‘

print $3′-1})’; do ceph osd find $index;echo; done


Next bring down osdserver2 and add another disk of 20 TB capacity, note the watch window output when the node is down:

Reboot osdserver2 and check the watch window again to show that ceph has recovered.

Create a fourth OSD on the disk that was recently added and again list the OSDs.

Creating a Replicated Pool

Calculating Placement Groups

The first example shows how to create a replicated pool with 200 Placement Groups. Ideally an OSD will have 100 Placement Groups per OSD. This can be made larger if the pool is expected to grow in the future. The Placement Groups can be calculated according to the formula:

This number is then rounded up to the next power of two. So for a configuration with 9 OSDs, using three way replication the pg size would be 512. In the case of an erasure coded pool the replication factor is the sum of the k and m values. PG counts can be increased but not decreased so it may be better to start with slightly undersized pg counts and increase them later on. Placement Group count has an effect on data distribution within the cluster and may also have an effect on performance.

See ceph.com/pgcalc for a pg calculator

The example next shows how to create a replicated pool.

ceph osd pool create replicatedpool0 200 200 replicated

The watch window shows the progress of the pool creation and also the pg usage

Object commands

The pool can now be used for object storage, in this case we have not set up an external infrastructure so are somewhat limited by operations however it is possible to perform some simple tasks via rados:

Simple GET and PUT operation

PUT

The watch window shows the data being written

The next command shows the object mapping.

Now store a second object and show the mapping.

In the first instance object.1 was stored on OSDs 2,1,0 and the second object was stored on OSDs 3,1,0.

The next command shows the objects in the pool.

Pools can use the df command as well. It is recommended that a high degree of disk free space is available

GET

Objects can be retrieved by use of GET.

Deleting a pool

Delete a pool using the command:

ceph osd pool delete <poolname> <poolname> –yes-i-really-really-mean-it


Note it is instructive to monitor the watch window during a pool delete operation

Benchmarking Pool performance

Ceph includes some basic benchmarking commands. These commands include read and write with the ability to vary the thread count and the block sizes.

The format is:

rados bench –p <poolname> <time> <write|> <seq|rand> -t <#of threads> -b <blocksize>

Example2 à

Note To perform read tests it is necessary to have first written data; by default the write benchmark deletes any written data so add the –no-cleanup qualifier.

Now perform a read test (leave out the write parameter). Note if there is not enough data the read test may finish earlier than the time specified.

Creating an erasure coded pool

Erasure coded pools are more efficient in terms of storage efficiency. Erasure codes take two parameters known as k and m. The k parameter refers to the data portion and the m parameter is for the recovery portion, so for instance a k of 6 and an m of 2 could tolerate 2 device failures and has a storage efficiency of 6/8 or 75% in that the user gets to use 75% of the physical storage capacity. In the previous instance with a default replication of 3, the user can only access 1/3 of the total available storage. With a k and m of 20, 2 respectively they could use 90% of the physical storage.

The next example shows how to create an erasure coded pool, here the parameters used will be k=2 and m=1.

This profile can be now used to create an erasure coded pool.

This pool can be treated in a similar manner to the replicated pool as before

Delete the pool

Add another OSD by bringing down the monitor node and adding a 20GB virtual disk and use it to set up a fifth OSD device.

Next create another pool with k=4 and m=1

Question – The watch window shows the output below – why?


Adding a cluster (replication) network

The network can be configured so that the OSDs communicate over a back end private network which in this ceph.conf example is the network – (192.168.50) designated the Cluster network. They are the only nodes that will have access to this network. All other nodes will continue to communicate over the public network (172.27.50).

Now create a fresh ceph cluster using the previous instructions. Once the mgmt. node has been created, edit the ceph.conf file in ~/testcluster and then push it out to the other nodes. The edited ceph.conf file is shown following:


Suggested activity – As an exercise configure VirtualBox to add extra networks to the OSD nodes and configure them as a cluster network.

Changing the debug levels

Debug levels can be increased for troubleshooting purposes on the fly; the next setting increase the debug level for osd0 to 20:

ceph tell osd.0 injectargs –debug-osd 20

The output of ceph –w now shows this as well.

Creating a Block Device

Create a pool that will be used to hold the block devices.

Next create a block image from the pool specifying the image name, size (in MB) and pool name:

rbd -p iscsipool create myimage –size 10240

List the images in the pool.

Map the device

The next command shows the mapping.

Get information about the image.

Now create a partition on /dev/rbd0 using fdisk or parted

Now list the block devices again.

Create a file system using mkfs or mkfs.ext4

Next create a mount point

sudo mkdir /mnt/rbd0

and mount the device.

sudo mount /dev/rbd0p1 /mnt/rbd0

Test file access.

Creating a snapshot of the image

Prior to taking a snapshot it is recommended to quiesce the filesystem to ensure consistency. This can be done with the fsfreeze command. The format of the command is fsfreeze –freeze|unfreeze <filesystem>.

Freezing prevents write access and unfreezing resumes write activity.

Snapshots are read only images point in time images which are fully supported by rbd.

First create a snapshot:

List the snapshots

Listing the snapshots

Next delete all the files in /mnt/rbd0

List the contents of /mnt/rbd0

Next umount /dev/rbd0p1

Now rollback the snapshot

Mount the device again.

And list the contents of /mnt/rbd0 to show that the files have been restored.

Purging snapshots

Snapshots can be deleted individually or completely.

The next example shows how to create and delete an individual snapshot.

All snaps can be removed with the purge command.

Umapping an rbd device

Removing an image


Benchmarking a block device

The fio benchmark can be used for testing block devices; fio can be installed with apt-get.

FIO Small block testing

fio –filename=/dev/rbdXX –direct=1 –sync=1 –rw=write –bs=4k –numjobs=1 –iodepth=1 –runtime=60 –time_based –group_reporting –name=journal-test

Try increasing the –numjobs parameter to see how performance varies. For large block writes using 4M use the command line below:

fio –filename=/dev/rbdXX –direct=1 –sync=1 –rw=write –bs=4096k –numjobs=1 –iodepth=1 –runtime=60 –time_based –group_reporting –name=data-test

Sample run with 4k blocks using an iodepth of 16

See the fio documentation for more information!

Sample run with 4M blocks using an iodepth of 4

Create an iSCSI target

First install the necessary software on the system that will host the iscsi target. In this example the (overworked) monitor node will be used.

Edit /etc/default/iscsitarget and set the first line to read ISCSITARGET_ENABLE-true

Restart the service

Next create a pool called iscsipool (as before)

Next partition the device

Verify the operation

Now format the new partition

Edit the file /etc/iet/ietd.conf to add a target name to the bottom of the file.

Restart the service again

Connecting to the target

In this example a Windows iSCSI initiator will be used to connect to the target. Launch the iSCSI initiator from windows and enter the IP address. Select <Quick Connect>

At this point the target can be treated as a normal windows disk. Under Disk Management Initialize, create a volume, format and assign a drive letter to the target

In this case the label assigned is cephiscsitarget and has a drive letter assignment of E:

Now copy some files to verify operation:

The ceph watch window should show activity

Dissolving a Cluster

ceph-deploy purge <node1> <node2> . . . <noden>

ceph-deploy purgedata <node1> <node2> . . . <noden>

ceph-deploy forgetkeys

Advanced Topics

CRUSH map

CRUSH is used to give clients direct access to OSDs thus avoiding the requirement for a Metadata server or intermediary lookup. The map itself contains a list of the OSDs and can decide how they should be grouped together. The first stage is to look at a CRUSH map.

First obtain the CRUSH map. The format is ceph osd getcrushmap <output file>

This map is in compiled format so before it can be “read” it needs to be decompiled.

Use the –d switch to decompile.

Now the file is “readable”

# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable straw_calc_version 1

# devices

device 0 osd.0

device 1 osd.1

device 2 osd.2

# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root

# buckets

host osdserver0 {

    id -2        # do not change unnecessarily

    # weight 0.010

    alg straw

    hash 0    # rjenkins1

    item osd.0 weight 0.010

}

host osdserver1 {

    id -3        # do not change unnecessarily

    # weight 0.010

    alg straw

    hash 0    # rjenkins1

    item osd.1 weight 0.010

}

host osdserver2 {

    id -4        # do not change unnecessarily

    # weight 0.010

    alg straw

    hash 0    # rjenkins1

    item osd.2 weight 0.010

}

root default {

    id -1        # do not change unnecessarily

    # weight 0.030

    alg straw

    hash 0    # rjenkins1

    item osdserver0 weight 0.010

    item osdserver1 weight 0.010

    item osdserver2 weight 0.010

}

# rules

rule replicated_ruleset {

    ruleset 0

    type replicated

    min_size 1

    max_size 10

    step take default

    step chooseleaf firstn 0 type host

    step emit

}

# end crush map

Within the CRUSH map there are different sections.

  • Devices – here the CRUSH map shows three different OSDs.
  • Types – shows the different kinds of buckets which is an aggregation of locations for the storage such as a rack or a chassis. In this case the aggregation of the buckets are the OSD server hosts.
  • Rules – These define how the buckets are actually selected.

The CRUSH map can be recompiled with

crushtool –c <decompiled crushmapfile> -o <compiled crushmapfile>

and then reinjected by

ceph osd setcrushmap –i <newcompiledcrushmapfile>


Changes can be shown with the command ceph osd crush dump

Latency stats for the osds can be shown with:


Individual drive performance can be shown with


A number can be added to specify the number of bytes to be written, the command below writes out 100MB at a rate of 37 MB/s


If an individual drive is suspected of contributing to an overall degradation in performance, all drives can be tested using the wildcard symbol.


More about PGs, Pools and OSDs

Reconciling object, pgs and OSDs

The drawing below (repeated from the introduction) shows the relationship between a pool, objects, Placement Groups and OSDs. The pool houses the objects which are stored in Placement Groups and by default each Placement Group is replicated to three OSDs.


Suggested Activity –

Add more Virtual disks and configure them as OSDs, so that there are a minimum of 6 OSDs. Notice during this operation how the watch window will show backfilling taking place as the cluster is rebalanced.

This may take some time depending on how much data actually exists.

The following screenshot shows a portion of the output from the ceph pg dump command

Note the pg mapping to OSDs – Each pg uses the default mapping of each Placement Group to three OSDS. In this case there are 6 OSDs to choose from and the system will select three of these six to hold the pg data. In this case the two fields that are highlighted list the same OSDs.

Question – How many entries are there for the left hand field number starting with 0.x are there and why?

Next create some new pools similar to that shown below:


List the pgs again to show the new pools. Note that the number on the left hand side is of the form x.y where is x = the pool ID and y = the pg ID within the pool.

Now PUT an object into pool replicatedpool_1


It can be seen that the object is located on OSDs 2,1,0. To verify the mapping for this pg use the command:


pg dump again and grep for this pg.


Or simply issue the command

ceph pg map 2.6c

As an exercise add in a new OSD and then look to see if any of the mappings have changed.

Other rados file commands

List the contents of a pool:

rados –p <poolname> ls


Copy the contents of a pool to another pool

rados cppool <sourcepoolname> <destinationpoolname>


Reading the CRUSH Map

First get the map which is in binary format

Decompile the CRUSH map

Make a copy

Contents of initial CRUSH map:

If changes are required then edit the decompiled CRUSH map with the new entries

Next compile the CRUSH map

And inject it

Listing the osd tree shows:

Cache Tiering

Cache tiering keeps a subset of the main data in a cache pool. Typically this cache pool consists of fast media and is usually more expensive than regular HDD storage. The following diagram (taken from the ceph documentation) shows the concept.

A cache tiering agent decides when to migrate data between the storage tier and the cache tier. The ceph Objecter handles object placement. The cache can function in Writeback mode where the data is written to the cache tier which will send back an acknowledgement back to the client prior to the data being flushed to the storage tier. If data is fetched from the storage tier it is migrated to the cache tier and then sent to the client.

In Read-only mode the client writes data to the storage tier and during reads the data is copied to the cache tier – here though the data in the cache tier may not be up to date.

In this example it is assumed that a ruleset for ssd devices and a ruleset for hdd devices has been set up. The ssd devices can be used as a cache tier where the ssd pool will be the cache pool and the hdd pool will be used as the storage pool.

Set the cache mode as writeback or readonly

This is logged:

Next set up traffic to go to the cached pool

Cache tiering can be used for Object, block or file. Consult the ceph documentation for further granularity on managing cache tiers.


Other useful commands




Take an OSD out of the cluster, its data will be re-allocated


OSD can be brought back in with ceph osd in osd.4

Reweighting OSDs

If an OSD is heavily utilized it can be reweighted, by default this is set at 120% greater than the average OSD utilization. In the example below the system will reweigh if OSDs are above 140% of the average utilization.

More on CRUSH rules

The next setting is used for different levels of resiliency

The format of the setting is:

osd crush chooseleaf type = n

It is also possible to create single pools using these rulesets

In this example a pool will be created on a single server (osdserver2). The command to create this rule is shown below and the format is ceph osd crush rule create-simple <rulename> <node> osd.

The watch window shows:

The rules can be listed with:

Next create a pool with this rule:

More information about the rule can be shown with:

A comparison of the default replicated ruleset shows:

Note the difference in type “osd” versus “host”. Here a pool using the replicated ruleset would follow normal rules but any pools specified using the singelserverrule would not require a total of three servers to achieve a clean state.

Cephfs

As of the jewel community release (planned for mid 2016) cephfs will be considered stable. In the example that follows a cephfs server will be set up on a node named mds.

Installing the Meta Data Server

Install ceph as before however use the string

ceph-deploy install — release jewel <node1> <node2> .. <noden>. After ceph has been installed with OSDs configured, the steps to install cephfs are as follows:

Creating a Meta Data Server

First create a cephfs server

The format is ceph-deploy mds create <nodename>

ceph-deploy –overwite-conf mds create mds


Creating the metadata and data pools

Next create two pools for cephfs: a metadata pool and a regular data pool.

ceph osd pool create cephfsdatapool 128 128

ceph osd pool create cephfsmetadatapool 128 128

Creating the cephfs file system

Now create the file system:

ceph fs new <file system name> <metadatapool> <datapool>

ceph fs new mycephfs cephfsmetadatapool cephfsdatapool

Verify operation

ceph mds stat

ceph fs ls

Mounting the cephfs file system

Make a mount point on the mgmt (172.168.10.10) host which will be used as a client

sudo mkdir /mnt/cephfs

sudo mount -t ceph 172.168.10.10:6789:/ /mnt/cephfs -o name=admin,secret=`ceph-authtool -p ceph.client.admin.keyring`

Next show the mounted device with the mount command

Now test with dd

sudo dd if=/dev/zero of=/mnt/cephfs/cephfsfile bs=4M count=1024

Accessing cephfs from Windows

Installing samba

Samba can be used to access the files. First install it.


Customization can be applied to the file /etc/samba/smb.conf. The heading “Myfiles” shows up as a folder on the Windows machine.


Enable and start the smb service

# systemctl enable smb

# systemctl start smb

Setup access


Next on the windows client access the share by specifying the server’s IP address.



Setting up a ceph object gateway

The mgmt node will be used in this case to host the gateway. First install it:


After installing the gateway software; set up the mgmt node as the gateway.


From a browser enter http://mgmt:7480 at this point a screen similar to that shown below should appear.


Troubleshooting

ceph states

State Status Possible cause
Normal Active + Clean
Degraded Not able to satisfy replication rules This state should be automatically recoverable, unless not enough OSDs exist or the rulesets are not satisfied,
Degraded Recovering Recovering from a degraded state
Backfilling Rebalancing the cluster New empty OSD has been added
Incomplete Unable to satisfy pool min-size rules May need more OSDs
Inconsistent Detected error Detected during scrub may need to perform pq query to find issue
Down Data missing, pg unavailable Need to investigate – pg query, osd status

OSD States

OSDs can be in the cluster or out of the cluster and can either be up which is a running state or down which is not running. A client will be serviced using the OSD up set. If an OSD has a problem or perhaps rebalancing is occurring then the request is serviced from the OSD acting set. In most case the up set and the acting set are identical. An OSD can transition from and In to an Out state and also from an up to a down state. The ceph osd stat command will list the number of OSDS along with how many are up and in.

Peering

For a Placement Group to reach an Active and Clean state the first OSD in the set (which is the primary) must peer to the secondary and tertiary OSDs to reach a consistent state.

Placement Group states

Placement Groups can be stuck in various states according to the table below:

Stuck state Possible Cause
Inactive Cannot process requests as they are waiting for an OSD with the most up to date data to come in
Unclean Placement Groups hold object that are not replicated the specified number of times. This is typically seem during pool creation periods
Stale Placement Groups are in an unknown state, usually because their associated OSDs have not reported to the monitor within the mon_osd_report_timeout period.

Placement Groups related commands

If a PG is suspected of having issues;the query command provides a wealth of information. The format is ceph pg <pg id> query.

The OSDs that this particular PG maps to are OSD.5, OSD.0 and OSD.8. To show only the mapping then issue the command ceph pg map <pg id>

To check integrity of a Placement Group issue the command ceph pg scrub <pg id>

Progress can be shown in the (w)atch window

To list all pgs that use a particular OSD as their primary OSD issue the command ceph pg ls-by-primary <osd id>

Unfound Objects

If objects are shown as unfound and it is deemed that they cannot be retrieved then they must be marked as lost. Lost objects can either be deleted or rolled back to a previous version with the revert command. The format is ceph pg <pg id> mark_unfound_lost revert|delete.

To list pgs that are in a particular state use ceph pg dump_stuck inactive|unclean|stale|undersized|degraded –format json

In this example stuck pgs that are in a stale state are listed:

Troubleshooting examples

Issue – OSDs not joining cluster.

The output of ceph osd tree showed only 6 of the available OSDs in the cluster.

 

 


 

 

The OSDs that were down had been originally created on node osdserver0.

Looking at the devices (sda1 and sdb1) on node osdserver0 showed that they were correctly mounted

 

 

 

The next stage was to see if the node osdserver0 itself was part of the cluster. Since the OSDs seemed to be mounted OK and had originally been working, it was decided to check the network connections between the OSDs. This configuration used the 192.168.10.0 network for cluster communication so connectivity was tested on this network and the ping failed as shown below.

 

 

 

The next step is to physically logon to node osdserver0 and check the various network interfaces. Issuing an ipaddr command showed that the interface which was configured for 192.168.10.20 (osdserver’s ceph cluster IP address) was down.


 

 

 

 

Prior to restarting the network the NetworkManager service was disabled as this can cause issues.

The service was stopped and disabled and then the network was restarted. The system was now ‘pingable’ and the two OSDs now joined the cluster as shown below.


The Monitor Map

Obtain the monitor map by issuing the command below

This will extract the monitor map into the current directory naming it monmap.bin. It can be inspected with the monmaptool.

See the ceph documentation for further information relating to adding or removing monitor nodes on a running ceph cluster.

Changing the default location on a monitor node

If a different device from the default is used on the monitor node(s)is used then this location can be specified by following the ceph documentation as shown below:

Generally, we do not recommend changing the default data location. If you modify the default location, we recommend that you make it uniform across ceph Monitors by setting it in the [mon] section of the configuration file.

mon data

Description: The monitor’s data location.
Type: String
Default: /var/lib/ceph/mon/$cluster-$id

Real World Best Practices

The information contained in this section is based on observations and user feedback within a ceph environment. As a product ceph is dynamic and is rapidly evolving with frequent updates and releases. This may mean that some of the issues discussed here may not be applicable to newer releases.

SSD Journaling considerations

The selection of SSD devices is of prime importance when used as journals in ceph. A good discussion is referenced at http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/. Take care to follow the steps outlined in the procedure including disabling caches where applicable. Sebastien Han’s blog in general provides a wealth of ceph related information.

A suitable fio test script used is listed below:

for pass in {1..20}

do

echo Pass $pass starting

fio –filename=/dev/nvme0n1 –direct=1 –sync=1 –rw=write –bs=4k –numjobs=$pass –iodepth=1 –runtime=60 –time_based –group_reporting –name=nvme0n1journaltest

done

The script runs 20 passes incrementing the numjobs setting on each pass. The only other change necessary is to specify the device name. All tests were run on raw devices.

Poor response during recovery and OSD reconfiguration

During recovery periods Ceph has been observed to consume higher amounts of memory than normal and also to ramp up the CPU usage. This problem is more acute when using high capacity storage systems. If this situation is encountered then we recommend adding single OSDs sequentially. In addition the weight can be set to 0 and then gradually increased to give finer granularity during the recovery period.

Backfilling and recovery can also negatively affect client I/O

Related commands are:

ceph tell osd.* injectargs ‘–osd-max-backfills 1’

ceph tell osd.* injectargs ‘–osd-max-recovery-threads 1’

ceph tell osd.* injectargs ‘–osd-recovery-max-active 1

ceph tell osd.* injectargs ‘–osd-recovery-op-priority 1’

Node count guidelines for Ceph deployment

The key to Ceph is parallelism. A good rule of thumb is to distribute data across multiple servers. Consider a small system with 4 nodes using 3 X replication; should a complete server fail then the system now is only 75% more capable than before the failure. In addition the cluster is doing a lot more work since it has to deal with the recovery process as well as client I/O. Also if the cluster were 70% full across each of the nodes then each server would be close to being full after the recovery had completed and in Ceph a near full cluster is NOT a good situation.

For this reason it is strongly discouraged to use small node count deployments in a production environment. If the above situation used high density systems then the large OSD count will exacerbate the situation even more. With any deployment less than 1 PB it is recommended to use small bay count servers such as 12/18 bay storage systems.

Advertisements

Comments and suggestions for future articles welcome!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s