Ceph Quick reference Guide – Mimic and Luminous


Ceph Quick Reference Guide

Advertisements

Ceph Luminous/Mimic Update


 

Ceph Luminous/Mimic Quick Start Guide

Summary

This document outlines a quick start guide using Ceph Luminous release with CentOS 7.5 (1804). There is also a brief section outlining the Mimic release. For Luminous three physical servers are deployed with one server (mon160) doubling up as a MON and OSD server. Each system has a single network which also has Internet access. Again – no prizes for performance as the intent of this guide is to provide a quick recipe for deploying Ceph on smaller systems prior to migrating to full scale production deployments. The Mimic description uses 5 X Proxmox based VMs and focuses mainly on the dashboard which has significant changes from the luminous version.

Note:

One word of warning during the publication of this document the ceph-deploy changed to version 2. There are significant syntactical changes between both versions so if you encounter syntax errors ensure that you are using the correct version/syntax. Later Ceph releases (such as Nautilus) HOWTOs in this series will no longer cover versions of ceph-deploy prior to version2. The inclusion of both versions is somewhat confusing but it is because during the time of writing the transition occurred and both versions could be encountered. In general the Luminous portion uses version 1.5X of ceph-deploy and the Mimic version uses ceph-deploy V2.

New Features in Luminous

Some of the new features that are available in Luminous are listed below:

  • Bluestore is now the default filesystem for OSDs
  • New Dashboard introduced for basic cluster monitoring
  • RBD devices can use erasure coded pools
  • Data and Metadata checksumming
  • Compression

IP addresses

Table 1 Typical IPs used

Nodename

IP

Gateway

mon160

192.168.0.160

192.168.0.1

osd170

192.168.0.170

192.168.0.1

osd180

192.168.0.180

192.168.0.1

Configure the IPs according to your system, however static addresses should be used.

Software deployed

  • Ceph Luminous release
  • Ceph Mimic release
  • CentOS 7.5(1804) Operating System

Installation Steps

Install CentOS 7.5. For convenience the installation option of “Server with GUI” was used for mon160 and the other nodes used the minimal installation. During the installation create the password for root and add a user called cephuser without Administrator privileges.

Firewall configuration

For mon160 –

For all OSD nodes –

For the Gateway node(s) (if used)

Selinux configuration

On all nodes set the mode to “permissive”

vi /etc/sysconfig/selinux

Installing and enabling ntp

. . .

Grant user cephuser sudo privileges

# echo “cephuser ALL = (root) NOPASSWD:ALL” | sudo tee /etc/sudoers.d/cephuser

Now set permissions

# chmod 0440 /etc/sudoers.d/cephuser

Configuring ssh

Change to user cephuser and change to cephuser’s home directory:

Configure passwordless login

As user cephuser generate a key with ssh-keygen

Copy the key to itself (mon160) and the other nodes

ssh-copy-id <nodename>


Modify ~/ssh/config
to allow short hostname access

 

Change permissions

chmod 600 ~/.ssh/config

Ceph Repository

Add the following commands to /etc/yum.repos.d/ceph.repo

[ceph-noarch]

name=Ceph noarch packages

baseurl=https://download.ceph.com/rpm-luminous/el7/noarch

enabled=1

gpgcheck=1

type=rpm-md

gpgkey=https://download.ceph.com/keys/release.asc

 

and then copy the repo to the other nodes


 

Installing ceph-deploy

Next perform an update and install the ceph-deploy package. Verify the version deployed

. . .

Installing ceph

The first stage is to configure the monitor node. If using a simple single network then the format is simply:

ceph-deploy new <hostname>

or if using separate backend and frontend networks then the format is:

ceph-deploy new <hostname> –public-network <xxx.yyy.zzz.0/24> –cluster-network <aaa.bbb.ccc.0/24>

With this guide only one Ceph network is used, so the command is just:

ceph-deploy new mon160

Next install the ceph package on all nodes. As well as the footnote in the previous sentence using ceph-deploy install -–release=luminous
. . . should also work if the incorrect version is installed

. . .

Now enable the monitor function

Configure node mon160 as an admin node

Change the ceph.client
keyring permissions and watch the cluster

Note the format of the watch window has changed significantly from earlier releases.

Deploying a mgr daemon

Note in the output of the watch window it shows that no mgr daemons are active. This is a new feature with Luminous. The node mon160 will be used to host a manager daemon

. . .

Note that the output of ceph-w now shows the daemon active.

Creating OSDs

At this point no OSDs have been created; look at the output of lsblk to show available devices. Pre-existing data can be cleared using the parted utility and devices also can be configured with a GPT label.

Note large capacity storage servers with an excess of 30 OSDS may not have enough default resources to run; in this case edit /etc/sysctl.conf so that there is an entry fs.aio-max-nr = 1048576. Then use the steps below to verify.

Increase fs.aio-max-nr further if needs be!

In the example below device sdb will be used as the first OSD device. Next create the OSD device on node mon160

. . .

The output of ceph –w now shows:

Instead of using parted ceph can also be used to clear device data.

Create 1 more osd on each of the other nodes, the watch window now shows that three OSDs are in:

Ansible OSD Deployment

(As an aside) Ansible has bluestore support – in the example below there are three bluestore ceph data devices (/dev/sda, /dev/sdb,/dev/sdc) all sharing /dev/sdd for block.db and they all share /dev/sde for  block.wal. If bluestore_wal devices does not appear in the yml file then block.wl will coexist with block.db.

 

osd_scenario: no-collocated

osd objectstore: bluestore

devices:

/dev/sda

/dev/sdb

/dev/sdc

dedicated devices:

/dev/sdd

/dev/sdd

/dev/sdd

bluestore_wal_devices:

/dev/sde

/dev/sde

/dev/sde

 

Pool Creation

Verify that the three OSDs are up and create a pool called bluepool.

Perform a quick benchmark to ensure that the pool is working correctly.

Pool Association

Looking at the output of ceph-w there is a warning message

This is a new feature with Luminous and is used to associate applications with pools. An example follows:

Create a pool for use by RADOS Block Devices and then associate it with the rbd application

Deleting a pool with Luminous

Set the option in ceph.conf “mon_allow_pool_delete = trueand push it out to the nodes with:

ceph-deploy –-overwrite-conf config push mon160 osd170 osd180

After this; pools can be deleted with a command such as:

ceph osd pool delete <poolname> <poolname> –yes-i-really-really-mean-it

Enabling the Dashboard

Luminous supports a basic dashboard (as well as others) plugin module. Enable it by issuing the command below:

ceph mgr module enable dashboard

By default port 7000 is used and since the mgr was deployed on node mon160 (which can be seen from the output of ceph-s) the browser url is http://mon160:7000.

The opening screen shows:

At a glance the summary screen shows that there is one monitor node, three OSDs and two pools configured. Selecting the second icon down on the left and then <servers> shows the Server view

Note that the OSDs on each node are shown as services along with the Monitor service running on node mon160.

Selecting the OSD Tab shows a basic screen along with OSD capacity and performance information.

Next create a block image from the rbd pool

Now under the block icon (third icon down) select <Pools> à <rbd> to show the rbd image properties.

Write some data to each of the pools and then return to the cluster health screen to show the usage by pool information.

Fault conditions show up graphically and textually.

 

It is expected that more features will be added to the dashboard with later releases.

More about BlueStore

BlueStore is now the default backend for OSD devices, prior to this it was called FileStore.

Background

Currently filesystems do not provide atomic writes and Ceph used the concept of ceph journals to deal with this situation. The journaling method can compromise performance especially when the journal and ceph data are co-located on the same device. POSIX also causes some significant overhead. The figure below shows how a device is partitioned using the co-located journal mechanism.

The journal in this case consumes 5GB of space.

Looking at the BlueStore device (as it was prepared earlier) shows:

The parted utility shows that sda1 uses the xfs filesystem.

Here partition sda1 is a small metadata partition with partition sda2 actually holding the ceph data. This partition (sda2) is actually a raw partition and data is written directly to it.

The Metadata associated with an OSD is stored on a RocksDB database. In addition there is a Write ahead log known as WAL. The WAL can be used as BlueStore’s internal journal. It is possible to break out the database (block.db) and the WAL (block.wal) on different devices similar to the way that the journal was broken out from the actual ceph data. This should only be done if the WAL and DB are provisioned on faster devices than the ceph primary device. Small devices such as NV-DIMM could be used as a WAL device, larger flash devices can be used as DB devices.

There are a number of tuning parameters such as bluestore_cache_size, which are detailed in the ceph documentation.

BlueStore Checksums and Compression

Data checksumming uses a default algorithm of crc32c but others are available. There is an overhead to this and larger blocks can be checksummed, however this may compromise integrity. The checksum algorithm can be set globally or on a per pool basis.

BlueStore supports inline compression using algorithms such as snappy
or
zlib.
There are different compression modes such as:

Table 2 BlueStore Compression types

Compression type

Description

none:

Never compress

passive

Do not compress data unless the write operation as a compressible hint set

aggressive

Compress data unless the write operation as an incompressible hint set

force

Try to compress data no matter what

 

There are thresholds to determine if the data should be left uncompressed if it is unable to reach a particular compression threshold ratio. For more information about the compressible and incompressible IO hints, see rados_set_alloc_hint() in the ceph documentation.

The compression settings can be set either via a per-pool property or a global config option. Pool properties can be set with:

ceph osd pool set <pool-name> compression_algorithm <algorithm>

ceph osd pool set <pool-name> compression_mode <mode>

ceph osd pool set <pool-name> compression_required_ratio <ratio>

ceph osd pool set <pool-name> compression_min_blob_size <size>

ceph osd pool set <pool-name> compression_max_blob_size <size>

Configuring OSDs with BlueStore

The single OSD device configuration from a remote node has already been shown on page 11. These commands can be performed directly on the actual node housing the devices.

ceph-disk prepare –bluestore node:<device>

The full format of the command which can break out each of the components is

ceph-disk prepare –- bluestore <device> –block.wal <wal device> — block.db <db device

For example on node osd170 the command below uses device /dev/sda as the main ceph data device and associates the other two components (block.wal and block.db) on /dev/sdc.

 


The dashboard now shows the new OSD being brought into the active pools while the re-balancing occurs.

Looking at the OSD screen shows:

The output of ceph osd tree shows:

 

 

 

 

Note after rebooting the GUI screen showed OSD3 as a component of osd170 unlike the GUI screenshot above.

BlueStore WAL and DB space usage

Using parted to look at /dev/sdc on node osd170 which was used for the wal and db components shows:

Looking at /dev/sda
shows:

Deploying BlueStore device from an Admin node

The device can also be deployed from mon160; the format of the command is:

ceph-deploy osd prepare –bluestore –block-db <block.db device> –block-wal <block.wal device> <OSDServerhostname>:>ceph device>

For example the command below uses separate devices for the ceph objects but shares partitions on /dev/nvme0n1 for the block.db and block.wal devices.

ceph-deploy osd prepare –bluestore –block-db /dev/nvme0n1 –block-wal /dev/nvme0n1 osd170:/dev/sda

ceph-deploy osd prepare –bluestore –block-db /dev/nvme0n1 –block-wal /dev/nvme0n1 osd170:/dev/sdb

ceph-deploy osd prepare –bluestore –block-db /dev/nvme0n1 –block-wal /dev/nvme0n1 osd170:/dev/sdc

 

Benefits of BlueStore

BlueStore no longer suffers from the double write penalty as the data is written directly to the data partition. It also features data checksumming and compression (disabled by default). There is no filesystem overhead and lastly there is the flexibility of using separate devices for the data, block.wal and block.db. It is important to note though that in a HDD/Flash system the most expensive part of the write is the HDD portion. This does not change in BlueStore as the HDD will still require a full copy of the data.

Ceph-volume

With later releases of Luminous
ceph-deploy has been bumped up to Version 2. In this version
ceph-disk has been removed as a backend to create OSDs in favor of ceph-volume.

Using LVM2 with ceph

Ceph-volume can be used to create logical volume based OSD devices. In the following example the devices that are available for OSD deployment for node mon160 are shown below:

sdb 8:16 0 20G 0 disk

sdc 8:32 0 20G 0 disk

nvme0n1 259:0 0 8G 0 disk

 

The first two devices (sdb and sdc) will be used as a logical volume (LV) and /dev/nvme0n1 will be used for journal purposes. Use
parted
to create a partition on /dev/sdb and /dev/sdc.

sdb 8:16 0 20G 0 disk

└─sdb1 8:17 0 20G 0 part

sdc 8:32 0 20G 0 disk

└─sdc1 8:33 0 20G 0 part

 

Create a volume group

First use pvcreate to create the physical volumes.

Now create a volume group.

 

Verify

$ sudo vgdisplay

 


 


Create the Logical Volume

The section below specifies 9000 extents (each extent is 4 MiB giving ~ 36 GiB)

Now create the OSD (using Bluestore).

. . .

Verify

Using ceph-volume directly

After deploying ceph run the command

ceph-deploy gatherkeys mon160

Creating Volume Groups and Logical Volumes

Create a logical volume on /dev/nvme0n1 which will be used as the journal. Prepare the OSD with the command below:

Note the UUID of the OSD from the printout and pass it to the activate command.

Verify

Volume Groups can be extended using the vgextend command.

Create another NVME logical volume for a second journal

Create a second Volume group using /dev/sdd and /dev/sde.

Now create a new Logical Volume for the data

Create a second OSD

# ceph-volume lvm prepare –filestore –data mon160vg1/mon160vol2 –journal nvmevg/nvmevol2

Running command: ceph-authtool –gen-print-key

Running command: ceph –cluster ceph –name client.bootstrap-osd –keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i – osd new 11b9ed7e-c63f-4e0e-ad2e-047820889887

Running command: ceph-authtool –gen-print-key

Running command: mkfs -t xfs -f -i size=2048 /dev/mon160vg1/mon160vol2

stdout: meta-data=/dev/mon160vg1/mon160vol2 isize=2048 agcount=4, agsize=2304000 blks

= sectsz=512 attr=2, projid32bit=1

= crc=1 finobt=0, sparse=0

data = bsize=4096 blocks=9216000, imaxpct=25

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0 ftype=1

log =internal log bsize=4096 blocks=4500, version=2

= sectsz=512 sunit=0 blks, lazy-count=1

realtime =none extsz=4096 blocks=0, rtextents=0

Running command: mount -t xfs -o rw,noatime,inode64 /dev/mon160vg1/mon160vol2 /var/lib/ceph/osd/ceph-2

Running command: chown -R ceph:ceph /dev/dm-5

Running command: ln -s /dev/nvmevg/nvmevol2 /var/lib/ceph/osd/ceph-2/journal

Running command: ceph –cluster ceph –name client.bootstrap-osd –keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap

stderr: got monmap epoch 1

Running command: chown -R ceph:ceph /dev/dm-5

Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/

Running command: ceph-osd –cluster ceph –osd-objectstore filestore –mkfs -i 2 –monmap /var/lib/ceph/osd/ceph-2/activate.monmap –osd-data /var/lib/ceph/osd/ceph-2/ –osd-journal /var/lib/ceph/osd/ceph-2/journal –osd-uuid 11b9ed7e-c63f-4e0e-ad2e-047820889887 –setuser ceph –setgroup ceph. . .

Now activate, noting the OSD’s UUID as before

# ceph-volume lvm activate –filestore 2 11b9ed7e-c63f-4e0e-ad2e-047820889887

Running command: ln -snf /dev/nvmevg/nvmevol2 /var/lib/ceph/osd/ceph-2/journal

Running command: chown -R ceph:ceph /dev/dm-5

Running command: systemctl enable ceph-volume@lvm-2-11b9ed7e-c63f-4e0e-ad2e-047820889887

stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-2-11b9ed7e-c63f-4e0e-ad2e-047820889887.service to /usr/lib/systemd/system/ceph-volume@.service.

Running command: systemctl start ceph-osd@2

–> ceph-volume lvm activate successful for osd ID: 2

–> ceph-volume lvm activate successful for osd ID: 2

 

[root@mon160 ceph]#

 

 

[cephuser@mon160 ~]$ ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF

-1 0.06857 root default

-3 0.06857 host mon160

1 hdd 0.03429 osd.1 up 1.00000 1.00000

2 hdd 0.03429 osd.2 up 1.00000 1.00000

0 0 osd.0 destroyed 0 1.00000

[cephuser@mon160 ~]$

 

 

Create an OSD on the other nodes – osd170 and osd180.

First push out ceph.conf and admin.

sudo ceph-deploy admin osd170

Next push out the keys

# scp /var/lib/ceph/bootstrap-osd/*keyring osd170:/var/lib/ceph/bootstrap-osd/

root@osd170’s password:

ceph.bootstrap-mds.keyring 100% 113 86.0KB/s 00:00

ceph.bootstrap-mgr.keyring 100% 113 91.2KB/s 00:00

ceph.bootstrap-osd.keyring 100% 113 100.1KB/s 00:00

ceph.bootstrap-rgw.keyring 100% 113 101.3KB/s 00:00

ceph.client.admin.keyring 100% 151 127.9KB/s 00:00

ceph.keyring 100% 71 128.3KB/s 00:00

ceph.mon.keyring 100% 77 70.7KB/s 00:00

 

Create the Logical Volumes as described earlier on page 25

#ceph-volume lvm prepare –filestore –data osd170vg/osd170vol1 –journal nvmevg/nvmevol1

 

Activate


 

Repeat for node OSD180

Remove the previously destroyed OSD (OSD.0)

$ ceph osd rm 0

removed osd.0

 

Now show the configuration


 

Verify that the cluster is healthy.

Refer to page 43 for examples of OSD creation using bluestore.

Create a pool and run a benchmark

Here is the syntax for separating the WAL and DB from the data OSD. Note this is done from the monitor/admin node

ceph-deploy osd create –data dev/osd0d0vg/osd0vol1 –block-db /dev/osd0j0vg/osd0j0vol1 –block-wal /dev/osd0j0vg/osd0j0vol1 osd0

NOTE The following command can be used to clear remnants of previous file systems:

ceph-volume lvm zap /dev/sdc

 

Notes: It has been observed that on occasion, with previously used Bluestore devices, the zap command did not clear them correctly. This was overcome with using a command such as:

for i in {0..3}; do dd if=/dev/zero of=/dev/nvme${i}n1 bs=4096K count=100; done

for i in {a..z}; do dd if=/dev/zero of=/dev/sd$i bs=4096K count=100; done

 

and then using

“for i in {a..l}; do ceph-deploy disk zap osd3:sd$i; done” again

Alternative method for wiping old ceph disks

# wipefs –a /dev/sdx

Using lvm for the data and partitions for the db and wal

In this example 12 HDDs will use one NVMe device to house their associated DB and WAL components.

First create 24 partitions on the NVMe device

sudo parted -a optimal /dev/nvme0n1 mkpart primary 0% 3% mkpart primary 4% 7% mkpart primary 8% 11% mkpart primary 12% 15% mkpart primary 18% 21% mkpart primary 22% 25% mkpart primary 26% 29% mkpart primary 30% 33% mkpart primary 36% 39% mkpart primary 40% 43% mkpart primary 44% 47% mkpart primary 48% 51% mkpart primary 52% 55% mkpart primary 56% 59% mkpart primary 62% 65% mkpart primary 66% 69% mkpart primary 70% 73% mkpart primary 74% 77% mkpart primary 78% 81% mkpart primary 82% 85% mkpart primary 86% 89% mkpart primary 90% 93% mkpart primary 94% 97% mkpart primary 98% 100%

Then use ceph-deploy (note this was version 2 of ceph-deploy) to create OSDs according to the table below.

Table 3 Non co-located DATA, DB and WAL mapping

data

block.db

block.wal

host

/dev/sda

/dev/nvme0n1p1

/dev/nvme0n1p2

1u12bay

/dev/sdb

/dev/nvme0n1p3

/dev/nvme0n1p4

1u12bay

/dev/sdc

/dev/nvme0n1p5

/dev/nvme0n1p6

1u12bay

/dev/sdd

/dev/nvme0n1p7

/dev/nvme0n1p8

1u12bay

/dev/sde

/dev/nvme0n1p9

/dev/nvme0n1p10

1u12bay

/dev/sdf

/dev/nvme0n1p11

/dev/nvme0n1p12

1u12bay

/dev/sdg

/dev/nvme0n1p13

/dev/nvme0n1p14

1u12bay

/dev/sdh

/dev/nvme0n1p15

/dev/nvme0n1p16

1u12bay

/dev/sdi

/dev/nvme0n1p17

/dev/nvme0n1p18

1u12bay

/dev/sdj

/dev/nvme0n1p19

/dev/nvme0n1p20

1u12bay

/dev/sdk

/dev/nvme0n1p21

/dev/nvme0n1p22

1u12bay

/dev/sdl

/dev/nvme0n1p23

/dev/nvme0n1p24

1u12bay

An example using /dev/sdf as the data device follows:

$ ceph-deploy osd create –data /dev/sdf –block-db /dev/nvme0n1p10 –block-wal /dev/nvme0n1p11 1u12bay

[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephuser/.cephdeploy.conf

[ceph_deploy.cli][INFO ] Invoked (2.0.1): /usr/bin/ceph-deploy osd create –data /dev/sdf –block-db /dev/nvme0n1p10 –block-wal /dev/nvme0n1p11 1u12bay

[ceph_deploy.cli][INFO ] ceph-deploy options:

[ceph_deploy.cli][INFO ] verbose : False

[ceph_deploy.cli][INFO ] bluestore : None

[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f55d34f3368>

[ceph_deploy.cli][INFO ] cluster : ceph

[ceph_deploy.cli][INFO ] fs_type : xfs

[ceph_deploy.cli][INFO ] block_wal : /dev/nvme0n1p11

[ceph_deploy.cli][INFO ] default_release : False

[ceph_deploy.cli][INFO ] username : None

[ceph_deploy.cli][INFO ] journal : None

[ceph_deploy.cli][INFO ] subcommand : create

[ceph_deploy.cli][INFO ] host : 1u12bay

[ceph_deploy.cli][INFO ] filestore : None

[ceph_deploy.cli][INFO ] func : <function osd at 0x7f55d3942c80>

[ceph_deploy.cli][INFO ] ceph_conf : None

[ceph_deploy.cli][INFO ] zap_disk : False

[ceph_deploy.cli][INFO ] data : /dev/sdf

[ceph_deploy.cli][INFO ] block_db : /dev/nvme0n1p10

[ceph_deploy.cli][INFO ] dmcrypt : False

[ceph_deploy.cli][INFO ] overwrite_conf : False

[ceph_deploy.cli][INFO ] dmcrypt_key_dir : /etc/ceph/dmcrypt-keys

[ceph_deploy.cli][INFO ] quiet : False

[ceph_deploy.cli][INFO ] debug : False

[ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device /dev/sdf

[1u12bay][DEBUG ] connection detected need for sudo

. . .

[1u12bay][DEBUG ] stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-5-7751f8da-36ec-4a5d-8747-1da08e4b95ab.service to /usr/lib/systemd/system/ceph-volume@.service.

[1u12bay][DEBUG ] Running command: /bin/systemctl start ceph-osd@5

[1u12bay][DEBUG ] –> ceph-volume lvm activate successful for osd ID: 5

[1u12bay][DEBUG ] –> ceph-volume lvm create successful for: /dev/sdf

[1u12bay][INFO ] checking OSD status…

[1u12bay][DEBUG ] find the location of an executable

[1u12bay][INFO ] Running command: sudo /bin/ceph –cluster=ceph osd stat –format=json

[ceph_deploy.osd][DEBUG ] Host 1u12bay is now ready for osd use.

[cephuser@1u12bay cephcluster]$

 

Here ceph-deploy created the logical volumes for the data device, to get finer control on the volume creation process the volumes can be created manually as described on page 25.

 

Beyond Luminous – Mimic

In this section – 5 nodes are available mimic80, mimic81, mimic82, mimic83 and mimic84. Nodes mimic 80 will be used as a monitor node, Nodes mimic81, mimic82 and mimic83 will be used as OSD nodes and node mimic84 will be used as a cephfs node. The systems uses two NIC ports – DHCP and 10.10.10.0/24 for the ceph public address.

Installation of Mimic differs very little from Luminous. Use the same steps as described in the Luminous installation except substituting the “mimic” for “luminous” in the ceph-deploy installation command – ceph-deploy install -–release=luminous and configuring the ceph repo to call out mimic. There are some significant enhancements to the dashboard which will now be described.

Enable the dashboard, set up a username and password, create a self-signed certificate and show the services.

Next login using the URL shown above using the credentials that were specified.

After logging on the initial screen should be similar to that shown below:

The installation steps can be scripted with the commands below:

ceph-deploy new mimic80 –public-network 10.10.10.0/24

ceph-deploy install –release=mimic mimic80 mimic81 mimic82 mimic83 mimic84

echo “mon_allow_pool_delete = true” >>ceph.conf

ceph-deploy mon create-initial

ceph-deploy admin mimic80 mimic81 mimic82 mimic83 mimic84

sudo chmod +r /etc/ceph/ceph.client.admin.keyring

ceph-deploy mgr create mimic80

sleep 5

ceph mgr module enable dashboard

ceph dashboard create-self-signed-cert

ceph dashboard set-login-credentials cephuser <password>

ceph mgr services

 

At this point the cluster is healthy but no OSDs or pools have been created. The OSD nodes (mimic81, mimic82 and mimic83 have been configured with a 100GB SCSI disk which will be used as an OSD device. Note that ceph-deploy uses a different syntax from earlier versions.

The disk structures prior to OSD creation can be cleared with

$ ceph-deploy disk zap mimic81 /dev/sdb

and then the OSD can be created with

$ ceph-deploy osd create –data /dev/sdb mimic81

After the OSD has been deployed it shows up as a logical volume – since ceph-deploy V2.X uses ceph-volume rather than ceph-disk (which was used with ceph-deploy V1.X)

Repeat for nodes mimic82 and mimic83, ceph osd tree shows –

Create a pool

Looking at the dashboard shows the newly created OSDs and Pool –

Selecting <Pools> from the top of the GUI shows –

Selecting <Cluster> gives a further sub menu –

Looking at <Cluster>/<Hosts> shows the cluster members and the services that are running –

Selecting <Cluster>/<Monitors> shows –

The next option <Cluster>/OSDS shows –

Finally <Cluster/Configuration/Documentation> shows –

Note this can be filtered

Cephfs

Node mimic84 will be used as the cephfs server. First create the metadata server.

. . .

Now create two pools – 1 for regular data and the other for metadata.

Now create the cephfs filesystem

Check for basic functionality

Node mimic80 will be used as the client – create a mountpoint directory on mimic80 – /mnt/cephfs/ Now mount the filesystem specifying the mon node (mimic80) in the mount string.

The /etc/fstab entry might look like:

The GUI shows:

Create I/O.

The OSDs are showing write activity – (Hold the mouse tip over a data point in the Writes bytes window to see tha actual value)

Hold the mouse tip over a data point in the Writes bytes window to see the actual value –

Use dd to test performance

Using oflag=direct gives a dramatic effect with small blocks –

With larger block sizes –

Using read testing with dd

Note that caching can come into play here; before performing a read test use dd again to write out to a temporary file larger than available memory, also using the commands following will most likely give a more accurate result.

Note the value “1” clears the PageCache only, the value “2” clears Dentries and inodes and “3” clears PageCache, Dentries and inodes.

Use bonnie++ to test performance

Install bonnie++ using

$ sudo yum install -y bonnie++

The command string below specifies the file location followed by the memory size (4GB). By default 2X memory is the dataset default size – which is shown in the output of bonnie++ (below).

Note the utility bon_csvhtml can be used to tabulate the bottom csv output of bonnie++, an example is shown below –

# echo “1.97,1.97,1u12bay,1,1535489406,256G,,376,99,724816,90,214714,41,670,99,347356,38,929.6,32,16,,,,,259,1,+++++,+++,1733,4,1589,5,3587,7,1759,4,57716us,5888ms,7009ms,38125us,95929us,96859us,48344ms,19670us,161ms,531ms,5759us,109ms” | bon_csv2html > results.html

Using strace to monitor bonnie++

The strace utility shows activity during the bonnie++ run.

Using ceph-volume directly on mimic OSD nodes

Assuming that the cluster has been set up according to the previous steps, this example will use ceph-volume directly on the OSD nodes without the use of ceph-deploy.

The examples following use hypothetical nodes mon100, osd101,osd102 and osd103 with two physical devices (sda and sdb) available for OSD deployment as well as an NVMe device which will be used to offload the block.wal and block.db components from the HDDs.

Initially create the keys on mon0 and push the bootstrap-osd key out to the OSD nodes

/usr/sbin/ceph-create-keys -i mon100

for i in {1..3}; do scp /var/lib/ceph/bootstrap-osd/ceph.keyring osd10$i:/var/lib/ceph/bootstrap-osd/; done

 

The next sequence of instructions first removes previous volume groups (if they exist). It then creates two new volume groups and two new logical volumes with 5000 4 MiB extents which corresponds to 20 GiB. It then removes any existing partitions from the NVMe devices and creates 4 new ones. The final step is to create the new OSD devices.

 

sudo vgremove sdavg sdbvg -y

sudo vgcreate sdavg /dev/sda

sudo vgcreate sdbvg /dev/sdb

sudo vgdisplay | grep -i sd

sudo lvcreate -l 5000 -n sdalv sdavg

sudo lvcreate -l 5000 -n sdblv sdbvg

sudo parted /dev/nvme0n1 rm 1 rm 2 rm 3 rm 4

sudo parted -a optimal /dev/nvme0n1 mkpart primary 0% 24% mkpart primary 25% 49% mkpart primary 50% 74% mkpart primary 75% 100%

sudo ceph-volume lvm create –bluestore –data sdavg/sdalv –block.db /dev/nvme0n1p1 –block.wal /dev/nvme0n1p2

sudo ceph-volume lvm create –bluestore –data sdbvg/sdblv –block.db /dev/nvme0n1p3 –block.wal /dev/nvme0n1p4

Using Proxmox to build a working Ceph Cluster


Proxmox Version  Used– 5.0

Hardware – Intel NUC x4 with 16 GB RAM each with SSD for the Proxmox O/S and 3TB USB disks for uses as OSDS’s

Note This is not a tutorial on Ceph or Proxmox, it assumes familiarity with both. The intent is to show how to rapidly deploy Ceph using the capabilities of Proxmox.

Steps

  1. Create a basic Promox Cluster
  2. Install Ceph
  3. Create a three node Ceph Cluster
  4. Configure OSDs
  5. Create RBD Pools
  6. Use the Ceph RBD Storage as VM space for proxmox

Creating the Proxmox Cluster

Initially a four node Proxmox cluster will be created. Within this configuration three of the Proxmox cluster nodes will be used to form a ceph cluster. This ceph cluster will, in turn, provides storage for various VMs used by Proxmox. The nodes in question are proxmox127, proxmox128 and proxmox129. The last three digits of the hostname correspond to the last octet of the node’s IP address. The network used is 192.168.1.0/24.

The first task is to create a normal Proxmox Cluster – as well as the three ceph nodes mentioned the Proxmox cluster will also involve a non ceph node proxmox126.

The assumption is that the Proxmox nodes have already been created. Create a /etc/hosts file and copy it to each of the other nodes so that the nodes are “known” to each other. Open a browser and point it to https://192.168.1.126:8006 as shown below.

Open a shell and create the Proxmox cluster.

Next add the remaining nodes to this cluster by logging on to each node and specifying anode where the cluster is running.

Check the status of the cluster

The browser should now show all the nodes.

Creating the ceph cluster

This cluster is an example of a hyper-converged cluster in that the Monitor nodes and OSD nodes exist on the same server. The ceph cluster will be built on nodes proxmox127, proxmox128 and proxmox129.

Install the ceph packages on each of the three nodes

Next specify the cluster network. Note this need only be specified on the first node.

# pveceph init –network 192.168.1.0/24

After this an initial ceph.conf
file is created in /etc/ceph/. Edit the file to change the default pool replication size from 3 to 2 and the default minimum pool size from 2 to 1. Also, this is a good time to make any other changes to the ceph configuration as the cluster has not been started yet. Typically enterprise users require a replication size of three but since this is a home system and usable capacity might be more of a consideration a replication size of two is used but of course the final choice is in the domain of the System Administrator.

# vi /etc/ceph

Create the monitor on all three nodes. Note it is possible to use just one node but for resiliency purposes three are better. This will start the ceph cluster.

The ceph.conf file on the initial ceph node will be pushed out to the other nodes as they create their own monitors.

The crush map and ceph.conf
can be shown from the GUI by selecting <Ceph> – <Configuration>

Selecting <Ceph> à <Monitor> shows the Monitor configuration.

At this point the GUI can be used to create the Ceph OSD’s and pools. Note that the ceph cluster is showing an error status due to the fact that no OSD’s have been created yet.

===============================================================

Note if timeouts are observed at the OSD screen perform the following tasks –

#Nano /etc/apt/sources.list.d/pve-enterprise.list
In the file comment and add
#deb https://enterprise.proxmox.com/debian/pve stretch pve-enterprise
deb http://download.proxmox.com/debian/pve stretch pve-no-subscription
After update
>apt update && apt dist-upgrade
Next
Create Ceph Manager (mgr) on each Monitor host.
>pveceph createmgr      

Credit for this workaround should be given to the author at :

https://forum.proxmox.com/threads/ceph-osd-on-pve5-got-timeout-500-in-gui.36235/

=================================================================

 

Create an OSD by selecting <Ceph> – <OSD> – <Create OSD>

Eligible disks are shown in the drop-down box; in addition, there is a choice to use a journal disk or co-locate the journal on the OSD data disk. Note the OSD disks should be cleared before this stage using a tool such as parted. This is because ceph may be conservative in its approach to creating OSDs if it finds that there is existing data present on the candidate device.

Select <Create> and the system will begin to create the OSD.

The OSD screen now shows the newly formed OSD. Note the weight corresponds to the capacity and ceph uses this value to balance capacity across clusters. Since this is the first OSD it has been given an index of 0.

At this point the ceph cluster is still degraded. Continue adding OSDs until there is at least one OSD configured on each server node. After adding one OSD to each server the <Ceph> – <OSD> screen looks like:

The main Ceph screen shows a healthy cluster now that the replication requirements have been met:

At a console prompt issue, the command ceph osd tree to see a similar view.

root@proxmox127:/etc/ceph# ceph osd tree
‘ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.18669 root default
-2 2.72890   host proxmox127
0 2.72890      osd.0 up 1.00000 1.00000
-3 2.72890   host proxmox128
1 2.72890      osd.1 up 1.00000 1.00000
-4 2.72890   host proxmox129
2 2.72890      osd.2 up 1.00000 1.00000

The next task is to create an Object Storage Pool. Select <Ceph> – <Pools> – <Create> and enter the appropriate parameters. Here the replication size is left at 3 since this is a temporary pool and it will be deleted shortly

After creation, the new pool will be displayed:

Note these commands can also be performed from the command line.
Next perform a short benchmark test to ensure basic functionality:
# rados bench -p objectpool 20 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 20 seconds or 0 objects
Object prefix: benchmark_data_proxmox127_12968
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 – 0
1 16 16 0 0 0 – 0
2 16 34 18 35.9956 36 0.519035 1.22814
3 16 53 37 49.3273 76 0.93994 1.10036
4 16 69 53 52.9938 64 0.91114 1.04655
5 16 86 70 55.9936 68 0.902308 1.02032
6 16 106 90 59.9931 80 0.622469 0.997476
7 16 121 105 59.9931 60 1.14761 0.9906
8 16 140 124 61.9929 76 0.725148 0.980396
9 16 157 141 62.6595 68 1.01548 0.975507
10 16 173 157 62.7926 64 0.781161 0.970356
11 16 193 177 64.3561 80 0.897278 0.959163
12 16 211 195 64.9923 72 0.993592 0.949374
13 16 230 214 65.8383 76 0.963193 0.944466
14 16 246 230 65.7065 64 0.813863 0.9372
15 16 264 248 66.1255 72 0.851179 0.934594
16 16 284 268 66.9921 80 0.868938 0.931225
17 16 299 283 66.5804 60 0.986844 0.932238
18 16 317 301 66.881 72 0.885554 0.931278
19 16 335 319 67.1501 72 0.873851 0.932255
2017-07-26 16:28:35.531937 min lat: 0.519035 max lat: 1.72213 avg lat: 0.926949
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
20 16 353 337 67.3921 72 0.839821 0.926949
Total time run: 20.454429
Total writes made: 354
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 69.2271
Stddev Bandwidth: 18.3716
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average IOPS: 17
Stddev IOPS: 4
Max IOPS: 20
Min IOPS: 0
Average Latency(s): 0.923226
Stddev Latency(s): 0.177937
Max latency(s): 1.72213
Min latency(s): 0.441373
Cleaning up (deleting benchmark objects)
Removed 354 objects
Clean up completed and total clean up time :1.034961

Delete the pool by highlighting it and selecting <Remove> and then follow the prompts.

Using Ceph Storage as VM space

In this example two pools will be used – one for storing images and the other for containers. Create the pools with a replication size of 2 and set the pg count at 128. This will allow the addition of further OSD’s later on. The pools will be named rbd-vms and rbd-containers.

Copy the keyring to the locations shown below: Note that the filename is /etc/pve/priv/ceph/<poolname>.keyring

Next from the GUI select <Datacenter> – <Storage> – <Add> and select <RBD> as the storage type.

At the dialogue enter the parameters as shown adding all three of the monitors.

For the container pool select <KRBD> and use <Container> for the content deselecting image.

Now the Server View pane shows the storage available to all nodes.

Select One of the nodes then – <rbd-vms> – < Summary>

Using the ceph pools for VM storage

First upload an image to one of the servers – in this example the VM will be created on node proxmox127. Select the local storage and upload an iso image to it.

In this example Ubuntu 17 server has been selected –

Now select <Create VM> from the top right hand side of the screen and respond to the prompts until the <Hard Disk> screen is reached. From the drop-down menu select <rbd-vms> as the storage for the Virtual Ubuntu machine.

Complete the prompts – and then select the VM – <Hardware> and add a second 40 GB virtual disk to the VM again using rbd-vms.

Start the VM and install Ubuntu Server.

Note during the installation that both of the ceph based virtual disks are presented.

The ceph performance screen shows the I/O activity

Summary

Proxmox is highly recommended not just for home use as it is also an excellent Hypervisor that is used in many Production environments. There are a number of support options available and in the author’s opinion it is the easiest and fastest way to get started with exploring the world of Ceph.

Using Proxmox to build a working Ceph Cluster


Proxmox Version  Used– 5.0

Hardware – Intel NUC x4 with 16 GB RAM each with SSD for the Proxmox O/S and 3TB USB disks for uses as OSDS’s

Note This is not a tutorial on Ceph or Proxmox, it assumes familiarity with both. The intent is to show how to rapidly deploy Ceph using the capabilities of Proxmox.

Steps

  1. Create a basic Promox Cluster
  2. Install Ceph
  3. Create a three node Ceph Cluster
  4. Configure OSDs
  5. Create RBD Pools
  6. Use the Ceph RBD Storage as VM space for proxmox

Creating the Proxmox Cluster

Initially a four node Proxmox cluster will be created. Within this configuration three of the Proxmox cluster nodes will be used to form a ceph cluster. This ceph cluster will, in turn, provides storage for various VMs used by Proxmox. The nodes in question are proxmox127, proxmox128 and proxmox129. The last three digits of the hostname correspond to the last octet of the node’s IP address. The network used is 192.168.1.0/24.

The first task is to create a normal Proxmox Cluster – as well as the three ceph nodes mentioned the Proxmox cluster will also involve a non ceph node proxmox126.

The assumption is that the Proxmox nodes have already been created. Create a /etc/hosts file and copy it to each of the other nodes so that the nodes are “known” to each other. Open a browser and point it to https://192.168.1.126:8006 as shown below.

Open a shell and create the Proxmox cluster.

Next add the remaining nodes to this cluster by logging on to each node and specifying anode where the cluster is running.

Check the status of the cluster

The browser should now show all the nodes.

Creating the ceph cluster

This cluster is an example of a hyper-converged cluster in that the Monitor nodes and OSD nodes exist on the same server. The ceph cluster will be built on nodes proxmox127, proxmox128 and proxmox129.

Install the ceph packages on each of the three nodes

Next specify the cluster network. Note this need only be specified on the first node.

# pveceph init –network 192.168.1.0/24

After this an initial ceph.conf
file is created in /etc/ceph/. Edit the file to change the default pool replication size from 3 to 2 and the default minimum pool size from 2 to 1. Also, this is a good time to make any other changes to the ceph configuration as the cluster has not been started yet. Typically enterprise users require a replication size of three but since this is a home system and usable capacity might be more of a consideration a replication size of two is used but of course the final choice is in the domain of the System Administrator.

# vi /etc/ceph

Create the monitor on all three nodes. Note it is possible to use just one node but for resiliency purposes three are better. This will start the ceph cluster.

The ceph.conf file on the initial ceph node will be pushed out to the other nodes as they create their own monitors.

The crush map and ceph.conf
can be shown from the GUI by selecting <Ceph> <Configuration>

Selecting <Ceph> à <Monitor> shows the Monitor configuration.

At this point the GUI can be used to create the Ceph OSD’s and pools. Note that the ceph cluster is showing an error status due to the fact that no OSD’s have been created yet.

===============================================================

Note if timeouts are observed at the OSD screen perform the following tasks –

#Nano /etc/apt/sources.list.d/pve-enterprise.list
In the file comment and add
#deb https://enterprise.proxmox.com/debian/pve stretch pve-enterprise
deb http://download.proxmox.com/debian/pve stretch pve-no-subscription
After update
>apt update && apt dist-upgrade
Next
Create Ceph Manager (mgr) on each Monitor host.
>pveceph createmgr      

Credit for this workaround should be given to the author at :

https://forum.proxmox.com/threads/ceph-osd-on-pve5-got-timeout-500-in-gui.36235/

=================================================================

 

Create an OSD by selecting <Ceph>-<OSD>-<Create OSD>

Eligible disks are shown in the drop-down box; in addition, there is a choice to use a journal disk or co-locate the journal on the OSD data disk. Note the OSD disks should be cleared before this stage using a tool such as parted. This is because ceph may be conservative in its approach to creating OSDs if it finds that there is existing data present on the candidate device.

Select <Create> and the system will begin to create the OSD.

The OSD screen now shows the newly formed OSD. Note the weight corresponds to the capacity and ceph uses this value to balance capacity across clusters. Since this is the first OSD it has been given an index of 0.

At this point the ceph cluster is still degraded. Continue adding OSDs until there is at least one OSD configured on each server node. After adding one OSD to each server the <Ceph> <OSD> screen looks like:

The main Ceph screen shows a healthy cluster now that the replication requirements have been met:

At a console prompt issue, the command ceph osd tree to see a similar view.

root@proxmox127:/etc/ceph# ceph osd tree
‘ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.18669 root default
-2 2.72890   host proxmox127
0 2.72890      osd.0 up 1.00000 1.00000
-3 2.72890   host proxmox128
1 2.72890      osd.1 up 1.00000 1.00000
-4 2.72890   host proxmox129
2 2.72890      osd.2 up 1.00000 1.00000

The next task is to create an Object Storage Pool. Select <Ceph> <Pools> <Create> and enter the appropriate parameters. Here the replication size is left at 3 since this is a temporary pool and it will be deleted shortly

After creation, the new pool will be displayed:

Note these commands can also be performed from the command line.
Next perform a short benchmark test to ensure basic functionality:
# rados bench -p objectpool 20 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 20 seconds or 0 objects
Object prefix: benchmark_data_proxmox127_12968
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 – 0
1 16 16 0 0 0 – 0
2 16 34 18 35.9956 36 0.519035 1.22814
3 16 53 37 49.3273 76 0.93994 1.10036
4 16 69 53 52.9938 64 0.91114 1.04655
5 16 86 70 55.9936 68 0.902308 1.02032
6 16 106 90 59.9931 80 0.622469 0.997476
7 16 121 105 59.9931 60 1.14761 0.9906
8 16 140 124 61.9929 76 0.725148 0.980396
9 16 157 141 62.6595 68 1.01548 0.975507
10 16 173 157 62.7926 64 0.781161 0.970356
11 16 193 177 64.3561 80 0.897278 0.959163
12 16 211 195 64.9923 72 0.993592 0.949374
13 16 230 214 65.8383 76 0.963193 0.944466
14 16 246 230 65.7065 64 0.813863 0.9372
15 16 264 248 66.1255 72 0.851179 0.934594
16 16 284 268 66.9921 80 0.868938 0.931225
17 16 299 283 66.5804 60 0.986844 0.932238
18 16 317 301 66.881 72 0.885554 0.931278
19 16 335 319 67.1501 72 0.873851 0.932255
2017-07-26 16:28:35.531937 min lat: 0.519035 max lat: 1.72213 avg lat: 0.926949
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
20 16 353 337 67.3921 72 0.839821 0.926949
Total time run: 20.454429
Total writes made: 354
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 69.2271
Stddev Bandwidth: 18.3716
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average IOPS: 17
Stddev IOPS: 4
Max IOPS: 20
Min IOPS: 0
Average Latency(s): 0.923226
Stddev Latency(s): 0.177937
Max latency(s): 1.72213
Min latency(s): 0.441373
Cleaning up (deleting benchmark objects)
Removed 354 objects
Clean up completed and total clean up time :1.034961

Delete the pool by highlighting it and selecting <Remove> and then follow the prompts.


Using Ceph Storage as VM space

In this example two pools will be used – one for storing images and the other for containers. Create the pools with a replication size of 2 and set the pg count at 128. This will allow the addition of further OSD’s later on. The pools will be named rbd-vms and rbd-containers.

Copy the keyring to the locations shown below: Note that the filename is /etc/pve/priv/ceph/<poolname>.keyring

Next from the GUI select <Datacenter> à <Storage> à <Add> and select <RBD> as the storage type.

At the dialogue enter the parameters as shown adding all three of the monitors.

For the container pool select <KRBD> and use <Container> for the content deselecting image.

Now the Server View pane shows the storage available to all nodes.

Select One of the nodes then <rbd-vms> < Summary>

Using the ceph pools for VM storage

First upload an image to one of the servers – in this example the VM will be created on node proxmox127. Select the local storage and upload an iso image to it.

In this example Ubuntu 17 server has been selected –

Now select <Create VM> from the top right hand side of the screen and respond to the prompts until the <Hard Disk> screen is reached. From the drop-down menu select <rbd-vms> as the storage for the Virtual Ubuntu machine.

Complete the prompts – and then select the VM <Hardware> and add a second 40 GB virtual disk to the VM again using rbd-vms.

Start the VM and install Ubuntu Server.

Note during the installation that both of the ceph based virtual disks are presented.

The ceph performance screen shows the I/O activity

Summary

Proxmox is highly recommended not just for home use as it is also an excellent Hypervisor that is used in many Production environments. There are a number of support options available and in the author’s opinion it is the easiest and fastest way to get started with exploring the world of Ceph

A Hands on guide to ceph


Introduction

This guide is designed to be used as a self-training course covering ceph. The first part is a gentle introduction to ceph and will serve as a primer before tackling more advanced concepts which are covered in the latter part of the document.

The course is aimed at engineers and administrators that want to gain familiarization with ceph quickly. If difficulties are encountered at any stage the ceph documentation should be consulted as ceph is being constantly updated and the content here is not guaranteed to apply to future releases.

Pre-requisites

  • Familiarization with Unix like Operating Systems
  • Networking basics
  • Storage basics
  • Laptop with 8GB of RAM for Virtual machines or 4 physical nodes

Objectives:

At the end of the training session the attendee should be able to:

  • To describe ceph technology and basic concepts
    • Understand the roles played by Client, Monitor, OSD and MDS nodes
  • To build and deploy a small scale ceph cluster
  • Understand customer use cases
  • Understand ceph networking concepts as it relates to public and private networks
  • Perform basic troubleshooting

Pre course activities

Activities

  • Prepare a Linux environment for ceph deployment
  • Build a basic 4/5 node ceph cluster in a Linux environment using physical or virtualized servers
  • Install ceph using the ceph-deploy utility
  • Configure admin, monitor and OSD nodes
  • Create replicated and Erasure coded pools
    • Describe how to change the default replication factor
    • Create erasure coded profiles
  • Perform basic benchmark testing
  • Configure object storage and use PUT and GET commands
  • Configure block storage, mount and copy files, create snapshots, set up an iscsi target
  • Investigate OSD to PG mapping
  • Examine CRUSH maps

About this guide

The training course covers the pre-installation steps for deployment on Ubuntu V14.04, and Centos V7. There are some slight differences in the repository configuration with between Debian and RHEL based distributions as well as some settings in the sudoers file. Ceph installation can of course be deployed using Red Hat Enterprise Linux.

Disclaimer

Other versions of the Operating System and the ceph release may require different installation steps (and commands) from those contained in this document. The intent of this guide is to provide instruction on how to deploy and gain familiarization with a basic ceph cluster. The examples shown here are mainly for demonstration/tutorial purposes only and they do not necessarily constitute the best practices that would be employed in a production environment. The information contained herein is distributed with the best intent and although care has been taken, there is no guarantee that the document is error free. Official documentation should always be used instead when architecting an actual working deployment and due diligence should be employed.

Getting to know ceph – a brief introduction

This section is mainly taken from ceph.com/docs/master which should be used as the definitive reference.

Ceph is a distributed file system supporting block, object and file based storage. It consists of MON nodes, OSD nodes and optionally an MDS node. The MON node is for monitoring the cluster and there are normally multiple monitor nodes to prevent a single point of failure. The OSD nodes house ceph Object Storage Daemons which is where the user data is held. The MDS node is the Meta Data Node and is only used for file based storage. It is not necessary if block and object storage is only needed. The diagram below is taken from the ceph web site and shows that all nodes have access to a front end Public network, optionally there is a backend Cluster Network which is only used by the OSD nodes. The cluster network takes replication traffic away from the front end network and may improve performance. By default a backend cluster network is not created and needs to be manually configured in ceph’s configuration file (ceph.conf). The ceph clients are part of the cluster.

The Client nodes know about monitors, OSDs and MDS’s but have no knowledge of object locations. Ceph clients communicate directly with the OSDs rather than going through a dedicated server.

The OSDs (Object Storage Daemons) store the data. They can be up and in the map or can be down and out if they have failed. An OSD can be down but still in the map which means that the PG has not yet been remapped. When OSDs come on line they inform the monitor.

The Monitors store a master copy of the cluster map.

Ceph features Synchronous replication – strong consistency.

The architectural model of ceph is shown below.

RADOS stands for Reliable Autonomic Distributed Object Store and it makes up the heart of the scalable object storage service.

In addition to accessing RADOS via the defined interfaces, it is also possible to access RADOS directly via a set of library calls as shown above.

Ceph Replication

By default three copies of the data are kept, although this can be changed!

Ceph can also use Erasure Coding, with Erasure Coding objects are stored in k+m chunks where k = # of data chunks and m = # of recovery or coding chunks

Example k=7, m= 2 would use 9 OSDs – 7 for data storage and 2 for recovery

Pools are created with an appropriate replication scheme.

CRUSH (Controlled Replication Under Scalable Hashing)

The CRUSH map knows the topology of the system and is location aware. Objects are mapped to Placement Groups and Placement Groups are mapped to OSDs. It allows dynamic rebalancing and controls which Placement Group holds the objects and which of the OSDs should hold the Placement Group. A CRUSH map holds a list of OSDs, buckets and rules that hold replication directives. CRUSH will try not to shuffle too much data during rebalancing whereas a true hash function would be likely to cause greater data movement

The CRUSH map allows for different resiliency models such as:

#0 for a 1-node cluster.

#1 for a multi node cluster in a single rack

#2 for a multi node, multi chassis cluster with multiple hosts in a chassis

#3 for a multi node cluster with hosts across racks, etc.

osd crush chooseleaf type = {n}

Buckets

Buckets are a hierarchical structure of storage locations; a bucket in the CRUSH map context is a location. The Bucket Type structure contains

  • id #unique negative integer
  • weight # Relative capacity (but could also reflect other values)
  • alg #Placement algorithm
    • Uniform # Use when all devices have equal weights
    • List # Good for expanding clusters
    • Tree # Similar to list but better for larger sets
    • Straw # Default allows fair competition between devices.
  • Hash #Hash algorithm 0 = rjenkins1

An extract from a ceph CRUSH map is shown following:


An example of a small deployment using racks, servers and host buckets is shown below.

Placement Groups

Objects are mapped to Placement Groups by hashing the object’s name along with the replication factor and a bitmask

The PG settings are calculated by Total PGs = (OSDs * 100) /#of OSDs per object) (# of replicas or k+m sum ) rounded to a power of two.


Enterprise or Community Editions?

Ceph is available as a community or Enterprise edition. The latest version of the Enterprise edition as of mid-2015 is ICE1.3. This is fully supported by Red Hat with professional services and it features enhanced monitoring tools
such as Calamari. This guide covers the community edition.

Installation of the base Operating System

Download either the Centos or the Ubuntu server iso images. Install 4 (or more OSD nodes if resources are available) instances of Ubuntu or CentOS based Virtual Machines (these can of course be physical machines if they are available), according to the configuration below:

Hostname Role NIC1 NIC2 RAM HDD
monserver0 Monitor, Mgmt, Client DHCP 192.168.10.10 1 GB 1 x 20GB Thin Provisioned
osdserver0 OSD DHCP 192.168.10.20 1 GB 2 x 20GB Thin Provisioned
osdserver1 OSD DHCP 192.168.10.30 1 GB 2 x 20GB Thin Provisioned
osdserver2 OSD DHCP 192.168.10.40 1 GB 1 x 20GB Thin Provisioned
osdserver3 OSD DHCP 192.168.10.40 1 GB 1 x 20GB Thin Provisioned

If more OSD server nodes can be made available; then add them according to the table above.

VirtualBox Network Settings

For all nodes – set the first NIC as NAT, this will be used for external access.

Set the second NIC as a Host Only Adapter, this will be set up for cluster access and will be configured with a static IP.

VirtualBox Storage Settings

OSD Nodes

For the OSD nodes – allocate a second 20 GB Thin provisioned Virtual Disk which will be used as an OSD device for that particular node. At this point do not add any extra disks to the monitor node.

Mount the ISO image as a virtual boot device. This can be the downloaded Centos or Ubuntu iso image

Enabling shared clipboard support

  • Set GeneralàAdvancedàShared Clipboard to Bidirectional
  • Set GeneralàAdvancedàDrag,n,Drop to Bidirectional

Close settings and start the Virtual Machine. Select the first NIC as the primary interface (since this has been configured for NAT in VirtualBox). Enter the hostname as shown.

Select a username for ceph deployment.

Select the disk

Accept the partitioning scheme

Select OpenSSH server

Respond to the remaining prompts and ensure that the login screen is reached successfully.

The installation steps for Centos are not shown but it is suggested that the server option is used at the software selection screen if CentOS is used.

Installing a GUI on an Ubuntu server

This section is purely optional but it may facilitate monitoring ceph activity later on. In this training session administration will be performed from the monitor node. In most instances the monitor node will be distinct from a dedicated administration or management node. Due to the limited resources (in most examples shown here) the monserver0 node will function as the MON node, an admin/management node and as a client node as shown in the table on page 8.

If you decide to deploy a GUI after an Ubuntu installation then select the Desktop Manager of your choice using the instruction strings below, the third option is more lightweight than the other two larger deployments.

  • sudo apt-get install ubuntu-desktop
  • sudo apt-get install ubuntu-gnome-desktop
  • sudo apt-get install xorg gnome-core gnome-system-tools gnome-app-install

Reboot the node.

sudo reboot

Installing a GUI on CentOS 7

A GUI can also be installed on CentOS machines by issuing the command:

sudo yum groupinstall “Gnome Desktop”

The GUI can be started with the command

startx

Then to make this default environment –

systemctl set-default graphical.target

VirtualBox Guest Additions

To increase screen resolution go to the VirtualBox main menu and select devicesàInstall Guest Additions CD Image


Select <OK> and reboot.

Preparation – pre-deployment tasks

Configure NICs on Ubuntu

Edit the file /etc/networks/interfaces according to the table below:

hostname NIC1 NIC2
monserver0 DHCP 192.168.10.10
osdserver0 DHCP 192.168.10.20
osdserver1 DHCP 192.168.10.30
osdserver2 DHCP 192.168.10.40
osdserver3 DHCP 192.168.10.50

The screenshot shows the network settings for the monitor node; use it as a template to configure nic1 and nic2 on the osd nodes.

 

 

 

 

 

Bring up eth1 and restart the network.

Verify the IP address.

Configure NICs on CentOS

Ensure NetworkManager is not running and disabled.

Or use the more updated command systemctl disable NetworkManager

Then edit the appropriate interface in /etc/sysconfig/network-scripts e.g. vi ifcfg-enps03 setting the static IPs according to the table shown at the beginning of this section.

Edit /etc/hosts on the monitor node.

Setting up ssh

If this option was not selected at installation time – Install openssh-server on all nodes. For Ubuntu enter:
sudo apt-get install openssh-server
For CentOS use sudo yum install openssh-server
Next from the monitor node push the hosts file out to the osd servers.
scp /etc/hosts osdserver0:/home/cephuser
scp /etc/hosts osdserver1:/home/cephuser
scp /etc/hosts osdserver2:/home/cephuser


Now copy the hosts file to /etc/hosts on each of the osd nodes

sudo cp ~/hosts /etc/hosts


Disabling the firewall

Note: Turning off the firewall is obviously not an option for production environments but is acceptable for the purposes of this tutorial. The official documentation can be consulted with regards to port configuration if the implementer does not want to disable the firewall. In general the exercises used here should not require disabling the firewall.

Disabling the firewall on Ubuntu

sudo ufw disable

Disabling the firewall on CentOS

systemctl stop firewalld

systemctl disable firewalld

Configuring sudo

Do the following on all nodes:

If the user cephuser has not already been chosen at installation time, create this user and set a password.

sudo useradd –d /home/cephuser –m cephuser

sudo passwd cephuser

Next set up the sudo permissions

echo “cephuser ALL = (root) NOPASSWD:ALL” | sudo tee /etc/sudoers.d/cephuser

sudo chmod 0440 /etc/sudoers.d/cephuser


Repeat on osdserver0, osdserver1, osdserver2

Centos disabling requiretty

For CentOS only, on each node disable requiretty for user cephuser by issuing the sudo visudo command and adding the line Defaults:cephuser !requiretty as shown below.

Add in the line Defaults:cephuser !requiretty under the Defaults requiretty line as shown to the <Defaults> section of the sudo file.

Note: If an error message similar to that shown below occurs double check the sudoers setting as shown above.

Setting up passwordless login

The ceph-deploy tool requires passwordless login with a non-root account, this can be achieved by performing the following steps:

On the monitor node enter the ssh-keygen command.

Now copy the key from monserver0 to each of the OSD nodes in turn.

ssh-copy-id cephuser@osdserver0


Repeat for the other two osd nodes.

Finally edit ~/.ssh/config for the user and hostnames as shown.

And change the permissions

chmod 600 ~./ssh/config

Configuring the ceph repositories on Ubuntu

On the monitor node Create a directory for ceph administration under the cephuser home directory. This will be used for administration.

On monserver0 node enter:

wget -q -O- ‘https://download.ceph.com/keys/release.asc&#8217; | sudo apt-key add –

For the hammer release of ceph enter:

echo deb http://ceph.com/debian-hammer/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list


The operation can be verified by printing out /etc/apt/sources.list.d/ceph.list.

Configuring the ceph repositories on CentOS

As user cephuser, enter the ~/cephcluster directory and edit the file /etc/yum.repos.d/ceph.repo with the content shown below.

Note: The version of ceph and O/S used here is “hammer” and “el7”, this would change if a different distribution is used, (el6 and el7 for Centos V6 and 7, rhel6 and rhel7 for Red Hat® Enterprise Linux® 6 and 7, fc19, fc20 for Fedora® 19 and 20)

[ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-{ceph-release}/{distro}/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc

For the jewel release:

[ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-jewel/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc

Installing and configuring ceph

Ceph will be deployed using ceph-deploy. Other tools are widespread but will not be used here. First install the deploy tool on the monitor node.

For Ubuntu

sudo apt-get update && sudo apt-get install ceph-deploy (as shown in the screen shot)

For CentOS use

yum update && sudo yum install ceph-deploy


From the directory ~/cephcluster

Setup the monitor node.

The format of this command is ceph-deploy new <monitor1>, <monitor2>, . . <monitor n>.

Note production environments will typically have a minimum of three monitor nodes to prevent a single node of failure

ceph-deploy new monserver0

Examine ceph.conf

There are a number of configuration sections within ceph.conf. These are described in the ceph documentation (ceph.com/docs/master).

Note the file ceph.conf is hugely important in ceph. This file holds the configuration details of the cluster. It will be discussed in more detail during the course of the tutorial. This is also the time to make any changes to the configuration file before it is pushed out to the other nodes. One option that could be used within a training guide such as this would be to lower the replication factor as shown following:

Changing the replication factor in ceph.conf

The following options can be used to change the replication factor:

osd pool default size = 2
osd pool default min size = 1

In this case the default replication size is 2 and the system will run as long as one of the OSDs is up.

Changing the default leaf resiliency as a global setting

By default ceph will try and replicate to OSDS on different servers. For test purposes, however only one OSD server might be available. It is possible to configure ceph.conf to replicate to OSDs within a single server. The chooseleaf setting in ceph.conf is used for specifying these different levels of resiliency – in the example following a single server ceph cluster can be built using a leaf setting of 0. Some of the other chooseleaf settings are shown below:

  • #0 for a 1-node cluster.
  • #1 for a multi node cluster in a single rack
  • #2 for a multi node, multi chassis cluster with multiple hosts in a chassis
  • #3 for a multi node cluster with hosts across racks, etc.

The format of the setting is:

osd crush chooseleaf type = n

Using this setting in ceph.conf will allow a cluster to reach an active+clean state with only one OSD node.

Install ceph on all nodes

ceph-deploy install monserver0 osdserver0 osdserver1 osdserver

Note at the time of writing a bug has been reported with CentOS7 deployments which can result in an error message stating “RuntimeError: NoSectionError No section: `ceph'”. If this is encountered use the following workaround:

sudo mv /etc/yum.repos.d/ceph.repo /etc/yum.repos.d/ceph-deploy.repo

Note always verify the version as there have been instances where the wrong version of ceph-deploy has pulled in an earlier version!

Once this step has completed, the next stage is to set up the monitor(s)

Note make sure that you are in the directory where the ceph.conf file is located (cephcluster in this example).

Enable the monitor(s)

This section assumes that you are running the monitor on the same node as the management station as described in the setup. I f you are using a dedicated management node that does not house the monitor then pay particular attention to section regarding keyrings on page 28.

ceph-deploy mon create-initial


The next stage is to change the permissions on /etc/ceph/ceph.client.admin.keyring.


Note: This step is really important as the system will issue a (reasonably) obscure message when attempting to perform ceph operations such as the screen below.

Now check again to see if quorum has been reached during the deployment.


The status of the ceph cluster can be shown with the ceph –s or ceph health commands.



Note regarding Keyrings

In this example the ceph commands are run from the monitor node, however if a dedicated management node is deployed, the authentication keys can be gathered from the monitor node one the cluster is up and running (after a successful ceph-deploy mon create-initial has been issued). The format of the command is ceph-deploy gatherkeys <Host> . . .



Note By default when a ceph cluster is first created a single pool
(rbd) is created consisting of 64 placement groups. At this point no OSDs have been created and this is why there is a health error. It is also possible that a message may be issued stating too few PGs but this can be ignored for now.


Creating OSDs

Ceph OSDs consist of a daemon, a data device (can normally a disk drive, but can also be a directory), and an associated journal device which can be separate or co-exist as a separate partition.

Important commands relating to osd creation are listed below:

ceph-deploy disk list <nodename>
ceph-deploy disk zap <nodename>:<datadiskname>
ceph-deploy osd prepare <nodename>:<datadiskname>:<journaldiskname>
ceph-deploy osd activate <node-name>:<datadiskpartition>[:<journaldisk partition>

The picture below shows a small (20GB disk) with co-existing journal and data partitions

The first stage is to view suitable candidates for OSD deployment

ceph-deploy disk list osdserver0


In this example three OSDs will be created. The command will only specify a single device name which will cause the journal to be located on the device as a second partition.

Prior to creating OSDS it may be useful to open a watch window which will show real time progress.

ceph –w

ceph-deploy disk zap osdserver0:sdb

Next prepare the disk.

ceph-deploy osd prepare osdserver0:sdb

. . .

The output of the watch window now shows:

The cluster at this stage is still unhealthy as by default a minimum of three OSDs are required for a healthy pool. The replication factor can be changed in ceph.conf but for now continue to create the other OSDs on nodes osdserver1 and osdserver2.

After a second OSD has been created the watch window shows:

After the third OSD has been created, the pool now has the required degree of resilience and the watch window shows that all pgs are active and clean.

Note: This is typically scripted as shown below, in this example 4 servers are used (osdserver0 osdserver1 osdserver2 osdserver3) with each having 3 disks (sdb, sdc and sdd). The script can easily be adapted to a different configuration.

for node in osdserver0 osdserver1 osdserver2 osdserver3
do
for drive in sdb sdc sdd
do
ceph-deploy disk zap $node:$drive
ceph-deploy osd prepare $node:$drive
done
done

ceph –s shows:

Listing the OSDs

The ceph osd tree command shows the osd status

More information about the OSD’s can be found with the following script:

for index in ‘seq 0 $(ceph osd stat | awk {‘

print $3′-1})’; do ceph osd find $index;echo; done


Next bring down osdserver2 and add another disk of 20 TB capacity, note the watch window output when the node is down:

Reboot osdserver2 and check the watch window again to show that ceph has recovered.

Create a fourth OSD on the disk that was recently added and again list the OSDs.

Creating a Replicated Pool

Calculating Placement Groups

The first example shows how to create a replicated pool with 200 Placement Groups. Ideally an OSD will have 100 Placement Groups per OSD. This can be made larger if the pool is expected to grow in the future. The Placement Groups can be calculated according to the formula:

This number is then rounded up to the next power of two. So for a configuration with 9 OSDs, using three way replication the pg size would be 512. In the case of an erasure coded pool the replication factor is the sum of the k and m values. PG counts can be increased but not decreased so it may be better to start with slightly undersized pg counts and increase them later on. Placement Group count has an effect on data distribution within the cluster and may also have an effect on performance.

See ceph.com/pgcalc for a pg calculator

The example next shows how to create a replicated pool.

ceph osd pool create replicatedpool0 200 200 replicated

The watch window shows the progress of the pool creation and also the pg usage

Object commands

The pool can now be used for object storage, in this case we have not set up an external infrastructure so are somewhat limited by operations however it is possible to perform some simple tasks via rados:

Simple GET and PUT operation

PUT

The watch window shows the data being written

The next command shows the object mapping.

Now store a second object and show the mapping.

In the first instance object.1 was stored on OSDs 2,1,0 and the second object was stored on OSDs 3,1,0.

The next command shows the objects in the pool.

Pools can use the df command as well. It is recommended that a high degree of disk free space is available

GET

Objects can be retrieved by use of GET.

Deleting a pool

Delete a pool using the command:

ceph osd pool delete <poolname> <poolname> –yes-i-really-really-mean-it


Note it is instructive to monitor the watch window during a pool delete operation

Benchmarking Pool performance

Ceph includes some basic benchmarking commands. These commands include read and write with the ability to vary the thread count and the block sizes.

The format is:

rados bench –p <poolname> <time> <write|> <seq|rand> -t <#of threads> -b <blocksize>

Example2 à

Note To perform read tests it is necessary to have first written data; by default the write benchmark deletes any written data so add the –no-cleanup qualifier.

Now perform a read test (leave out the write parameter). Note if there is not enough data the read test may finish earlier than the time specified.

Creating an erasure coded pool

Erasure coded pools are more efficient in terms of storage efficiency. Erasure codes take two parameters known as k and m. The k parameter refers to the data portion and the m parameter is for the recovery portion, so for instance a k of 6 and an m of 2 could tolerate 2 device failures and has a storage efficiency of 6/8 or 75% in that the user gets to use 75% of the physical storage capacity. In the previous instance with a default replication of 3, the user can only access 1/3 of the total available storage. With a k and m of 20, 2 respectively they could use 90% of the physical storage.

The next example shows how to create an erasure coded pool, here the parameters used will be k=2 and m=1.

This profile can be now used to create an erasure coded pool.

This pool can be treated in a similar manner to the replicated pool as before

Delete the pool

Add another OSD by bringing down the monitor node and adding a 20GB virtual disk and use it to set up a fifth OSD device.

Next create another pool with k=4 and m=1

Question – The watch window shows the output below – why?


Adding a cluster (replication) network

The network can be configured so that the OSDs communicate over a back end private network which in this ceph.conf example is the network – (192.168.50) designated the Cluster network. They are the only nodes that will have access to this network. All other nodes will continue to communicate over the public network (172.27.50).

Now create a fresh ceph cluster using the previous instructions. Once the mgmt. node has been created, edit the ceph.conf file in ~/testcluster and then push it out to the other nodes. The edited ceph.conf file is shown following:


Suggested activity – As an exercise configure VirtualBox to add extra networks to the OSD nodes and configure them as a cluster network.

Changing the debug levels

Debug levels can be increased for troubleshooting purposes on the fly; the next setting increase the debug level for osd0 to 20:

ceph tell osd.0 injectargs –debug-osd 20

The output of ceph –w now shows this as well.

Creating a Block Device

Create a pool that will be used to hold the block devices.

Next create a block image from the pool specifying the image name, size (in MB) and pool name:

rbd -p iscsipool create myimage –size 10240

List the images in the pool.

Map the device

The next command shows the mapping.

Get information about the image.

Now create a partition on /dev/rbd0 using fdisk or parted

Now list the block devices again.

Create a file system using mkfs or mkfs.ext4

Next create a mount point

sudo mkdir /mnt/rbd0

and mount the device.

sudo mount /dev/rbd0p1 /mnt/rbd0

Test file access.

Creating a snapshot of the image

Prior to taking a snapshot it is recommended to quiesce the filesystem to ensure consistency. This can be done with the fsfreeze command. The format of the command is fsfreeze –freeze|unfreeze <filesystem>.

Freezing prevents write access and unfreezing resumes write activity.

Snapshots are read only images point in time images which are fully supported by rbd.

First create a snapshot:

List the snapshots

Listing the snapshots

Next delete all the files in /mnt/rbd0

List the contents of /mnt/rbd0

Next umount /dev/rbd0p1

Now rollback the snapshot

Mount the device again.

And list the contents of /mnt/rbd0 to show that the files have been restored.

Purging snapshots

Snapshots can be deleted individually or completely.

The next example shows how to create and delete an individual snapshot.

All snaps can be removed with the purge command.

Umapping an rbd device

Removing an image


Benchmarking a block device

The fio benchmark can be used for testing block devices; fio can be installed with apt-get.

FIO Small block testing

fio –filename=/dev/rbdXX –direct=1 –sync=1 –rw=write –bs=4k –numjobs=1 –iodepth=1 –runtime=60 –time_based –group_reporting –name=journal-test

Try increasing the –numjobs parameter to see how performance varies. For large block writes using 4M use the command line below:

fio –filename=/dev/rbdXX –direct=1 –sync=1 –rw=write –bs=4096k –numjobs=1 –iodepth=1 –runtime=60 –time_based –group_reporting –name=data-test

Sample run with 4k blocks using an iodepth of 16

See the fio documentation for more information!

Sample run with 4M blocks using an iodepth of 4

Create an iSCSI target

First install the necessary software on the system that will host the iscsi target. In this example the (overworked) monitor node will be used.

Edit /etc/default/iscsitarget and set the first line to read ISCSITARGET_ENABLE-true

Restart the service

Next create a pool called iscsipool (as before)

Next partition the device

Verify the operation

Now format the new partition

Edit the file /etc/iet/ietd.conf to add a target name to the bottom of the file.

Restart the service again

Connecting to the target

In this example a Windows iSCSI initiator will be used to connect to the target. Launch the iSCSI initiator from windows and enter the IP address. Select <Quick Connect>

At this point the target can be treated as a normal windows disk. Under Disk Management Initialize, create a volume, format and assign a drive letter to the target

In this case the label assigned is cephiscsitarget and has a drive letter assignment of E:

Now copy some files to verify operation:

The ceph watch window should show activity

Dissolving a Cluster

ceph-deploy purge <node1> <node2> . . . <noden>

ceph-deploy purgedata <node1> <node2> . . . <noden>

ceph-deploy forgetkeys

Advanced Topics

CRUSH map

CRUSH is used to give clients direct access to OSDs thus avoiding the requirement for a Metadata server or intermediary lookup. The map itself contains a list of the OSDs and can decide how they should be grouped together. The first stage is to look at a CRUSH map.

First obtain the CRUSH map. The format is ceph osd getcrushmap <output file>

This map is in compiled format so before it can be “read” it needs to be decompiled.

Use the –d switch to decompile.

Now the file is “readable”

# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable straw_calc_version 1

# devices

device 0 osd.0

device 1 osd.1

device 2 osd.2

# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root

# buckets

host osdserver0 {

    id -2        # do not change unnecessarily

    # weight 0.010

    alg straw

    hash 0    # rjenkins1

    item osd.0 weight 0.010

}

host osdserver1 {

    id -3        # do not change unnecessarily

    # weight 0.010

    alg straw

    hash 0    # rjenkins1

    item osd.1 weight 0.010

}

host osdserver2 {

    id -4        # do not change unnecessarily

    # weight 0.010

    alg straw

    hash 0    # rjenkins1

    item osd.2 weight 0.010

}

root default {

    id -1        # do not change unnecessarily

    # weight 0.030

    alg straw

    hash 0    # rjenkins1

    item osdserver0 weight 0.010

    item osdserver1 weight 0.010

    item osdserver2 weight 0.010

}

# rules

rule replicated_ruleset {

    ruleset 0

    type replicated

    min_size 1

    max_size 10

    step take default

    step chooseleaf firstn 0 type host

    step emit

}

# end crush map

Within the CRUSH map there are different sections.

  • Devices – here the CRUSH map shows three different OSDs.
  • Types – shows the different kinds of buckets which is an aggregation of locations for the storage such as a rack or a chassis. In this case the aggregation of the buckets are the OSD server hosts.
  • Rules – These define how the buckets are actually selected.

The CRUSH map can be recompiled with

crushtool –c <decompiled crushmapfile> -o <compiled crushmapfile>

and then reinjected by

ceph osd setcrushmap –i <newcompiledcrushmapfile>


Changes can be shown with the command ceph osd crush dump

Latency stats for the osds can be shown with:


Individual drive performance can be shown with


A number can be added to specify the number of bytes to be written, the command below writes out 100MB at a rate of 37 MB/s


If an individual drive is suspected of contributing to an overall degradation in performance, all drives can be tested using the wildcard symbol.


More about PGs, Pools and OSDs

Reconciling object, pgs and OSDs

The drawing below (repeated from the introduction) shows the relationship between a pool, objects, Placement Groups and OSDs. The pool houses the objects which are stored in Placement Groups and by default each Placement Group is replicated to three OSDs.


Suggested Activity –

Add more Virtual disks and configure them as OSDs, so that there are a minimum of 6 OSDs. Notice during this operation how the watch window will show backfilling taking place as the cluster is rebalanced.

This may take some time depending on how much data actually exists.

The following screenshot shows a portion of the output from the ceph pg dump command

Note the pg mapping to OSDs – Each pg uses the default mapping of each Placement Group to three OSDS. In this case there are 6 OSDs to choose from and the system will select three of these six to hold the pg data. In this case the two fields that are highlighted list the same OSDs.

Question – How many entries are there for the left hand field number starting with 0.x are there and why?

Next create some new pools similar to that shown below:


List the pgs again to show the new pools. Note that the number on the left hand side is of the form x.y where is x = the pool ID and y = the pg ID within the pool.

Now PUT an object into pool replicatedpool_1


It can be seen that the object is located on OSDs 2,1,0. To verify the mapping for this pg use the command:


pg dump again and grep for this pg.


Or simply issue the command

ceph pg map 2.6c

As an exercise add in a new OSD and then look to see if any of the mappings have changed.

Other rados file commands

List the contents of a pool:

rados –p <poolname> ls


Copy the contents of a pool to another pool

rados cppool <sourcepoolname> <destinationpoolname>


Reading the CRUSH Map

First get the map which is in binary format

Decompile the CRUSH map

Make a copy

Contents of initial CRUSH map:

If changes are required then edit the decompiled CRUSH map with the new entries

Next compile the CRUSH map

And inject it

Listing the osd tree shows:

Cache Tiering

Cache tiering keeps a subset of the main data in a cache pool. Typically this cache pool consists of fast media and is usually more expensive than regular HDD storage. The following diagram (taken from the ceph documentation) shows the concept.

A cache tiering agent decides when to migrate data between the storage tier and the cache tier. The ceph Objecter handles object placement. The cache can function in Writeback mode where the data is written to the cache tier which will send back an acknowledgement back to the client prior to the data being flushed to the storage tier. If data is fetched from the storage tier it is migrated to the cache tier and then sent to the client.

In Read-only mode the client writes data to the storage tier and during reads the data is copied to the cache tier – here though the data in the cache tier may not be up to date.

In this example it is assumed that a ruleset for ssd devices and a ruleset for hdd devices has been set up. The ssd devices can be used as a cache tier where the ssd pool will be the cache pool and the hdd pool will be used as the storage pool.

Set the cache mode as writeback or readonly

This is logged:

Next set up traffic to go to the cached pool

Cache tiering can be used for Object, block or file. Consult the ceph documentation for further granularity on managing cache tiers.


Other useful commands




Take an OSD out of the cluster, its data will be re-allocated


OSD can be brought back in with ceph osd in osd.4

Reweighting OSDs

If an OSD is heavily utilized it can be reweighted, by default this is set at 120% greater than the average OSD utilization. In the example below the system will reweigh if OSDs are above 140% of the average utilization.

More on CRUSH rules

The next setting is used for different levels of resiliency

The format of the setting is:

osd crush chooseleaf type = n

It is also possible to create single pools using these rulesets

In this example a pool will be created on a single server (osdserver2). The command to create this rule is shown below and the format is ceph osd crush rule create-simple <rulename> <node> osd.

The watch window shows:

The rules can be listed with:

Next create a pool with this rule:

More information about the rule can be shown with:

A comparison of the default replicated ruleset shows:

Note the difference in type “osd” versus “host”. Here a pool using the replicated ruleset would follow normal rules but any pools specified using the singelserverrule would not require a total of three servers to achieve a clean state.

Cephfs

As of the jewel community release (planned for mid 2016) cephfs will be considered stable. In the example that follows a cephfs server will be set up on a node named mds.

Installing the Meta Data Server

Install ceph as before however use the string

ceph-deploy install — release jewel <node1> <node2> .. <noden>. After ceph has been installed with OSDs configured, the steps to install cephfs are as follows:

Creating a Meta Data Server

First create a cephfs server

The format is ceph-deploy mds create <nodename>

ceph-deploy –overwite-conf mds create mds


Creating the metadata and data pools

Next create two pools for cephfs: a metadata pool and a regular data pool.

ceph osd pool create cephfsdatapool 128 128

ceph osd pool create cephfsmetadatapool 128 128

Creating the cephfs file system

Now create the file system:

ceph fs new <file system name> <metadatapool> <datapool>

ceph fs new mycephfs cephfsmetadatapool cephfsdatapool

Verify operation

ceph mds stat

ceph fs ls

Mounting the cephfs file system

Make a mount point on the mgmt (172.168.10.10) host which will be used as a client

sudo mkdir /mnt/cephfs

sudo mount -t ceph 172.168.10.10:6789:/ /mnt/cephfs -o name=admin,secret=`ceph-authtool -p ceph.client.admin.keyring`

Next show the mounted device with the mount command

Now test with dd

sudo dd if=/dev/zero of=/mnt/cephfs/cephfsfile bs=4M count=1024

Accessing cephfs from Windows

Installing samba

Samba can be used to access the files. First install it.


Customization can be applied to the file /etc/samba/smb.conf. The heading “Myfiles” shows up as a folder on the Windows machine.


Enable and start the smb service

# systemctl enable smb

# systemctl start smb

Setup access


Next on the windows client access the share by specifying the server’s IP address.



Setting up a ceph object gateway

The mgmt node will be used in this case to host the gateway. First install it:


After installing the gateway software; set up the mgmt node as the gateway.


From a browser enter http://mgmt:7480 at this point a screen similar to that shown below should appear.


Troubleshooting

ceph states

State Status Possible cause
Normal Active + Clean
Degraded Not able to satisfy replication rules This state should be automatically recoverable, unless not enough OSDs exist or the rulesets are not satisfied,
Degraded Recovering Recovering from a degraded state
Backfilling Rebalancing the cluster New empty OSD has been added
Incomplete Unable to satisfy pool min-size rules May need more OSDs
Inconsistent Detected error Detected during scrub may need to perform pq query to find issue
Down Data missing, pg unavailable Need to investigate – pg query, osd status

OSD States

OSDs can be in the cluster or out of the cluster and can either be up which is a running state or down which is not running. A client will be serviced using the OSD up set. If an OSD has a problem or perhaps rebalancing is occurring then the request is serviced from the OSD acting set. In most case the up set and the acting set are identical. An OSD can transition from and In to an Out state and also from an up to a down state. The ceph osd stat command will list the number of OSDS along with how many are up and in.

Peering

For a Placement Group to reach an Active and Clean state the first OSD in the set (which is the primary) must peer to the secondary and tertiary OSDs to reach a consistent state.

Placement Group states

Placement Groups can be stuck in various states according to the table below:

Stuck state Possible Cause
Inactive Cannot process requests as they are waiting for an OSD with the most up to date data to come in
Unclean Placement Groups hold object that are not replicated the specified number of times. This is typically seem during pool creation periods
Stale Placement Groups are in an unknown state, usually because their associated OSDs have not reported to the monitor within the mon_osd_report_timeout period.

Placement Groups related commands

If a PG is suspected of having issues;the query command provides a wealth of information. The format is ceph pg <pg id> query.

The OSDs that this particular PG maps to are OSD.5, OSD.0 and OSD.8. To show only the mapping then issue the command ceph pg map <pg id>

To check integrity of a Placement Group issue the command ceph pg scrub <pg id>

Progress can be shown in the (w)atch window

To list all pgs that use a particular OSD as their primary OSD issue the command ceph pg ls-by-primary <osd id>

Unfound Objects

If objects are shown as unfound and it is deemed that they cannot be retrieved then they must be marked as lost. Lost objects can either be deleted or rolled back to a previous version with the revert command. The format is ceph pg <pg id> mark_unfound_lost revert|delete.

To list pgs that are in a particular state use ceph pg dump_stuck inactive|unclean|stale|undersized|degraded –format json

In this example stuck pgs that are in a stale state are listed:

Troubleshooting examples

Issue – OSDs not joining cluster.

The output of ceph osd tree showed only 6 of the available OSDs in the cluster.

 

 


 

 

The OSDs that were down had been originally created on node osdserver0.

Looking at the devices (sda1 and sdb1) on node osdserver0 showed that they were correctly mounted

 

 

 

The next stage was to see if the node osdserver0 itself was part of the cluster. Since the OSDs seemed to be mounted OK and had originally been working, it was decided to check the network connections between the OSDs. This configuration used the 192.168.10.0 network for cluster communication so connectivity was tested on this network and the ping failed as shown below.

 

 

 

The next step is to physically logon to node osdserver0 and check the various network interfaces. Issuing an ipaddr command showed that the interface which was configured for 192.168.10.20 (osdserver’s ceph cluster IP address) was down.


 

 

 

 

Prior to restarting the network the NetworkManager service was disabled as this can cause issues.

The service was stopped and disabled and then the network was restarted. The system was now ‘pingable’ and the two OSDs now joined the cluster as shown below.


The Monitor Map

Obtain the monitor map by issuing the command below

This will extract the monitor map into the current directory naming it monmap.bin. It can be inspected with the monmaptool.

See the ceph documentation for further information relating to adding or removing monitor nodes on a running ceph cluster.

Changing the default location on a monitor node

If a different device from the default is used on the monitor node(s)is used then this location can be specified by following the ceph documentation as shown below:

Generally, we do not recommend changing the default data location. If you modify the default location, we recommend that you make it uniform across ceph Monitors by setting it in the [mon] section of the configuration file.

mon data

Description: The monitor’s data location.
Type: String
Default: /var/lib/ceph/mon/$cluster-$id

Real World Best Practices

The information contained in this section is based on observations and user feedback within a ceph environment. As a product ceph is dynamic and is rapidly evolving with frequent updates and releases. This may mean that some of the issues discussed here may not be applicable to newer releases.

SSD Journaling considerations

The selection of SSD devices is of prime importance when used as journals in ceph. A good discussion is referenced at http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/. Take care to follow the steps outlined in the procedure including disabling caches where applicable. Sebastien Han’s blog in general provides a wealth of ceph related information.

A suitable fio test script used is listed below:

for pass in {1..20}

do

echo Pass $pass starting

fio –filename=/dev/nvme0n1 –direct=1 –sync=1 –rw=write –bs=4k –numjobs=$pass –iodepth=1 –runtime=60 –time_based –group_reporting –name=nvme0n1journaltest

done

The script runs 20 passes incrementing the numjobs setting on each pass. The only other change necessary is to specify the device name. All tests were run on raw devices.

Poor response during recovery and OSD reconfiguration

During recovery periods Ceph has been observed to consume higher amounts of memory than normal and also to ramp up the CPU usage. This problem is more acute when using high capacity storage systems. If this situation is encountered then we recommend adding single OSDs sequentially. In addition the weight can be set to 0 and then gradually increased to give finer granularity during the recovery period.

Backfilling and recovery can also negatively affect client I/O

Related commands are:

ceph tell osd.* injectargs ‘–osd-max-backfills 1’

ceph tell osd.* injectargs ‘–osd-max-recovery-threads 1’

ceph tell osd.* injectargs ‘–osd-recovery-max-active 1

ceph tell osd.* injectargs ‘–osd-recovery-op-priority 1’

Node count guidelines for Ceph deployment

The key to Ceph is parallelism. A good rule of thumb is to distribute data across multiple servers. Consider a small system with 4 nodes using 3 X replication; should a complete server fail then the system now is only 75% more capable than before the failure. In addition the cluster is doing a lot more work since it has to deal with the recovery process as well as client I/O. Also if the cluster were 70% full across each of the nodes then each server would be close to being full after the recovery had completed and in Ceph a near full cluster is NOT a good situation.

For this reason it is strongly discouraged to use small node count deployments in a production environment. If the above situation used high density systems then the large OSD count will exacerbate the situation even more. With any deployment less than 1 PB it is recommended to use small bay count servers such as 12/18 bay storage systems.

Open-E HOWTO


 

Open-E HOWTO

  1. Introduction and Disclaimer

    Open-E develops Enterprise class IP based storage software. As of the time of writing (April 2014) the Data Storage Software V7 (Open-E DSS V7) is active and this is the version used here. The software is mature and widely used with a good track record, it also features ease of use (which will be show in this guide) and any experienced SysAdmin will have no issues getting to grips with device configuration.

    As always my HOWTO guides focus on the HOW rather than the WHY and are designed to get the user up to speed rapidly with a working configuration and as such very little explanation is given along the way. Basic functions such as iSCSI and NAS configuration are shown. Since the intent is to gain familiarity with the functionality of the product the configurations shown here are not necessarily best practices, indeed since the focus is more on education, virtual machines were used rather than physical ones to illustrate the functionality. Each configuration step shows the associated screen shots which reduces the likelihood of errors, however there is no guarantee that this document is free of errors. The information portrayed here is for educational purposes and should not be used in a live production environment.

    The user is encouraged to implement other features not covered in this guide and investigate the Enterprise level features such as High Availability, Remote Mirroring etc. Open-E has comprehensive documentation available that  goes well beyond the limited scope of this document.

  2. Obtaining and installing the software

    The software can be downloaded at http://www.open-e.com/download/open-e-data-storage-software-v7/ and burnt onto a CD.DVD. The installation process is fairly simple and after successful installation a screen similar to that shown below will be shown.

  3. Accessing the configuration screens

    From a suitable configured system point a browser to the IP address shown in the screen above.

    Select <Try> to get a 60 day trial version. This will bring up the next screen prompting for an email address.

    Enter a valid email address and follow the link that is emailed. After this a key will be sent which can be pasted into the dialog box.

    Select <apply> to activate the product key. This will bring up the license agreement screen as shown below. Review the agreement and select <agree> to proceed to the wizard screens.

     

    Use the password “admin” for the initial login.

    In this case the default language of English will be used – select <Next> to select the language.

     

    Change the password from the default of “admin”.

     

    Next set the IP address, in this case no changes are needed and the static IP of 192.168.0.200 will continue to be used.

    Next select the correct time zone.

     

    Set the time manually or use an NTP server.

    Name the Server.

    Verify the settings are correct and select <Finish>.

    At this point the wizard is complete and the system is ready for volume configuration.

    In addition scrolling down will show documentation links. The system allows for a 60 day trial period.

  4. Configuring a Volume

    The first step is to configure a volume. A volume is part of a volume group so the first part is to create the volume group. This is done by selecting Configuration Volume manager Volume groups.

     

    Name the volume group and select apply.

    Select <OK> to create and format the volume group

    Now that the volume group exists a logical volume can be created. This can be done by selecting the highlighted link below.

     

  5. Setting up an iSCSI Target

    The next screen will allow the creation of NAS or iSCSI devices. In addition snapshots can be created. In the following example an iSCSI volume will be used with a capacity of 25 GB. Also a SWAP space of 4 GB will be set up. The initialization rate can also be adjusted from the drop down box. It is recommended to perform an initialization on the new volume. In the option box select <Create new target automatically>.

     

    If the process is successful a screen similar to the one shown will appear.

    The capacity of 25 GB is shown and the link can be selected to configure the iSCSI target. This screen shows that the access mode is set up for write through which is best for reliability, CHAP authentication is disabled (which is acceptable for the purposes of the tutorial) and the default iqn is used. It is also possible to map only certain hosts to the target from the <target IP access> box.

  6. Connecting to the iSCSI target from a Linux iSCSI initiator

    This example will setup an iSCSI initiator on CentOS. If the iSCSI initiator is not included with the installation then it can be installed with the following command:

    sudo yum install iscsi-initiator-utils

    The default iSCSI configuration file is located at /etc/iscsi/iscsid.conf. At this point the defaults settings should be OK. Next configure the iSCSI services by:

    sudo service iscsid start

    Connect to the target by issuing:

    iscsiadm -m discovery -t st -p 192.168.0.220

    The system should respond with the target’s iqn as shown below/

    This can be compared with the Name on the Open-E configuration screen.

    Configure the iSCSI service to start at boot time –

    sudo chkconfig iscsid on

    Use the fdisk command to show the target:

    fdisk –l


     

    Here the iSCSI target is /dev/sdb. Next use fdisk to partition the device and mkfs to format it.

     

    Test the device by creating a mount point, mounting it, copy some files and then list the copied files.


  7. Configuring a NAS Device

    A new physical device of 50Gb has been added to the open-e host. This device will be used to configure a NAS share. From the browser select the <Configuration Tab> as before and select <Volume Manager>à <Volume Groups>. Select the second disk (Unit S002) and select <New Volume Group> from the Action dialogue.


    Note that instead we have the option to add to the existing volume group (vg00) but here a new volume group (vg01) will be created.


     

    Select <Apply> to format the new volume group. After the new volume group has been created it will show up as being “in use” and now select the link to create a logical volume.

     

    Select <new NAS volume> for the action and use 20GB of the capacity. Select <apply>.

    After the volume has been created the link to create a network share can be selected.

    Name the share and select <apply>. After the network share has been created the next step is to configure services for the share. Do this by selecting the link.

  8. Configuring a Windows Share

    This time Windows will be used as the Operating System to access the open-e resource. From the SMB setting box ensure that <Use SMB> is checked and then set the Users access permissions accordingly. Other share settings will not be used in this example.

  9. Accessing the Open-E share from Windows

    Select run form the menu (varies according to the windows version) and enter \\192.168.0.220. The share should now pop up in a file window as shown below. This share can be mapped to a drive letter and used a s a file resource.

     

     

    The share now shows up as a regular device under explorer.

  10. Other features

  11. Statistics

    Along with the ease of use of open-e it is also very strong in showing a wide range of statistics which can be used to analyze performance bottlenecks and to glean other important information. Select <Status> à <Statistics> and then select the link <more details>.

  12. Load Statistics

    The first section is <Load> which shows how the load various over the course of the day and this is of great importance to administrators as it gives them a great feel for what is happening “Behind the Scenes”.

  13. FileSystems Statistics

    The filesystems page shows device read and write accesses over varying time periods.

  14. Misc Statistics

    This screen captures System Uptime Stats.

     

  15. Network Statistics

    This screen shows the number of bits and packets that are sent and received over a given time period.

  16. Memory statistics

    This screen shows memory usage as well as access patterns.

  17. Using the Open-E console

    Many functions are also available directly from the console Press the <F1> Key to bring up a help screen.

    Only a few of the options will be discussed here but some of the screen that are covered are:

  18. Console Tools Console command

    Press Ctrl – Alt-T to access the console tools menu.

    Press <2> to bring up the Hardware Info menu. Use the arrow keys to navigate through the listing.

  19. Extended Tools console command

    Press Ctrl–Alt-X to access the console tools menu.

    This is a potentially data destructive tool so a password prompt is required to proceed.

    Warning it is not recommended to use these options without a thorough understanding of the underlying actions involved.

     


  20. Configure Network console command

    Press Ctrl–Alt-N to access the network settings menu. This screen can be used to change IP addresses and set up static or DHCP configuration.

  21. Hardware Configuration console command

    Press Ctrl-Alt-W to access this screen. This is also a potentially data destructive operation so the options should be used with care.

    Enter the administrator password to proceed.

    Option 7 can be used to run a basic benchmark such as a read performance test.

    Select the <Read test> only.

    Use the arrow keys to navigate and the space bar to select the devices, select <OK>.

  22. Shutdown console command

    Finally use Ctrl-Alt-K to shutdown or restart the system.

    As mentioned at the beginning this guide only covers the bare minimum of DSS V7’s capabilities, further investigation of the many enterprise level features is greatly encouraged.