Ceph Luminous/Mimic Update


 

Ceph Luminous/Mimic Quick Start Guide

Summary

This document outlines a quick start guide using Ceph Luminous release with CentOS 7.5 (1804). There is also a brief section outlining the Mimic release. For Luminous three physical servers are deployed with one server (mon160) doubling up as a MON and OSD server. Each system has a single network which also has Internet access. Again – no prizes for performance as the intent of this guide is to provide a quick recipe for deploying Ceph on smaller systems prior to migrating to full scale production deployments. The Mimic description uses 5 X Proxmox based VMs and focuses mainly on the dashboard which has significant changes from the luminous version.

Note:

One word of warning during the publication of this document the ceph-deploy changed to version 2. There are significant syntactical changes between both versions so if you encounter syntax errors ensure that you are using the correct version/syntax. Later Ceph releases (such as Nautilus) HOWTOs in this series will no longer cover versions of ceph-deploy prior to version2. The inclusion of both versions is somewhat confusing but it is because during the time of writing the transition occurred and both versions could be encountered. In general the Luminous portion uses version 1.5X of ceph-deploy and the Mimic version uses ceph-deploy V2.

New Features in Luminous

Some of the new features that are available in Luminous are listed below:

  • Bluestore is now the default filesystem for OSDs
  • New Dashboard introduced for basic cluster monitoring
  • RBD devices can use erasure coded pools
  • Data and Metadata checksumming
  • Compression

IP addresses

Table 1 Typical IPs used

Nodename

IP

Gateway

mon160

192.168.0.160

192.168.0.1

osd170

192.168.0.170

192.168.0.1

osd180

192.168.0.180

192.168.0.1

Configure the IPs according to your system, however static addresses should be used.

Software deployed

  • Ceph Luminous release
  • Ceph Mimic release
  • CentOS 7.5(1804) Operating System

Installation Steps

Install CentOS 7.5. For convenience the installation option of “Server with GUI” was used for mon160 and the other nodes used the minimal installation. During the installation create the password for root and add a user called cephuser without Administrator privileges.

Firewall configuration

For mon160 –

For all OSD nodes –

For the Gateway node(s) (if used)

Selinux configuration

On all nodes set the mode to “permissive”

vi /etc/sysconfig/selinux

Installing and enabling ntp

. . .

Grant user cephuser sudo privileges

# echo “cephuser ALL = (root) NOPASSWD:ALL” | sudo tee /etc/sudoers.d/cephuser

Now set permissions

# chmod 0440 /etc/sudoers.d/cephuser

Configuring ssh

Change to user cephuser and change to cephuser’s home directory:

Configure passwordless login

As user cephuser generate a key with ssh-keygen

Copy the key to itself (mon160) and the other nodes

ssh-copy-id <nodename>


Modify ~/ssh/config
to allow short hostname access

 

Change permissions

chmod 600 ~/.ssh/config

Ceph Repository

Add the following commands to /etc/yum.repos.d/ceph.repo

[ceph-noarch]

name=Ceph noarch packages

baseurl=https://download.ceph.com/rpm-luminous/el7/noarch

enabled=1

gpgcheck=1

type=rpm-md

gpgkey=https://download.ceph.com/keys/release.asc

 

and then copy the repo to the other nodes


 

Installing ceph-deploy

Next perform an update and install the ceph-deploy package. Verify the version deployed

. . .

Installing ceph

The first stage is to configure the monitor node. If using a simple single network then the format is simply:

ceph-deploy new <hostname>

or if using separate backend and frontend networks then the format is:

ceph-deploy new <hostname> –public-network <xxx.yyy.zzz.0/24> –cluster-network <aaa.bbb.ccc.0/24>

With this guide only one Ceph network is used, so the command is just:

ceph-deploy new mon160

Next install the ceph package on all nodes. As well as the footnote in the previous sentence using ceph-deploy install -–release=luminous
. . . should also work if the incorrect version is installed

. . .

Now enable the monitor function

Configure node mon160 as an admin node

Change the ceph.client
keyring permissions and watch the cluster

Note the format of the watch window has changed significantly from earlier releases.

Deploying a mgr daemon

Note in the output of the watch window it shows that no mgr daemons are active. This is a new feature with Luminous. The node mon160 will be used to host a manager daemon

. . .

Note that the output of ceph-w now shows the daemon active.

Creating OSDs

At this point no OSDs have been created; look at the output of lsblk to show available devices. Pre-existing data can be cleared using the parted utility and devices also can be configured with a GPT label.

Note large capacity storage servers with an excess of 30 OSDS may not have enough default resources to run; in this case edit /etc/sysctl.conf so that there is an entry fs.aio-max-nr = 1048576. Then use the steps below to verify.

Increase fs.aio-max-nr further if needs be!

In the example below device sdb will be used as the first OSD device. Next create the OSD device on node mon160

. . .

The output of ceph –w now shows:

Instead of using parted ceph can also be used to clear device data.

Create 1 more osd on each of the other nodes, the watch window now shows that three OSDs are in:

Ansible OSD Deployment

(As an aside) Ansible has bluestore support – in the example below there are three bluestore ceph data devices (/dev/sda, /dev/sdb,/dev/sdc) all sharing /dev/sdd for block.db and they all share /dev/sde for  block.wal. If bluestore_wal devices does not appear in the yml file then block.wl will coexist with block.db.

 

osd_scenario: no-collocated

osd objectstore: bluestore

devices:

/dev/sda

/dev/sdb

/dev/sdc

dedicated devices:

/dev/sdd

/dev/sdd

/dev/sdd

bluestore_wal_devices:

/dev/sde

/dev/sde

/dev/sde

 

Pool Creation

Verify that the three OSDs are up and create a pool called bluepool.

Perform a quick benchmark to ensure that the pool is working correctly.

Pool Association

Looking at the output of ceph-w there is a warning message

This is a new feature with Luminous and is used to associate applications with pools. An example follows:

Create a pool for use by RADOS Block Devices and then associate it with the rbd application

Deleting a pool with Luminous

Set the option in ceph.conf “mon_allow_pool_delete = trueand push it out to the nodes with:

ceph-deploy –-overwrite-conf config push mon160 osd170 osd180

After this; pools can be deleted with a command such as:

ceph osd pool delete <poolname> <poolname> –yes-i-really-really-mean-it

Enabling the Dashboard

Luminous supports a basic dashboard (as well as others) plugin module. Enable it by issuing the command below:

ceph mgr module enable dashboard

By default port 7000 is used and since the mgr was deployed on node mon160 (which can be seen from the output of ceph-s) the browser url is http://mon160:7000.

The opening screen shows:

At a glance the summary screen shows that there is one monitor node, three OSDs and two pools configured. Selecting the second icon down on the left and then <servers> shows the Server view

Note that the OSDs on each node are shown as services along with the Monitor service running on node mon160.

Selecting the OSD Tab shows a basic screen along with OSD capacity and performance information.

Next create a block image from the rbd pool

Now under the block icon (third icon down) select <Pools> à <rbd> to show the rbd image properties.

Write some data to each of the pools and then return to the cluster health screen to show the usage by pool information.

Fault conditions show up graphically and textually.

 

It is expected that more features will be added to the dashboard with later releases.

More about BlueStore

BlueStore is now the default backend for OSD devices, prior to this it was called FileStore.

Background

Currently filesystems do not provide atomic writes and Ceph used the concept of ceph journals to deal with this situation. The journaling method can compromise performance especially when the journal and ceph data are co-located on the same device. POSIX also causes some significant overhead. The figure below shows how a device is partitioned using the co-located journal mechanism.

The journal in this case consumes 5GB of space.

Looking at the BlueStore device (as it was prepared earlier) shows:

The parted utility shows that sda1 uses the xfs filesystem.

Here partition sda1 is a small metadata partition with partition sda2 actually holding the ceph data. This partition (sda2) is actually a raw partition and data is written directly to it.

The Metadata associated with an OSD is stored on a RocksDB database. In addition there is a Write ahead log known as WAL. The WAL can be used as BlueStore’s internal journal. It is possible to break out the database (block.db) and the WAL (block.wal) on different devices similar to the way that the journal was broken out from the actual ceph data. This should only be done if the WAL and DB are provisioned on faster devices than the ceph primary device. Small devices such as NV-DIMM could be used as a WAL device, larger flash devices can be used as DB devices.

There are a number of tuning parameters such as bluestore_cache_size, which are detailed in the ceph documentation.

BlueStore Checksums and Compression

Data checksumming uses a default algorithm of crc32c but others are available. There is an overhead to this and larger blocks can be checksummed, however this may compromise integrity. The checksum algorithm can be set globally or on a per pool basis.

BlueStore supports inline compression using algorithms such as snappy
or
zlib.
There are different compression modes such as:

Table 2 BlueStore Compression types

Compression type

Description

none:

Never compress

passive

Do not compress data unless the write operation as a compressible hint set

aggressive

Compress data unless the write operation as an incompressible hint set

force

Try to compress data no matter what

 

There are thresholds to determine if the data should be left uncompressed if it is unable to reach a particular compression threshold ratio. For more information about the compressible and incompressible IO hints, see rados_set_alloc_hint() in the ceph documentation.

The compression settings can be set either via a per-pool property or a global config option. Pool properties can be set with:

ceph osd pool set <pool-name> compression_algorithm <algorithm>

ceph osd pool set <pool-name> compression_mode <mode>

ceph osd pool set <pool-name> compression_required_ratio <ratio>

ceph osd pool set <pool-name> compression_min_blob_size <size>

ceph osd pool set <pool-name> compression_max_blob_size <size>

Configuring OSDs with BlueStore

The single OSD device configuration from a remote node has already been shown on page 11. These commands can be performed directly on the actual node housing the devices.

ceph-disk prepare –bluestore node:<device>

The full format of the command which can break out each of the components is

ceph-disk prepare –- bluestore <device> –block.wal <wal device> — block.db <db device

For example on node osd170 the command below uses device /dev/sda as the main ceph data device and associates the other two components (block.wal and block.db) on /dev/sdc.

 


The dashboard now shows the new OSD being brought into the active pools while the re-balancing occurs.

Looking at the OSD screen shows:

The output of ceph osd tree shows:

 

 

 

 

Note after rebooting the GUI screen showed OSD3 as a component of osd170 unlike the GUI screenshot above.

BlueStore WAL and DB space usage

Using parted to look at /dev/sdc on node osd170 which was used for the wal and db components shows:

Looking at /dev/sda
shows:

Deploying BlueStore device from an Admin node

The device can also be deployed from mon160; the format of the command is:

ceph-deploy osd prepare –bluestore –block-db <block.db device> –block-wal <block.wal device> <OSDServerhostname>:>ceph device>

For example the command below uses separate devices for the ceph objects but shares partitions on /dev/nvme0n1 for the block.db and block.wal devices.

ceph-deploy osd prepare –bluestore –block-db /dev/nvme0n1 –block-wal /dev/nvme0n1 osd170:/dev/sda

ceph-deploy osd prepare –bluestore –block-db /dev/nvme0n1 –block-wal /dev/nvme0n1 osd170:/dev/sdb

ceph-deploy osd prepare –bluestore –block-db /dev/nvme0n1 –block-wal /dev/nvme0n1 osd170:/dev/sdc

 

Benefits of BlueStore

BlueStore no longer suffers from the double write penalty as the data is written directly to the data partition. It also features data checksumming and compression (disabled by default). There is no filesystem overhead and lastly there is the flexibility of using separate devices for the data, block.wal and block.db. It is important to note though that in a HDD/Flash system the most expensive part of the write is the HDD portion. This does not change in BlueStore as the HDD will still require a full copy of the data.

Ceph-volume

With later releases of Luminous
ceph-deploy has been bumped up to Version 2. In this version
ceph-disk has been removed as a backend to create OSDs in favor of ceph-volume.

Using LVM2 with ceph

Ceph-volume can be used to create logical volume based OSD devices. In the following example the devices that are available for OSD deployment for node mon160 are shown below:

sdb 8:16 0 20G 0 disk

sdc 8:32 0 20G 0 disk

nvme0n1 259:0 0 8G 0 disk

 

The first two devices (sdb and sdc) will be used as a logical volume (LV) and /dev/nvme0n1 will be used for journal purposes. Use
parted
to create a partition on /dev/sdb and /dev/sdc.

sdb 8:16 0 20G 0 disk

└─sdb1 8:17 0 20G 0 part

sdc 8:32 0 20G 0 disk

└─sdc1 8:33 0 20G 0 part

 

Create a volume group

First use pvcreate to create the physical volumes.

Now create a volume group.

 

Verify

$ sudo vgdisplay

 


 


Create the Logical Volume

The section below specifies 9000 extents (each extent is 4 MiB giving ~ 36 GiB)

Now create the OSD (using Bluestore).

. . .

Verify

Using ceph-volume directly

After deploying ceph run the command

ceph-deploy gatherkeys mon160

Creating Volume Groups and Logical Volumes

Create a logical volume on /dev/nvme0n1 which will be used as the journal. Prepare the OSD with the command below:

Note the UUID of the OSD from the printout and pass it to the activate command.

Verify

Volume Groups can be extended using the vgextend command.

Create another NVME logical volume for a second journal

Create a second Volume group using /dev/sdd and /dev/sde.

Now create a new Logical Volume for the data

Create a second OSD

# ceph-volume lvm prepare –filestore –data mon160vg1/mon160vol2 –journal nvmevg/nvmevol2

Running command: ceph-authtool –gen-print-key

Running command: ceph –cluster ceph –name client.bootstrap-osd –keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i – osd new 11b9ed7e-c63f-4e0e-ad2e-047820889887

Running command: ceph-authtool –gen-print-key

Running command: mkfs -t xfs -f -i size=2048 /dev/mon160vg1/mon160vol2

stdout: meta-data=/dev/mon160vg1/mon160vol2 isize=2048 agcount=4, agsize=2304000 blks

= sectsz=512 attr=2, projid32bit=1

= crc=1 finobt=0, sparse=0

data = bsize=4096 blocks=9216000, imaxpct=25

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0 ftype=1

log =internal log bsize=4096 blocks=4500, version=2

= sectsz=512 sunit=0 blks, lazy-count=1

realtime =none extsz=4096 blocks=0, rtextents=0

Running command: mount -t xfs -o rw,noatime,inode64 /dev/mon160vg1/mon160vol2 /var/lib/ceph/osd/ceph-2

Running command: chown -R ceph:ceph /dev/dm-5

Running command: ln -s /dev/nvmevg/nvmevol2 /var/lib/ceph/osd/ceph-2/journal

Running command: ceph –cluster ceph –name client.bootstrap-osd –keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap

stderr: got monmap epoch 1

Running command: chown -R ceph:ceph /dev/dm-5

Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/

Running command: ceph-osd –cluster ceph –osd-objectstore filestore –mkfs -i 2 –monmap /var/lib/ceph/osd/ceph-2/activate.monmap –osd-data /var/lib/ceph/osd/ceph-2/ –osd-journal /var/lib/ceph/osd/ceph-2/journal –osd-uuid 11b9ed7e-c63f-4e0e-ad2e-047820889887 –setuser ceph –setgroup ceph. . .

Now activate, noting the OSD’s UUID as before

# ceph-volume lvm activate –filestore 2 11b9ed7e-c63f-4e0e-ad2e-047820889887

Running command: ln -snf /dev/nvmevg/nvmevol2 /var/lib/ceph/osd/ceph-2/journal

Running command: chown -R ceph:ceph /dev/dm-5

Running command: systemctl enable ceph-volume@lvm-2-11b9ed7e-c63f-4e0e-ad2e-047820889887

stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-2-11b9ed7e-c63f-4e0e-ad2e-047820889887.service to /usr/lib/systemd/system/ceph-volume@.service.

Running command: systemctl start ceph-osd@2

–> ceph-volume lvm activate successful for osd ID: 2

–> ceph-volume lvm activate successful for osd ID: 2

 

[root@mon160 ceph]#

 

 

[cephuser@mon160 ~]$ ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF

-1 0.06857 root default

-3 0.06857 host mon160

1 hdd 0.03429 osd.1 up 1.00000 1.00000

2 hdd 0.03429 osd.2 up 1.00000 1.00000

0 0 osd.0 destroyed 0 1.00000

[cephuser@mon160 ~]$

 

 

Create an OSD on the other nodes – osd170 and osd180.

First push out ceph.conf and admin.

sudo ceph-deploy admin osd170

Next push out the keys

# scp /var/lib/ceph/bootstrap-osd/*keyring osd170:/var/lib/ceph/bootstrap-osd/

root@osd170’s password:

ceph.bootstrap-mds.keyring 100% 113 86.0KB/s 00:00

ceph.bootstrap-mgr.keyring 100% 113 91.2KB/s 00:00

ceph.bootstrap-osd.keyring 100% 113 100.1KB/s 00:00

ceph.bootstrap-rgw.keyring 100% 113 101.3KB/s 00:00

ceph.client.admin.keyring 100% 151 127.9KB/s 00:00

ceph.keyring 100% 71 128.3KB/s 00:00

ceph.mon.keyring 100% 77 70.7KB/s 00:00

 

Create the Logical Volumes as described earlier on page 25

#ceph-volume lvm prepare –filestore –data osd170vg/osd170vol1 –journal nvmevg/nvmevol1

 

Activate


 

Repeat for node OSD180

Remove the previously destroyed OSD (OSD.0)

$ ceph osd rm 0

removed osd.0

 

Now show the configuration


 

Verify that the cluster is healthy.

Refer to page 43 for examples of OSD creation using bluestore.

Create a pool and run a benchmark

Here is the syntax for separating the WAL and DB from the data OSD. Note this is done from the monitor/admin node

ceph-deploy osd create –data dev/osd0d0vg/osd0vol1 –block-db /dev/osd0j0vg/osd0j0vol1 –block-wal /dev/osd0j0vg/osd0j0vol1 osd0

NOTE The following command can be used to clear remnants of previous file systems:

ceph-volume lvm zap /dev/sdc

 

Notes: It has been observed that on occasion, with previously used Bluestore devices, the zap command did not clear them correctly. This was overcome with using a command such as:

for i in {0..3}; do dd if=/dev/zero of=/dev/nvme${i}n1 bs=4096K count=100; done

for i in {a..z}; do dd if=/dev/zero of=/dev/sd$i bs=4096K count=100; done

 

and then using

“for i in {a..l}; do ceph-deploy disk zap osd3:sd$i; done” again

Alternative method for wiping old ceph disks

# wipefs –a /dev/sdx

Using lvm for the data and partitions for the db and wal

In this example 12 HDDs will use one NVMe device to house their associated DB and WAL components.

First create 24 partitions on the NVMe device

sudo parted -a optimal /dev/nvme0n1 mkpart primary 0% 3% mkpart primary 4% 7% mkpart primary 8% 11% mkpart primary 12% 15% mkpart primary 18% 21% mkpart primary 22% 25% mkpart primary 26% 29% mkpart primary 30% 33% mkpart primary 36% 39% mkpart primary 40% 43% mkpart primary 44% 47% mkpart primary 48% 51% mkpart primary 52% 55% mkpart primary 56% 59% mkpart primary 62% 65% mkpart primary 66% 69% mkpart primary 70% 73% mkpart primary 74% 77% mkpart primary 78% 81% mkpart primary 82% 85% mkpart primary 86% 89% mkpart primary 90% 93% mkpart primary 94% 97% mkpart primary 98% 100%

Then use ceph-deploy (note this was version 2 of ceph-deploy) to create OSDs according to the table below.

Table 3 Non co-located DATA, DB and WAL mapping

data

block.db

block.wal

host

/dev/sda

/dev/nvme0n1p1

/dev/nvme0n1p2

1u12bay

/dev/sdb

/dev/nvme0n1p3

/dev/nvme0n1p4

1u12bay

/dev/sdc

/dev/nvme0n1p5

/dev/nvme0n1p6

1u12bay

/dev/sdd

/dev/nvme0n1p7

/dev/nvme0n1p8

1u12bay

/dev/sde

/dev/nvme0n1p9

/dev/nvme0n1p10

1u12bay

/dev/sdf

/dev/nvme0n1p11

/dev/nvme0n1p12

1u12bay

/dev/sdg

/dev/nvme0n1p13

/dev/nvme0n1p14

1u12bay

/dev/sdh

/dev/nvme0n1p15

/dev/nvme0n1p16

1u12bay

/dev/sdi

/dev/nvme0n1p17

/dev/nvme0n1p18

1u12bay

/dev/sdj

/dev/nvme0n1p19

/dev/nvme0n1p20

1u12bay

/dev/sdk

/dev/nvme0n1p21

/dev/nvme0n1p22

1u12bay

/dev/sdl

/dev/nvme0n1p23

/dev/nvme0n1p24

1u12bay

An example using /dev/sdf as the data device follows:

$ ceph-deploy osd create –data /dev/sdf –block-db /dev/nvme0n1p10 –block-wal /dev/nvme0n1p11 1u12bay

[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephuser/.cephdeploy.conf

[ceph_deploy.cli][INFO ] Invoked (2.0.1): /usr/bin/ceph-deploy osd create –data /dev/sdf –block-db /dev/nvme0n1p10 –block-wal /dev/nvme0n1p11 1u12bay

[ceph_deploy.cli][INFO ] ceph-deploy options:

[ceph_deploy.cli][INFO ] verbose : False

[ceph_deploy.cli][INFO ] bluestore : None

[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f55d34f3368>

[ceph_deploy.cli][INFO ] cluster : ceph

[ceph_deploy.cli][INFO ] fs_type : xfs

[ceph_deploy.cli][INFO ] block_wal : /dev/nvme0n1p11

[ceph_deploy.cli][INFO ] default_release : False

[ceph_deploy.cli][INFO ] username : None

[ceph_deploy.cli][INFO ] journal : None

[ceph_deploy.cli][INFO ] subcommand : create

[ceph_deploy.cli][INFO ] host : 1u12bay

[ceph_deploy.cli][INFO ] filestore : None

[ceph_deploy.cli][INFO ] func : <function osd at 0x7f55d3942c80>

[ceph_deploy.cli][INFO ] ceph_conf : None

[ceph_deploy.cli][INFO ] zap_disk : False

[ceph_deploy.cli][INFO ] data : /dev/sdf

[ceph_deploy.cli][INFO ] block_db : /dev/nvme0n1p10

[ceph_deploy.cli][INFO ] dmcrypt : False

[ceph_deploy.cli][INFO ] overwrite_conf : False

[ceph_deploy.cli][INFO ] dmcrypt_key_dir : /etc/ceph/dmcrypt-keys

[ceph_deploy.cli][INFO ] quiet : False

[ceph_deploy.cli][INFO ] debug : False

[ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device /dev/sdf

[1u12bay][DEBUG ] connection detected need for sudo

. . .

[1u12bay][DEBUG ] stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-5-7751f8da-36ec-4a5d-8747-1da08e4b95ab.service to /usr/lib/systemd/system/ceph-volume@.service.

[1u12bay][DEBUG ] Running command: /bin/systemctl start ceph-osd@5

[1u12bay][DEBUG ] –> ceph-volume lvm activate successful for osd ID: 5

[1u12bay][DEBUG ] –> ceph-volume lvm create successful for: /dev/sdf

[1u12bay][INFO ] checking OSD status…

[1u12bay][DEBUG ] find the location of an executable

[1u12bay][INFO ] Running command: sudo /bin/ceph –cluster=ceph osd stat –format=json

[ceph_deploy.osd][DEBUG ] Host 1u12bay is now ready for osd use.

[cephuser@1u12bay cephcluster]$

 

Here ceph-deploy created the logical volumes for the data device, to get finer control on the volume creation process the volumes can be created manually as described on page 25.

 

Beyond Luminous – Mimic

In this section – 5 nodes are available mimic80, mimic81, mimic82, mimic83 and mimic84. Nodes mimic 80 will be used as a monitor node, Nodes mimic81, mimic82 and mimic83 will be used as OSD nodes and node mimic84 will be used as a cephfs node. The systems uses two NIC ports – DHCP and 10.10.10.0/24 for the ceph public address.

Installation of Mimic differs very little from Luminous. Use the same steps as described in the Luminous installation except substituting the “mimic” for “luminous” in the ceph-deploy installation command – ceph-deploy install -–release=luminous and configuring the ceph repo to call out mimic. There are some significant enhancements to the dashboard which will now be described.

Enable the dashboard, set up a username and password, create a self-signed certificate and show the services.

Next login using the URL shown above using the credentials that were specified.

After logging on the initial screen should be similar to that shown below:

The installation steps can be scripted with the commands below:

ceph-deploy new mimic80 –public-network 10.10.10.0/24

ceph-deploy install –release=mimic mimic80 mimic81 mimic82 mimic83 mimic84

echo “mon_allow_pool_delete = true” >>ceph.conf

ceph-deploy mon create-initial

ceph-deploy admin mimic80 mimic81 mimic82 mimic83 mimic84

sudo chmod +r /etc/ceph/ceph.client.admin.keyring

ceph-deploy mgr create mimic80

sleep 5

ceph mgr module enable dashboard

ceph dashboard create-self-signed-cert

ceph dashboard set-login-credentials cephuser <password>

ceph mgr services

 

At this point the cluster is healthy but no OSDs or pools have been created. The OSD nodes (mimic81, mimic82 and mimic83 have been configured with a 100GB SCSI disk which will be used as an OSD device. Note that ceph-deploy uses a different syntax from earlier versions.

The disk structures prior to OSD creation can be cleared with

$ ceph-deploy disk zap mimic81 /dev/sdb

and then the OSD can be created with

$ ceph-deploy osd create –data /dev/sdb mimic81

After the OSD has been deployed it shows up as a logical volume – since ceph-deploy V2.X uses ceph-volume rather than ceph-disk (which was used with ceph-deploy V1.X)

Repeat for nodes mimic82 and mimic83, ceph osd tree shows –

Create a pool

Looking at the dashboard shows the newly created OSDs and Pool –

Selecting <Pools> from the top of the GUI shows –

Selecting <Cluster> gives a further sub menu –

Looking at <Cluster>/<Hosts> shows the cluster members and the services that are running –

Selecting <Cluster>/<Monitors> shows –

The next option <Cluster>/OSDS shows –

Finally <Cluster/Configuration/Documentation> shows –

Note this can be filtered

Cephfs

Node mimic84 will be used as the cephfs server. First create the metadata server.

. . .

Now create two pools – 1 for regular data and the other for metadata.

Now create the cephfs filesystem

Check for basic functionality

Node mimic80 will be used as the client – create a mountpoint directory on mimic80 – /mnt/cephfs/ Now mount the filesystem specifying the mon node (mimic80) in the mount string.

The /etc/fstab entry might look like:

The GUI shows:

Create I/O.

The OSDs are showing write activity – (Hold the mouse tip over a data point in the Writes bytes window to see tha actual value)

Hold the mouse tip over a data point in the Writes bytes window to see the actual value –

Use dd to test performance

Using oflag=direct gives a dramatic effect with small blocks –

With larger block sizes –

Using read testing with dd

Note that caching can come into play here; before performing a read test use dd again to write out to a temporary file larger than available memory, also using the commands following will most likely give a more accurate result.

Note the value “1” clears the PageCache only, the value “2” clears Dentries and inodes and “3” clears PageCache, Dentries and inodes.

Use bonnie++ to test performance

Install bonnie++ using

$ sudo yum install -y bonnie++

The command string below specifies the file location followed by the memory size (4GB). By default 2X memory is the dataset default size – which is shown in the output of bonnie++ (below).

Note the utility bon_csvhtml can be used to tabulate the bottom csv output of bonnie++, an example is shown below –

# echo “1.97,1.97,1u12bay,1,1535489406,256G,,376,99,724816,90,214714,41,670,99,347356,38,929.6,32,16,,,,,259,1,+++++,+++,1733,4,1589,5,3587,7,1759,4,57716us,5888ms,7009ms,38125us,95929us,96859us,48344ms,19670us,161ms,531ms,5759us,109ms” | bon_csv2html > results.html

Using strace to monitor bonnie++

The strace utility shows activity during the bonnie++ run.

Using ceph-volume directly on mimic OSD nodes

Assuming that the cluster has been set up according to the previous steps, this example will use ceph-volume directly on the OSD nodes without the use of ceph-deploy.

The examples following use hypothetical nodes mon100, osd101,osd102 and osd103 with two physical devices (sda and sdb) available for OSD deployment as well as an NVMe device which will be used to offload the block.wal and block.db components from the HDDs.

Initially create the keys on mon0 and push the bootstrap-osd key out to the OSD nodes

/usr/sbin/ceph-create-keys -i mon100

for i in {1..3}; do scp /var/lib/ceph/bootstrap-osd/ceph.keyring osd10$i:/var/lib/ceph/bootstrap-osd/; done

 

The next sequence of instructions first removes previous volume groups (if they exist). It then creates two new volume groups and two new logical volumes with 5000 4 MiB extents which corresponds to 20 GiB. It then removes any existing partitions from the NVMe devices and creates 4 new ones. The final step is to create the new OSD devices.

 

sudo vgremove sdavg sdbvg -y

sudo vgcreate sdavg /dev/sda

sudo vgcreate sdbvg /dev/sdb

sudo vgdisplay | grep -i sd

sudo lvcreate -l 5000 -n sdalv sdavg

sudo lvcreate -l 5000 -n sdblv sdbvg

sudo parted /dev/nvme0n1 rm 1 rm 2 rm 3 rm 4

sudo parted -a optimal /dev/nvme0n1 mkpart primary 0% 24% mkpart primary 25% 49% mkpart primary 50% 74% mkpart primary 75% 100%

sudo ceph-volume lvm create –bluestore –data sdavg/sdalv –block.db /dev/nvme0n1p1 –block.wal /dev/nvme0n1p2

sudo ceph-volume lvm create –bluestore –data sdbvg/sdblv –block.db /dev/nvme0n1p3 –block.wal /dev/nvme0n1p4

Comments and suggestions for future articles welcome!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s