Ceph/Guide

Ceph Distributed Filesystem

Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Signficant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Weil (Inktank) which was acquired by RedHat in April 2014.

The Floss Weekly podcast interviewd Sage Weil in 2013 for their 250th show. The interview was done around the time that the "Cuttlefish" release was created. One of the points of discussion was the need for datacenters to handle disaster recovery, and Sage pointed out that starting with Dumpling, Ceph would provide for replication between datacenters. Another bit of trivia came out in the podcast: Sage Weil was one of the inventors of the WebRing concept in the early days of the World Wide Web.

Overview

Ceph consists of four major components:

Object Store Device
Monitor Server
RADOS Block Device (RBD) and RADOS Gateway
Metadata Server providing a POSIX compliant Filesystem

Object Store Device

Two object stores mark the beginning of a Ceph cluster and they may be joined by potentially thousands more. They sit on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and are created and maintained by an Object Store Device Daemon (OSD). While the underlying filesystem may provide for redundancy, error detection and repair on its own, Ceph implements its own layer of error detection, recovery and n-way replication. There is a tradeoff between using a RAID1,5,6 or 10 scheme with the underlying filesystem and then having a single OSD server versus having individual drives and multiple OSD servers. The former provides a defense in depth strategy against data loss, but the latter has less of an impact on the cluster when a drive fails and requires replacement. The latter also potentially provides better performance than a software RAID or a filesystem built on top of a number of JBOD devices.

An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It can make extended use of the xattr feature to store metadata about an object, but this will often exceed the 4kb limitation of ext4 filesystems such that an alternative metadata store will be necessary. The ceph.com site documentation recommends either ext4 or xfs in production for OSDs, but it is obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. Ultimately btrfs will become the preferred underlying filesystem for a Linux based OSD when the majority is satisfied that it is stable enough. If you are still unsure about btrfs, look at the performance tuning research that has been done already such as that done by an Inktank employee on Bobtail at this page.

The task of the OSD is to handle the distribution of objects by Ceph across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.

Objects are broken down into extants, or shards, when distributed instead of having them treated as a single entity. In a 2-way replication scheme where there are more than 2 OSD servers, an object's shards will actually end up distributed across potentially all of the OSD servers.

Note
An OSD server also implements a Journal (typically 1-10GB) which can be a file or a raw device. The default journal goes into the same filesystem as the rest of an object store, but this is not optimal for either performance nor fault tolerance. When implementing OSDs on a host, consider dedicating a drive to handle just journals. An SSD would be a huge performance boost for this purpose. If your system drive is an SSD, consider using that for journals if you can't dedicate a drive to journals. Otherwise partition off a 1-10GB section of each drive that will be used for OSD filesystems and then put the journal of one OSD server and the rest of the OSD for another server on each drive.

Note
Although Ceph will work with prebuilt RAID arrays, the Inktank developers suggest that "you are doing it wrong" if you take this approach. They would rather that you have one disk device per OSD and thus prefer JBODs to hardware or software based RAIDs. In our example cluster we implemented Ceph on top of pre-existing BTRFS based software RAIDs. We will eventually explore taking down the old arrays one at a time to re-do them as individual OSD servers to look at the performance impact.

Monitor Server

OSDs are watched over by Monitor Servers (MONs) which act as the coordinators for object traffic. The initial Ceph Cluster would consist of a MON and two OSD servers, and this is the example used in their documentation for a quick install. They also talk about an admin server, but this is only a system which is able to painlessly remote into the cluster members using ssh authorized_keys. The admin server would be the system that the user has set up to run Chef, Puppet or other control systems that oversee the operation of the cluster.

A single MON would be a single point of failure for Ceph, so it is recommended that the Ceph Cluster be run with an odd number of MONs with a minimum number of 3 running to establish a quorum. For performance reasons, MONs should be put on a separate filesystem or device from OSDs because they tend to do a lot of fsyncs. Although they are typically shown as running on dedicated hosts, they can share a host with an OSD and often do in order to have enough MON servers for a decent quorum. MONs don't need a lot of storage space, so it is perfectly fine to have them run on the system drive, while the OSD takes over whatever large disk or array is in the server. If you dedicate an SSD to handle OSD journals, the MON storage will only require another 2gb or so.

MONs coordinate object traffic by implementing the Controlled Replication Under Scalable Hashing (CRUSH) map. This is an algorithm that computes the locations for storing objects in the OSD pools. MONS also keep track of the map of daemons running the various flavors of Ceph server in the cluster. An "Initial Members" setting allows the user the specify the minimum number of MON servers that must be running in order to form a quorum. When there are not enough MONs to form a quorum, the Ceph cluster will stop processing until a quorum is re-established in order to avoid a "split-brain" situation.

The CRUSH map defaults to an algorithm that automatically computes where in the OSDs an object's shards should be placed, but it can be influenced by additional human specified policies. This way, a site administrator can sway CRUSH when making choices such as:

use the sites faster OSDs by default
divide OSDs into "hot" (SSD based), "normal" and "archival" (slow or tape backed) storage
localize replication to OSDs sitting on the same switch or subnet
prevent replication to OSDs on the same rack to avoid downtime when an entire RACK has a power failure

It is this spreading out of the load with the CRUSH map that allows Ceph to scale up to thousands of OSDs so easily while increasing performance as new stores are added. Because of the spreading, the bottleneck transfers from raw disk performance (about 100mb/sec for a SATA drive for example) to the bandwidth capacity of your network and switches.

There are a number of ways to work with the MON pool and Praxis database to monitor and administrate the cluster, but the most common is the /usr/bin/ceph command. This is a Python script that uses a number of Ceph supplied Python modules that use json to communicate with the MON pool.

RADOS Block Device and RADOS Gateway

Ceph provides a kernel module for the RADOS Block Device (RBD) and a librados library which libvirt and KVM can be linked against. This is essentially a virtual disk device that distributes its "blocks" across the OSDs in the Ceph cluster. An RBD provides the following capabilities:

thin provisioning
i/o striping and redundancy across the Cluster
resizeable
snapshot with revert capability
directly useable as a KVM guest's disk device
a variant of COW where a VM starts with a "golden image" which the VM diverges from as it operates
Data Replication between datacenters starting with the Dumpling Release

A major selling point for the RBD is the fact that it can be used as a virtual machine's drive store in KVM. Because it spans the OSD server pool, the guest can be hot migrated between cluster CPUs by literally shutting the guest down on one CPU and booting it on another. Libvirt and Virt-Manager have provided this support for some time now, and it is probably one of the main reasons why RedHat (a major sponsor of QEMU/KVM, Libvirt and Virt-Manager) has acquired Inktank.

The RBD and the RADOS Gateway provide the same sort of functionality for Cloud Services as Amazon S3 and OpenStack Swift. The early adopters of Ceph were interested primarily in Cloud Service object stores. Cloud Services also drove the intial work on replication between datacenters.

Metadata Server

Ceph provides a MetaData Server (MDS) which provides a more traditional style of filesystem based on POSIX standards that translates into objects stored in the OSD pool. This is typically where a non-Linux platform can implement client support for Ceph. This can be shared via CIFS and NFS to non-Ceph and non-Linux based systems including Windows. This is also the way to use Ceph as a drop-in replacement for HADOOP. The filesystem component started to mature around the Dumpling release.

Ceph requires all of its servers to be able to see each other directly in the cluster. So this filesystem would also be the point where external systems would be able to see the content without having direct access to the Ceph Cluster. For performance reasons, the user may have all of the Ceph cluster participants using a dedicated network on faster hardware with isolated switches. The MDS server would then have multiple NICs to straddle the Ceph network and the outside world.

As of the Firefly release, there is only one active MDS server at a time. Other MDS servers run in a standby mode to quickly perform a failover when the active server goes down. The cluster will take about 30 seconds to determine whether the active MDS server has failed. This may appear to be a bottleneck for the cluster, but the MDS only does the mapping of POSIX file names to object ids. With an object id, a client then directly contacts the OSD servers to perform the necessary i/o of extents/shards.

Eventually Ceph will allow multiple active MDS servers, dividing the POSIX filesystem namespace with a mapping scheme that distributes the load.

Installation

As of this writing the stable version of Ceph in portage is ceph-0.56.3 which corresponds to a midway rev of the second major release of Ceph code named "Bobtail". In gentoo unstable are versions of the follow-on major Ceph updates up to the current major version "Firefly":

ceph-0.56.x "Bobtail"
ceph-0.67.x "Cuttlefish"
ceph-0.72.x "Dumpling"
ceph-0.80.x "Firefly"
ceph-0.87.x "Giant"

The ceph site online archive only shows release downloads back to Cuttlefish. Also the MDS server doesn't begin to stabilize until around Dumpling. We decided to unmask unstable ceph in our package.keywords file along with its dependencies and ended up building with Firefly before doing our installation.

Outside of Portage, there is a development only release on the ceph site (0.84.x) and an upcoming release "Giant" due to drop, probably around Q4 2014. In the weeks since the creation of this wiki entry and the current edit, there have already been a number of ebuild updates to Firefly which we have installed. The example cluster was initially installed with 0.80.1 and is now running 0.80.5.

root #ACCEPT_KEYWORDS="~amd64" emerge -pv ceph

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild  N     ] sys-apps/hdparm-9.43  127 kB
[ebuild  N     ] sys-libs/libunwind-1.1:7  USE="libatomic lzma static-libs -debug -debug-frame" 1,073 kB
[ebuild  N     ] dev-util/google-perftools-2.0-r2:0/4  USE="debug static-libs -largepages -minimal {-test}" 1,232 kB
[ebuild  N     ] dev-libs/leveldb-1.15.0  USE="snappy static-libs tcmalloc" 201 kB
[ebuild  N     ] dev-python/chardet-2.2.1  PYTHON_TARGETS="python2_7 python3_3 -pypy (-python2_6) -python3_2 -python3_4" 177 kB
[ebuild  N     ] dev-python/itsdangerous-0.24  PYTHON_TARGETS="python2_7 python3_3 -pypy (-python2_6) -python3_2 -python3_4" 46 kB
[ebuild  N     ] dev-python/blinker-1.3  USE="doc {-test}" PYTHON_TARGETS="python2_7 python3_3 -pypy (-python2_6) -python3_2 -python3_4" 90 kB
[ebuild  N     ] dev-python/requests-2.3.0  PYTHON_TARGETS="python2_7 python3_3 -pypy -python3_4" 420 kB
[ebuild  N     ] dev-python/werkzeug-0.9.4  PYTHON_TARGETS="python2_7 python3_3 -pypy -python3_4" 1,100 kB
[ebuild  N     ] dev-python/flask-0.10.1-r1  USE="-examples {-test}" PYTHON_TARGETS="python2_7 python3_3 -pypy -python3_4" 532 kB
[ebuild  N     ] sys-kernel/spl-0.6.2-r5  USE="-custom-cflags -debug -debug-log" 219 kB
[ebuild  N     ] sys-fs/zfs-kmod-0.6.2-r5  USE="rootfs -custom-cflags -debug" 1,585 kB
[ebuild  N     ] sys-fs/zfs-0.6.2-r5  USE="bash-completion rootfs static-libs -custom-cflags -debug (-kernel-builtin) (-selinux) -test-suite" PYTHON_TARGETS="python2_7 python3_3 (-python2_6) -python3_2" 0 kB
[ebuild     U  ] sys-cluster/ceph-0.80.1 [0.56.3] USE="fuse gtk libaio%* libatomic nss%* radosgw static-libs xfs%* zfs%* -cryptopp% -debug -tcmalloc" 3,996 kB

Total: 14 packages (1 upgrade, 13 new), Size of downloads: 10,793 kB

Note
The USE flags libaio and radosgw are local to the ceph ebuild. Also even if you do build Ceph with libaio support, an OSD will not use asynchronous i/o by default.

Kernel Config

If you want to use the RADOS block device, you will need to put that into your kernel .config as either a module or baked in. Ceph itself will want to have FUSE support enabled if you want to work with the POSIX filesystem component and you will also want to include the driver for that in Network File Systems. For your backend object stores, you will want to have xfs support because of the xattr limitations in Ext4 and btrfs because it really is becoming stable now.

KERNEL

Device Drivers
  Block devices
    Rados block device (RBD)
File systems
  XFS filesystem support
    XFS POSIX ACL support
  Btrfs filesystem support
    Btrfs POSIX Access Control Lists
  FUSE (Filesystem in Userspace) support
Network File Systems
  Ceph distributed file system

Network Config

Ceph is sensitive to IP address changes, so you should make sure that all of your Ceph servers are assigned static IP addresses. You also may want to proactively treat the Ceph cluster members as an independent subnet from your existing network by multi-homing your existing network adapters as necessary. That way if an ISP change or other topology changes are needed, you can keep your cluster setup intact. It also gives you the luxury of migrating the ceph subnet later on to dedicated nics, switches and faster hardware such as 10Gbit ethernet or Infiniband. If the cluster subnet is small enough, consider keeping the hostnames in your /etc/hosts files, at least until things grow to the point where a pair of DNS servers among the cluster members becomes a compelling solution.

Note
Ceph MON servers require accurate (or at least synched) system clocks and will mark the cluster with HEALTH_WARN if the pool detects that the servers are not within a second or two or each other. If you haven't already installed a time synchroniziation mechanism such as NTP, you really want to get that installed and configured before building the cluster.

Note
The /etc/init.d/ceph script for openrc only has a depend "after net" in place at the moment. It is quite possible that your servers on a host will be started before ntp-client and ntpd (or your preferred time service) is started. We have opened a bug report with the ceph gentoo team to add "after ntp-client" in the depend function in order to avoid having to restart ceph services after a boot if the clock was off.

We will be using four hosts in our Example implementation. All four will be MON servers with an initial quorum of 3 so that we can safely avoid a "split-brain" situation and still be able to run the cluster when a single server is rebooted.

FILE /etc/conf.d/netExample with a multi-homed eth0 for host kroll1

#
#  An example multi-homed eth0 where 192.168.1 subnet is the entire LAN and access to the outside world
#  The 192.168.2 subnet is dedicated to the ceph cluster
#
config_eth0="192.168.1.10/24 192.168.2.1/24"
routes_eth0="default via 192.168.1.1"
dns_domain_eth0="example.com"
dns_servers_eth0="192.168.1.2 192.168.1.3"
dns_search-eth0="example.com"

FILE /etc/hostsExample of a 4 node ceph cluster with hosts kroll1-4

# /etc/hosts: Local Host Database
#
# This file describes a number of aliases-to-address mappings for the for 
# local hosts that share this file.
#
# In the presence of the domain name service or NIS, this file may not be 
# consulted at all; see /etc/host.conf for the resolution order.
#

# IPv4 and IPv6 localhost aliases
127.0.0.1       localhost
::1             localhost
192.168.2.1     kroll1
192.168.2.2     kroll2
192.168.2.3     kroll3
192.168.2.4     kroll4

Our Example Ceph Cluster

We have chosen to roll out Ceph on a portion of our home network. The four kroll hosts are as follows:

kroll1 (aka Thufir) - An AMD FX9590 8 core CPU with 32GB of memory, 256GB SSD root drive and a 4x4TB SATA array formatted as a RAID5 btrfs with the default volume mounted on /thufirraid. kroll1 will act as our admin server since the ssh keys for its root user have been pushed out to the other nodes in their /root/.ssh/authorized_keys files. Kroll1 will act as a MON and OSD server since only slightly less than half of the btrfs array has been used.
kroll2 (aka Figo) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD root drive and a 4x3TB SATA array formatted as btrfs RAID1. Kroll2 will act as a MON and MDS server. We will not do an OSD server here since the array is already over 90% capacity. Also the Ceph developers have suggested that it is not a wise idea to run an MDS and an OSD on the same node.
kroll3 (aka Mater) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD and 4x4TB SATA array formatted as a RAID5 btrfs and default volume mounted on /materraid. /materraid was being kept as a mirror of /thufirraid using rsync -av --delete on a periodic basis. kroll3 will become a MON and an OSD server.
kroll4 (aka Tube) - An AMD A10-7850K APU with 16GB of memory, 256GB SSD and a 2x4TB SATA array formatted as a btrfs RAID1 mirror set with its default volume mounted on /tuberaid. As its real name suggests, kroll4 was originally set up a MythTV box, but its filesystem is only averaging about 10% of its capacity. We will thus use kroll4 as a MON and OSD server.

Thufir, Mater and Tube run a gentoo stable desktop profile and are currently on kernel 3.12.21-gentoo-r1. Thufir and Mater are btrfs installs with new drives using the latest btrfs code btrfs-progs-3.12-r1. Tube has been up and running a bit longer and had its btrfs array built under kernel 3.10.25-gentoo. Figo runs a gentoo unstable desktop profile and is currenly on kernel 3.13.7-gentoo. It is also being used as an rsync mirror for thufir, but its array has been running for about 2 years. The reason for /figoraid running at 90% capacity is both due to older smaller drives (3TB versus 4TB) and also because the version of btrfs available when it was built did not yet include RAID5 support.

Editing the ceph config file

We will be following the manual guide for ceph installation on their site. There is also a python based script call ceph-deploy which is packaged for a number of distros but not directly available for gentoo. If you can manage to get it working, it would automate a good bit of the process of rolling out a server from your admin node.

FILE /etc/ceph/ceph.confInitial ceph config file

[global]
        # For version 0.54 and earlier, you may enable
        # authentication with the following setting.
        # Specifying `cephx` enables authentication;
        # and specifying `none` disables authentication.

        #auth supported = cephx

        # For version 0.55 and beyond, you must explicitly enable 
        # or disable authentication with "auth" entries in [global].
        cluster = ceph
        auth cluster required = cephx
        auth service required = cephx
        auth client required = cephx


[osd]
        osd journal size = 1000
        # uncomment the following line if you are mounting with ext4
        # filestore xattr use omap = true


        # For Bobtail (v 0.56) and subsequent versions, you may 
        # add settings for mkcephfs so that it will create and mount
        # the file system for you. Remove the comment `#` character for 
        # the following settings and replace the values in parenthesis 
        # with appropriate values, or leave the following settings commented 
        # out to accept the default values. You must specify the --mkfs 
        # option with mkcephfs in order for the deployment script to 
        # utilize the following settings, and you must define the 'devs'
        # option for each osd instance; see below.

        #osd mkfs type = {fs-type}
        #osd mkfs options {fs-type} = {mkfs options}   # default for xfs is "-f"
        #osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime"

We use uuidgen to generate a new random uuid for the entire cluster. We will rename the cluster name from the default ceph to kroll to match our host naming scheme. We specify the 192.168.2 network to be the "public" network for the cluster. Other default settings come from the manual install url mentioned earlier, including a default to replicate two copies of each object with a minimum of 1 copy allowed when the cluster is in "degraded" state.

We set a journal size in the osd global section but leave the filestore stanza commented out since we will be using btrfs for object stores instead of ext4. We also added a little extra language to this to clarify exactly what it means.

We add a mon global section where we specify the list of hostnames that will act as mons along with their corresponding ip addresses. The port number 6789 is the IANA registered well known port assigned to Ceph. The initial members stanza specifies the three hosts which will be necessary to form an initial quorum. The list of numbers corresponds to the list of mon sections that will follow for mon.0, mon.1 etc. These will also be used by the /etc/init.d/ceph startup script when figuring out which services are to be started for a host.

Note
The numeric mon convention is new in the Firefly release. Earlier releases used alphbetical letters instead (eg mon.a, etc) for the mon sections.

The rest of the ceph.conf file consists of subsections for the various mon, osd and mds servers that we will be implementing.

root #uuidgen

1798897a-f0c9-422d-86b3-d4933a12c7ac

FILE /etc/ceph/ceph.confCustomized ceph config file

[global]
        # For version 0.54 and earlier, you may enable
        # authentication with the following setting.
        # Specifying `cephx` enables authentication;
        # and specifying `none` disables authentication.

        #auth supported = cephx

        # For version 0.55 and beyond, you must explicitly enable 
        # or disable authentication with "auth" entries in [global].
        fsid = 1798897a-f0c9-422d-86b3-d4933a12c7ac
        cluster = kroll
        public network = 192.168.2.0/24
        auth cluster required = cephx
        auth service required = cephx
        auth client required = cephx
        osd journal size = 1024

        #
        # default to replicating two copies of each object.  Allow a minimum
        # of one copy to be made when number of object stores is degraded
        #
        osd pool default size = 2
        osd pool default min size = 1
        #
        # defaults from  http://ceph.com/docs/master/install/manual-deployment/
        #
        osd pool default pg num = 333
        osd pool default pgp num = 333
        osd crush chooseleaf type = 1



[osd]
        osd journal size = 1024

        #
        #  This is a crutch for ext4 filesystems with limit of 4kb xattr data
        #  btrfs has no bounds for xattr store
        #
        # uncomment the following line if you are mounting with ext4
        #filestore xattr use omap = true


        # For Bobtail (v 0.56) and subsequent versions, you may 
        # add settings for mkcephfs so that it will create and mount
        # the file system for you. Remove the comment `#` character for 
        # the following settings and replace the values in parenthesis 
        # with appropriate values, or leave the following settings commented 
        # out to accept the default values. You must specify the --mkfs 
        # option with mkcephfs in order for the deployment script to 
        # utilize the following settings, and you must define the 'devs'
        # option for each osd instance; see below.

        #osd mkfs type = {fs-type}
        #osd mkfs options {fs-type} = {mkfs options}   # default for xfs is "-f"
        #osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime"

[mon]
        mon host = kroll1, kroll2, kroll3, kroll4
        mon addr = 192.168.2.1:6789, 192.168.2.2:6789, 192.168.2.3:6789, 192.168.2.4:6789
        mon initial members = 0,1,2

[mon.0]
        host = kroll1
        mon addr = 192.168.2.1:6789

[mon.1]
        host = kroll2
        mon addr = 192.168.2.2:6789

[mon.2]
        host = kroll3
        mon addr = 192.168.2.3:6789

[mon.3]
        host = kroll4
        mon addr = 192.168.2.4:6789

[osd.0]
        host = kroll1

[osd.1]
        host = kroll3

[osd.2]
        host = kroll4

[mds.0]
        host = kroll2

After editing the file, we copy it around to the other cluster members from our admin node kroll1 using scp

root #scp ceph.conf root@kroll2:/etc/ceph

ceph.conf                                                         100% 2408     2.4KB/s   2.4KB/s   00:00

root #scp ceph.conf root@kroll3:/etc/ceph

ceph.conf                                                         100% 2408     2.4KB/s   2.4KB/s   00:00

root #scp ceph.conf root@kroll4:/etc/ceph

ceph.conf                                                         100% 2408     2.4KB/s   2.4KB/s   00:00

/etc/conf.d/ceph file

There is also a conf file for the ceph service, but as of the Firefly release, there is only the location of the conf file to specify. This is because in previous releases, there was only an /etc/init.d/ceph script that needed to be worked with. The single script would start up or shut down all of the services enabled for the site at once. The ebuild maintainer changed this for Firefly to use renamed softlinks of /etc/init.d/ceph in order to specify the running of individual Ceph services. It is somewhat similar to what gentoo does with net.lo when enabling network devices.

FILE /etc/conf.d/cephBarebones ceph file for Firefly release

# Example

# default ceph conf file
ceph_conf="/etc/ceph/ceph.conf"

/etc/init.d/ceph script

As noted earlier when discussing MON servers, Ceph is dependent on clocks that are synchronized across the cluster. Until the ebuild maintainers update the script, consider editing your /etc/init.d/ceph script's depend function to include whatever service you may have running that has synched your clock. In this example edit to the firefly version of the script, we have added "after ntp-client" since we use the standard ntp ebuild and have ntp-client and ntp services in our default runlevel.

FILE /etc/init.d/cephopenrc ceph script for firefly edited to depend on ntp-client

#!/sbin/openrc-run

ceph_conf="${ceph_conf:-/etc/ceph/ceph.conf}"
extra_commands="reload"
type=${RC_SVCNAME:5:3}
id=${RC_SVCNAME:9}
id=${id:-"0"}
command="/usr/bin/ceph-${type}"
pidfile="/run/ceph/${type}.${id}.pid"
command_args="-i ${id} --pid-file ${pidfile} -c ${ceph_conf}"

depend() {
        after net
        after ntp-client
        before netmount
}

is_type_valid() {
        case ${type} in
                mon|mds|osd) return 0;;
                *) return 1;;
        esac
}

start_pre() {
        checkpath -d -q $(dirname ${pidfile})
}

start() {
        ebegin "Starting Ceph ${type}.${id}"
        if ! is_type_valid ;then
                eerror "Please give valid Ceph Server Type: mds, mon, osd"
                return 1
        fi
        ${command} ${command_args}
        eend $?
}

stop() {
        ebegin "Stopping Ceph ${type}.${id}"
        start-stop-daemon --stop --pidfile ${pidfile}
        eend $?
}

reload() {
        ebegin "Reloading Ceph ${type}.${id}"
        start-stop-daemon --signal 1 --pidfile ${pidfile}
        eend $?
}

Creating Keyrings For MON rollout

Ceph uses its own shared secret concept when handling communications among cluster members. We must generate keyring files that will then be distributed out to the servers that will be set up among the cluster members. The keyrings are generated by the ceph-authtool command. The first keyring is for the mon servers. The manual install url has it going to a file on /tmp, but we are more inclined to keep it around by parking it in /etc/ceph

root #ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'

creating /etc/ceph/ceph.mon.keyring

The result is a readable text file:

FILE /etc/ceph/ceph.mon.keyringceph-authtool generated file

[mon.]
        key = (redacted key text)
        caps mon = "allow *"

Next we create an admin keyring file which goes into /etc/ceph/ceph.client.admin.keyring.

root #

ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow'

creating /etc/ceph/ceph.client.admin.keyring

The resulting text file may actually be shorter than the complicated command line used to create it. The redacted key here is the same as the one that appears in our mon keyring, so it must be based on the uuid we parked in the /etc/ceph/ceph.conf config file.

FILE /etc/ceph/ceph.client.admin.keyringceph-authtool generated admin file

[client.admin]
        key = (redacted key text)
        auid = 0
        caps mds = "allow"
        caps mon = "allow *"
        caps osd = "allow *"

Note
The default ownership for the client.admin.keyring file is root:root and mode 600. You might consider changing the mode to either 660 or 640 and then changing the group to something like disk. This will allow non-root users who you trust to do disk maintenance (ie mount/unmount) to use ceph admin commands.

This next command is just as annoying, because it wasn't until after running it that we discovered that the auth tool basically just appended the client admin keyring file contents to the mon keyring file.

root #ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring

importing contents of /etc/ceph/ceph.client.admin.keyring into /etc/ceph/ceph.mon.keyring

FILE /etc/ceph/ceph.mon.keyringmon keyring with client admin contents

[mon.]
        key = (redacted)
        caps mon = "allow *"
[client.admin]
        key = (redacted)
        auid = 0
        caps mds = "allow"
        caps mon = "allow *"
        caps osd = "allow *"

We push the mon and client keyrings out to /etc/ceph on the other kroll hosts.

root #scp ceph.mon.keyring ceph.client.admin.keyring kroll2:/etc/ceph

ceph.mon.keyring                                                  100%  214     0.2KB/s   0.2KB/s   00:00    
ceph.client.admin.keyring                                         100%  137     0.1KB/s   0.2KB/s   00:00

root #scp ceph.mon.keyring ceph.client.admin.keyring kroll3:/etc/ceph

ceph.mon.keyring                                                  100%  214     0.2KB/s   0.2KB/s   00:00    
ceph.client.admin.keyring                                         100%  137     0.1KB/s   0.2KB/s   00:00

root #scp ceph.mon.keyring ceph.client.admin.keyring kroll4:/etc/ceph

ceph.mon.keyring                                                  100%  214     0.2KB/s   0.2KB/s   00:00    
ceph.client.admin.keyring                                         100%  137     0.1KB/s   0.2KB/s   00:00

Creating the initial monmap file

The OSD and MDS servers use the /etc/ceph/ceph.conf for discovering MON servers, but the MON servers themselves have a much stricter consistency scheme in order to form and maintain their quorum. When up and running the quorum uses majority rule voting system called Praxis, but MONs do work with an initial binary file called a monmap when you first set up the Ceph cluster.

The manual deployment page covers the example where only a single MON is used to form the quorum. We referred to the Monitor Config reference page and the Monitor Bootstrap page it refers to when creating our scheme with an initial quorum of three MONs.

The monmap command is used to create the initial monmap binary file. We essentially give it the addresses corrsponding to our mon.0, mon.1, mon.2 initial members and the cluster fsid from /etc/ceph/ceph.conf file. We will park this file in /etc/ceph and then pass it around to the right place when we configure our MON hosts.

root #

monmaptool  --create --add 0 192.168.2.1 --add 1 192.168.2.2 --add 2 192.168.2.3 --fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac initial-monmap

monmaptool: monmap file initial-monmap
monmaptool: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac
monmaptool: writing epoch 0 to initial-monmap (3 monitors)

We push the initial monmap file out to the other /etc/ceph directories on kroll2, 3 and 4.

root #scp initial-monmap kroll2:/etc/ceph

initial-monmap                                                    100%  481     0.5KB/s   0.5KB/s   00:00

root #scp initial-monmap kroll3:/etc/ceph

initial-monmap                                                    100%  481     0.5KB/s   0.5KB/s   00:00

root #scp initial-monmap kroll4:/etc/ceph

initial-monmap                                                    100%  481     0.5KB/s   0.5KB/s   00:00

Creating kroll1 server mon.0

Ceph servers look for their file trees in /var/lib/ceph. Mon servers look for their server id number subtree under /var/lib/ceph/mon/ceph.N where N is the id # that we designated for the server in /etc/ceph/ceph.conf. Kroll1 will host mon.0, so we create /var/lib/ceph/mon/ceph.0 for it. This implies that we will be using our 256GB SSD root system device for mon.0's i/o. Later on when we create the OSD for kroll1, we will be creating and mounting a btrfs subvolume for it to use. Otherwise the object store would default to eating us out of house and home on our system drive!

root #mkdir -p /var/lib/ceph/mon/ceph-0

Before continuing on, you may want to look at /var/log/ceph to clear out anything that may be in there. The next command will create an empty /var/log/ceph/ceph-mon.0.log file if it doesn't already exist.

root #cd /var/log/ceph

root #rm -rf *

root #cd /etc/ceph

The ceph-mon command will populate the ceph.0 directory with a copy of our ceph.mon.keyring file renamed to keyring and a store.db directory tree which is a Praxis database reflecting the contents of the initial monmap file.

root #ceph-mon --mkfs -i 0 --monmap initial-monmap --keyring ceph.mon.keyring

ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0

root #cd /var/lib/ceph/mon/ceph-0

root #ls -lR

.:
total 8
-rw-r--r-- 1 root root   77 Jul 14 13:13 keyring
drwxr-xr-x 2 root root 4096 Jul 14 13:13 store.db

./store.db:
total 12
-rw-r--r-- 1 root root 1069 Jul 14 13:13 000003.log
-rw-r--r-- 1 root root   16 Jul 14 13:13 CURRENT
-rw-r--r-- 1 root root    0 Jul 14 13:13 LOCK
-rw-r--r-- 1 root root   50 Jul 14 13:13 MANIFEST-000002

root #cat keyring

[mon.]
        key = (redacted)
        caps mon = "allow *"

We set up the mon.0 server startup in /etc/init.d by softlinking. Examination of the ceph script reveals that the gentoo ebuild developer was merely picking the server type and id # from string positions in the script file name. Thus it isn't crucial to use the "." and the "-" as we did here, but it does make things readable:

root #cd /etc/init.d

root #ln -s ceph ceph.mon-0

root #rc-update add ceph.mon-0 default

 * service ceph.mon-0 added to runlevel default

Note
We won't start the server yet until after the other mon hosts have been set up. Otherwise it would just sit stalled looking for its friends and periodically complaining into its log file in /var/log/ceph

Note
This author is not contemplating a move to systemd any time soon after prior ugly experiences with it when it first started to appear on Fedora. It is left as an exercise to the reader to figure out what sort of scheme will be necessary when they are not using openrc

We repeated the same process to create mon.1, mon.2 and mon.3 on the other kroll member hosts.

kroll2

root #mkdir -p /var/lib/ceph/mon/ceph-1

root #ceph-mon --mkfs -i 1 --monmap initial-monmap --keyring ceph.mon.keyring

ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-1 for mon.1

root #cd /var/lib/ceph/mon

root #ls -l ceph-1

total 8
-rw-r--r-- 1 root root   77 Jul 14 13:44 keyring
drwxr-xr-x 2 root root 4096 Jul 14 13:44 store.db

root #cd /etc/init.d

root #ln -s ceph ceph.mon-1

root #rc-update add ceph.mon-1 default

* service ceph.mon-1 added to runlevel default

kroll3

root #mkdir -p /var/lib/ceph/mon/ceph-2

root #ceph-mon --mkfs -i 2 --monmap initial-monmap --keyring ceph.mon.keyring

ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-2 for mon.2

root #cd /etc/init.d

root #ln -s ceph ceph.mon-2

root #rc-update add ceph.mon-2 default

* service ceph.mon-2 added to runlevel default

kroll4

root #mkdir -p /var/lib/ceph/mon/ceph-3

root #ceph-mon --mkfs -i 3 --monmap initial-monmap --keyring ceph.mon.keyring

ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-3 for mon.3

root #cd /etc/init.d

root #ln -s ceph ceph.mon-3

root #rc-update add ceph.mon-3 default

* service ceph.mon-3 added to runlevel default

Starting the Mon Servers

With all four kroll hosts configured with mons, we now go back and start the services beginning with mon.0 on kroll1.

root #cd /var/log/ceph

root #/etc/init.d/ceph.mon-0 start

* Caching service dependencies ...                                                                     [ ok ]
* Starting Ceph mon.0 ...                                                                              [ ok ]

root #ls -l

total 4
-rw-r--r-- 1 root root 1411 Jul 14 14:55 ceph-mon.0.log

root #cat ceph-mon.0.log

2014-07-14 14:55:49.806317 7f7928c81780  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 25102
2014-07-14 14:55:49.815592 7f7928c81780  0 mon.0 does not exist in monmap, will attempt to join an existing cluster
2014-07-14 14:55:49.815719 7f7928c81780  0 using public_addr 192.168.2.1:0/0 -> 192.168.2.1:6789/0
2014-07-14 14:55:49.815751 7f7928c81780  0 starting mon.0 rank -1 at 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 14:55:49.815892 7f7928c81780  1 mon.0@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 14:55:49.815927 7f7928c81780  1 mon.0@-1(probing) e0  initial_members 0,1,2, filtering seed monmap
2014-07-14 14:55:49.817013 7f7928c81780  0 mon.0@-1(probing) e0  my rank is now 0 (was -1)
2014-07-14 14:55:49.817558 7f7928c79700  0 -- 192.168.2.1:6789/0 >> 0.0.0.0:0/1 pipe(0x2182280 sd=18 :0 s=1 pgs=0 cs=0 l=0 c=0x2138dc0).fault
2014-07-14 14:55:49.817569 7f791fc21700  0 -- 192.168.2.1:6789/0 >> 0.0.0.0:0/2 pipe(0x2182a00 sd=17 :0 s=1 pgs=0 cs=0 l=0 c=0x2139a20).fault
2014-07-14 14:55:49.817700 7f791fa1f700  0 -- 192.168.2.1:6789/0 >> 192.168.2.3:6789/0 pipe(0x2183180 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x21394a0).fault
2014-07-14 14:55:49.817674 7f791fb20700  0 -- 192.168.2.1:6789/0 >> 192.168.2.2:6789/0 pipe(0x2183680 sd=16 :0 s=1 pgs=0 cs=0 l=0 c=0x2139760).fault

Notice the ".fault" entries at the bottom of the log. We will continue to see them until the quorum is established. We will also see them as outputwhen attempting to do monitor style commands such as ceph -s until there is a quorum.

Starting mon.1 on kroll2

We start up the mon.1 server on kroll2. The mon log for this server will show it discovering the kroll1 mon.0 peer server, and the 2 start using praxis to hold a quorum election. Until mon.2 gets spun up, we still won't have a quorum.

root #cd /var/log/ceph

root #/etc/init.d/mon.1 start

 * Caching service dependencies ...                                                                     [ ok ]
 * Starting Ceph mon.1 ...                                                                              [ ok ]

root #cat ceph-mon.1.log

2014-07-14 15:01:03.087006 7f680d91b7c0  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 18579
2014-07-14 15:01:03.110960 7f680d91b7c0  0 mon.1 does not exist in monmap, will attempt to join an existing cluster
2014-07-14 15:01:03.111291 7f680d91b7c0  0 using public_addr 192.168.2.2:0/0 -> 192.168.2.2:6789/0
2014-07-14 15:01:03.111377 7f680d91b7c0  0 starting mon.1 rank -1 at 192.168.2.2:6789/0 mon_data /var/lib/ceph/mon/ceph-1 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 15:01:03.111717 7f680d91b7c0  1 mon.1@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 15:01:03.111825 7f680d91b7c0  1 mon.1@-1(probing) e0  initial_members 0,1,2, filtering seed monmap
2014-07-14 15:01:03.114379 7f680d91b7c0  0 mon.1@-1(probing) e0  my rank is now 0 (was -1)
2014-07-14 15:01:03.115191 7f6804881700  0 -- 192.168.2.2:6789/0 >> 0.0.0.0:0/2 pipe(0x2d30780 sd=20 :0 s=1 pgs=0 cs=0 l=0 c=0x2d6ba20).fault
2014-07-14 15:01:03.115208 7f680d8b9700  0 -- 192.168.2.2:6789/0 >> 0.0.0.0:0/1 pipe(0x2d30c80 sd=21 :0 s=1 pgs=0 cs=0 l=0 c=0x2d6adc0).fault
2014-07-14 15:01:03.115672 7f680467f700  0 -- 192.168.2.2:6789/0 >> 192.168.2.3:6789/0 pipe(0x2d30f00 sd=24 :0 s=1 pgs=0 cs=0 l=0 c=0x2d6b4a0).fault
2014-07-14 15:01:03.118087 7f6803d7d700  0 -- 192.168.2.2:6789/0 >> 192.168.2.1:6789/0 pipe(0x2d31900 sd=25 :6789 s=0 pgs=0 cs=0 l=0 c=0x2ddc000).accept connect_seq 0 vs existing 0 state wait
2014-07-14 15:01:03.120547 7f6805883700  1 mon.1@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:01:03.120680 7f6805883700  1 mon.1@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
                      .
                      .
                      .
2014-07-14 15:01:03.133469 7f6805883700  1 mon.1@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:01:03.133516 7f6805883700  1 mon.1@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:01:03.133575 7f6805883700  1 mon.1@0(probing) e0  learned initial mon 0 addr 192.168.2.1:6789/0
2014-07-14 15:01:03.133584 7f6805883700  0 mon.1@0(probing) e0  my rank is now 1 (was 0)
2014-07-14 15:01:03.134096 7f6805883700  0 -- 192.168.2.2:6789/0 submit_message mon_probe(reply 1798897a-f0c9-422d-86b3-d4933a12c7ac name 1 paxos( fc 0 lc 0 ) new) v6 remote, 192.168.2.1:6789/0, failed lossy con, dropping message 0x2e4db80
2014-07-14 15:01:03.134139 7f6805883700  0 -- 192.168.2.2:6789/0 submit_message mon_probe(reply 1798897a-f0c9-422d-86b3-d4933a12c7ac name 1 paxos( fc 0 lc 0 ) new) v6 remote, 192.168.2.1:6789/0, failed lossy con, dropping message 0x2f19400
2014-07-14 15:01:03.134155 7f6803d7d700  0 -- 192.168.2.2:6789/0 >> 0.0.0.0:0/2 pipe(0x2e4c500 sd=21 :0 s=1 pgs=0 cs=0 l=0 c=0x2ddc580).fault
2014-07-14 15:01:03.134250 7f680467f700  0 -- 192.168.2.2:6789/0 >> 192.168.2.3:6789/0 pipe(0x2e4cf00 sd=25 :0 s=1 pgs=0 cs=0 l=0 c=0x2ddc420).fault
2014-07-14 15:01:03.135037 7f6805883700  0 log [INF] : mon.1 calling new monitor election
2014-07-14 15:01:03.135083 7f6805883700  1 mon.1@1(electing).elector(1) init, last seen epoch 1
2014-07-14 15:01:03.135161 7f6805883700  1 mon.1@1(electing).paxos(paxos recovering c 0..0) is_readable now=2014-07-14 15:01:03.135162 lease_expire=0.000000 has v0 lc 0

Starting mon.2 on kroll3

root #cd /var/log/ceph

root #/etc/init.d/mon.2 start

 * Caching service dependencies ...                                                                     [ ok ]
 * Starting Ceph mon.2 ...                                                                              [ ok ]

root #cat ceph-mon.2.log

2014-07-14 15:06:31.837248 7f0b3846c780  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 25431
2014-07-14 15:06:31.875796 7f0b3846c780  0 mon.2 does not exist in monmap, will attempt to join an existing cluster
2014-07-14 15:06:31.876179 7f0b3846c780  0 using public_addr 192.168.2.3:0/0 -> 192.168.2.3:6789/0
2014-07-14 15:06:31.876298 7f0b3846c780  0 starting mon.2 rank -1 at 192.168.2.3:6789/0 mon_data /var/lib/ceph/mon/ceph-2 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 15:06:31.876708 7f0b3846c780  1 mon.2@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 15:06:31.876834 7f0b3846c780  1 mon.2@-1(probing) e0  initial_members 0,1,2, filtering seed monmap
2014-07-14 15:06:31.878883 7f0b3846c780  0 mon.2@-1(probing) e0  my rank is now 0 (was -1)
2014-07-14 15:06:31.879515 7f0b2f409700  0 -- 192.168.2.3:6789/0 >> 0.0.0.0:0/2 pipe(0x2b02a00 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x2ab7a20).fault
2014-07-14 15:06:31.879519 7f0b38464700  0 -- 192.168.2.3:6789/0 >> 0.0.0.0:0/1 pipe(0x2b02280 sd=14 :0 s=1 pgs=0 cs=0 l=0 c=0x2ab6dc0).fault
2014-07-14 15:06:31.882536 7f0b2e804700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b03b80 sd=26 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32160).accept connect_seq 0 vs existing 0 state wait
2014-07-14 15:06:31.882697 7f0b2e905700  0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b03900 sd=25 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32000).accept connect_seq 0 vs existing 0 state wait
2014-07-14 15:06:31.884022 7f0b3040b700  1 mon.2@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:06:31.884124 7f0b3040b700  1 mon.2@0(probing) e0  adding peer 192.168.2.2:6789/0 to list of hints
2014-07-14 15:06:31.884281 7f0b3040b700  1 mon.2@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:06:31.884361 7f0b3040b700  1 mon.2@0(probing) e0  adding peer 192.168.2.2:6789/0 to list of hints
2014-07-14 15:06:31.884434 7f0b3040b700  1 mon.2@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:06:31.884516 7f0b3040b700  1 mon.2@0(probing) e0  adding peer 192.168.2.1:6789/0 to list of hints
2014-07-14 15:06:31.884665 7f0b3040b700  0 mon.2@0(probing) e1  my rank is now -1 (was 0)
2014-07-14 15:06:31.885501 7f0b3040b700  0 -- 192.168.2.3:6789/0 submit_message mon_probe(reply 1798897a-f0c9-422d-86b3-d4933a12c7ac name 2 paxos( fc 0 lc 0 ) new) v6 remote, 192.168.2.1:6789/0, failed lossy con, dropping message 0x2b03400
2014-07-14 15:06:31.885631 7f0b2f409700  0 -- 192.168.2.3:6789/0 >> 0.0.0.0:0/2 pipe(0x2b02a00 sd=15 :0 s=1 pgs=0 cs=0 l=0 c=0x2b32160).fault
2014-07-14 15:06:31.886047 7f0b2e804700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a000 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32580).accept connect_seq 2 vs existing 0 state connecting
2014-07-14 15:06:31.886061 7f0b2e804700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a000 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32580).accept we reset (peer sent cseq 2, 0x2b03180.cseq = 0), sending RESETSESSION
2014-07-14 15:06:31.886664 7f0b2e804700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a000 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32580).accept connect_seq 0 vs existing 0 state connecting
2014-07-14 15:06:31.886722 7f0b2e905700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b03180 sd=13 :33205 s=4 pgs=0 cs=0 l=0 c=0x2b32000).connect got RESETSESSION but no longer connecting
2014-07-14 15:06:33.886683 7f0b3040b700  0 mon.2@-1(probing) e2  my rank is now 2 (was -1)
2014-07-14 15:06:33.888244 7f0b2f409700  0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b2a500 sd=15 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32420).accept connect_seq 2 vs existing 0 state connecting
2014-07-14 15:06:33.888268 7f0b2f409700  0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b2a500 sd=15 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32420).accept we reset (peer sent cseq 2, 0x2b03680.cseq = 0), sending RESETSESSION
2014-07-14 15:06:33.888297 7f0b2e804700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a280 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32dc0).accept connect_seq 2 vs existing 0 state connecting
2014-07-14 15:06:33.888318 7f0b2e804700  0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a280 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32dc0).accept we reset (peer sent cseq 2, 0x2b03400.cseq = 0), sending RESETSESSION
2014-07-14 15:06:33.888859 7f0b2f409700  0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b2a500 sd=15 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32420).accept connect_seq 0 vs existing 0 state connecting
2014-07-14 15:06:33.888956 7f0b2f308700  0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b03680 sd=14 :60500 s=4 pgs=0 cs=0 l=0 c=0x2ab6dc0).connect got RESETSESSION but no longer connecting
2014-07-14 15:06:33.914131 7f0b3040b700  1 mon.2@2(synchronizing).paxosservice(pgmap 1..2) refresh upgraded, format 0 -> 1
2014-07-14 15:06:33.914186 7f0b3040b700  1 mon.2@2(synchronizing).pg v0 on_upgrade discarding in-core PGMap
2014-07-14 15:06:33.917087 7f0b3040b700  0 mon.2@2(synchronizing).mds e1 print_map
epoch   1
flags   0
created 2014-07-14 15:01:18.167385
modified        2014-07-14 15:01:18.167410
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap}
max_mds 1
in
up      {}
failed
stopped
data_pools      0
metadata_pool   1
inline_data     disabled
 
2014-07-14 15:06:33.917425 7f0b3040b700  1 mon.2@2(synchronizing).osd e1 e1: 0 osds: 0 up, 0 in
2014-07-14 15:06:33.922385 7f0b3040b700  0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:06:33.922412 7f0b3040b700  0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:06:33.922417 7f0b3040b700  0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:06:33.922421 7f0b3040b700  0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:06:33.922868 7f0b3040b700  1 mon.2@2(synchronizing).paxosservice(auth 1..1) refresh upgraded, format 0 -> 1
2014-07-14 15:06:33.929984 7f0b3040b700  0 log [INF] : mon.2 calling new monitor election
2014-07-14 15:06:33.930106 7f0b3040b700  1 mon.2@2(electing).elector(1) init, last seen epoch 1
2014-07-14 15:06:33.930366 7f0b3040b700  1 mon.2@2(electing).paxos(paxos recovering c 1..11) is_readable now=2014-07-14 15:06:33.930368 lease_expire=0.000000 has v0 lc 11
2014-07-14 15:06:33.930384 7f0b3040b700  1 mon.2@2(electing).paxos(paxos recovering c 1..11) is_readable now=2014-07-14 15:06:33.930386 lease_expire=0.000000 has v0 lc 11
2014-07-14 15:06:36.931021 7f0b3040b700  1 mon.2@2(peon).paxos(paxos active c 1..11) is_readable now=2014-07-14 15:06:36.931027 lease_expire=2014-07-14 15:06:41.926571 has v0 lc 11
2014-07-14 15:06:36.931058 7f0b3040b700  1 mon.2@2(peon).paxos(paxos active c 1..11) is_readable now=2014-07-14 15:06:36.931060 lease_expire=2014-07-14 15:06:41.926571 has v0 lc 11
2014-07-14 15:06:36.999210 7f0b3040b700  1 mon.2@2(peon).paxos(paxos active c 1..12) is_readable now=2014-07-14 15:06:36.999212 lease_expire=2014-07-14 15:06:41.988409 has v0 lc 12
2014-07-14 15:06:39.988946 7f0b3040b700  1 mon.2@2(peon).paxos(paxos active c 1..12) is_readable now=2014-07-14 15:06:39.988948 lease_expire=2014-07-14 15:06:44.988542 has v0 lc 12
2014-07-14 15:06:42.989047 7f0b3040b700  1 mon.2@2(peon).paxos(paxos active c 1..12) is_readable now=2014-07-14 15:06:42.989049 lease_expire=2014-07-14 15:06:47.988632 has v0 lc 12

There is now also a ceph.log for the cluster with a briefer summary. There is a warning that the system drive is over 70% full on kroll1 since we have also been using it for a /home directory there. We may migrate that to the btrfs array later if it becomes necessary.

root #cat ceph.log

2014-07-14 15:01:18.238250 mon.0 192.168.2.1:6789/0 11 : [INF] pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2014-07-14 15:01:49.818434 mon.0 192.168.2.1:6789/0 12 : [WRN] reached concerning levels of available space on local monitor storage (28% free)
2014-07-14 15:01:03.135041 mon.1 192.168.2.2:6789/0 1 : [INF] mon.1 calling new monitor election
2014-07-14 15:01:13.152781 mon.1 192.168.2.2:6789/0 2 : [INF] mon.1 calling new monitor election
2014-07-14 15:06:31.908086 mon.0 192.168.2.1:6789/0 13 : [INF] mon.0 calling new monitor election
2014-07-14 15:06:33.929996 mon.2 192.168.2.3:6789/0 1 : [INF] mon.2 calling new monitor election
2014-07-14 15:06:36.916720 mon.0 192.168.2.1:6789/0 14 : [INF] mon.0@0 won leader election with quorum 0,1,2
2014-07-14 15:06:36.926950 mon.0 192.168.2.1:6789/0 15 : [INF] monmap e2: 3 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0}
2014-07-14 15:06:36.927070 mon.0 192.168.2.1:6789/0 16 : [INF] pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2014-07-14 15:06:36.927174 mon.0 192.168.2.1:6789/0 17 : [INF] mdsmap e1: 0/0/1 up
2014-07-14 15:06:36.927293 mon.0 192.168.2.1:6789/0 18 : [INF] osdmap e1: 0 osds: 0 up, 0 in
2014-07-14 15:06:49.820264 mon.0 192.168.2.1:6789/0 19 : [WRN] reached concerning levels of available space on local monitor storage (28% free)

The commands in the manual install page to check cluster sanity: ceph osd lspools and ceph -s will now work, but of course we don't have any OSDs spun up yet.

root #ceph osd lspools

0 data,1 metadata,2 rbd,

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
     monmap e2: 3 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0}, election epoch 6, quorum 0,1,2 0,1,2
     osdmap e1: 0 osds: 0 up, 0 in
      pgmap v2: 192 pgs, 3 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 192 creating

Starting mon.3 on kroll4

root #cd /var/log/ceph

root #/etc/init.d/ceph.mon.3 start

 * Caching service dependencies ...                                                                     [ ok ]
 * Starting Ceph mon.3 ...                                                                              [ ok ]

root #cat ceph-mon.3.log

2014-07-14 15:22:30.350302 7fdf4e653780  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 7963
2014-07-14 15:22:30.374728 7fdf4e653780  0 mon.3 does not exist in monmap, will attempt to join an existing cluster
2014-07-14 15:22:30.375095 7fdf4e653780  0 using public_addr 192.168.2.4:0/0 -> 192.168.2.4:6789/0
2014-07-14 15:22:30.375191 7fdf4e653780  0 starting mon.3 rank -1 at 192.168.2.4:6789/0 mon_data /var/lib/ceph/mon/ceph-3 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 15:22:30.375544 7fdf4e653780  1 mon.3@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 15:22:30.375679 7fdf4e653780  1 mon.3@-1(probing) e0  initial_members 0,1,2, filtering seed monmap
2014-07-14 15:22:30.377442 7fdf4e64b700  0 -- 192.168.2.4:6789/0 >> 0.0.0.0:0/1 pipe(0x1fe4c80 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x201cdc0).fault
2014-07-14 15:22:30.377505 7fdf455d9700  0 -- 192.168.2.4:6789/0 >> 0.0.0.0:0/2 pipe(0x1fe4780 sd=14 :0 s=1 pgs=0 cs=0 l=0 c=0x201da20).fault
2014-07-14 15:22:30.377866 7fdf454d8700  0 -- 192.168.2.4:6789/0 >> 0.0.0.0:0/3 pipe(0x1fe5400 sd=15 :0 s=1 pgs=0 cs=0 l=0 c=0x201d760).fault
2014-07-14 15:22:30.410274 7fdf465db700  1 mon.3@-1(synchronizing).paxosservice(pgmap 1..2) refresh upgraded, format 0 -> 1
2014-07-14 15:22:30.410318 7fdf465db700  1 mon.3@-1(synchronizing).pg v0 on_upgrade discarding in-core PGMap
2014-07-14 15:22:30.412460 7fdf465db700  0 mon.3@-1(synchronizing).mds e1 print_map
epoch   1
flags   0
created 2014-07-14 15:01:18.167385
modified        2014-07-14 15:01:18.167410
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap}
max_mds 1
in
up      {}
failed
stopped
data_pools      0
metadata_pool   1
inline_data     disabled

2014-07-14 15:22:30.412779 7fdf465db700  1 mon.3@-1(synchronizing).osd e1 e1: 0 osds: 0 up, 0 in
2014-07-14 15:22:30.417334 7fdf465db700  0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:22:30.417364 7fdf465db700  0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:22:30.417369 7fdf465db700  0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:22:30.417373 7fdf465db700  0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires
2014-07-14 15:22:30.418155 7fdf465db700  1 mon.3@-1(synchronizing).paxosservice(auth 1..2) refresh upgraded, format 0 -> 1
2014-07-14 15:22:32.420179 7fdf465db700  0 mon.3@-1(probing) e3  my rank is now 3 (was -1)
2014-07-14 15:22:32.421409 7fdf465db700  0 log [INF] : mon.3 calling new monitor election
2014-07-14 15:22:32.421511 7fdf465db700  1 mon.3@3(electing).elector(1) init, last seen epoch 1
2014-07-14 15:22:32.421625 7fdf465db700  1 mon.3@3(electing).paxos(paxos recovering c 1..14) is_readable now=2014-07-14 15:22:32.421626 lease_expire=0.000000 has v0 lc 14
2014-07-14 15:22:32.421642 7fdf465db700  1 mon.3@3(electing).paxos(paxos recovering c 1..14) is_readable now=2014-07-14 15:22:32.421643 lease_expire=0.000000 has v0 lc 14
2014-07-14 15:22:32.421820 7fdf448d3700  0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe4c80 sd=14 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079340).accept connect_seq 2 vs existing 0 state connecting
2014-07-14 15:22:32.421854 7fdf448d3700  0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe4c80 sd=14 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079340).accept we reset (peer sent cseq 2, 0x1fe5680.cseq = 0), sending RESETSESSION
2014-07-14 15:22:32.422235 7fdf446d1700  0 -- 192.168.2.4:6789/0 >> 192.168.2.1:6789/0 pipe(0x1fe5900 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079080).accept connect_seq 2 vs existing 0 state connecting
2014-07-14 15:22:32.422263 7fdf446d1700  0 -- 192.168.2.4:6789/0 >> 192.168.2.1:6789/0 pipe(0x1fe5900 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079080).accept we reset (peer sent cseq 2, 0x207b680.cseq = 0), sending RESETSESSION
2014-07-14 15:22:32.422283 7fdf453d7700  0 -- 192.168.2.4:6789/0 >> 192.168.2.3:6789/0 pipe(0x1fe4f00 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x20791e0).accept connect_seq 2 vs existing 0 state connecting
2014-07-14 15:22:32.422303 7fdf453d7700  0 -- 192.168.2.4:6789/0 >> 192.168.2.3:6789/0 pipe(0x1fe4f00 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x20791e0).accept we reset (peer sent cseq 2, 0x1fe4a00.cseq = 0), sending RESETSESSION
2014-07-14 15:22:32.422464 7fdf448d3700  0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe4c80 sd=14 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079340).accept connect_seq 0 vs existing 0 state connecting
2014-07-14 15:22:32.423192 7fdf4e64b700  0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe5680 sd=17 :51313 s=4 pgs=0 cs=0 l=0 c=0x2079600).connect got RESETSESSION but no longer connecting
2014-07-14 15:22:35.478165 7fdf465db700  1 mon.3@3(peon).paxos(paxos active c 1..15) is_readable now=2014-07-14 15:22:35.478176 lease_expire=2014-07-14 15:22:40.466348 has v0 lc 15
2014-07-14 15:22:35.478209 7fdf465db700  1 mon.3@3(peon).paxos(paxos active c 1..15) is_readable now=2014-07-14 15:22:35.478210 lease_expire=2014-07-14 15:22:40.466348 has v0 lc 15
2014-07-14 15:22:35.540828 7fdf465db700  1 mon.3@3(peon).paxos(paxos active c 1..16) is_readable now=2014-07-14 15:22:35.540829 lease_expire=2014-07-14 15:22:40.533829 has v0 lc 16
2014-07-14 15:22:38.536309 7fdf465db700  1 mon.3@3(peon).paxos(paxos active c 1..16) is_readable now=2014-07-14 15:22:38.536311 lease_expire=2014-07-14 15:22:43.533964 has v0 lc 16

The appearance of mon.3 causes a new monitor election as noted in the ceph.log on kroll4

root #cat ceph.log

2014-07-14 15:06:36.927293 mon.0 192.168.2.1:6789/0 18 : [INF] osdmap e1: 0 osds: 0 up, 0 in
2014-07-14 15:06:49.820264 mon.0 192.168.2.1:6789/0 19 : [WRN] reached concerning levels of available space on local monitor storage (28% free)
2014-07-14 15:06:31.911339 mon.1 192.168.2.2:6789/0 3 : [INF] mon.1 calling new monitor election
2014-07-14 15:22:30.433866 mon.0 192.168.2.1:6789/0 20 : [INF] mon.0 calling new monitor election
2014-07-14 15:22:30.434038 mon.1 192.168.2.2:6789/0 4 : [INF] mon.1 calling new monitor election
2014-07-14 15:22:30.438257 mon.2 192.168.2.3:6789/0 2 : [INF] mon.2 calling new monitor election
2014-07-14 15:22:32.421411 mon.3 192.168.2.4:6789/0 1 : [INF] mon.3 calling new monitor election
2014-07-14 15:22:35.444315 mon.0 192.168.2.1:6789/0 21 : [INF] mon.0@0 won leader election with quorum 0,1,2,3
2014-07-14 15:22:35.466723 mon.0 192.168.2.1:6789/0 22 : [INF] monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}
2014-07-14 15:22:35.466825 mon.0 192.168.2.1:6789/0 23 : [INF] pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2014-07-14 15:22:35.466931 mon.0 192.168.2.1:6789/0 24 : [INF] mdsmap e1: 0/0/1 up
2014-07-14 15:22:35.467028 mon.0 192.168.2.1:6789/0 25 : [INF] osdmap e1: 0 osds: 0 up, 0 in
2014-07-14 15:22:49.826214 mon.0 192.168.2.1:6789/0 26 : [WRN] reached concerning levels of available space on local monitor storage (28% free)

And now we are four. The health will stay at HEALTH_ERR or degraded until we get at least two OSDs spun up. We need two since that's the default replication count for objects as set in the /etc/ceph/ceph.conf.

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
     monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3
     osdmap e1: 0 osds: 0 up, 0 in
      pgmap v2: 192 pgs, 3 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 192 creating

Creating osd.0 on kroll1

We use uuidgen to create a unique id which will be used in our first osd. With the mon servers up and running maintaining the server map in praxis, we will need to follow the id numbers returned to us from ceph osd create and then retrofit /etc/ceph/ceph.conf if necessary if we lose and osd later or do things out of order.

root #uuidgen

fb84a114-e961-4a8d-98a3-0b263c003129

root #ceph osd create fb84a114-e961-4a8d-98a3-0b263c003129

Since this will be the first osd for the mon servers, we get assigned id 0 to create osd.0 on kroll1. As we have noted earlier, kroll1 (Thufir) already has a btrfs raid5 array up and running with the default volume mounted on /thufirraid. The "normal" content is in the subvolume raid which is mounted on /raid. We will add a new subvolume called ceph_osd_0 which will be mounted to /var/lib/ceph/osd/ceph-0 for use by the new osd server.

root #df

Filesystem                        1K-blocks       Used  Available Use% Mounted on
/dev/sda2                         230374564  152033716   66615436  70% /
devtmpfs                           16387752          0   16387752   0% /dev
tmpfs                               3284440        952    3283488   1% /run
shm                                16422196        140   16422056   1% /dev/shm
cgroup_root                           10240          0      10240   0% /sys/fs/cgroup
/dev/sdb                        15628074336 6429555452 7031525188  48% /thufirraid
/dev/sdb                        15628074336 6429555452 7031525188  48% /raid

root #cd /thufirraid

root #btrfs subvol create ceph_osd_0

Create subvolume './ceph_osd_0'

root #ls

ceph_osd_0  raid

We add the new subvolume to our /etc/fstab and take the liberty of turning on automatic defragmentation and on-the-fly lzo compression for the subvolume. The osd will actually manage some subvolumes of its own underneath this mountpoint (current and snap.N rolling snapshots).

FILE /etc/fstabAdding ceph_osd_0 to the fstab

LABEL=boot	/boot		ext3		noauto,noatime	1 2
LABEL=root	/		ext4		defaults,noatime,discard		0 1
/dev/sdb	/thufirraid	btrfs		defaults,noatime,compress=lzo,autodefrag		0 0
/dev/sdb	/raid		btrfs		defaults,noatime,compress=lzo,autodefrag,subvol=raid	0 0
/dev/sdb	/var/lib/ceph/osd/ceph-0  btrfs defaults,noatime,compress=lzo,autodefrag,subvol=ceph_osd_0	0 0

root #mkdir -p /var/lib/ceph/osd/ceph-0

root #mount /var/lib/ceph/osd/ceph-0

root #df

Filesystem                        1K-blocks       Used  Available Use% Mounted on
/dev/sda2                         230374564  152034736   66614416  70% /
devtmpfs                           16387752          0   16387752   0% /dev
tmpfs                               3284440        952    3283488   1% /run
shm                                16422196        140   16422056   1% /dev/shm
cgroup_root                           10240          0      10240   0% /sys/fs/cgroup
/dev/sdb                        15628074336 6429555484 7031525188  48% /thufirraid
/dev/sdb                        15628074336 6429555484 7031525188  48% /raid
/dev/sdb                        15628074336 6429555484 7031525188  48% /var/lib/ceph/osd/ceph-0

Now we let ceph-osd have its way with our new btrfs subvolume.

root #ceph-osd -i 0 --mkfs --mkkey --osd-uuid fb84a114-e961-4a8d-98a3-0b263c003129

2014-07-14 16:15:48.188183 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 16:15:51.294994 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 16:15:51.314809 7f667580dbc0 -1 filestore(/var/lib/ceph/osd/ceph-0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-07-14 16:15:52.440379 7f667580dbc0 -1 created object store /var/lib/ceph/osd/ceph-0 journal /var/lib/ceph/osd/ceph-0/journal for osd.0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 16:15:52.440477 7f667580dbc0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-0/keyring: can't open /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory
2014-07-14 16:15:52.440658 7f667580dbc0 -1 created new key in keyring /var/lib/ceph/osd/ceph-0/keyring

We now have a /var/log/ceph/ceph-osd.0.log which shows us all of the activity and the btrfs features that the osd decided to take advantage of:

root #cat /var/log/ceph/ceph-osd.0.log

2014-07-14 16:15:45.644892 7f667580dbc0  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 25509
2014-07-14 16:15:45.650521 7f667580dbc0  1 filestore(/var/lib/ceph/osd/ceph-0) mkfs in /var/lib/ceph/osd/ceph-0
2014-07-14 16:15:45.650596 7f667580dbc0  1 filestore(/var/lib/ceph/osd/ceph-0) mkfs using provided fsid fb84a114-e961-4a8d-98a3-0b263c003129
2014-07-14 16:15:48.185602 7f667580dbc0  1 filestore(/var/lib/ceph/osd/ceph-0) leveldb db exists/created
2014-07-14 16:15:48.188183 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 16:15:48.190839 7f667580dbc0  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 9: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-07-14 16:15:48.325943 7f667580dbc0  0 filestore(/var/lib/ceph/osd/ceph-0) mkjournal created journal on /var/lib/ceph/osd/ceph-0/journal
2014-07-14 16:15:48.325974 7f667580dbc0  1 filestore(/var/lib/ceph/osd/ceph-0) mkfs done in /var/lib/ceph/osd/ceph-0
2014-07-14 16:15:48.326479 7f667580dbc0  0 filestore(/var/lib/ceph/osd/ceph-0) mount detected btrfs
2014-07-14 16:15:48.481458 7f667580dbc0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is supported and appears to work
2014-07-14 16:15:48.481508 7f667580dbc0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-07-14 16:15:49.418326 7f667580dbc0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2014-07-14 16:15:49.418549 7f667580dbc0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: CLONE_RANGE ioctl is supported
2014-07-14 16:15:50.018307 7f667580dbc0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_CREATE is supported
2014-07-14 16:15:50.027624 7f667580dbc0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_DESTROY is supported
2014-07-14 16:15:50.029177 7f667580dbc0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: START_SYNC is supported (transid 4539)
2014-07-14 16:15:50.318284 7f667580dbc0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: WAIT_SYNC is supported
2014-07-14 16:15:50.778085 7f667580dbc0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_CREATE_V2 is supported
2014-07-14 16:15:51.293060 7f667580dbc0  0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling PARALLEL journal mode: fs, checkpoint is enabled
2014-07-14 16:15:51.294994 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 16:15:51.295050 7f667580dbc0  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 15: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-07-14 16:15:51.314226 7f667580dbc0  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 15: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-07-14 16:15:51.314809 7f667580dbc0 -1 filestore(/var/lib/ceph/osd/ceph-0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-07-14 16:15:52.437200 7f667580dbc0  1 journal close /var/lib/ceph/osd/ceph-0/journal
2014-07-14 16:15:52.440379 7f667580dbc0 -1 created object store /var/lib/ceph/osd/ceph-0 journal /var/lib/ceph/osd/ceph-0/journal for osd.0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 16:15:52.440477 7f667580dbc0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-0/keyring: can't open /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory
2014-07-14 16:15:52.440658 7f667580dbc0 -1 created new key in keyring /var/lib/ceph/osd/ceph-0/keyring

The resulting filesystem looks like this:

root #cd /var/lib/ceph/osd/ceph-0

root #ls -l

total 1048608
-rw-r--r-- 1 root root         37 Jul 14 16:15 ceph_fsid
drwxr-xr-x 1 root root         42 Jul 14 16:15 current
-rw-r--r-- 1 root root         37 Jul 14 16:15 fsid
-rw-r--r-- 1 root root 1073741824 Jul 14 16:15 journal
-rw------- 1 root root         56 Jul 14 16:15 keyring
-rw-r--r-- 1 root root         21 Jul 14 16:15 magic
-rw-r--r-- 1 root root          6 Jul 14 16:15 ready
drwxr-xr-x 1 root root         26 Jul 14 16:15 snap_1
drwxr-xr-x 1 root root         42 Jul 14 16:15 snap_2
-rw-r--r-- 1 root root          4 Jul 14 16:15 store_version
-rw-r--r-- 1 root root         42 Jul 14 16:15 superblock
-rw-r--r-- 1 root root          2 Jul 14 16:15 whoami

The current, snap_1 and snap_2 directories are btrfs snapshots. The keyring file is a new unique key that was generated for osd-0.

root #cat keyring

[osd.0]
        key = (redacted, unique to osd.0)

We use ceph auth to transfer that into the praxis database in the mon servers.

root #ceph auth add osd.0 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-0/keyring

added key for osd.0

The ceph_fsid actually contains the fsid for the cluster from /etc/ceph/ceph.conf and not the one we passed when creating the osd itself:

root #cat ceph_fsid

1798897a-f0c9-422d-86b3-d4933a12c7ac

root #grep fsid /etc/conf/ceph.conf

        fsid = 1798897a-f0c9-422d-86b3-d4933a12c7ac

We save the uuid we used for osd.0 in a text file in /etc/ceph/osd_uuids.txt just in case. It would probably be a bad idea to try to recycle an osd id number if we ever have a filesystem go bad, but you never know...

We set up kroll1 as a host in the CRUSH map and make it part of the default osd tree.

root #ceph osd crush add-bucket kroll1 host

added bucket kroll1 type host to crush map

root #ceph osd crush move kroll1 root=default

moved item id -2 name 'kroll1' to location {root=default} in crush map

We can now see kroll1 as a host in the default tree using ceph osd tree. osd.0 also appears but hasn't been assigned to a host.

root #ceph osd tree

# id    weight  type name       up/down reweight
-1      0       root default
-2      0               host kroll1
0       0       osd.0   down    0

We now put osd.0 in the crush map under kroll1 with a default weighting value.

root #ceph osd crush add osd.0 1.0 host=kroll1

add item id 0 name 'osd.0' weight 1 at location {host=kroll1} to crush map

root #ceph osd tree

ceph osd tree
# id    weight  type name       up/down reweight
-1      1       root default
-2      1               host kroll1
0       1                       osd.0   down    0

All that is left is to enable and start the osd.0 service.

root #cd /etc/init.d

root #ln -s ceph ceph.osd-0

root #rc-update add ceph.osd-0 default

 * service ceph.osd-0 added to runlevel default

root #./ceph.osd-0 start

 * Caching service dependencies ...                                                                     [ ok ]
 * Starting Ceph osd.0 ...
starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal                [ ok ]

With one osd spun up, our cluster is now operating in a degraded state.

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
     monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3
     osdmap e7: 1 osds: 1 up, 1 in
      pgmap v9: 192 pgs, 3 pools, 0 bytes data, 0 objects
            6132 GB used, 6704 GB / 14904 GB avail
                 192 active+degraded

osd.1 on kroll3

osd.1 will use our other big btrfs array on host kroll2 (mater).

root #uuidgen

0b4c1e77-cc5c-49aa-b322-42d331621113

root #ceph osd create 0b4c1e77-cc5c-49aa-b322-42d331621113

Since we are visiting hosts in the right order, we get to set up osd.1 on kroll3 just as we had anticipated in our /etc/ceph/ceph.conf file. Mater's btrfs array has its default volume mounted on /materraid

root #cd /materraid

root #btrfs subvol create ceph_osd_1

Create subvolume './ceph_osd_1'

FILE /etc/fstabEdited to add ceph_osd_1 subvolume

LABEL=SAMBOOT	/boot		ext4		defaults,noatime,discard	1 2
LABEL=SAMROOT	/		ext4		defaults,noatime,discard		0 1
/dev/sdb	/materraid	btrfs		defaults,noatime,compress=lzo,autodefrag		0 0
/dev/sdb	/raid		btrfs		defaults,noatime,compress=lzo,autodefrag,subvol=raid	0 0
/dev/sdb	/var/lib/ceph/osd/ceph-1 btrfs	defaults,noatime,compress=lzo,autodefrag,subvol=ceph_osd_1	0 0

root #mkdir -p /var/lib/ceph/osd/ceph-1

root #mount /var/lib/ceph/osd/ceph-1

root #df

Filesystem                        1K-blocks       Used  Available Use% Mounted on
/dev/sda3                         239716036  150083632   77432412  66% /
devtmpfs                            8160292          0    8160292   0% /dev
tmpfs                               1632264        928    1631336   1% /run
shm                                 8161308         76    8161232   1% /dev/shm
cgroup_root                           10240          0      10240   0% /sys/fs/cgroup
/dev/sda2                            499656      20764     442196   5% /boot
/dev/sdb                        15628074336 6398440760 7076273736  48% /materraid
/dev/sdb                        15628074336 6398440760 7076273736  48% /raid
/dev/sdb                        15628074336 6398440760 7076273736  48% /var/lib/ceph/osd/ceph-1

root #ceph-osd -i 1 --mkfs --mkkey --osd-uuid 0b4c1e77-cc5c-49aa-b322-42d331621113

2014-07-14 17:41:03.717559 7fcc47c22bc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 17:41:12.369793 7fcc47c22bc0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 17:41:12.371069 7fcc47c22bc0 -1 filestore(/var/lib/ceph/osd/ceph-1) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-07-14 17:41:14.455595 7fcc47c22bc0 -1 created object store /var/lib/ceph/osd/ceph-1 journal /var/lib/ceph/osd/ceph-1/journal for osd.1 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 17:41:14.455697 7fcc47c22bc0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-1/keyring: can't open /var/lib/ceph/osd/ceph-1/keyring: (2) No such file or directory
2014-07-14 17:41:14.455883 7fcc47c22bc0 -1 created new key in keyring /var/lib/ceph/osd/ceph-1/keyring

root #ceph auth add osd.1 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-1/keyring

added key for osd.1

root #ceph osd crush add-bucket kroll3 host

added bucket kroll3 type host to crush map

root #ceph osd crush move kroll3 root=default

moved item id -3 name 'kroll3' to location {root=default} in crush map

root #ceph osd crush add osd.1 1.0 host=kroll3

add item id 1 name 'osd.1' weight 1 at location {host=kroll3} to crush map

root #ceph osd tree

# id    weight  type name       up/down reweight
-1      2       root default
-2      1               host kroll1
0       1                       osd.0   up      1
-3      1               host kroll3
1       1                       osd.1   down    0

root #cd /etc/init.d

root #ln -s ceph ceph.osd-1

root #rc-update add ceph.osd-1 default

 * service ceph.osd-1 added to runlevel default

root #./ceph.osd-1 start

 * Caching service dependencies ...                                                                     [ ok ]
 * Starting Ceph osd.1 ...
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal                [ ok ]

With two osds running now, the cluster moves from degraded to "warn" state. This may be leftover from our warning about the root filesystem on kroll1.

root #ceph osd tree

# id    weight  type name       up/down reweight
-1      2       root default
-2      1               host kroll1
0       1                       osd.0   up      1
-3      1               host kroll3
1       1                       osd.1   up      1

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_WARN
     monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3
     osdmap e13: 2 osds: 2 up, 2 in
      pgmap v24: 192 pgs, 3 pools, 0 bytes data, 0 objects
            12235 GB used, 13452 GB / 29808 GB avail
                 192 active+clean

osd.2 on kroll4

After shoving a couple of data archives from /home to the /raid filesystem on thufir, we got the root drive down to 51% of capacity. The cluster now shows HEALTH_OK.

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_OK
     monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3
     osdmap e14: 2 osds: 2 up, 2 in
      pgmap v39: 192 pgs, 3 pools, 0 bytes data, 0 objects
            12269 GB used, 13407 GB / 29808 GB avail
                 192 active+clean

We will now add our final (for the moment) object store as osd.2 on kroll4. Its btrfs mirror set has its default volume mounted on /tuberaid. It also has a different subvolume setup from thufir and mater. Instead of mounting and exporting a single /raid subvolume, it has subvolumes for /home, /mythtv and a dedicated volume for virtual machines as /vm. Another wrinkle is that this system has been set up to use btrfs on the SSD for its /boot and / drives. In fact, the root drive is mounted as a subvolume of /systemraid

root #uuidgen

9b5550a2-7c92-4875-bea8-e61ee7a1ea15

root #ceph osd create 9b5550a2-7c92-4875-bea8-e61ee7a1ea15

root #cd /tuberaid

root #btrfs subvol create ceph_osd_2

Create subvolume './ceph_osd_2'

FILE /etc/fstabupdated with ceph_osd_2 mount

/dev/sda3     	/systemraid	    btrfs   defaults,noatime,ssd	0 0
/dev/sda2	/boot  	  	    btrfs   defaults,noatime,ssd	0 0
/dev/sda3	/		    btrfs   defaults,noatime,ssd,compress=lzo,subvol=activeroot	0 0
#
#  2x4tb raid1/raid1 btrfs.  /dev/sdb1,c1 has alternative grub2bios and /dev/sdb2,c2 has alternative /boot
#
/dev/sdb3      /tuberaid	btrfs   defaults,noatime,compress=lzo,autodefrag		0 0
/dev/sdb3      /home		btrfs	subvol=home,defaults,noatime,compress=lzo,autodefrag	0 0
/dev/sdb3      /mythtv		btrfs   subvol=mythtv,autodefrag,defaults,noatime		0 0
/dev/sdb3      /vm		btrfs	subvol=vm,compress=lzo,autodefrag,default,noatime	0 0
/dev/sdb3      /var/lib/ceph/osd/ceph-2	 btrfs	subvol=ceph_osd_2,compress=lzo,autodefrag,default,noatime 0 0

root #mkdir -p /var/lib/ceph/osd/ceph-2

root #mount /var/lib/ceph/osd/ceph-2

root #df

Filesystem                        1K-blocks       Used  Available Use% Mounted on
/dev/sda3                         249542980   26435412  221184188  11% /
devtmpfs                            7634844          0    7634844   0% /dev
tmpfs                               1527156        956    1526200   1% /run
shm                                 7635760        912    7634848   1% /dev/shm
cgroup_root                           10240          0      10240   0% /sys/fs/cgroup
/dev/sda3                         249542980   26435412  221184188  11% /systemraid
/dev/sda2                            512000      47048     460860  10% /boot
/dev/sdb3                        7813006984  443978232 7367515784   6% /tuberaid
/dev/sdb3                        7813006984  443978232 7367515784   6% /home
/dev/sdb3                        7813006984  443978232 7367515784   6% /mythtv
/dev/sdb3                        7813006984  443978232 7367515784   6% /vm
/dev/sdb3                        7813006984  443978232 7367515784   6% /var/lib/ceph/osd/ceph-2

root #ceph-osd -i 2 --mkfs --mkkey --osd-uuid 9b5550a2-7c92-4875-bea8-e61ee7a1ea15

2014-07-14 18:52:38.014662 7f88eed18780 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 18:52:38.789413 7f88eed18780 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-07-14 18:52:38.803037 7f88eed18780 -1 filestore(/var/lib/ceph/osd/ceph-2) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-07-14 18:52:39.057377 7f88eed18780 -1 created object store /var/lib/ceph/osd/ceph-2 journal /var/lib/ceph/osd/ceph-2/journal for osd.2 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2014-07-14 18:52:39.057479 7f88eed18780 -1 auth: error reading file: /var/lib/ceph/osd/ceph-2/keyring: can't open /var/lib/ceph/osd/ceph-2/keyring: (2) No such file or directory
2014-07-14 18:52:39.057665 7f88eed18780 -1 created new key in keyring /var/lib/ceph/osd/ceph-2/keyring

root #ceph auth add osd.2 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-2/keyring

added key for osd.2

root #ceph osd crush add-bucket kroll4 host

added bucket kroll4 type host to crush map

root #ceph osd crush move kroll4 root=default

moved item id -4 name 'kroll4' to location {root=default} in crush map

root #ceph osd crush add osd.2 1.0 host=kroll4

add item id 2 name 'osd.2' weight 1 at location {host=kroll4} to crush map

root #cd /etc/init.d

root #ln -s ceph ceph.osd-2

root #rc-update add ceph.osd-2 default

 * service ceph.osd-2 added to runlevel default

root #./ceph.mon-2 start

root #ceph osd tree

# id    weight  type name       up/down reweight
-1      3       root default
-2      1               host kroll1
0       1                       osd.0   up      1
-3      1               host kroll3
1       1                       osd.1   up      1
-4      1               host kroll4
2       1                       osd.2   up      1

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_OK
     monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3
     osdmap e20: 3 osds: 3 up, 3 in
      pgmap v57: 192 pgs, 3 pools, 0 bytes data, 0 objects
            12697 GB used, 20429 GB / 37259 GB avail
                 192 active+clean

It suddenly strikes us that we are the almost assuredly the first kid in our neighborhood to set up a cluster with almost 40tb of distributed object store in the comfort of our own home. In fact, we are probably outclassing all but one or two NIH funded research labs that are nearby. The cost of the hardware was probably in the range of $5-7K since it was built from scratch, and it may even been cheaper than a single desktop video editing system sold by a certain fruity computer company.

Setting up mds.0 on kroll2

The Ceph site wants us to use ceph-deploy to set up the mds servers, but that script has yet to be ported and packaged for Gentoo. We found enough information from a bloggers site to do a manual install for Gentoo. The /etc/ceph/ceph.conf needs an edit to introduce a global [mds] section for the default location of the mds server and the keyring that it needs.

FILE /etc/ceph/ceph.confedited for new [mds] global section

[global]
        # For version 0.54 and earlier, you may enable
        # authentication with the following setting.
        # Specifying `cephx` enables authentication;
        # and specifying `none` disables authentication.

        #auth supported = cephx

        # For version 0.55 and beyond, you must explicitly enable 
        # or disable authentication with "auth" entries in [global].
        fsid = 1798897a-f0c9-422d-86b3-d4933a12c7ac
        cluster = kroll
        public network = 192.168.2.0/24
        auth cluster required = cephx
        auth service required = cephx
        auth client required = cephx
        osd journal size = 1024

        #
        # default to replicating two copies of each object.  Allow a minimum
        # of one copy to be made when number of object stores is degraded
        #
        osd pool default size = 2
        osd pool default min size = 1
        #
        # defaults from  http://ceph.com/docs/master/install/manual-deployment/
        #
        osd pool default pg num = 333
        osd pool default pgp num = 333
        osd crush chooseleaf type = 1



[osd]
        osd journal size = 1024

        #
        #  This is a crutch for ext4 filesystems with limit of 4kb xattr data
        #  btrfs has no bounds for xattr store
        #
        # uncomment the following line if you are mounting with ext4
        #filestore xattr use omap = true


        # For Bobtail (v 0.56) and subsequent versions, you may 
        # add settings for mkcephfs so that it will create and mount
        # the file system for you. Remove the comment `#` character for 
        # the following settings and replace the values in parenthesis 
        # with appropriate values, or leave the following settings commented 
        # out to accept the default values. You must specify the --mkfs 
        # option with mkcephfs in order for the deployment script to 
        # utilize the following settings, and you must define the 'devs'
        # option for each osd instance; see below.

        #osd mkfs type = {fs-type}
        #osd mkfs options {fs-type} = {mkfs options}   # default for xfs is "-f"
        #osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime"

[mon]
        mon host = kroll1, kroll2, kroll3, kroll4
        mon addr = 192.168.2.1:6789, 192.168.2.2:6789, 192.168.2.3:6789, 192.168.2.4:6789
        mon initial members = 0,1,2

[mds]
        mds data = /var/lib/ceph/mds/mds-$id
        keyring = /var/lib/ceph/mds/mds-$id/keyring

[mon.0]
        host = kroll1
        mon addr = 192.168.2.1:6789

[mon.1]
        host = kroll2
        mon addr = 192.168.2.2:6789

[mon.2]
        host = kroll3
        mon addr = 192.168.2.3:6789

[mon.3]
        host = kroll4
        mon addr = 192.168.2.4:6789

[osd.0]
        host = kroll1

[osd.1]
        host = kroll3

[osd.2]
        host = kroll4

[mds.0]
        host = kroll2

The rest of the file stays untouched including the [mds.0] section that we had put in much earlier. We pass around the updated conf file to the other hosts.

root #cd /etc/ceph

root #scp ceph.conf kroll2:/etc/ceph

ceph.conf                                                         100% 2498     2.4KB/s   2.4KB/s   00:00

root #scp ceph.conf kroll3:/etc/ceph

ceph.conf                                                         100% 2498     2.4KB/s   2.4KB/s   00:00

root #scp ceph.conf kroll4:/etc/ceph

ceph.conf                                                         100% 2498     2.4KB/s   2.4KB/s   00:00

On kroll2 we create the mds-0 directory and then use ceph auth to create its keyring. The result is only a single file with a simple key value stanza.

root #mkdir -p /var/lib/ceph/mds/mds-0

root #ceph auth get-or-create mds.0 mds 'allow ' osd 'allow *' mon 'allow rwx' > /var/lib/ceph/mds/mds-0/keyring

root #cd /var/lib/ceph/mds/mds-0

root #ls -l

total 4
-rw-r--r-- 1 root root 56 Jul 14 23:10 keyring

root #cat keyring

[mds.0]
        key = (redacted key text)

We then create a softlink of the ceph openrc script to enable and start mds-0.

root #cd /etc/init.d

root #ln -s ceph ceph.mds-0

root #rc-update add ceph.mds-0 default

 * service ceph.mds-0 added to runlevel default

root #./ceph.mds-0 start

 * Caching service dependencies ...                                                                     [ ok ]
 * Starting Ceph mds.0 ...
starting mds.0 at :/0                                                                                   [ ok ]

Looking at the ceph-mds-0.log and the ceph -s command shows us that everything is fine and we now have an mds server running. We should now be able to use mount.ceph and export out the ceph namespace over nfs and cifs.

root #cd /var/log/ceph

root #cat ceph-mds.0.log

2014-07-14 23:11:25.537494 7f90bdf7a7c0  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mds, pid 23081
2014-07-14 23:11:25.574128 7f90b6b44700  1 mds.-1.0 handle_mds_map standby
2014-07-14 23:11:25.598605 7f90b6b44700  1 mds.0.1 handle_mds_map i am now mds.0.1
2014-07-14 23:11:25.598649 7f90b6b44700  1 mds.0.1 handle_mds_map state change up:standby --> up:creating
2014-07-14 23:11:25.598924 7f90b6b44700  0 mds.0.cache creating system inode with ino:1
2014-07-14 23:11:25.601827 7f90b6b44700  0 mds.0.cache creating system inode with ino:100
2014-07-14 23:11:25.601852 7f90b6b44700  0 mds.0.cache creating system inode with ino:600
2014-07-14 23:11:25.601991 7f90b6b44700  0 mds.0.cache creating system inode with ino:601
2014-07-14 23:11:25.602110 7f90b6b44700  0 mds.0.cache creating system inode with ino:602
2014-07-14 23:11:25.602188 7f90b6b44700  0 mds.0.cache creating system inode with ino:603
2014-07-14 23:11:25.602261 7f90b6b44700  0 mds.0.cache creating system inode with ino:604
2014-07-14 23:11:25.602356 7f90b6b44700  0 mds.0.cache creating system inode with ino:605
2014-07-14 23:11:25.602427 7f90b6b44700  0 mds.0.cache creating system inode with ino:606
2014-07-14 23:11:25.602536 7f90b6b44700  0 mds.0.cache creating system inode with ino:607
2014-07-14 23:11:25.602626 7f90b6b44700  0 mds.0.cache creating system inode with ino:608
2014-07-14 23:11:25.602762 7f90b6b44700  0 mds.0.cache creating system inode with ino:609
2014-07-14 23:11:25.602840 7f90b6b44700  0 mds.0.cache creating system inode with ino:200
2014-07-14 23:11:27.320245 7f90b6b44700  1 mds.0.1 creating_done
2014-07-14 23:11:27.597400 7f90b6b44700  1 mds.0.1 handle_mds_map i am now mds.0.1
2014-07-14 23:11:27.597410 7f90b6b44700  1 mds.0.1 handle_mds_map state change up:creating --> up:active
2014-07-14 23:11:27.597413 7f90b6b44700  1 mds.0.1 recovery_done -- successful recovery!
2014-07-14 23:11:27.597466 7f90b6b44700  1 mds.0.1 active_start

root #ceph -s

    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_OK
     monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3
     mdsmap e4: 1/1/1 up {0=0=up:active}
     osdmap e20: 3 osds: 3 up, 3 in
      pgmap v86: 192 pgs, 3 pools, 1884 bytes data, 20 objects
            12703 GB used, 20423 GB / 37259 GB avail
                 192 active+clean

Creating and exporting the Posix Filesystem

Once again, the Ceph web site offered scant details about how to go about mounting the object store as a Posix Filesystem with cephx authentication. After a lot of googling around after mount errors 5, etc, we hit upon the magic sauce that is necessary. If you have compiled the ceph network filesystem and ceph lib as modules, you do not need to worry about a manual modprobe to have them loaded. The mount.ceph command will take care of that for you. You can confirm that the following two modules are loaded after trying your first mount:

root #lsmod | grep ceph

ceph                  125992  1 
libceph               114568  1 ceph

If you remember, the ceph.client.admin.keyring you created back in the beginning of the install included allow for mds operations. However the format of that keyring file will not work with mount.ceph. We need to copy only the key value itself as the contents of an /etc/ceph/admin.secret file.

FILE /etc/ceph.client.admin.keyringexisting generated file

[client.admin]
        key = (redacted key text)
        auid = 0
        caps mds = "allow"
        caps mon = "allow *"
        caps osd = "allow *"

FILE /etc/admin.secrethacked up copy of ceph.client.admin with just the key

(redacted key text)

Because we are using nfs4, we will create an /export/kroll and then use mount.ceph to mount to that. The user specified is just "admin", not "client.admin". Since we have four mons up, we will go ahead and specify all of them to mount the root ceph object namespace.

root #mkdir /export/kroll

root #mount -t ceph kroll1,kroll2,kroll3,kroll4:/ /export/kroll -o name=admin,secretfile=/etc/ceph/admin.secret

root #df

/dev/sda2                                           245506816    36207828   196804556  16% /
devtmpfs                                              8195144           0     8195144   0% /dev
tmpfs                                                 1639140        3212     1635928   1% /run
shm                                                   8195696          80     8195616   1% /dev/shm
cgroup_root                                             10240           0       10240   0% /sys/fs/cgroup
/dev/sda1                                              466248       70287      371464  16% /boot
192.168.2.1,192.168.2.2,192.168.2.3,192.168.2.4:/ 39069155328 17653448704 21415706624  46% /export/kroll

And now we have a gi-normous empty filesystem hanging off of /export/kroll. The 46% capacity filled there are from the non-ceph btrfs subvolumes that are currently in the three arrays that are providing our object stores. We will modify our /etc/exports and then update the sharing to put out /export/kroll.

FILE /etc/exportsmodified nfs exports map

# /etc/exports: NFS file systems being exported.  See exports(5).
/export         *(rw,sync,fsid=0,no_subtree_check,no_root_squash,crossmnt)
/export/home    *(rw,sync,no_subtree_check,no_root_squash)
/export/raid    *(rw,sync,no_subtree_check,no_root_squash)
/export/kroll   *(rw,sync,no_subtree_check,no_root_squash)

root #exportfs -a -v

exporting *:/export/kroll
exporting *:/export/raid
exporting *:/export/home
exporting *:/export