This is Gentoo's testing wiki. It is a non-operational environment and its textual content is outdated.
Please visit our production wiki at https://wiki.gentoo.org
Ceph/Guide
Ceph Distributed Filesystem
Ceph is a distributed object store and filesystem designed to provide excellent performance, reliability and scalability. According to the Ceph wikipedia entry, the first stable release (Argonaut) was in 2012. It arose from a doctoral dissertation by Sage Weil at the University of California, Santa Cruz. Signficant funding came from the US DOE as the software has found early adoption in clusters in use at Lawrence Livermore, Los Alamos and Sandia National Labs. The main commercial backing for Ceph comes from a company founded by Weil (Inktank) which was acquired by RedHat in April 2014.
The Floss Weekly podcast interviewd Sage Weil in 2013 for their 250th show. The interview was done around the time that the "Cuttlefish" release was created. One of the points of discussion was the need for datacenters to handle disaster recovery, and Sage pointed out that starting with Dumpling, Ceph would provide for replication between datacenters. Another bit of trivia came out in the podcast: Sage Weil was one of the inventors of the WebRing concept in the early days of the World Wide Web.
Overview
Ceph consists of four major components:
- Object Store Device
- Monitor Server
- RADOS Block Device (RBD) and RADOS Gateway
- Metadata Server providing a POSIX compliant Filesystem
Object Store Device
Two object stores mark the beginning of a Ceph cluster and they may be joined by potentially thousands more. They sit on top of an existing filesystem such as ext4 , xfs, zfs or btrfs and are created and maintained by an Object Store Device Daemon (OSD). While the underlying filesystem may provide for redundancy, error detection and repair on its own, Ceph implements its own layer of error detection, recovery and n-way replication. There is a tradeoff between using a RAID1,5,6 or 10 scheme with the underlying filesystem and then having a single OSD server versus having individual drives and multiple OSD servers. The former provides a defense in depth strategy against data loss, but the latter has less of an impact on the cluster when a drive fails and requires replacement. The latter also potentially provides better performance than a software RAID or a filesystem built on top of a number of JBOD devices.
An OSD will take advantage of advanced features of the underlying filesystem such as Extents, Copy On Write (COW), and snapshotting. It can make extended use of the xattr feature to store metadata about an object, but this will often exceed the 4kb limitation of ext4 filesystems such that an alternative metadata store will be necessary. The ceph.com site documentation recommends either ext4 or xfs in production for OSDs, but it is obvious that zfs or btrfs would be better because of their ability to self-repair, snapshot and handle COW. Ultimately btrfs will become the preferred underlying filesystem for a Linux based OSD when the majority is satisfied that it is stable enough. If you are still unsure about btrfs, look at the performance tuning research that has been done already such as that done by an Inktank employee on Bobtail at this page.
The task of the OSD is to handle the distribution of objects by Ceph across the cluster. The user can specify the number of copies of an object to be created and distributed amongst the OSDs. The default is 2 copies with a minimum of 1, but those values can be increased up to the number of OSDs that are implemented. Since this redundancy is on top of whatever may be provided the underlying RAID arrays, the cluster enjoys an added layer of protection that guards against catastrophic failure of a disk array. When a drive array fails, only the OSD or OSDs that make use of it are brought down.
Objects are broken down into extants, or shards, when distributed instead of having them treated as a single entity. In a 2-way replication scheme where there are more than 2 OSD servers, an object's shards will actually end up distributed across potentially all of the OSD servers.
An OSD server also implements a Journal (typically 1-10GB) which can be a file or a raw device. The default journal goes into the same filesystem as the rest of an object store, but this is not optimal for either performance nor fault tolerance. When implementing OSDs on a host, consider dedicating a drive to handle just journals. An SSD would be a huge performance boost for this purpose. If your system drive is an SSD, consider using that for journals if you can't dedicate a drive to journals. Otherwise partition off a 1-10GB section of each drive that will be used for OSD filesystems and then put the journal of one OSD server and the rest of the OSD for another server on each drive.
Although Ceph will work with prebuilt RAID arrays, the Inktank developers suggest that "you are doing it wrong" if you take this approach. They would rather that you have one disk device per OSD and thus prefer JBODs to hardware or software based RAIDs. In our example cluster we implemented Ceph on top of pre-existing BTRFS based software RAIDs. We will eventually explore taking down the old arrays one at a time to re-do them as individual OSD servers to look at the performance impact.
Monitor Server
OSDs are watched over by Monitor Servers (MONs) which act as the coordinators for object traffic. The initial Ceph Cluster would consist of a MON and two OSD servers, and this is the example used in their documentation for a quick install. They also talk about an admin server, but this is only a system which is able to painlessly remote into the cluster members using ssh authorized_keys. The admin server would be the system that the user has set up to run Chef, Puppet or other control systems that oversee the operation of the cluster.
A single MON would be a single point of failure for Ceph, so it is recommended that the Ceph Cluster be run with an odd number of MONs with a minimum number of 3 running to establish a quorum. For performance reasons, MONs should be put on a separate filesystem or device from OSDs because they tend to do a lot of fsyncs. Although they are typically shown as running on dedicated hosts, they can share a host with an OSD and often do in order to have enough MON servers for a decent quorum. MONs don't need a lot of storage space, so it is perfectly fine to have them run on the system drive, while the OSD takes over whatever large disk or array is in the server. If you dedicate an SSD to handle OSD journals, the MON storage will only require another 2gb or so.
MONs coordinate object traffic by implementing the Controlled Replication Under Scalable Hashing (CRUSH) map. This is an algorithm that computes the locations for storing objects in the OSD pools. MONS also keep track of the map of daemons running the various flavors of Ceph server in the cluster. An "Initial Members" setting allows the user the specify the minimum number of MON servers that must be running in order to form a quorum. When there are not enough MONs to form a quorum, the Ceph cluster will stop processing until a quorum is re-established in order to avoid a "split-brain" situation.
The CRUSH map defaults to an algorithm that automatically computes where in the OSDs an object's shards should be placed, but it can be influenced by additional human specified policies. This way, a site administrator can sway CRUSH when making choices such as:
- use the sites faster OSDs by default
- divide OSDs into "hot" (SSD based), "normal" and "archival" (slow or tape backed) storage
- localize replication to OSDs sitting on the same switch or subnet
- prevent replication to OSDs on the same rack to avoid downtime when an entire RACK has a power failure
It is this spreading out of the load with the CRUSH map that allows Ceph to scale up to thousands of OSDs so easily while increasing performance as new stores are added. Because of the spreading, the bottleneck transfers from raw disk performance (about 100mb/sec for a SATA drive for example) to the bandwidth capacity of your network and switches.
There are a number of ways to work with the MON pool and Praxis database to monitor and administrate the cluster, but the most common is the /usr/bin/ceph
command. This is a Python script that uses a number of Ceph supplied Python modules that use json to communicate with the MON pool.
RADOS Block Device and RADOS Gateway
Ceph provides a kernel module for the RADOS Block Device (RBD) and a librados library which libvirt and KVM can be linked against. This is essentially a virtual disk device that distributes its "blocks" across the OSDs in the Ceph cluster. An RBD provides the following capabilities:
- thin provisioning
- i/o striping and redundancy across the Cluster
- resizeable
- snapshot with revert capability
- directly useable as a KVM guest's disk device
- a variant of COW where a VM starts with a "golden image" which the VM diverges from as it operates
- Data Replication between datacenters starting with the Dumpling Release
A major selling point for the RBD is the fact that it can be used as a virtual machine's drive store in KVM. Because it spans the OSD server pool, the guest can be hot migrated between cluster CPUs by literally shutting the guest down on one CPU and booting it on another. Libvirt and Virt-Manager have provided this support for some time now, and it is probably one of the main reasons why RedHat (a major sponsor of QEMU/KVM, Libvirt and Virt-Manager) has acquired Inktank.
The RBD and the RADOS Gateway provide the same sort of functionality for Cloud Services as Amazon S3 and OpenStack Swift. The early adopters of Ceph were interested primarily in Cloud Service object stores. Cloud Services also drove the intial work on replication between datacenters.
Metadata Server
Ceph provides a MetaData Server (MDS) which provides a more traditional style of filesystem based on POSIX standards that translates into objects stored in the OSD pool. This is typically where a non-Linux platform can implement client support for Ceph. This can be shared via CIFS and NFS to non-Ceph and non-Linux based systems including Windows. This is also the way to use Ceph as a drop-in replacement for HADOOP. The filesystem component started to mature around the Dumpling release.
Ceph requires all of its servers to be able to see each other directly in the cluster. So this filesystem would also be the point where external systems would be able to see the content without having direct access to the Ceph Cluster. For performance reasons, the user may have all of the Ceph cluster participants using a dedicated network on faster hardware with isolated switches. The MDS server would then have multiple NICs to straddle the Ceph network and the outside world.
As of the Firefly release, there is only one active MDS server at a time. Other MDS servers run in a standby mode to quickly perform a failover when the active server goes down. The cluster will take about 30 seconds to determine whether the active MDS server has failed. This may appear to be a bottleneck for the cluster, but the MDS only does the mapping of POSIX file names to object ids. With an object id, a client then directly contacts the OSD servers to perform the necessary i/o of extents/shards.
Eventually Ceph will allow multiple active MDS servers, dividing the POSIX filesystem namespace with a mapping scheme that distributes the load.
Installation
As of this writing the stable version of Ceph in portage is ceph-0.56.3
which corresponds to a midway rev of the second major release of Ceph code named "Bobtail". In gentoo unstable are versions of the follow-on major Ceph updates up to the current major version "Firefly":
ceph-0.56.x
"Bobtail"ceph-0.67.x
"Cuttlefish"ceph-0.72.x
"Dumpling"ceph-0.80.x
"Firefly"ceph-0.87.x
"Giant"
The ceph site online archive only shows release downloads back to Cuttlefish. Also the MDS server doesn't begin to stabilize until around Dumpling. We decided to unmask unstable ceph
in our package.keywords
file along with its dependencies and ended up building with Firefly before doing our installation.
Outside of Portage, there is a development only release on the ceph site (0.84.x) and an upcoming release "Giant" due to drop, probably around Q4 2014. In the weeks since the creation of this wiki entry and the current edit, there have already been a number of ebuild updates to Firefly which we have installed. The example cluster was initially installed with 0.80.1 and is now running 0.80.5.
root #
ACCEPT_KEYWORDS="~amd64" emerge -pv ceph
These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild N ] sys-apps/hdparm-9.43 127 kB [ebuild N ] sys-libs/libunwind-1.1:7 USE="libatomic lzma static-libs -debug -debug-frame" 1,073 kB [ebuild N ] dev-util/google-perftools-2.0-r2:0/4 USE="debug static-libs -largepages -minimal {-test}" 1,232 kB [ebuild N ] dev-libs/leveldb-1.15.0 USE="snappy static-libs tcmalloc" 201 kB [ebuild N ] dev-python/chardet-2.2.1 PYTHON_TARGETS="python2_7 python3_3 -pypy (-python2_6) -python3_2 -python3_4" 177 kB [ebuild N ] dev-python/itsdangerous-0.24 PYTHON_TARGETS="python2_7 python3_3 -pypy (-python2_6) -python3_2 -python3_4" 46 kB [ebuild N ] dev-python/blinker-1.3 USE="doc {-test}" PYTHON_TARGETS="python2_7 python3_3 -pypy (-python2_6) -python3_2 -python3_4" 90 kB [ebuild N ] dev-python/requests-2.3.0 PYTHON_TARGETS="python2_7 python3_3 -pypy -python3_4" 420 kB [ebuild N ] dev-python/werkzeug-0.9.4 PYTHON_TARGETS="python2_7 python3_3 -pypy -python3_4" 1,100 kB [ebuild N ] dev-python/flask-0.10.1-r1 USE="-examples {-test}" PYTHON_TARGETS="python2_7 python3_3 -pypy -python3_4" 532 kB [ebuild N ] sys-kernel/spl-0.6.2-r5 USE="-custom-cflags -debug -debug-log" 219 kB [ebuild N ] sys-fs/zfs-kmod-0.6.2-r5 USE="rootfs -custom-cflags -debug" 1,585 kB [ebuild N ] sys-fs/zfs-0.6.2-r5 USE="bash-completion rootfs static-libs -custom-cflags -debug (-kernel-builtin) (-selinux) -test-suite" PYTHON_TARGETS="python2_7 python3_3 (-python2_6) -python3_2" 0 kB [ebuild U ] sys-cluster/ceph-0.80.1 [0.56.3] USE="fuse gtk libaio%* libatomic nss%* radosgw static-libs xfs%* zfs%* -cryptopp% -debug -tcmalloc" 3,996 kB Total: 14 packages (1 upgrade, 13 new), Size of downloads: 10,793 kB
The USE flags
libaio
and radosgw
are local to the ceph ebuild. Also even if you do build Ceph with libaio support, an OSD will not use asynchronous i/o by default.
Kernel Config
If you want to use the RADOS block device, you will need to put that into your kernel .config as either a module or baked in. Ceph itself will want to have FUSE support enabled if you want to work with the POSIX filesystem component and you will also want to include the driver for that in Network File Systems. For your backend object stores, you will want to have xfs support because of the xattr limitations in Ext4 and btrfs because it really is becoming stable now.
Device Drivers Block devices Rados block device (RBD) File systems XFS filesystem support XFS POSIX ACL support Btrfs filesystem support Btrfs POSIX Access Control Lists FUSE (Filesystem in Userspace) support Network File Systems Ceph distributed file system
Network Config
Ceph is sensitive to IP address changes, so you should make sure that all of your Ceph servers are assigned static IP addresses. You also may want to proactively treat the Ceph cluster members as an independent subnet from your existing network by multi-homing your existing network adapters as necessary. That way if an ISP change or other topology changes are needed, you can keep your cluster setup intact. It also gives you the luxury of migrating the ceph subnet later on to dedicated nics, switches and faster hardware such as 10Gbit ethernet or Infiniband. If the cluster subnet is small enough, consider keeping the hostnames in your /etc/hosts files, at least until things grow to the point where a pair of DNS servers among the cluster members becomes a compelling solution.
Ceph MON servers require accurate (or at least synched) system clocks and will mark the cluster with HEALTH_WARN if the pool detects that the servers are not within a second or two or each other. If you haven't already installed a time synchroniziation mechanism such as NTP, you really want to get that installed and configured before building the cluster.
The
/etc/init.d/ceph
script for openrc only has a depend "after net" in place at the moment. It is quite possible that your servers on a host will be started before ntp-client and ntpd (or your preferred time service) is started. We have opened a bug report with the ceph gentoo team to add "after ntp-client" in the depend function in order to avoid having to restart ceph services after a boot if the clock was off.
We will be using four hosts in our Example implementation. All four will be MON servers with an initial quorum of 3 so that we can safely avoid a "split-brain" situation and still be able to run the cluster when a single server is rebooted.
/etc/conf.d/net
Example with a multi-homed eth0 for host kroll1# # An example multi-homed eth0 where 192.168.1 subnet is the entire LAN and access to the outside world # The 192.168.2 subnet is dedicated to the ceph cluster # config_eth0="192.168.1.10/24 192.168.2.1/24" routes_eth0="default via 192.168.1.1" dns_domain_eth0="example.com" dns_servers_eth0="192.168.1.2 192.168.1.3" dns_search-eth0="example.com"
/etc/hosts
Example of a 4 node ceph cluster with hosts kroll1-4# /etc/hosts: Local Host Database # # This file describes a number of aliases-to-address mappings for the for # local hosts that share this file. # # In the presence of the domain name service or NIS, this file may not be # consulted at all; see /etc/host.conf for the resolution order. # # IPv4 and IPv6 localhost aliases 127.0.0.1 localhost ::1 localhost 192.168.2.1 kroll1 192.168.2.2 kroll2 192.168.2.3 kroll3 192.168.2.4 kroll4
Our Example Ceph Cluster
We have chosen to roll out Ceph on a portion of our home network. The four kroll hosts are as follows:
- kroll1 (aka Thufir) - An AMD FX9590 8 core CPU with 32GB of memory, 256GB SSD root drive and a 4x4TB SATA array formatted as a RAID5 btrfs with the default volume mounted on
/thufirraid
. kroll1 will act as our admin server since the ssh keys for its root user have been pushed out to the other nodes in their/root/.ssh/authorized_keys
files. Kroll1 will act as a MON and OSD server since only slightly less than half of the btrfs array has been used. - kroll2 (aka Figo) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD root drive and a 4x3TB SATA array formatted as btrfs RAID1. Kroll2 will act as a MON and MDS server. We will not do an OSD server here since the array is already over 90% capacity. Also the Ceph developers have suggested that it is not a wise idea to run an MDS and an OSD on the same node.
- kroll3 (aka Mater) - An AMD FX8350 8 core CPU with 16GB of memory, 256GB SSD and 4x4TB SATA array formatted as a RAID5 btrfs and default volume mounted on
/materraid
. /materraid was being kept as a mirror of /thufirraid usingrsync -av --delete
on a periodic basis. kroll3 will become a MON and an OSD server. - kroll4 (aka Tube) - An AMD A10-7850K APU with 16GB of memory, 256GB SSD and a 2x4TB SATA array formatted as a btrfs RAID1 mirror set with its default volume mounted on
/tuberaid
. As its real name suggests, kroll4 was originally set up a MythTV box, but its filesystem is only averaging about 10% of its capacity. We will thus use kroll4 as a MON and OSD server.
Thufir, Mater and Tube run a gentoo stable desktop profile and are currently on kernel 3.12.21-gentoo-r1
. Thufir and Mater are btrfs installs with new drives using the latest btrfs code btrfs-progs-3.12-r1
. Tube has been up and running a bit longer and had its btrfs array built under kernel 3.10.25-gentoo
. Figo runs a gentoo unstable desktop profile and is currenly on kernel 3.13.7-gentoo
. It is also being used as an rsync mirror for thufir, but its array has been running for about 2 years. The reason for /figoraid
running at 90% capacity is both due to older smaller drives (3TB versus 4TB) and also because the version of btrfs available when it was built did not yet include RAID5 support.
Editing the ceph config file
We will be following the manual guide for ceph installation on their site. There is also a python based script call ceph-deploy which is packaged for a number of distros but not directly available for gentoo. If you can manage to get it working, it would automate a good bit of the process of rolling out a server from your admin node.
/etc/ceph/ceph.conf
Initial ceph config file[global] # For version 0.54 and earlier, you may enable # authentication with the following setting. # Specifying `cephx` enables authentication; # and specifying `none` disables authentication. #auth supported = cephx # For version 0.55 and beyond, you must explicitly enable # or disable authentication with "auth" entries in [global]. cluster = ceph auth cluster required = cephx auth service required = cephx auth client required = cephx [osd] osd journal size = 1000 # uncomment the following line if you are mounting with ext4 # filestore xattr use omap = true # For Bobtail (v 0.56) and subsequent versions, you may # add settings for mkcephfs so that it will create and mount # the file system for you. Remove the comment `#` character for # the following settings and replace the values in parenthesis # with appropriate values, or leave the following settings commented # out to accept the default values. You must specify the --mkfs # option with mkcephfs in order for the deployment script to # utilize the following settings, and you must define the 'devs' # option for each osd instance; see below. #osd mkfs type = {fs-type} #osd mkfs options {fs-type} = {mkfs options} # default for xfs is "-f" #osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime"
We use uuidgen
to generate a new random uuid for the entire cluster. We will rename the cluster name from the default ceph
to kroll
to match our host naming scheme. We specify the 192.168.2 network to be the "public" network for the cluster. Other default settings come from the manual install url mentioned earlier, including a default to replicate two copies of each object with a minimum of 1 copy allowed when the cluster is in "degraded" state.
We set a journal size in the osd
global section but leave the filestore stanza commented out since we will be using btrfs for object stores instead of ext4. We also added a little extra language to this to clarify exactly what it means.
We add a mon
global section where we specify the list of hostnames that will act as mons along with their corresponding ip addresses. The port number 6789
is the IANA registered well known port assigned to Ceph. The initial members stanza specifies the three hosts which will be necessary to form an initial quorum. The list of numbers corresponds to the list of mon sections that will follow for mon.0
, mon.1
etc. These will also be used by the /etc/init.d/ceph startup script when figuring out which services are to be started for a host.
The numeric mon convention is new in the Firefly release. Earlier releases used alphbetical letters instead (eg
mon.a
, etc) for the mon
sections.
The rest of the ceph.conf
file consists of subsections for the various mon, osd and mds servers that we will be implementing.
root #
uuidgen
1798897a-f0c9-422d-86b3-d4933a12c7ac
/etc/ceph/ceph.conf
Customized ceph config file[global] # For version 0.54 and earlier, you may enable # authentication with the following setting. # Specifying `cephx` enables authentication; # and specifying `none` disables authentication. #auth supported = cephx # For version 0.55 and beyond, you must explicitly enable # or disable authentication with "auth" entries in [global]. fsid = 1798897a-f0c9-422d-86b3-d4933a12c7ac cluster = kroll public network = 192.168.2.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx osd journal size = 1024 # # default to replicating two copies of each object. Allow a minimum # of one copy to be made when number of object stores is degraded # osd pool default size = 2 osd pool default min size = 1 # # defaults from http://ceph.com/docs/master/install/manual-deployment/ # osd pool default pg num = 333 osd pool default pgp num = 333 osd crush chooseleaf type = 1 [osd] osd journal size = 1024 # # This is a crutch for ext4 filesystems with limit of 4kb xattr data # btrfs has no bounds for xattr store # # uncomment the following line if you are mounting with ext4 #filestore xattr use omap = true # For Bobtail (v 0.56) and subsequent versions, you may # add settings for mkcephfs so that it will create and mount # the file system for you. Remove the comment `#` character for # the following settings and replace the values in parenthesis # with appropriate values, or leave the following settings commented # out to accept the default values. You must specify the --mkfs # option with mkcephfs in order for the deployment script to # utilize the following settings, and you must define the 'devs' # option for each osd instance; see below. #osd mkfs type = {fs-type} #osd mkfs options {fs-type} = {mkfs options} # default for xfs is "-f" #osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime" [mon] mon host = kroll1, kroll2, kroll3, kroll4 mon addr = 192.168.2.1:6789, 192.168.2.2:6789, 192.168.2.3:6789, 192.168.2.4:6789 mon initial members = 0,1,2 [mon.0] host = kroll1 mon addr = 192.168.2.1:6789 [mon.1] host = kroll2 mon addr = 192.168.2.2:6789 [mon.2] host = kroll3 mon addr = 192.168.2.3:6789 [mon.3] host = kroll4 mon addr = 192.168.2.4:6789 [osd.0] host = kroll1 [osd.1] host = kroll3 [osd.2] host = kroll4 [mds.0] host = kroll2
After editing the file, we copy it around to the other cluster members from our admin node kroll1 using scp
root #
scp ceph.conf root@kroll2:/etc/ceph
ceph.conf 100% 2408 2.4KB/s 2.4KB/s 00:00
root #
scp ceph.conf root@kroll3:/etc/ceph
ceph.conf 100% 2408 2.4KB/s 2.4KB/s 00:00
root #
scp ceph.conf root@kroll4:/etc/ceph
ceph.conf 100% 2408 2.4KB/s 2.4KB/s 00:00
/etc/conf.d/ceph file
There is also a conf file for the ceph service, but as of the Firefly release, there is only the location of the conf file to specify. This is because in previous releases, there was only an /etc/init.d/ceph script that needed to be worked with. The single script would start up or shut down all of the services enabled for the site at once. The ebuild maintainer changed this for Firefly to use renamed softlinks of /etc/init.d/ceph
in order to specify the running of individual Ceph services. It is somewhat similar to what gentoo does with net.lo
when enabling network devices.
/etc/conf.d/ceph
Barebones ceph file for Firefly release# Example # default ceph conf file ceph_conf="/etc/ceph/ceph.conf"
/etc/init.d/ceph script
As noted earlier when discussing MON servers, Ceph is dependent on clocks that are synchronized across the cluster. Until the ebuild maintainers update the script, consider editing your /etc/init.d/ceph
script's depend function to include whatever service you may have running that has synched your clock. In this example edit to the firefly version of the script, we have added "after ntp-client" since we use the standard ntp ebuild and have ntp-client and ntp services in our default runlevel.
/etc/init.d/ceph
openrc ceph script for firefly edited to depend on ntp-client#!/sbin/openrc-run ceph_conf="${ceph_conf:-/etc/ceph/ceph.conf}" extra_commands="reload" type=${RC_SVCNAME:5:3} id=${RC_SVCNAME:9} id=${id:-"0"} command="/usr/bin/ceph-${type}" pidfile="/run/ceph/${type}.${id}.pid" command_args="-i ${id} --pid-file ${pidfile} -c ${ceph_conf}" depend() { after net after ntp-client before netmount } is_type_valid() { case ${type} in mon|mds|osd) return 0;; *) return 1;; esac } start_pre() { checkpath -d -q $(dirname ${pidfile}) } start() { ebegin "Starting Ceph ${type}.${id}" if ! is_type_valid ;then eerror "Please give valid Ceph Server Type: mds, mon, osd" return 1 fi ${command} ${command_args} eend $? } stop() { ebegin "Stopping Ceph ${type}.${id}" start-stop-daemon --stop --pidfile ${pidfile} eend $? } reload() { ebegin "Reloading Ceph ${type}.${id}" start-stop-daemon --signal 1 --pidfile ${pidfile} eend $? }
Creating Keyrings For MON rollout
Ceph uses its own shared secret concept when handling communications among cluster members. We must generate keyring files that will then be distributed out to the servers that will be set up among the cluster members. The keyrings are generated by the ceph-authtool
command. The first keyring is for the mon servers. The manual install url has it going to a file on /tmp
, but we are more inclined to keep it around by parking it in /etc/ceph
root #
ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
creating /etc/ceph/ceph.mon.keyring
The result is a readable text file:
/etc/ceph/ceph.mon.keyring
ceph-authtool generated file[mon.] key = (redacted key text) caps mon = "allow *"
Next we create an admin keyring file which goes into /etc/ceph/ceph.client.admin.keyring
.
root #
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow'
creating /etc/ceph/ceph.client.admin.keyring
The resulting text file may actually be shorter than the complicated command line used to create it. The redacted key here is the same as the one that appears in our mon keyring, so it must be based on the uuid we parked in the /etc/ceph/ceph.conf
config file.
/etc/ceph/ceph.client.admin.keyring
ceph-authtool generated admin file[client.admin] key = (redacted key text) auid = 0 caps mds = "allow" caps mon = "allow *" caps osd = "allow *"
The default ownership for the client.admin.keyring file is
root:root
and mode 600
. You might consider changing the mode to either 660
or 640
and then changing the group to something like disk
. This will allow non-root users who you trust to do disk maintenance (ie mount/unmount) to use ceph admin commands.
This next command is just as annoying, because it wasn't until after running it that we discovered that the auth tool basically just appended the client admin keyring file contents to the mon keyring file.
root #
ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
importing contents of /etc/ceph/ceph.client.admin.keyring into /etc/ceph/ceph.mon.keyring
/etc/ceph/ceph.mon.keyring
mon keyring with client admin contents[mon.] key = (redacted) caps mon = "allow *" [client.admin] key = (redacted) auid = 0 caps mds = "allow" caps mon = "allow *" caps osd = "allow *"
We push the mon and client keyrings out to /etc/ceph
on the other kroll hosts.
root #
scp ceph.mon.keyring ceph.client.admin.keyring kroll2:/etc/ceph
ceph.mon.keyring 100% 214 0.2KB/s 0.2KB/s 00:00 ceph.client.admin.keyring 100% 137 0.1KB/s 0.2KB/s 00:00
root #
scp ceph.mon.keyring ceph.client.admin.keyring kroll3:/etc/ceph
ceph.mon.keyring 100% 214 0.2KB/s 0.2KB/s 00:00 ceph.client.admin.keyring 100% 137 0.1KB/s 0.2KB/s 00:00
root #
scp ceph.mon.keyring ceph.client.admin.keyring kroll4:/etc/ceph
ceph.mon.keyring 100% 214 0.2KB/s 0.2KB/s 00:00 ceph.client.admin.keyring 100% 137 0.1KB/s 0.2KB/s 00:00
Creating the initial monmap file
The OSD and MDS servers use the /etc/ceph/ceph.conf
for discovering MON servers, but the MON servers themselves have a much stricter consistency scheme in order to form and maintain their quorum. When up and running the quorum uses majority rule voting system called Praxis, but MONs do work with an initial binary file called a monmap when you first set up the Ceph cluster.
The manual deployment page covers the example where only a single MON is used to form the quorum. We referred to the Monitor Config reference page and the Monitor Bootstrap page it refers to when creating our scheme with an initial quorum of three MONs.
The monmap
command is used to create the initial monmap binary file. We essentially give it the addresses corrsponding to our mon.0, mon.1, mon.2
initial members
and the cluster fsid from /etc/ceph/ceph.conf
file. We will park this file in /etc/ceph
and then pass it around to the right place when we configure our MON hosts.
root #
monmaptool --create --add 0 192.168.2.1 --add 1 192.168.2.2 --add 2 192.168.2.3 --fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac initial-monmap
monmaptool: monmap file initial-monmap monmaptool: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac monmaptool: writing epoch 0 to initial-monmap (3 monitors)
We push the initial monmap file out to the other /etc/ceph
directories on kroll2, 3 and 4.
root #
scp initial-monmap kroll2:/etc/ceph
initial-monmap 100% 481 0.5KB/s 0.5KB/s 00:00
root #
scp initial-monmap kroll3:/etc/ceph
initial-monmap 100% 481 0.5KB/s 0.5KB/s 00:00
root #
scp initial-monmap kroll4:/etc/ceph
initial-monmap 100% 481 0.5KB/s 0.5KB/s 00:00
Creating kroll1 server mon.0
Ceph servers look for their file trees in /var/lib/ceph
. Mon servers look for their server id number subtree under /var/lib/ceph/mon/ceph.N
where N is the id # that we designated for the server in /etc/ceph/ceph.conf
. Kroll1 will host mon.0, so we create /var/lib/ceph/mon/ceph.0
for it. This implies that we will be using our 256GB SSD root system device for mon.0's i/o. Later on when we create the OSD for kroll1, we will be creating and mounting a btrfs subvolume for it to use. Otherwise the object store would default to eating us out of house and home on our system drive!
root #
mkdir -p /var/lib/ceph/mon/ceph-0
Before continuing on, you may want to look at /var/log/ceph
to clear out anything that may be in there. The next command will create an empty /var/log/ceph/ceph-mon.0.log
file if it doesn't already exist.
root #
cd /var/log/ceph
root #
rm -rf *
root #
cd /etc/ceph
The ceph-mon
command will populate the ceph.0 directory with a copy of our ceph.mon.keyring file renamed to keyring
and a store.db
directory tree which is a Praxis database reflecting the contents of the initial monmap file.
root #
ceph-mon --mkfs -i 0 --monmap initial-monmap --keyring ceph.mon.keyring
ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0
root #
cd /var/lib/ceph/mon/ceph-0
root #
ls -lR
.: total 8 -rw-r--r-- 1 root root 77 Jul 14 13:13 keyring drwxr-xr-x 2 root root 4096 Jul 14 13:13 store.db ./store.db: total 12 -rw-r--r-- 1 root root 1069 Jul 14 13:13 000003.log -rw-r--r-- 1 root root 16 Jul 14 13:13 CURRENT -rw-r--r-- 1 root root 0 Jul 14 13:13 LOCK -rw-r--r-- 1 root root 50 Jul 14 13:13 MANIFEST-000002
root #
cat keyring
[mon.] key = (redacted) caps mon = "allow *"
We set up the mon.0 server startup in /etc/init.d by softlinking. Examination of the ceph script reveals that the gentoo ebuild developer was merely picking the server type and id # from string positions in the script file name. Thus it isn't crucial to use the "." and the "-" as we did here, but it does make things readable:
root #
cd /etc/init.d
root #
ln -s ceph ceph.mon-0
root #
rc-update add ceph.mon-0 default
* service ceph.mon-0 added to runlevel default
We won't start the server yet until after the other mon hosts have been set up. Otherwise it would just sit stalled looking for its friends and periodically complaining into its log file in
/var/log/ceph
This author is not contemplating a move to
systemd
any time soon after prior ugly experiences with it when it first started to appear on Fedora. It is left as an exercise to the reader to figure out what sort of scheme will be necessary when they are not using openrc
We repeated the same process to create mon.1, mon.2
and mon.3
on the other kroll member hosts.
kroll2
root #
mkdir -p /var/lib/ceph/mon/ceph-1
root #
ceph-mon --mkfs -i 1 --monmap initial-monmap --keyring ceph.mon.keyring
ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac ceph-mon: created monfs at /var/lib/ceph/mon/ceph-1 for mon.1
root #
cd /var/lib/ceph/mon
root #
ls -l ceph-1
total 8 -rw-r--r-- 1 root root 77 Jul 14 13:44 keyring drwxr-xr-x 2 root root 4096 Jul 14 13:44 store.db
root #
cd /etc/init.d
root #
ln -s ceph ceph.mon-1
root #
rc-update add ceph.mon-1 default
* service ceph.mon-1 added to runlevel default
kroll3
root #
mkdir -p /var/lib/ceph/mon/ceph-2
root #
ceph-mon --mkfs -i 2 --monmap initial-monmap --keyring ceph.mon.keyring
ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac ceph-mon: created monfs at /var/lib/ceph/mon/ceph-2 for mon.2
root #
cd /etc/init.d
root #
ln -s ceph ceph.mon-2
root #
rc-update add ceph.mon-2 default
* service ceph.mon-2 added to runlevel default
kroll4
root #
mkdir -p /var/lib/ceph/mon/ceph-3
root #
ceph-mon --mkfs -i 3 --monmap initial-monmap --keyring ceph.mon.keyring
ceph-mon: set fsid to 1798897a-f0c9-422d-86b3-d4933a12c7ac ceph-mon: created monfs at /var/lib/ceph/mon/ceph-3 for mon.3
root #
cd /etc/init.d
root #
ln -s ceph ceph.mon-3
root #
rc-update add ceph.mon-3 default
* service ceph.mon-3 added to runlevel default
Starting the Mon Servers
With all four kroll hosts configured with mons, we now go back and start the services beginning with mon.0
on kroll1.
root #
cd /var/log/ceph
root #
/etc/init.d/ceph.mon-0 start
* Caching service dependencies ... [ ok ] * Starting Ceph mon.0 ... [ ok ]
root #
ls -l
total 4 -rw-r--r-- 1 root root 1411 Jul 14 14:55 ceph-mon.0.log
root #
cat ceph-mon.0.log
2014-07-14 14:55:49.806317 7f7928c81780 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 25102 2014-07-14 14:55:49.815592 7f7928c81780 0 mon.0 does not exist in monmap, will attempt to join an existing cluster 2014-07-14 14:55:49.815719 7f7928c81780 0 using public_addr 192.168.2.1:0/0 -> 192.168.2.1:6789/0 2014-07-14 14:55:49.815751 7f7928c81780 0 starting mon.0 rank -1 at 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 14:55:49.815892 7f7928c81780 1 mon.0@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 14:55:49.815927 7f7928c81780 1 mon.0@-1(probing) e0 initial_members 0,1,2, filtering seed monmap 2014-07-14 14:55:49.817013 7f7928c81780 0 mon.0@-1(probing) e0 my rank is now 0 (was -1) 2014-07-14 14:55:49.817558 7f7928c79700 0 -- 192.168.2.1:6789/0 >> 0.0.0.0:0/1 pipe(0x2182280 sd=18 :0 s=1 pgs=0 cs=0 l=0 c=0x2138dc0).fault 2014-07-14 14:55:49.817569 7f791fc21700 0 -- 192.168.2.1:6789/0 >> 0.0.0.0:0/2 pipe(0x2182a00 sd=17 :0 s=1 pgs=0 cs=0 l=0 c=0x2139a20).fault 2014-07-14 14:55:49.817700 7f791fa1f700 0 -- 192.168.2.1:6789/0 >> 192.168.2.3:6789/0 pipe(0x2183180 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x21394a0).fault 2014-07-14 14:55:49.817674 7f791fb20700 0 -- 192.168.2.1:6789/0 >> 192.168.2.2:6789/0 pipe(0x2183680 sd=16 :0 s=1 pgs=0 cs=0 l=0 c=0x2139760).fault
Notice the ".fault" entries at the bottom of the log. We will continue to see them until the quorum is established. We will also see them as outputwhen attempting to do monitor style commands such as ceph -s
until there is a quorum.
Starting mon.1 on kroll2
We start up the mon.1 server on kroll2. The mon log for this server will show it discovering the kroll1 mon.0
peer server, and the 2 start using praxis to hold a quorum election. Until mon.2
gets spun up, we still won't have a quorum.
root #
cd /var/log/ceph
root #
/etc/init.d/mon.1 start
* Caching service dependencies ... [ ok ] * Starting Ceph mon.1 ... [ ok ]
root #
cat ceph-mon.1.log
2014-07-14 15:01:03.087006 7f680d91b7c0 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 18579 2014-07-14 15:01:03.110960 7f680d91b7c0 0 mon.1 does not exist in monmap, will attempt to join an existing cluster 2014-07-14 15:01:03.111291 7f680d91b7c0 0 using public_addr 192.168.2.2:0/0 -> 192.168.2.2:6789/0 2014-07-14 15:01:03.111377 7f680d91b7c0 0 starting mon.1 rank -1 at 192.168.2.2:6789/0 mon_data /var/lib/ceph/mon/ceph-1 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 15:01:03.111717 7f680d91b7c0 1 mon.1@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 15:01:03.111825 7f680d91b7c0 1 mon.1@-1(probing) e0 initial_members 0,1,2, filtering seed monmap 2014-07-14 15:01:03.114379 7f680d91b7c0 0 mon.1@-1(probing) e0 my rank is now 0 (was -1) 2014-07-14 15:01:03.115191 7f6804881700 0 -- 192.168.2.2:6789/0 >> 0.0.0.0:0/2 pipe(0x2d30780 sd=20 :0 s=1 pgs=0 cs=0 l=0 c=0x2d6ba20).fault 2014-07-14 15:01:03.115208 7f680d8b9700 0 -- 192.168.2.2:6789/0 >> 0.0.0.0:0/1 pipe(0x2d30c80 sd=21 :0 s=1 pgs=0 cs=0 l=0 c=0x2d6adc0).fault 2014-07-14 15:01:03.115672 7f680467f700 0 -- 192.168.2.2:6789/0 >> 192.168.2.3:6789/0 pipe(0x2d30f00 sd=24 :0 s=1 pgs=0 cs=0 l=0 c=0x2d6b4a0).fault 2014-07-14 15:01:03.118087 7f6803d7d700 0 -- 192.168.2.2:6789/0 >> 192.168.2.1:6789/0 pipe(0x2d31900 sd=25 :6789 s=0 pgs=0 cs=0 l=0 c=0x2ddc000).accept connect_seq 0 vs existing 0 state wait 2014-07-14 15:01:03.120547 7f6805883700 1 mon.1@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:01:03.120680 7f6805883700 1 mon.1@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints . . . 2014-07-14 15:01:03.133469 7f6805883700 1 mon.1@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:01:03.133516 7f6805883700 1 mon.1@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:01:03.133575 7f6805883700 1 mon.1@0(probing) e0 learned initial mon 0 addr 192.168.2.1:6789/0 2014-07-14 15:01:03.133584 7f6805883700 0 mon.1@0(probing) e0 my rank is now 1 (was 0) 2014-07-14 15:01:03.134096 7f6805883700 0 -- 192.168.2.2:6789/0 submit_message mon_probe(reply 1798897a-f0c9-422d-86b3-d4933a12c7ac name 1 paxos( fc 0 lc 0 ) new) v6 remote, 192.168.2.1:6789/0, failed lossy con, dropping message 0x2e4db80 2014-07-14 15:01:03.134139 7f6805883700 0 -- 192.168.2.2:6789/0 submit_message mon_probe(reply 1798897a-f0c9-422d-86b3-d4933a12c7ac name 1 paxos( fc 0 lc 0 ) new) v6 remote, 192.168.2.1:6789/0, failed lossy con, dropping message 0x2f19400 2014-07-14 15:01:03.134155 7f6803d7d700 0 -- 192.168.2.2:6789/0 >> 0.0.0.0:0/2 pipe(0x2e4c500 sd=21 :0 s=1 pgs=0 cs=0 l=0 c=0x2ddc580).fault 2014-07-14 15:01:03.134250 7f680467f700 0 -- 192.168.2.2:6789/0 >> 192.168.2.3:6789/0 pipe(0x2e4cf00 sd=25 :0 s=1 pgs=0 cs=0 l=0 c=0x2ddc420).fault 2014-07-14 15:01:03.135037 7f6805883700 0 log [INF] : mon.1 calling new monitor election 2014-07-14 15:01:03.135083 7f6805883700 1 mon.1@1(electing).elector(1) init, last seen epoch 1 2014-07-14 15:01:03.135161 7f6805883700 1 mon.1@1(electing).paxos(paxos recovering c 0..0) is_readable now=2014-07-14 15:01:03.135162 lease_expire=0.000000 has v0 lc 0
Starting mon.2 on kroll3
root #
cd /var/log/ceph
root #
/etc/init.d/mon.2 start
* Caching service dependencies ... [ ok ] * Starting Ceph mon.2 ... [ ok ]
root #
cat ceph-mon.2.log
2014-07-14 15:06:31.837248 7f0b3846c780 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 25431 2014-07-14 15:06:31.875796 7f0b3846c780 0 mon.2 does not exist in monmap, will attempt to join an existing cluster 2014-07-14 15:06:31.876179 7f0b3846c780 0 using public_addr 192.168.2.3:0/0 -> 192.168.2.3:6789/0 2014-07-14 15:06:31.876298 7f0b3846c780 0 starting mon.2 rank -1 at 192.168.2.3:6789/0 mon_data /var/lib/ceph/mon/ceph-2 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 15:06:31.876708 7f0b3846c780 1 mon.2@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 15:06:31.876834 7f0b3846c780 1 mon.2@-1(probing) e0 initial_members 0,1,2, filtering seed monmap 2014-07-14 15:06:31.878883 7f0b3846c780 0 mon.2@-1(probing) e0 my rank is now 0 (was -1) 2014-07-14 15:06:31.879515 7f0b2f409700 0 -- 192.168.2.3:6789/0 >> 0.0.0.0:0/2 pipe(0x2b02a00 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x2ab7a20).fault 2014-07-14 15:06:31.879519 7f0b38464700 0 -- 192.168.2.3:6789/0 >> 0.0.0.0:0/1 pipe(0x2b02280 sd=14 :0 s=1 pgs=0 cs=0 l=0 c=0x2ab6dc0).fault 2014-07-14 15:06:31.882536 7f0b2e804700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b03b80 sd=26 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32160).accept connect_seq 0 vs existing 0 state wait 2014-07-14 15:06:31.882697 7f0b2e905700 0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b03900 sd=25 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32000).accept connect_seq 0 vs existing 0 state wait 2014-07-14 15:06:31.884022 7f0b3040b700 1 mon.2@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:06:31.884124 7f0b3040b700 1 mon.2@0(probing) e0 adding peer 192.168.2.2:6789/0 to list of hints 2014-07-14 15:06:31.884281 7f0b3040b700 1 mon.2@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:06:31.884361 7f0b3040b700 1 mon.2@0(probing) e0 adding peer 192.168.2.2:6789/0 to list of hints 2014-07-14 15:06:31.884434 7f0b3040b700 1 mon.2@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:06:31.884516 7f0b3040b700 1 mon.2@0(probing) e0 adding peer 192.168.2.1:6789/0 to list of hints 2014-07-14 15:06:31.884665 7f0b3040b700 0 mon.2@0(probing) e1 my rank is now -1 (was 0) 2014-07-14 15:06:31.885501 7f0b3040b700 0 -- 192.168.2.3:6789/0 submit_message mon_probe(reply 1798897a-f0c9-422d-86b3-d4933a12c7ac name 2 paxos( fc 0 lc 0 ) new) v6 remote, 192.168.2.1:6789/0, failed lossy con, dropping message 0x2b03400 2014-07-14 15:06:31.885631 7f0b2f409700 0 -- 192.168.2.3:6789/0 >> 0.0.0.0:0/2 pipe(0x2b02a00 sd=15 :0 s=1 pgs=0 cs=0 l=0 c=0x2b32160).fault 2014-07-14 15:06:31.886047 7f0b2e804700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a000 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32580).accept connect_seq 2 vs existing 0 state connecting 2014-07-14 15:06:31.886061 7f0b2e804700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a000 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32580).accept we reset (peer sent cseq 2, 0x2b03180.cseq = 0), sending RESETSESSION 2014-07-14 15:06:31.886664 7f0b2e804700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a000 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32580).accept connect_seq 0 vs existing 0 state connecting 2014-07-14 15:06:31.886722 7f0b2e905700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b03180 sd=13 :33205 s=4 pgs=0 cs=0 l=0 c=0x2b32000).connect got RESETSESSION but no longer connecting 2014-07-14 15:06:33.886683 7f0b3040b700 0 mon.2@-1(probing) e2 my rank is now 2 (was -1) 2014-07-14 15:06:33.888244 7f0b2f409700 0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b2a500 sd=15 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32420).accept connect_seq 2 vs existing 0 state connecting 2014-07-14 15:06:33.888268 7f0b2f409700 0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b2a500 sd=15 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32420).accept we reset (peer sent cseq 2, 0x2b03680.cseq = 0), sending RESETSESSION 2014-07-14 15:06:33.888297 7f0b2e804700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a280 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32dc0).accept connect_seq 2 vs existing 0 state connecting 2014-07-14 15:06:33.888318 7f0b2e804700 0 -- 192.168.2.3:6789/0 >> 192.168.2.1:6789/0 pipe(0x2b2a280 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32dc0).accept we reset (peer sent cseq 2, 0x2b03400.cseq = 0), sending RESETSESSION 2014-07-14 15:06:33.888859 7f0b2f409700 0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b2a500 sd=15 :6789 s=0 pgs=0 cs=0 l=0 c=0x2b32420).accept connect_seq 0 vs existing 0 state connecting 2014-07-14 15:06:33.888956 7f0b2f308700 0 -- 192.168.2.3:6789/0 >> 192.168.2.2:6789/0 pipe(0x2b03680 sd=14 :60500 s=4 pgs=0 cs=0 l=0 c=0x2ab6dc0).connect got RESETSESSION but no longer connecting 2014-07-14 15:06:33.914131 7f0b3040b700 1 mon.2@2(synchronizing).paxosservice(pgmap 1..2) refresh upgraded, format 0 -> 1 2014-07-14 15:06:33.914186 7f0b3040b700 1 mon.2@2(synchronizing).pg v0 on_upgrade discarding in-core PGMap 2014-07-14 15:06:33.917087 7f0b3040b700 0 mon.2@2(synchronizing).mds e1 print_map epoch 1 flags 0 created 2014-07-14 15:01:18.167385 modified 2014-07-14 15:01:18.167410 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure 0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap} max_mds 1 in up {} failed stopped data_pools 0 metadata_pool 1 inline_data disabled 2014-07-14 15:06:33.917425 7f0b3040b700 1 mon.2@2(synchronizing).osd e1 e1: 0 osds: 0 up, 0 in 2014-07-14 15:06:33.922385 7f0b3040b700 0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:06:33.922412 7f0b3040b700 0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:06:33.922417 7f0b3040b700 0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:06:33.922421 7f0b3040b700 0 mon.2@2(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:06:33.922868 7f0b3040b700 1 mon.2@2(synchronizing).paxosservice(auth 1..1) refresh upgraded, format 0 -> 1 2014-07-14 15:06:33.929984 7f0b3040b700 0 log [INF] : mon.2 calling new monitor election 2014-07-14 15:06:33.930106 7f0b3040b700 1 mon.2@2(electing).elector(1) init, last seen epoch 1 2014-07-14 15:06:33.930366 7f0b3040b700 1 mon.2@2(electing).paxos(paxos recovering c 1..11) is_readable now=2014-07-14 15:06:33.930368 lease_expire=0.000000 has v0 lc 11 2014-07-14 15:06:33.930384 7f0b3040b700 1 mon.2@2(electing).paxos(paxos recovering c 1..11) is_readable now=2014-07-14 15:06:33.930386 lease_expire=0.000000 has v0 lc 11 2014-07-14 15:06:36.931021 7f0b3040b700 1 mon.2@2(peon).paxos(paxos active c 1..11) is_readable now=2014-07-14 15:06:36.931027 lease_expire=2014-07-14 15:06:41.926571 has v0 lc 11 2014-07-14 15:06:36.931058 7f0b3040b700 1 mon.2@2(peon).paxos(paxos active c 1..11) is_readable now=2014-07-14 15:06:36.931060 lease_expire=2014-07-14 15:06:41.926571 has v0 lc 11 2014-07-14 15:06:36.999210 7f0b3040b700 1 mon.2@2(peon).paxos(paxos active c 1..12) is_readable now=2014-07-14 15:06:36.999212 lease_expire=2014-07-14 15:06:41.988409 has v0 lc 12 2014-07-14 15:06:39.988946 7f0b3040b700 1 mon.2@2(peon).paxos(paxos active c 1..12) is_readable now=2014-07-14 15:06:39.988948 lease_expire=2014-07-14 15:06:44.988542 has v0 lc 12 2014-07-14 15:06:42.989047 7f0b3040b700 1 mon.2@2(peon).paxos(paxos active c 1..12) is_readable now=2014-07-14 15:06:42.989049 lease_expire=2014-07-14 15:06:47.988632 has v0 lc 12
There is now also a ceph.log
for the cluster with a briefer summary. There is a warning that the system drive is over 70% full on kroll1 since we have also been using it for a /home directory there. We may migrate that to the btrfs array later if it becomes necessary.
root #
cat ceph.log
2014-07-14 15:01:18.238250 mon.0 192.168.2.1:6789/0 11 : [INF] pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-07-14 15:01:49.818434 mon.0 192.168.2.1:6789/0 12 : [WRN] reached concerning levels of available space on local monitor storage (28% free) 2014-07-14 15:01:03.135041 mon.1 192.168.2.2:6789/0 1 : [INF] mon.1 calling new monitor election 2014-07-14 15:01:13.152781 mon.1 192.168.2.2:6789/0 2 : [INF] mon.1 calling new monitor election 2014-07-14 15:06:31.908086 mon.0 192.168.2.1:6789/0 13 : [INF] mon.0 calling new monitor election 2014-07-14 15:06:33.929996 mon.2 192.168.2.3:6789/0 1 : [INF] mon.2 calling new monitor election 2014-07-14 15:06:36.916720 mon.0 192.168.2.1:6789/0 14 : [INF] mon.0@0 won leader election with quorum 0,1,2 2014-07-14 15:06:36.926950 mon.0 192.168.2.1:6789/0 15 : [INF] monmap e2: 3 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0} 2014-07-14 15:06:36.927070 mon.0 192.168.2.1:6789/0 16 : [INF] pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-07-14 15:06:36.927174 mon.0 192.168.2.1:6789/0 17 : [INF] mdsmap e1: 0/0/1 up 2014-07-14 15:06:36.927293 mon.0 192.168.2.1:6789/0 18 : [INF] osdmap e1: 0 osds: 0 up, 0 in 2014-07-14 15:06:49.820264 mon.0 192.168.2.1:6789/0 19 : [WRN] reached concerning levels of available space on local monitor storage (28% free)
The commands in the manual install page to check cluster sanity: ceph osd lspools
and ceph -s
will now work, but of course we don't have any OSDs spun up yet.
root #
ceph osd lspools
0 data,1 metadata,2 rbd,
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds monmap e2: 3 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0}, election epoch 6, quorum 0,1,2 0,1,2 osdmap e1: 0 osds: 0 up, 0 in pgmap v2: 192 pgs, 3 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 192 creating
Starting mon.3 on kroll4
root #
cd /var/log/ceph
root #
/etc/init.d/ceph.mon.3 start
* Caching service dependencies ... [ ok ] * Starting Ceph mon.3 ... [ ok ]
root #
cat ceph-mon.3.log
2014-07-14 15:22:30.350302 7fdf4e653780 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mon, pid 7963 2014-07-14 15:22:30.374728 7fdf4e653780 0 mon.3 does not exist in monmap, will attempt to join an existing cluster 2014-07-14 15:22:30.375095 7fdf4e653780 0 using public_addr 192.168.2.4:0/0 -> 192.168.2.4:6789/0 2014-07-14 15:22:30.375191 7fdf4e653780 0 starting mon.3 rank -1 at 192.168.2.4:6789/0 mon_data /var/lib/ceph/mon/ceph-3 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 15:22:30.375544 7fdf4e653780 1 mon.3@-1(probing) e0 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 15:22:30.375679 7fdf4e653780 1 mon.3@-1(probing) e0 initial_members 0,1,2, filtering seed monmap 2014-07-14 15:22:30.377442 7fdf4e64b700 0 -- 192.168.2.4:6789/0 >> 0.0.0.0:0/1 pipe(0x1fe4c80 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x201cdc0).fault 2014-07-14 15:22:30.377505 7fdf455d9700 0 -- 192.168.2.4:6789/0 >> 0.0.0.0:0/2 pipe(0x1fe4780 sd=14 :0 s=1 pgs=0 cs=0 l=0 c=0x201da20).fault 2014-07-14 15:22:30.377866 7fdf454d8700 0 -- 192.168.2.4:6789/0 >> 0.0.0.0:0/3 pipe(0x1fe5400 sd=15 :0 s=1 pgs=0 cs=0 l=0 c=0x201d760).fault 2014-07-14 15:22:30.410274 7fdf465db700 1 mon.3@-1(synchronizing).paxosservice(pgmap 1..2) refresh upgraded, format 0 -> 1 2014-07-14 15:22:30.410318 7fdf465db700 1 mon.3@-1(synchronizing).pg v0 on_upgrade discarding in-core PGMap 2014-07-14 15:22:30.412460 7fdf465db700 0 mon.3@-1(synchronizing).mds e1 print_map epoch 1 flags 0 created 2014-07-14 15:01:18.167385 modified 2014-07-14 15:01:18.167410 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure 0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap} max_mds 1 in up {} failed stopped data_pools 0 metadata_pool 1 inline_data disabled 2014-07-14 15:22:30.412779 7fdf465db700 1 mon.3@-1(synchronizing).osd e1 e1: 0 osds: 0 up, 0 in 2014-07-14 15:22:30.417334 7fdf465db700 0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:22:30.417364 7fdf465db700 0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:22:30.417369 7fdf465db700 0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:22:30.417373 7fdf465db700 0 mon.3@-1(synchronizing).osd e1 crush map has features 1107558400, adjusting msgr requires 2014-07-14 15:22:30.418155 7fdf465db700 1 mon.3@-1(synchronizing).paxosservice(auth 1..2) refresh upgraded, format 0 -> 1 2014-07-14 15:22:32.420179 7fdf465db700 0 mon.3@-1(probing) e3 my rank is now 3 (was -1) 2014-07-14 15:22:32.421409 7fdf465db700 0 log [INF] : mon.3 calling new monitor election 2014-07-14 15:22:32.421511 7fdf465db700 1 mon.3@3(electing).elector(1) init, last seen epoch 1 2014-07-14 15:22:32.421625 7fdf465db700 1 mon.3@3(electing).paxos(paxos recovering c 1..14) is_readable now=2014-07-14 15:22:32.421626 lease_expire=0.000000 has v0 lc 14 2014-07-14 15:22:32.421642 7fdf465db700 1 mon.3@3(electing).paxos(paxos recovering c 1..14) is_readable now=2014-07-14 15:22:32.421643 lease_expire=0.000000 has v0 lc 14 2014-07-14 15:22:32.421820 7fdf448d3700 0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe4c80 sd=14 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079340).accept connect_seq 2 vs existing 0 state connecting 2014-07-14 15:22:32.421854 7fdf448d3700 0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe4c80 sd=14 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079340).accept we reset (peer sent cseq 2, 0x1fe5680.cseq = 0), sending RESETSESSION 2014-07-14 15:22:32.422235 7fdf446d1700 0 -- 192.168.2.4:6789/0 >> 192.168.2.1:6789/0 pipe(0x1fe5900 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079080).accept connect_seq 2 vs existing 0 state connecting 2014-07-14 15:22:32.422263 7fdf446d1700 0 -- 192.168.2.4:6789/0 >> 192.168.2.1:6789/0 pipe(0x1fe5900 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079080).accept we reset (peer sent cseq 2, 0x207b680.cseq = 0), sending RESETSESSION 2014-07-14 15:22:32.422283 7fdf453d7700 0 -- 192.168.2.4:6789/0 >> 192.168.2.3:6789/0 pipe(0x1fe4f00 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x20791e0).accept connect_seq 2 vs existing 0 state connecting 2014-07-14 15:22:32.422303 7fdf453d7700 0 -- 192.168.2.4:6789/0 >> 192.168.2.3:6789/0 pipe(0x1fe4f00 sd=16 :6789 s=0 pgs=0 cs=0 l=0 c=0x20791e0).accept we reset (peer sent cseq 2, 0x1fe4a00.cseq = 0), sending RESETSESSION 2014-07-14 15:22:32.422464 7fdf448d3700 0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe4c80 sd=14 :6789 s=0 pgs=0 cs=0 l=0 c=0x2079340).accept connect_seq 0 vs existing 0 state connecting 2014-07-14 15:22:32.423192 7fdf4e64b700 0 -- 192.168.2.4:6789/0 >> 192.168.2.2:6789/0 pipe(0x1fe5680 sd=17 :51313 s=4 pgs=0 cs=0 l=0 c=0x2079600).connect got RESETSESSION but no longer connecting 2014-07-14 15:22:35.478165 7fdf465db700 1 mon.3@3(peon).paxos(paxos active c 1..15) is_readable now=2014-07-14 15:22:35.478176 lease_expire=2014-07-14 15:22:40.466348 has v0 lc 15 2014-07-14 15:22:35.478209 7fdf465db700 1 mon.3@3(peon).paxos(paxos active c 1..15) is_readable now=2014-07-14 15:22:35.478210 lease_expire=2014-07-14 15:22:40.466348 has v0 lc 15 2014-07-14 15:22:35.540828 7fdf465db700 1 mon.3@3(peon).paxos(paxos active c 1..16) is_readable now=2014-07-14 15:22:35.540829 lease_expire=2014-07-14 15:22:40.533829 has v0 lc 16 2014-07-14 15:22:38.536309 7fdf465db700 1 mon.3@3(peon).paxos(paxos active c 1..16) is_readable now=2014-07-14 15:22:38.536311 lease_expire=2014-07-14 15:22:43.533964 has v0 lc 16
The appearance of mon.3
causes a new monitor election as noted in the ceph.log on kroll4
root #
cat ceph.log
2014-07-14 15:06:36.927293 mon.0 192.168.2.1:6789/0 18 : [INF] osdmap e1: 0 osds: 0 up, 0 in 2014-07-14 15:06:49.820264 mon.0 192.168.2.1:6789/0 19 : [WRN] reached concerning levels of available space on local monitor storage (28% free) 2014-07-14 15:06:31.911339 mon.1 192.168.2.2:6789/0 3 : [INF] mon.1 calling new monitor election 2014-07-14 15:22:30.433866 mon.0 192.168.2.1:6789/0 20 : [INF] mon.0 calling new monitor election 2014-07-14 15:22:30.434038 mon.1 192.168.2.2:6789/0 4 : [INF] mon.1 calling new monitor election 2014-07-14 15:22:30.438257 mon.2 192.168.2.3:6789/0 2 : [INF] mon.2 calling new monitor election 2014-07-14 15:22:32.421411 mon.3 192.168.2.4:6789/0 1 : [INF] mon.3 calling new monitor election 2014-07-14 15:22:35.444315 mon.0 192.168.2.1:6789/0 21 : [INF] mon.0@0 won leader election with quorum 0,1,2,3 2014-07-14 15:22:35.466723 mon.0 192.168.2.1:6789/0 22 : [INF] monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0} 2014-07-14 15:22:35.466825 mon.0 192.168.2.1:6789/0 23 : [INF] pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2014-07-14 15:22:35.466931 mon.0 192.168.2.1:6789/0 24 : [INF] mdsmap e1: 0/0/1 up 2014-07-14 15:22:35.467028 mon.0 192.168.2.1:6789/0 25 : [INF] osdmap e1: 0 osds: 0 up, 0 in 2014-07-14 15:22:49.826214 mon.0 192.168.2.1:6789/0 26 : [WRN] reached concerning levels of available space on local monitor storage (28% free)
And now we are four. The health will stay at HEALTH_ERR or degraded until we get at least two OSDs spun up. We need two since that's the default replication count for objects as set in the /etc/ceph/ceph.conf
.
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3 osdmap e1: 0 osds: 0 up, 0 in pgmap v2: 192 pgs, 3 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 192 creating
Creating osd.0 on kroll1
We use uuidgen
to create a unique id which will be used in our first osd. With the mon servers up and running maintaining the server map in praxis, we will need to follow the id numbers returned to us from ceph osd create
and then retrofit /etc/ceph/ceph.conf
if necessary if we lose and osd later or do things out of order.
root #
uuidgen
fb84a114-e961-4a8d-98a3-0b263c003129
root #
ceph osd create fb84a114-e961-4a8d-98a3-0b263c003129
0
Since this will be the first osd for the mon servers, we get assigned id 0 to create osd.0
on kroll1. As we have noted earlier, kroll1 (Thufir) already has a btrfs raid5 array up and running with the default volume mounted on /thufirraid
. The "normal" content is in the subvolume raid
which is mounted on /raid
. We will add a new subvolume called ceph_osd_0
which will be mounted to /var/lib/ceph/osd/ceph-0
for use by the new osd server.
root #
df
Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 230374564 152033716 66615436 70% / devtmpfs 16387752 0 16387752 0% /dev tmpfs 3284440 952 3283488 1% /run shm 16422196 140 16422056 1% /dev/shm cgroup_root 10240 0 10240 0% /sys/fs/cgroup /dev/sdb 15628074336 6429555452 7031525188 48% /thufirraid /dev/sdb 15628074336 6429555452 7031525188 48% /raid
root #
cd /thufirraid
root #
btrfs subvol create ceph_osd_0
Create subvolume './ceph_osd_0'
root #
ls
ceph_osd_0 raid
We add the new subvolume to our /etc/fstab
and take the liberty of turning on automatic defragmentation and on-the-fly lzo compression for the subvolume. The osd will actually manage some subvolumes of its own underneath this mountpoint (current
and snap.N
rolling snapshots).
/etc/fstab
Adding ceph_osd_0 to the fstabLABEL=boot /boot ext3 noauto,noatime 1 2 LABEL=root / ext4 defaults,noatime,discard 0 1 /dev/sdb /thufirraid btrfs defaults,noatime,compress=lzo,autodefrag 0 0 /dev/sdb /raid btrfs defaults,noatime,compress=lzo,autodefrag,subvol=raid 0 0 /dev/sdb /var/lib/ceph/osd/ceph-0 btrfs defaults,noatime,compress=lzo,autodefrag,subvol=ceph_osd_0 0 0
root #
mkdir -p /var/lib/ceph/osd/ceph-0
root #
mount /var/lib/ceph/osd/ceph-0
root #
df
Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 230374564 152034736 66614416 70% / devtmpfs 16387752 0 16387752 0% /dev tmpfs 3284440 952 3283488 1% /run shm 16422196 140 16422056 1% /dev/shm cgroup_root 10240 0 10240 0% /sys/fs/cgroup /dev/sdb 15628074336 6429555484 7031525188 48% /thufirraid /dev/sdb 15628074336 6429555484 7031525188 48% /raid /dev/sdb 15628074336 6429555484 7031525188 48% /var/lib/ceph/osd/ceph-0
Now we let ceph-osd have its way with our new btrfs subvolume.
root #
ceph-osd -i 0 --mkfs --mkkey --osd-uuid fb84a114-e961-4a8d-98a3-0b263c003129
2014-07-14 16:15:48.188183 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 16:15:51.294994 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 16:15:51.314809 7f667580dbc0 -1 filestore(/var/lib/ceph/osd/ceph-0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2014-07-14 16:15:52.440379 7f667580dbc0 -1 created object store /var/lib/ceph/osd/ceph-0 journal /var/lib/ceph/osd/ceph-0/journal for osd.0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 16:15:52.440477 7f667580dbc0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-0/keyring: can't open /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory 2014-07-14 16:15:52.440658 7f667580dbc0 -1 created new key in keyring /var/lib/ceph/osd/ceph-0/keyring
We now have a /var/log/ceph/ceph-osd.0.log
which shows us all of the activity and the btrfs features that the osd decided to take advantage of:
root #
cat /var/log/ceph/ceph-osd.0.log
2014-07-14 16:15:45.644892 7f667580dbc0 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 25509 2014-07-14 16:15:45.650521 7f667580dbc0 1 filestore(/var/lib/ceph/osd/ceph-0) mkfs in /var/lib/ceph/osd/ceph-0 2014-07-14 16:15:45.650596 7f667580dbc0 1 filestore(/var/lib/ceph/osd/ceph-0) mkfs using provided fsid fb84a114-e961-4a8d-98a3-0b263c003129 2014-07-14 16:15:48.185602 7f667580dbc0 1 filestore(/var/lib/ceph/osd/ceph-0) leveldb db exists/created 2014-07-14 16:15:48.188183 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 16:15:48.190839 7f667580dbc0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 9: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-07-14 16:15:48.325943 7f667580dbc0 0 filestore(/var/lib/ceph/osd/ceph-0) mkjournal created journal on /var/lib/ceph/osd/ceph-0/journal 2014-07-14 16:15:48.325974 7f667580dbc0 1 filestore(/var/lib/ceph/osd/ceph-0) mkfs done in /var/lib/ceph/osd/ceph-0 2014-07-14 16:15:48.326479 7f667580dbc0 0 filestore(/var/lib/ceph/osd/ceph-0) mount detected btrfs 2014-07-14 16:15:48.481458 7f667580dbc0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is supported and appears to work 2014-07-14 16:15:48.481508 7f667580dbc0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-07-14 16:15:49.418326 7f667580dbc0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-07-14 16:15:49.418549 7f667580dbc0 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: CLONE_RANGE ioctl is supported 2014-07-14 16:15:50.018307 7f667580dbc0 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_CREATE is supported 2014-07-14 16:15:50.027624 7f667580dbc0 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_DESTROY is supported 2014-07-14 16:15:50.029177 7f667580dbc0 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: START_SYNC is supported (transid 4539) 2014-07-14 16:15:50.318284 7f667580dbc0 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: WAIT_SYNC is supported 2014-07-14 16:15:50.778085 7f667580dbc0 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: SNAP_CREATE_V2 is supported 2014-07-14 16:15:51.293060 7f667580dbc0 0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling PARALLEL journal mode: fs, checkpoint is enabled 2014-07-14 16:15:51.294994 7f667580dbc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 16:15:51.295050 7f667580dbc0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 15: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-07-14 16:15:51.314226 7f667580dbc0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 15: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-07-14 16:15:51.314809 7f667580dbc0 -1 filestore(/var/lib/ceph/osd/ceph-0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2014-07-14 16:15:52.437200 7f667580dbc0 1 journal close /var/lib/ceph/osd/ceph-0/journal 2014-07-14 16:15:52.440379 7f667580dbc0 -1 created object store /var/lib/ceph/osd/ceph-0 journal /var/lib/ceph/osd/ceph-0/journal for osd.0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 16:15:52.440477 7f667580dbc0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-0/keyring: can't open /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory 2014-07-14 16:15:52.440658 7f667580dbc0 -1 created new key in keyring /var/lib/ceph/osd/ceph-0/keyring
The resulting filesystem looks like this:
root #
cd /var/lib/ceph/osd/ceph-0
root #
ls -l
total 1048608 -rw-r--r-- 1 root root 37 Jul 14 16:15 ceph_fsid drwxr-xr-x 1 root root 42 Jul 14 16:15 current -rw-r--r-- 1 root root 37 Jul 14 16:15 fsid -rw-r--r-- 1 root root 1073741824 Jul 14 16:15 journal -rw------- 1 root root 56 Jul 14 16:15 keyring -rw-r--r-- 1 root root 21 Jul 14 16:15 magic -rw-r--r-- 1 root root 6 Jul 14 16:15 ready drwxr-xr-x 1 root root 26 Jul 14 16:15 snap_1 drwxr-xr-x 1 root root 42 Jul 14 16:15 snap_2 -rw-r--r-- 1 root root 4 Jul 14 16:15 store_version -rw-r--r-- 1 root root 42 Jul 14 16:15 superblock -rw-r--r-- 1 root root 2 Jul 14 16:15 whoami
The current, snap_1
and snap_2
directories are btrfs snapshots. The keyring file is a new unique key that was generated for osd-0
.
root #
cat keyring
[osd.0] key = (redacted, unique to osd.0)
We use ceph auth
to transfer that into the praxis database in the mon servers.
root #
ceph auth add osd.0 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-0/keyring
added key for osd.0
The ceph_fsid
actually contains the fsid for the cluster from /etc/ceph/ceph.conf
and not the one we passed when creating the osd itself:
root #
cat ceph_fsid
1798897a-f0c9-422d-86b3-d4933a12c7ac
root #
grep fsid /etc/conf/ceph.conf
fsid = 1798897a-f0c9-422d-86b3-d4933a12c7ac
We save the uuid we used for osd.0 in a text file in /etc/ceph/osd_uuids.txt
just in case. It would probably be a bad idea to try to recycle an osd id number if we ever have a filesystem go bad, but you never know...
We set up kroll1 as a host in the CRUSH map and make it part of the default
osd tree.
root #
ceph osd crush add-bucket kroll1 host
added bucket kroll1 type host to crush map
root #
ceph osd crush move kroll1 root=default
moved item id -2 name 'kroll1' to location {root=default} in crush map
We can now see kroll1 as a host in the default tree using ceph osd tree
. osd.0
also appears but hasn't been assigned to a host.
root #
ceph osd tree
# id weight type name up/down reweight -1 0 root default -2 0 host kroll1 0 0 osd.0 down 0
We now put osd.0
in the crush map under kroll1 with a default weighting value.
root #
ceph osd crush add osd.0 1.0 host=kroll1
add item id 0 name 'osd.0' weight 1 at location {host=kroll1} to crush map
root #
ceph osd tree
ceph osd tree # id weight type name up/down reweight -1 1 root default -2 1 host kroll1 0 1 osd.0 down 0
All that is left is to enable and start the osd.0 service.
root #
cd /etc/init.d
root #
ln -s ceph ceph.osd-0
root #
rc-update add ceph.osd-0 default
* service ceph.osd-0 added to runlevel default
root #
./ceph.osd-0 start
* Caching service dependencies ... [ ok ] * Starting Ceph osd.0 ... starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal [ ok ]
With one osd spun up, our cluster is now operating in a degraded state.
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3 osdmap e7: 1 osds: 1 up, 1 in pgmap v9: 192 pgs, 3 pools, 0 bytes data, 0 objects 6132 GB used, 6704 GB / 14904 GB avail 192 active+degraded
osd.1 on kroll3
osd.1
will use our other big btrfs array on host kroll2 (mater).
root #
uuidgen
0b4c1e77-cc5c-49aa-b322-42d331621113
root #
ceph osd create 0b4c1e77-cc5c-49aa-b322-42d331621113
1
Since we are visiting hosts in the right order, we get to set up osd.1 on kroll3 just as we had anticipated in our /etc/ceph/ceph.conf
file. Mater's btrfs array has its default volume mounted on /materraid
root #
cd /materraid
root #
btrfs subvol create ceph_osd_1
Create subvolume './ceph_osd_1'
/etc/fstab
Edited to add ceph_osd_1 subvolumeLABEL=SAMBOOT /boot ext4 defaults,noatime,discard 1 2 LABEL=SAMROOT / ext4 defaults,noatime,discard 0 1 /dev/sdb /materraid btrfs defaults,noatime,compress=lzo,autodefrag 0 0 /dev/sdb /raid btrfs defaults,noatime,compress=lzo,autodefrag,subvol=raid 0 0 /dev/sdb /var/lib/ceph/osd/ceph-1 btrfs defaults,noatime,compress=lzo,autodefrag,subvol=ceph_osd_1 0 0
root #
mkdir -p /var/lib/ceph/osd/ceph-1
root #
mount /var/lib/ceph/osd/ceph-1
root #
df
Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 239716036 150083632 77432412 66% / devtmpfs 8160292 0 8160292 0% /dev tmpfs 1632264 928 1631336 1% /run shm 8161308 76 8161232 1% /dev/shm cgroup_root 10240 0 10240 0% /sys/fs/cgroup /dev/sda2 499656 20764 442196 5% /boot /dev/sdb 15628074336 6398440760 7076273736 48% /materraid /dev/sdb 15628074336 6398440760 7076273736 48% /raid /dev/sdb 15628074336 6398440760 7076273736 48% /var/lib/ceph/osd/ceph-1
root #
ceph-osd -i 1 --mkfs --mkkey --osd-uuid 0b4c1e77-cc5c-49aa-b322-42d331621113
2014-07-14 17:41:03.717559 7fcc47c22bc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 17:41:12.369793 7fcc47c22bc0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 17:41:12.371069 7fcc47c22bc0 -1 filestore(/var/lib/ceph/osd/ceph-1) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2014-07-14 17:41:14.455595 7fcc47c22bc0 -1 created object store /var/lib/ceph/osd/ceph-1 journal /var/lib/ceph/osd/ceph-1/journal for osd.1 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 17:41:14.455697 7fcc47c22bc0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-1/keyring: can't open /var/lib/ceph/osd/ceph-1/keyring: (2) No such file or directory 2014-07-14 17:41:14.455883 7fcc47c22bc0 -1 created new key in keyring /var/lib/ceph/osd/ceph-1/keyring
root #
ceph auth add osd.1 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-1/keyring
added key for osd.1
root #
ceph osd crush add-bucket kroll3 host
added bucket kroll3 type host to crush map
root #
ceph osd crush move kroll3 root=default
moved item id -3 name 'kroll3' to location {root=default} in crush map
root #
ceph osd crush add osd.1 1.0 host=kroll3
add item id 1 name 'osd.1' weight 1 at location {host=kroll3} to crush map
root #
ceph osd tree
# id weight type name up/down reweight -1 2 root default -2 1 host kroll1 0 1 osd.0 up 1 -3 1 host kroll3 1 1 osd.1 down 0
root #
cd /etc/init.d
root #
ln -s ceph ceph.osd-1
root #
rc-update add ceph.osd-1 default
* service ceph.osd-1 added to runlevel default
root #
./ceph.osd-1 start
* Caching service dependencies ... [ ok ] * Starting Ceph osd.1 ... starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal [ ok ]
With two osds running now, the cluster moves from degraded to "warn" state. This may be leftover from our warning about the root filesystem on kroll1.
root #
ceph osd tree
# id weight type name up/down reweight -1 2 root default -2 1 host kroll1 0 1 osd.0 up 1 -3 1 host kroll3 1 1 osd.1 up 1
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_WARN monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3 osdmap e13: 2 osds: 2 up, 2 in pgmap v24: 192 pgs, 3 pools, 0 bytes data, 0 objects 12235 GB used, 13452 GB / 29808 GB avail 192 active+clean
osd.2 on kroll4
After shoving a couple of data archives from /home to the /raid filesystem on thufir, we got the root drive down to 51% of capacity. The cluster now shows HEALTH_OK.
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_OK monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3 osdmap e14: 2 osds: 2 up, 2 in pgmap v39: 192 pgs, 3 pools, 0 bytes data, 0 objects 12269 GB used, 13407 GB / 29808 GB avail 192 active+clean
We will now add our final (for the moment) object store as osd.2 on kroll4. Its btrfs mirror set has its default volume mounted on /tuberaid
. It also has a different subvolume setup from thufir and mater. Instead of mounting and exporting a single /raid
subvolume, it has subvolumes for /home, /mythtv
and a dedicated volume for virtual machines as /vm
. Another wrinkle is that this system has been set up to use btrfs on the SSD for its /boot and / drives. In fact, the root drive is mounted as a subvolume of /systemraid
root #
uuidgen
9b5550a2-7c92-4875-bea8-e61ee7a1ea15
root #
ceph osd create 9b5550a2-7c92-4875-bea8-e61ee7a1ea15
2
root #
cd /tuberaid
root #
btrfs subvol create ceph_osd_2
Create subvolume './ceph_osd_2'
/etc/fstab
updated with ceph_osd_2 mount/dev/sda3 /systemraid btrfs defaults,noatime,ssd 0 0 /dev/sda2 /boot btrfs defaults,noatime,ssd 0 0 /dev/sda3 / btrfs defaults,noatime,ssd,compress=lzo,subvol=activeroot 0 0 # # 2x4tb raid1/raid1 btrfs. /dev/sdb1,c1 has alternative grub2bios and /dev/sdb2,c2 has alternative /boot # /dev/sdb3 /tuberaid btrfs defaults,noatime,compress=lzo,autodefrag 0 0 /dev/sdb3 /home btrfs subvol=home,defaults,noatime,compress=lzo,autodefrag 0 0 /dev/sdb3 /mythtv btrfs subvol=mythtv,autodefrag,defaults,noatime 0 0 /dev/sdb3 /vm btrfs subvol=vm,compress=lzo,autodefrag,default,noatime 0 0 /dev/sdb3 /var/lib/ceph/osd/ceph-2 btrfs subvol=ceph_osd_2,compress=lzo,autodefrag,default,noatime 0 0
root #
mkdir -p /var/lib/ceph/osd/ceph-2
root #
mount /var/lib/ceph/osd/ceph-2
root #
df
Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 249542980 26435412 221184188 11% / devtmpfs 7634844 0 7634844 0% /dev tmpfs 1527156 956 1526200 1% /run shm 7635760 912 7634848 1% /dev/shm cgroup_root 10240 0 10240 0% /sys/fs/cgroup /dev/sda3 249542980 26435412 221184188 11% /systemraid /dev/sda2 512000 47048 460860 10% /boot /dev/sdb3 7813006984 443978232 7367515784 6% /tuberaid /dev/sdb3 7813006984 443978232 7367515784 6% /home /dev/sdb3 7813006984 443978232 7367515784 6% /mythtv /dev/sdb3 7813006984 443978232 7367515784 6% /vm /dev/sdb3 7813006984 443978232 7367515784 6% /var/lib/ceph/osd/ceph-2
root #
ceph-osd -i 2 --mkfs --mkkey --osd-uuid 9b5550a2-7c92-4875-bea8-e61ee7a1ea15
2014-07-14 18:52:38.014662 7f88eed18780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 18:52:38.789413 7f88eed18780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-07-14 18:52:38.803037 7f88eed18780 -1 filestore(/var/lib/ceph/osd/ceph-2) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2014-07-14 18:52:39.057377 7f88eed18780 -1 created object store /var/lib/ceph/osd/ceph-2 journal /var/lib/ceph/osd/ceph-2/journal for osd.2 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2014-07-14 18:52:39.057479 7f88eed18780 -1 auth: error reading file: /var/lib/ceph/osd/ceph-2/keyring: can't open /var/lib/ceph/osd/ceph-2/keyring: (2) No such file or directory 2014-07-14 18:52:39.057665 7f88eed18780 -1 created new key in keyring /var/lib/ceph/osd/ceph-2/keyring
root #
ceph auth add osd.2 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-2/keyring
added key for osd.2
root #
ceph osd crush add-bucket kroll4 host
added bucket kroll4 type host to crush map
root #
ceph osd crush move kroll4 root=default
moved item id -4 name 'kroll4' to location {root=default} in crush map
root #
ceph osd crush add osd.2 1.0 host=kroll4
add item id 2 name 'osd.2' weight 1 at location {host=kroll4} to crush map
root #
cd /etc/init.d
root #
ln -s ceph ceph.osd-2
root #
rc-update add ceph.osd-2 default
* service ceph.osd-2 added to runlevel default
root #
./ceph.mon-2 start
root #
ceph osd tree
# id weight type name up/down reweight -1 3 root default -2 1 host kroll1 0 1 osd.0 up 1 -3 1 host kroll3 1 1 osd.1 up 1 -4 1 host kroll4 2 1 osd.2 up 1
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_OK monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3 osdmap e20: 3 osds: 3 up, 3 in pgmap v57: 192 pgs, 3 pools, 0 bytes data, 0 objects 12697 GB used, 20429 GB / 37259 GB avail 192 active+clean
It suddenly strikes us that we are the almost assuredly the first kid in our neighborhood to set up a cluster with almost 40tb of distributed object store in the comfort of our own home. In fact, we are probably outclassing all but one or two NIH funded research labs that are nearby. The cost of the hardware was probably in the range of $5-7K since it was built from scratch, and it may even been cheaper than a single desktop video editing system sold by a certain fruity computer company.
Setting up mds.0 on kroll2
The Ceph site wants us to use ceph-deploy to set up the mds servers, but that script has yet to be ported and packaged for Gentoo. We found enough information from a bloggers site to do a manual install for Gentoo. The /etc/ceph/ceph.conf
needs an edit to introduce a global [mds]
section for the default location of the mds server and the keyring that it needs.
/etc/ceph/ceph.conf
edited for new [mds] global section[global] # For version 0.54 and earlier, you may enable # authentication with the following setting. # Specifying `cephx` enables authentication; # and specifying `none` disables authentication. #auth supported = cephx # For version 0.55 and beyond, you must explicitly enable # or disable authentication with "auth" entries in [global]. fsid = 1798897a-f0c9-422d-86b3-d4933a12c7ac cluster = kroll public network = 192.168.2.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx osd journal size = 1024 # # default to replicating two copies of each object. Allow a minimum # of one copy to be made when number of object stores is degraded # osd pool default size = 2 osd pool default min size = 1 # # defaults from http://ceph.com/docs/master/install/manual-deployment/ # osd pool default pg num = 333 osd pool default pgp num = 333 osd crush chooseleaf type = 1 [osd] osd journal size = 1024 # # This is a crutch for ext4 filesystems with limit of 4kb xattr data # btrfs has no bounds for xattr store # # uncomment the following line if you are mounting with ext4 #filestore xattr use omap = true # For Bobtail (v 0.56) and subsequent versions, you may # add settings for mkcephfs so that it will create and mount # the file system for you. Remove the comment `#` character for # the following settings and replace the values in parenthesis # with appropriate values, or leave the following settings commented # out to accept the default values. You must specify the --mkfs # option with mkcephfs in order for the deployment script to # utilize the following settings, and you must define the 'devs' # option for each osd instance; see below. #osd mkfs type = {fs-type} #osd mkfs options {fs-type} = {mkfs options} # default for xfs is "-f" #osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime" [mon] mon host = kroll1, kroll2, kroll3, kroll4 mon addr = 192.168.2.1:6789, 192.168.2.2:6789, 192.168.2.3:6789, 192.168.2.4:6789 mon initial members = 0,1,2 [mds] mds data = /var/lib/ceph/mds/mds-$id keyring = /var/lib/ceph/mds/mds-$id/keyring [mon.0] host = kroll1 mon addr = 192.168.2.1:6789 [mon.1] host = kroll2 mon addr = 192.168.2.2:6789 [mon.2] host = kroll3 mon addr = 192.168.2.3:6789 [mon.3] host = kroll4 mon addr = 192.168.2.4:6789 [osd.0] host = kroll1 [osd.1] host = kroll3 [osd.2] host = kroll4 [mds.0] host = kroll2
The rest of the file stays untouched including the [mds.0] section that we had put in much earlier. We pass around the updated conf file to the other hosts.
root #
cd /etc/ceph
root #
scp ceph.conf kroll2:/etc/ceph
ceph.conf 100% 2498 2.4KB/s 2.4KB/s 00:00
root #
scp ceph.conf kroll3:/etc/ceph
ceph.conf 100% 2498 2.4KB/s 2.4KB/s 00:00
root #
scp ceph.conf kroll4:/etc/ceph
ceph.conf 100% 2498 2.4KB/s 2.4KB/s 00:00
On kroll2 we create the mds-0 directory and then use ceph auth
to create its keyring. The result is only a single file with a simple key value stanza.
root #
mkdir -p /var/lib/ceph/mds/mds-0
root #
ceph auth get-or-create mds.0 mds 'allow ' osd 'allow *' mon 'allow rwx' > /var/lib/ceph/mds/mds-0/keyring
root #
cd /var/lib/ceph/mds/mds-0
root #
ls -l
total 4 -rw-r--r-- 1 root root 56 Jul 14 23:10 keyring
root #
cat keyring
[mds.0] key = (redacted key text)
We then create a softlink of the ceph openrc script to enable and start mds-0.
root #
cd /etc/init.d
root #
ln -s ceph ceph.mds-0
root #
rc-update add ceph.mds-0 default
* service ceph.mds-0 added to runlevel default
root #
./ceph.mds-0 start
* Caching service dependencies ... [ ok ] * Starting Ceph mds.0 ... starting mds.0 at :/0 [ ok ]
Looking at the ceph-mds-0.log and the ceph -s
command shows us that everything is fine and we now have an mds server running. We should now be able to use mount.ceph and export out the ceph namespace over nfs and cifs.
root #
cd /var/log/ceph
root #
cat ceph-mds.0.log
2014-07-14 23:11:25.537494 7f90bdf7a7c0 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mds, pid 23081 2014-07-14 23:11:25.574128 7f90b6b44700 1 mds.-1.0 handle_mds_map standby 2014-07-14 23:11:25.598605 7f90b6b44700 1 mds.0.1 handle_mds_map i am now mds.0.1 2014-07-14 23:11:25.598649 7f90b6b44700 1 mds.0.1 handle_mds_map state change up:standby --> up:creating 2014-07-14 23:11:25.598924 7f90b6b44700 0 mds.0.cache creating system inode with ino:1 2014-07-14 23:11:25.601827 7f90b6b44700 0 mds.0.cache creating system inode with ino:100 2014-07-14 23:11:25.601852 7f90b6b44700 0 mds.0.cache creating system inode with ino:600 2014-07-14 23:11:25.601991 7f90b6b44700 0 mds.0.cache creating system inode with ino:601 2014-07-14 23:11:25.602110 7f90b6b44700 0 mds.0.cache creating system inode with ino:602 2014-07-14 23:11:25.602188 7f90b6b44700 0 mds.0.cache creating system inode with ino:603 2014-07-14 23:11:25.602261 7f90b6b44700 0 mds.0.cache creating system inode with ino:604 2014-07-14 23:11:25.602356 7f90b6b44700 0 mds.0.cache creating system inode with ino:605 2014-07-14 23:11:25.602427 7f90b6b44700 0 mds.0.cache creating system inode with ino:606 2014-07-14 23:11:25.602536 7f90b6b44700 0 mds.0.cache creating system inode with ino:607 2014-07-14 23:11:25.602626 7f90b6b44700 0 mds.0.cache creating system inode with ino:608 2014-07-14 23:11:25.602762 7f90b6b44700 0 mds.0.cache creating system inode with ino:609 2014-07-14 23:11:25.602840 7f90b6b44700 0 mds.0.cache creating system inode with ino:200 2014-07-14 23:11:27.320245 7f90b6b44700 1 mds.0.1 creating_done 2014-07-14 23:11:27.597400 7f90b6b44700 1 mds.0.1 handle_mds_map i am now mds.0.1 2014-07-14 23:11:27.597410 7f90b6b44700 1 mds.0.1 handle_mds_map state change up:creating --> up:active 2014-07-14 23:11:27.597413 7f90b6b44700 1 mds.0.1 recovery_done -- successful recovery! 2014-07-14 23:11:27.597466 7f90b6b44700 1 mds.0.1 active_start
root #
ceph -s
cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac health HEALTH_OK monmap e3: 4 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,2=192.168.2.3:6789/0,3=192.168.2.4:6789/0}, election epoch 8, quorum 0,1,2,3 0,1,2,3 mdsmap e4: 1/1/1 up {0=0=up:active} osdmap e20: 3 osds: 3 up, 3 in pgmap v86: 192 pgs, 3 pools, 1884 bytes data, 20 objects 12703 GB used, 20423 GB / 37259 GB avail 192 active+clean
Creating and exporting the Posix Filesystem
Once again, the Ceph web site offered scant details about how to go about mounting the object store as a Posix Filesystem with cephx authentication. After a lot of googling around after mount errors 5, etc, we hit upon the magic sauce that is necessary. If you have compiled the ceph network filesystem and ceph lib as modules, you do not need to worry about a manual modprobe to have them loaded. The mount.ceph
command will take care of that for you. You can confirm that the following two modules are loaded after trying your first mount:
root #
lsmod | grep ceph
ceph 125992 1 libceph 114568 1 ceph
If you remember, the ceph.client.admin.keyring you created back in the beginning of the install included allow for mds operations. However the format of that keyring file will not work with mount.ceph. We need to copy only the key value itself as the contents of an /etc/ceph/admin.secret
file.
/etc/ceph.client.admin.keyring
existing generated file[client.admin] key = (redacted key text) auid = 0 caps mds = "allow" caps mon = "allow *" caps osd = "allow *"
/etc/admin.secret
hacked up copy of ceph.client.admin with just the key(redacted key text)
Because we are using nfs4, we will create an /export/kroll
and then use mount.ceph to mount to that. The user specified is just "admin", not "client.admin". Since we have four mons up, we will go ahead and specify all of them to mount the root ceph object namespace.
root #
mkdir /export/kroll
root #
mount -t ceph kroll1,kroll2,kroll3,kroll4:/ /export/kroll -o name=admin,secretfile=/etc/ceph/admin.secret
root #
df
/dev/sda2 245506816 36207828 196804556 16% / devtmpfs 8195144 0 8195144 0% /dev tmpfs 1639140 3212 1635928 1% /run shm 8195696 80 8195616 1% /dev/shm cgroup_root 10240 0 10240 0% /sys/fs/cgroup /dev/sda1 466248 70287 371464 16% /boot 192.168.2.1,192.168.2.2,192.168.2.3,192.168.2.4:/ 39069155328 17653448704 21415706624 46% /export/kroll
And now we have a gi-normous empty filesystem hanging off of /export/kroll. The 46% capacity filled there are from the non-ceph btrfs subvolumes that are currently in the three arrays that are providing our object stores. We will modify our /etc/exports and then update the sharing to put out /export/kroll.
/etc/exports
modified nfs exports map# /etc/exports: NFS file systems being exported. See exports(5). /export *(rw,sync,fsid=0,no_subtree_check,no_root_squash,crossmnt) /export/home *(rw,sync,no_subtree_check,no_root_squash) /export/raid *(rw,sync,no_subtree_check,no_root_squash) /export/kroll *(rw,sync,no_subtree_check,no_root_squash)
root #
exportfs -a -v
exporting *:/export/kroll exporting *:/export/raid exporting *:/export/home exporting *:/export