LXC

Resources

LXC (Linux Containers) is a virtualization system, making use of the "cgroups" feature. It is conceptually similar to Solaris's Zones and FreeBSD's Jails, so to provide more segregation of a simple chroot without having to incur in the penalties of a full virtualization solution, and similar to other OS-level virtualization technologies on Linux such as OpenVZ and Linux-VServer. LXC is production ready while LXD, under heavy development, uses LXC through liblxc and its Go binding to create and manage the containers.

Concepts

Virtualization concepts

This section is a basic overview of how LXC fits into the virtualization world, the type of approach it uses, and the benefits and limitations thereof. If you are trying to figure out if lxc is for you, or it's your first time setting up virtualization under Linux, then you should at least skim this section. Roughly speaking there are two types of virtualization in use today, container-based virtualization and full virtualization.

Container-based virtualization (LXC)

Container based virtualization is very fast and efficient. It's based on the premise that an OS kernel provides different views of the system to different running processes. This sort of segregation or compartmentalisation (sometimes called "thick sandboxing") can be useful for ensuring guaranteed access to hardware resources such as CPU and IO bandwidth, whilst maintaining security and efficiency.

On the unix family of operating systems, it is said^[1] that container based virtualization has its roots in the 1982 release of the chroot tool, a filesystem subsystem specific container-based virtualization tool that was written by Sun Microsystems founder Bill Joy and published as part of 4.2BSD. Since this early tool, which has become a mainstay of the unix world, a large number of unix developers have worked to mature more powerful container based virtualization solutions. Some examples:

Solaris Zones
FreeBSD Jails
Linux VServer
OpenVZ

On Linux, historically the major two techniques have been Linux-VServer (open source / community driven) and OpenVZ (a free spinoff of a commercial product).

However, neither of these will be accepted in to the Linux kernel. Instead Linus has opted for a more flexible, longer-term approach to achieving similar goals, using various new kernel features. lxc is the next-generation container-based virtualization solution that uses these new features.

Conceptually, LXC can be seen as a further development of the existing 'chroot' technique with extra dimensions added. Where 'chroot'-ing only offers isolation at the file system level, LXC offers complete logical isolation from a container to the host and all other containers. In fact, installing a new Gentoo container from scratch is pretty much the same as for any normal Gentoo installation.

Some of the most notable differences include:

Each container will share the kernel with the host (and other containers). No kernel need to be present and/or mounted on the containers /boot directory;
Devices and filesystem will be (more or less) 'inherited' from the host, and need not be configured as would apply for a normal installation;
If the host is using the OpenRC system for bootstrapping, such configuration items will "automagically" be omitted (i.e. filesystem mounts from fstab).

The last point is important to keep lxc based installation as much as simple and the same as for normal installations (no exceptions).

Full virtualization (not LXC)

Full virtualization and paravirtualization solutions aim to simulate the underlying hardware. This type of solution, unlike lxc and other container-based solutions, usually allow you to run any operating system. Whilst this may be useful for the purposes of security and server consolidation, it is hugely inefficient compared to container based solutions. The most popular solutions in this area right now are probably VMware, KVM, Xen and VirtualBox.

Limitations of LXC

With LXC, you can efficiently manage resource allocation in real time. In addition, you should be able to run different Linux distributions on the same host kernel in different containers. Nevertheless, there may be teething issues with startup and shutdown 'run control' (rc) scripts, and these may need to be modified slightly to make some guests work. That said, maintainers of tools such as OpenRC are increasingly implementing LXC detection to ensure correct behaviour when their code runs within containers.

Unlike full virtualization solutions, LXC will not let you run other operating systems (such as proprietary operating systems, or other types of unix). Theoretically, there is no reason why you can't install a full or paravirtualization solution on the same kernel as your LXC host system and run both full/paravirtualised guests in addition to LXC guests at the same time. Should you elect to do this, there are powerful abstracted virtualization management API under development, such as libvirt and ganeti, that you may wish to check out.

In short:

One kernel
One operating system
Many instances

... but can co-exist with other virtualization solutions if required.

Security Concerns

root in a container has all capabilities
- Do not treat root privileges in the container any more lightly than on the host itself.
- Better solution would be to use unprivileged containers (see below).
legacy UID/GID comparisons in many parts of the kernel code are dumb and will not respect containers
- Do not mount parts of external filesystems within a container, except ro (read only).
- Do not re-use UIDs/GIDs between the container and the host
shutdown and halt will run over the host system.
- Restrict/Replace them in the container
Don't do both (1) mount proc in a guest that you don't trust, and (2) have CONFIG_MAGIC_SYSRQ 'Magic SysRq Key' enabled in your kernel (which creates /proc/sysrq-trigger) ... as this can be abused for denial of service
- Turn off MAGIC_SYSRQ option from kernel config.

LXC components

LXC uses two new and lesser known kernel features known as 'control groups' and 'POSIX file capabilities'. It also includes 'template scripts' to setup different guest environments.

Control groups

Control Groups are a multi-hierarchy, multi-subsystem resource management / control framework for the Linux kernel.

In simpler language, what this means is that unlike the old chroot tool which was limited to the file subsystem, control groups let you define a 'group' encompassing one or more processes (eg: sshd, Apache) and then specify a variety of resource control and accounting options for that control group against multiple subsystems, such as:

Filesystem access
General device access
Memory resources
Network device resources
CPU bandwidth
Block device IO bandwidth
Various other aspects of a control group's view of the system

The user-space access to these new kernel features is a kernel-provided filesystem, known as 'cgroup'. It is typically mounted at /cgroup and provides files similar to /proc and /sys representing the running environment and various kernel configuration options.

POSIX file capabilities

POSIX file capabilities are a way to allocate privileges to a process that allow for more specific security controls than the traditional 'root' vs. 'user' privilege separation on unix family operating systems.

Host setup

To get an lxc-capable host system working you will need the following steps.

Kernel with the appropriate LXC options enabled

If you are unfamiliar with recompiling kernels, see the copious documentation available on that subject in addition to the notes below.

Kernel options required

The app-emulation/lxc ebuild will check for the most important options for the kernel that are required to set up a LXC host. This is, though, not a fatal check which means you have to make sure to have the options correctly enabled manually. The package also comes with an upstream-provided lxc-checkconfig script that should report on the proper options.

root #

cd /usr/src/linux

root #

lxc-checkconfig

Write down missing kernel options in output, or leave it open and switch to a new terminal; if lxc-checkconfig complaints about missing "File capabilities" ignore it; the feature is now enabled by default and the setting has been removed.

root #make menuconfig

Search for each kernel CONFIG feature listed in the output of lxc-checkconfig script using the / search hot-key, enable them one by one, save the new configuration, and quit. For more information on kernel configuration visit the kernel configuration article.

Once finished build the kernel:

root #make && make install modules_install

General options

KERNEL

'"`UNIQ--pre-00000001-QINU`"'

KERNEL namespaces

'"`UNIQ--pre-00000004-QINU`"'

KERNEL control groups

'"`UNIQ--pre-00000007-QINU`"'

Freezer support

Freezer support allows you to 'freeze' and 'thaw' a running guest, something like 'suspend' under VMware products. It appears to be under heavy development as of October 2010 (LXC list) but is apparently mostly functional. Please add additional notes on this page if you explore further.

CONFIG_CGROUP_FREEZER / "Freeze/thaw support" ('General Setup -> Control Group support -> Freezer cgroup subsystem')

Scheduling options

Scheduling allows you to specify how much hardware access (CPU bandwidth, block device bandwidth, etc.) control groups have.

CONFIG_CGROUP_SCHED / "Cgroup sched" ('General Setup -> Control Group support -> Group CPU scheduler')
FAIR_GROUP_SCHED / "Group scheduling for SCHED_OTHER" ('General Setup -> Control Group support -> Group CPU scheduler -> Group scheduling for SCHED_OTHER')
CONFIG_BLK_CGROUP / "Block IO controller" ('General Setup -> Control Group support -> Block IO controller')
CONFIG_CFQ_GROUP_IOSCHED / "CFQ Group Scheduling support" ('Enable the block layer -> IO Schedulers -> CFQ I/O scheduler -> CFQ Group Scheduling support')

Memory/swap accounting

To measure resource utilization in your guest...

CONFIG_CGROUP_MEM_RES_CTLR / "Cgroup memory controller" ('General Setup -> Control Group support -> Resource counters -> Memory Resource Controller for Control Groups')

If you want to also count swap utilization, also select...

CONFIG_CGROUP_MEM_RES_CTLR_SWAP / "Memory Resource Controller Swap Extension(EXPERIMENTAL)" ('General Setup -> Control Group support -> Resource counters -> Memory Resource Controller for Control Groups -> Memory Resource Controller Swap Extension')

Resource counters were recently removed from the kernel and replaced with page counters which are now automatically selected when the above is selected. Ignore any userland warnings about missing resource counter config.

CPU accounting

This allows you to measure the CPU utilization of your control groups.

CONFIG_CGROUP_CPUACCT / "Cgroup cpu account" ('General Setup -> Control Group support -> Simple CPU accounting cgroup subsystem')

Networking options

Ethernet bridging, veth, macvlan and vlan (802.1q) support are optional, but you probably want at least one of these:

CONFIG_BRIDGE / "802.1d Ethernet Bridging" ('Networking support -> Networking options -> 802.1d Ethernet Bridging')
CONFIG_VETH / "Veth pair device"
CONFIG_MACVLAN / "Macvlan"
CONFIG_VLAN_8021Q / "Vlan"

Further details about LXC networking options are available on Flameeyes's Weblog

LXC userspace utilities

Due to LXC's still unstable nature, Gentoo provides ebuilds only for the most recent version available, therefore make sure to update the Portage tree before proceeding:

root #emerge --sync

root #emerge --ask app-emulation/lxc

Mounted cgroup filesystem

The 'cgroup' filesystem provides user-space access to the required kernel control group features, and is required by the LXC userspace utilities. Up to kernel 3.1, the filesystem's mountpoint wasn't well defined; nowadays it is defined to be mounted (split) within /sys/fs/cgroup. Recent OpenRC versions already mount it during boot, and the app-emulation/lxc ebuild already depends on a new enough version.

Check it using:

root #mount | grep cgroup

cgroup_root on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755)
openrc on /sys/fs/cgroup/openrc type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib64/rc/sh/cgroup-release-agent.sh,name=openrc)
cpuset on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cpu on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
cpuacct on /sys/fs/cgroup/cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct)
memory on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
devices on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
freezer on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)

To use systemd in the container, a recent enough (>=4.6) kernel version with support for cgroup namespaces is needed. Additionally the host needs to have a name=systemd cgroup hierarchy mounted. Doing so does not require running systemd on the host:

root #mkdir -p /sys/fs/cgroup/systemd

root #mount -t cgroup -o none,name=systemd systemd /sys/fs/cgroup/systemd

Network configuration

The network section defines how the network is virtualized in the container. The network virtualization acts at layer two. In order to use the network virtualization, parameters must be specified to define the network interfaces of the container. Several virtual interfaces can be assigned and used in a container even if the system has only one physical network interface.

According to lxc.network.type option in lxc.conf there are six types of network virtualization to be used for the container:

none: will cause the container to share the host's network namespace. This means the host network devices are usable in the container. It also means that if both the container and host have upstart as init, 'halt' in a container (for instance) will shut down the host. So, it's not safe option at all.
empty: will create only the loopback interface. It means no network connection of the container with outside world.
phys: an already existing physical device interface specified by the lxc.network.link is assigned to the container. It means you need a spare network device for this option for the container to use.
veth: a virtual ethernet pair device is created with one side assigned to the container and the other side attached to a bridge (see Network bridge) specified by the lxc.network.link option. If the bridge is not specified, then the veth pair device will be created but not attached to any bridge. Otherwise, the bridge has to be created on the system before starting the container. lxc won't handle any configuration outside of the container. This is the most common option to use for the isolated network inside virtual container with the connection to outside world for the home use.
vlan: a vlan interface is linked with the interface specified by the lxc.network.link and assigned to the container. The vlan identifier is specified with the option lxc.network.vlan.id. VLANs are usually useful to split big networks into isolated parts (subnetworks) from each other.
macvlan: a macvlan interface is linked with the interface specified by the lxc.network.link and assigned to the container. lxc.network.macvlan.mode specifies the mode the macvlan will use to communicate between different macvlan on the same upper device. The accepted modes are private , the device never communicates with any other device on the same upper_dev (default), vepa , the new Virtual Ethernet Port Aggregator (VEPA) mode, it assumes that the adjacent bridge returns all frames where both source and destination are local to the macvlan port, i.e. the bridge is set up as a reflective relay. Broadcast frames coming in from the upper_dev get flooded to all macvlan interfaces in VEPA mode, local frames are not delivered locally, or bridge , it provides the behavior of a simple bridge between different macvlan interfaces on the same port. Frames from one interface to another one get delivered directly and are not sent out externally. Broadcast frames get flooded to all other bridge ports and to the external interface, but when they come back from a reflective relay, we don't deliver them again. Since we know all the MAC addresses, the macvlan bridge mode does not require learning or STP like the bridge module does. For more information about macvlan modes with clear pictures see Virtual switching technologies and Linux bridge presentation. Also note that macvlan option usually needs either external gateway or switch and won't communicate with host's internal configured gateway. It will be seen as another network interface outside of the container with the separate MAC-address. So, if you'll try to assign container to the external WAN interface of your provider of the Internet, then your Internet provider will be seeing it as different MAC-address interface. This means that you won't get Internet access inside container in that case if you paid only for 1 MAC to the provider. So, this option is mostly useful for big Internet servers with some amount of spare external WAN addresses and separate gateway.

Simple network configuration

As suggested in globalcitizen/lxc-gentoo, basic connectivity can be resolved by setting the IPs of the guest and virtual interfaces and establishing the appropriate iptables/nftables masquerade routes.

In this example, we have enp5s0 as the host interface with IP address is 192.168.1.100. The container's configuration file assigns 192.168.10.101 to the interface and 192.168.10.100 to the gateway.

FILE ~/.config/lxc/guest.conf

lxc.network.type = veth
lxc.network.flags = up
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.10.101/24
lxc.network.ipv4.gateway = 192.168.10.100

A virtual interface will appear after starting the container:

root #ip link

6: vethCOB3OK@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether fe:b7:4a:99:aa:3e brd ff:ff:ff:ff:ff:ff link-netnsid 0

Setting the IP of the virtual interface (vethCOB3OK) to the address of the container's gateway (192.168.10.100) and adding the masquerade routes to the firewall will give internet access to the guest:

root #

ip addr add 192.168.10.100/24 dev vethCOB3OK

root #ip addr

|6: vethCOB3OK@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:b7:4a:99:aa:3e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.10.100/24 scope global vethCOB3OK
       valid_lft forever preferred_lft forever
    inet6 fe80::fcb7:4aff:fe99:aa3e/64 scope link 
       valid_lft forever preferred_lft forever

root #nft add rule nat postrouting oif enp5s0 masquerade

Packet forwarding may be needed in the firewall configuration. See detailed information in next sections.

Host configuration for VLANs inside the bridge which are connected to container's virtual Ethernet pair device

Let's assume that we have a host with enp2s0 device connected to provider LAN network which connects to the Internet (WAN) through it using ppp0 interface. We also have our private LAN network on the enp3s6 interface side. As long as we don't have many spare network interfaces and we also want some container's network isolation let's create another VLAN interface (enp3s6.1) on the host assigned to our private LAN network's interface enp3s6. Then we put it inside the bridge br0.1 as a port.

FILE /etc/conf.d/net

# VLAN (802.1q)
vlans_enp3s6="1"
# bridge ports defined empty to avoid DHCP being run for their configuration (bridge will have 1 IP)
config_enp3s6_1="null"

# Bridge (802.1d)
# To add port to bridge dynamically when the interface comes up
bridge_add_enp3s6_1="br0.1"
config_br0_1="192.168.10.1/24"

# Note that it is important to include 'bridge_forward_delay_br0=0' and 'bridge_hello_time_br0=1000' in order
# to bring the interface up quickly. Other values will cause network packets
# to be dropped for the first 30 seconds after the bridge has become active.
# This in turn could prevent DHCP from working.
bridge_forward_delay_br0_1=0
bridge_hello_time_br0_1=1000
bridge_stp_state_br0_1=0

Then let's create bridge interface, restart enp3s6 interface to get enp3s6.1 and put bridge interface to startup:

root #

cd /etc/init.d/

root #

ln -s net.lo net.br0.1

root #

cd ~

root #

rc-service net.enp3s6 restart

root #

rc-service net.br0.1 start

root #rc-update add net.br0.1

You will have something like the following configuration:

root #ip addr

3: enp3s6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 scope global enp3s6
       valid_lft forever preferred_lft forever
    inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope link 
       valid_lft forever preferred_lft forever
4: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 10.55.1.101/24 brd 10.52.1.255 scope global enp2s0
       valid_lft forever preferred_lft forever
    inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope link 
       valid_lft forever preferred_lft forever
5: enp3s6.1@enp3s6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0.1 state UP group default 
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope link 
       valid_lft forever preferred_lft forever
6: ppp0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast state UNKNOWN group default qlen 3
    link/ppp 
    inet 76.54.32.101 peer 76.54.20.10/32 scope global ppp0
       valid_lft forever preferred_lft forever
8: br0.1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.1/24 brd 192.168.10.255 scope global br0.1
       valid_lft forever preferred_lft forever
    inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope link 
       valid_lft forever preferred_lft forever

Let's now start our container with veth assigned to our bridge br0.1. You'll get another network interface on the host's side which looks like this:

root #ip addr

...
10: vethB004H3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0.1 state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet6 xxxx::xxxx:xxxx:xxxx:xxxx/64 scope link 
       valid_lft forever preferred_lft forever

Both our host's enp3s6.1 VLAN and container's virtual Ethernet pair device vethB004H3 are ports of our bridge br0.1:

root #brctl show

bridge name     bridge id               STP enabled     interfaces
br0.1           8000.blablablabla       no              enp3s6.1
                                                        vethB004H3

Host configuration with NAT networking (nftables)

Lets now give the Internet to the container. We'll use Nftables for that. (For iptables see next section). As long as we don't want container access to our private network LAN or our provider's LAN we'll give only access to the ppp0 WAN device to the container. Let's assume you already have configuration on your host similar to the following Nftables/Examples#Simple_stateful_router_example. Then you'll have to add several rules to it into according places.

FILE /home/rt/scripts/nft.sh

#!/bin/bash
 
nft="/sbin/nft";
...
LAN_PRIVATE_LXC=192.168.10.1/24
export WAN=ppp0
...
#4LXC
${nft} add rule nat postrouting oifname ${WAN} ip saddr ${LAN_PRIVATE_LXC} masquerade;
...
#4LXC
${nft} add rule filter forward ip saddr ${LAN_PRIVATE_LXC} ct state new accept;
echo ACCEPT ${LAN_PRIVATE_LXC} ALL;
...
/etc/init.d/nftables save;

This will give you Internet access inside container. You can later create more isolated containers inside each separate bridge br0.X or connect several container's interfaces inside one br0.Y.

Host configuration with NAT networking (iptables)

For simple network access from container to outside world via NAT using iptables - we can masquarad all connections from our container network to outside world via device, that connected to internet.

Before this, you already created device net.br0.1 (see topic above)

FILE /etc/lxc/lxc-usernet

lxc veth br0.1 2

and your guest container have something like this (for full listing of this file and configuration, please, read next section or unprivileged container section)

FILE ~/.config/lxc/guest.conf

# Other configuration here
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0.1
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.10.101/24
lxc.network.ipv4.gateway = 192.168.10.1
lxc.network.hwaddr = b6:65:81:93:cb:a0
# Maybe, guid mapping here

First, enable masquerade for ipv4

root #echo 1 > /proc/sys/net/ipv4/ip_forward

If you want to enable it at every boot time:

FILE /etc/sysctl.conf

...
net.ipv4.ip_forward = 1
...

Configure NAT to accept and masquerade all connections from container to outside. Command use output device enp5s0. Please, set-up correct device name of your network/wifi network card (you can find it via ifconfig output)

root #

iptables -P FORWARD ACCEPT

root #

iptables -t nat -A POSTROUTING -s 192.168.10.1/24 -o enp5s0 -j MASQUERADE

To save this rule for feature boot-ups:

root #service iptables save

Now, you can start you container and check networking:

user $ping 8.8.8.8

Guest configuration for a virtual Ethernet pair device connected by bridge

Your guest network configuration resides in the guest's /etc/lxc/<lxcname>/config file. To auto-generate it we will use distributive-specific template scripts, but we need some network configuration base for generation. We will use /etc/lxc/guest.conf as such base config file. Documentation for both of this files is accessible with: man lxc.conf.

Your guest configuration should include the following network-related lines:

FILE /etc/lxc/guest.conf

lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0.1
lxc.network.name = eth0
#lxc.network.ipv4 = 192.168.10.101/24
#lxc.network.hwaddr = b6:65:81:93:cb:a0

Note
If you are not using DHCP inside the container to get an IP address, then just delete the 'lxc.network.hwaddr' line, and manually specify the IP you want to use next to lxc.network.ipv4.

If you are using DHCP inside the container to get an IP address, then run it once as shown. LXC will generate a random MAC address for the interface. To keep your DHCP server from getting confused, you will want to use that MAC address all the time. So find out what it is, and then uncomment the 'lxc.network.hwaddr' line and specify it there.

Note
If you have compiled bridge netfilter into your kernel, the LXC guest will only be able to ping the host and not other computers on your LAN or the internet, since all network traffic from the bridge is filtered by the kernel for routing. (See [1])

The solution is to disable all bridge-nf-* filters in /proc/sys/net/bridge, eg. "for f in /proc/sys/net/bridge/bridge-nf-*; do echo 0 > $f; done";

You can permanently disable the bridge-nf-* filters by setting each to '0' in /etc/sysctl.conf:

FILE /etc/sysctl.conf

net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
bridge-nf-filter-pppoe-tagged = 0
bridge-nf-pass-vlan-input-dev = 0

or e.g. by creating a file with those same settings in /etc/sysctl.d/99-bridge-nf-dont-pass.conf:

Alternatively, you can avoid above trouble with bridge-netfilter by setting correctly in-kernel bridge settings or turn some of them off. For example, I have the following bridge-related kernel config options and don't have anything inside /proc/sys/net/bridge/ at all with working LXC inside the bridge br0.1:

root #grep BRIDGE /usr/src/linux/.config

CONFIG_BRIDGE_NETFILTER=m
# CONFIG_NF_TABLES_BRIDGE is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_BRIDGE_VLAN_FILTERING=y

Adjusting guest config of the container after using template script

If you got unworkable network inside the container (after using template script) then you always can adjust your guest configuration on the host using /etc/lxc/<lxcname>/config file. For example:

FILE /etc/lxc/alpha/config

lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0.1
lxc.network.name = eth0
lxc.network.ipv4 = 192.168.10.101/24
lxc.network.ipv4.gateway = 192.168.10.1

You can also always change network config inside container by adjusting it's configuration files (after login into container), for example:

FILE /etc/resolv.conf

nameserver 8.8.8.8

Guest setup

Template scripts

A number of 'template scripts' are distributed with the LXC package. These scripts assist with generating various guest environments.

Template scripts live in /usr/share/lxc/templates but should be executed via the lxc-create tool as follows:

root #lxc-create -n guestname -t template-name -f configuration-file-path

The rootfs of Linux container is stored in /etc/lxc/guestname/

Configuration files (the -f configuration-file option) are usually used to specify the network settings for the initial guest configuration as we described above inside /etc/lxc/guest.conf.

Using download as the template-name displays a list of available guest environments to download and saves the guest image in /var/lib/lxc. See below under LXC#LXC_pre-built_containers.

root #lxc-create -t download -n guestname -f configuration-file-path

Gentoo

Automatic setup: LXC standard Gentoo template script

It's probably the recommended way now as long as the latest internal app-emulation/lxc Gentoo template script is based on lxc-gentoo script and has some additional fixes to it like:

Out of the lxc-create compatibility
Vanilla Gentoo config
Ready to use cache (shared Portage, distfiles, eix cache)

See /usr/share/lxc/templates/lxc-gentoo for additional info and also consult /etc/lxc/<lxcname>/config file after using this template script.

Let's use LXC's template script to create a Gentoo guest:

root #lxc-create -t gentoo -n <guestname> -f /etc/lxc/guest.conf

After creating the guest you can manage it as usual.

Automatic setup: lxc-gentoo

The lxc-gentoo tool can download, extract and configure a Gentoo guest for you, including cryptographic validation of sources and support for arbitrary architectures/variants via Qemu.

root #lxc-gentoo -q && lxc-start -n gentoo -f gentoo.conf

You can grab it from Github.

Additional developers, bug fixes, comments, etc. are welcome.

Other distributions

Alt Linux

Fixme: this template script cannot be executed in Gentoo Linux directly, because it contains apt-get command when downloading Alt Linux guest.

Arch Linux

lxc-archlinux template assists with setting up Archlinux guests (see Archlinux Chroot in Gentoo). Note that in order to use lxc-archlinux, you must:

root #emerge sys-apps/pacman

Fixme: It seems the pacman-4.0.1 cannot work correctly in gentoo linux

root #pacman

error: failed to initialize alpm library (Could not find or read directory)

You need to edit pacman configuration:

FILE /etc/pacman.conf

# The default is /var/chroot/archlinux. If you already have an archlinux chroot, you can use it for the RootDir
RootDir     = /
# Uncomment one of the XferCommand options
XferCommand = /usr/bin/wget --passive-ftp -c -O %o %u
# Add the core repository
# You can generate the mirrorlist at https://www.archlinux.org/mirrorlist/
[core]
Include = /etc/pacman.d/mirrorlist

You will also need to install these tools: https://projects.archlinux.org/arch-install-scripts.git

Fixme: The archlinux template does not create a working container (app-emulation/lxc-0.8.0_rc2-r1), giving an error on not being able to find /sbin/init (the file /usr/lib/systemd/systemd does not exists). Chrooting into the linux container (the rootfs directory) and issuing:

root #pacman -S systemd systemd-sysvcompat initscripts

solves this issue. Also, you need CONFIG_DEVTMPFS activated in the kernel configuration if you configure the container as stated in the archlinux wiki

Note
Further steps are needed to set-up a working archlinux container in gentoo.

Edit: This is working perfectly with app-emulation/lxc-1.1.5

Busybox

lxc contains a minimal template script for busybox. Busybox is basically a base system oriented towards embedded use, where many base utilities exist in an optimized form within one stripped binary to save on memory. Busybox is installed as part of the base Gentoo system, so the script works right away. Example:

root #lxc-create -t busybox -n guest-name -f config-file

Debian

You will need to install dev-util/debootstrap package:

root #emerge --ask dev-util/debootstrap

You can then use the LXC supplied Debian template script to download all required files, generate a configuration file and a root filesystem for your guest.

root #lxc-create -t debian -n guest-name -f configuration-file

Fedora

lxc-fedora template assists with setting up Fedora guests. Note that in order to use lxc-fedora, you must:

root #emerge --ask sys-apps/yum

You will also need to install febootstrap tool from http://people.redhat.com/~rjones/febootstrap/. An ebuild has been created but is not yet in the Portage tree (See bug #309805).

In addition, in order for the script /usr/share/lxc/template/lxc-fedora to mount the squashfs on the loop device, you need to have CONFIG_SQUASHFS=m and CONFIG_SQUASHFS_XZ=y selected in the kernel config.

OpenSUSE

lxc-opensuse template assists with setting up OpenSUSE Linux guest. Fixme: lack of zypper command line package manager tool in Gentoo Portage.

sshd

LXC contains a minimal template script for sshd guests. You can create the sshd guest through:

root #lxc-create -t sshd -n guest-name -f configuration-file

Ubuntu

lxc contains a minimal template script for Ubuntu guests (see ubuntu.com). Note that in order to use lxc-ubuntu, you must:

root #emerge dev-util/debootstrap

Usage is as follows...

root #lxc-create -t ubuntu -n ubuntu-guest -f network-configuration-file

Or, in versions < app-emulation/lxc-0.8.0_rc2-r1

root #lxc-create -t ubuntu -n ubuntu-guest -f network-configuration-file -- -r existing system user

This will create the folder ubuntu-guest. Inside the folder, there will be a file called config. It takes a very long time to create a Ubuntu guest, please be patient.

Another example alternative using the download template:

root #lxc-create -t download -n ubuntu-guest -- -d ubuntu -r trusty -a amd64

Using the guest container

Manual use

To start and stop the guest container, simply run:

root #

lxc-start -n guestname

root #

lxc-stop -n guestname

Please be aware, that when you have daemonized the booting process (-d), you will not get any output on screen. This might happen when you conveniently use an alias which daemonizes by default, and forgot about it. You may get puzzled later by this if there is problem while booting a new the container that has not been configured properly (e.g. network).

You should use the username and password of the existing system user used when creating the container.

To set root password, enter the directory /etc/lxc/guestname and you will see the directory rootfs. Issue:

root #chroot rootfs /bin/bash

Set the password with the command:

root #passwd

Use from Gentoo init system

Gentoo's ebuild (without the vanilla USE flag enabled), provides an init script to manage containers and start them at boot time. To make use of the init script you just have to create a symlink in the /etc/init.d/ directory:

root #

ln -s lxc /etc/init.d/lxc.guestname

root #

/etc/init.d/lxc.guestname stop

root #/etc/init.d/lxc.guestname start

Of course, the use of such scripts is primarily intended for booting and stopping the system. To make a guest to the rc chain, run:

root #rc-update add lxc.guestname default

To enter a (already started) guest directly from the host machine, see the lxc-console section below.

Use from Gentoo systemd

To start the system in the container, call:

root #systemctl start lxc@guestname.service

To stop it again, issue:

root #systemctl stop lxc@guestname.service

To start it automatically at (host) system boot up, use:

root #systemctl enable lxc@guestname.service

Accessing the guest

lxc-console

Using lxc-console provides console access to the guest. To use type:

root #lxc-console -n guestname

If you get a message saying lxc-console: console denied by guestname, then you need to add to your container config:

lxc.tty = 1

To exit the console, use:

root #Ctrl+a q

Note that unless you log out inside the guest, you will remain logged on, so the next time you run lxc-console, you will return to the same session.

Usage of lxc-console should be restricted to root. It should be primarily a tool for system administrators (root) to enter a (newly) container after it is first created, e.g. when the network connection is not properly configured yet. Using multiple instances of lxc-console on distinct guests works fine, but starting a second instance for a guest that is already governed by another lxc-console session, leads to redirection of keyboard input and terminal output. Best is to avoid use of lxc-console at all. (Perhaps lxc developers should enhance the tool in such way that is only possible for singleton use per guest. ;-)

lxc-attach

lxc-attach starts a process inside a running container. If no command is specified, it looks for the default shell. Therefore, you can get into the container by:

root #lxc-attach -n guestname

Accessing the container with sshd

A common technique to allow users direct access into a system container is to run a separate sshd inside the container. Users then connect to that sshd directly. In this way, you can treat the container just like you treat a full virtual machine where you grant external access. If you give the container a routable address, then users can reach it without using ssh tunneling.

If you set up the container with a virtual ethernet interface connected to a bridge on the host, then it can have its own Ethernet address on the LAN, and you should be able to connect directly to it without logically involving the host (the host will transparently relay all traffic destined for the container, without the need for any special considerations). You should be able to simply 'ssh <container_ip>'.

Note
The above comments of Hu and BoneKracker have been taken from the Gentoo Forums.

Filesystem layout

Some of the lxc tools apparently assume that /etc/lxc/<guestname>/ exists. However, you should keep the guests' root filesystems out of /etc since it's not a path that's supposed to store large volumes of binary data.

The templates of LXC will use the following locations:

/etc/lxc/<guestname>/config = guest configuration file
/etc/lxc/<guestname>/fstab = optional guest fstab file
/var/lxc/<guestname>/rootfs = root filesystem image
/var/log/lxc/<guestname> = lxc-start logfile

Unprivileged containers

Unprivileged containers are the safest containers. Usual privileged LXC should be considered unsafe because while running in a separate namespace, UID 0 in the container is still equal to UID 0 (root) outside of the container, meaning that if you somehow get access to any host resource through proc, sys or some random syscalls, you can potentially escape the container and then you'll be root on the host. That's what user namespaces were designed for. Each user that's allowed to use them on the system gets assigned a range of unused UIDs and GIDs. So, unprivileged LXC map, for instance, user and group ids 0 through 65,000 in the container to the ids 100,000 through 165,000 on the host. That means that UID 0 (root) in the container maps into UID 100,000 outside the container. So, in case something goes wrong and an attacker manages to escape the container, one find himself with as many rights as a nobody user.

The standard paths also have their unprivileged equivalents:

/etc/lxc/lxc.conf => ~/.config/lxc/lxc.conf
/etc/lxc/default.conf => ~/.config/lxc/default.conf
/var/lib/lxc => ~/.local/share/lxc
/var/lib/lxcsnaps => ~/.local/share/lxcsnaps
/var/cache/lxc => ~/.cache/lxc

Your user, while it can create new user namespaces in which it'll be UID 0 and will have some of root's privileges against resources tied to that namespace will obviously not be granted any extra privilege on the host. Unfortunately this also means that the following common operations aren't allowed:

Mounting most of filesystems.
Creating device nodes.
Any operation against a UID/GID outside of the mapped set.

This also means that your user will be limited of creating new network devices on the host or changing bridge configuration. To workaround that, LXC team wrote a tool called “lxc-user-nic” which is the only SETUID binary part of LXC 1.0 and which performs one simple task. It parses a configuration file and based on its content creates network devices for the user and bridge them. To prevent abuse, you can restrict the number of devices a user can request and to what bridge they may be added by editing /etc/lxc/lxc-usernet file.

Prerequisites

Prerequisites for well working unprivileged containers include:

Kernel: 3.13 + a couple of staging patches or later version
User namespaces enabled in the kernel (CONFIG_USER_NS=y)
A very recent version of shadow that supports subuid/subgid (sys-apps/shadow-4.2.1 or later)
Per-user cgroups on all controllers
LXC 1.0 or higher
A version of PAM with a loginuid patch (it's a dependency of recent version of shadow mentioned above, so it installs automatically with recent shadow-4.2.1)

LXC pre-built containers

Because of the limitations mentioned above you won't be allowed to use mknod to create a block or character device in a user namespace as being allowed to do so would let you access anything on the host. Same thing goes with some filesystems, you won’t for example be allowed to do loop mounts or mount an ext partition, even if you can access the block device. Those limitations are a big problem during the initial bootstrap of a container as tools like debootstrap, yum, … usually try to do some of those restricted actions and will fail pretty badly.

Some templates may be tweaked to work and workaround such as a modified fakeroot could be used to bypass some of those limitations but the current state is that the most distribution templates (including Gentoo) simply won't work with those. Instead you should use the "download" template which will provide you with pre-built images of the distributions that are known to work in such an environment. This template is used to contact a server which contains daily pre-built rootfs and configuration for most common templates instead of assembling the rootfs and local configuration.

Those images are built from LXC project's Jenkins server. The actual build process is pretty straightforward, a basic chroot is assembled, then the current git master is downloaded, built and the standard templates are run with the right release and architecture, the resulting rootfs is compressed, a basic config and metadata (expiry, files to template, …) is saved, the result is pulled by LXC project's main server, signed with a dedicated GPG key and published on the public web server.

The client side is a simple template which contacts the server over https (the domain is also DNSSEC enabled and available over IPv6), grabs signed indexes of all the available images, checks if the requested combination of distribution, release and architecture is supported and if it is, grabs the rootfs and metadata tarballs, validates their signature and stores them in a local cache. Any container creation after that point is done using that cache until the time the cache entries expires at which point it'll grab a new copy from the server. You can also use "--flush-cache" parameter to flush the local copy (if present).

The template has been carefully written to work on any system that has a POSIX-compliant shell with wget. gpg is recommended but can be disabled if your host doesn't have it (at your own risks). The current list of images can be requested by passing –list parameter (click "Expand" to see the full output):

root #lxc-create -t download -n alpha -- --list

Expand

Setting up the GPG keyring
Downloading the image index

---
DIST    RELEASE ARCH    VARIANT BUILD
---
alpine  3.3     amd64   default 20171116_17:50
alpine  3.3     armhf   default 20170103_17:50
alpine  3.3     i386    default 20171116_17:50
alpine  3.4     amd64   default 20171116_17:50
alpine  3.4     armhf   default 20170111_20:27
alpine  3.4     i386    default 20171116_17:50
alpine  3.5     amd64   default 20171116_17:50
alpine  3.5     i386    default 20171116_17:50
alpine  3.6     amd64   default 20171116_18:00
alpine  3.6     i386    default 20171116_17:50
alpine  edge    amd64   default 20171116_17:50
alpine  edge    armhf   default 20170111_20:27
alpine  edge    i386    default 20171116_17:50
archlinux       current amd64   default 20171117_01:27
archlinux       current i386    default 20171116_01:27
centos  6       amd64   default 20171117_02:16
centos  6       i386    default 20171117_02:16
centos  7       amd64   default 20171117_02:16
debian  buster  amd64   default 20171116_22:42
debian  buster  arm64   default 20171116_22:42
debian  buster  armel   default 20171116_22:42
debian  buster  armhf   default 20171117_04:09
debian  buster  i386    default 20171116_22:42
debian  buster  ppc64el default 20171116_22:42
debian  buster  s390x   default 20171116_22:42
debian  jessie  amd64   default 20171116_22:42
debian  jessie  arm64   default 20171116_22:42
debian  jessie  armel   default 20171116_22:42
debian  jessie  armhf   default 20171116_22:42
debian  jessie  i386    default 20171116_22:42
debian  jessie  powerpc default 20171116_22:42
debian  jessie  ppc64el default 20171116_22:42
debian  jessie  s390x   default 20171116_22:42
debian  sid     amd64   default 20171116_22:42
debian  sid     arm64   default 20171116_22:42
debian  sid     armel   default 20171116_22:42
debian  sid     armhf   default 20171117_04:09
debian  sid     i386    default 20171116_22:42
debian  sid     powerpc default 20171116_22:42
debian  sid     ppc64el default 20171116_22:42
debian  sid     s390x   default 20171116_22:42
debian  stretch amd64   default 20171116_22:42
debian  stretch arm64   default 20171116_22:42
debian  stretch armel   default 20171116_22:42
debian  stretch armhf   default 20171116_22:42
debian  stretch i386    default 20171116_22:42
debian  stretch powerpc default 20161104_22:42
debian  stretch ppc64el default 20171116_22:42
debian  stretch s390x   default 20171116_22:42
debian  wheezy  amd64   default 20171116_22:42
debian  wheezy  armel   default 20171116_22:42
debian  wheezy  armhf   default 20171116_22:42
debian  wheezy  i386    default 20171116_22:42
debian  wheezy  powerpc default 20171116_22:42
debian  wheezy  s390x   default 20171116_22:42
fedora  24      amd64   default 20171117_01:27
fedora  24      i386    default 20171117_01:27
fedora  25      amd64   default 20171117_02:20
fedora  25      i386    default 20171117_01:27
fedora  26      amd64   default 20171117_01:27
fedora  26      i386    default 20171117_01:27
gentoo  current amd64   default 20171116_14:12
gentoo  current i386    default 20171116_14:12
opensuse        42.2    amd64   default 20171117_00:53
opensuse        42.3    amd64   default 20171117_00:53
oracle  6       amd64   default 20171117_11:40
oracle  6       i386    default 20171117_11:40
oracle  7       amd64   default 20171117_11:40
plamo   5.x     amd64   default 20171116_21:36
plamo   5.x     i386    default 20171116_21:36
plamo   6.x     amd64   default 20171116_21:36
plamo   6.x     i386    default 20171116_21:36
ubuntu  artful  amd64   default 20171117_03:49
ubuntu  artful  arm64   default 20171117_03:49
ubuntu  artful  armhf   default 20171117_03:49
ubuntu  artful  i386    default 20171117_03:49
ubuntu  artful  ppc64el default 20171117_03:49
ubuntu  artful  s390x   default 20171117_03:49
ubuntu  bionic  amd64   default 20171117_03:49
ubuntu  bionic  arm64   default 20171117_03:49
ubuntu  bionic  armhf   default 20171117_03:49
ubuntu  bionic  i386    default 20171117_03:49
ubuntu  bionic  ppc64el default 20171117_03:49
ubuntu  bionic  s390x   default 20171117_03:49
ubuntu  precise amd64   default 20171025_03:49
ubuntu  precise armel   default 20171024_03:49
ubuntu  precise armhf   default 20171024_08:01
ubuntu  precise i386    default 20171025_03:49
ubuntu  precise powerpc default 20171025_03:49
ubuntu  trusty  amd64   default 20171117_03:49
ubuntu  trusty  arm64   default 20171117_03:49
ubuntu  trusty  armhf   default 20171117_03:49
ubuntu  trusty  i386    default 20171117_03:49
ubuntu  trusty  powerpc default 20171117_03:49
ubuntu  trusty  ppc64el default 20171117_03:49
ubuntu  xenial  amd64   default 20171117_03:49
ubuntu  xenial  arm64   default 20171117_03:49
ubuntu  xenial  armhf   default 20171117_03:49
ubuntu  xenial  i386    default 20171117_03:49
ubuntu  xenial  powerpc default 20171117_03:49
ubuntu  xenial  ppc64el default 20171117_03:49
ubuntu  xenial  s390x   default 20171117_03:49
ubuntu  zesty   amd64   default 20171117_03:49
ubuntu  zesty   arm64   default 20171117_03:49
ubuntu  zesty   armhf   default 20171117_03:49
ubuntu  zesty   i386    default 20171117_03:49
ubuntu  zesty   powerpc default 20170317_03:49
ubuntu  zesty   ppc64el default 20171117_03:49
ubuntu  zesty   s390x   default 20171117_03:49
---

While the template was designed to workaround limitations of unprivileged containers, it works just as well with system containers, so even on a system that doesn’t support unprivileged containers you can do:

root #lxc-create -t download -n alpha -f /etc/lxc/guest.conf -- -d ubuntu -r trusty -a amd64

And you'll get a new container running the latest build of Ubuntu 15.04 Vivid Vervet amd64.

Configuring unprivileged LXC

Install the required packages:

root #emerge shadow pambase

Create files necessary for assigning subuids and subgids:

root #touch /etc/subuid /etc/subgid

Create new user, set it's password and login:

root #

useradd -m -G users lxc

root #

passwd lxc

root #su - lxc

Make sure your user has a UID and GID map defined in /etc/subuid and /etc/subgid:

root #grep lxc /etc/sub* 2>/dev/null

/etc/subgid:lxc:165537:65536
/etc/subuid:lxc:165537:65536

On Gentoo, a default allocation of 65536 UIDs and GIDs is given to every new user on the system, so you should already have one. If not, you'll have to assign a set of subuids and subgids for a user manually:

root #

usermod --add-subuids 100000-165536 lxc

root #

usermod --add-subgids 100000-165536 lxc

root #chmod +x /home/lxc

That last one is required because LXC needs it to access ~/.local/share/lxc/ after it switched to the mapped UIDs. If you’re using ACLs, you may instead use “u:100000:x” as a more specific ACL.

Now create /home/lxc/.config/lxc/guest.conf with the following content:

FILE /home/lxc/.config/lxc/guest.conf

lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.link = br0.1
lxc.net.0.name = eth0
lxc.net.0.ipv4.address = 192.168.10.101/24
lxc.net.0.ipv4.gateway = 192.168.10.1
lxc.idmap = u 0 100000 65536
lxc.idmap = g 0 100000 65536

The last two strings are mean that you have one UID map and one GID map defined for the container which will map UIDs and GIDs 0 through 65,536 in the container to UIDs and GIDs 100,000 through 165,536 on the host. Those values should match those found in /etc/subuid and /etc/subgid, the values above are just illustrative ones.

And /etc/lxc/lxc-usernet with:

FILE /etc/lxc/lxc-usernet

lxc veth br0.1 2

This declares that the user “lxc” is allowed up to 2 veth type devices to be created and added to the bridge called br0.1.

Don't forget to add /usr/sbin into the PATH environment variable either inside the /etc/env.d/90lxc for all users to take effect or inside ~/.bashrc the for current user. Otherwise lxc-* commands will not work under your user environment (it is not the case for lxc-1.1.0-r5, lxc-1.1.1 and later versions because they use standard /usr/bin/ path for command files). Example:

FILE ~/.bashrc

# /etc/skel/.bashrc
...
# Put your fun stuff here.
PATH="/usr/sbin:${PATH}"

Now let’s create our first unprivileged container with:

lxc@localhost $ lxc-create -t download -n alpha -f ~/.config/lxc/guest.conf -- -d ubuntu -r trusty -a amd64

Don't forget to change root password of unprivileged LXC with the following commands by running under your user:

lxc@localhost $

lxc-start -n alpha -- /bin/bash

lxc@localhost $

passwd

lxc@localhost $

exit

Then you can login easily with your new password as usual under your user:

lxc@localhost $

lxc-start -n alpha

If you get error: "Permission denied, can't create directory /sys/fs/cgroup/alpha", please, see section LXC#Create_user_namespace_manually_.28no_systemd.29

P.S. To be accomplished. "Creating cgroups" section has to be added with or without cgmanager through OpenRC/systemd accordingly (See "Creating cgroups" paragraph there as an example at the moment).

Create user namespace manually (no systemd)

OpenRC configuration pre-check

For systems, that booted by OpenRC, check that OpenRC mounts cgroups v1.

Open /etc/rc.conf and check those line:

FILE /etc/rc.conf

...
rc_cgroup_mode="hybrid"
...

rc_cgroup_mode must be: hybrid or legacy as lxc scripts developed to work with cgroups v1. By default (with commented line rc_cgroup_mode) it should be "hybrid"

Namespace create script

On systems without systemd, external script should create user cgroup namespace manually. In our case, we should create all required dirs for lxc user, give permission for it and move current active bash shell to cgroup user namespace.

Add such things in lxc users /home/lxc/.bashrc:

FILE ~/.bashrc

'"`UNIQ--pre-00000031-QINU`"'

Now, on every lxc user login, bash will create all required cgroups, grand permissins and move current users bash to those namespace.

Grand permissions for sudo

Last step - give permissions for mkdir / tee and chown for creating all required subdirs.

Add to sudoers permissions for lxc user:

FILE /etc/sudoers

# ... Other permissions here
Cmnd_Alias CGRP_MKDIR = /bin/mkdir -p /sys/fs/cgroup/*/lxc
Cmnd_Alias CG_TEE = /usr/bin/tee -a /sys/fs/cgroup/*
Cmnd_Alias CG_CHOWN = /bin/chown -R lxc /sys/fs/cgroup/*/lxc

lxc ALL=(ALL) NOPASSWD: CGRP_MKDIR, CG_TEE,CG_CHOWN

Validate configuration

After re-login to user lxc, lxc should have user namespace lxc Lets recheck it:

lxc@localhostcat /proc/self/cgroup

Should be:

FILE /proc/self/cgroup

12:pids:/lxc
11:hugetlb:/lxc
10:net_prio:/lxc
9:net_cls:/lxc
8:freezer:/lxc
7:devices:/lxc
6:memory:/lxc
5:blkio:/lxc
4:cpuacct:/lxc
3:cpu:/lxc
2:cpuset:/lxc
1:name=openrc:/lxc

Create container example

Now, we can execute any lxc-* command from lxc user without any permission problems. For example:

lxc@localhost $

lxc-create -t download -n my_ubuntu_container -f ~/.config/lxc/guest.conf -- -d ubuntu -r xenial -a amd64

lxc@localhost $lxc-start --name my_ubuntu_container

Troubleshooting

newuidmap error

Packages sys-auth/pambase-20150213 and sys-apps/shadow-4.4-r2 provides newuidmap and newgidmap commands without required permissions for lxc. As result, all lxc-* commands return error like:

newuidmap: write to uid_map failed: Operation not permitted

For example:

lxc@localhost $

lxc-create -t download -n alpha -f ~/.config/lxc/guest.conf -- -d ubuntu -r trusty -a amd64

newuidmap: write to uid_map failed: Operation not permitted
error mapping child
setgid: Invalid argument

To fix this issue, set setuid and setgid flags with command:

root #

chmod 4755 /usr/bin/newuidmap

root #chmod 4755 /usr/bin/newgidmap

For more details regarding bug, see this issue: [2]

Could not set clone_children to 1 for cpuset hierarchy in parent cgroup

As of december 2017, unified cgroups recently introduced in systemd and openrc are messing things up in the realm of unprivileged containers. It's tricky to see in other distros how they solve the problem because Arch doesn't support unprivileged containers and Ubuntu Xenial is on systemd 229 (unified cgroups became the default with 233), Debian stretch is on 232.

I began looking at non-LTS ubuntu releases but I haven't figured out what they're doing yet. It involves using the new cgfsng driver which, to my knowledge, has never been made to work on Gentoo. Unprivileged containers always worked with cgmanager which is now deprecated.

To work around that on systemd, you'll have to manually set user namespaces by following "Create user namespace manually (no systemd)" instructions above, ignoring the "no systemd" part. After that, you should be good to go.

On the OpenRC side, you also have to disable unified cgroups. You do that by editing /etc/rc.conf and setting rc_cgroup_mode="legacy"

That will bring you pretty far because your unprivileged container will boot, but if you boot a systemd system, it won't be happy about not being in its cosy systemd world. You will have to manually create a systemd cgroup for it. this write up about LXC under alpine helps a lot.

External resources

References

↑ http://en.wikipedia.org/wiki/Chroot#History

[1] ttp://en.wikipedia.org/wiki/Chroot#History

[1]

LXC

Concepts

Virtualization concepts

Container-based virtualization (LXC)

Full virtualization (not LXC)

Limitations of LXC

Security Concerns

LXC components

Control groups

POSIX file capabilities

Host setup

Kernel with the appropriate LXC options enabled

Kernel options required

General options

Freezer support

Scheduling options

Memory/swap accounting

CPU accounting

Networking options

LXC userspace utilities

Mounted cgroup filesystem

Network configuration

Simple network configuration

Host configuration for VLANs inside the bridge which are connected to container's virtual Ethernet pair device

Host configuration with NAT networking (nftables)

Host configuration with NAT networking (iptables)

Guest configuration for a virtual Ethernet pair device connected by bridge

Adjusting guest config of the container after using template script

Guest setup

Template scripts

Gentoo

Automatic setup: LXC standard Gentoo template script

Automatic setup: lxc-gentoo

Other distributions

Alt Linux

Arch Linux

Busybox

Debian

Fedora

OpenSUSE

sshd

Ubuntu

Using the guest container

Manual use

Use from Gentoo init system

Use from Gentoo systemd

Accessing the guest

lxc-console

lxc-attach

Accessing the container with sshd

Filesystem layout

Unprivileged containers

Prerequisites

LXC pre-built containers

Configuring unprivileged LXC

Create user namespace manually (no systemd)

OpenRC configuration pre-check

Namespace create script

Grand permissions for sudo

Validate configuration

Create container example

Troubleshooting

newuidmap error

Could not set clone_children to 1 for cpuset hierarchy in parent cgroup

See also

External resources

References