Introduction to Docker

This paper first published in my GitHub page, All updates and feedback will be in the blog, reprint please indicate the source

Introduction to Docker

Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.

Docker is a PaaS provider dotCloud open-source LXC engine based on advanced container,
The source code is hosted on Github, go language and Apache2.0 compliant open source based on.
Docker recent very hot, Whether from the code on GitHub activity, Or Redhat in RHEL6.5 integrated support for Docker, even Google Compute Engine also supports the docker operation on the basis of recent Baidu, use Docker as the basis for the PaaS (don't know how big the scale).

An open source software can be successful in business, largely depends on three things - the success of user case, active community and a good story. DotCloud's own PaaS products built on top of docker, long-term maintenance
And there are a large number of users, the community is also very active, then we have a look the docker story.

container vs VM

In the face of these problems, docker is the idea of delivery operation environment as shipping, OS as a carrier, each on the basis of OS software as a container, the user can through standardization means freedom assembly operation environment,
At the same time the container contents can be defined by the user, and can also be made by professional staff. In this way, a software delivery, is a series of standardized components delivery, as Lego blocks, users only need to choose the right combination of building blocks,
And on the top most Department of his name (the last standard component is the user's app). It is based on the docker PaaS product prototype.

What Docker Can Do

On the docker website mentioned typical scene docker:

Due to its characteristics of lightweight virtualization based on LXC, compared to docker KVM, the most obvious feature is the fast start, small occupied resources. So for the operation environment constructing isolation standardization, lightweight PaaS (such as dokku), building automation testing and continuous integration environment, as well as the application of all horizontal expansion (especially web applications require fast start and stop to the valley).

  1. Construction of standardized operating environment, Existing programs are running a set of puppet in a base OS/chef, Or a image file, Its disadvantage is that the former requires base OS many preconditions, The latter almost can not modify (because copy on write file format in the runtime rootfs read only). And the latter file size, environment management and version control itself is also a problem.

  2. The PaaS environment is self-evident, at the beginning of design, and the dotcloud case is the environment based on PaaS products

  3. Because of its standardization construction method (buildfile) and REST API, the automatic testing and continuous integration / deployment can be integrated in a good

  4. Because of the characteristics of LXC lightweight, fast start, and the docker can only load each part of container variation, small occupation of such resources, to virtualization solution in stand-alone environment with KVM, compared to more rapid and less resource intensive

What Docker Can NOT Do

Docker is not omnipotent, the beginning of the design is not like the KVM virtualization method substitutes, simple person and summarizes several

  1. Docker is based on Linux 64bit, cannot be used on a windows/unix or 32bit linux environment (although 64-bit is now very popular)
  2. LXC is a CGroup Linux based on kernel function, so that only the guest system container Linux base
  3. Isolation compared to KVM, or some lack of virtualization solutions, all container public part of the runtime
  4. Network management is relatively simple, mainly based on the namespace isolation
  5. Virtualization solution function of CPU CGroup CPU and cpuset compared to KVM compared to metric (so dotcloud is mainly an memory charge)
  6. Docker Disk Management Limited
  7. Container and destroyed with the cessation of the user process, in the container log data collection user inconvenience

According to 1-2, a windows base application needs can be basically pass; 3-5 mainly depends on the needs of users, what is the need for a container or a VM, but also determines the docker as IaaS not feasible.
The 6,7 is docker itself does not support function, but can be resolved through other means (disk quota, mount --bind). In short, the use of container and VM, that is tradeoff in isolation and resource reuse

And even the docker 0.7 can support non AUFS file system, but its function is not stable, commercial applications may exist, and the stable version of AUFS kernel 3.8, so if you want to copy dotcloud
The successful case, may need to consider upgrading kernel or for the Ubuntu version of server (the latter provides DEB update). The reason I think this is why open source community tend to support Ubuntu (kernel version)

Docker Usage

Because of the limited space, here no longer on translation, see the links

Docker Build File

Because of the limited space, here no longer on translation, see the links

Docker's Trick

What Docker Needs

The Docker core problem is to achieve the similar function of VM by LXC, then save more hardware resources available to the user more computing resources. With the VM in different ways, LXC It is not a set ofHardware virtualization method - not belonging to any full virtualization, virtualization and paravirtualization one, but aOperating system level virtualizationMethod, it may not be as intuitive VM. So we from virtual to docker to solve the problem, have a look how he meet user virtualization requirements.

Users need a virtual method considering, especially the hardware virtualization method, need the help of the solution is mainly the following 4 questions:

Linux Namespace (ns)

Isolation of LXC implementation is mainly derived from kernel namespace, where PID, net, IPC, MNT, UTS, namespace container, network, message, file system and hostname isolation.

pid namespace

As mentioned before the user process is lxc-start process sub process, different user process is separated by pidnamespace, and different namespace can have the same PID. Have the following characteristics:

  1. Each namespace PID has its own pid=1 process (similar to the /sbin/init process)
  2. In each namespace process can influence their own with a namespace or namespace in the process of
  3. Because /proc contains a running process, so the /proc directory only in container pseudo-filesystem to see their processes in namespace
  4. Because namespace allows for nested, father namespace can affect the namespace process, so the namespace process can be seen in the parent namespace, but with a different PID

It is because of the above characteristics, all of the LXC in docker in the process of the parent process docker process, LxC process, each with a different namespace. At the same time as to allow nested, so can realize LXC in LXC very convenient

net namespace

With PID namespace, each namespace PID can be isolated from each other, but the network port or shared host port. Network isolation is realized through netnamespace,
Each net namespace independently of the network devices, IP addresses, IP routing tables, the /proc/net directory. So that each container can be isolated from the network.
There are 5 types of network LXC based on docker, the default Veth manner using a docker bridge virtual network in container with host together.

ipc namespace

In the process of interaction with container or Linux common inter process interaction method (interprocess communication IPC), including the semaphore, message queue and common shared memory. However, unlike VM, container
The interaction between processes is actually host with the same PID namespace in the process of interaction, therefore need to join in the IPC application namespace information - each IPC resources have a unique 32bit ID.

mnt namespace

Similar to chroot, will be a process in a specific directory. The file structure of MNT namespace allows different namespace process to see, see that each namespace in the process of file directory is set apart. Unlike chroot, each namespace container in the /proc/mounts information contains only the namespace mount point.

uts namespace

UTS("UNIX Time-sharing System") Namespace allows each container has a separate hostname and domain name,
In the network can be regarded as an independent node instead of one process on Host.

user namespace

Each container can have a different user and group ID, that is to say, to container internal users within the container program rather than users on the Host.

With more than 6 namespace from the process, network, IPC, file system, UTS and the user point of isolation, a container can reveal the ability to separate and different from the computer, container OS to achieve the level of isolation.
However resources between different namespace or competing, still need similar ulimit to the management of each container can use the resources - LXC is used in CGroup.




Control Groups (cgroups)

Cgroups to achieve resource quotas and measure. Cgroups is very easy to use, providing similar files in the /cgroup directory, create a folder to create a new group, new task in this folder
File, and the PID to write to the file, can realize the process control of resources. Resource configuration options specific can create a new sub subsystem in this folder, {subsystem prefix}.{resource item} is a typical configuration method,
Such as memory.usage_in_bytes defines a group memory in subsystem memory restriction options.
In addition, cgroups subsystem can be freely combined, a subsystem can be in a different group, can also be a group contains a subsystem that is a subsystem

About the definition of terms

A *cgroup* associates a set of tasks with a set of parameters for one
or more subsystems.

A *subsystem* is a module that makes use of the task grouping
facilities provided by cgroups to treat groups of tasks in
particular ways. A subsystem is typically a "resource controller" that
schedules a resource or applies per-cgroup limits, but it may be
anything that wants to act on a group of processes, e.g. a
virtualization subsystem.

We are primarily concerned with cgroups can restrict what resource, namely subsystem is what we care about.

cpu : In CGroup, and can not be like the hardware virtualization solution can be defined as the ability of CPU, but to define CPU round robin priority, so it has a higher CPU priority process will be more likely to get CPU operation.
Through to write parameters cpu.shares, CPU priority - to change the definition of CGroup here is a relative weight, rather than absolute values. Of course, in the CPU subsystem and other configuration items, are described in detail in the manual.

cpusets : Cpusets defines a few CPU can be the group or CPU, which can be used for the use of group. In some cases, a single CPU binding can prevent cache switching between multiple cores, thus improving efficiency

memory : Memory limit

blkio : Block IO statistics and related restrictions, byte/operation statistics and limit (IOPS), read and write speed restrictions, but the main statistics here are synchronous IO

net_cls, cpuacct , devices , freezer Other management.


LinuX Containers(LXC)

Isolation mechanism and CGroup limit function by means of namespace, LXC provides a unified API and tools to create and manage container, LXC uses the following kernel features:

LXC users to shield the details of the kernel interface, provides the following components greatly simplifies the user's development and use of the work:

LXC aims to provide a shared OS virtual kernel, when executed without repeated loading Kernel, and kernel and host container sharing, therefore can greatly speed up the container
The boot process, and significantly reduce the memory consumption. In the actual test, performance virtualization method of LXC IO and CPU performance based on almost baremetal (see ref. 3 arguments), most of the data are compared
Xen has the advantage of. Of course, for KVM this is carried out through the Kernel isolation, performance may not be so obvious, difference is mainly the memory consumption and the start time. In reference 4 mentioned by iozone
Disk IO throughput test KVM but faster than LXC, and the device mapping driver reproduce the same case experiments also can be such a conclusion. Literature of 5 references from the virtual routing network virtualization in the scene (personal understanding of network IO and CPU angle) between KVM and LXC, get the conclusion is KVM in balance performance and isolation is better than LXC - KVM throughput is slightly worse than the LXC, but the CPU isolation can manage more than LXC clear.

On the CPU, DiskIO, network IO and memory in KVM and LXC in comparison to still need to experiment more draw convincing conclusions.




3 (test)

4 (compared with the KVM IO)



The use of Docker for container basic is to establish well foundation of LXC, however the problems of LXC is difficult to move the standard templates, reconstruction, copy and move container.
In VM virtual instruments based, image and snapshot can be used for VM replication, reconstruction and moving function. Want to through the container to achieve rapid and large-scale deployment and update, these functions are indispensable.
Docker is the use of AUFS to realize the fast update of container introduced storage driver, in docker0.7, AUFS, VFS, device mapper, also provided the possibility for the introduction of BTRFS and ZFS. But the addition of AUFS without using the dotcloud line, so we still from the angle of AUFS.

AUFS (AnotherUnionFS) Is a kind of Union FS, Simply support different directory mount to the same virtual file system(unite several directories into a single virtual filesystem)File system, Further, AUFSSupport for each member directory(AKA branch)Set up'readonly', 'readwrite' And 'whiteout-able' Jurisdiction, At the same timeAUFSThere is a similar
Hierarchical concept, branch can modify the permissions on the readonly logic on (incrementally, does not affect the readonly part). Usually Union FS has two purposes, one hand can be achieved without the aid of LVM, RAID multiple disk and hanging on to a directory, another is the more commonly used a readonly branch and a writeable branch together, Live CD is based on this basis can allow change in the OS image allows the user to do some writing on it. Docker on AUFS container image is true, then we start from the container Linux as an example the application of docker in AUFS characteristics.

A typical Linux start to run to two FS - bootfs + rootfs (from the angle of function rather than the file system view)


bootfs (boot file system) Including bootloader and kernel, bootloader is the main kernel boot loader, when after the success of the boot kernel is loaded into memory after the bootfs was umount
rootfs (root file system) Is a typical Linux system consists of /dev, /proc, /bin, /etc and other standard directory and file.

Thus for different Linux distributions, the bootfs is basically the same, the rootfs will be different, so different distributions can be public bootfs as shown below:


A typical Linux after start, first rootfs is set to readonly, undergo a series of tests, then switch to "readwrite" for use by the user. In docker, The first is the rootfs way to readonly loading and check, Then using union mount readwrite to a file system mounted on top of the readonly rootfs, And to allow again the underlying file system is set to readonly and to the overlay, so that the structure of a group of readonly and a writeable constitute a container run directory, each called a Layer. The following diagram:


Characteristics that benefit from AUFS, each readonly layer file / directory changes will only exist in the upper writeable layer. Because there is no competition, multiple container can share readonly layer.
So the docker readonly layer called "image" For container the rootfs is read-write, but in fact all modifications are written to the writeable layer of the top,
Image does not save the user state, can be used for the template, the reconstruction and reproduction.


The image dependence of lower image, so docker in the lower layer of image called a parent image, none of the parent image image called base image


So you want to start a container from a image, docker will first loaded his father image until base image, a user process running on the writeable layer. All data and information in parent image
ID, Network and LxC management to resource constraints specific container configuration, constitute a docker concept container. The following diagram:


Therefore, using AUFS as the docker file system container, can provide the following advantages:

  1. To save storage space and multiple container can share base image storage

  2. Rapid deployment - if the deployment of multiple container, base image can avoid multiple copies

  3. Memory more province - because of multiple container share base image, and OS disk cache mechanism, probability of multiple processes in container cache content significantly increased

  4. Update more convenient compared to the copy-on-write type of FS, base-image can be mounted as writeable, can be updated through base image and update the container

  5. Allows to modify the files in the directory - at the same time does not change the base-image of all write operations have occurred in the writeable layer of the top, so you can file content significantly increased base image shared.

The above 5 1-3 can be achieved through the copy-on-write FS, the 4 can use other Union mount way, only 5 of AUFS to achieve a good. This is why Docker was built on top of AUFS.

Because AUFS does not enter the Linux backbone (According to Christoph Hellwig, linux rejects all union-type filesystems but UnionMount.),
At the same time required kernel version 3 (docker recommendation 3.8 and above), so the RedHat engineers in version docker0.7 implementation of the driver mechanism, AUFS is just one of the driver,
Used in RHEL is the container file system to realize the Device Mapper way, relevant content will be introduced in the following.













Grsec is the Linux kernel security related patch, used to protect host prevent illegal intrusion. As part of its not docker, we only briefly.
Grsec can be mainly from the 4 aspects of the process was not illegal intrusion protection:

Safety is always relative, these methods can only tell us from these angles considering security issues of the container type can be concerned about.




What docker do more than LXC

Seemingly docker major OS level virtualization operation is performed with the LXC, the AUFS is just the icing on the cake. Then somebody will have curious docker what more than LXC what. Inadvertently found stackoverflow just to have people ask this question,
The answer is the founder of Dotcloud, for the memo to verbatim.

On top of this low-level foundation of kernel features, Docker offers a high-level tool with several powerful functionalities:

What we can do with Docker

The docker such a powerful tool, more game player wants to know about what docker can do


As the sandbox is probably the most basic idea of container - lightweight isolation mechanism, fast reconstruction and destruction, less resource. Using docker simulation of distributed software deployment and debugging on the developer unit environment, is fast and well.
Version control and image mechanism is also provided by docker and remote image management, a distributed development environment can build similar Git. You can see for the construction of multi platform image packer and vagrant the same authors have been trying in this aspect, the author will these two sections from the same geek compact tools introduced the following blog.


dotcloud, Heroku and cloudfoundry are trying to container to isolate is provided to the user's runtime and service, but dotcloud uses docker, heroku by LXC, cloudfoundry use
CGroup warden based on their development. Isolating mechanism lightweight provides to the user PaaS service is more common in practice - PaaS provides to the user is not OS but runtime+service based on OS level, thus isolating mechanism
To shield users from details have enough. And a lot of articles docker mentioned "can run any application" PaaS "cloud" is from the perspective of image docker from by constructing a image user app packaging and reuse standards service service image, not the buildpack way.

Due to the Cloud Foundry and the docker understanding, then talk about the author's understanding of PaaS. PaaS platform has long been known as a group of multi language runtime and a set of commonly used middleware, provide the two things
Can be considered a to meet the demand of the PaaS. However, the deployment of PaaS in its application on the high energy:

Application of all deployed on the PaaS almost has no from the old platform migration to may, parameter tuning this in-depth work is difficult to enter the new application. Personal understanding or show rapid prototyping, and short-term application attempt.

However, docker does from another angle (like IaaS+orchestration tools) to realize the control and management of user operation environment, but also system based on LXC lightweight, indeed is a great attempt.
I also think that IaaS + tools (flexible orchestration into app level management such as bosh) is the best way to deliver user environment.

Open Solution

The above mentioned are inconvenient in the lower limit and disk/network versions of kernel docker (RHEL 2.6.32) AUFS does not support issues. This section try to give answers.

disk/network quota

Although CGroup provides limited mechanisms such as IOPS, but from the user can use the size of disk and network bandwidth is very limited.

Disk/Network quota now has two ideas:




RHEL 6.5

Here a brief introduction of device mapper driver thought, discussed in reference 2 very valuable.
Docker dirver uses the snapshot mechanism, the initial FS is an empty ext4 directory, and then write each layer. Each time that you create a image is actually to its parent image/base image snapshot,
Then the snapshot operation will be recorded in the FS metadata and AUFS layer (didn't read code is not very understanding?), docker commit diff information parent image again
The operating environment that created image can be separated with the current container independent preserved.

Here only to view material understanding is not very thorough, still need to go deep into the code to understand the details. Posted mail list fragment, if understand please generous with your criticism.

The way it works is that we set up a device-mapper thin provisioning pool with a single base device containing an empty ext4 filesystem. Then each time we create an image we take a snapshot of the parent image (or the base image) and manually apply the AUFS layer to this. Similarly we create snapshots of images when we create containers and mount these as the container filesystem.

"docker diff" is implemented by just scanning the container filesystem and the parent image filesystem, looking at the metadata for changes. Theoretically this can be fooled if you do in-place editing of a file (not changing the size) and reset the mtime/ctime, but in practice I think this will be good enough. 

"docker commit" uses the above diff command to get a list of changed files which are used to construct a tarball with files and AUFS whiteouts (for deletes). This means you can commit containers to images, run new containers based on the image, etc. You should be able to push them to the index too (although I've not tested this yet).

Docker looks for a "docker-pool" device-mapper device (i.e. /dev/mapper/docker-pool) when it starts up, but if none exists it automatically creates two sparse files (100GB for the data and 2GB for the metadata) and loopback mount these and sets these up as the block devices for docker-pool, with a 10GB ext4 fs as the base image. 

This means that there is no need for manual setup of block devices, and that generally there should be no need to pre-allocate large amounts of space (the sparse files are small, and we things up so that discards are passed through all the way back to the sparse loopbacks, so deletes in a container should fully reclaim space.

At present known problems is to remove the image block file is not deleted, see HTTPS://,
I found this problem for 4 hours before the author gives the reason seems to be kernel, issue, contains the work around method in the discussion.





This paper summarizes the following contents

  1. The introduction of docker, including the origin, scenarios

  2. A series of technical - namespace, docker behind the CGroup, LxC, aufs etc.

  3. Docker in the use of LXC also provides what innovation

  4. The author of docker the container, some understanding of PaaS

  5. Docker and existing problems of the solution

We hope to want to learn docker friends help, more detailed understanding of the code or deeper, understand why.

docker@github -

docker_maillist -!forum/docker-dev

Posted by Miranda at April 06, 2014 - 4:28 AM