How Kubernetes works

Reference: Revisiting Kubernetes Pod internals from Container basicsarrow-up-right

What is Container?

  • A Container is a Process that runs in an isolated environment

Key principles of how containers implement an isolated environment

1. Root directory isolation (chroot)

  • chroot

    $ chroot <NEWROOT> <COMMAND>
    • A command to run a process (command) isolated with the input path (New root path) as the root directory

2. Linux namespaces

  • A Linux kernel feature for isolating system resources between processes

    $ lsns -p <pid>
  • unshare

    • A command that can run a process with specific namespaces isolated

    • ex)

      # Run /bin/bash process with mount namespace isolated (-m)
      $ unshare -m /bin/bash
      
      # Run /bin/bash process with mount namespace(-m) and ipc namespace(-i) isolated
      $ unshare -m -i /bin/bash

3. Mount (mnt) namespace

  • mount

    • In Unix-based systems, mounting is the act of connecting a file system to a specific directory in the file tree starting from the root directory (/) in order to access that file system

    • ex)

  • mount namespace

    • Allows processes to have different mount points from each other

4. Process ID (pid) namespace

  • A Container is a Process that runs in an isolated environment

    • Inside the container, it looks like a standalone VM, but from outside the container (host), it is just a single process

    • For the same process,

      • The PID seen from the host and the PID seen inside the container are different

  • The first process executed (entrypoint) in a container with an isolated PID namespace always has PID=1

5. Inter-Process Communication (ipc) namespace

  • Isolates System V based inter-process communication

    • System V IPC

      • shared memory (shm)

      • semaphore

        • A method for controlling access to shared resources by multiple processes or threads

          • A process synchronization technique for concurrent processing

      • POSIX message queue

        • /proc/sys/fs/mqueue

        • A newer version of System V message queues

          • Although the function names and types differ, they perform similar tasks

          • More intuitive and easier to use than System V based message queue functions

    • IPC objects are only visible to processes within the same IPC namespace

6. Network (net) namespace

  • Isolates network interfaces, routing, and firewall rules

  • ex)

  • Docker's container network architecture

    • Containers have their network namespace isolated from the host

      • A veth peer is created to connect between host and container

      • The host's veth is connected to the docker bridge, and when communicating outside the container, traffic goes through the bridge

7. Unix Time-Sharing (uts) namespace

  • What is Unix Time-Sharing?

    • Originated from the concept of sharing computing resources with other users

    • When multiple users are using the same machine but want to make it appear as if they are using different machines, a space is created to isolate hostnames

8. User ID (user) namespace

  • Maps the uid on the host differently from the uid in the container

  • Docker containers do not isolate the user namespace by default

    • This means the container's user can exercise nearly the same uid privileges as the host!

    • Why Docker does not isolate the user namespace

      • Compatibility issues with PID and Network namespace sharing features

      • Compatibility issues with external volumes or drivers that do not support user mapping

      • Complexity of ensuring that a user in an isolated user namespace has access to files bound by the host, from the actual host uid the user is mapped to

      • Although the container root in a non-isolated user namespace has nearly equivalent privileges to the host root, it does not mean full root privileges

    • Kubernetes also does not yet support user namespace isolation

    • When not using user namespace isolation

      • Restrict so that only trusted users can run the container runtime (ex. Docker)

      • Ensure that container processes do not run as the root user

        • Specify them to run with a particular UID and GID

      • Do not mount host directories for direct access by the container

    • Kubernetes provides security settings based on the same principles

9. Control group (cgroup)

  • A Linux kernel feature that can limit and isolate resource allocation for Process groups

    • CPU

      • ex) Limit CPU usage

    • Memory

      • ex) Limit memory usage

    • Network

      • ex) Set network traffic priority

    • Disk

      • ex) Provide statistics on disk usage

Wrap-up: Key principles of how containers implement an isolated environment

  • A Container is a process that runs in an isolated environment

  • Isolated environments for processes are implemented through namespaces

  • Process resource usage is limited through cgroups

What is Kubernetes Pod?

  • The smallest deployable object unit in Kubernetes

  • A group consisting of one or more containers

Pod is the "smallest deployable object unit"

K8s applications are deployed in pod units, and pods are deployed by various types of resources

  • Job

    • Manages pods that run once and terminate when the task is complete

  • ReplicaSet

    • Guarantees that a specified number of pods are running

  • DaemonSet

    • Manages pods that run exactly one per node

  • StatefulSet

    • Manages pods that run stateful applications

  • Deployment

    • Manages deployment of updates for Pods and ReplicaSets

A Pod is the most fundamental unit that is created and managed in Kubernetes!

Pod is a group of one or more containers

  • A Pod can contain one or more containers

    • Pods running a single container

    • Pods running multiple containers

Cases where a Pod consists of multiple containers

  • One Primary Container that serves as the main role

  • One or more Sidecar Containers

    • Containers that run to complement the Primary Container

      • ex) monitoring, logging, etc.

  • Why?

    • As mentioned earlier, a Container is a process that runs in an isolated environment

    • The first process executed in an isolated PID namespace has pid=1

    • In other words, the state of the first process executed in a Container == the lifetime of the Container!

  • If multiple processes are running inside a container?

    • Even if the container is running, the execution state of processes other than the main process cannot be guaranteed

    • In other words, the state of processes running in a Container != the state of the Container

  • If a specific container in a Kubernetes pod terminates?

    • Kubernetes restarts the container according to the declared restartPolicy

      • restartPolicy options

        • Always

        • OnFailure

        • Never

Criteria for composing a Pod

  • Must the containers run on the same node?

    • Containers in the same pod always reside on the same node!

  • Do the containers need to be horizontally scaled by the same count?

    • Pod unit == unit of scaling!

  • Must the containers be deployed together as a single group?

Isolation between containers in a Kubernetes Pod

  • When examining containers on the node where a Pod is running,

    • cgroup namespace and user namespace are not separately isolated

    • mnt, uts, pid namespaces are isolated per container

      • They are not shared even within the same pod!

    • ipc, net namespaces are shared between containers in the pod

      • shared memory and other IPC between container processes is possible

      • Containers share the same IP address and port (beware of conflicts)

What is a Pause Container?

  • The Pause container creates and maintains the isolated IPC and Network namespaces

    • The remaining containers share and use those namespaces

      • This prevents issues where a user-launched container terminates abnormally and causes problems in namespaces shared across all containers!

  • It simply runs an infinite loop and terminates when receiving SIGINT or SIGTERM

    • SIGINT

      • An interrupt signal from the keyboard

        • The signal sent when pressing [CTRL] + [C]

      • Stops execution

    • SIGTERM

      • Short for Terminate, a signal that requests graceful termination

      • The default signal of the kill command

  • It serves the role of Zombie Process Reaping

    • When PID namespace sharing is enabled!

PID namespace sharing in Kubernetes

  • When there is a risk of zombie processes occurring in individual containers, you can enable the Kubernetes PID namespace sharing option to delegate the zombie process reaping role to the Pause container

  • How to enable

    • Set spec.template.spec.shareProcessNamespace: true in the pod yml file

Wrap-up

Concepts of Kubernetes Pod

What is a Pod?

  • The smallest deployable object unit in Kubernetes

    • Pods are deployed by various types of resources (Job, ReplicaSet, etc.)

  • A group of one or more containers

    • Pods running a single container

    • Pods running multiple containers

      • Primary Container

      • Sidecar containers

  • Because even if the container is running, the running state of processes other than the main process cannot be guaranteed!

When a specific container in a Kubernetes Pod terminates, Kubelet restarts the container according to the restartPolicy

Criteria for deciding how to compose a Pod

  • Must the containers run on the same node?

  • Do the containers need to be horizontally scaled by the same count?

  • Must the containers be deployed together as a single group?

Isolation between containers in a Pod

  • Namespaces shared with the host

    • cgroup

    • user

  • Namespaces shared between containers in the same pod

    • ipc

    • net

  • Namespaces isolated per container

    • mount

    • uts

    • pid

      • pid namespace sharing is optional!

What is a Pause Container?

  • Creates and maintains IPC and Network namespaces to be shared among containers

  • When PID namespace is shared, it also performs the zombie process reaping role

Last updated