Deploying and Scaling Microservices
with Kubernetes
Allez-y doucement avec le WiFi!
N'utilisez pas votre hotspot.
Ne chargez pas de vidéos, ne téléchargez pas de gros fichiers pendant la formation.
djalal, Nanterre, 13 sept.
Bonjour, je suis:
Cet atelier se déroulera de 9h à 17h.
La pause déjeuner se fera entre 12h et 13h30.
(avec 2 pauses café à 10h30 et 15h!)
N'hésitez pas à m'interrompre pour vos questions, à n'importe quel moment.
Surtout quand vous verrez des photos de conteneurs en plein écran!
Vos réactions en direct, questions, demande d'aide
sur https://tinyurl.com/docker-w-djalal
This was initially written by Jérôme Petazzoni to support in-person, instructor-led workshops and tutorials
Credit is also due to multiple contributors — thank you!
You can also follow along on your own, at your own pace
We included as much information as possible in these slides
We recommend having a mentor to help you ...
... Or be comfortable spending some time reading the Kubernetes documentation ...
... And looking for answers on StackOverflow and other outlets
Tout le contenu est disponible dans un dépôt public Github:
Vous pouvez obtenir une version à jour de ces diapos ici:
https://container.training/ (anglais) ou https://docker.djal.al/ (français)
Tout le contenu est disponible dans un dépôt public Github:
Vous pouvez obtenir une version à jour de ces diapos ici:
https://container.training/ (anglais) ou https://docker.djal.al/ (français)
👇 Essayez! Le code source sera affiché et vous pourrez l'ouvrir dans Github pour le consulter et le corriger.
Cette diapo a une petite loupe dans le coin en haut à gauche.
Cette loupe signifie que ces diapos apportent des détails supplémentaires.
Vous pouvez les zapper si:
vous êtes pressé(e);
vous êtes tout nouveau et vous craignez la surcharge cognitive;
vous ne souhaitez que l'essentiel des informations.
Vous pourrez toujours y revenir une autre fois, ils vous attendront ici ☺
(auto-generated TOC)
(auto-generated TOC)
(auto-generated TOC)
(auto-generated TOC)
(auto-generated TOC)
(auto-generated TOC)
Pre-requis
(automatically generated title slide)
Être à l'aise avec la ligne de commande UNIX
se déplacer à travers les dossiers
modifier des fichiers
un petit peu de bash-fu (variables d'environnement, boucles)
Un peu de savoir-faire sur Docker
docker run
, docker ps
, docker build
idéalement, vous savez comment écrire un Dockerfile et le générer.
(même si c'est une ligne FROM
et une paire de commandes RUN
)
C'est totalement autorisé de ne pas être un expert Docker!
Raconte moi et j'oublie.
Apprends-moi et je me souviens.
Implique moi et j'apprends.
Attribué par erreur à Benjamin Franklin
(Plus probablement inspiré du philosophe chinois confucianiste Xunzi)
Cet atelier est entièrement pratique
Nous allons construire, livrer et exécuter des conteneurs!
Vous être invité(e) à reproduire toutes les démos
Les sections "pratique" sont clairement identifiées, via le rectangle gris ci-dessous
C'est le genre de trucs que vous êtes censé faire!
Allez à http://container.training/ pour voir ces diapos
Joignez-vous au salon de chat: In person!
Chaque personne aura son cluster privé de VMs dans le cloud (partagé avec personne d'autre)
Les VMs resterons allumées toute la durée de la formation
Vous devez avoir une petite carte avec identifiant+mot de passe+adresses IP
Vous pouvez automatiquement SSH d'une VM à une autre
Les serveurs ont des alias: node1
, node2
, etc.
Installer cet outillage peut être difficile sur certaines machines
(CPU ou OS à 32bits... Portables sans accès admin, etc.)
Toute l'équipe a téléchargé ces images de conteneurs depuis le WiFi!
... et tout s'est bien passé (litéralement personne)
Tout ce dont vous avez besoin est un ordinateur (ou même une tablette), avec:
une connexion internet
un navigateur web
un client SSH
Sur Linux, OS X, FreeBSD... vous être sûrement déjà prêt(e)
Sur Windows, récupérez un de ces logiciels:
Sur Android, JuiceSSH (Play Store) marche plutôt pas mal.
Petit bonus pour: Mosh en lieu et place de SSH, si votre connexion internet à tendance à perdre des paquets.
You don't have to use Mosh or even know about it to follow along.
We're just telling you about it because some of us think it's cool!
Mosh is "the mobile shell"
It is essentially SSH over UDP, with roaming features
It retransmits packets quickly, so it works great even on lossy connections
(Like hotel or conference WiFi)
It has intelligent local echo, so it works great even in high-latency connections
(Like hotel or conference WiFi)
It supports transparent roaming when your client IP address changes
(Like when you hop from hotel to conference WiFi)
To install it: (apt|yum|brew) install mosh
It has been pre-installed on the VMs that we are using
To connect to a remote machine: mosh user@host
(It is going to establish an SSH connection, then hand off to UDP)
It requires UDP ports to be open
(By default, it uses a UDP port between 60000 and 61000)
node1
) avec votre client SSHnode2
sans mot de passe:ssh node2
exit
ou ^D
pour revenir à node1
Si quoique ce soit va mal - appelez à l'aide!
Use something like Play-With-Docker or Play-With-Kubernetes
Zero setup effort; but environment are short-lived and might have limited resources
Create your own cluster (local or cloud VMs)
Small setup effort; small cost; flexible environments
Create a bunch of clusters for you and your friends (instructions)
Bigger setup effort; ideal for group training
Ces remarques s'appliquent uniquement en cas de serveurs multiples, bien sûr.
Sauf contre-indication expresse, toutes les commandes sont lancées depuis la première VM, node1
Tout code sera récupéré sur node1
uniquement.
En administration classique, nous n'avons pas besoin d'accéder aux autres serveurs.
Si nous devions diagnostiquer une panne, on utiliserait tout ou partie de:
SSH (pour accéder aux logs de système, statut du daemon, etc.)
l'API Docker (pour vérifier les conteneurs lancés, et l'état du moteur de conteneurs)
Once in a while, the instructions will say:
"Open a new terminal."
There are multiple ways to do this:
create a new window or tab on your machine, and SSH into the VM;
use screen or tmux on the VM and open a new window from there.
You are welcome to use the method that you feel the most comfortable with.
Tmux is a terminal multiplexer like screen
.
You don't have to use it or even know about it to follow along.
But some of us like to use it to switch between terminals.
It has been preinstalled on your workshop nodes.
kubectl versiondocker versiondocker-compose -v
Kubernetes 1.13.x est uniquement validé avec les versions Docker Engine jusqu'à to 18.06
Kubernetes 1.14 est validé avec les versions Docker Engine versions jusqu'à 18.09
(la dernière version stable quand Kubernetes 1.14 est sorti)
Est-ce qu'on vit dangereusement en installant un Docker Engine "trop récent"?
Kubernetes 1.13.x est uniquement validé avec les versions Docker Engine jusqu'à to 18.06
Kubernetes 1.14 est validé avec les versions Docker Engine versions jusqu'à 18.09
(la dernière version stable quand Kubernetes 1.14 est sorti)
Est-ce qu'on vit dangereusement en installant un Docker Engine "trop récent"?
Que nenni!
"Validé" = passe les tests d'intégration continue très intenses (et coûteux)
L'API Docker est versionnée, et offre une comptabilité arrière très forte.
(Si un client "parle" l'API v1.25, le Docker Engine va continuer à se comporter de la même façon)
Notre application de démo
(automatically generated title slide)
Nous allons cloner le dépôt Github sur notre node1
Le dépôt contient aussi les scripts et outils à utiliser à travers la formation.
node1
:git clone https://github.com/jpetazzo/container.training
(Vous pouvez aussi forker le dépôt sur Github et cloner votre version si vous préférez.)
Démarrons-la avant de s'y plonger, puisque le téléchargement peut prendre un peu de temps...
Aller dans le dossier dockercoins
du dépôt cloné:
cd ~/container.training/dockercoins
Utiliser Compose pour générer et lancer tous les conteneurs:
docker-compose up
Compose indique à Docker de construire toutes les images de conteneurs (en téléchargeant les images de base correspondantes), puis de démarrer tous les conteneurs et d'afficher les logs agrégés.
C'est un miner de DockerCoin! 💰🐳📦🚢
Non, on ne paiera pas le café avec des DockerCoins
C'est un miner de DockerCoin! 💰🐳📦🚢
Non, on ne paiera pas le café avec des DockerCoins
Comment DockerCoins fonctionne
générer quelques octets aléatoires
calculer une somme de hachage
incrémenter un compteur (pour suivre la vitesse)
répéter en boucle!
C'est un miner de DockerCoin! 💰🐳📦🚢
Non, on ne paiera pas le café avec des DockerCoins
Comment DockerCoins fonctionne
générer quelques octets aléatoires
calculer une somme de hachage
incrémenter un compteur (pour suivre la vitesse)
répéter en boucle!
DockerCoins n'est pas une crypto-monnaie
(les seuls points communs étant "aléatoire", "hachage", et "coins" dans le nom)
DockerCoins est composée de 5 services:
rng
= un service web générant des octets au hasard
hasher
= un service web calculant un hachage basé sur les données POST-ées
worker
= un processus en arrière-plan utilisant rng
et hasher
webui
= une interface web pour le suivi du travail
redis
= base de données (garde un décompte, mis à jour par worker
)
Ces 5 services sont visibles dans le fichier Compose de l'application, docker-compose.yml
worker
invoque le service web rng
pour générer quelques octets aléatoires
worker
invoque le service web hasher
pour générer un hachage de ces octets
worker
reboucle de manière infinie sur ces 2 tâches
chaque seconde, worker
écrit dans redis
pour indiquer combien de boucles ont été réalisées
webui
interroge redis
, pour calculer et exposer la "vitesse de hachage" dans notre navigateur
(Voir le diagramme en diapo suivante!)
Comment chaque service trouve l'adresse des autres?
On ne code pas en dur des adresses IP dans le code.
On ne code pas en dur des FQDN dans le code, non plus.
On se connecte simplement avec un nom de service, et la magie du conteneur fait le reste
(Par magie du conteneur, nous entendons "l'astucieux DNS embarqué dynamique")
worker/worker.py
redis = Redis("redis")def get_random_bytes(): r = requests.get("http://rng/32") return r.contentdef hash_bytes(data): r = requests.post("http://hasher/", data=data, headers={"Content-Type": "application/octet-stream"})
(Code source complet disponible ici)
Les conteneurs peuvent avoir des alias de réseau (résolus par DNS)
Compose dans sa version 2+ rend chaque conteneur disponible via son nom de service
Compose en version 1 rendait obligatoire la section "links"
Les alias de réseau sont automatiquement préfixé par un espace de nommage
vous pouvez avoir plusieurs applications déclarées via un service appelé database
les conteneurs dans l'appli bleue vont atteindre database
via l'IP de la base de données bleue
les conteneurs dans l'appli verte vont atteindre database
via l'IP de la base de données verte
Vous pouvez ouvrir le dépôt Github avec tous les contenus de cet atelier:
https://github.com/jpetazzo/container.training
Cette application est dans le sous-dossier dockercoins
Le fichier Compose (docker-compose.yml) liste les 5 services
redis
utilise une image officielle issue du Docker Hub
hasher
, rng
, worker
, webui
sont générés depuis un Dockerfile
Chaque Dockerfile de service et son code source est stocké dans son propre dossier
(hasher
est dans le dossier hasher,
rng
est dans le dossier rng, etc.)
Uniquement pertinent si vous avez utilisé Compose avant 2016...
Compose 1.6 a introduit le support d'un nouveau format de fichier Compose (alias "v2")
Les services ne sont plus au plus haut niveau, mais dans une section services
.
Il doit y avoir une clé version
tout en haut du fichier, avec la valeur "2"
(la chaîne de caractères, pas le chiffre)
Les conteneurs sont placés dans un réseau dédié, rendant les links inutiles
Il existe d'autres différences mineures, mais la mise à jour est facile et assez directe.
A votre gauche, la bande "arc-en-ciel" montrant les noms de conteneurs
A votre droite, nous voyons la sortie standard de nos conteneurs
On peut voir le service worker
exécutant des requêtes vers rng
et hasher
Pour rng
et hasher
, on peut lire leur logs d'accès HTTP
"Les logs, c'est excitant et drôle" (Citation de personne, jamais, vraiment)
Le conteneur webui
expose un écran de contrôle web; allons-y voir.
Avec un navigateur, se connecter à node1
sur le port 8000
Rappel: les alias nodeX
ne sont valides que sur les noeuds eux-mêmes.
Dans votre navigateur, vous aurez besoin de taper l'adresse IP de votre noeud.
Un diagramme devrait s'afficher, et après quelques secondes, une courbe en bleu va apparaître.
On dirait peu ou prou que la vitesse est de 4 hachages/seconde.
Ou plus précisément: 4 hachages/secondes avec des trous reguliers à zéro
Pourquoi?
On dirait peu ou prou que la vitesse est de 4 hachages/seconde.
Ou plus précisément: 4 hachages/secondes avec des trous reguliers à zéro
Pourquoi?
L'appli a en réalité une vitesse constante et régulière de 3.33 hachages/seconde.
(ce qui correspond à 1 hachage toutes les 0.3 secondes, pour certaines raisons)
Oui, et donc?
Le worker ne met pas à jour le compteur après chaque boucle, mais au maximum une fois par seconde.
La vitesse est calculée par le navigateur, qui vérifie le compte à peu près une fois par seconde.
Entre 2 mise à jours consécutives, le compteur augmentera soit de 4, ou de 0 (zéro).
La vitesse perçue sera donc 4 - 4 - 0 - 4 - 4 - 0, etc.
Que peut-on conclure de tout cela?
Le worker ne met pas à jour le compteur après chaque boucle, mais au maximum une fois par seconde.
La vitesse est calculée par le navigateur, qui vérifie le compte à peu près une fois par seconde.
Entre 2 mise à jours consécutives, le compteur augmentera soit de 4, ou de 0 (zéro).
La vitesse perçue sera donc 4 - 4 - 0 - 4 - 4 - 0, etc.
Que peut-on conclure de tout cela?
Si nous stoppons Compose (avec ^C
), il demandera poliment au Docker Engine d'arrêter l'appli
Le Docker Engine va envoyer un signal TERM
aux conteneurs
Si les conteneurs ne quittent pas assez vite, l'Engine envoie le signal KILL
^C
Si nous stoppons Compose (avec ^C
), il demandera poliment au Docker Engine d'arrêter l'appli
Le Docker Engine va envoyer un signal TERM
aux conteneurs
Si les conteneurs ne quittent pas assez vite, l'Engine envoie le signal KILL
^C
Certains conteneurs quittent immédiatement, d'autres prennent plus de temps.
Les conteneurs qui ne gèrent pas le SIGTERM
finissent pas être tués après 10 secs. Si nous sommes vraiment impatients, on peut taper ^C
une seconde fois!
docker-compose down
Concepts Kubernetes
(automatically generated title slide)
Kubernetes est un système de gestion de conteneurs
Il lance et gère des applications conteneurisées sur un cluster
Kubernetes est un système de gestion de conteneurs
Il lance et gère des applications conteneurisées sur un cluster
Qu'est-ce que ça signifie vraiment?
atseashop/api:v1.3
Démarrer 5 conteneurs basés sur l'image atseashop/api:v1.3
Placer un load balancer interne devant ces conteneurs
Démarrer 5 conteneurs basés sur l'image atseashop/api:v1.3
Placer un load balancer interne devant ces conteneurs
Démarrer 10 conteneurs basés sur l'image atseashop/webfront:v1.3
Démarrer 5 conteneurs basés sur l'image atseashop/api:v1.3
Placer un load balancer interne devant ces conteneurs
Démarrer 10 conteneurs basés sur l'image atseashop/webfront:v1.3
Placer un load balancer public devant ces conteneurs
Démarrer 5 conteneurs basés sur l'image atseashop/api:v1.3
Placer un load balancer interne devant ces conteneurs
Démarrer 10 conteneurs basés sur l'image atseashop/webfront:v1.3
Placer un load balancer public devant ces conteneurs
C'est Black Friday (ou Noël!), le trafic explose, agrandir notre cluster et ajouter des conteneurs
Démarrer 5 conteneurs basés sur l'image atseashop/api:v1.3
Placer un load balancer interne devant ces conteneurs
Démarrer 10 conteneurs basés sur l'image atseashop/webfront:v1.3
Placer un load balancer public devant ces conteneurs
C'est Black Friday (ou Noël!), le trafic explose, agrandir notre cluster et ajouter des conteneurs
Nouvelle version! Remplacer les conteneurs avec la nouvelle image atseashop/webfront:v1.4
Démarrer 5 conteneurs basés sur l'image atseashop/api:v1.3
Placer un load balancer interne devant ces conteneurs
Démarrer 10 conteneurs basés sur l'image atseashop/webfront:v1.3
Placer un load balancer public devant ces conteneurs
C'est Black Friday (ou Noël!), le trafic explose, agrandir notre cluster et ajouter des conteneurs
Nouvelle version! Remplacer les conteneurs avec la nouvelle image atseashop/webfront:v1.4
Continuer de traiter les requêtes pendant la mise à jour; renouveler mes conteneurs un à la fois
Montée en charge basique
Déploiement Blue/Green, déploiement canary
Services de longue durée, mais aussi des tâches par lots (batch)
Surcharger notre cluster et évincer les tâches de basse priorité
Lancer des services à données persistentes (bases de données, etc.)
Contrôle d'accès assez fin, pour définir quelle action est autorisée pour qui sur quelle ressources.
Intégrer les services tiers (catalogue de services)
Automatiser des tâches complexes (opérateurs)
Ha ha ha ha
OK, je voulais juste vous faire peur, c'est plus simple que ça ❤️
Le premier schéma est un cluster Kubernetes avec du stockage sur l'iSCSI multi-path
(Grâce à Yongbok Kim)
Le second est une représentation simplifiée d'un cluster Kubernetes
(Grâce à Imesh Gunaratne)
Les nodes qui font tourner nos conteneurs ont aussi une collection de services:
un moteur de conteneurs (typiquement Docker)
kubelet (l'agent de node)
kube-proxy (un composant réseau nécessaire mais pas suffisant)
Les nodes étaient précédemment appelées des "minions"
(On peut encore rencontrer ce terme dans d'anciens articles ou documentation)
La logique de Kubernetes (ses "méninges") est une collection de services:
Le serveur API (notre point d'entrée pour toute chose!)
des services principaux comme l'ordonnanceur et le contrôleur
etcd
(une base clé-valeur hautement disponible; la "base de données" de Kubernetes)
Ensemble, ces services forment le plan de contrôle de notre cluster
Le plan de contrôle est aussi appelé le "master"
Il est commun de réserver une node dédiée au plan de contrôle
(Excepté pour les cluster de développement à node unique, comme avec minikube)
Cette node est alors appelée un "master"
(Oui, c'est ambigu: est-ce que le "master" est une node, ou tout le plan de contrôle?)
Les applis normales sont interdites de tourner sur cette node
(En utilisant un mécanisme appelé "taints")
Pour de la haute dispo, chaque service du plan de contrôle doit être résilient
Le plan de contrôle est alors répliqué sur de multiples noeuds
(On parle alors d'installation "multi-master")
Les services du plan de contrôle peuvent tourner avec ou sans conteneurs
Par exemple: puisque etcd
est un service critique, certains le
déploient directement sur un cluster dédié (sans conteneurs)
(C'est illustré dans le premier schéma "super compliqué")
Dans certaines offres commerciales Kubernetes (par ex. AKS, GKE, EKS), le plan de contrôle est invisible
(On "voit" juste un point d'entrée Kubernetes API)
Dans ce cas, il n'y a pas de node "master"
Pour cette raison, il est plus précis de parler de "plan de contrôle" plutôt que de "master".
Non!
Non!
Par défaut, Kubernetes choisit le Docker Engine pour lancer les conteneurs
On pourrait utiliser rkt
("Rocket") par CoreOS
Ou exploiter d'autre moteurs via la Container Runtime Interface
(comme CRI-O, ou containerd)
Oui!
Oui!
Dans cet atelier, on lancera d'abord notre appli sur un seul noeud
On devra générer les images et les envoyer à la ronde
On pourrait se débrouiller sans Docker
(et être diagnostiqué du syndrome NIH¹)
Docker est à ce jour le moteur de conteneurs le plus stable
(mais les alternatives mûrissent rapidement)
Sur nos environnements de développement, les pipelines CI ... :
Oui, très certainement
Sur nos serveurs de production:
Oui (pour aujourd'hui)
Probablement pas (dans le futur)
Pour plus d'infos sur CRI sur le blog Kubernetes
Le dialogue avec Kubernetes s'effectue via une API RESTful, la plupart du temps.
L'API Kubernetes définit un tas d'objets appelés resources
Ces ressources sont organisées par type, ou Kind
(dans l'API)
Elle permet de déclarer, lire, modifier et supprimer les resources
Quelques types de ressources communs:
Et bien plus!
On peut afficher la liste complète avec kubectl api-resources
Le premier diagramme est une grâcieuseté de Lucas Käldström, dans cette présentation
Le second diagramme est une grâcieuseté de Weave Works
un pod peut avoir plusieurs conteneurs qui travaillent ensemble
les adresses IP sont associées aux pods, pas aux conteneurs eux-mêmes
Les deux diagrammes sont utilisés avec la permission de leurs auteurs.
Déclaratif vs Impératif
(automatically generated title slide)
Notre orchestrateur de conteneurs insiste fortement sur sa nature déclarative
Déclaratif:
Je voudrais une tasse de thé
Impératif:
Faire bouillir de l'eau. Verser dans la théière. Ajouter les feuilles de thé. Infuser un moment. Servir dans une tasse.
Notre orchestrateur de conteneurs insiste fortement sur sa nature déclarative
Déclaratif:
Je voudrais une tasse de thé
Impératif:
Faire bouillir de l'eau. Verser dans la théière. Ajouter les feuilles de thé. Infuser un moment. Servir dans une tasse.
Le mode déclaratif semble plus simple au début...
Notre orchestrateur de conteneurs insiste fortement sur sa nature déclarative
Déclaratif:
Je voudrais une tasse de thé
Impératif:
Faire bouillir de l'eau. Verser dans la théière. Ajouter les feuilles de thé. Infuser un moment. Servir dans une tasse.
Le mode déclaratif semble plus simple au début...
... tant qu'on sait comment préparer du thé
Ce que le mode déclaratif devrait vraiment être:
Je voudrais une tasse de thé, obtenue en versant une infusion¹ de feuilles de thé dans une tasse.
Ce que le mode déclaratif devrait vraiment être:
Je voudrais une tasse de thé, obtenue en versant une infusion¹ de feuilles de thé dans une tasse.
¹Une infusion est obtenue en laissant l'objet infuser quelques minutes dans l'eau chaude².
Ce que le mode déclaratif devrait vraiment être:
Je voudrais une tasse de thé, obtenue en versant une infusion¹ de feuilles de thé dans une tasse.
¹Une infusion est obtenue en laissant l'objet infuser quelques minutes dans l'eau chaude².
²Liquide chaud obtenu en le versant dans un contenant³ approprié et le placer sur la gazinière.
Ce que le mode déclaratif devrait vraiment être:
Je voudrais une tasse de thé, obtenue en versant une infusion¹ de feuilles de thé dans une tasse.
¹Une infusion est obtenue en laissant l'objet infuser quelques minutes dans l'eau chaude².
²Liquide chaud obtenu en le versant dans un contenant³ approprié et le placer sur la gazinière.
³Ah, finalement, des conteneurs! Quelque chose qu'on maitrise. Mettons-nous au boulot, n'est-ce pas?
Ce que le mode déclaratif devrait vraiment être:
Je voudrais une tasse de thé, obtenue en versant une infusion¹ de feuilles de thé dans une tasse.
¹Une infusion est obtenue en laissant l'objet infuser quelques minutes dans l'eau chaude².
²Liquide chaud obtenu en le versant dans un contenant³ approprié et le placer sur la gazinière.
³Ah, finalement, des conteneurs! Quelque chose qu'on maitrise. Mettons-nous au boulot, n'est-ce pas?
Saviez-vous qu'il existait une norme ISO spécifiant comment infuser le thé?
Système impératifs:
plus simple
si une tache est interrompue, on doit la redémarrer de zéro
Système déclaratifs:
si une tache est interrompue (ou si on arrive en plein milieu de la fête), on peut déduire ce qu'il manque, et on complète juste par ce qui est nécessaire.
on doit être en mesure d'observer le système
... et de calculer un "diff" entre ce qui tourne en ce moment et ce que nous souhaitons
Pratiquement tout ce que nous lançons sur Kubernetes est déclaré dans une spec
Tout ce qu'on peut faire est écrire un spec et la pousser au serveur API
(en déclarant des ressources comme Pod ou Deployment)
Le serveur API va valider cette spec (la rejeter si elle est invalide)
Puis la stocker dans etcd
Un controller va "repérer" cette spécification et réagir en conséquence
Gardez un oeil sur les champs spec
dans les fichiers YAML plus tard!
La spec décrit comment on voudrait que ce truc tourne
Kubernetes va réconcilier l'état courant avec la spec
(techniquement, c'est possible via un tas de controllers)
Quand on veut changer une ressources, on modifie la spec
Kubernetes va alors faire converger cette ressource
Modèle réseau de Kubernetes
(automatically generated title slide)
En un mot comme en cent:
Notre cluster (nodes et pods) est un grand réseau IP tout plat.
En un mot comme en cent:
Notre cluster (nodes et pods) est un grand réseau IP tout plat.
Dans le détail:
toutes les nodes doivent être accessibles les unes aux autres, sans NAT
tous les pods doivent être accessibles les uns aux autres, sans NAT
pods et nodes doivent être accessibles les uns aux autres, sans NAT
chaque pod connait sa propore adresse IP (sans NAT)
les adresses IP sont assignées par l'implémentation du réseau (le plugin)
Kubernetes ne force pas une implémentation particulière
Tout peut se connecter à tout
Pas de traduction d'adresse
Pas de traduction de port
Pas de nouveau protocole
L'implémentation réseau peut décider comment allouer les adresses
Les adresses IP n'ont pas à être "portables" d'une node à une autre.
(On peut avoir par ex. un sous-réseau par node et utiliser une topologie simple)
La spécification est assez simple pour permettre différentes implémentations variées
Tout peut se connecter à tout
si on cherche de la sécurité, on devra rajouter des règles réseau
l'implémentation réseau que vous choisirez devra offrir cette fonction
Il y a littéralement des dizaines d'implémentations dans le monde
(Pas moins de 15 sont mentionnées dans la documentation Kubernetes)
Les pods ont une connectivité de niveau 3 (IP), et les services de niveau 4 (TCP ou UDP)
(Les services sont associés à un seul port TCP ou UDP; pas de groupe de ports ou de paquets IP arbitraires)
kube-proxy
est sur le chemin de données quand il se connecte à un pod ou conteneur,
et ce n'est pas particulièrement rapide (il s'appuie sur du proxy utilisateur ou iptables)
Les nodes que nous avons à notre disposition utilisent Weave
On ne recommande pas Weave plus que ça, c'est juste que "Ca Marche Pour Nous"
Pas d'inquiétude à propos des réserves sur la performance kube-proxy
Sauf si vous:
Si nécessaire, des alternatives à kube-proxy
existent, comme:
kube-router
La CNI est une spécification complète à destination des plugins réseau.
Quand un nouveau pod est créé, Kubernetes délègue la config réseau aux plugins CNI.
(ça peut être un seul plugin, ou une combinaison de plugins, chacun spécialisé dans une tache)
Généralement, un plugin CNI va:
allouer une adresse IP (en appelant un plugin IPAM)
ajouter une interface réseau dans le namespace réseau du pod
configurer l'interface ainsi que les routes minimum, etc.
Tous les plugins CNI ne naissent pas égaux
(par ex. il ne supportent pas tous les politiques de réseau, obligatoires pour isoler les pods)
Le "réseau pod-à-pod" ou "réseau pod":
fournit la communication entre pods et nodes
est généralement implémenté via des plugins CNI
Le "réseau pod-à-service":
fournit la communication interne et la répartition de charge
est généralement implémenté avec kube-proxy (ou par ex. kube-router)
Network policies :
jouent le rôle de firewall et de couche d'isolation
peuvent être livrées avec le "réseau pod" ou fournit par un autre composant
Le trafic entrant peut être géré par plusieurs composants:
quelque chose comme kube-proxy ou kube-router (pour les services NodePort)
les load balancers (idéalement, connectés au réseau pod)
En théorie, il est possible d'utiliser plusieurs réseaux pods en parallèle
(avec des "meta-plugins" comme CNI-Genie ou Multus)
Quelques solutions peuvent remplir plusieurs de ces rôles
(par ex. kube-router peut être installé pour implémenter le réseau pod et/ou les network policies et/ou remplacer kube-proxy)
Premier contact avec kubectl
(automatically generated title slide)
kubectl
kubectl
est (presque) le seul outil dont nous aurons besoin pour parler à Kubernetes
C'est un outil en ligne de commande très riche, autour de l'API Kubernetes
(Tout ce qu'on peut faire avec kubectl
, est directement exécutable via l'API)
Sur nos machines, on trouvera un fichier ~/.kube/config
avec:
l'adresse de l'API Kubernetes
le chemin vers nos certificats TLS d'identification
On peut aussi utiliser l'option --kubeconfig
pour forcer un fichier de config
Ou passer directement --server
, --user
, etc.
kubectl
se prononce "Cube Cé Té Elle", "Cube coeuteule", "Cube coeudeule"
kubectl get
Node
avec kubectl get
!Examiner la composition de notre cluster:
kubectl get node
Ces commandes sont équivalentes:
kubectl get nokubectl get nodekubectl get nodes
kubectl get
peut afficher du JSON, YAML ou un format personnaliséSortir plus d'info sur les nodes
kubectl get nodes -o wide
Récupérons du YAML:
kubectl get no -o yaml
Ce bout de kind: List
tout à la fin? C'est le type de notre résultat!
kubectl
et jq
kubectl get nodes -o json | jq ".items[] | {name:.metadata.name} + .status.capacity"
kubectl
dispose de capacité d'introspection solides
On peut lister les types de ressources en lançant kubectl api-resources
(Sur Kubernetes 1.10 et les versions précédentes, il fallait taper kubectl get
)
Pour détailler une ressource, c'est:
kubectl explain type
La définition d'un type de ressource s'affiche avec:
kubectl explain node.spec
kubectl explain node --recursive
On peut accéder à la même information en lisant la documentation d'API
La doc est habituellement plus facile à lire, mais:
kubectl api-resources
and kubectl explain
font de l'introspection
(en s'appuyant sur le serveur API, pour récupérer des définitions de types exactes)
Les ressources les plus communes ont jusqu'à 3 formes de noms:
singulier (par ex. node
, service
, deployment
)
pluriel (par ex. nodes
, services
, deployments
)
court (par ex. no
, svc
, deploy
)
Certaines ressources n'ont pas de nom court
Endpoints
n'ont qu'une forme au pluriel
(parce que même une seule ressource Endpoints
est en fait une liste d'endpoints)
On peut taper kubectl get -o yaml
pour un détail complet d'une ressource
Toutefois, le format YAML peut être à la fois trop verbeux et incomplet
Par exemple, kubectl get node node1 -o yaml
est:
trop verbeux (par ex. la liste des images disponibles sur cette node)
incomplet (car on ne voit pas les pods qui y tournent)
difficile à lire pour un administrateur humain
Pour une vue complète, on peut utiliser kubectl describe
en alternative.
kubectl describe
kubectl describe
requiert un type de ressource et (en option) un nom de ressource
Il est possible de fournir un préfixe de nom de ressource
(tous les objets contenant ce nom seront affichés)
kubectl describe
va récupérer quelques infos de plus sur une ressource
Jeter un oeil aux infos de node1
avec une de ces commandes:
kubectl describe node/node1kubectl describe node node1
(On devrait voir un tas de pods du plan de contrôle)
Un service est un point d'entrée stable pour se connecter à "quelque chose"
(Dans la proposition initiale, on appelait ça un "portail")
kubectl get serviceskubectl get svc
Un service est un point d'entrée stable pour se connecter à "quelque chose"
(Dans la proposition initiale, on appelait ça un "portail")
kubectl get serviceskubectl get svc
Il y a déjà un service sur notre cluster: l'API Kubernetes elle-même.
Un service ClusterIP
est interne, disponible uniquement depuis le cluster
C'est utile pour faire l'introspection depuis l'intérieur de conteneurs.
Essayer de se connecter à l'API:
curl -k https://10.96.0.1
-k
est spécifié pour désactiver la vérification de certificat
Attention à bien remplacer 10.96.0.1 avec l'IP CLUSTER affichée par kubectl get svc
NB :sur Docker for Desktop, l'API n'est accessible que sur https://localhost:6443/
Un service ClusterIP
est interne, disponible uniquement depuis le cluster
C'est utile pour faire l'introspection depuis l'intérieur de conteneurs.
Essayer de se connecter à l'API:
curl -k https://10.96.0.1
-k
est spécifié pour désactiver la vérification de certificat
Attention à bien remplacer 10.96.0.1 avec l'IP CLUSTER affichée par kubectl get svc
NB :sur Docker for Desktop, l'API n'est accessible que sur https://localhost:6443/
L'erreur que vous voyez était attendue: l'API Kubernetes exige une identification.
Les conteneurs existent à travers des pods.
Un pod est un groupe de conteneurs:
qui tournent ensemble (sur le même noeud)
qui partagent des ressources (RAM, CPU; mais aussi réseau et volumes)
kubectl get pods
Les conteneurs existent à travers des pods.
Un pod est un groupe de conteneurs:
qui tournent ensemble (sur le même noeud)
qui partagent des ressources (RAM, CPU; mais aussi réseau et volumes)
kubectl get pods
Ce ne sont pas là les pods que nous cherchons. Mais où sont-ils alors?!?
kubectl get namespaceskubectl get namespacekubectl get ns
kubectl get namespaceskubectl get namespacekubectl get ns
Vous savez quoi... Ce machin kube-system
m'a l'air suspect.
En fait, je suis plutôt sûr de l'avoir vu tout à l'heure, quand on a tapé:
kubectl describe node node1
Par défaut, kubectl
utilise le namespace... default
On peut montrer toutes les ressources avec --all-namespaces
.exercise[
Lister les pods à travers tous les namespaces:
kubectl get pods --all-namespaces
Depuis Kubernetes 1.14, on peut aussi taper -A
pour faire plus court:
kubectl get pods -A
Et voici nos pods système!
etcd
est notre serveur etcd
kube-apiserver
est le serveur API
kube-controller-manager
et kube-scheduler
sont d'autres composants maître
coredns
fournit une découverte de services basé sur le DNS (il remplace kube-dns depuis 1.11)
kube-proxy
tourne sur chaque node et gère le mapping de ports etc.
weave
est le composant qui gère les réseaux superposés sur chaque noeud
la colonne READY
indique le nombre de conteneurs dans chaque pod
les pods avec un nom qui finit en -node1
sont les composants maître
ils sont spécifiquement "scotchés" au noeud maître.
default
)kube-system
:kubectl get pods --namespace=kube-systemkubectl get pods -n kube-system
kubectl
On peut combiner -n
/--namespace
avec presque toute commande
Exemple:
kubectl create --namespace=X
pour créer quelque chose dans le namespace XOn peut utiliser -A
/--all-namespaces
avec la plupart des commandes qui manipulent plein d'objets à la fois
Exemples:
kubectl delete
supprime des ressources à travers plusieurs namespaces
kubectl label
ajoute/supprime des labels à travers plusieurs namespaces
kube-public
?kube-public
:kubectl -n kube-public get pods
Rien!
kube-public
est créé par kubeadm et utilisé pour établie une sécurité de base
kube-public
kube-public
est un ConfigMap
nommé cluster-info
Lister les ConfigMaps dans le namespace kube-public
:
kubectl -n kube-public get configmaps
Inspecter cluster-info
:
kubectl -n kube-public get configmap cluster-info -o yaml
Noter l'URI selfLink
: /api/v1/namespaces/kube-public/configmaps/cluster-info
On pourrait en avoir besoin!
cluster-info
Plus tôt, en interrogeant le serveur API, on a reçu une réponse Forbidden
Mais cluster-info
est lisible par tous (y compris sans authentification)
cluster-info
:curl -k https://10.96.0.1/api/v1/namespaces/kube-public/configmaps/cluster-info
Nous sommes capables d'accéder à cluster-info
(sans auth)
Il contient un fichier kubeconfig
kubeconfig
kubeconfig
de cette ConfigMapkubeconfig
:curl -sk https://10.96.0.1/api/v1/namespaces/kube-public/configmaps/cluster-info \ | jq -r .data.kubeconfig
Ce fichier contient l'adresse canonique du serveur d'API, et la clé publique du CA.
Ce fichier ne contient pas les clés client ou tokens
Ce ne sont pas des infos sensibles, mais c'est essentiel pour établir une connexion sécurisée.
kube-node-lease
?Depuis Kubernetes 1.14, il y a un namespace kube-node-lease
(ou dès la version 1.13 si la fonction NodeLease était activée)
Ce namespace contient un objet Lease par node
Un Node lease est une nouvelle manière d'implémenter les heartbeat de node
(c'est-à-dire qu'une node va contacter le master de temps à autre et dire "Je suis vivant!")
Pour plus de détails, voir KEP-0009 ou la doc de contrôleur de node k8s/kubectlget.md
Installer Kubernetes
(automatically generated title slide)
On est passé par kubeadm
sur des VMs fraîchement installées avec Ubuntu LTS
Installer Docker
Installer les paquets Kubernetes
Lancer kubeadm init
sur la première node (c'est ce qui va déployer le plan de contrôle)
Installer Weave (la couche réseau overlay)
(cette étape consiste en une seule commande kubectl apply
; voir plus loin)
Lancer kubeadm join
sur les autres nodes (avec le jeton fourni par kubeadm init
)
Copier le fichier de configuration généré par kubeadm init
Allez voir README d'installation des VMs pour plus de détails.
kubeadm
N'installe ni Docker ni autre moteur de conteneurs
N'installe pas de réseau overlay
N'installe pas de mode multi-maître (pas de haute disponibilité)
kubeadm
N'installe ni Docker ni autre moteur de conteneurs
N'installe pas de réseau overlay
N'installe pas de mode multi-maître (pas de haute disponibilité)
(En tout cas... pas encore!) Même si c'est une fonction expérimentale en version 1.12.)
kubeadm
N'installe ni Docker ni autre moteur de conteneurs
N'installe pas de réseau overlay
N'installe pas de mode multi-maître (pas de haute disponibilité)
(En tout cas... pas encore!) Même si c'est une fonction expérimentale en version 1.12.)
"C'est quand même le double de travail par rapport à un cluster Swarm 😕" -- Jérôme
Si vous êtes sur Azure: AKS
Si vous êtes sur Google Cloud: GKE
Agnostique au cloud (AWS/DO/GCE (beta)/vSphere(alpha)): kops
Sur votre machine locale: minikube, kubespawn, Docker Desktop
Si vous avez un déploiement spécifique: kubicorn
Sans doute à ce jour l'outil le plus proche d'une solution multi-cloud/hybride, mais encore en développement.
Si vous aimez Ansible: kubespray
Si vous aimez Terraform: typhoon
Si vous aimez Terraform et Puppet: tarmak
Vous pouvez aussi apprendre à installer chaque composant manuellement, avec l'excellent tutoriel Kubernetes The Hard Way
Kubernetes The Hard Way est optimisé pour l'apprentissage, ce qui implique de prendre les détours obligatoires à la compréhension de chaque étape nécessaire pour la construction d'un cluster Kubernetes.
Il y a aussi nombre d'options commerciales disponibles!
Pour une liste plus complète, veuillez consulter la documentation Kubernetes:
on y trouve un super guide pour choisir la bonne piste
Lancer nos premiers conteneurs sur Kubernetes
(automatically generated title slide)
Commençons par le commencement: on ne lance pas "un" conteneur
On va lancer un pod, et dans ce pod, on fera tourner un seul conteneur
Commençons par le commencement: on ne lance pas "un" conteneur
On va lancer un pod, et dans ce pod, on fera tourner un seul conteneur
Dans ce conteneur, qui est dans le pod, nous allons lancer une simple commande ping
Puis nous allons démarrer plusieurs exemplaires du pod.
kubectl run
1.1.1.1
, le serveur DNS public de Cloudflare:kubectl run pingpong --image alpine ping 1.1.1.1
kubectl run
1.1.1.1
, le serveur DNS public de Cloudflare:kubectl run pingpong --image alpine ping 1.1.1.1
(A partir de Kubernetes 1.12, un message s'affiche nous indiquant que
kubectl run
est déprécié. Laissons ça de côté pour l'instant.)
kubectl run
kubectl run
kubectl get all
kubectl run
kubectl run
kubectl get all
On devrait y voir quelque chose comme:
deployment.apps/pingpong
(le deployment que nous venons juste de déclarer)replicaset.apps/pingpong-xxxxxxxxxx
(un replica set généré par ce déploiement)pod/pingpong-xxxxxxxxxx-yyyyy
(un pod généré par le replica set)Note: à partir de 1.10.1, les types de ressources sont affichés plus en détail.
Un deployment est une structure de haut niveau
permet la montée en charge, les mises à jour, les retour-arrière
plusieurs déploiements peuvent être cumulés pour implémenter un canary deployment
délègue la gestion des pods aux replica sets
Un replica set est une structure de bas niveau
s'assure qu'un nombre de pods identiques est lancé
permet la montée en chage
est rarement utilisé directement
pingpong
kubectl run
déclare un deployment, deployment.apps/pingpong
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEdeployment.apps/pingpong 1 1 1 1 10m
replicaset.apps/pingpong-xxxxxxxxxx
NAME DESIRED CURRENT READY AGEreplicaset.apps/pingpong-7c8bbcd9bc 1 1 1 10m
pod/pingpong-xxxxxxxxxx-yyyyy
NAME READY STATUS RESTARTS AGEpod/pingpong-7c8bbcd9bc-6c9qz 1/1 Running 0 10m
Nous verrons plus tard comment ces gars vivent ensemble pour:
Essayons la commande kubectl logs
On lui passera soit un nom de pod ou un type/name
(Par ex., si on spécifie un déploiement ou un replica set, il nous sortira le premier pod qu'il contient)
Sauf instruction expresse, la commande n'affichera que les logs du premier conteneur du pod
(Heureusement qu'il n'y en a qu'un chez nous!)
ping
:kubectl logs deploy/pingpong
Tout comme docker logs
, kubectl logs
supporte des options bien pratiques:
-f
/--follow
pour continuer à afficher les logs en temps réel (à la tail -f
)
--tail
pour indiquer combien de lignes on veut afficher (depuis la fin)
--since
pour afficher les logs après un certain timestamp
ping
:kubectl logs deploy/pingpong --tail 1 --follow
kubectl scale
Escalader notre déploiement pingpong
:
kubectl scale deploy/pingpong --replicas 3
Noter que cette autre commande fait exactement pareil:
kubectl scale deployment pingpong --replicas 3
Note: et si on avait essayé d'escalader replicaset.apps/pingpong-xxxxxxxxxx
?
On pourrait! Mais le deployment le remarquerait tout de suite, et le baisserait au niveau initial.
Le déploiement pingpong
affiche son replica set
Le replica set s'assure que le bon nombre de pods sont lancés
Que se passe-t-il en cas de disparition inattendue de pods?
kubectl get pods -w
kubectl delete pod pingpong-xxxxxxxxxx-yyyyy
Et si on voulait lancer un conteneur "one-shot" qui ne va pas se relancer?
On pourrait utiliser kubectl run --restart=OnFailure
or kubectl run --restart=Never
Ces commandes iraient déclarer des jobs ou pods au lieu de deployments.
Sous le capot, kubectl run
invoque des "generators" pour déclarer les descriptions de ressources.
On pourrait aussi écrire ces descriptions de ressources nous-mêmes (typiquement en YAML),
et les créer sur le cluster avec kubectl apply -f
(comme on verra plus loin)
Avec kubectl run --schedule=...
, on peut aussi lancer des cronjobs
Comme nous avons vu dans les diapos précédentes, kubectl run
peut faire bien des choses.
Le type exact des ressources créées n'est pas flagrant.
Pour rendre les choses plus explicites, on préfère passer par kubectl create
:
kubectl create deployment
pour créer un déploiement
kubectl create job
pour créer un job
kubectl create cronjob
pour lancer un job à intervalle régulier
(depuis Kubernetes 1.14)
Finalement, kubectl run
ne sera utilisé que pour démarrer des pods à usage unique
kubectl run
kubectl create <ressource>
kubectl create -f foo.yaml
ou kubectl apply -f foo.yaml
Quand on spécifie un nom de déploiement, les logs d'un seul pod sont affichés
On peut afficher les logs de plusieurs pods en ajoutant un selector
Un sélecteur est une expression logique basée sur des labels
Pour faciliter les choses, quand on lance kubectl run monpetitnom
, les objets associés ont un label run=monpetitnom
run=pingpong
:kubectl logs -l run=pingpong --tail 1
pingpong
?-l
and -f
:kubectl logs -l run=pingpong --tail 1 -f
Note: combiner les options -l
et -f
est possible depuis Kubernetes 1.14!
Essayons de comprendre pourquoi ...
Escalader notre déploiement:
kubectl scale deployment pingpong --replicas=8
Afficher les logs en continu:
kubectl logs -l run=pingpong --tail 1 -f
On devrait voir un message du type:
error: you are attempting to follow 8 log streams,but maximum allowed concurency is 5,use --max-log-requests to increase the limit
kubectl
ouvre une connection vers le serveur API par pod
Pour chaque pod, le serveur API ouvre une autre connexion vers le kubelet correspondant.
S'il y a 1000 pods dans notre déploiement, cela fait 1000 connexions entrantes + 1000 connexions au serveur API.
Cela peut facilement surcharger le serveur API.
Avant la version 1.14 de K8S, il a été décidé de ne pas autoriser les multiple connexions.
A partir de 1.14, c'est autorisé, mais plafonné à 5 connexions.
(paramétrable via --max-log-requests
)
Pour plus de détails sur les tenants et aboutissants, voir PR #67573
kubectl logs
On ne voit pas quel pod envoie quelle ligne
Si les pods sont redémarrés / remplacés, le flux de log se fige.
Si de nouveaux pods arrivent, on ne verra pas leurs logs.
Pour suivre les logs de plusieur pods, il nous faut écrire un sélecteur
Certains outils externes corrigent ces limitations:
(par ex.: Stern)
kubectl logs -l ... --tail N
En exécutant cette commande dans Kubernetes 1.12, plusieurs lignes s'affichent
C'est une régression quand --tail
et -l
/--selector
sont couplés.
Ca affichera toujours les 10 dernières lignes de la sortie de chaque conteneur.
(au lieu du nombre de lignes spécifiées en ligne de commande)
Le problème a été résolu dans Kubernetes 1.13
Voir #70554 pour plus de détails.
Si on y réfléchit, c'est une bonne question!
Pourtant, pas d'inquiétude:
Le groupe de recherche APNIC a géré les adresses 1.1.1.1 et 1.0.0.1. Alors qu'elles étaient valides, tellement de gens les ont introduit dans divers systèmes, qu'ils étaient continuellement submergés par un flot de trafic polluant. L'APNIC voulait étudier cette pollution mais à chaque fois qu'ils ont essayé d'annoncer les IPs, le flot de trafic a submergé tout réseau conventionnel.
Il est tout à fait improbable que nos pings réunis puissent produire ne serait-ce qu'un modeste truc dans le NOC chez Cloudflare!
On dit "qu'une image vaut mille mots".
Les 19 diapos suivantes montrent ce qu'il se passe quand on lance:
kubectl run web --image=nginx --replicas=3
Exposer des conteneurs
(automatically generated title slide)
kubectl expose
crée un service pour des pods existant
Un service est une adresse stable pour un (ou plusieurs) pods
Si on veut se connecter à nos pods, on doit déclarer un nouveau service
Une fois que le service est créé, CoreDNS va nous permettre d'y accéder par son nom
(i.e après avoir créé le service hello
, le nom hello
va pointer quelque part)
Il y a différents types de services, détaillé dans les diapos suivantes:
ClusterIP
, NodePort
, LoadBalancer
, ExternalName
ClusterIP
(type par défaut)
NodePort
Ces types de service sont toujours disponibles.
Sous le capot: kube-proxy
passe par un proxy utilisateur et un tas de règles iptables
.
LoadBalancer
NodePort
est créé, et le répartiteur y envoit le traffic vers son port)ExternalName
CNAME
Puisque ping
n'a nulle part où se connecter, nous allons lancer quelque chose d'autre
On pourrait utiliser l'image officielle nginx
, mais...
... comment distinguer un backend d'un autre!
On va plutôt passer par jpetazzo/httpenv
, un petit serveur HTTP écrit en Go
jpetazzo/httpenv
écoute sur le port 8888
Il renvoie ses variables d'environnement au format JSON
Les variables d'environnement vont inclure HOSTNAME
, qui aura pour valeur le nom du pod
(et de ce fait, elle aura une valeur différente pour chaque backend)
On pourrait lancer kubectl run httpenv --image=jpetazzo/httpenv
...
Mais puisque kubectl run
est bientôt obsolète, voyons voir comment utiliser kubectl create
à sa place.
kubectl get pods -w
Créer un déploiement pour ce serveur HTTP super-léger: server:
kubectl create deployment httpenv --image=jpetazzo/httpenv
Escalader le déploiement à 10 replicas:
kubectl scale deployment httpenv --replicas=10
ClusterIP
par défautExposer le port HTTP de notre serveur:
kubectl expose deployment httpenv --port 8888
Rechercher quelles adresses IP ont été alloués:
kubectl get service
On peut assigner des adresses IP aux services, mais elles restent dans la couche 4
(i.e un service n'est pas une adresse IP; c'est une IP+ protocole + port)
La raison en est l'implémentation actuelle de kube-proxy
(qui se base sur des mécanismes qui ne supportent pas la couche n°3)
Il en résulte que: vous devez indiquer le numéro de port de votre service
Lancer des services avec un (ou plusieurs) ports au hasard demandent des bidouilles
(comme passer le mode réseau au niveau hôte)
IP=$(kubectl get svc httpenv -o go-template --template '{{ .spec.clusterIP }}')
Envoyer quelques requêtes:
curl http://$IP:8888/
Trop de lignes? Filtrer avec jq
:
curl -s http://$IP:8888/ | jq .HOSTNAME
IP=$(kubectl get svc httpenv -o go-template --template '{{ .spec.clusterIP }}')
Envoyer quelques requêtes:
curl http://$IP:8888/
Trop de lignes? Filtrer avec jq
:
curl -s http://$IP:8888/ | jq .HOSTNAME
Essayez-le plusieurs fois! Nos requêtes sont réparties à travers plusieurs pods.
Parfois, on voudrait accéder à nos services directement:
si on veut économiser un petit bout de latence (typiquement < 1ms)
si on a besoin de se connecter à n'importe quel port (au lieu de quelques ports fixes)
si on a besoin de communiquer sur un autre protocole qu'UDP ou TCP
si on veut décider comment répartir la charge depuis le client
...
Dans ce cas, on peut utiliser un "headless service"
On obtient un service headless en assignant la valeur None
au champ clusterIP
(Soit avec --cluster-ip=None
, ou via un bout de YAML)
Puisqu'il n'y a pas d'adresse IP virtuelle, il n'y pas non plus de répartiteur de charge
CoreDNS va retourner les adresses IP des pods comme autant d'enregistrements A
C'est un moyen facile de recenser tous les réplicas d'un deploiement.
Un service dispose d'un certain nombre de "points d'entrée" (endpoint)
Chaque endpoint est une combinaison "hôte + port" qui pointe vers le service
Les points d'entrée sont maintenus et mis à jour automatiquement par Kubernetes
httpenv
:kubectl describe service httpenv
Dans l'affichage, il y aura une ligne commençant par Endpoints:
.
Cette ligne liste un tas d'adresses au format host:port
.
Dans le cas de nombreux endpoints, les commandes d'affichage tronquent la liste
kubectl get endpoints
Pour sortir la liste complète, on peut passer par la commande suivante:
kubectl describe endpoints httpenvkubectl get endpoints httpenv -o yaml
Ces commandes vont nous montrer une liste d'adresses IP
On devrait retrouver ces mêmes adresses IP dans les pods correspondants:
kubectl get pods -l app=httpenv -o wide
endpoints
, pas endpoint
endpoints
est la seule ressource qui ne s'écrit jamais au singulier
$ kubectl get endpointerror: the server doesn't have a resource type "endpoint"
C'est parce que le type lui-même est pluriel (contrairement à toutes les autres ressources)
Il n'existe aucun objet endpoint
: type Endpoints struct
Le type ne représente pas un seul endpoint, mais une liste d'endpoints
Le type par défaut (ClusterIP) ne fonctionne que pour le trafic interne
Si nous voulons accepter du trafic depuis l'extene, on devra utiliser soit:
NodePort (exposer un service sur un port TCP entre 30000 et 32768)
LoadBalancer (si notre fournisseur de cloud est compatible)
ExternalIP (passer par l'adresse IP externe d'une node)
Ingress (mécanisme spécial pour les services HTTP)
Nous détaillerons l'usage des NodePorts et Ingresses plus loin.
Shipping images with a registry
(automatically generated title slide)
Initially, our app was running on a single node
We could build and run in the same place
Therefore, we did not need to ship anything
Now that we want to run on a cluster, things are different
The easiest way to ship container images is to use a registry
What happens when we execute docker run alpine
?
If the Engine needs to pull the alpine
image, it expands it into library/alpine
library/alpine
is expanded into index.docker.io/library/alpine
The Engine communicates with index.docker.io
to retrieve library/alpine:latest
To use something else than index.docker.io
, we specify it in the image name
Examples:
docker pull gcr.io/google-containers/alpine-with-bash:1.0docker build -t registry.mycompany.io:5000/myimage:awesome .docker push registry.mycompany.io:5000/myimage:awesome
Create one deployment for each component
(hasher, redis, rng, webui, worker)
Expose deployments that need to accept connections
(hasher, redis, rng, webui)
For redis, we can use the official redis image
For the 4 others, we need to build images and push them to some registry
There are many options!
Manually:
build locally (with docker build
or otherwise)
push to the registry
Automatically:
build and test locally
when ready, commit and push a code repository
the code repository notifies an automated build system
that system gets the code, builds it, pushes the image to the registry
There are SAAS products like Docker Hub, Quay ...
Each major cloud provider has an option as well
(ACR on Azure, ECR on AWS, GCR on Google Cloud...)
There are also commercial products to run our own registry
(Docker EE, Quay...)
And open source options, too!
When picking a registry, pay attention to its build system
(when it has one)
For everyone's convenience, we took care of building DockerCoins images
We pushed these images to the DockerHub, under the dockercoins user
These images are tagged with a version number, v0.1
The full image names are therefore:
dockercoins/hasher:v0.1
dockercoins/rng:v0.1
dockercoins/webui:v0.1
dockercoins/worker:v0.1
$REGISTRY
and $TAG
In the upcoming exercises and labs, we use a couple of environment variables:
$REGISTRY
as a prefix to all image names
$TAG
as the image version tag
For example, the worker image is $REGISTRY/worker:$TAG
If you copy-paste the commands in these exercises:
make sure that you set $REGISTRY
and $TAG
first!
For example:
export REGISTRY=dockercoins TAG=v0.1
(this will expand $REGISTRY/worker:$TAG
to dockercoins/worker:v0.1
)
Running our application on Kubernetes
(automatically generated title slide)
Deploy redis
:
kubectl create deployment redis --image=redis
Deploy everything else:
set -ufor SERVICE in hasher rng webui worker; do kubectl create deployment $SERVICE --image=$REGISTRY/$SERVICE:$TAGdone
After waiting for the deployment to complete, let's look at the logs!
(Hint: use kubectl get deploy -w
to watch deployment events)
kubectl logs deploy/rngkubectl logs deploy/worker
After waiting for the deployment to complete, let's look at the logs!
(Hint: use kubectl get deploy -w
to watch deployment events)
kubectl logs deploy/rngkubectl logs deploy/worker
🤔 rng
is fine ... But not worker
.
After waiting for the deployment to complete, let's look at the logs!
(Hint: use kubectl get deploy -w
to watch deployment events)
kubectl logs deploy/rngkubectl logs deploy/worker
🤔 rng
is fine ... But not worker
.
💡 Oh right! We forgot to expose
.
Three deployments need to be reachable by others: hasher
, redis
, rng
worker
doesn't need to be exposed
webui
will be dealt with later
kubectl expose deployment redis --port 6379kubectl expose deployment rng --port 80kubectl expose deployment hasher --port 80
worker
has an infinite loop, that retries 10 seconds after an errorStream the worker's logs:
kubectl logs deploy/worker --follow
(Give it about 10 seconds to recover)
worker
has an infinite loop, that retries 10 seconds after an errorStream the worker's logs:
kubectl logs deploy/worker --follow
(Give it about 10 seconds to recover)
We should now see the worker
, well, working happily.
Now we would like to access the Web UI
We will expose it with a NodePort
(just like we did for the registry)
Create a NodePort
service for the Web UI:
kubectl expose deploy/webui --type=NodePort --port=80
Check the port that was allocated:
kubectl get svc
Yes, this may take a little while to update. (Narrator: it was DNS.)
Yes, this may take a little while to update. (Narrator: it was DNS.)
Alright, we're back to where we started, when we were running on a single node!
Accessing the API with kubectl proxy
(automatically generated title slide)
kubectl proxy
The API requires us to authenticate¹
There are many authentication methods available, including:
TLS client certificates
(that's what we've used so far)
HTTP basic password authentication
(from a static file; not recommended)
various token mechanisms
(detailed in the documentation)
¹OK, we lied. If you don't authenticate, you are considered to
be user system:anonymous
, which doesn't have any access rights by default.
curl
Retrieve the ClusterIP allocated to the kubernetes
service:
kubectl get svc kubernetes
Replace the IP below and try to connect with curl
:
curl -k https://10.96.0.1/
The API will tell us that user system:anonymous
cannot access this path.
If we wanted to talk to the API, we would need to:
extract our TLS key and certificate information from ~/.kube/config
(the information is in PEM format, encoded in base64)
use that information to present our certificate when connecting
(for instance, with openssl s_client -key ... -cert ... -connect ...
)
figure out exactly which credentials to use
(once we start juggling multiple clusters)
change that whole process if we're using another authentication method
🤔 There has to be a better way!
kubectl proxy
for authenticationkubectl proxy
runs a proxy in the foreground
This proxy lets us access the Kubernetes API without authentication
(kubectl proxy
adds our credentials on the fly to the requests)
This proxy lets us access the Kubernetes API over plain HTTP
This is a great tool to learn and experiment with the Kubernetes API
... And for serious uses as well (suitable for one-shot scripts)
For unattended use, it's better to create a service account
kubectl proxy
kubectl proxy
and then do a simple request with curl
!Start kubectl proxy
in the background:
kubectl proxy &
Access the API's default route:
curl localhost:8001
kill %1
The output is a list of available API routes.
kubectl proxy
is intended for local useBy default, the proxy listens on port 8001
(But this can be changed, or we can tell kubectl proxy
to pick a port)
By default, the proxy binds to 127.0.0.1
(Making it unreachable from other machines, for security reasons)
By default, the proxy only accepts connections from:
^localhost$,^127\.0\.0\.1$,^\[::1\]$
This is great when running kubectl proxy
locally
Not-so-great when you want to connect to the proxy from a remote machine
kubectl proxy
on a remote machineIf we wanted to connect to the proxy from another machine, we would need to:
bind to INADDR_ANY
instead of 127.0.0.1
accept connections from any address
This is achieved with:
kubectl proxy --port=8888 --address=0.0.0.0 --accept-hosts=.*
Do not do this on a real cluster: it opens full unauthenticated access!
Running kubectl proxy
openly is a huge security risk
It is slightly better to run the proxy where you need it
(and copy credentials, e.g. ~/.kube/config
, to that place)
It is even better to use a limited account with reduced permissions
kubectl proxy
also gives access to all internal services
Specifically, services are exposed as such:
/api/v1/namespaces/<namespace>/services/<service>/proxy
We can use kubectl proxy
to access an internal service in a pinch
(or, for non HTTP services, kubectl port-forward
)
This is not very useful when running kubectl
directly on the cluster
(since we could connect to the services directly anyway)
But it is very powerful as soon as you run kubectl
from a remote machine
Controlling the cluster remotely
(automatically generated title slide)
All the operations that we do with kubectl
can be done remotely
In this section, we are going to use kubectl
from our local machine
The exercises in this chapter should be done on your local machine.
kubectl
is officially available on Linux, macOS, Windows
(and unofficially anywhere we can build and run Go binaries)
You may skip these exercises if you are following along from:
a tablet or phone
a web-based terminal
an environment where you can't install and run new binaries
kubectl
kubectl
on your local machine, you can skip thisNote: if you are following along with a different platform (e.g. Linux on an architecture different from amd64, or with a phone or tablet), installing kubectl
might be more complicated (or even impossible) so feel free to skip this section.
kubectl
Check that kubectl
works correctly
(before even trying to connect to a remote cluster!)
kubectl
to show its version number:kubectl version --client
The output should look like this:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0",GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean",BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc",Platform:"linux/amd64"}
~/.kube/config
If you already have a ~/.kube/config
file, rename it
(we are going to overwrite it in the following slides!)
If you never used kubectl
on your machine before: nothing to do!
Make a copy of ~/.kube/config
; if you are using macOS or Linux, you can do:
cp ~/.kube/config ~/.kube/config.before.training
If you are using Windows, you will need to adapt this command
node1
The ~/.kube/config
file that is on node1
contains all the credentials we need
Let's copy it over!
Copy the file from node1
; if you are using macOS or Linux, you can do:
scp USER@X.X.X.X:.kube/config ~/.kube/config# Make sure to replace X.X.X.X with the IP address of node1,# and USER with the user name used to log into node1!
If you are using Windows, adapt these instructions to your SSH client
There is a good chance that we need to update the server address
To know if it is necessary, run kubectl config view
Look for the server:
address:
if it matches the public IP address of node1
, you're good!
if it is anything else (especially a private IP address), update it!
To update the server address, run:
kubectl config set-cluster kubernetes --server=https://X.X.X.X:6443# Make sure to replace X.X.X.X with the IP address of node1!
Generally, the Kubernetes API uses a certificate that is valid for:
kubernetes
kubernetes.default
kubernetes.default.svc
kubernetes.default.svc.cluster.local
kubernetes
servicenode1
)On most clouds, the IP address of the node is an internal IP address
... And we are going to connect over the external IP address
... And that external IP address was not used when creating the certificate!
We need to tell kubectl
to skip TLS verification
(only do this with testing clusters, never in production!)
The following command will do the trick:
kubectl config set-cluster kubernetes --insecure-skip-tls-verify
Check the versions of the local client and remote server:
kubectl version
View the nodes of the cluster:
kubectl get nodes
We can now utilize the cluster exactly as we did before, except that it's remote.
Accessing internal services
(automatically generated title slide)
When we are logged in on a cluster node, we can access internal services
(by virtue of the Kubernetes network model: all nodes can reach all pods and services)
When we are accessing a remote cluster, things are different
(generally, our local machine won't have access to the cluster's internal subnet)
How can we temporarily access a service without exposing it to everyone?
When we are logged in on a cluster node, we can access internal services
(by virtue of the Kubernetes network model: all nodes can reach all pods and services)
When we are accessing a remote cluster, things are different
(generally, our local machine won't have access to the cluster's internal subnet)
How can we temporarily access a service without exposing it to everyone?
kubectl proxy
: gives us access to the API, which includes a proxy for HTTP resources
kubectl port-forward
: allows forwarding of TCP ports to arbitrary pods, services, ...
The exercises in this section assume that we have set up kubectl
on our
local machine in order to access a remote cluster.
We will therefore show how to access services and pods of the remote cluster, from our local machine.
You can also run these exercises directly on the cluster (if you haven't
installed and set up kubectl
locally).
Running commands locally will be less useful
(since you could access services and pods directly),
but keep in mind that these commands will work anywhere as long as you have
installed and set up kubectl
to communicate with your cluster.
kubectl proxy
in theoryRunning kubectl proxy
gives us access to the entire Kubernetes API
The API includes routes to proxy HTTP traffic
These routes look like the following:
/api/v1/namespaces/<namespace>/services/<service>/proxy
We just add the URI to the end of the request, for instance:
/api/v1/namespaces/<namespace>/services/<service>/proxy/index.html
We can access services
and pods
this way
kubectl proxy
in practicewebui
service through kubectl proxy
Run an API proxy in the background:
kubectl proxy &
Access the webui
service:
curl localhost:8001/api/v1/namespaces/default/services/webui/proxy/index.html
Terminate the proxy:
kill %1
kubectl port-forward
in theoryWhat if we want to access a TCP service?
We can use kubectl port-forward
instead
It will create a TCP relay to forward connections to a specific port
(of a pod, service, deployment...)
The syntax is:
kubectl port-forward service/name_of_service local_port:remote_port
If only one port number is specified, it is used for both local and remote ports
kubectl port-forward
in practiceForward connections from local port 10000 to remote port 6379:
kubectl port-forward svc/redis 10000:6379 &
Connect to the Redis server:
telnet localhost 10000
Issue a few commands, e.g. INFO server
then QUIT
kill %1
The Kubernetes dashboard
(automatically generated title slide)
Kubernetes resources can also be viewed with a web dashboard
That dashboard is usually exposed over HTTPS
(this requires obtaining a proper TLS certificate)
Dashboard users need to authenticate
We are going to take a dangerous shortcut
We could (and should) use Let's Encrypt ...
... but we don't want to deal with TLS certificates
We could (and should) learn how authentication and authorization work ...
... but we will use a guest account with admin access instead
Yes, this will open our cluster to all kinds of shenanigans. Don't do this at home.
We are going to deploy that dashboard with one single command
This command will create all the necessary resources
(the dashboard itself, the HTTP wrapper, the admin/guest account)
All these resources are defined in a YAML file
All we have to do is load that YAML file with with kubectl apply -f
kubectl apply -f ~/container.training/k8s/insecure-dashboard.yaml
kubectl get svc dashboard
You'll want the 3xxxx
port.
The dashboard will then ask you which authentication you want to use.
We have three authentication options at this point:
token (associated with a role that has appropriate permissions)
kubeconfig (e.g. using the ~/.kube/config
file from node1
)
"skip" (use the dashboard "service account")
Let's use "skip": we're logged in!
We have three authentication options at this point:
token (associated with a role that has appropriate permissions)
kubeconfig (e.g. using the ~/.kube/config
file from node1
)
"skip" (use the dashboard "service account")
Let's use "skip": we're logged in!
By the way, we just added a backdoor to our Kubernetes cluster!
The steps that we just showed you are for educational purposes only!
If you do that on your production cluster, people can and will abuse it
For an in-depth discussion about securing the dashboard,
check this excellent post on Heptio's blog
Security implications of kubectl apply
(automatically generated title slide)
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
starts bitcoin miners on the whole cluster
hides in a non-default namespace
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
starts bitcoin miners on the whole cluster
hides in a non-default namespace
bind-mounts our nodes' filesystem
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
starts bitcoin miners on the whole cluster
hides in a non-default namespace
bind-mounts our nodes' filesystem
inserts SSH keys in the root account (on the node)
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
starts bitcoin miners on the whole cluster
hides in a non-default namespace
bind-mounts our nodes' filesystem
inserts SSH keys in the root account (on the node)
encrypts our data and ransoms it
kubectl apply
When we do kubectl apply -f <URL>
, we create arbitrary resources
Resources can be evil; imagine a deployment
that ...
starts bitcoin miners on the whole cluster
hides in a non-default namespace
bind-mounts our nodes' filesystem
inserts SSH keys in the root account (on the node)
encrypts our data and ransoms it
☠️☠️☠️
kubectl apply
is the new curl | sh
curl | sh
is convenient
It's safe if you use HTTPS URLs from trusted sources
kubectl apply
is the new curl | sh
curl | sh
is convenient
It's safe if you use HTTPS URLs from trusted sources
kubectl apply -f
is convenient
It's safe if you use HTTPS URLs from trusted sources
Example: the official setup instructions for most pod networks
kubectl apply
is the new curl | sh
curl | sh
is convenient
It's safe if you use HTTPS URLs from trusted sources
kubectl apply -f
is convenient
It's safe if you use HTTPS URLs from trusted sources
Example: the official setup instructions for most pod networks
It introduces new failure modes
(for instance, if you try to apply YAML from a link that's no longer valid)
Scaling our demo app
(automatically generated title slide)
Our ultimate goal is to get more DockerCoins
(i.e. increase the number of loops per second shown on the web UI)
Let's look at the architecture again:
The loop is done in the worker; perhaps we could try adding more workers?
worker
Deploymentkubectl get pods -wkubectl get deployments -w
worker
replicas:kubectl scale deployment worker --replicas=2
After a few seconds, the graph in the web UI should show up.
worker
Deployment further:kubectl scale deployment worker --replicas=3
The graph in the web UI should go up again.
(This is looking great! We're gonna be RICH!)
worker
Deployment to a bigger number:kubectl scale deployment worker --replicas=10
worker
Deployment to a bigger number:kubectl scale deployment worker --replicas=10
The graph will peak at 10 hashes/second.
(We can add as many workers as we want: we will never go past 10 hashes/second.)
It may look like it, because the web UI shows instant speed
The instant speed can briefly exceed 10 hashes/second
The average speed cannot
The instant speed can be biased because of how it's computed
The instant speed is computed client-side by the web UI
The web UI checks the hash counter once per second
(and does a classic (h2-h1)/(t2-t1) speed computation)
The counter is updated once per second by the workers
These timings are not exact
(e.g. the web UI check interval is client-side JavaScript)
Sometimes, between two web UI counter measurements,
the workers are able to update the counter twice
During that cycle, the instant speed will appear to be much bigger
(but it will be compensated by lower instant speed before and after)
If this was high-quality, production code, we would have instrumentation
(Datadog, Honeycomb, New Relic, statsd, Sumologic, ...)
It's not!
Perhaps we could benchmark our web services?
(with tools like ab
, or even simpler, httping
)
We want to check hasher
and rng
We are going to use httping
It's just like ping
, but using HTTP GET
requests
(it measures how long it takes to perform one GET
request)
It's used like this:
httping [-c count] http://host:port/path
Or even simpler:
httping ip.ad.dr.ess
We will use httping
on the ClusterIP addresses of our services
We can simply check the output of kubectl get services
Or do it programmatically, as in the example below
HASHER=$(kubectl get svc hasher -o go-template={{.spec.clusterIP}})RNG=$(kubectl get svc rng -o go-template={{.spec.clusterIP}})
Now we can access the IP addresses of our services through $HASHER
and $RNG
.
hasher
and rng
response timeshttping -c 3 $HASHERhttping -c 3 $RNG
hasher
is fine (it should take a few milliseconds to reply)
rng
is not (it should take about 700 milliseconds if there are 10 workers)
Something is wrong with rng
, but ... what?
Le goulot d'étranglement semble être rng
.
Et si à tout hasard, nous n'avions pas assez d'entropie, et qu'on ne pouvait générer assez de nombres aléatoires?
On doit escalader le service rng
sur plusieurs machines!
Note: ceci est une fiction! Nous avons assez d'entropie. Mais on a besoin d'un prétexte pour monter en charge.
(En réalité, le code de rng
exploite /dev/urandom
, qui n'est jamais à court d'entropie...)
...et c'est tout aussi bon que /dev/random
.)
shared/hastyconclusions.md
Daemon sets
(automatically generated title slide)
We want to scale rng
in a way that is different from how we scaled worker
We want one (and exactly one) instance of rng
per node
What if we just scale up deploy/rng
to the number of nodes?
nothing guarantees that the rng
containers will be distributed evenly
if we add nodes later, they will not automatically run a copy of rng
if we remove (or reboot) a node, one rng
container will restart elsewhere
Instead of a deployment
, we will use a daemonset
Daemon sets are great for cluster-wide, per-node processes:
kube-proxy
weave
(our overlay network)
monitoring agents
hardware management tools (e.g. SCSI/FC HBA agents)
etc.
They can also be restricted to run only on some nodes
Unfortunately, as of Kubernetes 1.14, the CLI cannot create daemon sets
More precisely: it doesn't have a subcommand to create a daemon set
Unfortunately, as of Kubernetes 1.14, the CLI cannot create daemon sets
More precisely: it doesn't have a subcommand to create a daemon set
But any kind of resource can always be created by providing a YAML description:
kubectl apply -f foo.yaml
Unfortunately, as of Kubernetes 1.14, the CLI cannot create daemon sets
More precisely: it doesn't have a subcommand to create a daemon set
But any kind of resource can always be created by providing a YAML description:
kubectl apply -f foo.yaml
Unfortunately, as of Kubernetes 1.14, the CLI cannot create daemon sets
More precisely: it doesn't have a subcommand to create a daemon set
But any kind of resource can always be created by providing a YAML description:
kubectl apply -f foo.yaml
How do we create the YAML file for our daemon set?
Unfortunately, as of Kubernetes 1.14, the CLI cannot create daemon sets
More precisely: it doesn't have a subcommand to create a daemon set
But any kind of resource can always be created by providing a YAML description:
kubectl apply -f foo.yaml
How do we create the YAML file for our daemon set?
option 1: read the docs
option 2: vi
our way out of it
rng
resourceDump the rng
resource in YAML:
kubectl get deploy/rng -o yaml >rng.yml
Edit rng.yml
What if we just changed the kind
field?
(It can't be that easy, right?)
kind: Deployment
to kind: DaemonSet
Save, quit
Try to create our new resource:
kubectl apply -f rng.yml
What if we just changed the kind
field?
(It can't be that easy, right?)
kind: Deployment
to kind: DaemonSet
Save, quit
Try to create our new resource:
kubectl apply -f rng.yml
We all knew this couldn't be that easy, right!
error validating data:[ValidationError(DaemonSet.spec):unknown field "replicas" in io.k8s.api.extensions.v1beta1.DaemonSetSpec,...
error validating data:[ValidationError(DaemonSet.spec):unknown field "replicas" in io.k8s.api.extensions.v1beta1.DaemonSetSpec,...
error validating data:[ValidationError(DaemonSet.spec):unknown field "replicas" in io.k8s.api.extensions.v1beta1.DaemonSetSpec,...
Obviously, it doesn't make sense to specify a number of replicas for a daemon set
Workaround: fix the YAML
replicas
fieldstrategy
field (which defines the rollout mechanism for a deployment)progressDeadlineSeconds
field (also used by the rollout mechanism)status: {}
line at the enderror validating data:[ValidationError(DaemonSet.spec):unknown field "replicas" in io.k8s.api.extensions.v1beta1.DaemonSetSpec,...
Obviously, it doesn't make sense to specify a number of replicas for a daemon set
Workaround: fix the YAML
replicas
fieldstrategy
field (which defines the rollout mechanism for a deployment)progressDeadlineSeconds
field (also used by the rollout mechanism)status: {}
line at the endOr, we could also ...
--force
, LukeWe could also tell Kubernetes to ignore these errors and try anyway
The --force
flag's actual name is --validate=false
kubectl apply -f rng.yml --validate=false
--force
, LukeWe could also tell Kubernetes to ignore these errors and try anyway
The --force
flag's actual name is --validate=false
kubectl apply -f rng.yml --validate=false
🎩✨🐇
--force
, LukeWe could also tell Kubernetes to ignore these errors and try anyway
The --force
flag's actual name is --validate=false
kubectl apply -f rng.yml --validate=false
🎩✨🐇
Wait ... Now, can it be that easy?
deployment
into a daemonset
?kubectl get all
deployment
into a daemonset
?kubectl get all
We have two resources called rng
:
the deployment that was existing before
the daemon set that we just created
We also have one too many pods.
(The pod corresponding to the deployment still exists.)
deploy/rng
and ds/rng
You can have different resource types with the same name
(i.e. a deployment and a daemon set both named rng
)
We still have the old rng
deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEdeployment.apps/rng 1 1 1 1 18m
But now we have the new rng
daemon set as well
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEdaemonset.apps/rng 2 2 2 2 2 <none> 9s
If we check with kubectl get pods
, we see:
one pod for the deployment (named rng-xxxxxxxxxx-yyyyy
)
one pod per node for the daemon set (named rng-zzzzz
)
NAME READY STATUS RESTARTS AGErng-54f57d4d49-7pt82 1/1 Running 0 11mrng-b85tm 1/1 Running 0 25srng-hfbrr 1/1 Running 0 25s[...]
If we check with kubectl get pods
, we see:
one pod for the deployment (named rng-xxxxxxxxxx-yyyyy
)
one pod per node for the daemon set (named rng-zzzzz
)
NAME READY STATUS RESTARTS AGErng-54f57d4d49-7pt82 1/1 Running 0 11mrng-b85tm 1/1 Running 0 25srng-hfbrr 1/1 Running 0 25s[...]
The daemon set created one pod per node, except on the master node.
The master node has taints preventing pods from running there.
(To schedule a pod on this node anyway, the pod will require appropriate tolerations.)
(Off by one? We don't run these pods on the node hosting the control plane.)
Look at the web UI
The graph should now go above 10 hashes per second!
Look at the web UI
The graph should now go above 10 hashes per second!
It looks like the newly created pods are serving traffic correctly
How and why did this happen?
(We didn't do anything special to add them to the rng
service load balancer!)
Labels and selectors
(automatically generated title slide)
The rng
service is load balancing requests to a set of pods
That set of pods is defined by the selector of the rng
service
rng
service definition:kubectl describe service rng
The selector is app=rng
It means "all the pods having the label app=rng
"
(They can have additional labels as well, that's OK!)
We can use selectors with many kubectl
commands
For instance, with kubectl get
, kubectl logs
, kubectl delete
... and more
app=rng
:kubectl get pods -l app=rngkubectl get pods --selector app=rng
But ... why do these pods (in particular, the new ones) have this app=rng
label?
When we create a deployment with kubectl create deployment rng
,
this deployment gets the label app=rng
The replica sets created by this deployment also get the label app=rng
The pods created by these replica sets also get the label app=rng
When we created the daemon set from the deployment, we re-used the same spec
Therefore, the pods created by the daemon set get the same labels
Note: when we use kubectl run stuff
, the label is run=stuff
instead.
We would like to remove a pod from the load balancer
What would happen if we removed that pod, with kubectl delete pod ...
?
We would like to remove a pod from the load balancer
What would happen if we removed that pod, with kubectl delete pod ...
?
It would be re-created immediately (by the replica set or the daemon set)
We would like to remove a pod from the load balancer
What would happen if we removed that pod, with kubectl delete pod ...
?
It would be re-created immediately (by the replica set or the daemon set)
What would happen if we removed the app=rng
label from that pod?
We would like to remove a pod from the load balancer
What would happen if we removed that pod, with kubectl delete pod ...
?
It would be re-created immediately (by the replica set or the daemon set)
What would happen if we removed the app=rng
label from that pod?
It would also be re-created immediately
We would like to remove a pod from the load balancer
What would happen if we removed that pod, with kubectl delete pod ...
?
It would be re-created immediately (by the replica set or the daemon set)
What would happen if we removed the app=rng
label from that pod?
It would also be re-created immediately
Why?!?
The "mission" of a replica set is:
"Make sure that there is the right number of pods matching this spec!"
The "mission" of a daemon set is:
"Make sure that there is a pod matching this spec on each node!"
The "mission" of a replica set is:
"Make sure that there is the right number of pods matching this spec!"
The "mission" of a daemon set is:
"Make sure that there is a pod matching this spec on each node!"
In fact, replica sets and daemon sets do not check pod specifications
They merely have a selector, and they look for pods matching that selector
Yes, we can fool them by manually creating pods with the "right" labels
Bottom line: if we remove our app=rng
label ...
... The pod "disappears" for its parent, which re-creates another pod to replace it
Since both the rng
daemon set and the rng
replica set use app=rng
...
... Why don't they "find" each other's pods?
Since both the rng
daemon set and the rng
replica set use app=rng
...
... Why don't they "find" each other's pods?
Replica sets have a more specific selector, visible with kubectl describe
(It looks like app=rng,pod-template-hash=abcd1234
)
Daemon sets also have a more specific selector, but it's invisible
(It looks like app=rng,controller-revision-hash=abcd1234
)
As a result, each controller only "sees" the pods it manages
Currently, the rng
service is defined by the app=rng
selector
The only way to remove a pod is to remove or change the app
label
... But that will cause another pod to be created instead!
What's the solution?
Currently, the rng
service is defined by the app=rng
selector
The only way to remove a pod is to remove or change the app
label
... But that will cause another pod to be created instead!
What's the solution?
We need to change the selector of the rng
service!
Let's add another label to that selector (e.g. enabled=yes
)
If a selector specifies multiple labels, they are understood as a logical AND
(In other words: the pods must match all the labels)
Kubernetes has support for advanced, set-based selectors
(But these cannot be used with services, at least not yet!)
Add the label enabled=yes
to all our rng
pods
Update the selector for the rng
service to also include enabled=yes
Toggle traffic to a pod by manually adding/removing the enabled
label
Profit!
Note: if we swap steps 1 and 2, it will cause a short service disruption, because there will be a period of time during which the service selector won't match any pod. During that time, requests to the service will time out. By doing things in the order above, we guarantee that there won't be any interruption.
We want to add the label enabled=yes
to all pods that have app=rng
We could edit each pod one by one with kubectl edit
...
... Or we could use kubectl label
to label them all
kubectl label
can use selectors itself
enabled=yes
to all pods that have app=rng
:kubectl label pods -l app=rng enabled=yes
We need to edit the service specification
Reminder: in the service definition, we will see app: rng
in two places
the label of the service itself (we don't need to touch that one)
the selector of the service (that's the one we want to change)
enabled: yes
to its selector:kubectl edit service rng
We need to edit the service specification
Reminder: in the service definition, we will see app: rng
in two places
the label of the service itself (we don't need to touch that one)
the selector of the service (that's the one we want to change)
enabled: yes
to its selector:kubectl edit service rng
... And then we get the weirdest error ever. Why?
YAML parsers try to help us:
xyz
is the string "xyz"
42
is the integer 42
yes
is the boolean value true
If we want the string "42"
or the string "yes"
, we have to quote them
So we have to use enabled: "yes"
For a good laugh: if we had used "ja", "oui", "si" ... as the value, it would have worked!
enabled: "yes"
to its selector:kubectl edit service rng
This time it should work!
If we did everything correctly, the web UI shouldn't show any change.
We want to disable the pod that was created by the deployment
All we have to do, is remove the enabled
label from that pod
To identify that pod, we can use its name
... Or rely on the fact that it's the only one with a pod-template-hash
label
Good to know:
kubectl label ... foo=
doesn't remove a label (it sets it to an empty string)
to remove label foo
, use kubectl label ... foo-
to change an existing label, we would need to add --overwrite
In one window, check the logs of that pod:
POD=$(kubectl get pod -l app=rng,pod-template-hash -o name)kubectl logs --tail 1 --follow $POD
(We should see a steady stream of HTTP logs)
In another window, remove the label from the pod:
kubectl label pod -l app=rng,pod-template-hash enabled-
(The stream of HTTP logs should stop immediately)
There might be a slight change in the web UI (since we removed a bit
of capacity from the rng
service). If we remove more pods,
the effect should be more visible.
If we scale up our cluster by adding new nodes, the daemon set will create more pods
These pods won't have the enabled=yes
label
If we want these pods to have that label, we need to edit the daemon set spec
We can do that with e.g. kubectl edit daemonset rng
Reminder: a daemon set is a resource that creates more resources!
There is a difference between:
the label(s) of a resource (in the metadata
block in the beginning)
the selector of a resource (in the spec
block)
the label(s) of the resource(s) created by the first resource (in the template
block)
We would need to update the selector and the template
(metadata labels are not mandatory)
The template must match the selector
(i.e. the resource will refuse to create resources that it will not select)
When a pod is misbehaving, we can delete it: another one will be recreated
But we can also change its labels
It will be removed from the load balancer (it won't receive traffic anymore)
Another pod will be recreated immediately
But the problematic pod is still here, and we can inspect and debug it
We can even re-add it to the rotation if necessary
(Very useful to troubleshoot intermittent and elusive bugs)
Conversely, we can add pods matching a service's selector
These pods will then receive requests and serve traffic
Examples:
one-shot pod with all debug flags enabled, to collect logs
pods created automatically, but added to rotation in a second step
(by setting their label accordingly)
This gives us building blocks for canary and blue/green deployments
Rolling updates
(automatically generated title slide)
By default (without rolling updates), when a scaled resource is updated:
new pods are created
old pods are terminated
... all at the same time
if something goes wrong, ¯\_(ツ)_/¯
With rolling updates, when a resource is updated, it happens progressively
Two parameters determine the pace of the rollout: maxUnavailable
and maxSurge
They can be specified in absolute number of pods, or percentage of the replicas
count
At any given time ...
there will always be at least replicas
-maxUnavailable
pods available
there will never be more than replicas
+maxSurge
pods in total
there will therefore be up to maxUnavailable
+maxSurge
pods being updated
We have the possibility of rolling back to the previous version
(if the update fails or is unsatisfactory in any way)
kubectl
and jq
:kubectl get deploy -o json | jq ".items[] | {name:.metadata.name} + .spec.strategy.rollingUpdate"
As of Kubernetes 1.8, we can do rolling updates with:
deployments
, daemonsets
, statefulsets
Editing one of these resources will automatically result in a rolling update
Rolling updates can be monitored with the kubectl rollout
subcommand
worker
serviceOnly run these commands if you have built and pushed DockerCoins to a local registry.
If you are using images from the Docker Hub (dockercoins/worker:v0.1
), skip this.
Go to the stacks
directory (~/container.training/stacks
)
Edit dockercoins/worker/worker.py
; update the first sleep
line to sleep 1 second
Build a new tag and push it to the registry:
#export REGISTRY=localhost:3xxxxexport TAG=v0.2docker-compose -f dockercoins.yml builddocker-compose -f dockercoins.yml push
worker
servicekubectl get pods -wkubectl get replicasets -wkubectl get deployments -w
worker
either with kubectl edit
, or by running:kubectl set image deploy worker worker=$REGISTRY/worker:$TAG
worker
servicekubectl get pods -wkubectl get replicasets -wkubectl get deployments -w
worker
either with kubectl edit
, or by running:kubectl set image deploy worker worker=$REGISTRY/worker:$TAG
That rollout should be pretty quick. What shows in the web UI?
At first, it looks like nothing is happening (the graph remains at the same level)
According to kubectl get deploy -w
, the deployment
was updated really quickly
But kubectl get pods -w
tells a different story
The old pods
are still here, and they stay in Terminating
state for a while
Eventually, they are terminated; and then the graph decreases significantly
This delay is due to the fact that our worker doesn't handle signals
Kubernetes sends a "polite" shutdown request to the worker, which ignores it
After a grace period, Kubernetes gets impatient and kills the container
(The grace period is 30 seconds, but can be changed if needed)
Update worker
by specifying a non-existent image:
export TAG=v0.3kubectl set image deploy worker worker=$REGISTRY/worker:$TAG
Check what's going on:
kubectl rollout status deploy worker
Update worker
by specifying a non-existent image:
export TAG=v0.3kubectl set image deploy worker worker=$REGISTRY/worker:$TAG
Check what's going on:
kubectl rollout status deploy worker
Our rollout is stuck. However, the app is not dead.
(After a minute, it will stabilize to be 20-25% slower.)
Why is our app a bit slower?
Because MaxUnavailable=25%
... So the rollout terminated 2 replicas out of 10 available
Okay, but why do we see 5 new replicas being rolled out?
Because MaxSurge=25%
... So in addition to replacing 2 replicas, the rollout is also starting 3 more
It rounded down the number of MaxUnavailable pods conservatively,
but the total number of pods being rolled out is allowed to be 25+25=50%
We start with 10 pods running for the worker
deployment
Current settings: MaxUnavailable=25% and MaxSurge=25%
When we start the rollout:
Now we have 8 replicas up and running, and 5 being deployed
Our rollout is stuck at this point!
If you didn't deploy the Kubernetes dashboard earlier, just skip this slide.
kubectl -n kube-system get svc socat
Note the 3xxxx
port.
If you didn't deploy the Kubernetes dashboard earlier, just skip this slide.
kubectl -n kube-system get svc socat
Note the 3xxxx
port.
We could push some v0.3
image
(the pod retry logic will eventually catch it and the rollout will proceed)
Or we could invoke a manual rollback
kubectl rollout undo deploy workerkubectl rollout status deploy worker
We want to:
v0.1
The corresponding changes can be expressed in the following YAML snippet:
spec: template: spec: containers: - name: worker image: $REGISTRY/worker:v0.1 strategy: rollingUpdate: maxUnavailable: 0 maxSurge: 1 minReadySeconds: 10
We could use kubectl edit deployment worker
But we could also use kubectl patch
with the exact YAML shown before
kubectl patch deployment worker -p "spec: template: spec: containers: - name: worker image: $REGISTRY/worker:v0.1 strategy: rollingUpdate: maxUnavailable: 0 maxSurge: 1 minReadySeconds: 10"kubectl rollout status deployment workerkubectl get deploy -o json worker | jq "{name:.metadata.name} + .spec.strategy.rollingUpdate"
Namespaces
(automatically generated title slide)
We would like to deploy another copy of DockerCoins on our cluster
We could rename all our deployments and services:
hasher → hasher2, redis → redis2, rng → rng2, etc.
That would require updating the code
There has to be a better way!
We would like to deploy another copy of DockerCoins on our cluster
We could rename all our deployments and services:
hasher → hasher2, redis → redis2, rng → rng2, etc.
That would require updating the code
There has to be a better way!
As hinted by the title of this section, we will use namespaces
We cannot have two resources with the same name
(or can we...?)
We cannot have two resources with the same name
(or can we...?)
We cannot have two resources of the same kind with the same name
(but it's OK to have an rng
service, an rng
deployment, and an rng
daemon set)
We cannot have two resources with the same name
(or can we...?)
We cannot have two resources of the same kind with the same name
(but it's OK to have an rng
service, an rng
deployment, and an rng
daemon set)
We cannot have two resources of the same kind with the same name in the same namespace
(but it's OK to have e.g. two rng
services in different namespaces)
We cannot have two resources with the same name
(or can we...?)
We cannot have two resources of the same kind with the same name
(but it's OK to have an rng
service, an rng
deployment, and an rng
daemon set)
We cannot have two resources of the same kind with the same name in the same namespace
(but it's OK to have e.g. two rng
services in different namespaces)
Except for resources that exist at the cluster scope
(these do not belong to a namespace)
For namespaced resources:
the tuple (kind, name, namespace) needs to be unique
For resources at the cluster scope:
the tuple (kind, name) needs to be unique
kubectl api-resources
If we deploy a cluster with kubeadm
, we have three or four namespaces:
default
(for our applications)
kube-system
(for the control plane)
kube-public
(contains one ConfigMap for cluster discovery)
kube-node-lease
(in Kubernetes 1.14 and later; contains Lease objects)
If we deploy differently, we may have different namespaces
We can use kubectl create namespace
:
kubectl create namespace blue
Or we can construct a very minimal YAML snippet:
kubectl apply -f- <<EOFapiVersion: v1kind: Namespacemetadata: name: blueEOF
We can pass a -n
or --namespace
flag to most kubectl
commands:
kubectl -n blue get svc
We can also change our current context
A context is a (user, cluster, namespace) tuple
We can manipulate contexts with the kubectl config
command
kubectl config get-contexts
The current context (the only one!) is tagged with a *
What are NAME, CLUSTER, AUTHINFO, and NAMESPACE?
NAME is an arbitrary string to identify the context
CLUSTER is a reference to a cluster
(i.e. API endpoint URL, and optional certificate)
AUTHINFO is a reference to the authentication information to use
(i.e. a TLS client certificate, token, or otherwise)
NAMESPACE is the namespace
(empty string = default
)
We want to use a different namespace
Solution 1: update the current context
This is appropriate if we need to change just one thing (e.g. namespace or authentication).
Solution 2: create a new context and switch to it
This is appropriate if we need to change multiple things and switch back and forth.
Let's go with solution 1!
This is done through kubectl config set-context
We can update a context by passing its name, or the current context with --current
Update the current context to use the blue
namespace:
kubectl config set-context --current --namespace=blue
Check the result:
kubectl config get-contexts
kubectl get all
jpetazzo/kubercoins
contains everything we need!Clone the kubercoins repository:
cd ~git clone https://github.com/jpetazzo/kubercoins
Create all the DockerCoins resources:
kubectl create -f kubercoins
If the argument behind -f
is a directory, all the files in that directory are processed.
The subdirectories are not processed, unless we also add the -R
flag.
Retrieve the port number allocated to the webui
service:
kubectl get svc webui
Point our browser to http://X.X.X.X:3xxxx
If the graph shows up but stays at zero, give it a minute or two!
Namespaces do not provide isolation
A pod in the green
namespace can communicate with a pod in the blue
namespace
A pod in the default
namespace can communicate with a pod in the kube-system
namespace
CoreDNS uses a different subdomain for each namespace
Example: from any pod in the cluster, you can connect to the Kubernetes API with:
https://kubernetes.default.svc.cluster.local:443/
Actual isolation is implemented with network policies
Network policies are resources (like deployments, services, namespaces...)
Network policies specify which flows are allowed:
between pods
from pods to the outside world
and vice-versa
blue
namespacekubectl config set-context --current --namespace=
Note: we could have used --namespace=default
for the same result.
We can also use a little helper tool called kubens
:
# Switch to namespace fookubens foo# Switch back to the previous namespacekubens -
On our clusters, kubens
is called kns
instead
(so that it's even fewer keystrokes to switch namespaces)
kubens
and kubectx
With kubens
, we can switch quickly between namespaces
With kubectx
, we can switch quickly between contexts
Both tools are simple shell scripts available from https://github.com/ahmetb/kubectx
On our clusters, they are installed as kns
and kctx
(for brevity and to avoid completion clashes between kubectx
and kubectl
)
kube-ps1
It's easy to lose track of our current cluster / context / namespace
kube-ps1
makes it easy to track these, by showing them in our shell prompt
It's a simple shell script available from https://github.com/jonmosco/kube-ps1
On our clusters, kube-ps1
is installed and included in PS1
:
[123.45.67.89] (kubernetes-admin@kubernetes:default) docker@node1 ~
(The highlighted part is context:namespace
, managed by kube-ps1
)
Highly recommended if you work across multiple contexts or namespaces!
Kustomize
(automatically generated title slide)
Kustomize lets us transform YAML files representing Kubernetes resources
The original YAML files are valid resource files
(e.g. they can be loaded with kubectl apply -f
)
They are left untouched by Kustomize
Kustomize lets us define overlays that extend or change the resource files
Helm charts use placeholders {{ like.this }}
Kustomize "bases" are standard Kubernetes YAML
It is possible to use an existing set of YAML as a Kustomize base
As a result, writing a Helm chart is more work ...
... But Helm charts are also more powerful; e.g. they can:
use flags to conditionally include resources or blocks
check if a given Kubernetes API group is supported
Kustomize needs a kustomization.yaml
file
That file can be a base or a variant
If it's a base:
If it's a variant (or overlay):
it refers to (at least) one base
and some patches
We are going to use Replicated Ship to experiment with Kustomize
The Replicated Ship CLI has been installed on our clusters
Replicated Ship has multiple workflows; here is what we will do:
initialize a Kustomize overlay from a remote GitHub repository
customize some values using the web UI provided by Ship
look at the resulting files and apply them to the cluster
We need to run ship init
in a new directory
ship init
requires a URL to a remote repository containing Kubernetes YAML
It will clone that repository and start a web UI
Later, it can watch that repository and/or update from it
We will use the jpetazzo/kubercoins repository
(it contains all the DockerCoins resources as YAML files)
ship init
Change to a new directory:
mkdir ~/kustomcoinscd ~/kustomcoins
Run ship init
with the kustomcoins repository:
ship init https://github.com/jpetazzo/kubercoins
ship init
tells us to connect on localhost:8800
We need to replace localhost
with the address of our node
(since we run on a remote machine)
Follow the steps in the web UI, and change one parameter
(e.g. set the number of replicas in the worker Deployment)
Complete the web workflow, and go back to the CLI
Look at the content of our directory
base
contains the kubercoins repository + a kustomization.yaml
file
overlays/ship
contains the Kustomize overlay referencing the base + our patch(es)
rendered.yaml
is a YAML bundle containing the patched application
.ship
contains a state file used by Ship
We can kubectl apply -f rendered.yaml
(on any version of Kubernetes)
Starting with Kubernetes 1.14, we can apply the overlay directly with:
kubectl apply -k overlays/ship
But let's not do that for now!
We will create a new copy of DockerCoins in another namespace
Create a new namespace:
kubectl create namespace kustomcoins
Deploy DockerCoins:
kubectl apply -f rendered.yaml --namespace=kustomcoins
Or, with Kubernetes 1.14, you can also do this:
kubectl apply -k overlays/ship --namespace=kustomcoins
Retrieve the NodePort number of the web UI:
kubectl get service webui --namespace=kustomcoins
Open it in a web browser
Look at the worker logs:
kubectl logs deploy/worker --tail=10 --follow --namespace=kustomcoins
Note: it might take a minute or two for the worker to start.
Healthchecks
(automatically generated title slide)
Kubernetes provides two kinds of healthchecks: liveness and readiness
Healthchecks are probes that apply to containers (not to pods)
Each container can have two (optional) probes:
liveness = is this container dead or alive?
readiness = is this container ready to serve traffic?
Different probes are available (HTTP, TCP, program execution)
Let's see the difference and how to use them!
Indicates if the container is dead or alive
A dead container cannot come back to life
If the liveness probe fails, the container is killed
(to make really sure that it's really dead; no zombies or undeads!)
What happens next depends on the pod's restartPolicy
:
Never
: the container is not restarted
OnFailure
or Always
: the container is restarted
To indicate failures that can't be recovered
deadlocks (causing all requests to time out)
internal corruption (causing all requests to error)
If the liveness probe fails N consecutive times, the container is killed
N is the failureThreshold
(3 by default)
Indicates if the container is ready to serve traffic
If a container becomes "unready" (let's say busy!) it might be ready again soon
If the readiness probe fails:
the container is not killed
if the pod is a member of a service, it is temporarily removed
it is re-added as soon as the readiness probe passes again
To indicate temporary failures
the application can only service N parallel connections
the runtime is busy doing garbage collection or initial data load
The container is marked as "not ready" after failureThreshold
failed attempts
(3 by default)
It is marked again as "ready" after successThreshold
successful attempts
(1 by default)
HTTP request
specify URL of the request (and optional headers)
any status code between 200 and 399 indicates success
TCP connection
arbitrary exec
a command is executed in the container
exit status of zero indicates success
Rolling updates proceed when containers are actually ready
(as opposed to merely started)
Containers in a broken state get killed and restarted
(instead of serving errors or timeouts)
Overloaded backends get removed from load balancer rotation
(thus improving response times across the board)
Here is a pod template for the rng
web service of the DockerCoins app:
apiVersion: v1kind: Podmetadata: name: rng-with-livenessspec: containers: - name: rng image: dockercoins/rng:v0.1 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 10 periodSeconds: 1
If the backend serves an error, or takes longer than 1s, 3 times in a row, it gets killed.
Here is a pod template for a Redis server:
apiVersion: v1kind: Podmetadata: name: redis-with-livenessspec: containers: - name: redis image: redis livenessProbe: exec: command: ["redis-cli", "ping"]
If the Redis process becomes unresponsive, it will be killed.
Probes are executed at intervals of periodSeconds
(default: 10)
The timeout for a probe is set with timeoutSeconds
(default: 1)
A probe is considered successful after successThreshold
successes (default: 1)
A probe is considered failing after failureThreshold
failures (default: 3)
If a probe is not defined, it's as if there was an "always successful" probe
Accessing logs from the CLI
(automatically generated title slide)
The kubectl logs
command has limitations:
it cannot stream logs from multiple pods at a time
when showing logs from multiple pods, it mixes them all together
We are going to see how to do it better
We could (if we were so inclined) write a program or script that would:
take a selector as an argument
enumerate all pods matching that selector (with kubectl get -l ...
)
fork one kubectl logs --follow ...
command per container
annotate the logs (the output of each kubectl logs ...
process) with their origin
preserve ordering by using kubectl logs --timestamps ...
and merge the output
We could (if we were so inclined) write a program or script that would:
take a selector as an argument
enumerate all pods matching that selector (with kubectl get -l ...
)
fork one kubectl logs --follow ...
command per container
annotate the logs (the output of each kubectl logs ...
process) with their origin
preserve ordering by using kubectl logs --timestamps ...
and merge the output
We could do it, but thankfully, others did it for us already!
Stern is an open source project by Wercker.
From the README:
Stern allows you to tail multiple pods on Kubernetes and multiple containers within the pod. Each result is color coded for quicker debugging.
The query is a regular expression so the pod name can easily be filtered and you don't need to specify the exact id (for instance omitting the deployment id). If a pod is deleted it gets removed from tail and if a new pod is added it automatically gets tailed.
Exactly what we need!
Run stern
(without arguments) to check if it's installed:
$ sternTail multiple pods and containers from KubernetesUsage:stern pod-query [flags]
If it is not installed, the easiest method is to download a binary release
The following commands will install Stern on a Linux Intel 64 bit machine:
sudo curl -L -o /usr/local/bin/stern \ https://github.com/wercker/stern/releases/download/1.10.0/stern_linux_amd64sudo chmod +x /usr/local/bin/stern
There are two ways to specify the pods whose logs we want to see:
-l
followed by a selector expression (like with many kubectl
commands)
with a "pod query," i.e. a regex used to match pod names
These two ways can be combined if necessary
stern rng
The --tail N
flag shows the last N
lines for each container
(Instead of showing the logs since the creation of the container)
The -t
/ --timestamps
flag shows timestamps
The --all-namespaces
flag is self-explanatory
weave
system containers:stern --tail 1 --timestamps --all-namespaces weave
When specifying a selector, we can omit the value for a label
This will match all objects having that label (regardless of the value)
Everything created with kubectl run
has a label run
We can use that property to view the logs of all the pods created with kubectl run
Similarly, everything created with kubectl create deployment
has a label app
kubectl create deployment
:stern -l app
Centralized logging
(automatically generated title slide)
Using kubectl
or stern
is simple; but it has drawbacks:
when a node goes down, its logs are not available anymore
we can only dump or stream logs; we want to search/index/count...
We want to send all our logs to a single place
We want to parse them (e.g. for HTTP logs) and index them
We want a nice web dashboard
Using kubectl
or stern
is simple; but it has drawbacks:
when a node goes down, its logs are not available anymore
we can only dump or stream logs; we want to search/index/count...
We want to send all our logs to a single place
We want to parse them (e.g. for HTTP logs) and index them
We want a nice web dashboard
We are going to deploy an EFK stack
EFK is three components:
ElasticSearch (to store and index log entries)
Fluentd (to get container logs, process them, and put them in ElasticSearch)
Kibana (to view/search log entries with a nice UI)
The only component that we need to access from outside the cluster will be Kibana
kubectl apply -f ~/container.training/k8s/efk.yaml
If we look at the YAML file, we see that it creates a daemon set, two deployments, two services, and a few roles and role bindings (to give fluentd the required permissions).
A container writes a line on stdout or stderr
Both are typically piped to the container engine (Docker or otherwise)
The container engine reads the line, and sends it to a logging driver
The timestamp and stream (stdout or stderr) is added to the log line
With the default configuration for Kubernetes, the line is written to a JSON file
(/var/log/containers/pod-name_namespace_container-id.log
)
That file is read when we invoke kubectl logs
; we can access it directly too
Fluentd runs on each node (thanks to a daemon set)
It bind-mounts /var/log/containers
from the host (to access these files)
It continuously scans this directory for new files; reads them; parses them
Each log line becomes a JSON object, fully annotated with extra information:
container id, pod name, Kubernetes labels...
These JSON objects are stored in ElasticSearch
ElasticSearch indexes the JSON objects
We can access the logs through Kibana (and perform searches, counts, etc.)
Kibana offers a web interface that is relatively straightforward
Let's check it out!
Check which NodePort
was allocated to Kibana:
kubectl get svc kibana
With our web browser, connect to Kibana
Note: this is not a Kibana workshop! So this section is deliberately very terse.
The first time you connect to Kibana, you must "configure an index pattern"
Just use the one that is suggested, @timestamp
*
Then click "Discover" (in the top-left corner)
You should see container logs
Advice: in the left column, select a few fields to display, e.g.:
kubernetes.host
, kubernetes.pod_name
, stream
, log
*If you don't see @timestamp
, it's probably because no logs exist yet.
Wait a bit, and double-check the logging pipeline!
We are using EFK because it is relatively straightforward to deploy on Kubernetes, without having to redeploy or reconfigure our cluster. But it doesn't mean that it will always be the best option for your use-case. If you are running Kubernetes in the cloud, you might consider using the cloud provider's logging infrastructure (if it can be integrated with Kubernetes).
The deployment method that we will use here has been simplified: there is only one ElasticSearch node. In a real deployment, you might use a cluster, both for performance and reliability reasons. But this is outside of the scope of this chapter.
The YAML file that we used creates all the resources in the
default
namespace, for simplicity. In a real scenario, you will
create the resources in the kube-system
namespace or in a dedicated namespace.
Authentication and authorization
(automatically generated title slide)
And first, a little refresher!
Authentication = verifying the identity of a person
On a UNIX system, we can authenticate with login+password, SSH keys ...
Authorization = listing what they are allowed to do
On a UNIX system, this can include file permissions, sudoer entries ...
Sometimes abbreviated as "authn" and "authz"
In good modular systems, these things are decoupled
(so we can e.g. change a password or SSH key without having to reset access rights)
When the API server receives a request, it tries to authenticate it
(it examines headers, certificates... anything available)
Many authentication methods are available and can be used simultaneously
(we will see them on the next slide)
It's the job of the authentication method to produce:
The API server doesn't interpret these; that'll be the job of authorizers
TLS client certificates
(that's what we've been doing with kubectl
so far)
Bearer tokens
(a secret token in the HTTP headers of the request)
(carrying user and password in an HTTP header)
Authentication proxy
(sitting in front of the API and setting trusted headers)
If any authentication method rejects a request, it's denied
(401 Unauthorized
HTTP code)
If a request is neither rejected nor accepted by anyone, it's anonymous
the user name is system:anonymous
the list of groups is [system:unauthenticated]
By default, the anonymous user can't do anything
(that's what you get if you just curl
the Kubernetes API)
This is enabled in most Kubernetes deployments
The user name is derived from the CN
in the client certificates
The groups are derived from the O
fields in the client certificate
From the point of view of the Kubernetes API, users do not exist
(i.e. they are not stored in etcd or anywhere else)
Users can be created (and added to groups) independently of the API
The Kubernetes API can be set up to use your custom CA to validate client certs
CN
and O
fields for our certificate:kubectl config view \ --raw \ -o json \ | jq -r .users[0].user[\"client-certificate-data\"] \ | openssl base64 -d -A \ | openssl x509 -text \ | grep Subject:
Let's break down that command together! 😅
kubectl config view
shows the Kubernetes user configuration--raw
includes certificate information (which shows as REDACTED otherwise)-o json
outputs the information in JSON format| jq ...
extracts the field with the user certificate (in base64)| openssl base64 -d -A
decodes the base64 format (now we have a PEM file)| openssl x509 -text
parses the certificate and outputs it as plain text| grep Subject:
shows us the line that interests us→ We are user kubernetes-admin
, in group system:masters
.
(We will see later how and why this gives us the permissions that we have.)
The Kubernetes API server does not support certificate revocation
(see issue #18982)
As a result, we don't have an easy way to terminate someone's access
(if their key is compromised, or they leave the organization)
Option 1: re-create a new CA and re-issue everyone's certificates
→ Maybe OK if we only have a few users; no way otherwise
Option 2: don't use groups; grant permissions to individual users
→ Inconvenient if we have many users and teams; error-prone
Option 3: issue short-lived certificates (e.g. 24 hours) and renew them often
→ This can be facilitated by e.g. Vault or by the Kubernetes CSR API
Tokens are passed as HTTP headers:
Authorization: Bearer and-then-here-comes-the-token
Tokens can be validated through a number of different methods:
static tokens hard-coded in a file on the API server
bootstrap tokens (special case to create a cluster or join nodes)
OpenID Connect tokens (to delegate authentication to compatible OAuth2 providers)
service accounts (these deserve more details, coming right up!)
A service account is a user that exists in the Kubernetes API
(it is visible with e.g. kubectl get serviceaccounts
)
Service accounts can therefore be created / updated dynamically
(they don't require hand-editing a file and restarting the API server)
A service account is associated with a set of secrets
(the kind that you can view with kubectl get secrets
)
Service accounts are generally used to grant permissions to applications, services...
(as opposed to humans)
We are going to list existing service accounts
Then we will extract the token for a given service account
And we will use that token to authenticate with the API
serviceaccount
or sa
for short:kubectl get sa
There should be just one service account in the default namespace: default
.
default
service account:kubectl get sa default -o yamlSECRET=$(kubectl get sa default -o json | jq -r .secrets[0].name)
It should be named default-token-XXXXX
.
View the secret:
kubectl get secret $SECRET -o yaml
Extract the token and decode it:
TOKEN=$(kubectl get secret $SECRET -o json \ | jq -r .data.token | openssl base64 -d -A)
Find the ClusterIP for the kubernetes
service:
kubectl get svc kubernetesAPI=$(kubectl get svc kubernetes -o json | jq -r .spec.clusterIP)
Connect without the token:
curl -k https://$API
Connect with the token:
curl -k -H "Authorization: Bearer $TOKEN" https://$API
In both cases, we will get a "Forbidden" error
Without authentication, the user is system:anonymous
With authentication, it is shown as system:serviceaccount:default:default
The API "sees" us as a different user
But neither user has any rights, so we can't do nothin'
Let's change that!
There are multiple ways to grant permissions in Kubernetes, called authorizers:
Node Authorization (used internally by kubelet; we can ignore it)
Attribute-based access control (powerful but complex and static; ignore it too)
Webhook (each API request is submitted to an external service for approval)
Role-based access control (associates permissions to users dynamically)
The one we want is the last one, generally abbreviated as RBAC
RBAC allows to specify fine-grained permissions
Permissions are expressed as rules
A rule is a combination of:
verbs like create, get, list, update, delete...
resources (as in "API resource," like pods, nodes, services...)
resource names (to specify e.g. one specific pod instead of all pods)
in some case, subresources (e.g. logs are subresources of pods)
A role is an API object containing a list of rules
Example: role "external-load-balancer-configurator" can:
A rolebinding associates a role with a user
Example: rolebinding "external-load-balancer-configurator":
Yes, there can be users, roles, and rolebindings with the same name
It's a good idea for 1-1-1 bindings; not so much for 1-N ones
API resources Role and RoleBinding are for objects within a namespace
We can also define API resources ClusterRole and ClusterRoleBinding
These are a superset, allowing us to:
specify actions on cluster-wide objects (like nodes)
operate across all namespaces
We can create Role and RoleBinding resources within a namespace
ClusterRole and ClusterRoleBinding resources are global
A pod can be associated with a service account
by default, it is associated with the default
service account
as we saw earlier, this service account has no permissions anyway
The associated token is exposed to the pod's filesystem
(in /var/run/secrets/kubernetes.io/serviceaccount/token
)
Standard Kubernetes tooling (like kubectl
) will look for it there
So Kubernetes tools running in a pod will automatically use the service account
We are going to create a service account
We will use a default cluster role (view
)
We will bind together this role and this service account
Then we will run a pod using that service account
In this pod, we will install kubectl
and check our permissions
We will call the new service account viewer
(note that nothing prevents us from calling it view
, like the role)
Create the new service account:
kubectl create serviceaccount viewer
List service accounts now:
kubectl get serviceaccounts
Binding a role = creating a rolebinding object
We will call that object viewercanview
(but again, we could call it view
)
kubectl create rolebinding viewercanview \ --clusterrole=view \ --serviceaccount=default:viewer
It's important to note a couple of details in these flags...
We used --clusterrole=view
What would have happened if we had used --role=view
?
we would have bound the role view
from the local namespace
(instead of the cluster role view
)
the command would have worked fine (no error)
but later, our API requests would have been denied
This is a deliberate design decision
(we can reference roles that don't exist, and create/update them later)
We used --serviceaccount=default:viewer
What would have happened if we had used --user=default:viewer
?
we would have bound the role to a user instead of a service account
again, the command would have worked fine (no error)
...but our API requests would have been denied later
What's about the default:
prefix?
that's the namespace of the service account
yes, it could be inferred from context, but... kubectl
requires it
alpine
pod and install kubectl
thereRun a one-time pod:
kubectl run eyepod --rm -ti --restart=Never \ --serviceaccount=viewer \ --image alpine
Install curl
, then use it to install kubectl
:
apk add --no-cache curlURLBASE=https://storage.googleapis.com/kubernetes-release/releaseKUBEVER=$(curl -s $URLBASE/stable.txt)curl -LO $URLBASE/$KUBEVER/bin/linux/amd64/kubectlchmod +x kubectl
kubectl
in the podview
permissions, then to create an objectCheck that we can, indeed, view things:
./kubectl get all
But that we can't create things:
./kubectl create deployment testrbac --image=nginx
Exit the container with exit
or ^D
kubectl
We can also check for permission with kubectl auth can-i
:
kubectl auth can-i list nodeskubectl auth can-i create podskubectl auth can-i get pod/name-of-podkubectl auth can-i get /url-fragment-of-api-request/kubectl auth can-i '*' services
And we can check permissions on behalf of other users:
kubectl auth can-i list nodes \ --as some-userkubectl auth can-i list nodes \ --as system:serviceaccount:<namespace>:<name-of-service-account>
view
role come from?Kubernetes defines a number of ClusterRoles intended to be bound to users
cluster-admin
can do everything (think root
on UNIX)
admin
can do almost everything (except e.g. changing resource quotas and limits)
edit
is similar to admin
, but cannot view or edit permissions
view
has read-only access to most resources, except permissions and secrets
In many situations, these roles will be all you need.
You can also customize them!
If you need to add permissions to these default roles (or others),
you can do it through the ClusterRole Aggregation mechanism
This happens by creating a ClusterRole with the following labels:
metadata: labels: rbac.authorization.k8s.io/aggregate-to-admin: "true" rbac.authorization.k8s.io/aggregate-to-edit: "true" rbac.authorization.k8s.io/aggregate-to-view: "true"
This ClusterRole permissions will be added to admin
/edit
/view
respectively
This is particulary useful when using CustomResourceDefinitions
(since Kubernetes cannot guess which resources are sensitive and which ones aren't)
When interacting with the Kubernetes API, we are using a client certificate
We saw previously that this client certificate contained:
CN=kubernetes-admin
and O=system:masters
Let's look for these in existing ClusterRoleBindings:
kubectl get clusterrolebindings -o yaml | grep -e kubernetes-admin -e system:masters
(system:masters
should show up, but not kubernetes-admin
.)
Where does this match come from?
system:masters
groupIf we eyeball the output of kubectl get clusterrolebindings -o yaml
, we'll find out!
It is in the cluster-admin
binding:
kubectl describe clusterrolebinding cluster-admin
This binding associates system:masters
with the cluster role cluster-admin
And the cluster-admin
is, basically, root
:
kubectl describe clusterrole cluster-admin
For auditing purposes, sometimes we want to know who can perform an action
There is a proof-of-concept tool by Aqua Security which does exactly that:
This is one way to install it:
docker run --rm -v /usr/local/bin:/go/bin golang \ go get -v github.com/aquasecurity/kubectl-who-can
This is one way to use it:
kubectl-who-can create pods
The CSR API
(automatically generated title slide)
The Kubernetes API exposes CSR resources
We can use these resources to issue TLS certificates
First, we will go through a quick reminder about TLS certificates
Then, we will see how to obtain a certificate for a user
We will use that certificate to authenticate with the cluster
Finally, we will grant some privileges to that user
TLS (Transport Layer Security) is a protocol providing:
encryption (to prevent eavesdropping)
authentication (using public key cryptography)
When we access an https:// URL, the server authenticates itself
(it proves its identity to us; as if it were "showing its ID")
But we can also have mutual TLS authentication (mTLS)
(client proves its identity to server; server proves its identity to client)
To authenticate, someone (client or server) needs:
a private key (that remains known only to them)
a public key (that they can distribute)
a certificate (associating the public key with an identity)
A message encrypted with the private key can only be decrypted with the public key
(and vice versa)
If I use someone's public key to encrypt/decrypt their messages,
I can be certain that I am talking to them / they are talking to me
The certificate proves that I have the correct public key for them
This is what I do if I want to obtain a certificate.
Create public and private keys.
Create a Certificate Signing Request (CSR).
(The CSR contains the identity that I claim and a public key.)
Send that CSR to the Certificate Authority (CA).
The CA verifies that I can claim the identity in the CSR.
The CA generates my certificate and gives it to me.
The CA (or anyone else) never needs to know my private key.
The Kubernetes API has a CertificateSigningRequest resource type
(we can list them with e.g. kubectl get csr
)
We can create a CSR object
(= upload a CSR to the Kubernetes API)
Then, using the Kubernetes API, we can approve/deny the request
If we approve the request, the Kubernetes API generates a certificate
The certificate gets attached to the CSR object and can be retrieved
We will show how to use the CSR API to obtain user certificates
This will be a rather complex demo
... And yet, we will take a few shortcuts to simplify it
(but it will illustrate the general idea)
The demo also won't be automated
(we would have to write extra code to make it fully functional)
We will create a Namespace named "users"
Each user will get a ServiceAccount in that Namespace
That ServiceAccount will give read/write access to one CSR object
Users will use that ServiceAccount's token to submit a CSR
We will approve the CSR (or not)
Users can then retrieve their certificate from their CSR object
...And use that certificate for subsequent interactions
For a user named jean.doe
, we will have:
ServiceAccount jean.doe
in Namespace users
CertificateSigningRequest users:jean.doe
ClusterRole users:jean.doe
giving read/write access to that CSR
ClusterRoleBinding users:jean.doe
binding ClusterRole and ServiceAccount
If you want to use another name than jean.doe
, update the YAML file!
Create the global namespace for all users:
kubectl create namespace users
Create the ServiceAccount, ClusterRole, ClusterRoleBinding for jean.doe
:
kubectl apply -f ~/container.training/k8s/users:jean.doe.yaml
Let's obtain the user's token and give it to them
(the token will be their password)
List the user's secrets:
kubectl --namespace=users describe serviceaccount jean.doe
Show the user's token:
kubectl --namespace=users describe secret jean.doe-token-xxxxx
kubectl
to use the tokenAdd a new identity to our kubeconfig file:
kubectl config set-credentials token:jean.doe --token=...
Add a new context using that identity:
kubectl config set-context jean.doe --user=token:jean.doe --cluster=kubernetes
Try to access any resource:
kubectl get pods
(This should tell us "Forbidden")
Try to access "our" CertificateSigningRequest:
kubectl get csr users:jean.doe
(This should tell us "NotFound")
There are many tools to generate TLS keys and CSRs
Let's use OpenSSL; it's not the best one, but it's installed everywhere
(many people prefer cfssl, easyrsa, or other tools; that's fine too!)
openssl req -newkey rsa:2048 -nodes -keyout key.pem \ -new -subj /CN=jean.doe/O=devs/ -out csr.pem
The command above generates:
jean.doe
in group devs
The Kubernetes CSR object is a thin wrapper around the CSR PEM file
The PEM file needs to be encoded to base64 on a single line
(we will use base64 -w0
for that purpose)
The Kubernetes CSR object also needs to list the right "usages"
(these are flags indicating how the certificate can be used)
kubectl apply -f - <<EOFapiVersion: certificates.k8s.io/v1beta1kind: CertificateSigningRequestmetadata: name: users:jean.doespec: request: $(base64 -w0 < csr.pem) usages: - digital signature - key encipherment - client authEOF
By default, the CSR API generates certificates valid 1 year
We want to generate short-lived certificates, so we will lower that to 1 hour
For now, this is configured through an experimental controller manager flag
Edit the static pod definition for the controller manager:
sudo vim /etc/kubernetes/manifests/kube-controller-manager.yaml
In the list of flags, add the following line:
- --experimental-cluster-signing-duration=1h
Switch back to cluster-admin
:
kctx -
Inspect the CSR:
kubectl describe csr users:jean.doe
Approve it:
kubectl certificate approve users:jean.doe
Switch back to the user's identity:
kctx -
Retrieve the updated CSR object and extract the certificate:
kubectl get csr users:jean.doe \ -o jsonpath={.status.certificate} \ | base64 -d > cert.pem
Inspect the certificate:
openssl x509 -in cert.pem -text -noout
Add the key and certificate to kubeconfig:
kubectl config set-credentials cert:jean.doe --embed-certs \ --client-certificate=cert.pem --client-key=key.pem
Update the user's context to use the key and cert to authenticate:
kubectl config set-context jean.doe --user cert:jean.doe
Confirm that we are seen as jean.doe
(but don't have permissions):
kubectl get pods
We have just shown, step by step, a method to issue short-lived certificates for users.
To be usable in real environments, we would need to add:
a kubectl helper to automatically generate the CSR and obtain the cert
(and transparently renew the cert when needed)
a Kubernetes controller to automatically validate and approve CSRs
(checking that the subject and groups are valid)
a way for the users to know the groups to add to their CSR
(e.g.: annotations on their ServiceAccount + read access to the ServiceAccount)
Larger organizations typically integrate with their own directory
The general principle, however, is the same:
users have long-term credentials (password, token, ...)
they use these credentials to obtain other, short-lived credentials
This provides enhanced security:
the long-term credentials can use long passphrases, 2FA, HSM...
the short-term credentials are more convenient to use
we get strong security and convenience
Systems like Vault also have certificate issuance mechanisms
Pod Security Policies
(automatically generated title slide)
By default, our pods and containers can do everything
(including taking over the entire cluster)
We are going to show an example of a malicious pod
Then we will explain how to avoid this with PodSecurityPolicies
We will enable PodSecurityPolicies on our cluster
We will create a couple of policies (restricted and permissive)
Finally we will see how to use them to improve security on our cluster
For simplicity, let's work in a separate namespace
Let's create a new namespace called "green"
Create the "green" namespace:
kubectl create namespace green
Change to that namespace:
kns green
Create a Deployment using the official NGINX image:
kubectl create deployment web --image=nginx
Confirm that the Deployment, ReplicaSet, and Pod exist, and that the Pod is running:
kubectl get all
We will now show an escalation technique in action
We will deploy a DaemonSet that adds our SSH key to the root account
(on each node of the cluster)
The Pods of the DaemonSet will do so by mounting /root
from the host
Check the file k8s/hacktheplanet.yaml
with a text editor:
vim ~/container.training/k8s/hacktheplanet.yaml
If you would like, change the SSH key (by changing the GitHub user name)
Create the DaemonSet:
kubectl create -f ~/container.training/k8s/hacktheplanet.yaml
Check that the pods are running:
kubectl get pods
Confirm that the SSH key was added to the node's root account:
sudo cat /root/.ssh/authorized_keys
Remove the DaemonSet:
kubectl delete daemonset hacktheplanet
Remove the Deployment:
kubectl delete deployment web
To use PSPs, we need to activate their specific admission controller
That admission controller will intercept each pod creation attempt
It will look at:
who/what is creating the pod
which PodSecurityPolicies they can use
which PodSecurityPolicies can be used by the Pod's ServiceAccount
Then it will compare the Pod with each PodSecurityPolicy one by one
If a PodSecurityPolicy accepts all the parameters of the Pod, it is created
Otherwise, the Pod creation is denied and it won't even show up in kubectl get pods
With RBAC, using a PSP corresponds to the verb use
on the PSP
(that makes sense, right?)
If no PSP is defined, no Pod can be created
(even by cluster admins)
Pods that are already running are not affected
If we create a Pod directly, it can use a PSP to which we have access
If the Pod is created by e.g. a ReplicaSet or DaemonSet, it's different:
the ReplicaSet / DaemonSet controllers don't have access to our policies
therefore, we need to give access to the PSP to the Pod's ServiceAccount
We are going to enable the PodSecurityPolicy admission controller
At that point, we won't be able to create any more pods (!)
Then we will create a couple of PodSecurityPolicies
...And associated ClusterRoles (giving use
access to the policies)
Then we will create RoleBindings to grant these roles to ServiceAccounts
We will verify that we can't run our "exploit" anymore
To enable Pod Security Policies, we need to enable their admission plugin
This is done by adding a flag to the API server
On clusters deployed with kubeadm
, the control plane runs in static pods
These pods are defined in YAML files located in /etc/kubernetes/manifests
Kubelet watches this directory
Each time a file is added/removed there, kubelet creates/deletes the corresponding pod
Updating a file causes the pod to be deleted and recreated
Have a look at the static pods:
ls -l /etc/kubernetes/manifests
Edit the one corresponding to the API server:
sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
There should already be a line with --enable-admission-plugins=...
Let's add PodSecurityPolicy
on that line
Locate the line with --enable-admission-plugins=
Add PodSecurityPolicy
It should read: --enable-admission-plugins=NodeRestriction,PodSecurityPolicy
Save, quit
The kubelet detects that the file was modified
It kills the API server pod, and starts a new one
During that time, the API server is unavailable
Try to create a Pod directly:
kubectl run testpsp1 --image=nginx --restart=Never
Try to create a Deployment:
kubectl run testpsp2 --image=nginx
Look at existing resources:
kubectl get all
We can get hints at what's happening by looking at the ReplicaSet and Events.
We will create two policies:
privileged (allows everything)
restricted (blocks some unsafe mechanisms)
For each policy, we also need an associated ClusterRole granting use
We have a couple of files, each defining a PSP and associated ClusterRole:
privileged
, role psp:privileged
restricted
, role psp:restricted
kubectl create -f ~/container.training/k8s/psp-restricted.yamlkubectl create -f ~/container.training/k8s/psp-privileged.yaml
The privileged policy comes from the Kubernetes documentation
The restricted policy is inspired by that same documentation page
We haven't bound the policy to any user yet
But cluster-admin
can implicitly use
all policies
Check that we can now create a Pod directly:
kubectl run testpsp3 --image=nginx --restart=Never
Create a Deployment as well:
kubectl run testpsp4 --image=nginx
Confirm that the Deployment is not creating any Pods:
kubectl get all
We can create Pods directly (thanks to our root-like permissions)
The Pods corresponding to a Deployment are created by the ReplicaSet controller
The ReplicaSet controller does not have root-like permissions
We need to either:
or
The first option would allow anyone to create pods
The second option will allow us to scope the permissions better
Let's bind the role psp:restricted
to ServiceAccount green:default
(aka the default ServiceAccount in the green Namespace)
This will allow Pod creation in the green Namespace
(because these Pods will be using that ServiceAccount automatically)
kubectl create rolebinding psp:restricted \ --clusterrole=psp:restricted \ --serviceaccount=green:default
The Deployments that we created earlier will eventually recover
(the ReplicaSet controller will retry to create Pods once in a while)
If we create a new Deployment now, it should work immediately
Create a simple Deployment:
kubectl create deployment testpsp5 --image=nginx
Look at the Pods that have been created:
kubectl get all
Create a hostile DaemonSet:
kubectl create -f ~/container.training/k8s/hacktheplanet.yaml
Look at the state of the namespace:
kubectl get all
The restricted PSP is similar to the one provided in the docs, but:
it allows containers to run as root
it doesn't drop capabilities
Many containers run as root by default, and would require additional tweaks
Many containers use e.g. chown
, which requires a specific capability
(that's the case for the NGINX official image, for instance)
We still block: hostPath, privileged containers, and much more!
If we list the pods in the kube-system
namespace, kube-apiserver
is missing
However, the API server is obviously running
(otherwise, kubectl get pods --namespace=kube-system
wouldn't work)
The API server Pod is created directly by kubelet
(without going through the PSP admission plugin)
Then, kubelet creates a "mirror pod" representing that Pod in etcd
That "mirror pod" creation goes through the PSP admission plugin
And it gets blocked!
This can be fixed by binding psp:privileged
to group system:nodes
Our cluster is currently broken
(we can't create pods in namespaces kube-system, default, ...)
We need to either:
disable the PSP admission plugin
allow use of PSP to relevant users and groups
For instance, we could:
bind psp:restricted
to the group system:authenticated
bind psp:privileged
to the ServiceAccount kube-system:default
Exposing HTTP services with Ingress resources
(automatically generated title slide)
Services give us a way to access a pod or a set of pods
Services can be exposed to the outside world:
with type NodePort
(on a port >30000)
with type LoadBalancer
(allocating an external load balancer)
What about HTTP services?
how can we expose webui
, rng
, hasher
?
the Kubernetes dashboard?
a new version of webui
?
If we use NodePort
services, clients have to specify port numbers
(i.e. http://xxxxx:31234 instead of just http://xxxxx)
LoadBalancer
services are nice, but:
they are not available in all environments
they often carry an additional cost (e.g. they provision an ELB)
they require one extra step for DNS integration
(waiting for the LoadBalancer
to be provisioned; then adding it to DNS)
We could build our own reverse proxy
There are many options available:
Apache, HAProxy, Hipache, NGINX, Traefik, ...
(look at jpetazzo/aiguillage for a minimal reverse proxy configuration using NGINX)
Most of these options require us to update/edit configuration files after each change
Some of them can pick up virtual hosts and backends from a configuration store
Wouldn't it be nice if this configuration could be managed with the Kubernetes API?
There are many options available:
Apache, HAProxy, Hipache, NGINX, Traefik, ...
(look at jpetazzo/aiguillage for a minimal reverse proxy configuration using NGINX)
Most of these options require us to update/edit configuration files after each change
Some of them can pick up virtual hosts and backends from a configuration store
Wouldn't it be nice if this configuration could be managed with the Kubernetes API?
Enter¹ Ingress resources!
¹ Pun maybe intended.
Kubernetes API resource (kubectl get ingress
/ingresses
/ing
)
Designed to expose HTTP services
Basic features:
Can also route to different services depending on:
/api
→api-service
, /static
→assets-service
)Step 1: deploy an ingress controller
ingress controller = load balancer + control loop
the control loop watches over ingress resources, and configures the LB accordingly
Step 2: set up DNS
Step 3: create ingress resources
Step 4: profit!
We will deploy the Traefik ingress controller
this is an arbitrary choice
maybe motivated by the fact that Traefik releases are named after cheeses
For DNS, we will use nip.io
*.1.2.3.4.nip.io
resolves to 1.2.3.4
We will create ingress resources for various HTTP services
We want our ingress load balancer to be available on port 80
We could do that with a LoadBalancer
service
... but it requires support from the underlying infrastructure
We could use pods specifying hostPort: 80
... but with most CNI plugins, this doesn't work or requires additional setup
We could use a NodePort
service
... but that requires changing the --service-node-port-range
flag in the API server
Last resort: the hostNetwork
mode
hostNetwork
Normally, each pod gets its own network namespace
(sometimes called sandbox or network sandbox)
An IP address is assigned to the pod
This IP address is routed/connected to the cluster network
All containers of that pod are sharing that network namespace
(and therefore using the same IP address)
hostNetwork: true
No network namespace gets created
The pod is using the network namespace of the host
It "sees" (and can use) the interfaces (and IP addresses) of the host
The pod can receive outside traffic directly, on any port
Downside: with most network plugins, network policies won't work for that pod
most network policies work at the IP address level
filtering that pod = filtering traffic from the node
The Traefik documentation tells us to pick between Deployment and Daemon Set
We are going to use a Daemon Set so that each node can accept connections
We will do two minor changes to the YAML provided by Traefik:
enable hostNetwork
add a toleration so that Traefik also runs on node1
A taint is an attribute added to a node
It prevents pods from running on the node
... Unless they have a matching toleration
When deploying with kubeadm
:
a taint is placed on the node dedicated to the control plane
the pods running the control plane have a matching toleration
kubectl get node node1 -o json | jq .speckubectl get node node2 -o json | jq .spec
We should see a result only for node1
(the one with the control plane):
"taints": [ { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" } ]
The key
can be interpreted as:
a reservation for a special set of pods
(here, this means "this node is reserved for the control plane")
an error condition on the node
(for instance: "disk full," do not start new pods here!)
The effect
can be:
NoSchedule
(don't run new pods here)
PreferNoSchedule
(try not to run new pods here)
NoExecute
(don't run new pods and evict running pods)
kubectl -n kube-system get deployments coredns -o json | jq .spec.template.spec.tolerations
The result should include:
{ "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" }
It means: "bypass the exact taint that we saw earlier on node1
."
kube-proxy
:kubectl -n kube-system get ds kube-proxy -o json | jq .spec.template.spec.tolerations
The result should include:
{ "operator": "Exists" }
This one is a special case that means "ignore all taints and run anyway."
We provide a YAML file (k8s/traefik.yaml
) which is essentially the sum of:
Traefik's Daemon Set resources (patched with hostNetwork
and tolerations)
Traefik's RBAC rules allowing it to watch necessary API objects
kubectl apply -f ~/container.training/k8s/traefik.yaml
curl localhost
We should get a 404 page not found
error.
This is normal: we haven't provided any ingress rule yet.
To make our lives easier, we will use nip.io
Check out http://cheddar.A.B.C.D.nip.io
(replacing A.B.C.D with the IP address of node1
)
We should get the same 404 page not found
error
(meaning that our DNS is "set up properly", so to speak!)
Traefik provides a web dashboard
With the current install method, it's listening on port 8080
http://node1:8080
(replacing node1
with its IP address)We are going to use errm/cheese
images
(there are 3 tags available: wensleydale, cheddar, stilton)
These images contain a simple static HTTP server sending a picture of cheese
We will run 3 deployments (one for each cheese)
We will create 3 services (one for each deployment)
Then we will create 3 ingress rules (one for each service)
We will route <name-of-cheese>.A.B.C.D.nip.io
to the corresponding deployment
Run all three deployments:
kubectl create deployment cheddar --image=errm/cheese:cheddarkubectl create deployment stilton --image=errm/cheese:stiltonkubectl create deployment wensleydale --image=errm/cheese:wensleydale
Create a service for each of them:
kubectl expose deployment cheddar --port=80kubectl expose deployment stilton --port=80kubectl expose deployment wensleydale --port=80
Here is a minimal host-based ingress resource:
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: cheddarspec: rules: - host: cheddar.A.B.C.D.nip.io http: paths: - path: / backend: serviceName: cheddar servicePort: 80
(It is in k8s/ingress.yaml
.)
Edit the file ~/container.training/k8s/ingress.yaml
Replace A.B.C.D with the IP address of node1
Apply the file
(An image of a piece of cheese should show up.)
Edit the file ~/container.training/k8s/ingress.yaml
Replace cheddar
with stilton
(in name
, host
, serviceName
)
Apply the file
Check that stilton.A.B.C.D.nip.io
works correctly
Repeat for wensleydale
You can have multiple ingress controllers active simultaneously
(e.g. Traefik and NGINX)
You can even have multiple instances of the same controller
(e.g. one for internal, another for external traffic)
The kubernetes.io/ingress.class
annotation can be used to tell which one to use
It's OK if multiple ingress controllers configure the same resource
(it just means that the service will be accessible through multiple paths)
The traffic flows directly from the ingress load balancer to the backends
it doesn't need to go through the ClusterIP
in fact, we don't even need a ClusterIP
(we can use a headless service)
The load balancer can be outside of Kubernetes
(as long as it has access to the cluster subnet)
This allows the use of external (hardware, physical machines...) load balancers
Annotations can encode special features
(rate-limiting, A/B testing, session stickiness, etc.)
Aforementioned "special features" are not standardized yet
Some controllers will support them; some won't
Even relatively common features (stripping a path prefix) can differ:
This should eventually stabilize
(remember that ingresses are currently apiVersion: extensions/v1beta1
)
Collecting metrics with Prometheus
(automatically generated title slide)
Prometheus is an open-source monitoring system including:
multiple service discovery backends to figure out which metrics to collect
a scraper to collect these metrics
an efficient time series database to store these metrics
a specific query language (PromQL) to query these time series
an alert manager to notify us according to metrics values or trends
We are going to use it to collect and query some metrics on our Kubernetes cluster
We don't endorse Prometheus more or less than any other system
It's relatively well integrated within the cloud-native ecosystem
It can be self-hosted (this is useful for tutorials like this)
It can be used for deployments of varying complexity:
one binary and 10 lines of configuration to get started
all the way to thousands of nodes and millions of metrics
Prometheus obtains metrics and their values by querying exporters
An exporter serves metrics over HTTP, in plain text
This is what the node exporter looks like:
Prometheus itself exposes its own internal metrics, too:
If you want to expose custom metrics to Prometheus:
serve a text page like these, and you're good to go
libraries are available in various languages to help with quantiles etc.
The Prometheus server will scrape URLs like these at regular intervals
(by default: every minute; can be more/less frequent)
If you're worried about parsing overhead: exporters can also use protobuf
The list of URLs to scrape (the scrape targets) is defined in configuration
This is maybe the simplest configuration file for Prometheus:
scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
In this configuration, Prometheus collects its own internal metrics
A typical configuration file will have multiple scrape_configs
In this configuration, the list of targets is fixed
A typical configuration file will use dynamic service discovery
This configuration file will leverage existing DNS A
records:
scrape_configs: - ... - job_name: 'node' dns_sd_configs: - names: ['api-backends.dc-paris-2.enix.io'] type: 'A' port: 9100
In this configuration, Prometheus resolves the provided name(s)
(here, api-backends.dc-paris-2.enix.io
)
Each resulting IP address is added as a target on port 9100
In the DNS example, the names are re-resolved at regular intervals
As DNS records are created/updated/removed, scrape targets change as well
Existing data (previously collected metrics) is not deleted
Other service discovery backends work in a similar fashion
Prometheus can connect to e.g. a cloud API to list instances
Or to the Kubernetes API to list nodes, pods, services ...
Or a service like Consul, Zookeeper, etcd, to list applications
The resulting configurations files are way more complex
(but don't worry, we won't need to write them ourselves)
We could wonder, "why do we need a specialized database?"
One metrics data point = metrics ID + timestamp + value
With a classic SQL or noSQL data store, that's at least 160 bits of data + indexes
Prometheus is way more efficient, without sacrificing performance
(it will even be gentler on the I/O subsystem since it needs to write less)
Would you like to know more? Check this video:
Storage in Prometheus 2.0 by Goutham V at DC17EU
app=prometheus
across all namespaces:kubectl get services --selector=app=prometheus --all-namespaces
If we see a NodePort
service called prometheus-server
, we're good!
(We can then skip to "Connecting to the Prometheus web UI".)
We need to:
Run the Prometheus server in a pod
(using e.g. a Deployment to ensure that it keeps running)
Expose the Prometheus server web UI (e.g. with a NodePort)
Run the node exporter on each node (with a Daemon Set)
Set up a Service Account so that Prometheus can query the Kubernetes API
Configure the Prometheus server
(storing the configuration in a Config Map for easy updates)
To make our lives easier, we are going to use a Helm chart
The Helm chart will take care of all the steps explained above
(including some extra features that we don't need, but won't hurt)
Install Tiller (Helm's server-side component) on our cluster:
helm init
Give Tiller permission to deploy things on our cluster:
kubectl create clusterrolebinding add-on-cluster-admin \ --clusterrole=cluster-admin --serviceaccount=kube-system:default
Skip this if we already installed Prometheus earlier
(in doubt, check with helm list
)
helm upgrade prometheus stable/prometheus \ --install \ --namespace kube-system \ --set server.service.type=NodePort \ --set server.service.nodePort=30090 \ --set server.persistentVolume.enabled=false \ --set alertmanager.enabled=false
Curious about all these flags? They're explained in the next slide.
helm upgrade prometheus
→ upgrade release "prometheus" to the latest version...
(a "release" is a unique name given to an app deployed with Helm)
stable/prometheus
→ ... of the chart prometheus
in repo stable
--install
→ if the app doesn't exist, create it
--namespace kube-system
→ put it in that specific namespace
And set the following values when rendering the chart's templates:
server.service.type=NodePort
→ expose the Prometheus server with a NodePortserver.service.nodePort=30090
→ set the specific NodePort number to useserver.persistentVolume.enabled=false
→ do not use a PersistentVolumeClaimalertmanager.enabled=false
→ disable the alert manager entirelyFigure out the NodePort that was allocated to the Prometheus server:
kubectl get svc --all-namespaces | grep prometheus-server
With your browser, connect to that port
sum by (instance) ( irate( container_cpu_usage_seconds_total{ pod_name=~"worker.*" }[5m] ))
Click on the blue "Execute" button and on the "Graph" tab just below
We see the cumulated CPU usage of worker pods for each node
(if we just deployed Prometheus, there won't be much data to see, though)
We can't learn PromQL in just 5 minutes
But we can cover the basics to get an idea of what is possible
(and have some keywords and pointers)
We are going to break down the query above
(building it one step at a time)
This query will show us CPU usage across all containers:
container_cpu_usage_seconds_total
The suffix of the metrics name tells us:
the unit (seconds of CPU)
that it's the total used since the container creation
Since it's a "total," it is an increasing quantity
(we need to compute the derivative if we want e.g. CPU % over time)
We see that the metrics retrieved have tags attached to them
This query will show us only metrics for worker containers:
container_cpu_usage_seconds_total{pod_name=~"worker.*"}
The =~
operator allows regex matching
We select all the pods with a name starting with worker
(it would be better to use labels to select pods; more on that later)
The result is a smaller set of containers
This query will show us CPU usage % instead of total seconds used:
100*irate(container_cpu_usage_seconds_total{pod_name=~"worker.*"}[5m])
The irate
operator computes the "per-second instant rate of increase"
rate
is similar but allows decreasing counters and negative values
with irate
, if a counter goes back to zero, we don't get a negative spike
The [5m]
tells how far to look back if there is a gap in the data
And we multiply with 100*
to get CPU % usage
This query sums the CPU usage per node:
sum by (instance) ( irate(container_cpu_usage_seconds_total{pod_name=~"worker.*"}[5m]))
instance
corresponds to the node on which the container is running
sum by (instance) (...)
computes the sum for each instance
Note: all the other tags are collapsed
(in other words, the resulting graph only shows the instance
tag)
PromQL supports many more aggregation operators
Node metrics (related to physical or virtual machines)
Container metrics (resource usage per container)
Databases, message queues, load balancers, ...
(check out this list of exporters!)
Instrumentation (=deluxe printf
for our code)
Business metrics (customers served, revenue, ...)
CPU, RAM, disk usage on the whole node
Total number of processes running, and their states
Number of open files, sockets, and their states
I/O activity (disk, network), per operation or volume
Physical/hardware (when applicable): temperature, fan speed...
...and much more!
Similar to node metrics, but not totally identical
RAM breakdown will be different
I/O activity is also harder to track
For details about container metrics, see:
http://jpetazzo.github.io/2013/10/08/docker-containers-metrics/
Arbitrary metrics related to your application and business
System performance: request latency, error rate...
Volume information: number of rows in database, message queue size...
Business data: inventory, items sold, revenue...
Prometheus can leverage Kubernetes service discovery
(with proper configuration)
Services or pods can be annotated with:
prometheus.io/scrape: true
to enable scrapingprometheus.io/port: 9090
to indicate the port numberprometheus.io/path: /metrics
to indicate the URI (/metrics
by default)Prometheus will detect and scrape these (without needing a restart or reload)
What if we want to get metrics for containers belonging to a pod tagged worker
?
The cAdvisor exporter does not give us Kubernetes labels
Kubernetes labels are exposed through another exporter
We can see Kubernetes labels through metrics kube_pod_labels
(each container appears as a time series with constant value of 1
)
Prometheus kind of supports "joins" between time series
But only if the names of the tags match exactly
The cAdvisor exporter uses tag pod_name
for the name of a pod
The Kubernetes service endpoints exporter uses tag pod
instead
See this blog post or this other one to see how to perform "joins"
Alas, Prometheus cannot "join" time series with different labels
(see Prometheus issue #2204 for the rationale)
There is a workaround involving relabeling, but it's "not cheap"
see this comment for an overview
or this blog post for a complete description of the process
Grafana is a beautiful (and useful) frontend to display all kinds of graphs
Not everyone needs to know Prometheus, PromQL, Grafana, etc.
But in a team, it is valuable to have at least one person who know them
That person can set up queries and dashboards for the rest of the team
It's a little bit like knowing how to optimize SQL queries, Dockerfiles...
Don't panic if you don't know these tools!
...But make sure at least one person in your team is on it 💯
Volumes
(automatically generated title slide)
Volumes are special directories that are mounted in containers
Volumes can have many different purposes:
share files and directories between containers running on the same machine
share files and directories between containers and their host
centralize configuration information in Kubernetes and expose it to containers
manage credentials and secrets and expose them securely to containers
store persistent data for stateful services
access storage systems (like Ceph, EBS, NFS, Portworx, and many others)
Kubernetes and Docker volumes are very similar
(the Kubernetes documentation says otherwise ...
but it refers to Docker 1.7, which was released in 2015!)
Docker volumes allow us to share data between containers running on the same host
Kubernetes volumes allow us to share data between containers in the same pod
Both Docker and Kubernetes volumes enable access to storage systems
Kubernetes volumes are also used to expose configuration and secrets
Docker has specific concepts for configuration and secrets
(but under the hood, the technical implementation is similar)
If you're not familiar with Docker volumes, you can safely ignore this slide!
Volumes and Persistent Volumes are related, but very different!
Volumes:
appear in Pod specifications (see next slide)
do not exist as API resources (cannot do kubectl get volumes
)
Persistent Volumes:
are API resources (can do kubectl get persistentvolumes
)
correspond to concrete volumes (e.g. on a SAN, EBS, etc.)
cannot be associated with a Pod directly; but through a Persistent Volume Claim
won't be discussed further in this section
apiVersion: v1kind: Podmetadata: name: nginx-with-volumespec: volumes: - name: www containers: - name: nginx image: nginx volumeMounts: - name: www mountPath: /usr/share/nginx/html/
We define a standalone Pod
named nginx-with-volume
In that pod, there is a volume named www
No type is specified, so it will default to emptyDir
(as the name implies, it will be initialized as an empty directory at pod creation)
In that pod, there is also a container named nginx
That container mounts the volume www
to path /usr/share/nginx/html/
apiVersion: v1kind: Podmetadata: name: nginx-with-volumespec: volumes: - name: www containers: - name: nginx image: nginx volumeMounts: - name: www mountPath: /usr/share/nginx/html/ - name: git image: alpine command: [ "sh", "-c", "apk add --no-cache git && git clone https://github.com/octocat/Spoon-Knife /www" ] volumeMounts: - name: www mountPath: /www/ restartPolicy: OnFailure
We added another container to the pod
That container mounts the www
volume on a different path (/www
)
It uses the alpine
image
When started, it installs git
and clones the octocat/Spoon-Knife
repository
(that repository contains a tiny HTML website)
As a result, NGINX now serves this website
Create the pod by applying the YAML file:
kubectl apply -f ~/container.training/k8s/nginx-with-volume.yaml
Check the IP address that was allocated to our pod:
kubectl get pod nginx-with-volume -o wideIP=$(kubectl get pod nginx-with-volume -o json | jq -r .status.podIP)
Access the web server:
curl $IP
The default restartPolicy
is Always
This would cause our git
container to run again ... and again ... and again
(with an exponential back-off delay, as explained in the documentation)
That's why we specified restartPolicy: OnFailure
There is a short period of time during which the website is not available
(because the git
container hasn't done its job yet)
This could be avoided by using Init Containers
(we will see a live example in a few sections)
The lifecycle of a volume is linked to the pod's lifecycle
This means that a volume is created when the pod is created
This is mostly relevant for emptyDir
volumes
(other volumes, like remote storage, are not "created" but rather "attached" )
A volume survives across container restarts
A volume is destroyed (or, for remote storage, detached) when the pod is destroyed
Managing configuration
(automatically generated title slide)
Some applications need to be configured (obviously!)
There are many ways for our code to pick up configuration:
command-line arguments
environment variables
configuration files
configuration servers (getting configuration from a database, an API...)
... and more (because programmers can be very creative!)
How can we do these things with containers and Kubernetes?
There are many ways to pass configuration to code running in a container:
baking it into a custom image
command-line arguments
environment variables
injecting configuration files
exposing it over the Kubernetes API
configuration servers
Let's review these different strategies!
Put the configuration in the image
(it can be in a configuration file, but also ENV
or CMD
actions)
It's easy! It's simple!
Unfortunately, it also has downsides:
multiplication of images
different images for dev, staging, prod ...
minor reconfigurations require a whole build/push/pull cycle
Avoid doing it unless you don't have the time to figure out other options
Pass options to args
array in the container specification
Example (source):
args: - "--data-dir=/var/lib/etcd" - "--advertise-client-urls=http://127.0.0.1:2379" - "--listen-client-urls=http://127.0.0.1:2379" - "--listen-peer-urls=http://127.0.0.1:2380" - "--name=etcd"
The options can be passed directly to the program that we run ...
... or to a wrapper script that will use them to e.g. generate a config file
Works great when options are passed directly to the running program
(otherwise, a wrapper script can work around the issue)
Works great when there aren't too many parameters
(to avoid a 20-lines args
array)
Requires documentation and/or understanding of the underlying program
("which parameters and flags do I need, again?")
Well-suited for mandatory parameters (without default values)
Not ideal when we need to pass a real configuration file anyway
Pass options through the env
map in the container specification
Example:
env: - name: ADMIN_PORT value: "8080" - name: ADMIN_AUTH value: Basic - name: ADMIN_CRED value: "admin:0pensesame!"
value
must be a string! Make sure that numbers and fancy strings are quoted.
🤔 Why this weird {name: xxx, value: yyy}
scheme? It will be revealed soon!
In the previous example, environment variables have fixed values
We can also use a mechanism called the downward API
The downward API allows exposing pod or container information
either through special files (we won't show that for now)
or through environment variables
The value of these environment variables is computed when the container is started
Remember: environment variables won't (can't) change after container start
Let's see a few concrete examples!
- name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace
Useful to generate FQDN of services
(in some contexts, a short name is not enough)
For instance, the two commands should be equivalent:
curl api-backendcurl api-backend.$MY_POD_NAMESPACE.svc.cluster.local
- name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP
Useful if we need to know our IP address
(we could also read it from eth0
, but this is more solid)
- name: MY_MEM_LIMIT valueFrom: resourceFieldRef: containerName: test-container resource: limits.memory
Useful for runtimes where memory is garbage collected
Example: the JVM
(the memory available to the JVM should be set with the -Xmx
flag)
Best practice: set a memory limit, and pass it to the runtime
(see this blog post for a detailed example)
This documentation page tells more about these environment variables
And this one explains the other way to use the downward API
(through files that get created in the container filesystem)
Works great when the running program expects these variables
Works great for optional parameters with reasonable defaults
(since the container image can provide these defaults)
Sort of auto-documented
(we can see which environment variables are defined in the image, and their values)
Can be (ab)used with longer values ...
... You can put an entire Tomcat configuration file in an environment ...
... But should you?
(Do it if you really need to, we're not judging! But we'll see better ways.)
Sometimes, there is no way around it: we need to inject a full config file
Kubernetes provides a mechanism for that purpose: configmaps
A configmap is a Kubernetes resource that exists in a namespace
Conceptually, it's a key/value map
(values are arbitrary strings)
We can think about them in (at least) two different ways:
as holding entire configuration file(s)
as holding individual configuration parameters
Note: to hold sensitive information, we can use "Secrets", which are another type of resource behaving very much like configmaps. We'll cover them just after!
In this case, each key/value pair corresponds to a configuration file
Key = name of the file
Value = content of the file
There can be one key/value pair, or as many as necessary
(for complex apps with multiple configuration files)
Examples:
# Create a configmap with a single key, "app.conf"kubectl create configmap my-app-config --from-file=app.conf# Create a configmap with a single key, "app.conf" but another filekubectl create configmap my-app-config --from-file=app.conf=app-prod.conf# Create a configmap with multiple keys (one per file in the config.d directory)kubectl create configmap my-app-config --from-file=config.d/
In this case, each key/value pair corresponds to a parameter
Key = name of the parameter
Value = value of the parameter
Examples:
# Create a configmap with two keyskubectl create cm my-app-config \ --from-literal=foreground=red \ --from-literal=background=blue# Create a configmap from a file containing key=val pairskubectl create cm my-app-config \ --from-env-file=app.conf
Configmaps can be exposed as plain files in the filesystem of a container
this is achieved by declaring a volume and mounting it in the container
this is particularly effective for configmaps containing whole files
Configmaps can be exposed as environment variables in the container
this is achieved with the downward API
this is particularly effective for configmaps containing individual parameters
Let's see how to do both!
We will start a load balancer powered by HAProxy
We will use the official haproxy
image
It expects to find its configuration in /usr/local/etc/haproxy/haproxy.cfg
We will provide a simple HAproxy configuration, k8s/haproxy.cfg
It listens on port 80, and load balances connections between IBM and Google
Go to the k8s
directory in the repository:
cd ~/container.training/k8s
Create a configmap named haproxy
and holding the configuration file:
kubectl create configmap haproxy --from-file=haproxy.cfg
Check what our configmap looks like:
kubectl get configmap haproxy -o yaml
We are going to use the following pod definition:
apiVersion: v1kind: Podmetadata: name: haproxyspec: volumes: - name: config configMap: name: haproxy containers: - name: haproxy image: haproxy volumeMounts: - name: config mountPath: /usr/local/etc/haproxy/
k8s/haproxy.yaml
kubectl apply -f ~/container.training/k8s/haproxy.yaml
kubectl get pod haproxy -o wideIP=$(kubectl get pod haproxy -o json | jq -r .status.podIP)
The load balancer will send:
half of the connections to Google
the other half to IBM
curl $IPcurl $IPcurl $IP
We should see connections served by Google, and others served by IBM.
(Each server sends us a redirect page. Look at the URL that they send us to!)
We are going to run a Docker registry on a custom port
By default, the registry listens on port 5000
This can be changed by setting environment variable REGISTRY_HTTP_ADDR
We are going to store the port number in a configmap
Then we will expose that configmap as a container environment variable
Our configmap will have a single key, http.addr
:
kubectl create configmap registry --from-literal=http.addr=0.0.0.0:80
Check our configmap:
kubectl get configmap registry -o yaml
We are going to use the following pod definition:
apiVersion: v1kind: Podmetadata: name: registryspec: containers: - name: registry image: registry env: - name: REGISTRY_HTTP_ADDR valueFrom: configMapKeyRef: name: registry key: http.addr
k8s/registry.yaml
kubectl apply -f ~/container.training/k8s/registry.yaml
Check the IP address allocated to the pod:
kubectl get pod registry -o wideIP=$(kubectl get pod registry -o json | jq -r .status.podIP)
Confirm that the registry is available on port 80:
curl $IP/v2/_catalog
For sensitive information, there is another special resource: Secrets
Secrets and Configmaps work almost the same way
(we'll expose the differences on the next slide)
The intent is different, though:
"You should use secrets for things which are actually secret like API keys, credentials, etc., and use config map for not-secret configuration data."
"In the future there will likely be some differentiators for secrets like rotation or support for backing the secret API w/ HSMs, etc."
(Source: the author of both features)
Secrets are base64-encoded when shown with kubectl get secrets -o yaml
keep in mind that this is just encoding, not encryption
it is very easy to automatically extract and decode secrets
With RBAC, we can authorize a user to access configmaps, but not secrets
(since they are two different kinds of resources)
Stateful sets
(automatically generated title slide)
Stateful sets are a type of resource in the Kubernetes API
(like pods, deployments, services...)
They offer mechanisms to deploy scaled stateful applications
At a first glance, they look like deployments:
a stateful set defines a pod spec and a number of replicas R
it will make sure that R copies of the pod are running
that number can be changed while the stateful set is running
updating the pod spec will cause a rolling update to happen
But they also have some significant differences
Pods in a stateful set are numbered (from 0 to R-1) and ordered
They are started and updated in order (from 0 to R-1)
A pod is started (or updated) only when the previous one is ready
They are stopped in reverse order (from R-1 to 0)
Each pod know its identity (i.e. which number it is in the set)
Each pod can discover the IP address of the others easily
The pods can persist data on attached volumes
🤔 Wait a minute ... Can't we already attach volumes to pods and deployments?
Volumes are used for many purposes:
sharing data between containers in a pod
exposing configuration information and secrets to containers
accessing storage systems
Let's see examples of the latter usage
There are many types of volumes available:
public cloud storage (GCEPersistentDisk, AWSElasticBlockStore, AzureDisk...)
private cloud storage (Cinder, VsphereVolume...)
traditional storage systems (NFS, iSCSI, FC...)
distributed storage (Ceph, Glusterfs, Portworx...)
Using a persistent volume requires:
creating the volume out-of-band (outside of the Kubernetes API)
referencing the volume in the pod description, with all its parameters
Here is a pod definition using an AWS EBS volume (that has to be created first):
apiVersion: v1kind: Podmetadata: name: pod-using-my-ebs-volumespec: containers: - image: ... name: container-using-my-ebs-volume volumeMounts: - mountPath: /my-ebs name: my-ebs-volume volumes: - name: my-ebs-volume awsElasticBlockStore: volumeID: vol-049df61146c4d7901 fsType: ext4
Here is another example using a volume on an NFS server:
apiVersion: v1kind: Podmetadata: name: pod-using-my-nfs-volumespec: containers: - image: ... name: container-using-my-nfs-volume volumeMounts: - mountPath: /my-nfs name: my-nfs-volume volumes: - name: my-nfs-volume nfs: server: 192.168.0.55 path: "/exports/assets"
Their lifecycle (creation, deletion...) is managed outside of the Kubernetes API
(we can't just use kubectl apply/create/delete/...
to manage them)
If a Deployment uses a volume, all replicas end up using the same volume
That volume must then support concurrent access
some volumes do (e.g. NFS servers support multiple read/write access)
some volumes support concurrent reads
some volumes support concurrent access for colocated pods
What we really need is a way for each replica to have its own volume
To abstract the different types of storage, a pod can use a special volume type
This type is a Persistent Volume Claim
A Persistent Volume Claim (PVC) is a resource type
(visible with kubectl get persistentvolumeclaims
or kubectl get pvc
)
A PVC is not a volume; it is a request for a volume
Using a Persistent Volume Claim is a two-step process:
creating the claim
using the claim in a pod (as if it were any other kind of volume)
A PVC starts by being Unbound (without an associated volume)
Once it is associated with a Persistent Volume, it becomes Bound
A Pod referring an unbound PVC will not start
(but as soon as the PVC is bound, the Pod can start)
A Kubernetes controller continuously watches PV and PVC objects
When it notices an unbound PVC, it tries to find a satisfactory PV
("satisfactory" in terms of size and other characteristics; see next slide)
If no PV fits the PVC, a PV can be created dynamically
(this requires to configure a dynamic provisioner, more on that later)
Otherwise, the PVC remains unbound indefinitely
(until we manually create a PV or setup dynamic provisioning)
At the very least, the claim should indicate:
the size of the volume (e.g. "5 GiB")
the access mode (e.g. "read-write by a single pod")
Optionally, it can also specify a Storage Class
The Storage Class indicates:
which storage system to use (e.g. Portworx, EBS...)
extra parameters for that storage system
e.g.: "replicate the data 3 times, and use SSD media"
A Storage Class is yet another Kubernetes API resource
(visible with e.g. kubectl get storageclass
or kubectl get sc
)
It indicates which provisioner to use
(which controller will create the actual volume)
And arbitrary parameters for that provisioner
(replication levels, type of disk ... anything relevant!)
Storage Classes are required if we want to use dynamic provisioning
(but we can also create volumes manually, and ignore Storage Classes)
Here is a minimal PVC:
kind: PersistentVolumeClaimapiVersion: v1metadata: name: my-claimspec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi
Here is a Pod definition like the ones shown earlier, but using a PVC:
apiVersion: v1kind: Podmetadata: name: pod-using-a-claimspec: containers: - image: ... name: container-using-a-claim volumeMounts: - mountPath: /my-vol name: my-volume volumes: - name: my-volume persistentVolumeClaim: claimName: my-claim
The pods in a stateful set can define a volumeClaimTemplate
A volumeClaimTemplate
will dynamically create one Persistent Volume Claim per pod
Each pod will therefore have its own volume
These volumes are numbered (like the pods)
When updating the stateful set (e.g. image upgrade), each pod keeps its volume
When pods get rescheduled (e.g. node failure), they keep their volume
(this requires a storage system that is not node-local)
These volumes are not automatically deleted
(when the stateful set is scaled down or deleted)
A Stateful sets manages a number of identical pods
(like a Deployment)
These pods are numbered, and started/upgraded/stopped in a specific order
These pods are aware of their number
(e.g., #0 can decide to be the primary, and #1 can be secondary)
These pods can find the IP addresses of the other pods in the set
(through a headless service)
These pods can each have their own persistent storage
(Deployments cannot do that)
Running a Consul cluster
(automatically generated title slide)
Here is a good use-case for Stateful sets!
We are going to deploy a Consul cluster with 3 nodes
Consul is a highly-available key/value store
(like etcd or Zookeeper)
One easy way to bootstrap a cluster is to tell each node:
the addresses of other nodes
how many nodes are expected (to know when quorum is reached)
After reading the Consul documentation carefully (and/or asking around), we figure out the minimal command-line to run our Consul cluster.
consul agent -data=dir=/consul/data -client=0.0.0.0 -server -ui \ -bootstrap-expect=3 \ -retry-join=X.X.X.X \ -retry-join=Y.Y.Y.Y
Replace X.X.X.X and Y.Y.Y.Y with the addresses of other nodes
The same command-line can be used on all nodes (convenient!)
Since version 1.4.0, Consul can use the Kubernetes API to find its peers
This is called Cloud Auto-join
Instead of passing an IP address, we need to pass a parameter like this:
consul agent -retry-join "provider=k8s label_selector=\"app=consul\""
Consul needs to be able to talk to the Kubernetes API
We can provide a kubeconfig
file
If Consul runs in a pod, it will use the service account of the pod k8s/statefulsets.md
We need to create a service account for Consul
We need to create a role that can list
and get
pods
We need to bind that role to the service account
And of course, we need to make sure that Consul pods use that service account
The file k8s/consul.yaml
defines the required resources
(service account, cluster role, cluster role binding, service, stateful set)
It has a few extra touches:
a podAntiAffinity
prevents two pods from running on the same node
a preStop
hook makes the pod leave the cluster when shutdown gracefully
This was inspired by this excellent tutorial by Kelsey Hightower. Some features from the original tutorial (TLS authentication between nodes and encryption of gossip traffic) were removed for simplicity.
Create the stateful set and associated service:
kubectl apply -f ~/container.training/k8s/consul.yaml
Check the logs as the pods come up one after another:
stern consul
kubectl exec consul-0 consul members
We haven't used a volumeClaimTemplate
here
That's because we don't have a storage provider yet
(except if you're running this on your own and your cluster has one)
What happens if we lose a pod?
a new pod gets rescheduled (with an empty state)
the new pod tries to connect to the two others
it will be accepted (after 1-2 minutes of instability)
and it will retrieve the data from the other pods
What happens if we lose two pods?
manual repair will be required
we will need to instruct the remaining one to act solo
then rejoin new pods
What happens if we lose three pods? (aka all of them)
If we run Consul without persistent storage, backups are a good idea!
Local Persistent Volumes
(automatically generated title slide)
We want to run that Consul cluster and actually persist data
But we don't have a distributed storage system
We are going to use local volumes instead
(similar conceptually to hostPath
volumes)
We can use local volumes without installing extra plugins
However, they are tied to a node
If that node goes down, the volume becomes unavailable
k8s/local-persistent-volumes.md
We will deploy a Consul cluster with persistence
That cluster's StatefulSet will create PVCs
These PVCs will remain unbound¹, until we will create local volumes manually
(we will basically do the job of the dynamic provisioner)
Then, we will see how to automate that with a dynamic provisioner
¹Unbound = without an associated Persistent Volume.
k8s/local-persistent-volumes.md
The labs in this section assume that we do not have a dynamic provisioner
If we do have one, we need to disable it
Check if we have a dynamic provisioner:
kubectl get storageclass
If the output contains a line with (default)
, run this command:
kubectl annotate sc storageclass.kubernetes.io/is-default-class- --all
Check again that it is no longer marked as (default)
k8s/local-persistent-volumes.md
Create a new namespace:
kubectl create namespace orange
Switch to that namespace:
kns orange
Make sure to call that namespace orange
: it is hardcoded in the YAML files.
k8s/local-persistent-volumes.md
We will use a slightly different YAML file
The only differences between that file and the previous one are:
volumeClaimTemplate
defined in the Stateful Set spec
the corresponding volumeMounts
in the Pod spec
the namespace orange
used for discovery of Pods
kubectl apply -f ~/container.training/k8s/persistent-consul.yaml
k8s/local-persistent-volumes.md
Check that we now have an unbound Persistent Volume Claim:
kubectl get pvc
We don't have any Persistent Volume:
kubectl get pv
The Pod consul-0
is not scheduled yet:
kubectl get pods -o wide
Hint: leave these commands running with -w
in different windows.
k8s/local-persistent-volumes.md
In a Stateful Set, the Pods are started one by one
consul-1
won't be created until consul-0
is running
consul-0
has a dependency on an unbound Persistent Volume Claim
The scheduler won't schedule the Pod until the PVC is bound
(because the PVC might be bound to a volume that is only available on a subset of nodes; for instance EBS are tied to an availability zone)
k8s/local-persistent-volumes.md
Let's create 3 local directories (/mnt/consul
) on node2, node3, node4
Then create 3 Persistent Volumes corresponding to these directories
Create the local directories:
for NODE in node2 node3 node4; do ssh $NODE sudo mkdir -p /mnt/consuldone
Create the PV objects:
kubectl apply -f ~/container.training/k8s/volumes-for-consul.yaml
k8s/local-persistent-volumes.md
The PVs that we created will be automatically matched with the PVCs
Once a PVC is bound, its pod can start normally
Once the pod consul-0
has started, consul-1
can be created, etc.
Eventually, our Consul cluster is up, and backend by "persistent" volumes
kubectl exec consul-0 consul members
k8s/local-persistent-volumes.md
The size of the Persistent Volumes is bogus
(it is used when matching PVs and PVCs together, but there is no actual quota or limit)
k8s/local-persistent-volumes.md
This specific example worked because we had exactly 1 free PV per node:
if we had created multiple PVs per node ...
we could have ended with two PVCs bound to PVs on the same node ...
which would have required two pods to be on the same node ...
which is forbidden by the anti-affinity constraints in the StatefulSet
To avoid that, we need to associated the PVs with a Storage Class that has:
volumeBindingMode: WaitForFirstConsumer
(this means that a PVC will be bound to a PV only after being used by a Pod)
See this blog post for more details
k8s/local-persistent-volumes.md
It's not practical to manually create directories and PVs for each app
We could pre-provision a number of PVs across our fleet
We could even automate that with a Daemon Set:
creating a number of directories on each node
creating the corresponding PV objects
We also need to recycle volumes
... This can quickly get out of hand
k8s/local-persistent-volumes.md
We could also write our own provisioner, which would:
watch the PVCs across all namespaces
when a PVC is created, create a corresponding PV on a node
Or we could use one of the dynamic provisioners for local persistent volumes
(for instance the Rancher local path provisioner)
k8s/local-persistent-volumes.md
Remember, when a node goes down, the volumes on that node become unavailable
High availability will require another layer of replication
(like what we've just seen with Consul; or primary/secondary; etc)
Pre-provisioning PVs makes sense for machines with local storage
(e.g. cloud instance storage; or storage directly attached to a physical machine)
Dynamic provisioning makes sense for large number of applications
(when we can't or won't dedicate a whole disk to a volume)
It's possible to mix both (using distinct Storage Classes)
k8s/local-persistent-volumes.md
Static pods
(automatically generated title slide)
Hosting the Kubernetes control plane on Kubernetes has advantages:
we can use Kubernetes' replication and scaling features for the control plane
we can leverage rolling updates to upgrade the control plane
However, there is a catch:
deploying on Kubernetes requires the API to be available
the API won't be available until the control plane is deployed
How can we get out of that chicken-and-egg problem?
Since each component of the control plane can be replicated...
We could set up the control plane outside of the cluster
Then, once the cluster is fully operational, create replicas running on the cluster
Finally, remove the replicas that are running outside of the cluster
What could possibly go wrong?
What if anything goes wrong?
(During the setup or at a later point)
Worst case scenario, we might need to:
set up a new control plane (outside of the cluster)
restore a backup from the old control plane
move the new control plane to the cluster (again)
This doesn't sound like a great experience
Pods are started by kubelet (an agent running on every node)
To know which pods it should run, the kubelet queries the API server
The kubelet can also get a list of static pods from:
a directory containing one (or multiple) manifests, and/or
a URL (serving a manifest)
These "manifests" are basically YAML definitions
(As produced by kubectl get pod my-little-pod -o yaml
)
Kubelet will periodically reload the manifests
It will start/stop pods accordingly
(i.e. it is not necessary to restart the kubelet after updating the manifests)
When connected to the Kubernetes API, the kubelet will create mirror pods
Mirror pods are copies of the static pods
(so they can be seen with e.g. kubectl get pods
)
We can run control plane components with these static pods
They can start without requiring access to the API server
Once they are up and running, the API becomes available
These pods are then visible through the API
(We cannot upgrade them from the API, though)
This is how kubeadm has initialized our clusters.
The API only gives us read-only access to static pods
We can kubectl delete
a static pod...
...But the kubelet will re-mirror it immediately
Static pods can be selected just like other pods
(So they can receive service traffic)
A service can select a mixture of static and other pods
Once the control plane is up and running, it can be used to create normal pods
We can then set up a copy of the control plane in normal pods
Then the static pods can be removed
The scheduler and the controller manager use leader election
(Only one is active at a time; removing an instance is seamless)
Each instance of the API server adds itself to the kubernetes
service
Etcd will typically require more work!
Alright, but what if the control plane is down and we need to fix it?
We restart it using static pods!
This can be done automatically with the Pod Checkpointer
The Pod Checkpointer automatically generates manifests of running pods
The manifests are used to restart these pods if API contact is lost
(More details in the Pod Checkpointer documentation page)
This technique is used by bootkube k8s/staticpods.md
Is it better to run the control plane in static pods, or normal pods?
If I'm a user of the cluster: I don't care, it makes no difference to me
What if I'm an admin, i.e. the person who installs, upgrades, repairs... the cluster?
If I'm using a managed Kubernetes cluster (AKS, EKS, GKE...) it's not my problem
(I'm not the one setting up and managing the control plane)
If I already picked a tool (kubeadm, kops...) to set up my cluster, the tool decides for me
What if I haven't picked a tool yet, or if I'm installing from scratch?
static pods = easier to set up, easier to troubleshoot, less risk of outage
normal pods = easier to upgrade, easier to move (if nodes need to be shut down)
staticPodPath
is /etc/kubernetes/manifests
ls -l /etc/kubernetes/manifests
We should see YAML files corresponding to the pods of the control plane.
Copy a manifest to the directory:
sudo cp ~/container.training/k8s/just-a-pod.yaml /etc/kubernetes/manifests
Check that it's running:
kubectl get pods
The output should include a pod named hello-node1
.
In the manifest, the pod was named hello
.
apiVersion: v1Kind: Podmetadata: name: hello namespace: defaultspec: containers: - name: hello image: nginx
The -node1
suffix was added automatically by kubelet.
If we delete the pod (with kubectl delete
), it will be recreated immediately.
To delete the pod, we need to delete (or move) the manifest file.
Next steps
(automatically generated title slide)
Alright, how do I get started and containerize my apps?
Alright, how do I get started and containerize my apps?
Suggested containerization checklist:
And then it is time to look at orchestration!
Get a managed cluster from a major cloud provider (AKS, EKS, GKE...)
(price: $, difficulty: medium)
Hire someone to deploy it for us
(price: $$, difficulty: easy)
Do it ourselves
(price: $-$$$, difficulty: hard)
Yes, it is possible to have prod+dev in a single cluster
(and implement good isolation and security with RBAC, network policies...)
But it is not a good idea to do that for our first deployment
Start with a production cluster + at least a test cluster
Implement and check RBAC and isolation on the test cluster
(e.g. deploy multiple test versions side-by-side)
Make sure that all our devs have usable dev clusters
(whether it's a local minikube or a full-blown multi-node cluster)
Namespaces let you run multiple identical stacks side by side
Two namespaces (e.g. blue
and green
) can each have their own redis
service
Each of the two redis
services has its own ClusterIP
CoreDNS creates two entries, mapping to these two ClusterIP
addresses:
redis.blue.svc.cluster.local
and redis.green.svc.cluster.local
Pods in the blue
namespace get a search suffix of blue.svc.cluster.local
As a result, resolving redis
from a pod in the blue
namespace yields the "local" redis
This does not provide isolation! That would be the job of network policies.
(covers permissions model, user and service accounts management ...)
As a first step, it is wiser to keep stateful services outside of the cluster
Exposing them to pods can be done with multiple solutions:
ExternalName
services
(redis.blue.svc.cluster.local
will be a CNAME
record)
ClusterIP
services with explicit Endpoints
(instead of letting Kubernetes generate the endpoints from a selector)
Ambassador services
(application-level proxies that can provide credentials injection and more)
If we want to host stateful services on Kubernetes, we can use:
a storage provider
persistent volumes, persistent volume claims
stateful sets
Good questions to ask:
what's the operational cost of running this service ourselves?
what do we gain by deploying this stateful service on Kubernetes?
Relevant sections: Volumes | Stateful Sets | Persistent Volumes
Excellent blog post tackling the question: “Should I run Postgres on Kubernetes?”
Services are layer 4 constructs
HTTP is a layer 7 protocol
It is handled by ingresses (a different resource kind)
Ingresses allow:
This section shows how to expose multiple HTTP apps using Træfik
Logging is delegated to the container engine
Logs are exposed through the API
Logs are also accessible through local files (/var/log/containers
)
Log shipping to a central platform is usually done through these files
(e.g. with an agent bind-mounting the log directory)
This section shows how to do that with Fluentd and the EFK stack
The kubelet embeds cAdvisor, which exposes container metrics
(cAdvisor might be separated in the future for more flexibility)
It is a good idea to start with Prometheus
(even if you end up using something else)
Starting from Kubernetes 1.8, we can use the Metrics API
Heapster was a popular add-on
(but is being deprecated starting with Kubernetes 1.11)
Two constructs are particularly useful: secrets and config maps
They allow to expose arbitrary information to our containers
Avoid storing configuration in container images
(There are some exceptions to that rule, but it's generally a Bad Idea)
Never store sensitive information in container images
(It's the container equivalent of the password on a post-it note on your screen)
This section shows how to manage app config with config maps (among others)
The best deployment tool will vary, depending on:
A few examples:
Sorry Star Trek fans, this is not the federation you're looking for!
Sorry Star Trek fans, this is not the federation you're looking for!
(If I add "Your cluster is in another federation" I might get a 3rd fandom wincing!)
Kubernetes master operation relies on etcd
etcd uses the Raft protocol
Raft recommends low latency between nodes
What if our cluster spreads to multiple regions?
Kubernetes master operation relies on etcd
etcd uses the Raft protocol
Raft recommends low latency between nodes
What if our cluster spreads to multiple regions?
Break it down in local clusters
Regroup them in a cluster federation
Synchronize resources across clusters
Discover resources across clusters
We've put this last, but it's pretty important!
How do you on-board a new developer?
What do they need to install to get a dev stack?
How does a code change make it from dev to prod?
How does someone add a component to a stack?
Links and resources
(automatically generated title slide)
All things Kubernetes:
All things Docker:
Everything else:
These slides (and future updates) are on → http://container.training/
Bonjour, je suis:
Cet atelier se déroulera de 9h à 17h.
La pause déjeuner se fera entre 12h et 13h30.
(avec 2 pauses café à 10h30 et 15h!)
N'hésitez pas à m'interrompre pour vos questions, à n'importe quel moment.
Surtout quand vous verrez des photos de conteneurs en plein écran!
Vos réactions en direct, questions, demande d'aide
sur https://tinyurl.com/docker-w-djalal
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |