kai chevannes.

GitOps

Kubernetes with no manual steps. How I self-host this portfolio for free.

Key Technologies

Oracle Cloud

Ansible

Kubernetes

Flux

Figure 1: Bootstrapping the VM with Ansible

Summary

I set up a CI/CD? pipeline using GitHub Actions to automate site updates. On each merge to `main` it follows Docker’s “build → ship → run” methodology:

  1. Build a Docker image of the site
  2. Ship the image to GitHub’s Container Registry
  3. Run the image on my always-free Oracle Cloud VM?

I bootstrap the VM with Ansible?, setting up Kubernetes? and Flux? seen in Figure 1. Flux watches the portfolio's Git repository and syncs the cluster with its state.

Purpose & Goal

I became interested in DevOps? after reading The Phoenix Project during my 2023 summer internship at Thought Quarter. I saw this project as an opportunity to improve my mental model of CI/CD technologies and Docker. My acceptance criteria were:

  1. One-click cluster bootstrap
  2. One-click deploy
  3. Zero-downtime rolling updates

To build a foundation, I completed Docker Mastery and Kubernetes Mastery by Bret Fisher on Udemy.

Spotlight

GitOps changes the run stage of build → ship → run. Instead of pushing changes to a Kubernetes cluster using a secret key, GitOps pulls the changes from source control. This eliminates the need to store high-privilege access keys, improving security. Figure 2 shows the Kubernetes manifest? files that Flux will sync the cluster’s state with.

A screenshot of the portfolio GitHub repo showing a list of YAML manifest files.
Figure 2: The Kubernetes manifest files

One of the key benefits of GitOps is that because we store the entire cluster state in code with uniquely tagged container images, rolling back releases becomes identical to rolling back the code. In the run stage of my CI/CD pipeline, I tag the portfolio container image with the commit hash it was built from, shown in Figure 3. Flux detects this change, applies it to the cluster, and Kubernetes performs a zero-downtime rolling update. It spins up the new image, and phases out the old one.

A screenshot of the portfolio GitHub repo showing the portfolio-deployment.yaml file. A highlighted line shows the image hash of the current version.
Figure 3: The image tag for the current release

Challenges

The biggest challenge in self-hosting my portfolio was the networking. To serve a request from a user that types ‘kaichevannes.com’ into their browser, I need to:

  1. Allow incoming traffic on HTTP (port 80) and HTTPS (port 443) to my Oracle Cloud VM by configuring its ingress rules
  2. Route incoming requests to Traefik?, running on my Kubernetes cluster
  3. Provide Traefik with the credentials to complete an ACME? DNS challenge with my domain registrar
  4. Persist the TLS certificate to avoid Let's Encrypt rate limits
  5. Redirect HTTP (port 80) to HTTPS (port 443) using a Kubernetes Ingress resource
  6. Forward HTTPS (port 443) requests to Next.js? on its default port of 3000

In step 3, I need to provide Traefik with a secret API token for authentication. This seems simple on the surface but provides a unique challenge for GitOps because the full cluster state is kept in source control, including secret management.

In enterprise, your Kubernetes cluster would be hosted with a cloud provider that has a built-in secret manager. Since I’m running a K3s? cluster directly on a VM, I need to handle secrets myself.

I use Bitnami Sealed Secrets which generates a public/private key pair inside the cluster. I can encrypt the secret API token using the clusters public key and safely store it in source control. The private key to decrypt this secret only exists within the cluster.

Lessons Learned

The lesson of slowing down and making less assumptions was repeated to me throughout the project.

As an example, when configuring Traefik locally I used the built-in Kubernetes Ingress resource, but the documentation for certificate verification used a custom IngressRoute resource. I didn't think this would change anything so skipped locally testing it, leading me down a rabbithole of attempting to fix what I thought was a network issue.

I tried changing my cluster configuration only to break my bootstrap script, linking Traefik directly to my VM's network using Kubernetes hostports, forwarding requests using custom iptables rules, swapping the networking backend of my cluster. I read the docs more thoroughly and realised the issue was the IngressRoute resource the whole time!

I gained a better appreciation for documentation because the issues I was running into weren't common and easy to Google. I also improved my understanding of working iteratively in small batch sizes; trying to put so many pieces together in one go led to a huge surface area for problems, eventually I had to take it step by step anyway.

Next time, I'll lead with that.