Preface

I've wanted to play with K8s clusters for a long time, but always felt that without sufficient knowledge, it would be too difficult. Recently, I spent some time studying DN42 and routing protocols like BGP and OSPF, and found that I could now understand them without much difficulty. So I decisively started with K3s (

The main reason for choosing K3s over K8s is its lightweight nature: low resource requirements, no need to pull a bunch of images during deployment, availability of domestic mirrors... In short, K3s fits my needs better.

I'm a beginner just starting to explore K3s. Please go easy on me if there are any mistakes~

Network premise: This article assumes that readers have a basic understanding of BGP concepts (AS, iBGP, Route Reflector) and have a configurable BGP router environment. If you don't have these foundations, you can read my DN42 series articles first...?

Analysis

Choice of CNI Component

My current network architecture looks like this:

graph TD
    subgraph ZeroTier Domestic
        subgraph WDS
            Gateway <--> VM1
            Gateway <--> VM2
        end
        NGB <--> Gateway
        HFE-NAS <--> Gateway
        NGB <--> HFE-NAS
    end

    subgraph IEPL
        Global-NIC <==OSPF==> CN-NIC 
    end

    subgraph ZeroTier Global
        HKG02 <-->  HKG04
        TYO <--> HKG04
        TYO <--> HKG02
    end

    CN-NIC <--> NGB
    CN-NIC <--> HFE-NAS
    CN-NIC <--OSPF--> Gateway

    Global-NIC <--OSPF--> TYO
    Global-NIC <--OSPF--> HKG02
    Global-NIC <--OSPF--> HKG04

    %% Styling: orange background, bold border for routers
    classDef router fill:#f96,stroke:#333,stroke-width:2px,font-weight:bold;
    class Global-NIC,CN-NIC,Gateway router;

Among these, the WDS node is a Proxmox VE hosting multiple VMs. It broadcasts the IPv4 prefixes of its VMs via OSPF. When a Hong Kong node needs to access a VM under the WDS node, it can do so by joining the OSPF internal network and achieving multi-hop reachability. This results in only one layer of encapsulation, so there's no worry about MTU shrinking.

I plan to create two new VMs under WDS as the master and one node (tentatively named KubeMaster and KubeNode-WDS1), and then also use HKG04 (tentatively named KubeNode-HKG04) as a node to join the K3s cluster.

The simplest way would be to use K3s's default Flannel as the CNI. However, Flannel is based on VXLAN, and adding it on top of my existing internal network would cause MTU shrinkage like this:

Data packet -> Flannel VXLAN encapsulation -> ZeroTier encapsulation -> physical link

The actual usable MTU for inter-container communication would be compressed to around 1350 or even lower. Therefore, I tried to find a CNI solution that can work directly on this internal network, and then I found Calico. From my understanding, Calico uses BGP as its underlying routing protocol and supports starting in No-Encapsulation mode, where data packets are directly routed by the upper-layer routers. So I chose Calico as the CNI component.

Routing Design

To ensure that routers at intermediate nodes know how to route Pod IPs, KubeMaster and KubeNode-WDS1 are under the Proxmox VE host. They need to establish BGP with HKG04 across the entire internal network. This means every intermediate router must learn the full BGP routes to enable a routing path like this:

graph LR

subgraph WDS
    KubeMaster
    KubeNode-WDS1
    Gateway
end

subgraph IEPL
    CN-Namespace
    Global-Namespace
end

KubeNode-WDS1 <--> Gateway
KubeMaster <--> Gateway <--> CN-Namespace <--> Global-Namespace <--> HKG04

%% Styling: highlight nodes with routing capability
classDef router fill:#f96,stroke:#333,stroke-width:2px,font-weight:bold;
class Gateway,CN-Namespace,Global-Namespace router;

Otherwise, any hop in the middle would drop packets because it doesn't recognize the source/destination IP. Additionally, due to the iBGP property that routes learned from an iBGP neighbor cannot be propagated to another iBGP neighbor, all BGP sessions between Gateway, CN-Namespace, Global-Namespace, and the nodes need to enable Route Reflector; otherwise, nodes cannot correctly learn each other's routes.

That said, this architecture is actually better suited for BGP Confederation. However, my existing network is already quite complex, and adding BGP Confederation would complicate maintenance later. Moreover, my number of nodes is small, so the overhead of iBGP Full Mesh is acceptable. ~~Definitely not because I'm lazy (~~

So the final network routing structure looks like this:

graph TD

subgraph WDS
    KubeMaster
    KubeNode-WDS1
    Gateway
end

subgraph IEPL
    CN-Namespace
    Global-Namespace
end

%% BGP logical connections
KubeMaster <-. Calico iBGP Full Mesh .-> KubeNode-WDS1
KubeMaster <-- iBGP Route Reflector --> Gateway
KubeNode-WDS1 <-- iBGP Route Reflector --> Gateway 
Gateway <-- iBGP --> CN-Namespace 
CN-Namespace <-- iBGP --> Global-Namespace 
Global-Namespace <-- iBGP Route Reflector --> HKG04

%% Redundant and cross-domain connections
Gateway <-- iBGP --> Global-Namespace
HKG04 <-. Calico iBGP Full Mesh .-> KubeMaster
KubeNode-WDS1 <-. Calico iBGP Full Mesh .-> HKG04

%% Styling
classDef router fill:#f96,stroke:#333,stroke-width:2px,font-weight:bold;
%% Mark nodes with routing or RR functions as Router
class Gateway,CN-Namespace,Global-Namespace router;

The dashed BGP sessions are automatically created by Calico; the solid ones need to be manually configured.

Keeping Calico's own iBGP Full Mesh is for future scalability, allowing nodes to establish direct P2P connections via ZeroTier whenever possible, rather than routing through the Route Reflector aggregation router.

Deployment

After clarifying the structure, deployment becomes simple.

Enable Kernel Forwarding and Disable rp_filter

Standard practice.

echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
echo "net.ipv6.conf.default.forwarding=1" >> /etc/sysctl.conf
echo "net.ipv6.conf.all.forwarding=1" >> /etc/sysctl.conf
echo "net.ipv4.conf.default.rp_filter=0" >> /etc/sysctl.conf
echo "net.ipv4.conf.all.rp_filter=0" >> /etc/sysctl.conf
sysctl -p

Install K3s

Master

Since the KubeMaster control plane node is in China, it's best to configure image acceleration:

mkdir -p /etc/rancher/k3s
cat <<EOF > /etc/rancher/k3s/registries.yaml
mirrors:
  docker.io:
    endpoint:
      - "https://docker.m.daocloud.io"
  quay.io:
    endpoint:
      - "https://quay.m.daocloud.io"
EOF

Install using the mirror:

curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | \
  INSTALL_K3S_MIRROR=cn INSTALL_K3S_EXEC=" \
  --flannel-backend=none \
  --disable-network-policy \
  --cluster-cidr=10.42.0.0/16" sh -

Note that you need to specify --flannel-backend=none and --disable-network-policy to disable the default CNI components.
View the token with cat /var/lib/rancher/k3s/server/node-token and record it.

Worker Nodes

Configure image acceleration for nodes in China:

mkdir -p /etc/rancher/k3s
cat <<EOF > /etc/rancher/k3s/registries.yaml
mirrors:
  docker.io:
    endpoint:
      - "https://docker.m.daocloud.io"
  quay.io:
    endpoint:
      - "https://quay.m.daocloud.io"
EOF

Then install K3s using the mirror and join the cluster:

export INSTALL_K3S_MIRROR=cn
export K3S_URL=https://<master node IP>:6443  # Replace with your master node's actual IP
export K3S_TOKEN=K10...your token...::server:xxx # Replace with the full token from the first step

curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | sh -

At this point, the nodes should be in NotReady state because the CNI component is missing.

Install Calico and Configure No-Encapsulation Mode

Download https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/tigera-operator.yaml on the master, then install the Calico operator:

kubectl create -f tigera-operator.yaml

Create a custom resource file custom-resource.yaml:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Add image registry configuration
  registry: quay.m.daocloud.io 
  calicoNetwork:
    ipPools:
    - blockSize: 26
      cidr: 10.42.0.0/16
      encapsulation: None
      natOutgoing: Enabled
      nodeSelector: all()

Here, encapsulation: None enables No-Encapsulation mode. You can also modify the IPv4 CIDR here. Then:

kubectl apply -f custom-resource.yaml

This executes the installation. Check pod status with:

kubectl get pods -A -o wide

Wait for each node to finish pulling images.

Configure BGP Topology

Label Nodes

Label nodes to specify that nodes under WDS should connect to the WDS Gateway's BGP, and overseas nodes should connect to the Global Namespace's BGP:

kubectl label nodes kubemaster region=WDS
kubectl label nodes kubenode-wds-1 region=WDS
kubectl label nodes kubenode-hkg04 region=Global

Calico Configuration

Write a YAML configuration file:

apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: route-reflector-domestic
spec:
  nodeSelector: region == 'Domestic' # This part isn't actually used; I originally designed a domestic aggregation router
  peerIP: 100.64.0.108
  asNumber: 64512
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: route-reflector-wds
spec:
  nodeSelector: region == 'WDS'
  peerIP: 192.168.100.1
  asNumber: 64512
---
apiVersion: crd.projectcalico.org/v1
kind: BGPPeer
metadata:
  name: route-reflector-global
spec:
  nodeSelector: region == 'Global'
  peerIP: 100.64.1.106
  asNumber: 64512

This means:

All nodes with label region=Domestic get a BGP session to 100.64.0.108 (domestic aggregation router) using AS 64512.
All nodes with label region=WDS get a BGP session to 192.168.100.1 (gateway for all VMs under the WDS node) using AS 64512.
All nodes with label region=Global get a BGP session to 100.64.1.106 (overseas aggregation router) using AS 64512.

This achieves the diagram above: all VMs under WDS, including the master and KubeNode-WDS1, connect to the WDS Gateway aggregation router, and all nodes in the overseas region connect to the overseas aggregation router.

Configure Aggregation Router iBGP

This part is straightforward: just write Bird configuration files. Here are some examples:

k3s/ibgp.conf:

function is_insider_as(){
    if bgp_path.len > 0 && !(bgp_path ~ [= 64512 =]) then {
        return false;
    }

    if net ~ [ 10.42.0.0/16{16,32} ] then {
        return true;
    }

    return false;
}

template bgp k3sbackbone{
    local as K3S_AS;
    router id INTRA_ROUTER_ID;
    neighbor as K3S_AS;

    ipv4{
        table intra_table_v4;
        import filter{
            if is_insider_as() then accept;
            reject;
        };
        export filter{
            if is_insider_as() then accept;
            reject;
        };
        next hop self;
        extended next hop;
    };

    ipv6{
        table intra_table_v6;
        import filter{
            if is_insider_as() then accept;
            reject;
        };
        export filter{
            if is_insider_as() then accept;
            reject;
        };
        next hop self;
    };
};

template bgp k3speers{
    local as K3S_AS;
    neighbor as K3S_AS;
    router id INTRA_ROUTER_ID;
    rr client;
    rr cluster id INTRA_ROUTER_ID;

    ipv4{
        table intra_table_v4;
        import filter{
            if is_insider_as() then accept;
            reject;
        };
        export filter{
            if is_insider_as() then accept;
            reject;
        };
        next hop self;
    };

    ipv6{
        table intra_table_v6;
        import filter{
            if is_insider_as() then accept;
            reject;
        };
        export filter{
            if is_insider_as() then accept;
            reject;
        };
        next hop self;
    };
};

include "ibgpeers/*";

ibgpeers/backbone-cn.conf:

protocol bgp 'k3s_backbone_cn_v4' from k3sbackbone{
    neighbor fd18:3e15:61d0:cafe:f001::1;
};

ibgpeers/master.conf:

protocol bgp 'k3s_master_v4' from k3speers{
    neighbor 192.168.100.251;
};

The key points: aggregation routers should not enable Route Reflector among themselves, and remember to enable next hop self.

After everything is done, kubectl get nodes should show all nodes as Ready:

NAME             STATUS   ROLES           AGE     VERSION
kubemaster       Ready    control-plane   2d23h   v1.34.5+k3s1
kubenode-hkg04   Ready    <none>          11h     v1.34.6+k3s1
kubenode-wds-1   Ready    <none>          2d7h    v1.34.5+k3s1

Check pods with kubectl get pods -A -o wide:

NAMESPACE         NAME                                       READY   STATUS      RESTARTS        AGE     IP                NODE             NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-64fc874957-6bdlz   1/1     Running     0               5h38m   10.42.253.136     kubenode-hkg04   <none>           <none>
calico-system     calico-node-2qz82                          1/1     Running     0               4h24m   10.2.5.7          kubenode-hkg04   <none>           <none>
calico-system     calico-node-dhl2c                          1/1     Running     0               4h24m   192.168.100.251   kubemaster       <none>           <none>
calico-system     calico-node-nbpkj                          1/1     Running     0               4h23m   192.168.100.252   kubenode-wds-1   <none>           <none>
calico-system     calico-typha-7bb5db4bdc-rfpwg              1/1     Running     0               5h38m   10.2.5.7          kubenode-hkg04   <none>           <none>
calico-system     calico-typha-7bb5db4bdc-rwwr5              1/1     Running     0               5h38m   192.168.100.251   kubemaster       <none>           <none>
calico-system     csi-node-driver-jglwp                      2/2     Running     0               5h38m   10.42.64.68       kubenode-wds-1   <none>           <none>
calico-system     csi-node-driver-jqjsc                      2/2     Running     0               5h38m   10.42.253.137     kubenode-hkg04   <none>           <none>
calico-system     csi-node-driver-vk26s                      2/2     Running     0               5h38m   10.42.141.16      kubemaster       <none>           <none>
kube-system       coredns-695cbbfcb9-8fx4p                   1/1     Running     1 (7h27m ago)   2d23h   10.42.141.14      kubemaster       <none>           <none>
kube-system       helm-install-traefik-crd-5bkwx             0/1     Completed   0               2d23h   <none>            kubemaster       <none>           <none>
kube-system       helm-install-traefik-m9fgj                 0/1     Completed   1               2d23h   <none>            kubemaster       <none>           <none>
kube-system       local-path-provisioner-546dfc6456-dmn4g    1/1     Running     1 (7h27m ago)   2d23h   10.42.141.15      kubemaster       <none>           <none>
kube-system       metrics-server-c8774f4f4-2wkwh             1/1     Running     1 (7h27m ago)   2d23h   10.42.141.12      kubemaster       <none>           <none>
kube-system       svclb-traefik-999cddce-hpmcm               2/2     Running     6 (7h26m ago)   11h     10.42.253.134     kubenode-hkg04   <none>           <none>
kube-system       svclb-traefik-999cddce-q4225               2/2     Running     2 (7h27m ago)   2d22h   10.42.141.9       kubemaster       <none>           <none>
kube-system       svclb-traefik-999cddce-xmd64               2/2     Running     2 (7h26m ago)   2d6h    10.42.64.66       kubenode-wds-1   <none>           <none>
kube-system       traefik-788bc4688c-vbbhj                   1/1     Running     1 (7h27m ago)   2d22h   10.42.141.13      kubemaster       <none>           <none>
tigera-operator   tigera-operator-6b95bbf4db-vl46l           1/1     Running     1 (7h27m ago)   2d23h   192.168.100.251   kubemaster       <none>           <none>

Use kubectl exec -it -n calico-system <calico-node-xxxx> -- birdcl s p to check Bird status:

root@KubeMaster:~/kube/calico# kubectl exec -it -n calico-system calico-node-2qz82 -- birdcl s p
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
BIRD v0.3.3+birdv1.6.8 ready.
name     proto    table    state  since       info
static1  Static   master   up     08:58:17    
kernel1  Kernel   master   up     08:58:17    
device1  Device   master   up     08:58:17    
direct1  Direct   master   up     08:58:17    
Mesh_192_168_100_251 BGP      master   up     08:58:33    Established   
Mesh_192_168_100_252 BGP      master   up     08:59:00    Established   
Node_100_64_1_106 BGP      master   up     12:57:44    Established

ip r shows the system routing table:

root@KubeMaster:~/kube/calico# ip r
default via 192.168.100.1 dev eth0 proto static 
10.42.64.64/26 proto bird 
        nexthop via 192.168.100.1 dev eth0 weight 1 
        nexthop via 192.168.100.252 dev eth0 weight 1 
blackhole 10.42.141.0/26 proto bird 
10.42.141.9 dev caliac6501d3794 scope link 
10.42.141.12 dev calib07c23291bb scope link 
10.42.141.13 dev caliab16e60bd19 scope link 
10.42.141.14 dev calid5959219080 scope link 
10.42.141.15 dev cali026d8f1ddb7 scope link 
10.42.141.16 dev califa657ba417a scope link 
10.42.253.128/26 via 192.168.100.1 dev eth0 proto bird 
192.168.100.0/24 dev eth0 proto kernel scope link src 192.168.100.251

Ping a Pod IP; if everything is fine, it should work directly:

root@KubeMaster:~/kube/calico# ping 10.42.253.137
PING 10.42.253.137 (10.42.253.137) 56(84) bytes of data.
64 bytes from 10.42.253.137: icmp_seq=1 ttl=60 time=33.7 ms
64 bytes from 10.42.253.137: icmp_seq=2 ttl=60 time=33.5 ms
^C
--- 10.42.253.137 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 33.546/33.632/33.718/0.086 ms

Tune MTU

This step is actually for stability...?
Testing shows that although my ZeroTier MTU is 1420, packet fragmentation starts occurring around 1380 bytes (test with ping -M do -s <size> <Pod_IP>). Therefore, force the Pod MTU to 1370:

root@KubeMaster:~/kube/calico# cat patch-mtu.yaml 
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    mtu: 1370
    nodeAddressAutodetectionV4:
      firstFound: true
root@KubeMaster:~/kube/calico# kubectl apply -f patch-mtu.yaml 
installation.operator.tigera.io/default configured

Building a Cross-Region K3s Cluster from Scratch - Ep.1 Calico No-Encapsulation CNI