Articles under the tag iBGP - iYoRoy's Develop Diary

Login

Search Tags

Kagura iYoRoy

A total of 32 articles have been written.
A total of 23 comments have been received.

3 articles related to were found.

Building a Cross-Region K3s Cluster from Scratch - Ep.1 Calico No-Encapsulation CNI # Preface I've actually wanted to play with a K8s cluster for a long time, but always felt that without sufficient knowledge, it would be too difficult to attempt. Recently, I spent some time studying DN42 and routing protocols like BGP and OSPF, and realized that it no longer feels so difficult. So I decisively started with K3s ( The main reason for choosing K3s over K8s is its lightweight nature: low resource requirements, no need to pull a bunch of images for deployment, availability of domestic mirrors… In short, K3s suits my needs better. I'm a beginner just starting to explore K3s, so please go easy on me if I make any mistakes~ # Analysis ## Choosing the CNI Component My current network architecture looks like this: ```mermaid graph TD subgraph ZeroTier Domestic subgraph WDS Gateway <--> VM1 Gateway <--> VM2 end NGB <--> Gateway HFE-NAS <--> Gateway NGB <--> HFE-NAS end subgraph IEPL Global-NIC <==OSPF==> CN-NIC end subgraph ZeroTier Global HKG02 <--> HKG04 TYO <--> HKG04 TYO <--> HKG02 end CN-NIC <--> NGB CN-NIC <--> HFE-NAS CN-NIC <--OSPF--> Gateway Global-NIC <--OSPF--> TYO Global-NIC <--OSPF--> HKG02 Global-NIC <--OSPF--> HKG04 %% Style definition: orange background, bold border to represent routers classDef router fill:#f96,stroke:#333,stroke-width:2px,font-weight:bold; class Global-NIC,CN-NIC,Gateway router; Among this, the WDS node is a Proxmox VE host with multiple VMs underneath. It advertises its VMs' IPv4 prefixes via OSPF. When Hong Kong nodes need to access a VM under the WDS node, they can do so by joining the OSPF internal network to achieve multi-hop reachability. This keeps the encapsulation layer count to only one, so there's no worry about MTU "disappearing act". I plan to create two new VMs under WDS to serve as the master and a node (temporarily called KubeMaster and KubeNode-WDS1). Then HKG04 (temporarily called KubeNode-HKG04) will also join the K3s cluster as a node. The simplest approach would be to use K3s's default Flannel as the CNI. However, Flannel is based on VXLAN, and adding another layer of my existing internal network would lead to the following MTU "disappearing act": Data packet -> Flannel VXLAN encapsulation -> ZeroTier encapsulation -> Physical link The actual usable MTU for inter-container communication would likely be compressed to 1350 or even lower. Therefore, I tried to find a CNI solution that can work directly on top of this internal network, and then I found Calico. As I understand, Calico uses BGP as its underlying routing protocol, supports starting in no-encapsulation (No-Encap) mode, and hands packets directly to the upper routers for routing. Thus, I chose Calico as the CNI component. Routing Design To ensure that intermediate routers know how to route Pod IPs, KubeMaster and KubeNode-WDS1 are under the Proxmox VE host. They need to establish BGP with HKG04 across the entire internal network. This means that every router at each intermediate level must learn the full BGP routes, so that the following routing path can be established: graph LR subgraph WDS KubeMaster KubeNode-WDS1 Gateway end subgraph IEPL CN-Namespace Global-Namespace end KubeNode-WDS1 <--> Gateway KubeMaster <--> Gateway <--> CN-Namespace <--> Global-Namespace <--> HKG04 %% Style definition: highlight nodes with routing capability classDef router fill:#f96,stroke:#333,stroke-width:2px,font-weight:bold; class Gateway,CN-Namespace,Global-Namespace router; Otherwise, any intermediate hop would drop packets because it doesn't recognize the source/destination IP. Also, due to the property of iBGP that routes learned from a neighbor cannot be propagated to the next iBGP neighbor, all BGP sessions between Gateway, CN-Namespace, Global-Namespace and the nodes need to enable Route Reflector; otherwise, nodes cannot correctly learn routes from each other. That said, this architecture would be more suitable for BGP Confederation, but my existing network is already quite complex, and adding BGP confederations would make later maintenance more troublesome. Moreover, my number of nodes is small, so the overhead of iBGP Full Mesh is acceptable. It's definitely not because I'm lazy (so Thus, the final network routing structure is as follows: graph TD subgraph WDS VM1 VM2 Gateway end subgraph IEPL CN-Namespace Global-Namespace end VM1 <-.Calico iBGP Full Mesh.-> VM2 VM1 <--iBGP Route Reflector--> Gateway VM2 <--iBGP Route Reflector--> Gateway <--iBGP--> CN-Namespace <--iBGP--> Global-Namespace <--iBGP Route Reflector--> HKG04 Gateway <--iBGP--> Global-Namespace HKG04 <-.Calico iBGP Full Mesh.-> VM1 VM2 <-.Calico iBGP Full Mesh.-> HKG04 %% Style definition classDef router fill:#f96,stroke:#333,stroke-width:2px,font-weight:bold; %% Mark nodes with routing/forwarding or RR functions as Router class Gateway,CN-Namespace,Global-Namespace router; The dashed-line BGP sessions are automatically created by Calico, while the solid-line parts need to be manually created by us. Keeping Calico's own iBGP Full Mesh is for future scalability, so that nodes can preferentially establish direct P2P connections via ZeroTier instead of taking a detour through the Route Reflector aggregation router. Deployment After clarifying the structure, deployment becomes simple. Enable Kernel Forwarding and Disable rp_filter Standard practice. echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf echo "net.ipv6.conf.default.forwarding=1" >> /etc/sysctl.conf echo "net.ipv6.conf.all.forwarding=1" >> /etc/sysctl.conf echo "net.ipv4.conf.default.rp_filter=0" >> /etc/sysctl.conf echo "net.ipv4.conf.all.rp_filter=0" >> /etc/sysctl.conf sysctl -p Install K3s Master Because the KubeMaster control plane node is located inside China, it's best to configure image acceleration: mkdir -p /etc/rancher/k3s cat <<EOF > /etc/rancher/k3s/registries.yaml mirrors: docker.io: endpoint: - "https://docker.m.daocloud.io" quay.io: endpoint: - "https://quay.m.daocloud.io" EOF Install using the mirror: curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | \ INSTALL_K3S_MIRROR=cn INSTALL_K3S_EXEC=" \ --flannel-backend=none \ --disable-network-policy \ --cluster-cidr=10.42.0.0/16" sh - Note the need to specify --flannel-backend=none and --disable-network-policy to disable the default CNI component. Use cat /var/lib/rancher/k3s/server/node-token to view the token and record it. Worker Nodes For nodes inside China, configure image acceleration: mkdir -p /etc/rancher/k3s cat <<EOF > /etc/rancher/k3s/registries.yaml mirrors: docker.io: endpoint: - "https://docker.m.daocloud.io" quay.io: endpoint: - "https://quay.m.daocloud.io" EOF Then install K3s using the mirror and join the cluster: export INSTALL_K3S_MIRROR=cn export K3S_URL=https://<master node IP>:6443 # Replace with your master node's actual IP export K3S_TOKEN=K10...your token...::server:xxx # Replace with the full token obtained in the first step curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | sh - At this point, the status of each node should be NotReady because the CNI component is missing. Install Calico and Configure No-Encap Mode On the master, manually download https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/tigera-operator.yaml and install the Calico operator: kubectl create -f tigera-operator.yaml Configure a custom resource by creating a custom-resource.yaml file: apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: # Add image registry configuration registry: quay.m.daocloud.io calicoNetwork: ipPools: - blockSize: 26 cidr: 10.42.0.0/16 encapsulation: None natOutgoing: Enabled nodeSelector: all() Here, specify encapsulation: None to enable No-Encap mode. You can also modify the IPv4 CIDR here if needed. Then: kubectl apply -f custom-resource.yaml to perform the installation. Use: kubectl get pods -A -o wide to check Pod status, waiting for each node to finish pulling images. Configure BGP Topology Label Nodes Label nodes to specify that nodes under WDS connect to the Gateway's BGP in the WDS node, and nodes outside China connect to the BGP of the Global Namespace: kubectl label nodes kubemaster region=WDS kubectl label nodes kubenode-wds-1 region=WDS kubectl label nodes kubenode-hkg04 region=Global Calico Configuration Create a YAML configuration file: apiVersion: crd.projectcalico.org/v1 kind: BGPPeer metadata: name: route-reflector-domestic spec: nodeSelector: region == 'Domestic' # This part is not actually used; I originally designed a general aggregation router in the Domestic area peerIP: 100.64.0.108 asNumber: 64512 --- apiVersion: crd.projectcalico.org/v1 kind: BGPPeer metadata: name: route-reflector-wds spec: nodeSelector: region == 'WDS' peerIP: 192.168.100.1 asNumber: 64512 --- apiVersion: crd.projectcalico.org/v1 kind: BGPPeer metadata: name: route-reflector-global spec: nodeSelector: region == 'Global' peerIP: 100.64.1.106 asNumber: 64512 This means: All nodes with label region equal to Domestic will have a BGP session to 100.64.0.108 (the domestic aggregation router) using AS 64512 All nodes with label region equal to WDS will have a BGP session to 192.168.100.1 (the Gateway for all VMs under the WDS node) using AS 64512 All nodes with label region equal to Global will have a BGP session to 100.64.1.106 (the overseas aggregation router) using AS 64512 This achieves what is shown in the diagram: all VMs under the WDS node, including the master and KubeNode-WDS1, connect to the Gateway aggregation router of the WDS node, and all nodes in overseas areas connect to the overseas aggregation router. Configure Aggregation Router iBGP This part is simply a matter of writing Bird configuration files (easy). Here are a few examples: k3s/ibgp.conf: function is_insider_as(){ if bgp_path.len > 0 && !(bgp_path ~ [= 64512 =]) then { return false; } if net ~ [ 10.42.0.0/16{16,32} ] then { return true; } return false; } template bgp k3sbackbone{ local as K3S_AS; router id INTRA_ROUTER_ID; neighbor as K3S_AS; ipv4{ table intra_table_v4; import filter{ if is_insider_as() then accept; reject; }; export filter{ if is_insider_as() then accept; reject; }; next hop self; extended next hop; }; ipv6{ table intra_table_v6; import filter{ if is_insider_as() then accept; reject; }; export filter{ if is_insider_as() then accept; reject; }; next hop self; }; }; template bgp k3speers{ local as K3S_AS; neighbor as K3S_AS; router id INTRA_ROUTER_ID; rr client; rr cluster id INTRA_ROUTER_ID; ipv4{ table intra_table_v4; import filter{ if is_insider_as() then accept; reject; }; export filter{ if is_insider_as() then accept; reject; }; next hop self; }; ipv6{ table intra_table_v6; import filter{ if is_insider_as() then accept; reject; }; export filter{ if is_insider_as() then accept; reject; }; next hop self; }; }; include "ibgpeers/*"; ibgpeers/backbone-cn.conf: protocol bgp 'k3s_backbone_cn_v4' from k3sbackbone{ neighbor fd18:3e15:61d0:cafe:f001::1; }; ibgpeers/master.conf: protocol bgp 'k3s_master_v4' from k3speers{ neighbor 192.168.100.251; }; Main points: it's best not to enable Route Reflector between the aggregation routers, and remember to enable next hop self. After everything is done, using kubectl get nodes should show all nodes as Ready: NAME STATUS ROLES AGE VERSION kubemaster Ready control-plane 2d23h v1.34.5+k3s1 kubenode-hkg04 Ready <none> 11h v1.34.6+k3s1 kubenode-wds-1 Ready <none> 2d7h v1.34.5+k3s1 Use kubectl get pods -A -o wide to view Pods: NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-system calico-kube-controllers-64fc874957-6bdlz 1/1 Running 0 5h38m 10.42.253.136 kubenode-hkg04 <none> <none> calico-system calico-node-2qz82 1/1 Running 0 4h24m 10.2.5.7 kubenode-hkg04 <none> <none> calico-system calico-node-dhl2c 1/1 Running 0 4h24m 192.168.100.251 kubemaster <none> <none> calico-system calico-node-nbpkj 1/1 Running 0 4h23m 192.168.100.252 kubenode-wds-1 <none> <none> calico-system calico-typha-7bb5db4bdc-rfpwg 1/1 Running 0 5h38m 10.2.5.7 kubenode-hkg04 <none> <none> calico-system calico-typha-7bb5db4bdc-rwwr5 1/1 Running 0 5h38m 192.168.100.251 kubemaster <none> <none> calico-system csi-node-driver-jglwp 2/2 Running 0 5h38m 10.42.64.68 kubenode-wds-1 <none> <none> calico-system csi-node-driver-jqjsc 2/2 Running 0 5h38m 10.42.253.137 kubenode-hkg04 <none> <none> calico-system csi-node-driver-vk26s 2/2 Running 0 5h38m 10.42.141.16 kubemaster <none> <none> kube-system coredns-695cbbfcb9-8fx4p 1/1 Running 1 (7h27m ago) 2d23h 10.42.141.14 kubemaster <none> <none> kube-system helm-install-traefik-crd-5bkwx 0/1 Completed 0 2d23h <none> kubemaster <none> <none> kube-system helm-install-traefik-m9fgj 0/1 Completed 1 2d23h <none> kubemaster <none> <none> kube-system local-path-provisioner-546dfc6456-dmn4g 1/1 Running 1 (7h27m ago) 2d23h 10.42.141.15 kubemaster <none> <none> kube-system metrics-server-c8774f4f4-2wkwh 1/1 Running 1 (7h27m ago) 2d23h 10.42.141.12 kubemaster <none> <none> kube-system svclb-traefik-999cddce-hpmcm 2/2 Running 6 (7h26m ago) 11h 10.42.253.134 kubenode-hkg04 <none> <none> kube-system svclb-traefik-999cddce-q4225 2/2 Running 2 (7h27m ago) 2d22h 10.42.141.9 kubemaster <none> <none> kube-system svclb-traefik-999cddce-xmd64 2/2 Running 2 (7h26m ago) 2d6h 10.42.64.66 kubenode-wds-1 <none> <none> kube-system traefik-788bc4688c-vbbhj 1/1 Running 1 (7h27m ago) 2d22h 10.42.141.13 kubemaster <none> <none> tigera-operator tigera-operator-6b95bbf4db-vl46l 1/1 Running 1 (7h27m ago) 2d23h 192.168.100.251 kubemaster <none> <none> Use kubectl exec -it -n calico-system <calico-node-xxxx> -- birdcl s p to check the status of Bird: root@KubeMaster:~/kube/calico# kubectl exec -it -n calico-system calico-node-2qz82 -- birdcl s p Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init) BIRD v0.3.3+birdv1.6.8 ready. name proto table state since info static1 Static master up 08:58:17 kernel1 Kernel master up 08:58:17 device1 Device master up 08:58:17 direct1 Direct master up 08:58:17 Mesh_192_168_100_251 BGP master up 08:58:33 Established Mesh_192_168_100_252 BGP master up 08:59:00 Established Node_100_64_1_106 BGP master up 12:57:44 Established ip r shows the system routing table: root@KubeMaster:~/kube/calico# ip r default via 192.168.100.1 dev eth0 proto static 10.42.64.64/26 proto bird nexthop via 192.168.100.1 dev eth0 weight 1 nexthop via 192.168.100.252 dev eth0 weight 1 blackhole 10.42.141.0/26 proto bird 10.42.141.9 dev caliac6501d3794 scope link 10.42.141.12 dev calib07c23291bb scope link 10.42.141.13 dev caliab16e60bd19 scope link 10.42.141.14 dev calid5959219080 scope link 10.42.141.15 dev cali026d8f1ddb7 scope link 10.42.141.16 dev califa657ba417a scope link 10.42.253.128/26 via 192.168.100.1 dev eth0 proto bird 192.168.100.0/24 dev eth0 proto kernel scope link src 192.168.100.251 Ping a Pod's IP – if everything is fine, it should work directly: root@KubeMaster:~/kube/calico# ping 10.42.253.137 PING 10.42.253.137 (10.42.253.137) 56(84) bytes of data. 64 bytes from 10.42.253.137: icmp_seq=1 ttl=60 time=33.7 ms 64 bytes from 10.42.253.137: icmp_seq=2 ttl=60 time=33.5 ms ^C --- 10.42.253.137 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 33.546/33.632/33.718/0.086 ms Tune MTU This step is actually for stability…? Tests have shown that although my ZeroTier MTU is 1420, packets start to fragment around 1392 bytes (test with ping -M do -s <packet size> <Pod_IP>). Therefore, force the Pod MTU to 1370: root@KubeMaster:~/kube/calico# cat patch-mtu.yaml apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: calicoNetwork: mtu: 1370 nodeAddressAutodetectionV4: firstFound: true root@KubeMaster:~/kube/calico# kubectl apply -f patch-mtu.yaml installation.operator.tigera.io/default configured
- 05/04/2026
- 55 Views
- 0 Comments
- 5 Stars
DN42 - Ep.4 Configuring BGP Communities Foreword I am a novice in BGP. This article may contain imprecise content/naive understandings/elementary mistakes. I kindly ask the experts to be lenient. If you find any issues, you are welcome to contact me via email, and I will correct them as soon as possible. If you find this unacceptable, it is recommended to close this article now. What are BGP Communities? TL;DR: BGP Communities "tag" routes, allowing others to use these tags for route optimization. This concept might be unfamiliar to newcomers (like me), who might not initially understand its purpose. Simply put, BGP Communities are a mechanism for tagging routes, similar to adding labels. They allow network administrators to attach one or more "tags" (i.e., community values) to routes propagated via BGP. These tags themselves do not alter the route's path attributes (like AS_PATH, LOCAL_PREF, MED, etc.), but they provide a signaling mechanism to indicate what policy or processing should be applied to that route by other routers within the same AS or in downstream peer ASes. BGP Communities can be used to: Simplify Policy Configuration: Routers inside the network or in downstream ASes only need to configure policies based on community values (like setting LOCAL_PREF, adding NO_EXPORT, applying route-maps, etc.), without needing to know the specific prefix details. This makes policies more centralized, easier to manage, and less prone to errors. Convey Policy Intent to Downstream ASes: An AS can attach community values to routes it advertises to its downstream customer or peer ASes. These communities convey requirements or suggestions on how these routes should be handled, such as route optimization based on geographic location, latency, or bandwidth. Coordinate Policies within an AS: Inside a large AS, when using IBGP full-mesh or route reflectors, edge routers (receiving EBGP routes or redistributing routes) can tag routes with community values. Core routers or route reflectors within the AS can recognize these communities and apply corresponding internal policies (like setting LOCAL_PREF, MED, deciding whether to advertise to certain IBGP peers, adding other communities, etc.), without needing complex prefix-based policies on every internal router. DN42 has its own set of Communities specifications. For details, please refer to: BGP-communities - DN42 Wiki Configuration The general idea and approach in this article are largely based on Xe_iu's method, focusing on adding BGP Communities for geographic information and performing route optimization. Generally, this is sufficient. (Another reason is that I haven't fully grasped the others yet) Note: We should ONLY add geographic information-related BGP Communities to routes originating from our own AS. We should not add such entries to routes received from neighbors. Adding our own regional Communities to a neighbor's routes constitutes forging the route origin, potentially leading to route hijacking. Downstream networks might misjudge the traffic path, routing traffic that should be direct through your network, increasing latency and consuming your network's bandwidth. (Large Communities are an exception, as they have a verification mechanism to prevent this, but that's beyond the scope of this article). The idea is clear. When exporting routes, we first need to verify if it's our own route. If it is, we add the communities tag to the route. The sample configuration provided by DN42 already includes two functions, is_self_net() and is_self_net_v6(), to check if a route is our own. Therefore, writing the configuration part is straightforward. Adding Communities to Routes First, we need to define the geographic region information for the current node at the beginning of the node configuration file. Please check BGP-communities - DN42 Wiki for details: define DN42_REGION = 52; # 52 represents East Asia define DN42_COUNTRY= 1344; # 1344 represents Hong Kong Please modify these values according to your node's actual geographic location. Then, modify the export filter in the dnpeers template: }; export filter { - if is_valid_network() && source ~ [RTS_STATIC, RTS_BGP] then accept; + if is_valid_network() && source ~ [RTS_STATIC, RTS_BGP] then{ + if (is_self_net()) then { # Check if it's our own route + bgp_community.add((64511, DN42_REGION)); # Add continent-level region info + bgp_community.add((64511, DN42_COUNTRY)); # Add country/region info + } + accept; + } reject; }; import limit 1000 action block; During export, it checks if it's our own route. If it is, it sets the bgp_community according to the defined DN42_REGION and DN42_COUNTRY. Here, 64511 is the reserved public AS number identifier specifically for geographic tags (Region/Country); you can just copy it. Apply the same method to the IPv6 export rules, but replace is_self_net() with is_self_net_v6(). Route Optimization Based on Communities Here we need to introduce another concept: local_pref (Local Preference). It is used within an AS to indicate the priority of a route. Its default value is 100, and a higher value indicates higher priority. Furthermore, in BGP route selection logic, local_pref has the highest priority, even higher than AS_PATH length. This means that by setting local_pref, we can adjust the priority of routes to achieve route optimization. Combining this with the BGP Communities mentioned above, we can set the corresponding local_pref based on Communities for optimization. Also, since BGP.local_pref is propagated within the AS, we need to modify the import logic for both eBGP and iBGP routes. My logic for handling BGP.local_pref here references (basically copies) Xe_iu's approach: For routes from the same region, priority +10 For routes from the same country, priority +5 additionally For routes received via direct peering with us, priority +20 additionally Create a function to calculate the priority: function ebgp_calculate_priority() { int priority = 100; # Base priority # Same region detection (+10) if bgp_community ~ [(64511, DN42_REGION)] then priority = priority + 10; # Same country detection (+5) if bgp_community ~ [(64511, DN42_COUNTRY)] then priority = priority + 5; # Direct eBGP neighbor detection (+20) if bgp_path.len = 1 then priority = priority + 20; return priority; } Then, in the import filter within the dnpeers template, set bgp_local_pref to the value calculated by the function: template bgp dnpeers { local as OWNAS; path metric 1; ipv4 { import filter { if is_valid_network() && !is_self_net() then { if (roa_check(dn42_roa, net, bgp_path.last) != ROA_VALID) then { print "[dn42] ROA check failed for ", net, " ASN ", bgp_path.last; reject; } + bgp_local_pref = ebgp_calculate_priority(); accept; } reject; }; Apply the same method for IPv6. After running birdc configure, we should be able to see that our routes have been tagged with Communities labels at our neighbors: (Screenshot source:https://lg.milu.moe/route_all/hk/172.20.234.224) Special thanks to Nuro Trace and Xe_iu. They helped deepen my understanding of BGP Communities and provided much assistance. Reference Articles: [DN42] bird2的配置文件 – Xe_iu's Blog | Xe_iu的杂物间 [DN42] 谈一谈如何配置 BGP community – Xe_iu's Blog | Xe_iu的杂物间 BGP-communities - DN42 Wiki
- 17/08/2025
- 152 Views
- 0 Comments
- 2 Stars
DN42 - Ep.2 Building Internal Network with OSPF and Enabling iBGP Foreword I am a novice in BGP. This article may contain imprecise content/naive understandings/elementary mistakes. I kindly ask the experts to be lenient. If you find any issues, you are welcome to contact me via email, and I will correct them as soon as possible. If you find this unacceptable, it is recommended to close this article now. Article Update Log {timeline} {timeline-item color="#50BFFF"} July 22, 2025: First edition published, using VXLAN over WireGuard tunnel. {/timeline-item} {timeline-item color="#50BFFF"} July 25, 2025: Updated tunneling solution, using type ptp; to support OSPF traffic via WireGuard (Special thanks to Nuro Trance for the guidance!). {/timeline-item} {timeline-item color="#50BFFF"} August 8, 2025: Added explanation and configuration for iBGP. {/timeline-item} {timeline-item color="#4F9E28"} August 27, 2025: Updated node topology diagram. {/timeline-item} {/timeline} Why Do We Need Internal Routing? As the number of nodes increases, we need a proper way to handle internal routing within our AS (Autonomous System). BGP only handles routing to different ASes, which causes a problem: if nodes A and B are both peering with external networks, a request from node A may have its response routed to node B, even though they are part of the same AS. Without internal routing, node A will not receive the reply. To solve this, we need to ensure that all devices within our AS can communicate with each other. The common solutions are: Using network tools like ZeroTier: Simple to set up, just install the client on each node for P2P connectivity. Using P2P tools like WireGuard to manually create $\frac{n(n-1)}{2}$ tunnels, which works like the first solution but becomes cumbersome as nodes grow. Using WireGuard to establish $\frac{n(n-1)}{2}$ tunnels, then using an internal routing protocol like OSPF or Babel to manage the routing. This is more flexible and easier to scale, but it can be risky and could break the DN42 network due to misconfigurations. Thus, I decided to take the risk. Node Topology graph LR A[HKG<br>172.20.234.225<br>fd18:3e15:61d0::1] B[NKG<br>172.20.234.226<br>fd18:3e15:61d0::2] C[TYO<br>172.20.234.227<br>fd18:3e15:61d0::3] D[FRA<br>172.20.234.228<br>fd18:3e15:61d0::4] E[LAX<br>172.20.234.229<br>fd18:3e15:61d0::5] B <--> A C <--> A A <--> E A <--> D C <--> D C <--> E D <--> E Update Bird2 to v2.16 or Above To use IPv6 Link-Local addresses to transmit IPv4 OSPF data, Bird v2.16 or later is required. Here are the steps to update: sudo apt update && sudo apt -y install apt-transport-https ca-certificates wget lsb-release sudo wget -O /usr/share/keyrings/cznic-labs-pkg.gpg https://pkg.labs.nic.cz/gpg echo "deb [signed-by=/usr/share/keyrings/cznic-labs-pkg.gpg] https://pkg.labs.nic.cz/bird2 $(lsb_release -sc) main" | sudo tee /etc/apt/sources.list.d/cznic-labs-bird2.list sudo apt update && sudo apt install bird2 -y Tunnel Configuration [Interface] PrivateKey = <Local WireGuard Private Key> ListenPort = <Listen Port> Table = off Address = <IPv6 LLA>/64 PostUp = sysctl -w net.ipv6.conf.%i.autoconf=0 [Peer] PublicKey = <Peer Public Key> Endpoint = <Peer Public Endpoint> AllowedIPs = 10.0.0.0/8, 172.20.0.0/14, 172.31.0.0/16, fd00::/8, fe00::/8, ff02::5 ff02::5is the OSPFv3 router-specific link-local multicast address and should be included in AllowedIPs. If you're using Bird versions earlier than v2.16, you'll need to add an IPv4 address for the tunnel as well. See the example below: {collapse} {collapse-item label="WireGuard Configuration Example with IPv4"} [Interface] PrivateKey = <Local WireGuard Private Key> ListenPort = <Listen Port> Table = off Address = <IPv6 LLA>/64 PostUp = ip addr add 100.64.0.225/32 peer 100.64.0.226/32 dev %i PostUp = sysctl -w net.ipv6.conf.%i.autoconf=0 [Peer] PublicKey = <Peer Public Key> Endpoint = <Peer Public Endpoint> AllowedIPs = 10.0.0.0/8, 172.20.0.0/14, 100.64.0.0/16, 172.31.0.0/16, fd00::/8, fe00::/8, ff02::5 Please replace 100.64.0.225 and 100.64.0.226 with your local and peer IPv4 addresses, and remember to add AllowedIPs. {/collapse-item} {/collapse} Enable OSPF You should have already configured basic Bird settings as described in the previous article. Create a new file called ospf.conf under /etc/bird and add the following: protocol ospf v3 <name> { ipv4 { import where is_self_net() && source != RTS_BGP; export where is_self_net() && source != RTS_BGP; }; include "/etc/bird/ospf/*"; }; protocol ospf v3 <name> { ipv6 { import where is_self_net_v6() && source != RTS_BGP; export where is_self_net_v6() && source != RTS_BGP; }; include "/etc/bird/ospf/*"; }; Theoretically, OSPF v2 should be used for handling IPv4, but since we need to communicate IPv4 using IPv6 Link-Local addresses, we are using OSPF v3 for IPv4 in this case as well. The filter rules ensure that only routes within the local network segment are allowed to propagate through OSPF, and routes from external BGP protocols are filtered out. Never use import all; export all; indiscriminately, as this could lead to route hijacking and affect the entire DN42 network. OSPF should only handle internal network routes. {collapse} {collapse-item label="Example"} /etc/bird/ospf.conf protocol ospf v3 dn42_iyoroynet_ospf { ipv4 { import where is_self_net() && source != RTS_BGP; export where is_self_net() && source != RTS_BGP; }; include "/etc/bird/ospf/*"; }; protocol ospf v3 dn42_iyoroynet_ospf6 { ipv6 { import where is_self_net_v6() && source != RTS_BGP; export where is_self_net_v6() && source != RTS_BGP; }; include "/etc/bird/ospf/*"; }; {/collapse-item} {/collapse} Next, create the /etc/bird/ospf folder and then create an area configuration file (e.g., /etc/bird/ospf/backbone.conf) with the following content: area 0.0.0.0 { interface "<DN42 dummy interface>" { stub; }; interface "<wg0 interface>" { cost 80; # Modify according to your network situation type ptp; }; interface "<wg1 interface>" { cost 100; # Modify according to your network situation type ptp; }; # Continue for other interfaces }; The 0.0.0.0 area represents the backbone network. he dummy interface here refers to the DN42 virtual interface mentioned in the previous article The cost value is typically used for cost calculation but in DN42's case, where bandwidth is less critical but latency is more important, you can directly assign the latency value. OSPF will automatically choose the route with the lowest cost (sum of the cost values). {collapse} {collapse-item label="Example"} /etc/bird/ospf/backbone.conf area 0.0.0.0 { interface "dn42" { stub; }; interface "dn42_hkg" { cost 80; type ptp; }; interface "dn42_hfe" { cost 150; type ptp; }; interface "dn42_lax"{ cost 100; type ptp; }; }; {/collapse-item} {/collapse} Finally, open /etc/bird/bird.conf and add the following to include the OSPF configuration file at the end: include "ospf.conf"; Run birdc configure, and then birdc show protocols should show the OSPF status as Running. If not, check the configuration steps for errors. At this point, you should be able to ping between two non-directly connected machines: Enable iBGP Before establishing multiple peer connections, each of your nodes must first have complete knowledge of the internal AS topology. This involves configuring another key component: internal BGP (iBGP). Necessity of iBGP iBGP ensures that all routers within the AS have complete knowledge of external destination routes. It ensures that: Internal routers can select the best exit path. Traffic is correctly routed to the boundary routers responsible for specific external networks. Even if there are multiple boundary routers connected to the same external network, internal routers can choose the best exit based on policies. Compared to using a default route pointing to the border router within the AS, iBGP provides precise external route information, allowing internal routers to make more intelligent forwarding decisions. Disadvantages and Solutions To prevent uncontrolled propagation of routing information within the AS, which could cause loops, an iBGP router will not readvertise routes learned from one iBGP neighbor to other iBGP neighbors. This necessitates that traditional iBGP requires a full mesh of iBGP neighbor relationships between all iBGP-running routers within the same AS. (You still need to establish $\frac{n(n+1)}{2}$ connections , there's no way around it. But configuring iBGP is still easier than configuring tunnels after OSPF is set up ). Solutions include: Using a Route Reflector (RR): An RR router manages all routing information within the entire AS. The disadvantage is that if the RR router fails, the entire network can be paralyzed (which is not very Decentralized). Using BGP Confederation: This involves virtually dividing the routers within the AS into sub-ASes, treating the connections between routers as eBGP, and finally stripping the internal AS path information when advertising routes externally. I haven't tried the latter two solutions. Here are some potentially useful reference articles. This article focuses on the configuration of iBGP. DN42 Experimental Network: Intro and Registration (Updated 2022-12) - Lan Tian @ Blog Configure BGP Confederation & Fake Confederation in Bird (Updated 2020-06-07) - Lan Tian @ Blog Writing the iBGP Configuration File Create a new file ibgp.conf in /etc/bird and fill it with the following content: template bgp ibgpeers { local as OWNAS; ipv4 { import where source = RTS_BGP && is_valid_network() && !is_self_net(); export where source = RTS_BGP && is_valid_network() && !is_self_net(); next hop self; extended next hop; }; ipv6 { import where source = RTS_BGP && is_valid_network_v6() && !is_self_net_v6(); export where source = RTS_BGP && is_valid_network_v6() && !is_self_net_v6(); next hop self; }; }; include "ibgp/*"; The import and export filters ensure that iBGP only processes routes learned via the BGP protocol and filters out IGP routes to prevent loops. next hop self is required. It instructs BIRD to rewrite the next hop to the border router's own IP address (instead of the original external next hop) when exporting routes to iBGP neighbors. This is because internal routers cannot directly access the external neighbor's address; without rewriting, the address would be considered unreachable. After rewriting, internal routers only need to send traffic to the border router via IGP routing, and the border router handles the final external forwarding. Because I want to use IPv6 addresses to establish MP-BGP and route IPv4 over IPv6, extended next hop is enabled for IPv4. Next, create the /etc/bird/ibgp directory. Inside, create an iBGP Peer configuration file for each node: protocol bgp 'dn42_ibgp_<Node Name>' from ibgpeers{ neighbor <Corresponding Node's IPv6 ULA Address> as OWNAS; }; {collapse} {collapse-item label="Example"} /etc/bird/ibgp/hkg.conf: protocol bgp 'dn42_ibgp_HKG' from ibgpeers{ neighbor fd18:3e15:61d0::1 as OWNAS; }; {/collapse-item} {/collapse} Note: Each node needs to establish (n-1) iBGP connections, ensuring connectivity with all other machines within the AS. This is why ULA addresses are used. Using ULA addresses ensures that even if the WireGuard connection between two nodes goes down, iBGP can still establish connections via the internal routing established by OSPF. Otherwise, it could lead to the collapse of the entire internal network. Finally, add the inclusion of ibgp.conf in /etc/bird/bird.conf: include "ibgp.conf"; And run birdc configure to apply the configuration. References: BIRD 与 BGP 的新手开场 - 海上的宫殿萌新入坑 DN42 之 —— 基于 tailscale + vxlan + OSPF 的组网 – 米露小窝使用 Bird2 配置 WireGuard + OSPF 实现网络的高可用 | bs' realm DN42 实验网络介绍及注册教程（2022-12 更新） - Lan Tian @ Blog 如何引爆 DN42 网络（2023-05-12 更新） - Lan Tian @ Blog Bird 配置 BGP Confederation，及模拟 Confederation（2020-06-07 更新） - Lan Tian @ Blog 深入解析OSPF路径开销、优先级和计时器 - 51CTO New release 2.16 | BIRD Internet Routing Daemon 第一章·第二节如何在 Linux 上安装最新版本的 BIRD？ | BIRD 中文文档 [DN42] 使用 OSPF ptp 搭建内网与IBGP配置 – Xe_iu's Blog | Xe_iu的杂物间 [译] dn42 多服务器环境中的 iBGP 与 IGP 配置 | liuzhen932 的小窝
- 22/07/2025
- 359 Views
- 1 Comments
- 3 Stars