1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
|
Next Hop Tracking
==================
Next hop tracking is an optimization feature that reduces the processing time
involved in the BGP bestpath algorithm by monitoring changes to the routing
table.
Background
-----------
Recursive routes are of the form:
::
p/m --> n
[Ex: 1.1.0.0/16 --> 2.2.2.2]
where 'n' itself is resolved through another route as follows:
::
p2/m --> h, interface
[Ex: 2.2.2.0/24 --> 3.3.3.3, eth0]
Usually, BGP routes are recursive in nature and BGP nexthops get resolved
through an IGP route. IGP usually adds its routes pointing to an interface
(these are called non-recursive routes).
When BGP receives a recursive route from a peer, it needs to validate the
nexthop. The path is marked valid or invalid based on the reachability status
of the nexthop. Nexthop validation is also important for BGP decision process
as the metric to reach the nexthop is a parameter to best path selection
process.
As it goes with routing, this is a dynamic process. Route to the nexthop can
change. The nexthop can become unreachable or reachable. In the current BGP
implementation, the nexthop validation is done periodically in the scanner run.
The default scanner run interval is one minute. Every minute, the scanner task
walks the entire BGP table. It checks the validity of each nexthop with Zebra
(the routing table manager) through a request and response message exchange
between BGP and Zebra process. BGP process is blocked for that duration. The
mechanism has two major drawbacks:
- The scanner task runs to completion. That can potentially starve the other
tasks for long periods of time, based on the BGP table size and number of
nexthops.
- Convergence around routing changes that affect the nexthops can be long
(around a minute with the default intervals). The interval can be shortened
to achieve faster reaction time, but it makes the first problem worse, with
the scanner task consuming most of the CPU resources.
The next-hop tracking feature makes this process event-driven. It eliminates
periodic nexthop validation and introduces an asynchronous communication path
between BGP and Zebra for route change notifications that can then be acted
upon.
Goal
----
Stating the obvious, the main goal is to remove the two limitations we
discussed in the previous section. The goals, in a constructive tone,
are the following:
- **Fairness**: the scanner run should not consume an unjustly high amount of
CPU time. This should give an overall good performance and response time to
other events (route changes, session events, IO/user interface).
- **Convergence**: BGP must react to nexthop changes instantly and provide
sub-second convergence. This may involve diverting the routes from one
nexthop to another.
Overview of changes
------------------------
The changes are in both BGP and Zebra modules. The short summary is
the following:
- Zebra implements a registration mechanism by which clients can
register for next hop notification. Consequently, it maintains a
separate table, per (VRF, AF) pair, of next hops and interested
client-list per next hop.
- When the main routing table changes in Zebra, it evaluates the next
hop table: for each next hop, it checks if the route table
modifications have changed its state. If so, it notifies the
interested clients.
- BGP is one such client. It registers the next hops corresponding to
all of its received routes/paths. It also threads the paths against
each nexthop structure.
- When BGP receives a next hop notification from Zebra, it walks the
corresponding path list. It makes them valid or invalid depending
on the next hop notification. It then re-computes best path for the
corresponding destination. This may result in re-announcing those
destinations to peers.
Design
------
Modules
~~~~~~~
The core design introduces an "nht" (next hop tracking) module in BGP
and "rnh" (recursive nexthop) module in Zebra. The "nht" module
provides the following APIs:
+----------------------------+--------------------------------------------------+
| Function | Action |
+============================+==================================================+
| bgp_find_or_add_nexthop() | find or add a nexthop in BGP nexthop table |
+----------------------------+--------------------------------------------------+
| bgp_find_nexthop() | find a nexthop in BGP nexthop table |
+----------------------------+--------------------------------------------------+
| bgp_parse_nexthop_update() | parse a nexthop update message coming from zebra |
+----------------------------+--------------------------------------------------+
The "rnh" module provides the following APIs:
+----------------------------+----------------------------------------------------------------------------------------------------------+
| Function | Action |
+============================+==========================================================================================================+
| zebra_add_rnh() | add a recursive nexthop |
+----------------------------+----------------------------------------------------------------------------------------------------------+
| zebra_delete_rnh() | delete a recursive nexthop |
+----------------------------+----------------------------------------------------------------------------------------------------------+
| zebra_lookup_rnh() | lookup a recursive nexthop |
+----------------------------+----------------------------------------------------------------------------------------------------------+
| zebra_add_rnh_client() | register a client for nexthop notifications against a recursive nexthop |
+----------------------------+----------------------------------------------------------------------------------------------------------+
| zebra_remove_rnh_client() | remove the client registration for a recursive nexthop |
+----------------------------+----------------------------------------------------------------------------------------------------------+
| zebra_evaluate_rnh_table() | (re)evaluate the recursive nexthop table (most probably because the main routing table has changed). |
+----------------------------+----------------------------------------------------------------------------------------------------------+
| zebra_cleanup_rnh_client() | Cleanup a client from the "rnh" module data structures (most probably because the client is going away). |
+----------------------------+----------------------------------------------------------------------------------------------------------+
4.2. Control flow
The next hop registration control flow is the following:
::
<==== BGP Process ====>|<==== Zebra Process ====>
|
receive module nht module | zserv module rnh module
----------------------------------------------------------------------
| | |
bgp_update_ | | |
main() | bgp_find_or_add_ | |
| nexthop() | |
| | |
| | zserv_nexthop_ |
| | register() |
| | | zebra_add_rnh()
| | |
The next hop notification control flow is the following:
::
<==== Zebra Process ====>|<==== BGP Process ====>
|
rib module rnh module | zebra module nht module
----------------------------------------------------------------------
| | |
meta_queue_ | | |
process() | zebra_evaluate_ | |
| rnh_table() | |
| | |
| | bgp_read_nexthop_ |
| | update() |
| | | bgp_parse_
| | | nexthop_update()
| | |
zclient message format
~~~~~~~~~~~~~~~~~~~~~~
ZEBRA_NEXTHOP_REGISTER and ZEBRA_NEXTHOP_UNREGISTER messages are
encoded in the following way:
::
. 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| AF | prefix len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. Nexthop prefix .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| AF | prefix len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. Nexthop prefix .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
``ZEBRA_NEXTHOP_UPDATE`` message is encoded as follows:
::
. 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| AF | prefix len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. Nexthop prefix getting resolved .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| #nexthops |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nexthop type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. resolving Nexthop details .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nexthop type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. resolving Nexthop details .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
BGP data structure
~~~~~~~~~~~~~~~~~~
Legend:
::
/\ struct bgp_node: a BGP destination/route/prefix
\/
[ ] struct bgp_info: a BGP path (e.g. route received from a peer)
_
(_) struct bgp_nexthop_cache: a BGP nexthop
/\ NULL
\/--+ ^
| :
+--[ ]--[ ]--[ ]--> NULL
/\ :
\/--+ :
| :
+--[ ]--[ ]--> NULL
:
_ :
(_)...........
Zebra data structure
~~~~~~~~~~~~~~~~~~~~
RNH table::
. O
/ \
O O
/ \
O O
struct rnh
{
uint8_t flags;
struct route_entry *state;
struct list *client_list;
struct route_node *node;
};
User interface changes
~~~~~~~~~~~~~~~~~~~~~~
::
frr# show ip nht
3.3.3.3
resolved via kernel
via 11.0.0.6, swp1
Client list: bgp(fd 12)
11.0.0.10
resolved via connected
is directly connected, swp2
Client list: bgp(fd 12)
11.0.0.18
resolved via connected
is directly connected, swp4
Client list: bgp(fd 12)
11.11.11.11
resolved via kernel
via 10.0.1.2, eth0
Client list: bgp(fd 12)
frr# show ip bgp nexthop
Current BGP nexthop cache:
3.3.3.3 valid [IGP metric 0], #paths 3
Last update: Wed Oct 16 04:43:49 2013
11.0.0.10 valid [IGP metric 1], #paths 1
Last update: Wed Oct 16 04:43:51 2013
11.0.0.18 valid [IGP metric 1], #paths 2
Last update: Wed Oct 16 04:43:47 2013
11.11.11.11 valid [IGP metric 0], #paths 1
Last update: Wed Oct 16 04:43:47 2013
frr# show ipv6 nht
frr# show ip bgp nexthop detail
frr# debug bgp nht
frr# debug zebra nht
6. Sample test cases
r2----r3
/ \ /
r1----r4
- Verify that a change in IGP cost triggers NHT
+ shutdown the r1-r4 and r2-r4 links
+ no shut the r1-r4 and r2-r4 links and wait for OSPF to come back
up
+ We should be back to the original nexthop via r4 now
- Verify that a NH becoming unreachable triggers NHT
+ Shutdown all links to r4
- Verify that a NH becoming reachable triggers NHT
+ no shut all links to r4
Future work
~~~~~~~~~~~
- route-policy for next hop validation (e.g. ignore default route)
- damping for rapid next hop changes
- prioritized handling of nexthop changes ((un)reachability vs. metric
changes)
- handling recursion loop, e.g::
11.11.11.11/32 -> 12.12.12.12
12.12.12.12/32 -> 11.11.11.11
11.0.0.0/8 -> <interface>
- better statistics
|