diff options
-rw-r--r-- | doc/sphinx/arm/hooks-ha.rst | 100 |
1 files changed, 93 insertions, 7 deletions
diff --git a/doc/sphinx/arm/hooks-ha.rst b/doc/sphinx/arm/hooks-ha.rst index 19173eb271..e7746289d3 100644 --- a/doc/sphinx/arm/hooks-ha.rst +++ b/doc/sphinx/arm/hooks-ha.rst @@ -70,14 +70,14 @@ the primary server to complete the lease database synchronization before it starts the synchronization. In the hot-standby configuration, one of the servers is also designated -as "primary" and the second as "secondary." However, during normal +as "primary" and the second as "standby." However, during normal operation, the primary server is the only one that responds to DHCP -requests. The secondary or standby server receives lease updates from -the primary over the control channel; however, it does not respond to -any DHCP queries as long as the primary is running or, more accurately, -until the secondary considers the primary to be offline. If the -secondary server detects the failure of the primary, it starts -responding to all DHCP queries. +requests. The standby server receives lease updates from the primary +over the control channel; however, it does not respond to any DHCP +queries as long as the primary is running or, more accurately, +until the standby considers the primary to be offline. If the standby +server detects the failure of the primary, it starts responding to all +DHCP queries. In the configurations described above, the primary and secondary/standby are referred to as ``"active"`` servers, because they receive lease @@ -205,6 +205,20 @@ The following is the list of all possible server states: - ``backup`` - normal operation of the backup server. In this state it receives lease updates from the active servers. +- ``comunication-recovery`` - an active server running in load-balancing + mode may transition to this state when it experiences communication + issues with a partner server over the control channel. This is an + intermediate state between the ``load-balancing`` and ``partner-down`` + states. In this state the server continues to respond to DHCP queries + but does not send lease updates to the partner. The lease updates are + queued and will be sent when the communication is resumed. If the + communication is not resumed the server may transition to the + ``partner-down`` state. The ``communication-recovery`` state was + introduced to ensure reliable DHCP service when both active servers + remain operational but the communication between them is interrupted + for a prolonged period of time. The server can be configured to never + enter this state by setting the ``delayed-updates-limit`` to 0. + - ``hot-standby`` - normal operation of the active server running in the hot-standby mode; both the primary and the standby server are in this state during their normal operation. The primary server responds @@ -339,6 +353,11 @@ the scopes can be found below. +========================+=================+=================+=================+ | backup | backup server | disabled | none | +------------------------+-----------------+-----------------+-----------------+ + | communication-recovery | primary or | enabled | ``HA_server1`` | + | | secondary | | or | + | | (load-balancing | | ``HA_server2`` | + | | mode only) | | | + +------------------------+-----------------+-----------------+-----------------+ | hot-standby | primary or | enabled | ``HA_server1`` | | | standby | | if primary, | | | (hot-standby | | none otherwise | @@ -456,6 +475,7 @@ with the only difference that ``this-server-name`` should be set to "max-response-delay": 10000, "max-ack-delay": 5000, "max-unacked-clients": 5, + "delayed-updates-limit": 100, "peers": [{ "name": "server1", "url": "http://192.168.56.33:8000/", @@ -555,6 +575,14 @@ behavior with respect to HA: the partner server is down and transitions to the ``partner-down`` state immediately. The default value of this parameter is 10. +- ``delayed-updates-limit`` - specifies a maximum number of lease + updates which can be queued while the server is in the + ``communication-recovery`` state. This parameter was introduced in + Kea 1.9.4 release. The special value of 0 configures the server to + never transition to the ``communication-recovery`` state and the + server behaves as in earlier Kea versions. The default value of this + parameter is 100. + The values of ``max-ack-delay`` and ``max-unacked-clients`` must be selected carefully, taking into account the specifics of the network in which the DHCP servers are operating. Note that the server in question @@ -575,6 +603,64 @@ other and network partitioning is unlikely, i.e. failure to respond to heartbeats is only possible when the partner is offline. In such cases, set the ``max-unacked-clients`` to 0. +The ``delayed-updates-limit`` parameter was introduced in Kea 1.9.4. It +is used to enable or disable the use of the communication recovery +procedure and controls server's behavior in the ``communication-recovery`` +state which were introduced in the same release. This parameter may +only be used in the load balancing mode. + +If the server is in the ``load-balancing`` state and it experiences +communication issues with a partner (heartbeat or lease update fail), +the server transitions to the ``communication-recovery`` state. In this +state the server keeps responding to DHCP queries but it does not send +lease updates to the partner. The lease updates are queued until the +communication is re-established. This ensures that the DHCP service +remains available even in the event of the communication loss between +the partners. Note that the communication loss may appear both when +one of the servers terminated or when both servers remain available +but can't communicate. In the former case, the surviving server will +follow the normal failover procedure and should eventually transition to +the ``partner-down`` state. In the latter case both servers should +transition to the ``communication-recovery`` state and should never +transition to the ``partner-down`` state (if ``max-unacked-clients`` +is set to non zero value), because all DHCP queries are responded and +servers would not see any unacked DHCP queries. + +Introduction of the communication recovery procedure was mostly +motivated by issues which may appear when two servers remain online +but the communication between them remains interrupted for a long +period of time. In earlier Kea versions, the servers having communication +issues used to drop DHCP packets before transitioning to the +``partner-down`` state. In some cases they both transitioned to the +``partner-down`` state which could potentially result in allocations +of the same IP addresses or delegated prefixes to different clients +by respective servers. By entering the intermediate ``communication-recovery`` +state these problems are avoided. + +If the server in the ``communication-recovery`` state re-establishes +communication with the partner, it will try to send all outstanding +lease updates to it. This is done synchronously and may take considerable +amount of time before the server transitions to the ``load-balancing`` +state and resumes normal operation. The maximum number of lease updates +which can be queued in the ``communication-recovery`` state is controlled +by the ``delayed-updates-limit``. If the limit is exceeded, the server +stops queuing lease updates and will perform full database synchronization +after re-establishing the connection with the partner instead of +sending outstanding lease updates before transitioning to +``load-balancing`` state. Even if the limit is exceeded, the server +in the ``communication-recovery`` state remains responsive to the DHCP +clients. + +It is preferred to set higher values of ``delayed-updates-limit`` when +there is a risk of prolonged communication interruption between the +servers and the lease database is large. This would avoid costly +lease database synchronization. On the other hand, if the lease +database is small the time required to send outstanding lease updates +may be longer than lease database synchronization. In such cases it +may be better to use lower value, e.g. 10. The default value is 100 +which seems to be a reasonable compromise and should work well in +most deployments with moderate traffic. + The ``peers`` parameter contains a list of servers within this HA setup. This configuration must contain at least one primary and one secondary server. It may also contain an unlimited number of backup servers. In |