ha: High Availability
This section describes the High Availability hooks library, which can be
loaded on a pair of DHCPv4 or DHCPv6 servers to increase the reliability of
the DHCP service in the event of an outage of one of the servers. This library
was previously only available to ISC's paid subscribers, but is now part of
the open source Kea, available to all users.
This library may only be loaded by the kea-dhcp4
or the kea-dhcp6 process.
High Availability (HA) of the DHCP service is provided by running multiple
cooperating server instances. If any of these instances becomes
unavailable for any reason (DHCP software crash, Control Agent
software crash, power outage, hardware failure), a surviving
server instance can continue providing reliable service to clients. Many
DHCP server implementations include the "DHCP Failover" protocol, whose most
significant features are communication between the servers, partner
failure detection, and lease synchronization between the servers.
However, the DHCPv4 failover standardization process was never completed
by the IETF. The DHCPv6 failover standard (RFC 8156) was published, but it
is complex, difficult to use, has significant operational constraints,
and is different than its v4 counterpart.
Although it may be useful for some users to use a "standard" failover
protocol, it seems that most Kea users are simply interested in
a working solution which guarantees high availability of the DHCP
service. Therefore, the Kea HA hook library derives major concepts from the
DHCP Failover protocol but uses its own solutions for communication and
configuration. It offers its own state machine, which greatly simplifies its
implementation and generally fits better into Kea, and it provides the
same features in both DHCPv4 and DHCPv6. This document intentionally
uses the term "High Availability" rather than "Failover" to emphasize that
it is not the Failover protocol implementation.
The following sections describe the configuration and operation of the Kea
HA hook library.
Supported ConfigurationsThe Kea HA hook library supports two configurations, also known as HA
modes: load balancing and hot standby. In the load-balancing mode,
two servers respond to DHCP requests. The load-balancing function
is implemented as described in RFC 3074, with each server responding to
half the received DHCP queries. When one of the servers allocates a lease
for a client, it notifies the partner server over the control channel
(RESTful API), so the partner can save the lease information in its
own database. If the communication with the partner is unsuccessful,
the DHCP query is dropped and the response is not returned to the DHCP
client. If the lease update is successful, the response is returned to
the DHCP client by the server which has allocated the lease. By
exchanging lease updates, both servers get a copy of all leases
allocated by the entire HA setup, and either server can be switched
to handle the entire DHCP traffic if its partner becomes unavailable.In the load-balancing configuration, one of the servers must be
designated as "primary" and the other as "secondary."
Functionally, there is no difference between the two during normal
operation. This distinction is required when the two servers are
started at (nearly) the same time and have to synchronize their
lease databases. The primary server synchronizes the database first.
The secondary server waits for the primary server to complete the
lease database synchronization before it starts the synchronization.
In the hot-standby configuration, one of the servers is also designated as
"primary" and the second as "secondary". However, during
normal operation, the primary server is the only one that responds to
DHCP requests. The secondary or standby server receives lease updates from the
primary over the control channel; however, it does not respond to any
DHCP queries as long as the primary is running or, more accurately,
until the secondary considers the primary to be offline. If the
secondary server detects the failure of the primary, it starts
responding to all DHCP queries.
In the configurations described above, the primary, secondary, and
standby are referred to as "active" servers, because they receive
lease updates and can automatically react to the partner's failures by
responding to the DHCP queries which would normally be handled by the
partner. The HA hook library supports another server type/role:
backup server. The use of a backup server is optional. They can be used
in both load balancing and hot standby setup, in addition to the active
servers. There is no limit on the number of backup servers in the HA
setup; however, the presence of backup servers increases the latency
of DHCP responses, because not only do active servers send lease
updates to each other, but also to the backup servers.
Clocks on Active ServersSynchronized clocks are essential for the HA setup to operate
reliably. The servers share lease information via lease updates and
during synchronization of the databases. The lease information includes
the time when the lease has been allocated and when it expires. Some
clock skew between the servers participating in the HA setup usually
exists; this is acceptable as long as the clock skew is relatively low,
compared to the lease lifetimes. However, if the clock skew becomes too
high, the different lease expiration times on different
servers may cause the HA system to malfunction. For example, one server
may consider a lease to be expired when it is actually still valid. The lease
reclamation process may remove a name associated with this lease from
the DNS, causing problems when the client later attempts to renew the lease.Each active server monitors the clock skew by comparing its current
time with the time returned by its partner in response to the heartbeat
command. This gives a good approximation of the clock skew, although it
doesn't take into account the time between sending the response by the
partner and receiving this response by the server which sent the
heartbeat command. If the clock skew exceeds 30 seconds, a warning log
message is issued. The administrator may correct this problem by
synchronizing the clocks (e.g. using NTP); the servers should notice
the clock skew correction and stop issuing the warningIf the clock skew is not corrected and exceeds 60 seconds, the
HA service on each of the servers is terminated, i.e. the state
machine enters the terminated state. The servers
will continue to respond to DHCP clients (as in the load-balancing
or hot-standby mode), but will exchange neither lease updates nor
heartbeats and their lease databases will diverge. In this case, the
administrator should synchronize the clocks and restart the servers.
Server StatesA DHCP server operating
within an HA setup runs a state machine,
and the state of the server can be retrieved by its peers using the
ha-heartbeat command sent over the RESTful API. If
the partner server doesn't respond to the ha-heartbeat
command within the specified amount of time, the communication is
considered interrupted and the server may (depending on the configuration)
use additional measures (described later in this document) to verify that
the partner is still operating. If it finds that the partner is not
operating, the server transitions to the partner-down
state to handle the entire DHCP traffic directed to the system.In this case, the surviving server continues to send the
ha-heartbeat command to detect when the partner wakes
up. At that time, the partner synchronizes the lease database and when it is again
ready to operate, the surviving server returns to normal operation,
i.e. the load-balancing or hot-standby
state.The following is the list of all possible server states:
backup - normal operation of the
backup server. In this state it receives lease updates from the active
servers.hot-standby - normal operation of
the active server running in the hot-standby mode; both the primary and
the standby server are in this state during their normal operation.
The primary server responds to DHCP queries and sends lease updates
to the standby server and to any backup servers that
are present.load-balancing - normal operation
of the active server running in the load-balancing mode; both the primary
and the secondary server are in this state during their normal operation.
Both servers respond to DHCP queries and send lease updates
to each other and to any backup servers that are
present.partner-down - an active server
transitions to this state after detecting that its partner (another
active server) is offline. The server does not transition to this state
if only a backup server is unavailable. In the
partner-down state the active server responds to all DHCP queries,
including those queries which are normally handled by the server
that is now unavailable.ready - an active server transitions
to this state after synchronizing its lease database with an active
partner. This state indicates to the partner - which may be in the
partner-down state - that it should return to
normal operation. If and when it does, the server in the
ready state will also start normal operation.syncing - an active server
transitions to this state to fetch leases from the active partner
and update the local lease database. When in this state, the server
issues the dhcp-disable command to disable the DHCP
service of the partner from which the leases are fetched. The DHCP
service is disabled for the maximum time of 60 seconds, after which
it is automatically re-enabled, in case the syncing partner was unable
to re-enable the service. If the synchronization is
completed, the syncing server issues the dhcp-enable
command to re-enable the DHCP service of its partner. The
syncing operation is synchronous; the server waits for an
answer from the partner and does nothing else while the
lease synchronization takes place. A server that is configured
not to synchronize the lease database with its partner, i.e. when the
sync-leases configuration parameter is set to
false, will never transition to this state.
Instead, it will transition directly from the
waiting state to the ready state.
terminated - an active server
transitions to this state when the High Availability hooks library
is unable to further provide reliable service and a manual
intervention of the administrator is required to correct the problem.
Various issues with the HA setup may cause the
server to transition to this state.
While in this state, the server continues responding to
DHCP clients based on the HA mode selected (load-balancing or
hot-standby), but the lease updates are not exchanged and the
heartbeats are not sent. Once a server has entered the
"terminated" state, it will remain in this state until it is
restarted. The administrator must correct the issue which caused
this situation prior to restarting the server (e.g. synchronize clocks).
Otherwise, the server will return to the "terminated" state once
it finds that the issue persists.
waiting - each started server
instance enters this state. The backup server transitions
directly from this state to the backup state.
An active server sends a heartbeat to its partner to check its
state; if the partner appears to be unavailable, the server
transitions to the partner-down state. If the partner is
available, the server transitions to the syncing or
ready state, depending on the setting of the
sync-leases configuration parameter. If
both servers appear to be in the waiting
state (concurrent startup), the primary server transitions to
the next state first. The secondary or standby server remains
in the waiting state until the primary
transitions to the ready state..
Currently, restarting the HA service from the
terminated state requires restarting the
DHCP server or reloading its configuration.Whether the server responds to the DHCP queries and which
queries it responds to is a matter of the server's state, if no
administrative action is performed to configure the server
otherwise. The following table provides the default behavior for
various states.The DHCP Server Scopes denote what group
of received DHCP queries the server responds to in the given state.
An in-depth explanation of the scopes can be found below.
Default Behavior of the Server in Various HA StatesStateServer TypeDHCP ServiceDHCP Service Scopesbackupbackup serverdisablednonehot-standbyprimary or standby (hot-standby mode)enabledHA_server1 if primary, none otherwiseload-balancingprimary or secondary (load-balancing mode)enabledHA_server1 or HA_server2partner-downactive serverenabledall scopesreadyactive serverdisablednonesyncingactive serverdisablednoneterminatedactive serverenabledsame as in the load-balancing or hot-standby statewaitingany serverdisablednone
The DHCP service scopes require some explanation. The HA
configuration must specify a unique name for each server within
the HA setup. This document uses the following convention within
provided examples: server1 for a primary server,
server2 for the secondary or standby server, and
server3 for the backup server. In real life
any names can be used as long as they remain unique.In the load-balancing mode there are two scopes named after
the active servers: HA_server1 and
HA_server2. The DHCP queries load-balanced to
server1 belong to the HA_server1
scope and the queries load-balanced to server2
belong to the HA_server2 scope. If either of the
servers is in the partner-down state, the active partner is
responsible for serving both scopes.In the hot-standby mode, there is only one scope -
HA_server1 - because only server1
is responding to DHCP queries. If that server becomes unavailable,
server2 becomes responsible for this scope.
The backup servers do not have their own scopes. In some
cases they can be used to respond to queries belonging to
the scopes of the active servers. Also, a server which is neither
in the partner-down state nor in normal operation serves
no scopes.The scope names can be used to associate pools, subnets,
and networks with certain servers, so only these servers
can allocate addresses or prefixes from those pools, subnets,
or networks. This is done via the client classification mechanism
(see below).Scope Transition in a Partner-Down CaseWhen one of the servers finds that its partner is unavailable,
it starts serving clients from both its own scope and the scope of the
unavailable partner. This is straightforward
for new clients, i.e. those sending DHCPDISCOVER (DHCPv4) or Solicit
(DHCPv6), because those requests are not sent to any particular server.
The available server will respond to all such queries when it is
in the partner-down state.When a client renews a lease, it sends its
DHCPREQUEST (DHCPv4) or Renew (DHCPv6) message directly to the
server which has allocated the lease being renewed. If this
server is no longer available, the client will get no response. In
that case, the client continues to use its lease and attempts to
renew until the rebind timer (T2) elapses. The client then enters
the rebinding phase, in which it sends a DHCPREQUEST (DHCPv4) or
Rebind (DHCPv6) message to any available server. The surviving
server will receive the rebinding request and will typically
extend the lifetime of the lease. The client then continues to
contact that new server to renew its lease as appropriate.If and when the other server once again becomes available, both active servers
will eventually transition to the load-balancing
or hot-standby state, in which they will again be
responsible for their own scopes. Some clients belonging to the
scope of the restarted server will try to renew their leases
via the surviving server, but this server will not respond to them
anymore; the client will eventually transition back to the
correct server via the rebinding mechanism.Load-Balancing ConfigurationThe following is the configuration snippet to enable
high availability on the primary server within the load-balancing
configuration. The same configuration should be applied on the
secondary and backup servers, with the only difference that
this-server-name should be set to
server2 and server3
on those servers, respectively.
{
"Dhcp4": {
...
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "server1",
"mode": "load-balancing",
"heartbeat-delay": 10000,
"max-response-delay": 10000,
"max-ack-delay": 5000,
"max-unacked-clients": 5,
"peers": [
{
"name": "server1",
"url": "http://192.168.56.33:8080/",
"role": "primary",
"auto-failover": true
},
{
"name": "server2",
"url": "http://192.168.56.66:8080/",
"role": "secondary",
"auto-failover": true
},
{
"name": "server3",
"url": "http://192.168.56.99:8080/",
"role": "backup",
"auto-failover": false
}
]
} ]
}
}
],
"subnet4": [
{
"subnet": "192.0.3.0/24",
"pools": [
{
"pool": "192.0.3.100 - 192.0.3.150",
"client-class": "HA_server1"
},
{
"pool": "192.0.3.200 - 192.0.3.250",
"client-class": "HA_server2"
}
],
"option-data": [
{
"name": "routers",
"data": "192.0.3.1"
}
],
"relay": { "ip-address": "10.1.2.3" }
}
],
...
}
}
Two hook libraries must be loaded to enable HA:
libdhcp_lease_cmds.so and
libdhcp_ha.so. The latter implements the
HA feature, while the former enables control
commands required by HA to fetch and manipulate leases on the
remote servers. In the example provided above, it is assumed that
Kea libraries are installed in the /usr/lib
directory. If Kea is not installed in the /usr directory, the
hook libraries locations must be updated accordingly.
The HA configuration is specified within the scope of
libdhcp_ha.so. Note that the top-level
parameter high-availability is a list, even
though it currently contains only one entry.The following are the global parameters which control the server's
behavior with respect to HA:
this-server-name - is a unique
identifier of the server within this HA setup. It must match with one
of the servers specified within the peers list.
mode - specifies an HA mode
of operation. Currently supported modes are load-balancing
and hot-standby.heartbeat-delay - specifies
a duration in milliseconds between sending the last heartbeat (or other command sent
to the partner) and the next heartbeat. The heartbeats are sent
periodically to gather the status of the partner and to verify whether
the partner is still operating. The default value of this parameter is
10000 ms.max-response-delay - specifies a
duration in milliseconds since the last successful communication with the
partner, after which the server assumes that communication with
the partner is interrupted. This duration should be greater than
the heartbeat-delay. Usually it is greater than
the duration of multiple heartbeat-delay values.
When the server detects that communication is interrupted, it
may transition to the partner-down state (when
max-unacked-clients is 0) or trigger the failure-
detection procedure using the values of the two parameters below.
The default value of this parameter is 60000.
max-ack-delay - is one of
the parameters controlling partner failure-detection. When
communication with the partner is interrupted, the server examines the values
of the secs field (DHCPv4) or Elapsed Time
option (DHCPv6), which denote how long the DHCP client has been
trying to communicate with the DHCP server. This parameter specifies the
maximum time in milliseconds for the client to try to communicate with the
DHCP server, after which this server assumes that the client failed to
communicate with the DHCP server (is "unacked"). The default value of
this parameter is 10000.max-unacked-clients - specifies
how many "unacked" clients are allowed (see max-ack-delay)
before this server assumes that the partner is offline and transitions
to the partner-down state. The special value of 0
is allowed for this parameter, which disables the failure-detection
mechanism. In this case, a server that can't communicate with its
partner over the control channel assumes that the partner server is
down and transitions to the partner-down state
immediately. The default value of this parameter is 10.
The values of max-ack-delay and
max-unacked-clients must be selected carefully, taking
into account the specifics of the network in which the DHCP servers are
operating. Note that the server in question may not respond to some
DHCP clients because these clients are not to be serviced
by this server according to administrative policy. The server may also
drop malformed queries from clients. Therefore, selecting too
low a value for the max-unacked-clients parameter may
result in a transition to the partner-down
state even though the partner is still operating. On the other
hand, selecting too high a value may result in never transitioning
to the partner-down state if the DHCP
traffic in the network is very low (e.g. nighttime), because the
number of distinct clients trying to communicate with the server
could be lower than the max-unacked-clients setting.
In some cases it may be useful to disable the failure-detection
mechanism altogether, if the servers are located very close to each
other and network partitioning is unlikely, i.e. failure to
respond to heartbeats is only possible when the partner is offline.
In such cases, set the max-unacked-clients to 0.
The peers parameter contains a list of servers
within this HA setup. This configuration must contain at least
one primary and one secondary server. It may also contain an unlimited
number of backup servers. In this example, there is one backup server
which receives lease updates from the active servers.These are the parameters specified for each of the
peers within this list:
name - specifies a unique name for
the server.url - specifies the URL to be used to
contact this server over the control channel. Other servers use this
URL to send control commands to that server.role - denotes the role of the
server in the HA setup. The following roles are supported in the
load-balancing configuration: primary,
secondary, and backup.
There must be exactly one primary and one secondary server in the
load-balancing setup.auto-failover - a boolean value
which denotes whether a server detecting a partner's failure should
automatically start serving the partner's clients. The default value of
this parameter is true.In our example configuration, both active servers can allocate
leases from the subnet "192.0.3.0/24". This subnet contains two
address pools: "192.0.3.100 - 192.0.3.150" and "192.0.3.200 - 192.0.3.250",
which are associated with HA server scopes using client classification.
When server1 processes a DHCP query, it uses
the first pool for lease allocation. Conversely, when
server2 processes a DHCP query it uses the
second pool. When either of the servers is in the partner-down
state, it can serve leases from both pools and it
selects the pool which is appropriate for the received query. In
other words, if the query would normally be processed by
server2 but this server is not available,
server1 will allocate the lease from the pool of
"192.0.3.200 - 192.0.3.250".
Load Balancing with Advanced ClassificationIn the previous section, we provided an example of
a load-balancing configuration with client classification limited
to the HA_server1 and HA_server2
classes, which are dynamically assigned to the received DHCP queries.
In many cases, HA will be needed in deployments which already
use some other client classification.
Suppose there is a system which classifies devices into two groups:
phones and laptops, based on some classification criteria specified in
Kea configuration file. Both types of devices are allocated leases
from different address pools. Introducing HA in the load-balancing mode
results in a further split of each of those pools, as
each server allocates leases for some phones and
some laptops. This requires each of the existing pools
to be split between HA_server1 and
HA_server2, so we end up with the following classes:
phones_server1laptops_server1phones_server2laptops_server2The corresponding server configuration using advanced classification
(and member expression) is provided below. For brevity's sake,
the HA hook library configuration has been removed from this example.
{
"Dhcp4": {
"client-classes": [
{
"name": "phones",
"test": "substring(option[60].hex,0,6) == 'Aastra'",
},
{
"name": "laptops",
"test": "not member('phones')"
},
{
"name": "phones_server1",
"test": "member('phones') and member('HA_server1')"
},
{
"name": "phones_server2",
"test": "member('phones') and member('HA_server2')"
},
{
"name": "laptops_server1",
"test": "member('laptops') and member('HA_server1')"
},
{
"name": "laptops_server2",
"test": "member('laptops') and member('HA_server2')"
}
],
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
...
} ]
}
}
],
"subnet4": [
{
"subnet": "192.0.3.0/24",
"pools": [
{
"pool": "192.0.3.100 - 192.0.3.125",
"client-class": "phones_server1"
},
{
"pool": "192.0.3.126 - 192.0.3.150",
"client-class": "laptops_server1"
},
{
"pool": "192.0.3.200 - 192.0.3.225",
"client-class": "phones_server2"
},
{
"pool": "192.0.3.226 - 192.0.3.250",
"client-class": "laptops_server2"
}
],
"option-data": [
{
"name": "routers",
"data": "192.0.3.1"
}
],
"relay": { "ip-address": "10.1.2.3" }
}
],
...
}
}
The configuration provided above splits the address range into
four pools: two pools dedicated to server1 and two to
server2. Each server can assign leases to both phones and laptops.
Both groups of devices are assigned addresses from different pools.
The HA_server1 and HA_server2 classes
are built-in (see )
and do not need to be declared. They are assigned dynamically by
the HA hook library as a result of the load-balancing algorithm.
phones_* and laptop_* evaluate to
"true" when the query belongs to a given combination of other classes,
e.g. HA_server1 and phones.
The pool is selected accordingly as a result of such an evaluation.
Consult for details on how to use the
member expression and class dependencies.Hot-Standby ConfigurationThe following is an example configuration of the primary server
in the hot-standby configuration:
{
"Dhcp4": {
...
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "server1",
"mode": "hot-standby",
"heartbeat-delay": 10000,
"max-response-delay": 10000,
"max-ack-delay": 5000,
"max-unacked-clients": 5,
"peers": [
{
"name": "server1",
"url": "http://192.168.56.33:8080/",
"role": "primary",
"auto-failover": true
},
{
"name": "server2",
"url": "http://192.168.56.66:8080/",
"role": "standby",
"auto-failover": true
},
{
"name": "server3",
"url": "http://192.168.56.99:8080/",
"role": "backup",
"auto-failover": false
}
]
} ]
}
}
],
"subnet4": [
{
"subnet": "192.0.3.0/24",
"pools": [
{
"pool": "192.0.3.100 - 192.0.3.250",
"client-class": "HA_server1"
}
],
"option-data": [
{
"name": "routers",
"data": "192.0.3.1"
}
],
"relay": { "ip-address": "10.1.2.3" }
}
],
...
}
}
This configuration is very similar to the load-balancing
configuration described in ,
with a few notable differences.The mode is now set to hot-standby,
in which only one server responds to DHCP clients.
If the primary server is online, it responds to
all DHCP queries. The standby server takes over all
DHCP traffic if it discovers that the primary is unavailable.
In this mode, the non-primary active server is called
standby and that is its role.Finally, because there is always one server responding to
DHCP queries, there is only one scope - HA_server1 -
in use within pools definitions. In fact, the client-class
parameter could be removed from this configuration without harm,
because there can be no conflicts in lease allocations by different
servers as they do not allocate leases concurrently. The
client-class remains in this example mostly for
demonstration purposes, to highlight the differences between the
hot-standby and load-balancing modes of operation.Lease Information SharingAn HA-enabled server informs its active partner about allocated
or renewed leases by sending appropriate control commands, and the partner
updates the lease information in its own database. When the server starts
up for the first time or recovers after a failure, it synchronizes its
lease database with its partner. These two mechanisms guarantee
consistency of the lease information between the servers and allow the
designation of one of the servers to handle the entire DHCP traffic load if
the other server becomes unavailable.In some cases, though, it is desirable to disable lease updates
and/or database synchronization between the active servers, if the
exchange of information about the allocated leases is performed
using some other mechanism. Kea supports various database types
that can be used to store leases, including MySQL, Postgres, and Cassandra.
Those databases include built-in solutions for data replication which
are often used by Kea administrators to provide redundancy.The HA hook library supports such scenarios by
disabling lease updates over the control channel and/or lease database
synchronization, leaving the server to rely on the database replication
mechanism. This is controlled by the two boolean parameters
send-lease-updates and sync-leases,
whose values default to true:
{
"Dhcp4": {
...
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "server1",
"mode": "load-balancing",
"send-lease-updates": false,
"sync-leases": false,
"peers": [
{
"name": "server1",
"url": "http://192.168.56.33:8080/",
"role": "primary"
},
{
"name": "server2",
"url": "http://192.168.56.66:8080/",
"role": "secondary"
}
]
} ]
}
}
],
...
}
In the most typical use case, both parameters are set to the same
value, i.e. both are false if database
replication is in use, or both are true otherwise.
Introducing two separate parameters to control lease updates and
lease-database synchronization is aimed at possible special use
cases; for example, when synchronization is performed by copying a lease file
(therefore sync-leases is set to
false), but lease updates should be conducted
as usual (send-lease-updates is set to
true). It should be noted that Kea does not
natively support such use cases, but users may develop their own
scripts and tools around Kea to provide such mechanisms. The HA
hooks library configuration is designed to maximize flexibility of administration.
Controlling Lease-Page Size LimitAn HA-enabled server initiates synchronization of the lease
database after downtime or upon receiving the ha-sync
command. The server uses commands described in
to fetch leases from its
partner server (lease queries). The size of the results page
(the maximum number of leases to be returned in a single response to one
of these commands) can be controlled via HA hooks library configuration.
Increasing the page size decreases the number of lease queries sent to
the partner server, but it causes the partner server to generate
larger responses, which lengthens transmission time as well as
increases memory and CPU utilization on both servers. Decreasing the
page size helps to decrease resource utilization, but requires
more lease queries to be issued to fetch the entire lease
database.The default value of the sync-page-limit command
controlling the page size is 10000. This means that the entire
lease database can be fetched with a single command if the
size of the database is equal to or less than 10000 lines.
Discussion About TimeoutsIn deployments with a large number of clients connected to the
network, lease-database synchronization after a server failure
may be a time-consuming operation. The synchronizing server must
gather all leases from its partner, which yields a large response
over the RESTful interface. The server receives leases using the
paging mechanism described in .
Before the page of leases is fetched, the synchronizing server
sends a dhcp-disable command to disable the DHCP
service on the partner server. If the service is already disabled, this
command will reset the timeout for the DHCP service being disabled.
This timeout value is by default set to 60 seconds. If fetching a
single page of leases takes longer than the specified time, the partner server will assume that
the synchronizing server died and will resume its DHCP service.
The connection of the synchronizing server with its partner is also
protected by the timeout. If the synchronization of a single page
of leases takes longer than the specified time, the synchronizing server
terminates the connection and the synchronization fails.
Both timeout values are controlled by a single configuration
parameter: sync-timeout. The following
configuration snippet demonstrates how to modify the timeout for
automatic re-enabling of the DHCP service on the partner server
and how to increase the timeout for fetching a single page of leases from 60 seconds
to 90 seconds:
{
"Dhcp4": {
...
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "server1",
"mode": "load-balancing",
"sync-timeout": 90000,
"peers": [
{
"name": "server1",
"url": "http://192.168.56.33:8080/",
"role": "primary"
},
{
"name": "server2",
"url": "http://192.168.56.66:8080/",
"role": "secondary"
}
]
} ]
}
}
],
...
}
It is important to note that extending this sync-timeout value may sometimes
be insufficient to prevent issues with timeouts during
lease-database synchronization. The control commands travel via the
Control Agent, which also monitors incoming (with a synchronizing
server) and outgoing (with a DHCP server) connections for timeouts.
The DHCP server also monitors the connection from the Control
Agent for timeouts. Those timeouts cannot currently be modified
via configuration; extending these timeouts is only possible by
modifying them in the Kea code and recompiling the server. The
relevant constants are located in the Kea source at:
src/lib/config/timeouts.h.
Pausing HA State MachineThe high-availability state machine includes many different
states described in detail in .
The server enters each state when certain conditions are met, most
often taking into account the partner server's state. In some states
the server performs specific actions, e.g. synchronization of the
lease database in the syncing state or responding
to DHCP queries according to the configured mode of operation in the
load-balancing and hot-standby
states.
By default, transitions between the states are performed
automatically and the server administrator has no direct control
when the transitions take place; in most cases, the
administrator doesn't need such control. In some situations,
however, the administrator may want to "pause" the HA state
machine in a selected state to perform some additional administrative
actions before the server transitions to the next state.
Consider a server failure which results in the loss of the entire
lease database. Typically, the server will rebuild its lease database
when it enters the syncing state by querying
the partner server for leases, but it is possible that the
partner was also experiencing a failure and lacks lease information.
In this case, it may be required to reconstruct lease databases on
both servers from some external source, e.g. a backup server. If the
lease database is to be reconstructed via RESTful API, the
servers should be started in the initial, i.e. waiting,
state and remain in this state while leases are being added. In
particular, the servers should not attempt to synchronize their lease
databases nor start serving DHCP clients.
The HA hooks library provides configuration parameters and a
command to control when the HA state machine should be paused and
resumed. The following configuration causes the HA state machine
to pause in the waiting state after server startup.
"Dhcp4": {
...
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "server1",
"mode": "load-balancing",
"peers": [
{
"name": "server1",
"url": "http://192.168.56.33:8080/",
"role": "primary"
},
{
"name": "server2",
"url": "http://192.168.56.66:8080/",
"role": "secondary"
}
],
"state-machine": {
"states": [
{
"state": "waiting",
"pause": "once"
}
]
}
} ]
}
}
],
...
}
The pause parameter value once
denotes that the state machine should be paused upon the first transition
to the waiting state; later transitions to this state
will not cause the state machine to pause. Two other supported values of the
pause parameter are: always and
never. The latter is the default value for each state,
which instructs the server never to pause the state machine.
In order to "unpause" the state machine, the ha-continue
command must be sent to the paused server. This command does not take
any arguments. See for details
about commands specific to the HA hooks library.
It is possible to configure the state machine to pause in more than
one state. Consider the following configuration:
"Dhcp4": {
...
"hooks-libraries": [
{
"library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "server1",
"mode": "load-balancing",
"peers": [
{
"name": "server1",
"url": "http://192.168.56.33:8080/",
"role": "primary"
},
{
"name": "server2",
"url": "http://192.168.56.66:8080/",
"role": "secondary"
}
],
"state-machine": {
"states": [
{
"state": "ready",
"pause": "always"
},
{
"state": "partner-down",
"pause": "once"
}
]
}
} ]
}
}
],
...
}
This configuration instructs the server to pause the state
machine every time it transitions to the ready state
and upon the first transition to the partner-down
state.Refer to for a complete
list of server states. The state machine can be paused in any of the
supported states; however, it is not practical for the
backup and terminated states because
the server never transitions out of these states anyway.
In the syncing state the server is paused
before it makes an attempt to synchronize the lease database with a partner.
To pause the state machine after lease-database synchronization,
use the ready state instead.
The state of the HA state machine depends on the state of the
cooperating server. Therefore, it must be taken into account that
pausing the state machine of one server may affect the operation of the
partner server. For example: if the primary server is paused in the
waiting state, the partner server will also remain in
the waiting state until the state machine of the
primary server is resumed and that server transitions to the
ready state.Control Agent Configuration describes in detail the
Kea daemon, which provides a RESTful interface to control Kea servers.
The same functionality is used by the High Availability hook library to
establish communication between the HA peers. Therefore, the HA
library requires that the Control Agent (CA) be started for each DHCP
instance within the HA setup. If the Control Agent is not started,
the peers will not be able to communicate with the particular DHCP
server (even if the DHCP server itself is online) and may eventually
consider this server to be offline.
The following is an example configuration for the CA running
on the same machine as the primary server. This configuration is
valid for both the load-balancing and the hot-standby cases presented in
previous sections.
{
"Control-agent": {
"http-host": "192.168.56.33",
"http-port": 8080,
"control-sockets": {
"dhcp4": {
"socket-type": "unix",
"socket-name": "/tmp/kea-dhcp4-ctrl.sock"
},
"dhcp6": {
"socket-type": "unix",
"socket-name": "/tmp/kea-dhcp6-ctrl.sock"
}
}
}
}
Control Commands for High AvailabilityEven though the HA hook library is designed to automatically
resolve issues with DHCP service interruptions by redirecting the
DHCP traffic to a surviving server and synchronizing the lease
database when required, it may be useful for the administrator to
have more control over the server behavior. In particular, it may be
useful to be able to trigger lease-database synchronization on demand.
It may also be useful to manually set the HA scopes that are being
served.Note that the backup server can sometimes be used to handle
DHCP traffic if both active servers are down. The backup
servers do not perform failover function automatically. Thus, in
order to use the backup server to respond to DHCP queries,
the server administrator must enable this function manually.
The following sections describe commands supported by the
HA hook library which are available for the administrator.
ha-sync CommandThe ha-sync command instructs the
server to synchronize its local lease database with the
selected peer. The server fetches all leases from the peer and
updates those locally stored leases which are older than
those fetched. It also creates new leases when any of those
fetched do not exist in the local database. All leases that
are not returned by the peer but are in the local database are
preserved. The database synchronization is unidirectional;
only the database on the server to which the command has been
sent is updated. In order to synchronize the peer's database a
separate ha-sync has to be issued to that
peer.Database synchronization may be triggered for
both active and backup server types. The ha-sync command
has the following structure (DHCPv4 server case):
{
"command": "ha-sync",
"service": [ "dhcp4 "],
"arguments": {
"server-name": "server2",
"max-period": 60
}
}
When the server receives this command it first disables the
DHCP service of the server from which it will be fetching leases, by
sending the dhcp-disable command to that server.
The max-period parameter specifies the maximum
duration (in seconds) for which the DHCP service should be disabled.
If the DHCP service is successfully disabled, the synchronizing
server will fetch leases from the remote server by issuing one or
more lease4-get-page commands. When the lease-
database synchronization is complete, the synchronizing server sends
the dhcp-enable command to the peer to re-enable its
DHCP service.
The max-period value should be sufficiently
long to guarantee that it doesn't elapse before the synchronization
is completed. Otherwise, the DHCP server will automatically enable
its DHCP function while the synchronization is still in progress.
If the DHCP server subsequently allocates any leases during the
synchronization, those new (or updated) leases will not be fetched
by the synchronizing server, leading to database inconsistencies.
ha-scopes CommandThis command allows modification of the HA scopes that the
server is serving. Consult
and to learn what scopes
are available for different HA modes of operation. The
ha-scopes command has the following structure
(DHCPv4 server case):
{
"command": "ha-scopes",
"service": [ "dhcp4" ],
"arguments": {
"scopes": [ "HA_server1", "HA_server2" ]
}
}
This command configures the server to handle traffic from
both HA_server1 and HA_server2
scopes. In order to disable all scopes specify an empty list:
{
"command": "ha-scopes",
"service": [ "dhcp4 "],
"arguments": {
"scopes": [ ]
}
}
ha-continue CommandThis command is used to resume the operation of the paused HA
state machine, as described in .
It takes no arguments, so the command structure is as simple as:
{
"command": "ha-continue"
}