summaryrefslogtreecommitdiffstats
path: root/docs/subsystem_metrics.md
blob: a3cbd8edf251f95b4ac64719c5a984a4a4793914 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# Subsystem Metrics

The subsystem metrics offers a flexible way to collect and aggregate metrics across the application and display them at the `api/v2/metrics` endpoint.

```python
m = Metrics() # initialize the metrics object
m.inc("foo", 1) # increment a value
m.pipe_execute() # save the values to Redis
```

## Endpoint reflects Redis

The endpoint reflects whatever values are in Redis. The metrics are stored in a Redis hash set called `awx_metrics`, and each metric is a field in this hash set. When a POST or GET is made to the endpoint, the view will load the data stored in Redis, format it to be Prometheus-compatible, and return it as a response. You can view the metrics in Redis by connecting to an instance via a client.

```
redis /run/redis/redis.sock> hget awx_metrics callback_receiver_events_insert_db
"100"
```

Each control node tracks its own metrics. Thus, there will be a `callback_receiver_events_insert_db` metric for each instance in the cluster.

## Metrics class

`subsystem_metrics.Metrics` is a class that can access, track, and update the Redis metrics. It is the interface between the python code and the Redis instance that ultimately stores the metrics. Different parts of the application are expected to initialize their own `Metrics` object and use it to interface with Redis.

## Performance consideration

Saving to Redis can be expensive if called too frequently, i.e. called rapidly in a while loop.

As such, the subsystem metrics is designed to track and aggregate data in-memory, and only update to Redis periodically. Because the endpoint only reflects what is in Redis, there could be a delay between the moment the measurement is taken and when that measurement is displayed at the endpoint.

The `inc` and `set` commands only affect the in-memory values, and calling `pipe_execute` updates the corresponding metrics in Redis. The trade-off with this design is real-time accuracy of the measurements reported at the `api/v2/metrics` endpoint. Keep in mind that this endpoint is likely to be scraped at an interval of 5-15 seconds from Prometheus.Therefore having a system that reports metrics at a resolution of < 1 second is not necessary.

## Thread safety

The Metrics object is not thread safe. Each thread should initialize and use its own `Metrics` object. `set` and `inc` will update a basic python int object without using any sort of mutex or lock. As such, multiple threads accessing and updating this value could lead to inaccuracies and race conditions.

However, from the perspective of Redis, `pipe_execute` *is* thread safe. So multiple `Metrics` objects can track and increment the same metric across threads and processes.

```
                                                     In memory
          Thread 1  +-------------+                 +---------+
                    |             |  inc("foo", 1)  |         |
        +---------->|  Metrics A  | --------------->|  IntM() +-----------------+
        |           |             |                 |         |   pipe_execute  |
        |           +-------------+                 +---------+                 |
        |                                                                   +---v---+
Process                                                                     |       |
                                                                            | Redis |
        |                                                                   |       |
        |                                            In memory              +---^---+
        |           +-------------+                 +---------+                 |
        | Thread 2  |             |  inc("foo", 1)  |         |                 |
        +---------->|  Metrics B  | --------------->|  IntM() +-----------------+
                    |             |                 |         |   pipe_execute
                    +-------------+                 +---------+
```

## When to call pipe_execute

As mentioned, it is best practice to ensure `pipe_execute` is not called too frequently. For convenience, the `should_pipe_execute` method can be used to determine whether enough time has elapsed since the last `pipe_execute` to warrant a new call. This interval is determined by the `SUBSYSTEM_METRICS_INTERVAL_SAVE_TO_REDIS` setting.

```python
m = Metrics()
while True:
    m.inc("foo", 1)
    if m.should_pipe_execute():
        m.pipe_execute()
    if some_condition:
        break
m.pipe_execute()
```

Although the metric `foo` is being incremented very frequently in-memory, the metrics won't save to Redis each iteration of the while loop. Instead, it will only save to Redis if `should_pipe_execute` is `True`.

In the example above, once `some_condition` hits, there may still be values accrued in the `Metrics` object that still haven't been saved to Redis. Therefore a final `pipe_execute` is needed to ensure all values are updated in Redis.

## Metric types

#### Metrics intended to increment over time

* `IntM` - data that can be represented by an integer (whole number). e.g. number of events inserted into database
* `FloatM` - data that can be represented by a float, e.g. time it took to insert events into database

The above metrics are designed to increment (increase) the values in Redis. That is, calling `pipe_execute` will *add* the value currently stored in-memory to the value stored in Redis.

Note, to decrease a value, you can increment by a negative number, e.g. `inc("foo", -1)`

#### Metrics intended to override the previous value

* `SetIntM` - e.g. number of events in the Redis queue right now
* `SetFloatM` - e.g. time it took to execute the last task manager

The above metrics are designed to override whatever values are in Redis. Calling `pipe_execute` will *set* (and override) the value currently stored in-memory to the value stored in Redis.

#### Observing data that falls into buckets

* `HistogramM` - observations of a measurement across time that falls into pre-defined buckets

Example, the following metric captures how many events are batch-inserted into the database.

```
callback_receiver_batch_events_insert_db_bucket{le="10",node="awx_1"} 1
callback_receiver_batch_events_insert_db_bucket{le="50",node="awx_1"} 5
callback_receiver_batch_events_insert_db_bucket{le="150",node="awx_1"} 5
callback_receiver_batch_events_insert_db_bucket{le="350",node="awx_1"} 5
callback_receiver_batch_events_insert_db_bucket{le="650",node="awx_1"} 5
callback_receiver_batch_events_insert_db_bucket{le="2000",node="awx_1"} 5
callback_receiver_batch_events_insert_db_bucket{le="+Inf",node="awx_1"} 5
callback_receiver_batch_events_insert_db_count{node="awx_1"} 5
callback_receiver_batch_events_insert_db_sum{node="awx_1"} 5
```

The histogram is cumulative, meaning each successive bucket includes the values in the *preceding* bucket. In the above, one occurrence of the insertion process resulted in less than 10 events being inserted into the database. Four (5-1) occurrences resulted in between 10 and 50 events being inserted into the database.

## Metrics broadcast

Periodically, the `Metrics` object will broadcast the full metrics dataset to other control nodes in the cluster. This ensures that the metrics endpoint has data from all instances, not just the instance that the browser happens to be connected to at that moment.

This data received from other metrics is stored in Redis as a JSON string. For example, in a cluster with three control nodes, each Redis instance will contain the following keys.

```
awx_metrics_instance_awx_1
awx_metrics_instance_awx_2
awx_metrics_instance_awx_3
```

The `api/v2/metrics` endpoint will load the data from each of these instances, format it into Prometheus, and return it as a response.