Cloud Native Model Driven Telemetry Stack on OpenShift

cerberus
8 min readJan 8, 2021

--

How to measure a network?

Modern networks monitoring is all about metrics, considering that Ops needs rapidly react on network changes or incidents Telco needs something really fast with low granularity (down to milliseconds). If you are in Telecom or Networking for a quite some time, your obvious choice to measure network health and performance would be SNMP. SNMP (Simple Network Management Protocol) RFC 1157 was ratified in 1990 and thanks to simplicity it’s still on duty and helping to monitor networks of any scale. The main advantage of SNMP defined in first letter of the acronym (S)imple as it’s based on UDP and format of messages is pretty simple (I do not consider SNMPv3 with encryption, form my point of view it’s a bit overkill as based on my experience it was killing devices with encryption offloading) But considering current speeds and traffic volumes simple may not be sufficient. SNMP engine on routers/switches expose hardware (ASIC) counters based on OID and return value to SNMP client. Here is a simple example of SNMP Polling in action:

Unfortunately, as more loaded a system less responsive it is. Sometimes you may get updates once in 30 or even 1 minute. In other words, even you will be requesting data every 10ms, you will get the same value and possibly overwhelm your device. Therefore your judgment about real throughput might be far away from real, it will be an average. Considering the nature of TCP traffic and micro-bursts you may not be aware of real throughput. If you have an interest in Carrier Grade monitoring system based on SNMP, have a look on NOC Project. It’s an Open Source Network Monitoring and Fault management system.

What is telemetry

Here is good explanation of telemetry from Cisco. Long story short, telemetry is metrics streaming (pushing) from network elements to Telemetry receiver/server. Receivers may subscribe to specific data based on YANG model (by dial-in or dial-out mode). Therefore you may receive nearly real-time data or with minimum delay. As result, an operator may see what’s happening on the interface without overloading device with constant polling. Another advantage of Telemetry is reliability as it works over TCP protocol where message delivery guaranteed. Also, it supports gRPC that works over HTTP/2 and proved more efficient streaming and embedded flow control. Messaged decoded in GPB (Google ProtoBuf) format, have a look here if you would like to know more about this format. Here is an example of metrics with sampling interval 500ms

10 data points per 5 seconds

Classic Telemetry Stack

There are a couple of libraries to work with telemetry messages:
Python Library cisco_mdt
Go Library (unfortunately, it’s not maintained and outdated) pipeline
Go Library based on gNMI pipeline-gnmi

Below is recommended deployment stack by Cisco

© https://github.com/cisco/bigmuddy-network-telemetry-pipeline

To implement this stack you might need roughly 11 VMs (Sure it depends on project sizing, but let’s assume we are building minimum HA solution)

+------------+-------------------------------+-----+
| role | description | qty |
+------------+-------------------------------+-----+
| Pipeline | Pipeline application consumer | 2 |
| Pipeline | Pipeline application producer | 2 |
| Kafka | Kafka Brokers | 3 |
| tsdb | Time Series Database | 2 |
| monitoring | Grafana and Prometheus | 2 |
+------------+-------------------------------+-----+

If your company do have infra departments who looks after your VMs, you should not have any issues. Order VMs, deploy applications, create an alert plan, write handbooks for operations and you are happy telemetry user. But who will be looking after all these applications? Who will be restarting services if they stuck and looks after OS level alarms? These are reasonable questions if you want to build a production ready solution, not just Proof Of Technology.

Cloud Native

We are happy to live in fun times and witnessing a new era in application development and operations (G’Day DevOps and DevNetOps). So presumably same stack can be built on top of Kubernetes/OpenShift platform. Sizing of Kubernetes/OpenShift cluster depends on a number of projects and your resources I leave you here. My OpenShift test cluster has 3 workers (8 x vCPU, 32GB, 100GB). There are many tweaks how to configure all OpenShift resources correctly and efficiently, but this article is just a high level overview.

High level architecture of Telemetry cluster on OpenShift

High level diagram
High Level Diagram

There are several key components in this solution

Network Elements
In my test environment it’s Cisco IOSXR ASR9k instance running on EVE-NG with enabled telemetry config

IOXSR Router config

Telemetry Service
This is an entry-point from our network to OpenShift cluster. We can use MetalLB or Keepalived operator to expose dedicated floating IP. This service connected to Telegraf Service. Have a look on MetalLB/Keepalived here.

Telegraf telemetry receiver/Kafka producer
As we highlighted above we have plenty of options to receive and decode GPB messages. I would prefer to write my own telemetry server, but to make it simple I use Telegraf with cisco_telemetry_mdt plugin. Telegraf is a native collector for Infulx DB what supports multiple inputs and outputs modes. It’s pretty optimised for high performance and has low memory footprint. Telegraf instance has 3 PODs for load balancing and high availability. Cisco_telemetry_mdt plugin receives GPB messages from network elements, telegraf serialize messages to influx format and send to kafka topic. In kafka messages looks like this text (for sure you may send bare GBP in kafka). This Influx line protocol format:

ifstats,interface_name=GigabitEthernet0/0/0/0,path=Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters,source=ios,subscription=Sub1 packets_received=93818i,bytes_received=5421494i,packets_sent=105098i,bytes_sent=140044185i,multicast_packets_received=0i,broadcast_packets_received=6323i,multicast_packets_sent=0i,broadcast_packets_sent=1i,output_drops=0i,output_queue_drops=0i,input_drops=18i,input_queue_drops=0i,runt_packets_received=0i,giant_packets_received=0i,throttled_packets_received=0i,parity_packets_received=0i,unknown_protocol_packets_received=0i,input_errors=0i,crc_errors=0i,input_overruns=0i,framing_errors_received=0i,input_ignored_packets=0i,input_aborts=0i,output_errors=0i,output_underruns=0i,output_buffer_failures=0i,output_buffers_swapped_out=0i,applique=0i,resets=0i,carrier_transitions=0i,availability_flag=0i,last_data_time=1609257618i,seconds_since_last_clear_counters=0i,last_discontinuity_time=1609250418i,seconds_since_packet_received=0i,seconds_since_packet_sent=0i 1609257638844000000

Messaging cluster (Kafka)
Kafka cluster is represented by Strimzi OpenShift/Kubernetes operator. Operators in OpenShift/K8s are super cool! They are looking after application lifecycle, fix issues, performing day2 operations and scale pods to handle increasing load.

Telegraf telemetry consumer
Another telegraf instance with kafka consumer input and influx URL output. It subscribes to Kafka topic, reads messages and send them to configured URL. You may configure Victoria Metrics as influx endpoint as it do know and recognise this Influx line format version 1 and 2

Victoria Metrics
This is the heart of our solution — Fast, cost-effective and scalable monitoring solution and time-series database. Victoria Metrics also has Kubernetes operator for deployment and lifecycle management. The clustered version of Victoria Metrics consist of 3 key components (there are more than 3 in the clustered version, check the documentation if you are interested)

C
Victoria Metrics Clustered Version architecture ©

> VMInsert
Receives data in different formats (Influx(v1|2) Line Protocol , OpenTSDB HTTP and Telnet, Graphite, Prometheus) and send data to VMStorage
> VMStorage
A storage engine to save, compress and select data points. VM has pretty high compress ratio, that allows to save data points more efficient and make a selection with low processing times. Cluster version support replication for high availability. If one node goes down we have same data in other node thanks to VMInsert and also replica
> VMSelect
VMSelect receives MericsQL requests (same format as PromQL), send query to all connected VMStorages and de-duplicate datapoints. As you noticed same data written to all sources, so to avoid duplicates VM drop repeated datapoints. Downsampling is also coming ;-)

Grafana
Everyone knows Grafana, but OpenShift/Kubernetes has Grafana Operator, so you do need to bother with managing it.

Fire in a hole!

As I mentioned my sandpit is not very powerful, but let’s give it a bit of pressure. To test the stack I do have 2 message sources synthetic message producer(a) and EVE-NG environment(b). BTW to run the synthetic test I do use Red Hat CodeReady workspace on OpenShift. It allows you to make development on your cluster with web-based VS Code IDE. Unfortunately emulated devices can not go down to milliseconds and maximum what I was able to get is 10 seconds update (this depends on virtual instance or release). For test purpose, I ingest messages with maximum possible speed of message generator, what runs on the same cluster. To monitor cluster performance we use 3 dashboards:

Let’s run test for 10 minutes and analyze data.

OK, let’s have a look on results.

Kubernetes cluster health

Cluster utilisation for 10 mins

As you may see overall utilisation over 3 projects didn’t exceed +3 cores. Most hungry was Kafka, because of lag and replication.

Kafka Dashboards

Strimzi dashboard

Ingestion rate: 120k messages per sec (peak)
Consume rate: ~90k messages per sec (peak) As you may see consuming is steep, that’s because of magic of “Autoscaling” Our consumers scale up if (create one more copy) if one POD takes more than 60% of allocated resources. Therefore we started with 1 pod proceed with 4 pods and ended with 1 pod after all messages had been consumed
Lag: This is a beauty of queues in action. All messages saved in queues before they being consumed and processed, therefore we can afford consumer lose for a while. Lag was growing because ingestion speed was higher than reading.

Victoria Metrics

VM Utilisation

Ingestion rate: As you may see maximum ingestion rate was ~100k cps (please be noted that it’s not VM possible maximum)
CPU Needed:
Writing is easy, reading is much more challenging as result you may see we do not more cores.

PS: Please ignore negative projection, I use 10 minutes interval view and it’s not enough to make prediction.

Test metrics

Datapoints from 1 second of traffic

Grafana uses custom intervals, so we can not go under 10ms intervals in this example, but if you want to make API call to VM it can give you more precise data.

Conclusion

As a big fan of networking and cloud-native development, I see a lot’s of use cases for networking applications on top of Kubernetes. That was one of the examples just to illustrate some very cool features what Kubernetes can provide you for networking stuff. As I mentioned before there are heaps of things happening in CNF space. For example modern 5G Core is containerised and designed for Kubernetes. Big companies like Red Hat are working with the community to deliver telco functions on OpenShift like SR-IOV, DPDK, Infiniband, GPU Support, HTTP/2 gRPC and many others.

Please leave comments if you need source code for the projects. I need to create kustomize file or helm chart to share it.

--

--

No responses yet