Introducing Telemetry: actionable time series data from counters

Latest News AND EVENTS

Stay up to date and find all the latest news and latest events from Metamako right here.

Introducing Telemetry: actionable time series data from counters

Posted on October 04, 2017 by Matthew Knight 04 October 2017

Virtually all network devices, operating systems and some applications provide insight into their inner-workings by  exposing the values of variables and counter variables or counters e.g. web counters.  Their role is often to offer visibility into the inner state of hardware or software. These counters will usually update, based upon changes in the state of the application producing them.

Counters may be read in a variety of ways such as via a shared memory API, a CLI and/or reading a file. However, across multiple platforms, operating systems and applications, there is almost no standardisation of counter types and ways of reading them. Consolidating counters from multiple sources and storing them as time-series was made significantly simpler with the relatively recent availability of time-series databases, associated real-time collection agents, streaming data processing engines and visualisation applications. As long as there is a collection agent available, or as long as one can easily be written for a particular source of counters, the values of counters from disparate sources can now be streamed into a database as a time-series. They can be pushed out to be plotted or they can be used to trigger alerts making the counters actionable in real-time. 

Over the last few major versions of the Metamako Operating System (MOS), we have been including InfluxData's open source InfluxDB (time-series database) and Telegraf (collection agent), with the latter configured to pull counters from the on-board management Linux instance as well as the Layer 1+ switch and Metamako applications. There are a number of tools that can be used to receive real-time streams from InfluxDB and display them; updating graphically as well as triggering pre-configured alerts from the streams. 

InfluxData also offer the Chronograf product (a Web UI, offering monitoring and alerting) and Metamako has successfully used the open source Grafana product (a Web UI, offering monitoring and alerting) with InfluxDB to build real-time streaming dashboards and alerts from the time series data streaming into InfluxDB.

Counters from network devices

Counters are usually made visible in network devices via commands issued in a Command-line Interface (CLI). For example, the following is basic port information from a 10GbE switch:

Screen Shot 2017-10-03 at 12.40.05.png

These, and other counters may also be polled by specialised management products using protocols such as the Simple Network Management Protocol (SNMP). On Metamako devices, these counters are fed in real-time to Telegraf.

 

Counters from the operating system

Linux provides two virtual file systems; procfs and sysfs which contain a myriad of counters that can be read as files. A number of system tools exist to read and present them such as top and sar.

Here is an example of top output:

top.png

 

Microsoft Windows exposes counters pertaining to its operation/applications via its Resource Monitor application or Performance Counters API:

Resource Monitor.png

 

Counters from applications

Applications offer up counters via a number of methods including: writing them periodically to log files, offering a shared memory interface allowing them to be read externally, writing them to the console and/or a GUI. Increasingly, applications offer up a HTTP interface that can return information and counters in a variety of formats; the most common being HTML and JSON.

One example is the Nginx HTTP and reverse proxy server, which offers its operational statistics in both HTML and JSON formats. The HTML output of the web GUI looks like this: 

nginx-1.png

 

Telemetry from Metamako devices

Building upon the raw counter data from the above examples, by leveraging InfluxDB, Telegraf and Grafana, counters from multiple disparate sources, devices, operating systems and applications can be used to generate actionable time-series. All Metamako devices have these telemetry components installed and running by default.

They are fed via Telegraf from the following sources:

i. Linux
  • cpu, mem, net, netstat, disk, diskio, swap, processes, kernel (/proc/stat), kernel (/proc/vmstat), linux_sysctl_fs (/proc/sys/fs)
  • The following Grafana display leverages some of the cpu, mem, disk, swap and kernel counters:

Linux Health Example.png

 
ii. Metamako Devices
  • Device environmentals e.g. temperatures
  • Metamko Layer 1+ statistics e.g. per-interface Ethernet counters, transceiver light levels
  • The folllowing Grafana display is presenting the packet and error counters for all 96 10GbE ports on a Metamako C96 device:

C96 TRAFFIC-1.png

 
iii. Metamako Applications
  • Metamako application statistics e.g. MetaWatch providing data such as buffer levels, drop counters, time-synchronisation offset
  • This Grafana display illustrates plotting the difference between the MetaWatch clock and the external PPS reference it is synchronising to:

WATCH PPS SYNC-1.png

 

Summing Up - Advantages

  • A rich set of connectors (streaming data into open source time-series databases) offer the ability to store and monitor counters that used to require manual polling
  • Metamako ships InfluxData's open source InfluxDB and Telegraf with all its devices and makes a comprehensive set of telemetry information available remotely
  • This telemetry information comprises a rich set of counters allowing the simple creation of graphical representations of the data all the way up to complex dashboards via visualisation tools such as the open source Grafana

 

Read Next:

Layer 1 Switch