Grafana Integration #401

dwffls · 2024-10-25T09:44:39Z

During @ct2034's talk at ROSCon 2024 the idea was started to visualize /diagnostics messages in a Grafana dashboard.
I wanted to restart this conversation here in the form of a feature request.
As I'd like to contribute to this feature I guess the first thing to tackle is the structure of this integration.

My own suggestion is to implement this in the diagnostics_aggregator and piggy back of the publishing of the /diagnostics_agg topic being send. This data would then be sent to Telegraf (taking inspiration from another talk at ROSCon) to be later used in grafana.

Happy to hear feedback!

ct2034 · 2024-10-25T10:35:26Z

Hi
I think it is an interesting idea. Especially if you fine-tune your diagnostics information to contain all necessary status information, this could be a powerful tool for fleets.
And I also think that an aggregator is the right place to implement it. Then people can use the aggregator matching to choose the info to be piped to Grafana. I have not worked with Telegraf before, but it seems to be designed for these kinds of use cases.

nnarain · 2024-10-26T01:32:30Z

Hey guys. I'm also interested in seeing what can be done here to improve diagnostics.

My company has done an approach like this for many years (though not with telegraf/grafana but a similar stack). And we can visualize diagnostics metrics.

It might be worth discussing the future of rosdiagnostics and figuring out what the scope is. I'd personally think this could be implemented generically to handle any stack.

dwffls · 2024-10-26T10:46:20Z

@nnarain Could you please explain more how you would set it up to handle any stack? I guess any implementation (be that prometheus, telegraf or straight to influxdb) would need it's own configuration.

nnarain · 2024-10-26T11:49:51Z

So my take on it would be a new composable node that consumes the aggregated diagnostics topic and forwards it to the desired endpoint (telegraf, elastic, a network sockets, etc).

I personally wouldn't do this in the aggregator node as to not add new dependency for those that don't want to use a particular metrics stack.

Maybe something like "diagnostics_telegraf".

dwffls · 2024-10-30T15:37:13Z

Sending data to either InfluxDB itself or Telegraf works by sending a small http request, with the data formatted in a special text as such:

home,room=Living\ Room temp=21.1,hum=35.9,co=0i 1641024000
home,room=Kitchen temp=21.0,hum=35.9,co=0i 1641024000
home,room=Living\ Room temp=21.4,hum=35.9,co=0i 1641027600
home,room=Kitchen temp=23.0,hum=36.2,co=0i 1641027600
home,room=Living\ Room temp=21.8,hum=36.0,co=0i 1641031200

The only extra dependency we have to add to the aggregator node is curl. Personally I do not see this as a problem to include. @ct2034 What do you think?

nnarain · 2024-10-30T19:43:28Z

Ya I'd imagine a lot of these tools just use JSON.

So along the lines of what I mentioned earlier it might be a composable node that converts the DiagnosticArray into a JSON payload and sends it to an endpoint.

It sounds like a good use of composition to me. But it depends on what is and is not in scope of the aggregator

ct2034 · 2024-12-03T16:03:14Z

I have thought about this again. Yes, it is only a dependency to curl. But I think it should be a separate package just to separate the concerns more clearly. Then we would also be able to support other backends down the line. And it is a functionality that I think is not in the default feature set that one expects from diagnostics and so it should be in its own package.

dwffls · 2024-12-04T08:52:25Z

Allright that seals it. I have some time on my hand to start work on this, will post the fork here when i have something up and running.

I'll start by naming the package "diagnostics_remote" and the node "telegraf" to start with. Any input on this naming is appreciated.

ct2034 · 2024-12-04T09:03:44Z

Sounds good. :) Looking forward to look at what you came up with.

For the package naming, I am thinking about something like:

diagnostics_remote_bridge
diagnostics_remote_logging
diagnostics_remote_export

I wanted to find something a little more descriptive.

The node naming sounds good. Then we could have other node names for other backends. I think that makes sense.

dwffls · 2024-12-04T09:16:41Z

I'll start with diagnostics_remote_logging, if anything better comes up in this thread I'll change it

dwffls · 2024-12-04T15:35:04Z

I've prepared a working version of the diagnostics code, available at https://github.com/dwffls/diagnostics.

The conversion logic for diagnostics messages to the InfluxDB line protocol is in a separate header file for reusability, such as in nodes sending data directly to InfluxDB.

Testing

Set up InfluxDB (e.g., InfluxDB Cloud and a local Telegraf instance. I've followed this guide.

Finaly add this to /etc/telegraf/telegraf.conf:

[[inputs.http_listener_v2]]
  service_address = "tcp://:8186"
  paths = ["/telegraf"]
  data_format = "influx"

Once set up, data should appear in the InfluxDB UI.

Feedback on the code and or it's structure is welcome!

avanmalleghem · 2024-12-12T17:18:56Z

I'm really interested in this topic.

Here is the roscon talk @dwffls talks about : https://vimeo.com/1024971769
There is also an available github repository related to this : https://github.com/bonsairobotics/ros_health_components

You can see the telegraf_bridge package for example.

@dwffls, I will definitely have a look at your repo 👍

avanmalleghem · 2024-12-17T16:44:03Z

@dwffls I tried you repo on Humble and I run into the following issue:

I start telegraf running docker : docker run -p 8186:8186 -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf with following config file :

[[inputs.http_listener_v2]]
  service_address = "tcp://:8186"
  paths = ["/telegraf"]
  data_format = "influx"
[[outputs.file]]

I started your node : ros2 run diagnostic_remote_logging telegraf

And.... I receive {"error":"http: bad request"} whenever your node tries to send data to telegraf. I tried with a dummy command like curl -i -XPOST 'http://localhost:8186/telegraf' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000' and it works successfully so I guess there is something missing in your node ?

In addition to it, in the documentation of http_listener_v2, it is recommended to use the [influxdb_v2_listener](https://github.com/influxdata/telegraf/blob/release-1.32/plugins/inputs/influxdb_v2_listener/README.md) instead of the http_listener_v2 (but I guess it won't solve the issue).

dwffls · 2024-12-18T08:50:27Z

Could you pull the repository again? Ive added some more error handling to when it posts to Telegraf.
It should now output the whole influx line when a bad request happens. It will probably still error out with the new code, but now it shows what it tries to post so that I can debug it. There is probably a problem in the conversion to this influx line protocol. So when it errors out could you send me the new output?

As to the the whole http_listener_v2 vs influxdb_v2_listener, I think you are right, we should be using the new influxdb_v2_listener. I've changed the default url to reflect the changes. telegraf.conf should now look like this:

[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
[[outputs.file]]

As we are now using the full influxdb input we could change the node to be a full "influxdb" node with an example in the readme to use telegraf as a proxy. Kind of on the fence about this one...

Edit: I've started the "rewrite" on a seperate branch to send it directly to influxdb as an option. Readme will follow with instructions for both telegraf and influxdb itself

Let me know if anything else doesn't work.

dwffls · 2024-12-18T14:51:04Z

I've switched to the influx_db branch for developement, please check this out and also see the README for examples on how to run

ct2034 self-assigned this Dec 3, 2024

ct2034 added enhancement This tackles a new feature of the code (and not a bug) ros2 PR tackling a ROS2 branch needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! labels Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana Integration #401

Grafana Integration #401

dwffls commented Oct 25, 2024

ct2034 commented Oct 25, 2024

nnarain commented Oct 26, 2024

dwffls commented Oct 26, 2024

nnarain commented Oct 26, 2024

dwffls commented Oct 30, 2024

nnarain commented Oct 30, 2024

ct2034 commented Dec 3, 2024

dwffls commented Dec 4, 2024 •

edited

Loading

ct2034 commented Dec 4, 2024 •

edited

Loading

dwffls commented Dec 4, 2024

dwffls commented Dec 4, 2024

avanmalleghem commented Dec 12, 2024

avanmalleghem commented Dec 17, 2024

dwffls commented Dec 18, 2024 •

edited

Loading

dwffls commented Dec 18, 2024

Grafana Integration #401

Grafana Integration #401

Comments

dwffls commented Oct 25, 2024

ct2034 commented Oct 25, 2024

nnarain commented Oct 26, 2024

dwffls commented Oct 26, 2024

nnarain commented Oct 26, 2024

dwffls commented Oct 30, 2024

nnarain commented Oct 30, 2024

ct2034 commented Dec 3, 2024

dwffls commented Dec 4, 2024 • edited Loading

ct2034 commented Dec 4, 2024 • edited Loading

dwffls commented Dec 4, 2024

dwffls commented Dec 4, 2024

Testing

avanmalleghem commented Dec 12, 2024

avanmalleghem commented Dec 17, 2024

dwffls commented Dec 18, 2024 • edited Loading

dwffls commented Dec 18, 2024

dwffls commented Dec 4, 2024 •

edited

Loading

ct2034 commented Dec 4, 2024 •

edited

Loading

dwffls commented Dec 18, 2024 •

edited

Loading