Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana Integration #401

Open
dwffls opened this issue Oct 25, 2024 · 15 comments
Open

Grafana Integration #401

dwffls opened this issue Oct 25, 2024 · 15 comments
Assignees
Labels
enhancement This tackles a new feature of the code (and not a bug) needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! ros2 PR tackling a ROS2 branch

Comments

@dwffls
Copy link

dwffls commented Oct 25, 2024

During @ct2034's talk at ROSCon 2024 the idea was started to visualize /diagnostics messages in a Grafana dashboard.
I wanted to restart this conversation here in the form of a feature request.
As I'd like to contribute to this feature I guess the first thing to tackle is the structure of this integration.

My own suggestion is to implement this in the diagnostics_aggregator and piggy back of the publishing of the /diagnostics_agg topic being send. This data would then be sent to Telegraf (taking inspiration from another talk at ROSCon) to be later used in grafana.

Happy to hear feedback!

@ct2034
Copy link
Collaborator

ct2034 commented Oct 25, 2024

Hi
I think it is an interesting idea. Especially if you fine-tune your diagnostics information to contain all necessary status information, this could be a powerful tool for fleets.
And I also think that an aggregator is the right place to implement it. Then people can use the aggregator matching to choose the info to be piped to Grafana. I have not worked with Telegraf before, but it seems to be designed for these kinds of use cases.

@nnarain
Copy link

nnarain commented Oct 26, 2024

Hey guys. I'm also interested in seeing what can be done here to improve diagnostics.

My company has done an approach like this for many years (though not with telegraf/grafana but a similar stack). And we can visualize diagnostics metrics.

It might be worth discussing the future of rosdiagnostics and figuring out what the scope is. I'd personally think this could be implemented generically to handle any stack.

@dwffls
Copy link
Author

dwffls commented Oct 26, 2024

@nnarain Could you please explain more how you would set it up to handle any stack? I guess any implementation (be that prometheus, telegraf or straight to influxdb) would need it's own configuration.

@nnarain
Copy link

nnarain commented Oct 26, 2024

So my take on it would be a new composable node that consumes the aggregated diagnostics topic and forwards it to the desired endpoint (telegraf, elastic, a network sockets, etc).

I personally wouldn't do this in the aggregator node as to not add new dependency for those that don't want to use a particular metrics stack.

Maybe something like "diagnostics_telegraf".

@dwffls
Copy link
Author

dwffls commented Oct 30, 2024

Sending data to either InfluxDB itself or Telegraf works by sending a small http request, with the data formatted in a special text as such:

home,room=Living\ Room temp=21.1,hum=35.9,co=0i 1641024000
home,room=Kitchen temp=21.0,hum=35.9,co=0i 1641024000
home,room=Living\ Room temp=21.4,hum=35.9,co=0i 1641027600
home,room=Kitchen temp=23.0,hum=36.2,co=0i 1641027600
home,room=Living\ Room temp=21.8,hum=36.0,co=0i 1641031200

The only extra dependency we have to add to the aggregator node is curl. Personally I do not see this as a problem to include. @ct2034 What do you think?

@nnarain
Copy link

nnarain commented Oct 30, 2024

Ya I'd imagine a lot of these tools just use JSON.

So along the lines of what I mentioned earlier it might be a composable node that converts the DiagnosticArray into a JSON payload and sends it to an endpoint.

It sounds like a good use of composition to me. But it depends on what is and is not in scope of the aggregator

@ct2034
Copy link
Collaborator

ct2034 commented Dec 3, 2024

I have thought about this again. Yes, it is only a dependency to curl. But I think it should be a separate package just to separate the concerns more clearly. Then we would also be able to support other backends down the line. And it is a functionality that I think is not in the default feature set that one expects from diagnostics and so it should be in its own package.

@ct2034 ct2034 self-assigned this Dec 3, 2024
@ct2034 ct2034 added enhancement This tackles a new feature of the code (and not a bug) ros2 PR tackling a ROS2 branch needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! labels Dec 3, 2024
@dwffls
Copy link
Author

dwffls commented Dec 4, 2024

Allright that seals it. I have some time on my hand to start work on this, will post the fork here when i have something up and running.

I'll start by naming the package "diagnostics_remote" and the node "telegraf" to start with. Any input on this naming is appreciated.

@ct2034
Copy link
Collaborator

ct2034 commented Dec 4, 2024

Sounds good. :) Looking forward to look at what you came up with.

For the package naming, I am thinking about something like:

  • diagnostics_remote_bridge
  • diagnostics_remote_logging
  • diagnostics_remote_export

I wanted to find something a little more descriptive.

The node naming sounds good. Then we could have other node names for other backends. I think that makes sense.

@dwffls
Copy link
Author

dwffls commented Dec 4, 2024

I'll start with diagnostics_remote_logging, if anything better comes up in this thread I'll change it

@dwffls
Copy link
Author

dwffls commented Dec 4, 2024

I've prepared a working version of the diagnostics code, available at https://github.com/dwffls/diagnostics.

The conversion logic for diagnostics messages to the InfluxDB line protocol is in a separate header file for reusability, such as in nodes sending data directly to InfluxDB.

Testing

Set up InfluxDB (e.g., InfluxDB Cloud and a local Telegraf instance. I've followed this guide.

Finaly add this to /etc/telegraf/telegraf.conf:

[[inputs.http_listener_v2]]
  service_address = "tcp://:8186"
  paths = ["/telegraf"]
  data_format = "influx"

Once set up, data should appear in the InfluxDB UI.

Feedback on the code and or it's structure is welcome!

@avanmalleghem
Copy link

I'm really interested in this topic.

Here is the roscon talk @dwffls talks about : https://vimeo.com/1024971769
There is also an available github repository related to this : https://github.com/bonsairobotics/ros_health_components

You can see the telegraf_bridge package for example.

@dwffls, I will definitely have a look at your repo 👍

@avanmalleghem
Copy link

@dwffls I tried you repo on Humble and I run into the following issue:

  • I start telegraf running docker : docker run -p 8186:8186 -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf with following config file :
[[inputs.http_listener_v2]]
  service_address = "tcp://:8186"
  paths = ["/telegraf"]
  data_format = "influx"
[[outputs.file]]
  • I started your node : ros2 run diagnostic_remote_logging telegraf

And.... I receive {"error":"http: bad request"} whenever your node tries to send data to telegraf. I tried with a dummy command like curl -i -XPOST 'http://localhost:8186/telegraf' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000' and it works successfully so I guess there is something missing in your node ?

In addition to it, in the documentation of http_listener_v2, it is recommended to use the [influxdb_v2_listener](https://github.com/influxdata/telegraf/blob/release-1.32/plugins/inputs/influxdb_v2_listener/README.md) instead of the http_listener_v2 (but I guess it won't solve the issue).

@dwffls
Copy link
Author

dwffls commented Dec 18, 2024

Could you pull the repository again? Ive added some more error handling to when it posts to Telegraf.
It should now output the whole influx line when a bad request happens. It will probably still error out with the new code, but now it shows what it tries to post so that I can debug it. There is probably a problem in the conversion to this influx line protocol. So when it errors out could you send me the new output?

As to the the whole http_listener_v2 vs influxdb_v2_listener, I think you are right, we should be using the new influxdb_v2_listener. I've changed the default url to reflect the changes. telegraf.conf should now look like this:

[[inputs.influxdb_v2_listener]]
  service_address = ":8086"
[[outputs.file]]

As we are now using the full influxdb input we could change the node to be a full "influxdb" node with an example in the readme to use telegraf as a proxy. Kind of on the fence about this one...

Edit: I've started the "rewrite" on a seperate branch to send it directly to influxdb as an option. Readme will follow with instructions for both telegraf and influxdb itself

Let me know if anything else doesn't work.

@dwffls
Copy link
Author

dwffls commented Dec 18, 2024

I've switched to the influx_db branch for developement, please check this out and also see the README for examples on how to run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This tackles a new feature of the code (and not a bug) needs more work Someone has worked on this but more work is needed PR welcome 💞 This issue has no PR that tries to implement it. Please create one! ros2 PR tackling a ROS2 branch
Projects
None yet
Development

No branches or pull requests

4 participants