Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge instance not syncing device state correctly #127

Open
aistisdev opened this issue Nov 19, 2024 · 5 comments
Open

Edge instance not syncing device state correctly #127

aistisdev opened this issue Nov 19, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@aistisdev
Copy link

aistisdev commented Nov 19, 2024

Describe the bug
One particular device on parent thingsboard instance will not sync correctly with edge no matter what is done. We have tried restarting all services, manual syncing after deleting and recreating the device. The device would not be deleted on edge instance. After that we deleted it from the edge database manually. This helped, but now we have a problem where if we create the same device, it automatically appends a suffix:

  1. Create device 07332076 on parent thingsboard
  2. It turns into 07332076_utQtmcKllgxPXoz on both parent and edge
  3. The original 07332076 is not present

This happens no matter how many times we recreated it. For all other device names this does not seem to happen.

Your Server Environment
Deployment: monolith
Deployment type: k8s
ThingsBoard Version: thingsboard/tb-edge-pe:3.6.4EDGEPE
Community or Professional Edition: Professional Edition
OS Name and Version:
NAME="AlmaLinux"
VERSION="9.3 (Shamrock Pampas Cat)"

Expected behavior
Should create device 07332076 on both thingsboard and edge without any suffixes.

To Reproduce
Don't know how to reproduce it , but seems like this started when one of the users imported the device without attaching it to the edge group, and the edge started posting

Screenshots
image

Additional context
At the same time we also had problems with new user syncing where edge instance kept having: Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "tb_user_email_key" exception. After deleting the user from the edge database manually, only the described problem with the device 07332076 persisted.

  1. Is there some way to make edge sync this correctly?
  2. Is it possible to prevent edge to sync new devices with parent, if the devices are already in parent, but not in the edge group? This has caused us many, many problems, where someone forgets to add the devices to the edge group and then we have a bunch of "new" devices with these random suffixes.
@aistisdev aistisdev added the bug Something isn't working label Nov 19, 2024
@AndriiLandiak
Copy link
Member

Hello, @aistisdev.

Thanks for bringing the problem to our attention. The suffix is added in case 2 devices with the same name are created separately on both Thingsboard and Edge (during disconnection or not assigning a device with such name to Edge and creating there the same one).

If I understood correctly, there was a device with the name 07332076 on TB, but it wasn't assigned to the Edge group, so when you create the same device on Edge - it creates the device in the correct group, but with suffix? I was able to reproduce this, but this logic is by design, we cannot have 2 devices with the same name, but in different groups.

In my case, I was able to delete that device (07332076) from TB and recreate it (or just delete the device with a suffix and the original one - 07332076 assign to Edge group), no suffix was added. So could you provide some additional screens, etc., if the problem still exists?

@aistisdev
Copy link
Author

Hi, @AndriiLandiak

If I understood correctly, there was a device with the name 07332076 on TB, but it wasn't assigned to the Edge group, so when you create the same device on Edge - it creates the device in the correct group, but with suffix?

It's a bit convoluted, but I will try to sketch out the situation:

  1. User imported device 07332076 into thingsboard, did not assign any edge group
  2. User configured deivce 07332076 to publish to thingsboard edge and device started publishing
  3. In the morning we had multiple 07332076_... with random suffixes
  4. I deleted the devices with suffixes

Now is the part that is convoluted...

  1. After deleting the devices with suffixes the original 07332076 also disappeared (I am not 100% sure that I did not accidentally deleted it, but I don't think I did)
  2. I tried recreating the device 07332076 and what ended up happening is:

Thingsboard:
image

Thingsboard edge:
image

The correct original device with 07332076 name was on edge only, and the 07332076_... device with the suffix was both on edge and thingsboard.

  1. I could only delete the one with the suffix, I could not delete the original name without suffix from the edge via UI.
  2. I then tried restarting both services, manually syncing edge etc. Nothing helped, when creating 07332076 It automatically appeared as 07332076 _... in thingsboard and edge.
  3. I manually deleted 07332076 from the edge database
  4. After creating 07332076 again, it automatically created 07332076_... in edge and thingsboard, without 07332076 being present anywhere.

This was the case until today. When I started to write this post and take screenshots, the situation fixed itself.... I can now recreate that device correctly with any suffixes. It seems to me that something was cached somewhere and this kept on happening until today when it magically does not happen anymore...

I was able to reproduce this, but this logic is by design, we cannot have 2 devices with the same name, but in different groups.

I understand. Is it possible to configure edge to not publish anything if the device in thingsboard is not included in edge group? Because if we start deploying thousands of devices to edge and those devices are not added to the edge group, there would be an insane amount of trash devices with suffixes, which would then need to be debugged and deleted.

@AndriiLandiak
Copy link
Member

User configured deivce 07332076 to publish to thingsboard edge and device started publishing

I am a bit confused, what does it mean - publish to thingsboard edge? Added to device group, that is assigned to Edge or Edge All group, yeah?

This was the case until today...

There could be different reasons for that. As one of the example, a lot of events are present in DB with 07332076 name, which applies renaming but was not processed yet, or something else. In order, you could reproduce it - contact us again!

I understand. Is it possible to configure edge to not publish anything if the device in thingsboard is not included in edge group? Because if we start deploying thousands of devices to edge and those devices are not added to the edge group, there would be an insane amount of trash devices with suffixes, which would then need to be debugged and deleted.

As for now, there are no such options. We could consider improving this in the next release. For example, add some logic for the user to choose - either he wants to create a device with a suffix or replace an existing one. Or some other approach.

@aistisdev
Copy link
Author

aistisdev commented Nov 21, 2024

I am a bit confused, what does it mean - publish to thingsboard edge? Added to device group, that is assigned to Edge or Edge All group, yeah?

It sends telemetry to our parser, which then sends the telemetry as a standard mqtt gateway api message to thingsboard edge. So it just means that the device started sending messages to our endpoint (it was idle before that).

There could be different reasons for that. As one of the example, a lot of events are present in DB with 07332076 name, which applies renaming but was not processed yet, or something else. In order, you could reproduce it - contact us again!

Yeah, we also had some new user email syncing issues at the same time, which was resolved by manually removing the user from the edge database and recreating the user in thingsboard after that. Maybe that had something to do with this issue. I will keep track of these issues if they appear again.

As for now, there are no such options.

In our case it makes more sense to just leave the device inactive in thingsboard or both thingsboard and edge if the device is not included in the edge group. We can just have alarms based on inactivity, and then administrator of devices would solve the issue as needed. Seems like this feature of suffixed devices is targeted more for situations where data loss is a critical issue, but in our field a few lost messages is usually not critical. Also, we don't have any way to deal with the data from suffixed devices and to add to that, for each new message a new suffixed device is created, which makes it very confusing for administrators, especially when there can be a large amount of such devices.

We could consider improving this in the next release. For example, add some logic for the user to choose - either he wants to create a device with a suffix or replace an existing one. Or some other approach.

That would be great!

@aistisdev
Copy link
Author

aistisdev commented Nov 27, 2024

Hello @AndriiLandiak we have a recurrence of this issue, but we might have a bit more info:

  1. We had a device that was posting telemetry normally for a month up until yesterday:
    image

  2. Yesterday an administrator used the import functionality to update the device attributes in the device panel:
    image

  3. This triggers a rule chain that assigns customers and groups:
    image
    rule chain file:
    create_customers_by_import(1).json
    import file example:
    1device.csv

  4. This did not change the edge group, but on the next time the device posted, a suffixed version appeared in thingsboard and edge, and the original id disappeared from edge:

edge:
image

thingsboard:
image

** Additional notes**

  • After these suffixed versions appeared we tried to disconnect the rule chain and create/update the device by hand, but the same thing happened, however we did not use manual device creation much and did not catch the initial behavior with manual creation
  • Most of the time import works correctly and we did not have any problems. This only started recently after the deployment was active for half a year.

My first thought is maybe the rule chain has some bugs, or the asynchronous behavior of it somehow causes this mismatch. However as this is not a regular occurrence it's hard to understand what is happening. Do you have any ideas how we could debug this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants