Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager crashes due to dependancy issue on M1-Chip machines #63

Open
the-bay-kay opened this issue Jul 9, 2024 · 28 comments
Open

Manager crashes due to dependancy issue on M1-Chip machines #63

the-bay-kay opened this issue Jul 9, 2024 · 28 comments

Comments

@the-bay-kay
Copy link

The Issue

When attempting to run any of the demos on an M-1 Chip machine, they fail due to a missing manifest within the docker dependancies. Running the MaEVe-based demos results in the following error...

Full MaEVe Failure
~/Documents/everest-demo user$ curl https://raw.githubusercontent.com/everest/everest-demo/main/demo-ac.sh | bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1111  100  1111    0     0   2053      0 --:--:-- --:--:-- --:--:--  2057
[+] Running 1/3
 ✘ manager Error                                                           2.3s 
 ⠸ node-red Pulling                                                        2.3s 
 ⠸ mqtt-server Pulling                                                     2.3s 
no matching manifest for linux/arm64/v8 in the manifest list entries

And likewise, running CitrineOS results in the following...

CitrineOS Failures image

This issue was originally found within this issue. For those without access to the internal repository, below is a copy of the findings:

Findings

Issues with Apple Silicone

I've run into some issues running on Apple's M1 Chips. Below are the specs:

  • VM: UTM (Running on MacOS 14.4.1), No VM (tested on MacOS hardware)
  • Operating Systems Tested: Ubuntu 23.10, MacOS 14.4.1
  • CPU: Apple M1 Pro
  • 4 GB Ram
  • Docker Version: Docker 26.1.2 (not Docker Desktop)

And, the subsequent error when attempting to spin-up any of the demos:

no matching manifest for linux/arm64/v8 in the manifest entries

From what I understand, this is an issue with one of our docker dependencies. When attempting a hack-y fix described here and composing locally, we get a bit further -- the script fails with the following:

Attaching to manager-1, mqtt-server-1, node-red-1
mqtt-server-1    | exec /docker-entrypoint.sh: exec format error
mqtt-server-1 exited with code 1
manager-1         | exec /bin/sh: exec format error
node-red-1        | exec /usr/local/bin/npm: exec format error
manager-1 exited with code 1
node-red-1 exited with code 1

This is what makes me believe the issue is with our dependencies, not just the platform declaration (as described in the linked thread). Many of the posts I've read have suggested this is an issue with MySQL (link), which doesn't seem relevant. These comments do have a common thread, however, suggesting that one of these dependancies is missing the linux/arm64/v8 manifest.

Looking at EVerest's packages (link), I do see linux/amd64 listed within the OS / Arch tab... perhaps the packages in the.yaml need to be linked differently? I'll keep reading up on it, will add updates on this / the PyTest issue as I find more details!

Likewise, running a Virtual Machine hosted on an M-1 Chip results in a similar failure (Tested on UTM). Complete emulation of Linux Machines results in a successful launch of the demos, but performance is hindered to the point that this not a viable workaround.

Potential Solutions

As suggested within the original thread, I believe that this issue stems from one of the packages missing an ARM64 dependency. I recall during a discussion with the team that this could be traced back to the internal ghcr for everest, though I cannot find a papertrail for those thoughts. Ideally, this fix should be as simple as finding and updating the correct package within either the ghcr or docker manifests. I will continue to investigate further, and report back what I find!

@shankari
Copy link
Collaborator

shankari commented Jul 9, 2024

You may want to look at: https://docs.docker.com/build/building/multi-platform/
Note also that we use the github image store, and I am not sure whether it supports multi-platform or not.

@fabiolnm
Copy link

Same error on M2

curl https://raw.githubusercontent.com/everest/everest-demo/main/demo-ac.sh | bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1111  100  1111    0     0   4454      0 --:--:-- --:--:-- --:--:--  4461
WARN[0000] /var/folders/wn/b_tbrcd5003ck5x_lp2rj3kc0000gn/T/tmp.KMTGl53bJs/docker-compose.yml: `version` is obsolete 
[+] Running 2/3
 ⠋ mqtt-server Pulling                                                                                           1.1s 
 ✘ manager Error       context canceled                                                                          1.1s 
 ✘ node-red Error      context canceled                                                                          1.1s 
no matching manifest for linux/arm64/v8 in the manifest list entries

Any plans to support https://docs.docker.com/build/building/multi-platform/ ?

@shankari
Copy link
Collaborator

Picking this up again to work in parallel, since it will involve a lot of slow builds.
It should be fairly trivial to enable multi-platform builds in general, since we are set up with dockerx across the board.

HOWEVER, for the case of EVerest, we are building the manager on top of buildkit, which is only available for linux-amd64.
Screenshot 2024-11-14 at 10 37 19 PM

But the buildkit is built automatically from a CI workflow that supports multiple platforms

.github/workflows/deploy-single-docker-image.yml

      platforms:
        description: 'Platforms to build for'
        default: |
          linux/amd64
          linux/arm64
          linux/arm/v7
        type: string

But it is set to only support amd64 later

.github/workflows/deploy-docker-images.yml

  REGISTRY: ghcr.io
  DOCKER_DIRECTORY: docker/images
  PLATFORMS: |
    linux/amd64
  PATH_TO_DEPLOY_SINGLE_DOCKER_IMAGE_WORKFLOW: .github/workflows/deploy-single-docker-image.yml
  PATH_TO_DEPLOY_DOCKER_IMAGES_WORKFLOW: .github/workflows/deploy-docker-images.yml

At this point, probably the easiest option is to clone this repo to US-JOET and expand the set of platforms.
We can then change the manager to build from the US-JOET iamge. We can also submit a PR to the main repo to add it so that we don't need to maintain a copy any longer.

But first, let's try to build the CSMS as multi-platform since that is much simpler and have somebody verify that the resulting images work.

@shankari
Copy link
Collaborator

Screenshot 2024-11-15 at 7 45 00 AM Screenshot 2024-11-15 at 7 45 35 AM
  • deprecated baseline images are built (although maybe we should comment them out to save resources)
Screenshot 2024-11-15 at 7 43 13 AM

But the non-deprecated build-kit images are not working on x86

Looks like this is because the package versions that work for amd64 don't work for arm64

#14 [linux/arm64 2/3] RUN apt update     && apt install --no-install-recommends -y     openjdk-17-jre=17.0.13+11-2~deb12u1     nodejs=18.19.0+dfsg-6~deb12u2     npm=9.2.0~ds1-1     python3-pip=23.0.1+dfsg-1     sqlite3=3.40.1-2     libboost-program-options1.74.0=1.74.0+ds1-21     libboost-log1.74.0=1.74.0+ds1-21     libboost-chrono1.74.0=1.74.0+ds1-21     libboost-system1.74.0=1.74.0+ds1-21     libssl3=3.0.14-1~deb12u2     libcurl4=7.88.1-10+deb12u7     libcap2=1:2.66-4     less=590-2.1~deb12u2     python3-pydantic=1.10.4-1     python3-cryptography=38.0.4-3~deb12u1     python3-netifaces=0.11.0-2+b1     python3-psutil=5.9.4-1+b1     python3-dateutil=2.8.2-2     && apt clean     && rm -rf /var/lib/apt/lists/*
#14 0.165 
#14 0.166 WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
#14 0.167 
#14 0.417 Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
#14 0.514 Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
#14 0.515 Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
#14 1.541 Get:4 http://deb.debian.org/debian bookworm/main arm64 Packages [8688 kB]
#14 CANCELED
------
 > [linux/amd64 2/3] RUN apt update     && apt install --no-install-recommends -y     openjdk-17-jre=17.0.13+11-2~deb12u1     nodejs=18.19.0+dfsg-6~deb12u2     npm=9.2.0~ds1-1     python3-pip=23.0.1+dfsg-1     sqlite3=3.40.1-2     libboost-program-options1.74.0=1.74.0+ds1-21     libboost-log1.74.0=1.74.0+ds1-21     libboost-chrono1.74.0=1.74.0+ds1-21     libboost-system1.74.0=1.74.0+ds1-21     libssl3=3.0.14-1~deb12u2     libcurl4=7.88.1-10+deb12u7     libcap2=1:2.66-4     less=590-2.1~deb12u2     python3-pydantic=1.10.4-1     python3-cryptography=38.0.4-3~deb12u1     python3-netifaces=0.11.0-2+b1     python3-psutil=5.9.4-1+b1     python3-dateutil=2.8.2-2     && apt clean     && rm -rf /var/lib/apt/lists/*:
2.180 is only available from another source
2.180 However the following packages replace it:
2.180   sqlite3-tools
2.180 
2.181 Package libcurl4 is not available, but is referred to by another package.
2.181 This may mean that the package is missing, has been obsoleted, or
2.181 is only available from another source
2.181 
2.182 E: Version '3.40.1-2' for 'sqlite3' was not found
2.182 E: Version '7.88.1-10+deb12u7' for 'libcurl4' was not found
------

This is actually kind of weird because it looks like the errors are with amd64, which is what the upstream CI is building.
How is it working there?

@catarial can you take this over for a bit now that I have the github actions piece all worked out and you have an M1 laptop?

  • Check the existing multi-platform images (e.g. if you comment out everything after "waiting to start CSMS" in the demo script, does the CSMS start up?)
  • Figure out the debian versioning on the buildkit and get it to work, including comparing against the upstream CI if needed. I will give you write permission to the branch I have been using to test, although I would suggest that you use docker buildx locally so you don't have to keep pushing to the cloud.

@shankari
Copy link
Collaborator

As you can see, at the time I pulled from upstream, we are at:

commit 17cc37fdad6e7fc92390294f9153e58e80e767fc (main)
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Wed Oct 30 21:29:12 2024 +0100

    Update ghcr.io/everest/everest-ci/build-env-base Docker tag to v1.4.2 (#61)    
    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

And 1.4.2 seems like it built successfully in case you want to compare working versus non-working logs

Screenshot 2024-11-15 at 8 48 54 AM

Although I will note that the PR number there was 56 instead of 61.

@catarial I will let you take it from here

@catarial
Copy link
Contributor

Check the existing multi-platform images (e.g. if you comment out everything after "waiting to start CSMS" in the demo script, does the CSMS start up?)

CSMS starts up

docker ps
610474aa27c0   ghcr.io/everest/everest-demo/maeve-gateway:0.0.19   "/app serve --ws-add…"   2 minutes ago   Up About a minute (healthy)   0.0.0.0:9312->9312/tcp, :::9312->9312/tcp, 0.0.0.0:80->9310/tcp, [::]:80->9310/tcp, 0.0.0.0:443->9311/tcp, [::]:443->9311/tcp   maeve-csms-gateway-1
c8b965840814   ghcr.io/everest/everest-demo/maeve-manager:0.0.19   "/app serve -c /conf…"   2 minutes ago   Up 2 minutes (healthy)        0.0.0.0:9410-9411->9410-9411/tcp, :::9410-9411->9410-9411/tcp                                                                   maeve-csms-manager-1
1d7f7bf347eb   eclipse-mosquitto:2                                 "/docker-entrypoint.…"   2 minutes ago   Up 2 minutes (healthy)        0.0.0.0:1883->1883/tcp, :::1883->1883/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp                                            maeve-csms-mqtt-1
6e1922af957d   google/cloud-sdk                                    "gcloud emulators fi…"   2 minutes ago   Up 2 minutes                  0.0.0.0:8080->8080/tcp, :::8080->8080/tcp                                                                                       maeve-csms-firestore-1

I get a warning about the architecture for firestore

! firestore The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested                                0.0s

@shankari
Copy link
Collaborator

shankari commented Nov 15, 2024

But is the configure with station and token still successful? If you try to send a charging profile to the CSMS (per @the-bay-kay's instructions), does that still work?

! firestore The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s

Why is it requesting amd64 when you are running on an arm?

Note that I don't rebuild the firestore image since it is coming directly from an image. I only rebuild the "manager" and "gateway" images.

@catarial
Copy link
Contributor

But is the configure with station and token still successful? If you try to send a charging profile to the CSMS (per @the-bay-kay's instructions), does that still work?

This is what I see in the manager logs after sending the profile

time=2024-11-15T17:58:22.702Z level=INFO msg="sending message" action=SetChargingProfile chargeStationId=MaxProfile10A4hr.json
time=2024-11-15T17:58:22.703Z level=INFO msg="POST /api/v0/cs/MaxProfile10A4hr.json/setchargingprofile" remote_addr=192.168.65.1:19963 status=201 bytes=0 duration=2.009083ms
time=2024-11-15T17:58:22.702Z level=INFO msg="cs/out/ocpp2.0.1/# publish" duration=1.140375ms messaging.system=mqtt messaging.message.payload_size_bytes=766 messaging.operation=publish messaging.message.conversation_id=a3e7bd6b-c3d2-4b2d-a121-686293912d9b csId=MaxProfile10A4hr.json call.action=SetChargingProfile
time=2024-11-15T17:58:44.473Z level=INFO msg="checking for pending charge station certificates changes"
time=2024-11-15T17:58:44.473Z level=INFO msg="checking for pending charge station settings changes"
time=2024-11-15T17:58:44.472Z level=INFO msg="sync triggers" duration=44.187209ms sync.trigger.previous="" sync.trigger.count=0

@catarial
Copy link
Contributor

catarial commented Nov 15, 2024

I tried the same thing on my amd64 laptop but saw different logs

time=2024-11-15T20:09:49.721Z level=ERROR msg="unable to route message" chargeStationId=cp001 action=SetChargingProfile err="routing request: NotImplemented: SetChargingProfile result not implemented"
time=2024-11-15T20:09:49.721Z level=ERROR msg="cs/in/ocpp2.0.1/# receive" duration=57.551µs messaging.system=mqtt messaging.consumer.id=manager-mjeOc messaging.message.payload_size_bytes=477 messaging.operation=receive csId=cp001 ocpp.version=2.0.1 call_result.action=SetChargingProfile messaging.message.conversation_id=bbcd22b5-2383-4f80-ba6e-0826043e015a
time=2024-11-15T20:09:49.721Z level=ERROR msg=exception exception.type=*fmt.wrapError exception.message="routing request: NotImplemented: SetChargingProfile result not implemented"
time=2024-11-15T20:09:51.342Z level=INFO msg="checking for pending charge station settings changes"
time=2024-11-15T20:09:51.342Z level=INFO msg="checking for pending charge station certificates changes"
time=2024-11-15T20:09:49.643Z level=INFO msg="cs/out/ocpp2.0.1/# publish" duration=523.186µs messaging.system=mqtt messaging.message.payload_size_bytes=444 messaging.operation=publish messaging.message.conversation_id=bbcd22b5-2383-4f80-ba6e-0826043e015a csId=cp001 call.action=SetChargingProfile
time=2024-11-15T20:09:51.342Z level=INFO msg="sync triggers" duration=3.780956ms sync.trigger.previous="" sync.trigger.count=0
docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS                    PORTS                                                                                                                           NAMES
c20f3d7e8508   ghcr.io/everest/everest-demo/nodered:0.0.19         "./entrypoint.sh"        8 minutes ago    Up 7 minutes (healthy)    0.0.0.0:1880->1880/tcp, :::1880->1880/tcp                                                                                       everest-ac-demo-nodered-1
6deeff923892   ghcr.io/everest/everest-demo/manager:0.0.19         "tail -f /dev/null"      8 minutes ago    Up 8 minutes                                                                                                                                              everest-ac-demo-manager-1
5322ebd79dec   ghcr.io/everest/everest-demo/mqtt-server:0.0.19     "/docker-entrypoint.…"   8 minutes ago    Up 8 minutes              1883/tcp                                                                                                                        everest-ac-demo-mqtt-server-1
7b0a613b6ded   ghcr.io/everest/everest-demo/maeve-gateway:0.0.19   "/app serve --ws-add…"   15 minutes ago   Up 15 minutes (healthy)   0.0.0.0:9312->9312/tcp, :::9312->9312/tcp, 0.0.0.0:80->9310/tcp, [::]:80->9310/tcp, 0.0.0.0:443->9311/tcp, [::]:443->9311/tcp   maeve-csms-gateway-1
d8b3f6ff667b   ghcr.io/everest/everest-demo/maeve-manager:0.0.19   "/app serve -c /conf…"   15 minutes ago   Up 15 minutes (healthy)   0.0.0.0:9410-9411->9410-9411/tcp, :::9410-9411->9410-9411/tcp                                                                   maeve-csms-manager-1
3e3883f55bfc   eclipse-mosquitto:2                                 "/docker-entrypoint.…"   15 minutes ago   Up 15 minutes (healthy)   0.0.0.0:1883->1883/tcp, :::1883->1883/tcp, 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp                                            maeve-csms-mqtt-1
08b927facbee   google/cloud-sdk                                    "gcloud emulators fi…"   5 days ago       Up 15 minutes             0.0.0.0:8080->8080/tcp, :::8080->8080/tcp                                        

I'm not sure if that has to do with build process or something else

@catarial
Copy link
Contributor

I tried the same thing on my amd64 laptop but saw different logs

Nevermind, I just forgot to comment out the part after "waiting for CSMS"

time=2024-11-15T20:20:17.783Z level=INFO msg="checking for pending charge station settings changes"
time=2024-11-15T20:20:17.783Z level=INFO msg="checking for pending charge station certificates changes"
time=2024-11-15T20:20:19.225Z level=INFO msg="sending message" action=SetChargingProfile chargeStationId=cp001
time=2024-11-15T20:20:19.226Z level=INFO msg="POST /api/v0/cs/cp001/setchargingprofile" remote_addr=[2001:db8:a::1]:45122 status=201 bytes=0 duration=1.617263ms
time=2024-11-15T20:20:17.783Z level=INFO msg="sync triggers" duration=346.757422ms sync.trigger.previous="" sync.trigger.count=0
time=2024-11-15T20:20:19.225Z level=INFO msg="cs/out/ocpp2.0.1/# publish" duration=1.211989ms messaging.system=mqtt messaging.message.payload_size_bytes=444 messaging.operation=publish messaging.message.conversation_id=643c3fe9-247a-4e86-8910-e31d59236d64 csId=cp001 call.action=SetChargingProfile

@catarial
Copy link
Contributor

@shankari

Figure out the debian versioning on the buildkit and get it to work, including comparing against the upstream CI if needed. I will give you write permission to the branch I have been using to test, although I would suggest that you use docker buildx locally so you don't have to keep pushing to the cloud.

Do you still want me to do this? I'm not seeing any crashes.

@shankari
Copy link
Collaborator

shankari commented Nov 15, 2024

@catarial are you saying that the demo works (including the everest manager) on arm64?
amd64 always worked; arm was the one that did not
are you able to launch a charge session?

@catarial
Copy link
Contributor

@catarial are you saying that the demo works (including the everest manager) on arm64? amd64 always worked; arm was the one that did not are you able to launch a charge session?

@shankari

I'm not sure if it works. Everything starts up, but nothing happens when I click the "Car Plugin" button in the UI. I don't actually know how to use the demo

@catarial
Copy link
Contributor

It seems to get stuck at PrepareCharging

@catarial
Copy link
Contributor

image

It looks like it's working. This is with demo-ac, I had to re-deploy node-red to get it to work

@shankari
Copy link
Collaborator

@the-bay-kay can you verify this on your arm64? I checked the earlier messages, and it looks like the manager wouldn't even start up so I am kind of paranoid here.

 ✘ manager Error       context canceled                                                                          1.1s 
 ✘ node-red Error      context canceled                                                                          1.1s 
no matching manifest for linux/arm64/v8 in the manifest list entries

And if it does work, you can start using it instead of having to work on your personal laptop!

@shankari
Copy link
Collaborator

@Abby-Wheelis maybe search for something specific to sockets on amd64 containers running on arm64 machines? Otherwise @catarial can try to reproduce, and then work on the buildkit build errors on Monday

@Abby-Wheelis
Copy link
Contributor

Investigation into this issue on my Macbook Air 2020 w/ M1 chip:

  • able to start everything up (loads much faster than on my work laptop)
  • plugin, swipe RFID, eventually hits the error here and then goes into a pause state
  • Commented out the line that was the source of the error -- eventually got to charging! Went through pause first, but got there.
  • Ok, so we know that is the source of the error on this machine ...
  • Put the line back in just to make sure that it would never get to charging and I wasn't being impatient earlier ... Yeah no it is stuck after the error.

@shankari
Copy link
Collaborator

what is the line that is the source of the error? if it is not important, maybe we should create a new patch that comments it out throughout?

@Abby-Wheelis
Copy link
Contributor

This is the line (from the stacktrace) that causes the error:

  File "/ext/dist/libexec/everest/modules/PyEvJosev/../../3rd_party/josev/iso15118/evcc/transport/udp_client.py", line 68, in _create_socket
    sock.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_MULTICAST_IF, interface_index)

When I commented that line out, I was able to start charging, but it did go through "car paused" for a while. There are other calls of setsockopt, so it is not the function itself. Maybe one of the options isn't compatible?

socket.IPPROTO_IPV6 is also used in a previous call, so that's not the issue. Searching next for socket.IPV6_MULTICAST_IF and adding a print statement to shed light on interface_index

@Abby-Wheelis
Copy link
Contributor

The interface_index is numerical. So we're using this implementation - socket.setsockopt(level, optname, value: int) which seems fine ...

Checking into socket.IPV6_MULTICAST_IF - found that it requires an integer interface index, which we've got. Not immediately seeing other people having issues with it ...

@shankari shankari removed this from the CharIN demo prep milestone Nov 18, 2024
@shankari
Copy link
Collaborator

Moving this off the milestone since we put it at the last minute. Would still be great to get it to work, but running out of time here.

@catarial
Copy link
Contributor

It does look like silicon and intel macs have different implementations for virtualization.

https://developer.apple.com/documentation/hypervisor

@catarial
Copy link
Contributor

I'm going to experiment with trying some alternatives to docker

@shankari
Copy link
Collaborator

shankari commented Nov 18, 2024

@catarial it looks like this worked for Abby after commenting out a single line - is that true for you too? If so, could we just patch that line out (or remove MULTICAST or sth) for now?

Docker on mac runs on virtualbox anyway.

Longer-term, I think we should use multiplatform images (see above).

@catarial
Copy link
Contributor

Yes, it works after I comment the line out.

@shankari
Copy link
Collaborator

shankari commented Nov 18, 2024

@catarial given that this is for a python file, can you add this to the list of runtime patches (https://github.com/EVerest/everest-demo/blob/main/manager/demo-patch-scripts/apply-runtime-patches.sh)? We can later check-in with the community and see if/why it is needed

Please also try out a few of the scenarios in #84 and verify that all of them work

@the-bay-kay can you also try with the line commented out on your work M1? I anticipate that many at CharIN will have recent Macs

@catarial
Copy link
Contributor

Already working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants