Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAIL TO RUN MNIST-PYT EXAMPLE #244

Open
guojuntang opened this issue Jun 9, 2024 · 2 comments
Open

FAIL TO RUN MNIST-PYT EXAMPLE #244

guojuntang opened this issue Jun 9, 2024 · 2 comments

Comments

@guojuntang
Copy link

Issue description

  • issue description: fail to run mnist-pyt example
  • occurrence - consistent or rare: always
  • error messages:
    When I tried to run SWCI and assign a task to build up an user Docker, it showed an error that it failed to copy the build context.
SWCI:7 > ASSIGN TASK build_test1 TO defaulttaskbb.taskdb.sml.hpe WITH 1 PEERS
    Task assigned to TaskRunner
SWCI:8 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
    WAITING FOR TASKRUNNER TO COMPLETE - Maximum wait time is : 120 mins
###     
    TASKRUNNER FINISHED
      STATE : ERROR
      TIME  : 2024-06-09 06:47:53
SWCI:8 > ERROR : Task has failed, check TASKRUNNER PEER STATUS for Error description
SWCI:9 > GET TASKRUNNER PEER STATUS defaulttaskbb.taskdb.sml.hpe 0
    NAME                      : demo
    SWOP_UID                  : e57f35f3-a1ad-4aec-a704-066363f55808
    OPERATION_ID              : 9575480544290973496
    PEER_COUNT                : 1
    UPDATE_TS                 : 2024-06-09 06:47:53
    SWOP_PEER_INDEX           : 0
    SWOP_PEER_STATUS          : ERROR
    SWOP_PEER_STATUS_DESC     : Failed to copy build context
SWCI:10 > EXIT

and the taskdef

######################################################################
# (C)Copyright 2021-2023 Hewlett Packard Enterprise Development LP
######################################################################
Name: build_test1
TaskType: MAKE_USER_CONTAINER
Author: HPESwarm
Prereq: ROOTTASK
Outcome: user-image-pyt1.5
Body:
    BuildContext: sl-cli-lib
    BuildType: INLINE
    BuildSteps:
    - FROM pytorch/pytorch
    - ' '
    - RUN apt-get update && apt-get install           \
    - '   build-essential python3-dev python3-pip     \'
    - '   python3-setuptools --no-install-recommends -y'
    - ' '
    - RUN conda install pip
    - ' '
    - RUN pip3 install --upgrade pip protobuf==3.15.6 && pip3 install \
    - '   torchvision matplotlib opencv-python pandas torchmetrics'
    - ' '

  • commands used for starting containers:
  • docker logs [APLS, SPIRE, SN, SL, SWCI]:

Logs from SWOP:

2024-06-09 06:21:48,907 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2024-06-09 06:21:55,512 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2024-06-09 06:21:55,516 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:21:55,516 : swarm.swop : INFO : 400 Client Error for http+docker://localhost/v1.40/images/create?tag=c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972&fromImage=sha256: Bad Request ("failed to resolve image name: short-name "sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972" did not resolve to an alias and no unqualified-search registries are defined in "/etc/containers/registries.conf"")
2024-06-09 06:21:55,516 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:21:55,516 : swarm.swop : WARNING : SWOPBuildTask: Failed to copy build context

I tried to fix this problem by myself, I edited the "/etc/containers/registries.conf" in this way:

unqualified-search-registries = ['registry.fedoraproject.org', 'registry.access.redhat.com', 'registry.centos.org', 'docker.io']
[[registry]]

location = "docker.io"

But it still failed to build up the user Docker after I edited the configuration. Here is the error after I added these lines to the configuration:

2024-06-09 06:47:44,631 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2024-06-09 06:47:51,232 : swarm.swop : INFO : Extracted container id and image info from /tmp/container_info_file file
2024-06-09 06:47:51,236 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:47:51,236 : swarm.swop : INFO : 404 Client Error for http+docker://localhost/v1.40/images/sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972/json: Not Found ("failed to find image sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972: sha256:c61eb2ae2da3e3745e64da0f2799aeee134e87a9084acf5a8884451802051972: No such image")
2024-06-09 06:47:51,236 : swarm.swop : INFO : SWOPBuildTask: Temp Container Creation Failed
2024-06-09 06:47:51,236 : swarm.swop : WARNING : SWOPBuildTask: Failed to copy build context

Swarm Learning Version:

  • Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
    2.2.0

OS and ML Platform

  • details of host OS: Ubuntu22.04
  • details of ML platform used: pytorch
  • details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): just the example from mnist-pyt, 1 machine, 1 SL node, 1 SN node

Quick Checklist: Respond [Yes/No]

  • APLS server web GUI shows available Licenses? YES
  • If Multiple systems are used, can each system access every other system? No mul systems
  • Is Password-less SSH configuration setup for all the systems? No
  • If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? No
  • Is the user id a member of the docker group? Yes

Additional notes

  • Are you running documented example without any modification?
    I use the ngrok to map the local APLS to other domain and port. But I think it doesn't matter.
  • Add any additional information about use case or any notes which supports for issue investigation:

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Maybe this is the issue with podman-docker?

@pkbindhani
Copy link
Collaborator

Can you collect the "SwarmLogCollector" output to review ?

@guojuntang
Copy link
Author

Can you collect the "SwarmLogCollector" output to review ?

Hi, the swarmLogCollector output is here:
swarm_logs_Kael_20240627100305.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants