Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to download artifact : getter subprocess failed: exit status 1 #18189

Closed
spanner4715 opened this issue Aug 11, 2023 · 12 comments
Closed

Comments

@spanner4715
Copy link

Nomad version

1.6.1

Operating system and Environment details

ubuntu-22.04

Issue

Couldn't download artifact "Hadoop" from "archive.apache.org", also I noticed the template file that I wrote couldn't be written in allocation
(core-site.xml, hdfs-site.xml...)

Expected Result

Artifact being downloaded completely

Actual Result

Screenshot from 2023-08-11 16-54-12
Screenshot from 2023-08-11 17-57-10
Screenshot from 2023-08-11 18-12-19

Job file (if appropriate)

variables{
  hadoop_version="3.2.1"
  node1="nomad37"
  node2="nomad78"
  node3="nomad79"

  //core-site.xml
  hdfs_cluster_name = "dmpcluster"
  dfs_permission= "false"
  hadoop_tmp_dir= "local/data/hadoop/tmp/hadoop"
  journal_edit_dir= "local/data/journal/tmp/journal"

  //yarn-site.xml
  ha_status= "true"
  yarn_cluster_name= "dmpcluster"
  //yarn-site spec1
  yarn_scheduler_mem= "47104"
  yarn_scheduler_cpu= "24"
  yarn_node_mem= "47104"
  yarn_node_cpu= "24"
  pmem_check= "false"
  vmem_check= "false"
}

job "hadoop-test" {
    datacenters = ["dc1"]
    type = "service"

   
    group "hadoop-test" {
        count = 1

        restart {
            attempts = 3
            interval = "3m"
            delay = "10s"
            mode = "fail"
        }

        affinity {
            attribute  = "${node.unique.name}"
            value     = "nomad37"
            weight    = 70
        }

        task "hd1" {
            driver = "exec"

            artifact {
        source = "https://archive.apache.org/dist/hadoop/common/hadoop-${var.hadoop_version}/hadoop-${var.hadoop_version}.tar.gz"
        destination = "local/hadoop"
      }

      template {
                destination = "local/hadoop/hadoop-${var.hadoop_version}/etc/hadoop/core-site.xml"
                change_mode     = "noop"
                data = <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://${var.node1}:9000</value>
  </property>

  <property>
        <name>dfs.permissions</name>
        <value>${var.dfs_permission}</value>
    </property>

  <property>
       <name>hadoop.tmp.dir</name>
       <value>${var.hadoop_tmp_dir}</value>
    </property>
    <property>
       <name>dfs.journalnode.edits.dir</name>
       <value>${var.journal_edit_dir}</value>
    </property>

  

</configuration>


EOF
      }

      template {
                destination = "local/hadoop/hadoop-${var.hadoop_version}/etc/hadoop/yarn-site.xml"
                change_mode     = "noop"
                data = <<EOF

<configuration>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>${var.ha_status}</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>${var.yarn_cluster_name}</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>${var.node1}</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>${var.node1}:8088</value>
  </property>
  <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>${var.yarn_scheduler_mem}</value>
  </property>
  <property>
      <name>yarn.scheduler.maximum-allocation-vcores</name>
      <value>${var.yarn_scheduler_cpu}</value>
  </property>
  <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>${var.yarn_node_mem}</value>
  </property>
  <property>
      <name>yarn.nodemanager.resource.cpu-vcores</name>
      <value>${var.yarn_node_cpu}</value>
  </property>
  <property>
      <name>yarn.nodemanager.pmem-check-enabled</name>
      <value>${var.pmem_check}</value>
  </property>
  <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>${var.vmem_check}</value>
  </property>

</configuration>


EOF
      }

      template {
                destination = "local/hadoop/hadoop-${var.hadoop_version}/etc/hadoop/hdfs-site.xml"
                change_mode     = "noop"
                data = <<EOF

<configuration>

  <property>
    <name>dfs.namenode.http-address</name>
    <value>${var.node1}:9000</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>alloc/usr/local/hadoop/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>alloc/usr/local/hadoop/dfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

</configuration>


EOF
      }
            template {
                destination = "local/hadoop/hadoop-${var.hadoop_version}/etc/hadoop/mapred-site.xml"
                change_mode     = "noop"
                data = <<EOF

<configuration>

  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>


</configuration>


EOF
      }

            config {
                command = "local/hadoop/hadoop-${var.hadoop_version}/bin/hdfs"
                args=[
                   "namenode",
                  "-format"
                ]          
            }

            resources {
                cpu = 600
                memory = 4096

        }
    }
}

}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 16, 2023

Thanks for the report @spanner4715.

Could you check if the Nomad client agent is running as root?

And do you see any relevant log lines in the client agent logs?

@spanner4715
Copy link
Author

Hello @lgfa29
Many thanks to your reply. By following your suggestion I solved this issue.
I used to follow this explanation and run the command with sudo nomad agent
Screenshot from 2023-08-17 10-55-47

However, I still got the same error as title. Until I use sudo su and run nomad agent command, the task can finally run successfully. I'm not sure whether the document needs to be corrected but client agent running as root do really solve this bothering issue. Thank you so much

@shoenig
Copy link
Member

shoenig commented Aug 18, 2023

@spanner4715 please post client logs, they should describe what went wrong.

@spanner4715
Copy link
Author

Hello @shoenig , thanks for your reply
The client logs is displayed the same as what I posted in "Actual result"

@shoenig
Copy link
Member

shoenig commented Aug 21, 2023

There should be an additional line in the client logs, one that contains the string OUTPUT, which contains the actual output from the artifact downloader sub process.

Here's an example from a while back, from a similar bug report. Notice the second line contains the actual error.

Mar 14 17:49:55 nomad-client-05 nomad[30661]:     2023-03-14T17:49:55.032Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c4d60ac3-5820-6b88-9a37-d47aa81baade task=init type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
Mar 14 17:49:55 nomad-client-05 nomad[30661]:     2023-03-14T17:49:55.052Z [ERROR] client.artifact: sub-process: OUTPUT="failed to download artifact: error downloading 'ssh://[email protected]/org/repo?sshkey=redacted': open /var/nomad/alloc/c4d60ac3-5820-6b88-9a37-d47aa81baade/init/tmp/go-getter1659107024: permission denied"
Mar 14 17:49:55 nomad-client-05 nomad[30661]:     2023-03-14T17:49:55.052Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=c4d60ac3-5820-6b88-9a37-d47aa81baade task=init type="Failed Artifact Download" msg="failed to download artifact \"git::[email protected]:org/repo\": getter subprocess failed: exit status 1" failed=false
Mar 14 17:49:55 nomad-client-05 nomad[30661]:     2023-03-14T17:49:55.054Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=c4d60ac3-5820-6b88-9a37-d47aa81baade task=init error="prestart hook \"artifacts\" failed: failed to download artifact \"git::[email protected]:org/repo\": getter subprocess failed: exit status 1"

@kmott
Copy link

kmott commented Dec 5, 2023

Hi @shoenig I am able to consistently reproduce this running recent Debian 12 + Nomad v1.6.3 + Docker v24.0.7.

In my case, I have an artifact download in my job, and it errors out with this:

failed to download artifact "s3::https://s3-us-west-2.amazonaws.com/my-custom-bucket/nomad/kitchen/fabio":
getter subprocess failed: exit status 1: failed to download artifact: RequestError: send request failed caused by:
Get "https://s3-us-west-2.amazonaws.com/my-custom-bucket?prefix=nomad%2Fkitchen%2Ffabio": dial tcp: lookup s3-us-west-2.amazonaws.com: device or resource busy

Followed by this in the client logs:

Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.528-0800 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=812401cb-ee61-6af0-addc-470e316c0eae task=fabio-loadbalancer
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.529-0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=812401cb-ee61-6af0-addc-470e316c0eae task=fabio-loadbalancer type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.558-0800 [DEBUG] client.alloc_runner.task_runner.task_hook.artifacts: downloading artifact: alloc_id=812401cb-ee61-6af0-addc-470e316c0eae task=fabio-loadbalancer artifact=s3::https://s3-us-west-2.amazonaws.com/my-custom-bucket/nomad/kitchen/fabio aid=ZF1AIyMkPAlNX2Ba0lb8RJORjILqChozPsaftaXYIFM
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.558-0800 [DEBUG] client.artifact: get: source=s3::https://s3-us-west-2.amazonaws.com/my-custom-bucket/nomad/kitchen/fabio destination=local/etc/fabio
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.978-0800 [ERROR] client.artifact: sub-process: OUTPUT="failed to download artifact: RequestError: send request failed"
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.978-0800 [ERROR] client.artifact: sub-process: OUTPUT="caused by: Get \"https://s3-us-west-2.amazonaws.com/my-custom-bucket?prefix=nomad%2Fkitchen%2Ffabio\": dial tcp: lookup s3-us-west-2.amazonaws.com: device or resource busy"
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):     2023-12-05T14:19:55.978-0800 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=812401cb-ee61-6af0-addc-470e316c0eae task=fabio-loadbalancer type="Failed Artifact Download"
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):   msg=
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):   | failed to download artifact "s3::https://s3-us-west-2.amazonaws.com/my-custom-bucket/nomad/kitchen/fabio": getter subprocess failed: exit status 1: failed to download artifact: RequestError: send request failed
Dec 05 14:19:55 nomad-n1-debian12 hab[1712]: nomad.default(O):   | caused by: Get "https://s3-us-west-2.amazonaws.com/my-custom-bucket?prefix=nomad%2Fkitchen%2Ffabio": dial tcp: lookup s3-us-west-2.amazonaws.com: device or resource busy

Note that my nomad clients are running as root:

root@nomad-n1-debian12:~# ps aux | grep 'USER\|nomad agent' | grep -v grep
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        3540  3.9  2.1 2118628 133132 ?      Sl   14:08   0:33 nomad agent -bind {{ GetInterfaceIP "enp0s8" }} -config /etc/nomad/config

Happy to provide more logs or anything else if it helps. Thank you.

@shoenig
Copy link
Member

shoenig commented Dec 6, 2023

@kmott can you show more of your job file and describe the system versions? The docker driver shouldn't have anything to do with it - artifact downloading happens before the task is started.

I just tried to repro a simple job in a debian vm and it worked fine

root@localhost:~# uname -a
Linux localhost 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
root@localhost:~# ./nomad version 
Nomad v1.6.3
BuildDate 2023-10-30T12:58:10Z
Revision e0497bff14378d68cad76a801cc0eba93ce05039
root@localhost:~# ./nomad job run bug.hcl 
...
root@localhost:~# ./nomad alloc logs 13
1.21.5
job "bug" {
  type = "batch"

  group "group" {
    task "task" {
      artifact {
        source = "https://raw.githubusercontent.com/hashicorp/nomad/main/.go-version"
      }
      driver = "raw_exec"
      config {
        command = "cat"
        args    = ["local/.go-version"]
      }
      resources {
        cpu    = 16
        memory = 32
      }
    }
  }
}

@kmott
Copy link

kmott commented Dec 6, 2023

Sure @shoenig , here's my Nomad Job (edited with your example pulling from raw.githubusercontent.com):

job "fabio" {
  datacenters = ["kitchen"]
  type = "system"

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  group "loadbalancer" {
    count = 1

    restart {
      attempts = 20
      interval = "3m"
      delay = "5s"
      mode = "delay"
    }

    ephemeral_disk {
      size = 128
    }

    network {
      mode = "bridge"
      port "admin" {
        static = 9168
        to = 9168
      }

      port "frontend_http" {
        static = 80
        to = 8080
      }

      port "frontend_https" {
        static = 443
        to = 8443
      }
    }

    task "fabio-loadbalancer" {
      driver = "docker"

      config {
        image = "fabiolb/fabio:1.5.15-go1.15.5"

        args = [
          "-cfg", "/local/etc/fabio/fabio.properties",
          "-registry.consul.addr", "< ... >",
          "-insecure"
        ]

        ports = ["admin","frontend_http","frontend_https"]
      }

      artifact {
        source = "https://raw.githubusercontent.com/hashicorp/nomad/main/.go-version"
      }

#      artifact {
#        source = "s3::https://s3-us-west-2.amazonaws.com/my-custom-bucket/nomad/kitchen/fabio"
#        destination = "local/etc/fabio"
#
#        options {
#          aws_access_key_id     = "<...>"
#          aws_access_key_secret = "<...>"
#        }
#      }

      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
      }

      service {
        name = "fabio"
        tags = [
          "loadbalancer",
          "admin",
        ]

        port = "admin"

        check {
          type     = "tcp"
          port     = "admin"
          interval = "10s"
          timeout  = "2s"
        }
      }

      service {
        name = "fabio-frontend-http"
        tags = ["loadbalancer", "frontend", "http"]

        port = "frontend_http"

        check {
          type     = "tcp"
          port     = "frontend_http"
          interval = "10s"
          timeout  = "2s"
        }
      }

      service {
        name = "fabio-frontend-https"
        tags = ["loadbalancer", "frontend", "https"]

        port = "frontend_https"

        check {
          type     = "tcp"
          port     = "frontend_https"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Here's some logs (attached for reference) from nomad on one node: nomad.log

And here is system info and nomad info:

root@nomad-n1-debian12:~# nomad version
Nomad v1.6.3
BuildDate 2023-10-30T12:58:10Z
Revision e0497bff14378d68cad76a801cc0eba93ce05039

root@nomad-n1-debian12:~# uname -a
Linux nomad-n1-debian12 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux

root@nomad-n1-debian12:~# cat /etc/debian_version 
12.2

FWIW, I also tried running nomad agent ... directly from a root shell, and it still exhibited the same problem.

Also, I am using CNI plugin v1.3.0--not sure if that matters.

@shoenig
Copy link
Member

shoenig commented Dec 7, 2023

Thanks @kmott - just so I can keep trying to reproduce your environment exactly, how did you install docker? And can you show the output of docker version?

@kmott
Copy link

kmott commented Dec 7, 2023

Thank you for your patience @shoenig, I will work on getting a reliable repro using Vagrant. If I am not able to come up with something by early next week, I'll let you know and we can probably close this out. More info soon (hopefully!).

@kmott
Copy link

kmott commented Dec 7, 2023

After much digging, this was caused by an older version (2.29) of glibc linked in with nomad binary (vs 2.34, which works fine). This can be closed--thank you for your time @shoenig!

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

5 participants