`

Reduce作业运行时错误:Too many fetch-failures

 
阅读更多

# bin/hadoop jar hadoop-*-examples.jar wordcount /test1 /test2
11/11/22 20:42:33 INFO input.FileInputFormat: Total input paths to process : 14
11/11/22 20:42:33 INFO mapred.JobClient: Running job: job_201111222034_0001
11/11/22 20:42:34 INFO mapred.JobClient: <wbr>map 0% reduce 0%<br> 11/11/22 20:45:07 INFO mapred.JobClient: <wbr>map 14% reduce 0%<br> 11/11/22 20:45:43 INFO mapred.JobClient: <wbr>map 14% reduce 4%<br> 11/11/22 20:45:54 INFO mapred.JobClient: <wbr>map 28% reduce 4%<br> 11/11/22 20:46:43 INFO mapred.JobClient: <wbr>map 57% reduce 4%<br> 11/11/22 20:46:52 INFO mapred.JobClient: <wbr>map 85% reduce 4%<br> 11/11/22 20:46:55 INFO mapred.JobClient: <wbr>map 92% reduce 4%<br> 11/11/22 20:46:58 INFO mapred.JobClient: <wbr>map 100% reduce 4%<br> 11/11/22 20:56:19 INFO mapred.JobClient: Task Id : attempt_201111222034_0001_m_000002_0, Status : FAILED<br> Too many fetch-failures<br> 11/11/22 20:56:19 WARN mapred.JobClient: Error reading task outputConnection refused<br> 11/11/22 20:56:19 WARN mapred.JobClient: Error reading task outputConnection refused<br></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

一、错误分析

Reduce task启动后第一个阶段是shuffle,即向mapfetch据。每次fetch都可能因为connect超时,read超时,checksum错误等原因而失败。Reduce task为每个map设置了一个计数器,用以记录fetchmap输出时失败的次数。当失败次数达到一定阈值时,会通知JobTracker fetchmap输出操作失败次数太多了,并打印如下log

Failed to fetch map-output from attempt_201105261254_102769_m_001802_0 even after MAX_FETCH_RETRIES_PER_MAP retries... reporting to the JobTracker

其中阈值计算方式为:

max(MIN_FETCH_RETRIES_PER_MAP,

getClosestPowerOf2((this.maxBackoff * 1000 / BACKOFF_INIT) + 1));

默认情况下MIN_FETCH_RETRIES_PER_MAP=2 maxBackoff=300 BACKOFF_INIT=4000因此默认阈值为6,可通过修改mapred.reduce.copy.backoff参数来调整。

当达到阈值后,Reduce task通过umbilical协议告诉TaskTrackerTaskTracker在下一次heartbeat时,通知JobTracker。当JobTracker发现超过50%Reduce汇报fetch某个map的输出多次失败后,JobTrackerfailed掉该map并重新调度,打印如下log

"Too many fetch-failures for output of task: attempt_201105261254_102769_m_001802_0 ... killing it"

二、出错原因及更正:

很可能是节点间的联通不够全面. <wbr><wbr></wbr></wbr>

1) 检查 、/etc/hosts
<wbr><wbr><wbr>要求本机ip 对应 <wbr>服务 <wbr>器名<br><wbr><wbr><wbr>要求要包含所有的服务器ip + 服务器名</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

本人使用的是虚拟机OS为:ubuntu11.04 ,重启系统后出现该错误,最后发现ubuntu系统在每次启动时,会在/etc/hosts文件最前端添加如下信息:

127.0.0.1 localhost <wbr><wbr><wbr><wbr>your_hostname</wbr></wbr></wbr></wbr>

::1 <wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>localhost6 <wbr><wbr>your_hostname</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

若将这两条信息注销掉,(或者把your_hostname删除掉)上述错误即可解决。
2) 检查 .ssh/authorized_keys
<wbr><wbr><wbr>要求包含所有服务器(包括其自身)的public key</wbr></wbr></wbr>

<wbr></wbr>尽管我们在安装hadoop之前已经配置了各节点的SSH无密码通信,假如有3个IP分别为192.168.128.131 <wbr>192.168.128.132 <wbr>192.168.133 ,对应的主机名为master 、 slave1 、 slave2 。从每个节点第一次执行命令$ ssh <wbr>主机名(master 、slave1 、slave2) 的时候,会出现一行关于密钥的yes or no ?的提示信息,Enter确认后再次连接就正常了。如果我们没有手动做这一步,如果恰好在hadoop/conf/core-site.xml 及 mpred-site.xml中相应的IP 用主机名代替了,则很可能出现该异常。</wbr></wbr></wbr>

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics