切片与MapTask并行决定机制

1.为什么会有切片机制?

因为大数据的处理都是在分布式集群上进行,而且最初设计的理念就是集群部署在廉价的机器上,所以为了达到最高的效率最快的速度,会把数据分成多个块分别分到不同的集群机器上然后执行相同的操作!这样就可以快速器高效了。由此可见如何切块也是job提交流程中非常重要的一环了,所以后面也会主要去介绍。

2.机制图解

在这里插入图片描述

3.概念简介

数据块:Block是HDFS物理上把数据分成一块一块
数据切片:数据切片只是在逻辑上对输入进行分片,并不会在磁盘上将其切分成片进行存储。

注:MapTask的并行度决定Map阶段的任务处理并发度,进而影响到整个Job的处理速度。但并不是开越多的MapTask就越好,如1k的数据开多个MapTask反而起到了相反的作用

版权声明:本博客为记录本人自学感悟,转载需注明出处!
https://me.csdn.net/qq_39657909

已标记关键词 清除标记
<div class="post-text" itemprop="text"> <p>I want to make it run parallel based on number of thread. But the result was not as i expected. I dont know how to make it efficient and fast.</p> <p>I ended up with this code.</p> <pre><code>package main import ( "fmt" "io/ioutil" "net/http" "os" "runtime" "strconv" "strings" "sync" "time" ) func main() { start := time.Now() target := os.Args[1] thread, _ := strconv.Atoi(os.Args[3]) file, err := ioutil.ReadFile(os.Args[2]) if err != nil { fmt.Println("Error: Please double check if the file " + os.Args[2] + " is exist!") os.Exit(0) } wordlist := strings.Split(string(file), " ") var wg sync.WaitGroup runtime.GOMAXPROCS(runtime.NumCPU()) jobs := make(chan string) for i := 0; i < thread; i++ { wg.Add(1) defer wg.Done() for _, word := range wordlist { go func(word string) { jobs <- word }(word) } } go func() { for job := range jobs { code := visit(target + job) fmt.Println(target + job + " - " + strconv.Itoa(code)) } }() wg.Wait() elapsed := time.Since(start) fmt.Printf("Timer: %s ", elapsed) } func visit(url string) int { data, err := http.Get(url) if err != nil { panic(err) } return data.StatusCode } </code></pre> <p>Any help would be appreciated. Thank you.</p> <p><strong>Update</strong> This is my current results :</p> <pre><code>$ go run test.go http://localhost/ word.txt 2 http://localhost/1 - 404 http://localhost/1 - 404 http://localhost/7 - 404 http://localhost/8 - 404 http://localhost/9 - 404 http://localhost/0 - 404 http://localhost/ - 200 http://localhost/3 - 404 http://localhost/2 - 404 http://localhost/4 - 404 http://localhost/6 - 404 http://localhost/2 - 404 http://localhost/3 - 404 http://localhost/4 - 404 http://localhost/5 - 404 http://localhost/9 - 404 http://localhost/7 - 404 http://localhost/8 - 404 http://localhost/0 - 404 http://localhost/5 - 404 http://localhost/ - 200 http://localhost/6 - 404 </code></pre> </div>
©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页