MapReduce练习之WordCount案例实操


上一篇简单的介绍了MapReduce的思想,优缺点以及编程规范等等。。。所以趁热打铁做一个MapReduce的实操,再深刻体会下!!!

1.需求

在给定的文件中统计输出每个单词出现的总次数
图解:
在这里插入图片描述

2.数据准备

hello world
hadoop 
spark
hello world
lsl
spark
hadoop 
spark
hadoop 
spark
lsl

3.分别编写Mapper、Reducer、Driver

3.1编写Mapper

package com.lsl.sumword;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private Text keyOut = new Text();
    //千万不要忘记传参!!!
    private IntWritable valueOut = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        System.out.println("key====>" + key + ";value====>" + value);
        //1.获取一行
        String line = value.toString();
        //2.切割
        String[] words = line.split(" ");
        //3.输出
        for (String word : words) {
            keyOut.set(word);
            context.write(keyOut, valueOut);
        }
    }
}

3.2 编写Reducer

package com.lsl.sumword;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    private int sum;
    private IntWritable intWritable = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
		// 1 累加求和
        sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        // 2 输出
        intWritable.set(sum);
        context.write(key,intWritable);
    }
}

3.3 编写Driver

package com.lsl.sumword;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        FileSystem fs = FileSystem.get(conf);
        Path input = new Path("d:/input/dea.txt");
        Path output = new Path("d:/output");
        //如果输出目录存在则删除
        if (fs.exists(output)) {
            fs.delete(output, true);
        }

        //2.设置jar加载路径
        job.setJarByClass(WordCountDriver.class);

        //3.设置map和reduce类
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        //4.设置map输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //5.设置Reduce输出
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //6.设置输入和输出路径
        FileInputFormat.setInputPaths(job, input);
        FileOutputFormat.setOutputPath(job, output);

        //7.提交
        boolean result = job.waitForCompletion(true);

        System.exit(result?0:1);
    }
}

4.测试

4.1 在本地情况下

1.先在自己的Windows机器上编译一下hadoop源码,然后解压到一个非中文路径,并在windows环境上配置HADOOP_HOME环境变量(在博主的交流群中有!)
2.在idea上运行程序
3.一些问题:打印不出日志
解决:需要在项目的src目录下,新建一个文件,命名为“log4j.properties”,并添加

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n  

4.运行结果
在这里插入图片描述

4.2 在集群情况下

1.在依赖中添加

	<build>
		<plugins>
			<plugin>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>2.3.2</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
				</configuration>
			</plugin>
			<plugin>
				<artifactId>maven-assembly-plugin </artifactId>
				<configuration>
					<descriptorRefs>
						<descriptorRef>jar-with-dependencies</descriptorRef>
					</descriptorRefs>
					<archive>
						<manifest>
							<mainClass>com.atguigu.mr.WordCountReduce</mainClass>
						</manifest>
					</archive>
				</configuration>
				<executions>
					<execution>
						<id>make-assembly</id>
						<phase>package</phase>
						<goals>
							<goal>single</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
		</plugins>
	</build>

然后idea点击:
在这里插入图片描述
或者eclipse:右击项目->maven->update project即可
2.修改WordCountDriver 中的代码:参数写为args[0]和args[1]
在这里插入图片描述
3.将程序打成jar包,然后拷贝到hadoop集群中
idea
在这里插入图片描述
在这里插入图片描述
eclipse
右键->Run as->maven install。等待编译完成就会在项目的target文件夹中生成jar包。如果看不到。在项目上右键-》Refresh,即可看到。修改不带依赖的jar包名称为wc.jar,并拷贝该jar包到Hadoop集群。
4.启动hadoop集群
5.在集群上执行wordcount程序(别忘了把数据上传到hdfs上!!!wordcount是我的数据文件)

[lsl@hadoop102 software]$ hadoop jar wordcount-1.0-SNAPSHOT.jar com.lsl.sumword.WordCountDriver  /wordcount /output

6.结果
在这里插入图片描述
在这里插入图片描述

版权声明:本博客为记录本人自学感悟,转载需注明出处!
https://me.csdn.net/qq_39657909

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页