您当前位置：首页 > 服务器 > hadoop基础----hadoop实战(五)-----myeclipse开发MapReduce---WordCount例子---解析MapReduce的写法

hadoop基础----hadoop实战(五)-----myeclipse开发MapReduce---WordCount例子---解析MapReduce的写法

来源：程序员人生发布时间：2016-12-01 15:52:43 阅读次数：3068次

我们在上1章节已了解了怎样在myeclipse中开发运行MapReduce

hadoop基础----hadoop实战(4)-----myeclipse开发MapReduce---myeclipse搭建hadoop开发环境并运行wordcount

也在很早的章节中了解了MapReduce的原理

hadoop基础----hadoop理论(4)-----hadoop散布式并行计算模型MapReduce详解

目标

MapReduce主要的流程是 map----》reduce。

我们本章节来详细学习java代码中，是怎样配置实现MapReduce的。就以WordCount例子为例。

本章节的目的是熟习MapReduce的写法以后，我们能写出更多的业务处理，解决更多的其它问题。

MapReduce的结构

写1个MapReduce主要有3部份:

Mapper接口的实现，Reducer接口的实现，Job的配置。

Mapper接口和Reducer接口的实现就是要分别编写两个类（例如分别叫做Map类和Reduce类）。

在Map类中规定如何将输入的<key, value>对转化为中间结果的<key, list of values>对。

在Reduce类中规定如何将Map输出的中间结果进1步处理，转化为终究的结果输出<key, value>对。

而对Job的配置是要在main函数中创建相干对象，调用其方法实现的。

完全代码

package org.apache.hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { //编写完成Map任务的静态内部类，类的名字就叫TokenizerMapper,继承Mapper类 public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } //编写完成Reduce任务的静态内部类，类的名字就叫IntSumReducer,继承Reducer类 public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } //main函数中所要做的就是Job的配置和提交 public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

文件中的内容

我们用来作测试的文件有2个，分别是file1.txt和file2.txt。

file1.txt中是

hello world

file2.txt中是

hello hadoop

Mapper接口的实现分析

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

这段代码实现了Map的功能，我们声明了1个类TokenizerMapper（类名随便，我们也能够起名WordCountMap但是必须继承Mapper接口）继承了Mapper接口---接口的参数是固定的，也就是写其它功能的MapReduce也继承这个接口，用这几个参数或适当调剂。

熟习java的同学会看到出现了1些新的数据类型：比如Text,IntWritable，Context。

LongWritable, IntWritable, Text 均是 Hadoop 中实现的用于封装 Java 数据类型的类，这些类实现了WritableComparable接口，它们都能够被串行化从而便于在散布式环境中进行数据交换，你可以将它们分别视为long,int,String 的替换品。

Context则是负责搜集键值对的中间结果或终究结果，有些版本可以用OutputCollector<Text, IntWritable> output，但是用法都1样，都是用来搜集结果。

private final static IntWritable one = new IntWritable(1);

定义了1个int赋值1，作为计数器。

private Text word = new Text();

定义1个变量,用来保存key。这个key会用来作为map辨别数据。

public void map(Object key, Text value, Context context ) throws IOException, InterruptedException {

Mapper接口中的必须有map方法实现功能，传入参数1般也是固定的。

参数中context负责搜集键值对的中间结果传递给reduce。

我们的file1.txt和file2.txt在hadoop中会经过TextInputFormat，每一个文件(或其1部份)都会单独地作为map的输入，而这是继承自FileInputFormat的。以后，每行数据都会生成1条记录，每条记录则表示为<key,value>情势：key值是每一个数据的记录在数据分片中的字节偏移量，数据类型是LongWritable；value值是每行的内容，数据类型是Text。

也就是说我们写在文本中的内容就存在 Text value这个参数中。

StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

StringTokenizer 是1个分词器。用来把句子拆分成1个个单词。 value是我们文本中的内容，这里是把内容分成1个个单词。

然后通过while去遍历，把单词放入word变量中。

然后把word变量和计数器1作为结果存起来。

那末经过了map以后的context中的结果就是

<hello,1>

<world,1>

<hello,1>

<hadoop,1>

这个结果会自动传给reduce。

Reducer接口的实现分析

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

这段代码实现了Reduce的功能，我们声明了1个类IntSumReducer（类名随便，我们也能够起名WordCountReduce但是必须继承Reducer接口）继承了Reducer接口---接口的参数是固定的，也就是写其它功能的MapReduce也继承这个接口，用这几个参数或适当调剂参数类型。

private IntWritable result = new IntWritable();

定义1个变量，用来装每组的计数结果。

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException {

Reducer接口中的必须有reduce方法实现功能，传入参数1般也是固定的。

参数中context负责搜集键值对的终究结果。

key对应map传递过滤的key,values对应map传递过滤的value。

为何这里是values呢。

由于进入reduce方法时会自动分组，只有key1样的数据才会同时进入1个reduce中。

map传递过来的结果中是

<hello,1>

<world,1>

<hello,1>

<hadoop,1>

也就是这个例子中会进入3次reduce，

第1次 key 是 hello, values是[1,1]

第2次key 是 world，values是[1]

第3次key是 hadoop ，values是[1]

int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

循环values列表，相加计数后放入终究结果容器context中。

所以终究的结果是

<hello,2>

<world,1>

<hadoop,1>

Job的配置--main方法

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

mapreduce中需要1个main方法配置参数,向hadoop框架描写map-reduce履行的工作，并提交运行。

Configuration conf = new Configuration();

创建1个配置实例。

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); }

获得参数并判断是不是合法。

Job job = new Job(conf, "word count");

新建1个job任务。

job.setJarByClass(WordCount.class);

设置运行的jar类，也就是mapreduce的主类名。

job.setMapperClass(TokenizerMapper.class);

设置map类，也就是继承map接口的类名。

job.setCombinerClass(IntSumReducer.class);

设置Combiner类，其实map到reduce还有1道工序是Combiner，如果有特殊需求可以新建1个类，没有的话直接使用继承reduce接口的类便可。

job.setReducerClass(IntSumReducer.class);

设置reduce类，也就是继承reduce接口的类名。

job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);