Hadoop3-读写SequenceFile文件

历史记录

清除记录

猜你想搜

AcWing热点
App
登录/注册

Hadoop3-读写SequenceFile文件

作者：

二分尘土 , 2023-10-06 20:37:15 , 所有人可见 , 阅读 83

知识点

1、写入一个SequenceFile文件;
2、读取一个SequenceFile文件。
3、排序和合并SequenceFile文件。

学习目标

掌握如何SequenceFile文件。

环境资源

硬件:Ubuntu16.04
软件:JDK-1.8、Hadoop-3.3
数据存放路径：/data/dataset
tar包路径：/data/software
tar包压缩路径：/data/bigdata
软件安装路径:/opt
实验设计创建文件：/data/resource

操作步骤

启动HDFS集群服务

注意：需要在配置文件/etc/profile文件中打开hadoop3的相关环境变量设置。
1、在/opt/hadoop-3.3.0/etc/hadoop目录下修改mapred-site.xml

<configuration>
  <property>
      <name>mapreduce.job.tracker</name>
      <value>hdfs://localhost:8001</value>
     <final>true</final>
  </property>
</configuration>

修改yarn-site.xml

<configuration>
   <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>3072</value>
  </property>
  <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>2</value>
  </property>
  <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>256</value>
  </property>
</configuration>

在终端窗口中，执行如下命令，启动HDFS集群。

$ cd /opt/hadoop-3.3.0/sbin/
$ ./start-all.sh

2、在终端窗口中，执行如下命令，查看HDFS服务启动情况：

$ jps

创建Hadoop项目

1、启动开发工具Eclipse。
2、打开eclipse开发工具后，在菜单栏中，选择【File】->【New】->【Project】->【Java project】创建Java项目并命名为【Hadoop3Demo】，点击【Finish】完成创建，如下图所示：

3、导入hadoop相关的jar包，首先右击项目选择【New】—>【Folder】创建一个【lib】文件夹并把指定位置中(/data/software/hadoop3_lib/)的包放入该文件中。如下图所示：

4、把lib下所有的jar包导入到环境变量，首先全选【lib】文件夹下的jar包文件，右键点击，选择【build path】->【add to build path】,添加后，发现在项目下多一个列表项【Referenced Libraries】。如下图所示：

向SequenceFile文件写入数据

1、在项目【src】目录下，单击右键，创建名为”com.simple.SequenceFileWriteDemo”的Java类，并编辑源代码如下：

package com.simple;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

public class SequenceFileWriteDemo {
    private static final String[] DATA = { 
            "One, two, buckle my shoe", 
            "Three, four, shut the door",
            "Five, six, pick up sticks", 
            "Seven, eight, lay them straight", 
            "Nine, ten, a big fat hen" 
        };
　　
    private static final String hdfsurl = "hdfs://localhost:9000";
　　
    public static void main(String[] args) throws IOException {
        Configuration conf = new Configuration();       
        FileSystem fs = FileSystem.get(URI.create(hdfsurl), conf);
　　
        IntWritable key = new IntWritable();    // key
        Text value = new Text();                // value
        Path path = new Path(hdfsurl + "/data.txt");
　　
        SequenceFile.Writer writer = null;
        try {
            // 旧的方法
            // writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
　　
            // 新的方法
            SequenceFile.Writer.Option optionfile = SequenceFile.Writer.file(path);
            SequenceFile.Writer.Option optionkey = SequenceFile.Writer.keyClass(key.getClass());
            SequenceFile.Writer.Option optionvalue = SequenceFile.Writer.valueClass(value.getClass());
            writer = SequenceFile.createWriter(conf, optionfile, optionkey, optionvalue);
　　
            for (int i = 0; i < 100; i++) {
                key.set(100 - i);
                value.set(DATA[i % DATA.length]);
                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
                writer.append(key, value);
            }
        } finally {
            IOUtils.closeStream(writer);
        }
    }
}

2、在代码的任意空白处，单击右键，在弹出的环境菜单中，选择”【Run As】->【Java Application】”菜单项，运行程序。操作如下图所示：

3、如果一切正常，则可以观察到Eclipse控制台输出信息如下：

......
 INFO [main] - Got brand-new compressor [.deflate]
[128]   100 One, two, buckle my shoe
[173]   99  Three, four, shut the door
[220]   98  Five, six, pick up sticks
[264]   97  Seven, eight, lay them straight
[314]   96  Nine, ten, a big fat hen
[359]   95  One, two, buckle my shoe
[404]   94  Three, four, shut the door
[451]   93  Five, six, pick up sticks
[495]   92  Seven, eight, lay them straight
[545]   91  Nine, ten, a big fat hen
[590]   90  One, two, buckle my shoe
[635]   89  Three, four, shut the door
[682]   88  Five, six, pick up sticks
[726]   87  Seven, eight, lay them straight
[776]   86  Nine, ten, a big fat hen
......

4、在终端窗口中，执行如下的命令，查看序列化到HDFS中的文件：

$ hdfs dfs -ls /

应该可以看到生成的序列化文件，如下所示：

-rw-r--r--   3 hduser supergroup       4788 2020-08-26 14:02 /data.txt

从SequenceFile文件中读取数据

下面这个示例演示了如何读取具有Writable的键和值的序列文件。
1、在项目【src】目录下，单击右键，创建名为”com.simple.SequenceFileReadDemo”的Java类，并编辑源代码如下：

package com.simple;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

public class SequenceFileReadDemo {

    private static final String hdfsurl = "hdfs://localhost:9000";

    public static void main(String[] args) throws IOException {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(hdfsurl), conf);
　　
        Path path = new Path(hdfsurl + "/data.txt");
        SequenceFile.Reader reader = null;
        try {
            // 旧方法
            // reader = new SequenceFile.Reader(fs, path, conf);
　　
            // 新方法
            SequenceFile.Reader.Option optionfile = SequenceFile.Reader.file(path);
            reader = new SequenceFile.Reader(conf, optionfile);
　　
            Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
            Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
　　
            long position = reader.getPosition();
            while (reader.next(key, value)) {
                String syncSeen = reader.syncSeen() ? "*" : "";
                System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
                position = reader.getPosition();    // 下一条记录的开始
            }
        } finally {
            IOUtils.closeStream(reader);
        }
    }

}

2、在代码的任意空白处，单击右键，在弹出的环境菜单中，选择”【Run As】->【Java Application】”菜单项，运行程序。操作如下图所示：

3、如果一切正常，则可以观察到Eclipse控制台输出信息如下：

......
INFO [main] - Got brand-new decompressor [.deflate]
[128]   100 One, two, buckle my shoe
[173]   99  Three, four, shut the door
[220]   98  Five, six, pick up sticks
[264]   97  Seven, eight, lay them straight
[314]   96  Nine, ten, a big fat hen
[359]   95  One, two, buckle my shoe
[404]   94  Three, four, shut the door
[451]   93  Five, six, pick up sticks
[495]   92  Seven, eight, lay them straight
[545]   91  Nine, ten, a big fat hen
......
[4512]  6   Nine, ten, a big fat hen
[4557]  5   One, two, buckle my shoe
[4602]  4   Three, four, shut the door
[4649]  3   Five, six, pick up sticks
[4693]  2   Seven, eight, lay them straight
[4743]  1   Nine, ten, a big fat hen
......

通过命令行查看SequenceFile文件内容

hdfs dfs命令有一个-text选项，用于以文本形式显示序列文件。
1、在终端窗口中，执行以下命令，查看序列文件的内容：

$ hdfs dfs -text /data.txt | head

可以看到如下的内容：

排序和合并SequenceFile

对一个或多个序列文件进行排序(和合并)的最强大的方法是使用MapReduce。
1、我们可以通过指定输入和输出是序列文件，并设置key和value类型，共使用Hadoop附带的排序示例。在终端窗口中，执行以下命令：

$ cd /opt/
$ rm -rf hadoop
$ rm -rf hadoop-2.7.3 (并在/etc/profile中删除对应hadoop2环境变量，打开hadoop3环境变量，然后执行source /etc/profile使其生效)
$ cd /opt/hadoop-3.3.0
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
sort -r 1 \
-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
/data.txt /sorted

3、在终端窗口中，执行以下命令，查看排序后生成的序列文件的内容：

$ hdfs dfs -text /sorted/part-r-00000 | head

可以看到如下的内容：

知识回顾

对于某些应用程序，我们需要一个专门的数据结构来保存数据。Hadoop的SequenceFile类为二进制key-value对提供了一个持久的数据结构。
SequenceFile是Hadoop API 提供的一种二进制文件，它将数据以字节流的形式序列化到文件中。这种二进制文件内部使用Hadoop 的标准的Writable 接口实现序列化和反序列化。
SequenceFile也可以作为较小文件的容器。HDFS和MapReduce针对大文件进行了优化，因此将文件打包成SequenceFile可以使存储和处理小文件更高效。

2 评论

IDxiaotong 2023-10-11 17:40

抓到你了，图片呢

二分尘土 2023-10-06 21:10

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

App 内打开