Flume安装与应用
Flume概述
- 日志采集和汇总工具
- 收集到的日志数据汇总到HDFS存储
- flume: 1.9.0
Flume组件
- source:数据源(需要采集的数据)
- channel:临时存储的数据位置,通常存储在内存
- sink:数据目标存储,hdfs
安装
上传安装文件
解压
1
2tar zxvf apache-flume-1.9.0-bin.tar.gz
sudo mv apache-flume-1.9.0-bin /usr/配置环境变量,~/.bash_profile
1
2
3
4
5
6
7
8
9JAVA_HOME=/usr/jdk1.8.0_231
HADOOP_HOME=/usr/hadoop-3.2.1
FLUME_HOME=/usr/apache-flume-1.9.0-bin
PATH=$FLUME_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH
export JAVA_HOME
export HADOOP_HOME
export FLUME_HOME
export PATH注意:
source ~/.bash_profile
Flume基本配置,
$FLUME_HOME/conf/flume-env.sh
1
2
3
4$ cp flume-env.sh.template flume-env.sh
$ vi flume-env.sh
22 export JAVA_HOME=/usr/jdk1.8.0_231解决jar包冲突
1
2
3
4
5
6
7
8
9
10cd /usr/apache-flume-1.9.0-bin/lib/
ll guava-11.0.2.jar
-rw-rw-r-- 1 hadoop hadoop 1648200 9月 13 2018 guava-11.0.2.jar
cd /usr/hadoop-3.2.1/share/hadoop/common/lib/
ll guava-27.0-jre.jar
-rw-r--r-- 1 hadoop hadoop 2747878 9月 10 2019 guava-27.0-jre.jar
rm -rf /usr/apache-flume-1.9.0-bin/lib/guava-11.0.2.jar
cp /usr/hadoop-3.2.1/share/hadoop/common/lib/guava-27.0-jre.jar /usr/apache-flume-1.9.0-bin/lib/
实现数据同步
功能需求
- 采集爬虫服务器数据
实现步骤
- 启动数据采集服务
- 启动hdfs服务,保证hdfs可读写
- 配置Agent(source、channel、sink)
- 使用flume1.7+版本新特性,source组件提供了高可靠的同步模式(TAILDIR),保证数据不丢失
- 编写运行脚本(shell)并执行
配置文件
1
2
3
4cd /usr/apache-flume-1.9.0-bin/
mkdir myconf
cd myconf/
vi flume-taildir-memory-hdfs.properties
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
创建agent配置文件`flume-taildir-memory-hdfs.properties`,内容如下:
```sh
# Name the components on this agent
hdfs_agent.sources = r1
hdfs_agent.sinks = k1
hdfs_agent.channels = c1
# Describe/configure the source
hdfs_agent.sources.r1.type = TAILDIR
hdfs_agent.sources.r1.filegroups = f1
hdfs_agent.sources.r1.filegroups.f1 = /home/hadoop/spider/data/collect/.*\.log
hdfs_agent.sources.r1.positionFile = /home/hadoop/spider/data/.flume/taildir_position.json
# Describe the sink
hdfs_agent.sinks.k1.type = hdfs
hdfs_agent.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/hdfs_filegroups_source/%Y-%m-%d/
hdfs_agent.sinks.k1.hdfs.rollInterval = 3600
hdfs_agent.sinks.k1.hdfs.rollSize = 1048576
hdfs_agent.sinks.k1.hdfs.rollCount = 0
hdfs_agent.sinks.k1.hdfs.filePrefix = log_file_%H
hdfs_agent.sinks.k1.hdfs.fileSuffix = .log
hdfs_agent.sinks.k1.hdfs.fileType = DataStream
hdfs_agent.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
hdfs_agent.channels.c1.type = memory
hdfs_agent.channels.c1.capacity = 1000
hdfs_agent.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
hdfs_agent.sources.r1.channels = c1
hdfs_agent.sinks.k1.channel = c1
创建hdfs目录
1
hdfs dfs -mkdir -p /flume/hdfs_filegroups_source/
编写shell脚本,安装目录下创建
mysbin
目录,start_taildir_memory_hdfs.sh
,内容如下:1
2
3
4
5
6
7
8
9
10
11
12$ cd /usr/apache-flume-1.9.0-bin/
$ mkdir mysbin
$ cd mysbin
$ vi start_taildir_memory_hdfs.sh
#!/bin/bash
ROOT_PATH=$(dirname $(dirname $(readlink -f $0)))
cd $ROOT_PATH
bin/flume-ng agent --conf ./conf/ -f myconf/flume-taildir-memory-hdfs.properties -Dflume.root.logger=INFO,console -n hdfs_agent更改脚本的执行权限
1
chmod 755 start_taildir_memory_hdfs.sh
执行
start_taildir_memory_hdfs.sh
脚本文件,命令如下:1
nohup ./start_taildir_memory_hdfs.sh &
Flume安装与应用