mac 10.14 大数据环境搭建

  • Post author:
  • Post category:其他


安装包版本列表(2019.4.17)

名称 版本号
zookeeper 3.4.13
hadoop 3.1.2
flume 1.9.0
hbase 1.2.9
hive 3.1.1
kafka 2.1.1
sqoop 1.4.6_1
storm 1.2.2
mysql 8.0.15

jdk安装(略)(1.8)

安装zookeeper(3.4.13)


环境变量

conf目录下:

cp zoo_sample.cfg zoo.cfg


修改zoo.cfg

dataDir=/usr/local/Cellar/zookeeper/3.4.13/tmp
dataLogDir=/usr/local/Cellar/zookeeper/3.4.13/logs

创建tmp和logs

bin下zkServer.sh start启动 zkServer.sh status查看启动状态

hadoop伪分布式安装(3.1.2)

配置环境变量:$HADOOP_HOME=… ${HADOOP_HOME}/bin

source ~/.profile


配置免密登录

ssh-keygen
ssh-copy-id localhost@127.0.0.1

关闭防火墙

配置hadoop-env.sh

export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home"

配置yarn-env.sh

JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home"


配置core-site.xml

<configuration>
  <!-- 指定HDFS老大(namenode)的通信地址 -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <!-- 指定hadoop运行时产生文件的存储目录 -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp/hadoop-${user.name}</value>
  </property>

</configuration>

配置hdfs-site.xml

<!-- 指定HDFS副本的数量 -->
<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>

配置yarn-site.xml

<property> 
    <name>yarn.nodemanager.aux-services</name> 
    <value>mapreduce_shuffle</value> 
</property> 
<property> 
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
    <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
</property> 
<property> 
    <name>yarn.resourcemanager.address</name> 
    <value>127.0.0.1:8032</value> 
</property> 
<property> 
    <name>yarn.resourcemanager.scheduler.address</name> 
    <value>127.0.0.1:8030</value> 
</property> 
<property> 
    <name>yarn.resourcemanager.resource-tracker.address</name> 
    <value>127.0.0.1:8031</value> 
</property>

配置 mapred-site.xml

<property> 
    <!-- 指定mapreduce运行在yarn上 -->
    <name>mapreduce.framework.name</name> 
    <value>yarn</value> 
</property>


hadoop version 测试是否安装成功

在bin目录下执行,提示”……has been successfully formatted” 等字样出现即说明格式化成功

hdfs namenode -format


在sbin启动

start-all.sh

jps查看有5个进程SecondaryNameNode、DataNode、NodeManager、ResourceManager、NameNode表示成功

访问http://localhost:9870(旧版hadoop2.x使用端口50070)

安装mysql(8.0.15)

sudo apt-get install mysql-server

sudo apt-get install mysql-client

sudo apt-get install libmysqlclient-dev

检查是否安装成功:

linux

sudo netstat -tap | grep mysql

mac

lsof -i:端口号

#查看所有监听的端口
sudo lsof | grep mysql


登录验证:

mysql -u root -p


启动/关闭/重启mysql

service mysql start

service mysql stop

service mysql restart

运行上面命令,其实是service命令去找/etc/init.d下的相关的mysql脚本去执行启动、关闭动作。

安装Hive(3.1.1)


环境变量

/conf目录下:

cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

在hdfs目录下建立三个文件,用来存放hive信息,并赋予777权限

注意:必须开启hdfs服务,不然报错

创建以下目录均可在localhost://9870中目录栏utility/browse the dictionary中找到

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /user/hive/tmp
hdfs dfs -mkdir -p /user/hive/log
hdfs dfs -chmod -R 777 /user/hive/warehouse
hadoop fs -chmod 777 /user/hive/tmp
hdfs dfs -chmod -R 777 /user/hive/tmp 
hdfs dfs -chmod -R 777 /user/hive/log

修改hive-env.sh文件

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2/libexec
export HIVE_HOME=/usr/local/Cellar/hive/3.1.1/libexec
export HIVE_CONF_DIR=/usr/local/Cellar/hive/3.1.1/libexec/conf

修改hive-site.xml文件

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration> 
    <property>
        <name>hive.exec.scratchdir</name>
        <value>/usr/hive/tmp</value>
        <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/usr/hive/warehouse</value>
        <description>location of default database for the warehouse</description>                  
    </property>

    <property>
        <name>hive.querylog.location</name>
        <value>/usr/hive/log</value>
        <description>Location of Hive run time structured log file</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
        <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.</description>
    </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
  </property>
</configuration>

hive目录下创建tmp文件

将mysql-connector-java-5.1.46-bin.jar放入lib目录下

初始化hive,在hive2.0以后的版本,初始化命令都是:

schematool -dbType mysql -initSchema


注意:如果原先已有mysql强烈建议卸载后用brew再安装

,不然会报如下错误:

failed to get schema version 或者 access denied for user

如果遇到上述错误网上建议的修改权限方法非但没有用,还有可能弄崩mysql,之后只能重装了~


或者

在mysql命令下执行

#创建数据库

mysql> create database hive;
#赋予访问权限
mysql> grant all privileges on hive.* to root@localhost identified by '密码' with grant option;
mysql> flush privileges;

bin目录下执行hive启动

安装Flume(1.9.0)

配置环境变量

简单示例:在conf文件中 cp flume-conf.properties.template flume-conf.properties

vim flume-conf.properties 如下:删除所有内容并添加

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:9000/test
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 1000

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume:

flume-ng agent --conf ../conf --conf-file /usr/local/Cellar/flume/1.9.0/libexec/conf/flume-conf.properties --name a1 -Dflume.root.logger=INFO,console

创建一个test.txt写些内容文件用于测试

测试:

./flume-ng avro-client --conf /usr/local/Cellar/flume/1.9.0/libexec/conf --host 0.0.0.0 --port 44444 --filename ../../test.txt

查看flume的进程

linux

ps -aux | grep flume

mac

sudo lsof | grep flume


监听文件并向hdfs和kafka写数据配置

a1.sources=r1
a1.sinks=fs kfk
a1.channels=c1 c2

a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /home/tellhow-iot2/doc/test.log
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sources.r1.selector.type=replicating 

a1.sinks.kfk.type=org.apache.flume.sink.kafka.KafkaSink
a1.sinks.kfk.topic=test
a1.sinks.kfk.brokerList=localhost:9092
#接收数据安全机制0 1 -1
a1.sinks.kfk.requiredAcks=1
#a1.sinks.kfk.batchSize = 2
a1.sinks.kfk.serializer.class=kafka.serializer.StringEncoder

a1.sinks.fs.type=hdfs
#%y-%m-%d/%H%M/
a1.sinks.fs.hdfs.path=hdfs://localhost:9000/source/%y-%m-%d/
# 文件的命名, 前缀
#a1.sinks.k1.hdfs.filePrefix = events-
# 文件的命名, 后缀
a1.sinks.k1.hdfs.fileSuffix=.log
# 临时文件名前缀inUsePrefix,临时文件名后缀inUseSuffix
# 10 分钟就改目录,生成新的目录 2018-11-20/1010  2018-11-20/1020  2018-11-20/1030
#a1.sinks.k1.hdfs.round = true
#a1.sinks.k1.hdfs.roundValue = 10
#a1.sinks.k1.hdfs.roundUnit = minute
#压缩格式gzip, bzip2, lzo, lzop, snappy
#a1.sinks.fs.hdfs.codeC = gzip
# 时间:每3s滚动生成一个新的文件 0表示不使用时间来滚动

a1.sinks.fs.hdfs.rollInterval=0
#空间: 文件滚动的大小限制(bytes) 当达到500b是滚动生成新的文件,默认128M
a1.sinks.fs.hdfs.rollSize=0
#写入多少个event数据后滚动文件(事件个数),滚动生成新的文件
a1.sinks.fs.hdfs.rollCount=0

#5个事件就开始往里面写入
#a1.sinks.k1.hdfs.batchSize = 5
#用本地时间格式化目录
a1.sinks.k1.hdfs.useLocalTimeStamp=flase
#下沉后, 生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType=DataStream
#最大允许打开的HDFS文件数,当打开的文件数达到该值,最早打开的文件将会被关闭
a1.sinks.k1.hdfs.maxOpenFiles=5000
#HDFS副本数,写入 HDFS 文件块的最小副本数, 该参数会影响文件的滚动配置,一般将该参数配置成1,才可以按照配置正确滚动文件
#a1.sinks.k1.hdfs.minBlockReplicas = 1
#默认值:10000,执行HDFS操作的超时时间(单位:毫秒);callTimeout
#threadsPoolSize:默认值:10,hdfs sink 启动的操作HDFS的线程数。
#rollTimerPoolSize:默认值:1,hdfs sink 启动的根据时间滚动文件的线程数。

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

a1.channels.c2.type=memory
a1.channels.c2.capacity=1000
a1.channels.c2.transactionCapacity=100

a1.sources.r1.channels=c1 c2
a1.sinks.fs.channel=c1
a1.sinks.kfk.channel=c2


安装kafka(2.11-2.1.1)

kafka目录下创建日志路径

mkdir logs


vim config/server.properties修改配置文件中21、31、36和60行:

broker.id=1
listeners=PLAINTEXT://localhost:9092
advertised.listeners=PLAINTEXT://localhost:9092
log.dirs=/usr/local/Cellar/kafka/2.1.1/libexec/logs

启动zookeeper验证

bin/kafka-server-start.sh config/server.properties

创建主题

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

查看列表

bin/kafka-topics.sh --list --zookeeper localhost:2181

生产者

./kafka-console-producer.sh --broker-list localhost:9092 --topic test


消费者

./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning


查看Topic消息

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test

第一行给出了所有分区的摘要,每个附加行给出了关于一个分区的信息。 由于我们只有一个分区,所以只有一行。

“Leader”: 是负责给定分区的所有读取和写入的节点。 每个节点将成为分区随机选择部分的领导者。

“Replicas”: 是复制此分区日志的节点列表,无论它们是否是领导者,或者即使他们当前处于活动状态。

“Isr”: 是一组“同步”副本。这是复制品列表的子集,当前活着并被引导到领导者。

启动kafka:

bin/kafka-server-start.sh config/server.properties &

安装storm(先要有jdk、zookeeper)(1.2.2)

修改storm.yaml

storm.zookeeper.servers:
    - "127.0.0.1"
storm.zookeeper.port: 2181
supervisor.slots.ports:
    - 6700
    - 6701
    - 6702
    - 6703
    - 6704
storm.local.dir: "/usr/local/Cellar/storm/1.2.2/data"
nimbus.seeds: ["127.0.0.1"]


启动storm:

启动niumbus

./storm nimbus >> /usr/local/Cellar/storm/1.2.2/logs/nimbus.out 2>&1 &
tail -f /usr/local/Cellar/storm/1.2.2/logs/nimbus.log

启动UI

./storm ui>> /usr/local/Cellar/storm/1.2.2/logs/ui.out 2>&1 &
tail -f /usr/local/Cellar/storm/1.2.2/logs/ui.log

启动supervisor

./storm supervisor >> /usr/local/Cellar/storm/1.2.2/logs/supervisor.out 2>&1 &
tail -f /usr/local/Cellar/storm/1.2.2/logs/supervisor.log

启动logviewer

./storm logviewer>> /usr/local/Cellar/storm/1.2.2/logs/logviewer.out 2>&1 &
tail -f /usr/local/Cellar/storm/1.2.2/logs/logviewer.log

验证:浏览器打开webUI,http://localhost:8080


*当正式运行项目导入jar包之后,启动topology

./bin/storm jar /usr/local/Cellar/storm/1.2.2/libexec/examples/storm-starter/storm-starter-topologies-0.9.5.jar storm.starter.WordCountTopologywordcount


安装hbase

在”/usr/local/Cellar/hbase/1.2.9″下面新建文件夹如下

/hadoop/pids

/hbasetmp

/zookeepertmp


修改conf下的hbase-env.sh:

export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home"

export HBASE_PID_DIR="/usr/local/Cellar/hbase/1.2.9/hadoop/pids"

修改conf下的hbase-site.xml:

<configuration>
    <property>
            <name>hbase.rootdir</name>
            <value>/usr/local/Cellar/hbase/1.2.9/hbasetmp</value>
    </property>
    <property>
            <name>hbase.zookeeper.property.dataDir</name>
            <value>/usr/local/Cellar/hbase/1.2.9/zookeepertmp</value>
    </property>
    <property>
            <name>hbase.cluster.distributed</name>
            <value>true</value>
      </property>
</configuration>


环境变量

启动:

start-hbase.sh

hbase shell

sqoop安装(1.4.6_1)

环境变量

conf目录下cp sqoop-env-template.sh sqoop-env.sh

修改:

export HADOOP_COMMON_HOME=/usr/local/Cellar/hadoop/3.1.2
export HADOOP_MAPRED_HOME=/usr/local/Cellar/hadoop/3.1.2
export HIVE_HOME=/usr/local/Cellar/hive/3.1.1
export HBASE_HOME=/usr/local/Cellar/hbase/1.2.9

lib放入mysql-connect-java-5.1.41.jar

将sqoop-1.4.4.jar放到hadoop中的/share/hadoop/mapreduce/的lib目录下

sqoop命令:

mysql创建表后缀

create table stu2 (`id` varchar(20),`name` varchar(20)) ENGINE=InnoDB DEFAULT CHARSET=utf8;


#sqoop导出hive的数据到mysql

sqoop export --connect jdbc:mysql://localhost:3306/test --username root --password root --table stu1 --export-dir '/user/hive/warehouse/parkdb.db/stu1' --fields-terminated-by '\t';

#从mysql导出到hive

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password root --table kafka --hive-import --create-hive-table --hive-table parkdb.stu2 -m 1

各别启动项:

zookeeper/3.4.13/libexec/bin/zkServer.sh start

hadoop/3.1.2/libexec/sbin/start-all.sh

kafka/2.1.1/libexec/bin/kafka-server-start.sh ./kafka/2.1.1/config/libexec/server.properties &

kafka/2.1.1/libexec/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

flume-ng agent --conf /usr/local/Cellar/flume/1.9.0/libexec/conf --conf-file /usr/local/Cellar/flume/1.9.0/libexec/conf/bridge03.properties --name a1 -Dflume.root.logger=INFO,console

删除kafka主题

kafka/2.1.1/bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic test


间隔1s循环shell插入文件

#!/bin/sh
function rand(){
    min=$1
    max=$(($2-$min+1))
    num=$(date +%s%N)
    echo $(($num%$max+$min))
}
for i in {1..100};
do
  random = $(rand 1 100)
  echo "$random $random $random $random $random 2018-11-$i" >> /home/tellhow-iot2/doc/test.log;
  sleep 1;
done;

安装错误解决方案:

1.ERROR 2002 (HY000): Can’t connect to local MySQL server through socket ‘/tmp/mysql.sock’ (2)

先检查/usr/etc/my.cnf,可以选择添加

[client]
port = 3306
socket = /tmp/mysql.sock
default-character-set = utf8

[mysqld]
collation-server = utf8_unicode_ci
character-set-server = utf8
init-connect ='SET NAMES utf8'
max_allowed_packet = 64M
bind-address = 127.0.0.1
port = 3306
socket = /tmp/mysql.sock
innodb_file_per_table=1

[mysqld_safe]
timezone = '+0:00'

然后修改.bash_profile

export PATH=$PATH:/usr/local/Cellar/mysql/8.0.15/bin


接着去/usr/local/var设置权限如下(关键步骤)

sudo chmod -R 777 mysql

最后去/usr/local/Cellar/mysql/8.0.15/bin下面启动

mysql.server start