Spark SQL部署指南

      最近一直在看Spark以及Hive相关的东西,Spark SQL就是使用Hive SQL Parser将hive语句转化为RDD,方便Hive用户平滑迁移。

      首先就是如何打包:

       1、从github上下载源码,地址是https://github.com/apache/spark,我选取的是1.0.2(1.1.0在提交时候出现Permission denied问题,我放弃了)

      2、安装maven,scala等,Hadoop集群,并设定home目录

      3、通过Maven打包前修改Maven的jvm参数,export MAVEN_OPTS=”-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512”

      4、根据HADOOP版本通过MAVEN打包,比如对2.4版本 bash make-distribution.sh –name spark-1.0.2 –tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 打包出来是一个tgz的包,上传到部署的机器上 部署问题:

      5、解压tgz的包,将hive需要的mysql-jdbc的jar包拷贝到lib下,同时将hive配置文件hive-site.xml配置好拷贝到conf下

      6、修改配置文件,修改spark-default.conf

spark.yarn.historyServer.address 10.39.5.23:18080 // spark history server position
spark.eventLog.enabled true
spark.eventLog.dir   hdfs://ns1/user/jiangyu2/logs

修改spark-env.sh

export SPARK_JAR=hdfs://ns1/user/jiangyu2/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
export HADOOP_CONF_DIR=/usr/local/hadoop-2.4.0/etc/hadoop
export YARN_CONF_DIR=/usr/local/hadoop-2.4.0/etc/hadoop
export SPARK_YARN_USER_ENV="CLASSPATH=/usr/local/hadoop-2.4.0/etc/hadoop/"
export SPARK_SUBMIT_LIBRARY_PATH=/usr/local/hadoop-2.4.0/lib/native/
export SPARK_CLASSPATH=$SPARK_CLASSPATH

7、启动spark shell

$ ./bin/spark-shell --master yarn-client
 
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/09/17 17:57:07 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@1f3b0168
 
scala> import hiveContext._
import hiveContext._
 
scala> hql("SELECT t1.fans_uid,t3.user_type FROM   (SELECT fans_uid,atten_uid,time FROM   ods_user_fanslist WHERE  dt = '20140909') t1  JOIN (SELECT uid,user_type_id AS user_type,user_status,reg_time FROM   mds_user_info WHERE  dt = '20140909') t3 ON t1.atten_uid = t3.uid limit 10”).collect().foreach(println)

8、写java程序提交同样的hql

package nimei;
 
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.hive.api.java.JavaHiveContext;
 
public class NiDaYe {
  public static void main(String[] args) {
    SparkConf sparkConf = new SparkConf().setAppName("caocaocao");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 
    JavaHiveContext hiveCtx = new org.apache.spark.sql.hive.api.java.JavaHiveContext(ctx);
 
    // Queries are expressed in HiveQL.
    hiveCtx.hql("SELECT count(fans_uid),user_status from(SELECT fans_uid,atten_uid,time FROM   ods_user_fanslist WHERE  dt = '20140909') " +
    		"t1  JOIN (SELECT uid,user_type_id AS user_type,user_status,reg_time FROM  " +
    		" mds_user_info WHERE  dt = '20140909') t3 ON t1.atten_uid = t3.uid group by user_status").collect();
  }
}

  最后就是applicationMaster页面了 spark