Tuesday, April 14, 2015

Hive on Spark at CDH 5.3

However, since Hive on Spark is not (yet) officially supported by Cloudera some manual steps are required to get Hive on Spark within CDH 5.3 working. Please note that there are four important requirements additionally to the hands-on work:
  1. Spark Gateway nodes needs to be a Hive Gateway node as well
  2. In case the client configurations are redeployed, you need to copy the hive-site.xml again
  3. In case CDH is upgraded (also for minor patches, often updated without noticing you), you need to adjust the class paths
  4. Hive libraries need to be present on all executors (CM should take care of this automatically)
Login to your spark server(s) and copy the running hive-site.xml to spark:

cp /etc/hive/conf/hive-site.xml /etc/spark/conf/

Start your spark shell with (replace <CDH_VERSION> with your parcel version, e.g. 5.3.2-1.cdh5.3.2.p0.10) and load the hive context within spark-shell:

spark-shell --master yarn-client --driver-class-path "/opt/cloudera/parcels/CDH-<CDH_VERSION>/lib/hive/lib/*" --conf spark.executor.extraClassPath="/opt/cloudera/parcels/CDH-<CDH_VERSION>/lib/hive/lib/*"
..
scala> val hive = new org.apache.spark.sql.hive.HiveContext(sc)
sql: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@1c966488

scala> var s1 = hive.sql("SELECT COUNT(*) FROM sample_07").collect()
s1: Array[org.apache.spark.sql.Row] = Array([823])