2016年12月13日火曜日

Scala/PySpqrkでSPARK-14927のワークアラウンドを試してみる

HDP 2.5.0 Sandboxを使用

まず、Scalaでうまくいくか試す。

1) PySpqrkからテストテーブルを作る
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
#sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict")
sqlContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")import sqlContext.implicits._
sqlContext.sql("create external table if not exists default.partitiontest1(val string) partitioned by (year int)")
"nonstrict"を使わないとSparkException: Dynamic partition strict mode requires at least one static partition column.

Hiveから確認:
hive> show create table partitiontest1;
OK
CREATE EXTERNAL TABLE `partitiontest1`(
  `val` string)
PARTITIONED BY (
  `year` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/partitiontest1'
TBLPROPERTIES (
  'transient_lastDdlTime'='1481536557')
Time taken: 1.071 seconds, Fetched: 14 row(s)
hive>

2) データを入力
import org.apache.spark.sql.SaveMode
Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val").write.partitionBy("year") .mode(SaveMode.Append).saveAsTable("default.partitiontest1")

(Sparkと)Hiveから確認:
#sqlContext.sql("show partitions default.partitiontest1").show
hive> select * from partitiontest1;
OK
a 2012
b 2013
c 2014
Time taken: 6.325 seconds, Fetched: 3 row(s)
hive> show partitions partitiontest1;
OK
year=2012
year=2013
year=2014
Time taken: 0.706 seconds, Fetched: 3 row(s)
hive>

PySparkバージョン:

from pyspark.sql import HiveContext
from pyspark.sql import Row
sqlContext = HiveContext(sc)
sqlContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
...
sqlContext.sql("create external table if not exists default.partitiontest2(val string) partitioned by (year int)")
...
# Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead
# TypeError: schema should be StructType or list or None
#sc.parallelize([{"2012":"a", "2013":"b","2014":"c"}]).toDF("year", "val").write.partitionBy("year") .mode("append").saveAsTable("default.partitiontest2")
#sqlContext.createDataFrame([{"2012":"a", "2013":"b","2014":"c"}])

http://nadbordrozd.github.io/blog/2016/05/22/one-weird-trick-that-will-fix-your-pyspark-schemas/

from pyspark.sql.types import StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("val", StringType(), True)])
# value order matters (eg: year needs to come first) and no label
record = [Row(2012, 'a'), Row(2013, 'b'), Row(2014, 'c')]
#sc.parallelize(record).toDF(schema).collect()
sc.parallelize(record).toDF(schema).write.partitionBy("year") .mode("append").saveAsTable("default.partitiontest2")

sqlContext.sql("show partitions default.partitiontest2").show()


統計情報を更新してみる:
sqlContext.sql("analyze table default.partitiontest1 compute statistics noscan")

# Table is partitioned and partition specification is needed
#sqlContext.sql("analyze table default.partitiontest1 compute statistics")

# ERROR ExecDriver: yarn
# java.lang.LinkageError: ClassCastException: attempting to castjar:file:/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar!/javax/ws/rs/ext/RuntimeDelegate.class
#sqlContext.sql("analyze table default.partitiontest1 partition(year) compute statistics")

pyspark.sql.utils.AnalysisException: u"missing KW_STATISTICS at 'for' near '<EOF>'; line 1 pos 61"
#sqlContext.sql("analyze table default.partitiontest1 partition(year) compute for columns")

# Don't see any stats change
# sqlContext.sql("MSCK REPAIR TABLE default.partitiontest2")


Misc.:
sqlContext.sql("select * from default.partitiontest1").collect()
...
[Row(val=u'a', year=2012), Row(val=u'b', year=2013), Row(val=u'c', year=2014)]

>>> sqlContext.sql("select year, val from default.partitiontest1").printSchema()
16/12/13 05:53:41 INFO ParseDriver: Parsing command: select year, val from default.partitiontest1
16/12/13 05:53:41 INFO ParseDriver: Parse Completed
root
 |-- year: integer (nullable = true)
 |-- val: string (nullable = true)

sqlContext.sql("set hive.stats.autogather").show()
...
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|hive.stats.autoga...| true|
+--------------------+-----+

HDFS JournalNodeの違いをチェックする

たまにeditsファイルが壊れてしまって起動しなくなることがあるかと思いますが、手早くチェックするための備忘録です。

よく最後の/(スラッシュ)を忘れます。

[root@node1 ~]# rsync -vncr --delete /hadoop/hdfs/journal/nnha/current/ root@node2.localdomain:/hadoop/hdfs/journal/nnha/current/
sending incremental file list
deleting edits_0000000000000847636-0000000000000847636
deleting edits_0000000000000847631-0000000000000847635
deleting edits_0000000000000847626-0000000000000847630
edits_inprogress_0000000000000847626

sent 525821 bytes  received 18 bytes  116853.11 bytes/sec
total size is 17011624  speedup is 32.35 (DRY RUN)
[root@node1 ~]#

Rsyncがない場合は?
http://stackoverflow.com/questions/20969124/how-to-diff-directories-over-ssh

2016年12月12日月曜日

HDP HUEで最初にyum/rpmでインストールされたバージョンから変更されたファイルを素早く確認する

[root@sandbox hue]# for h in `rpm -qa hue*`; do echo "# Checking $h";rpm -V $h | grep -P '^..5|^missing'; done
# Checking hue-pig-2.6.1.2.5.0.0-1245.el6.x86_64
# Checking hue-common-2.6.1.2.5.0.0-1245.el6.x86_64
S.5....T.  c /etc/hue/conf.empty/hue.ini
S.5....T.  c /etc/hue/conf.empty/log.conf
S.5....T.    /usr/lib/hue/app.reg
S.5....T.    /usr/lib/hue/build/env/lib/python2.6/site-packages/hue.pth
S.5....T.    /var/lib/hue/desktop.db
# Checking hue-hcatalog-2.6.1.2.5.0.0-1245.el6.x86_64
# Checking hue-oozie-2.6.1.2.5.0.0-1245.el6.x86_64
# Checking hue-2.6.1.2.5.0.0-1245.el6.x86_64
# Checking hue-beeswax-2.6.1.2.5.0.0-1245.el6.x86_64
# Checking hue-server-2.6.1.2.5.0.0-1245.el6.x86_64

2016年11月29日火曜日

SqoopをJDBでデバッグ

下記の例では、JDBでisOraOopEnabledが何を返すのか確認しようとしています。

​1)
vim /usr/hdp/current/hadoop-client/bin/hadoop.distro

2) 下記のラインを探します:
    exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"

3) このように変更します:
    if [ -n "$HADOOP_JDB" ]; then
      echo "export CLASSPATH=$CLASSPATH"
      echo "${JAVA_HOME}/bin/jdb" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
    else
      exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
    fi


execではなくechoを使っているのはexecではうまくいかなかったからです…

4) 実行します
[sqoop@node4 ~]$ HADOOP_JDB="Y" sqoop import > jdb_sqoop_import.sh

5) "jdb_sqoop_import.sh"を開いて要らないラインを削除します。 ("Warning: /usr/hdp/..." and "Please set $ACCUMLO_HOME .." etc.)
また、最後のラインの最後に、"$@"を追加します。

6) 実行します。

bash ./jdb_sqoop_import.sh --direct --verbose --connect jdbc:oracle:thin:@192.168.8.22:1521/XE --username ambari --password bigdata --query 'SELECT * FROM ambari.hosts WHERE $CONDITIONS' --num-mappers 2 --split-by 'ORA_HASH(ROWID)' --target-dir ambari.hosts

7) JDBが起動するはずですので、"help”などを実行

8) ブレークポイントを isOraOopEnabled に指定して、run

> stop in org.apache.sqoop.manager.oracle.OraOopManagerFactory.isOraOopEnabled
> run


> stop at org.apache.sqoop.manager.oracle.OraOopManagerFactory:101
> run # or cont
> step
> eval OraOopUtilities.getMinNumberOfImportMappersAcceptedByOraOop(sqoopOptions.getConf())


9) isOraOopEnabledで止まるはずです。
その後は、step, next, locals, where, print, evalを駆使します。JDB usage

CurlでWebHDFSへアクセス、ただしKerberosはON、でDEBUGログを見てみる


[hdfs@node3 hdfs]$ export HADOOP_OPTS="$HADOOP_OPTS -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
[hdfs@node3 hdfs]$ kill `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid`; sleep 3; /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode
[hdfs@node3 hdfs]$ tail -f hadoop-hdfs-namenode-node3.localdomain.out

違うノードから
[hajime@node1 ~]$ curl -sS -L -v -w '%{http_code}' -X GET --negotiate -u : 'http://node3.localdomain:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=incorrect_user'

Node3(NameNode)にもどって:
Found KeyTab /etc/security/keytabs/spnego.service.keytab for HTTP/node3.localdomain@HO-UBU02
Found KeyTab /etc/security/keytabs/spnego.service.keytab for HTTP/node3.localdomain@HO-UBU02
Entered Krb5Context.acceptSecContext with state=STATE_NEW
>>> KeyTabInputStream, readName(): HO-UBU02
>>> KeyTabInputStream, readName(): HTTP
>>> KeyTabInputStream, readName(): node3.localdomain
>>> KeyTab: load() entry length: 66; type: 17
>>> KeyTabInputStream, readName(): HO-UBU02
>>> KeyTabInputStream, readName(): HTTP
>>> KeyTabInputStream, readName(): node3.localdomain
>>> KeyTab: load() entry length: 66; type: 23
>>> KeyTabInputStream, readName(): HO-UBU02
>>> KeyTabInputStream, readName(): HTTP
>>> KeyTabInputStream, readName(): node3.localdomain
>>> KeyTab: load() entry length: 58; type: 3
>>> KeyTabInputStream, readName(): HO-UBU02
>>> KeyTabInputStream, readName(): HTTP
>>> KeyTabInputStream, readName(): node3.localdomain
>>> KeyTab: load() entry length: 82; type: 18
>>> KeyTabInputStream, readName(): HO-UBU02
>>> KeyTabInputStream, readName(): HTTP
>>> KeyTabInputStream, readName(): node3.localdomain
>>> KeyTab: load() entry length: 74; type: 16
Looking for keys for: HTTP/node3.localdomain@HO-UBU02
Added key: 16version: 1
Added key: 18version: 1
Found unsupported keytype (3) for HTTP/node3.localdomain@HO-UBU02
Added key: 23version: 1
Added key: 17version: 1
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
Using builtin default etypes for permitted_enctypes
default etypes for permitted_enctypes: 18 17 16 23.
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
MemoryCache: add 1480393332/553854/1D11869D8DDC6C3FDAE645FD45DEA27B/hajime@HO-UBU02 to hajime@HO-UBU02|HTTP/node3.localdomain@HO-UBU02
>>> KrbApReq: authenticate succeed.
Krb5Context setting peerSeqNumber to: 1055634594
Krb5Context setting mySeqNumber to: 1055634594
Nov 29, 2016 4:22:11 AM com.sun.jersey.api.core.PackagesResourceConfig init
INFO: Scanning for root resource and provider classes in the packages:
  org.apache.hadoop.hdfs.server.namenode.web.resources
  org.apache.hadoop.hdfs.web.resources
Found ticket for nn/node3.localdomain@HO-UBU02 to go to krbtgt/HO-UBU02@HO-UBU02 expiring on Tue Nov 29 14:18:04 UTC 2016
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for nn/node3.localdomain@HO-UBU02 to go to krbtgt/HO-UBU02@HO-UBU02 expiring on Tue Nov 29 14:18:04 UTC 2016
Found ticket for nn/node3.localdomain@HO-UBU02 to go to jn/node2.localdomain@HO-UBU02 expiring on Tue Nov 29 14:18:04 UTC 2016
Found ticket for nn/node3.localdomain@HO-UBU02 to go to jn/node3.localdomain@HO-UBU02 expiring on Tue Nov 29 14:18:04 UTC 2016
Found ticket for nn/node3.localdomain@HO-UBU02 to go to jn/node1.localdomain@HO-UBU02 expiring on Tue Nov 29 14:18:04 UTC 2016
Found ticket for nn/node3.localdomain@HO-UBU02 to go to nn/node2.localdomain@HO-UBU02 expiring on Tue Nov 29 14:18:04 UTC 2016
Found service ticket in the subjectTicket (hex) =
0000: 61 82 01 5E 30 82 01 5A   A0 03 02 01 05 A1 0A 1B  a..^0..Z........
...
0160: C7 7D                                              ..

Client Principal = nn/node3.localdomain@HO-UBU02
Server Principal = nn/node2.localdomain@HO-UBU02
Session Key = EncryptionKey: keyType=18 keyBytes (hex dump)=
0000: B3 F2 F3 5D 03 A2 01 B6   E7 D8 B2 87 82 FC 2B 6A  ...]..........+j
0010: A8 FD 37 68 E7 EC 74 68   22 D6 AD 63 C3 F5 06 E0  ..7h..th"..c....


Forwardable Ticket true
Forwarded Ticket false
Proxiable Ticket false
Proxy Ticket false
Postdated Ticket false
Renewable Ticket false
Initial Ticket false
Auth Time = Tue Nov 29 04:18:04 UTC 2016
Start Time = Tue Nov 29 04:20:11 UTC 2016
End Time = Tue Nov 29 14:18:04 UTC 2016
Renew Till = null
Client Addresses  Null
...

Ambari HostCleanup.pyを走らせてみた

[root@node5 ~]# python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users --verbose
INFO:HostCleanup:
Killing pid's: ['']
INFO:HostCleanup:Deleting packages: ['']
INFO:HostCleanup:
Deleting directories: ['/etc/hadoop', '/etc/ambari-metrics-monitor', '/var/run/hadoop', '/var/run/ambari-metrics-monitor', '/var/log/hadoop', '/var/log/ambari-metrics-monitor', '/usr/lib/flume', '/usr/lib/storm', '/tmp/hadoop-hdfs']
INFO:HostCleanup:
Deleting additional directories: ['/etc/hadoop', '/etc/ambari-metrics-monitor', '/var/run/hadoop', '/var/run/ambari-metrics-monitor', '/var/log/hadoop', '/var/log/ambari-metrics-monitor', '/usr/lib/flume', '/usr/lib/storm', '/tmp/hadoop-hdfs']
INFO:HostCleanup:Path doesn't exists: /tmp/hadoop-hdfs
INFO:HostCleanup:
Deleting repo files: ['/etc/yum.repos.d/ambari.repo']
INFO:HostCleanup:
Erasing alternatives:{'symlink_list': [''], 'target_list': ['']}
INFO:HostCleanup:Path doesn't exists:
INFO:HostCleanup:Clean-up completed. The output is at /var/lib/ambari-agent/data/hostcleanup.result

この時点では、Ambariからサービスやコンポーネントは削除されない、がコンフィグファイルは削除される模様。
Ambariからクライアントを再インストールして、コンポーネントを再起動すると、問題ないようにみえる?

jcmd ManagementAgentを試してみる

Note: Java 7u4 かそれ以上のバージョンが必要

[hdfs@node2 ~]$ /usr/jdk64/jdk1.8.0_60/bin/jcmd 31219 help
31219:
The following commands are available:
JFR.stop
JFR.start
JFR.dump
JFR.check
VM.native_memory
VM.check_commercial_features
VM.unlock_commercial_features
ManagementAgent.stop
ManagementAgent.start_local
ManagementAgent.start
GC.rotate_log
Thread.print
GC.class_stats
GC.class_histogram
GC.heap_dump
GC.run_finalization
GC.run
VM.uptime
VM.flags
VM.system_properties
VM.command_line
VM.version
help
For more information about a specific command use 'help <command>'.
[hdfs@node2 ~]$

[hdfs@node2 ~]$ /usr/jdk64/jdk1.8.0_60/bin/jcmd 31219 ManagementAgent.start
31219:
java.lang.RuntimeException: Invalid option specified


[hdfs@node2 ~]$ /usr/jdk64/jdk1.8.0_60/bin/jcmd 31219 ManagementAgent.start_local
31219:
Command executed successfully

[hdfs@node2 ~]$ /usr/jdk64/jdk1.8.0_60/bin/jstat -J-Djstat.showUnsupported=true -snap 31219 | grep sun.management.JMXConnectorServer.address
sun.management.JMXConnectorServer.address="service:jmx:rmi://127.0.0.1/stub/rO0ABXNyAC5qYXZheC5tYW5hZ2VtZW50LnJlbW90ZS5ybWkuUk1JU2VydmVySW1wbF9TdHViAAAAAAAAAAICAAB4cgAaamF2YS5ybWkuc2VydmVyLlJlbW90ZVN0dWLp/tzJi+FlGgIAAHhyABxqYXZhLnJtaS5zZXJ2ZXIuUmVtb3RlT2JqZWN002G0kQxhMx4DAAB4cHc3AAtVbmljYXN0UmVmMgAADDE3Mi4xNy4xMDAuMgAAtag+Sx7jYZTOeW5ym7MAAAFX1UxwhIABAHg="

[hdfs@node2 ~]$ hdfs jmxget -localVM "service:jmx:rmi://127.0.0.1/stub/rO0ABXNyAC5qYXZheC5tYW5hZ2VtZW50LnJlbW90ZS5ybWkuUk1JU2VydmVySW1wbF9TdHViAAAAAAAAAAICAAB4cgAaamF2YS5ybWkuc2VydmVyLlJlbW90ZVN0dWLp/tzJi+FlGgIAAHhyABxqYXZhLnJtaS5zZXJ2ZXIuUmVtb3RlT2JqZWN002G0kQxhMx4DAAB4cHc3AAtVbmljYXN0UmVmMgAADDE3Mi4xNy4xMDAuMgAAtag+Sx7jYZTOeW5ym7MAAAFX1UxwhIABAHg=" 2>&1 | head
init: server=localhost;port=;service=NameNode;localVMUrl=service:jmx:rmi://127.0.0.1/stub/rO0ABXNyAC5qYXZheC5tYW5hZ2VtZW50LnJlbW90ZS5ybWkuUk1JU2VydmVySW1wbF9TdHViAAAAAAAAAAICAAB4cgAaamF2YS5ybWkuc2VydmVyLlJlbW90ZVN0dWLp/tzJi+FlGgIAAHhyABxqYXZhLnJtaS5zZXJ2ZXIuUmVtb3RlT2JqZWN002G0kQxhMx4DAAB4cHc3AAtVbmljYXN0UmVmMgAADDE3Mi4xNy4xMDAuMgAAtag+Sx7jYZTOeW5ym7MAAAFX1UxwhIABAHg=
url string for local pid = service:jmx:rmi://127.0.0.1/stub/rO0ABXNyAC5qYXZheC5tYW5hZ2VtZW50LnJlbW90ZS5ybWkuUk1JU2VydmVySW1wbF9TdHViAAAAAAAAAAICAAB4cgAaamF2YS5ybWkuc2VydmVyLlJlbW90ZVN0dWLp/tzJi+FlGgIAAHhyABxqYXZhLnJtaS5zZXJ2ZXIuUmVtb3RlT2JqZWN002G0kQxhMx4DAAB4cHc3AAtVbmljYXN0UmVmMgAADDE3Mi4xNy4xMDAuMgAAtag+Sx7jYZTOeW5ym7MAAAFX1UxwhIABAHg= = service:jmx:rmi://127.0.0.1/stub/rO0ABXNyAC5qYXZheC5tYW5hZ2VtZW50LnJlbW90ZS5ybWkuUk1JU2VydmVySW1wbF9TdHViAAAAAAAAAAICAAB4cgAaamF2YS5ybWkuc2VydmVyLlJlbW90ZVN0dWLp/tzJi+FlGgIAAHhyABxqYXZhLnJtaS5zZXJ2ZXIuUmVtb3RlT2JqZWN002G0kQxhMx4DAAB4cHc3AAtVbmljYXN0UmVmMgAADDE3Mi4xNy4xMDAuMgAAtag+Sx7jYZTOeW5ym7MAAAFX1UxwhIABAHg=
Create RMI connector and connect to the RMI connector serverservice:jmx:rmi://127.0.0.1/stub/rO0ABXNyAC5qYXZheC5tYW5hZ2VtZW50LnJlbW90ZS5ybWkuUk1JU2VydmVySW1wbF9TdHViAAAAAAAAAAICAAB4cgAaamF2YS5ybWkuc2VydmVyLlJlbW90ZVN0dWLp/tzJi+FlGgIAAHhyABxqYXZhLnJtaS5zZXJ2ZXIuUmVtb3RlT2JqZWN002G0kQxhMx4DAAB4cHc3AAtVbmljYXN0UmVmMgAADDE3Mi4xNy4xMDAuMgAAtag+Sx7jYZTOeW5ym7MAAAFX1UxwhIABAHg=
Get an MBeanServerConnection
Domains:
        Domain = Hadoop
        Domain = JMImplementation
        Domain = com.sun.management

[root@sandbox-hdp ~]# jcmd `cat /var/run/ambari-server/ambari-server.pid` ManagementAgent.start jmxremote.port=5005 jmxremote.authenticate=false jmxremote.ssl=false
51141:
Command executed successfully
[root@sandbox-hdp ~]# jstat -J-Djstat.showUnsupported=true -snap `cat /var/run/ambari-server/ambari-server.pid` | grep -i jmx
sun.management.JMXConnectorServer.0.authenticate="false"
sun.management.JMXConnectorServer.0.remoteAddress="service:jmx:rmi:///jndi/rmi://sandbox-hdp.hortonworks.com:5005/jmxrmi"
sun.management.JMXConnectorServer.0.ssl="false"
sun.management.JMXConnectorServer.0.sslNeedClientAuth="false"
sun.management.JMXConnectorServer.0.sslRegistry="false"

But above doesn't allow jconsole to connect...

2016年11月28日月曜日

Hello world spark sbt on sandbox (HDP 2.5.0)

前準備

1)DockerバージョンのHDP Sandboxにログイン
ssh -p 2222 root@sandbox.hortonworks.com

2)SBTとVimをインストール
http://www.scala-sbt.org/release/docs/Installing-sbt-on-Linux.html
curl https://bintray.com/sbt/rpm/rpm | tee /etc/yum.repos.d/bintray-sbt-rpm.repo
yum install -y sbt vim

2.1)Vimがそのままだと見づらいので、ちょっと変更
http://bsnyderblog.blogspot.com.au/2012/12/vim-syntax-highlighting-for-scala-bash.html
mkdir -p ~/.vim/{ftdetect,indent,syntax} && for d in ftdetect indent syntax ; do curl -o ~/.vim/$d/scala.vim https://raw.githubusercontent.com/derekwyatt/vim-scala/master/syntax/scala.vim; done

実作業

1)作業用フォルダを作成し、必要なファイルを編集
http://spark.apache.org/docs/1.6.2/quick-start.html#self-contained-applications
mkdir scala && cd ./scala
mkdir -p ./src/main/scala
vim simple.sbt
name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.2"

vim ./src/main/scala/SimpleApp.scala
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

2)パッケージ化
sbt package
...
[info] Packaging /root/scala/target/scala-2.10/simple-project_2.10-1.0.jar ...
[info] Done packaging.
[success] Total time: 98 s, completed Nov 24, 2016 11:35:26 PM

2.1)HDFS側の用意(プログラムを変えるのが面倒なので、変なフォルダ名)
hdfs dfs -mkdir YOUR_SPARK_HOME
locate README.md
hdfs dfs -put /usr/lib/hue/ext/thirdparty/js/test-runner/mootools-runner/README.md YOUR_SPARK_HOME

3)ジョブをサブミット!
[root@sandbox hdfs]# spark-submit --class "SimpleApp" --master local[1] --driver-memory 512m --executor-memory 512m --executor-cores 1 /root/scala/target/scala-2.10/simple-project_2.10-1.0.jar 2>/dev/null
Lines with a: 23, Lines with b: 10


3.1)Windowsでもトライ
http://www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf
https://wiki.apache.org/hadoop/WindowsProblems
Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE.

C:\Apps\spark-1.6.2-bin-hadoop2.6\bin>spark-submit --class "HdfsDeleteApp" c:\Users\Hajime\Desktop\hdfsdeleteapp-project_2.10-1.0.jar 2>nul 

2016年11月21日月曜日

パッチが適用されているかを簡単に確認する

以前のトピックに似てますが、似たような方法でIDEなどを使わずにパッチが適用されているかを確認します。

例:https://issues.apache.org/jira/secure/attachment/12790079/AMBARI-15100-trunk_4.patch
上記パッチだと、putMetricが追加されたことがわかります。

Ambari Metrics Systemがインストールされたノードにログインしps auxwww | grep metricsなどでAMSのPIDを見つけます。
それらしいのが2つ見つかったので、procからJarファイルを探します。

ls -l /proc/{3195,3241}/fd | grep .jar$

または、パッチのパスからたぶんambari-metrics-commonがファイル名に含まれると思いますので、

[root@node1 ~]# ls -l /proc/{3195,3241}/fd | grep -E ambari-metrics-common.*\.jar$
lr-x------ 1 ams hadoop 64 Nov 21 02:42 86 -> /usr/lib/ambari-metrics-collector/ambari-metrics-common-2.2.2.0.460.jar
[root@node1 ~]# less /usr/lib/ambari-metrics-collector/ambari-metrics-common-2.2.2.0.460.jar | grep TimelineMetricsCache
-rw-r--r--  2.0 unx     5208 b- defN 16-May-05 18:35 org/apache/hadoop/metrics2/sink/timeline/cache/TimelineMetricsCache.class
-rw-r--r--  2.0 unx     3120 b- defN 16-May-05 18:35 org/apache/hadoop/metrics2/sink/timeline/cache/TimelineMetricsCache$TimelineMetricHolder.class
-rw-r--r--  2.0 unx     3251 b- defN 16-May-05 18:35 org/apache/hadoop/metrics2/sink/timeline/cache/TimelineMetricsCache$TimelineMetricWrapper.class 
[root@node1 ~]# /usr/jdk64/jdk1.8.0_60/bin/javap -classpath /usr/lib/ambari-metrics-collector/ambari-metrics-common-2.2.2.0.460.jar org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache
Compiled from "TimelineMetricsCache.java"
public class org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache {
  public static final int MAX_RECS_PER_NAME_DEFAULT;
  public static final int MAX_EVICTION_TIME_MILLIS;
  public org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache(int, int);
  public org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache(int, int, boolean);
  public org.apache.hadoop.metrics2.sink.timeline.TimelineMetric getTimelineMetric(java.lang.String);
  public int getMaxEvictionTimeInMillis();
  public void putTimelineMetric(org.apache.hadoop.metrics2.sink.timeline.TimelineMetric);
  public void putTimelineMetric(org.apache.hadoop.metrics2.sink.timeline.TimelineMetric, boolean);
  static int access$000(org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache);
  static int access$100(org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache);
  static org.apache.commons.logging.Log access$200();
  static {};
}
あれ、putMetricないですね。

[root@node1 ~]# zipgrep putMetric /usr/lib/ambari-metrics-collector/ambari-metrics-common-2.2.2.0.460.jar
org/apache/hadoop/metrics2/sink/timeline/cache/TimelineMetricsCache$TimelineMetricHolder.class:Binary file (standard input) matches
org/apache/hadoop/metrics2/sink/timeline/cache/TimelineMetricsCache$TimelineMetricWrapper.class:Binary file (standard input) matches
[root@node1 ~]# /usr/jdk64/jdk1.8.0_60/bin/javap -classpath /usr/lib/ambari-metrics-collector/ambari-metrics-common-2.2.2.0.460.jar org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache\$TimelineMetricHolder
Compiled from "TimelineMetricsCache.java"
class org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache$TimelineMetricHolder extends java.util.concurrent.ConcurrentSkipListMap<java.lang.String, org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache$TimelineMetricWrapper> {
  final org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache this$0;
  org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache$TimelineMetricHolder(org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache);
  public org.apache.hadoop.metrics2.sink.timeline.TimelineMetric evict(java.lang.String);
  public void put(java.lang.String, org.apache.hadoop.metrics2.sink.timeline.TimelineMetric);
}
[root@node1 ~]# /usr/jdk64/jdk1.8.0_60/bin/javap -classpath /usr/lib/ambari-metrics-collector/ambari-metrics-common-2.2.2.0.460.jar org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache\$TimelineMetricWrapper
Compiled from "TimelineMetricsCache.java"
class org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache$TimelineMetricWrapper {
  final org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache this$0;
  org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache$TimelineMetricWrapper(org.apache.hadoop.metrics2.sink.timeline.cache.TimelineMetricsCache, org.apache.hadoop.metrics2.sink.timeline.TimelineMetric);
  public synchronized void putMetric(org.apache.hadoop.metrics2.sink.timeline.TimelineMetric);
  public synchronized long getTimeDiff();
  public synchronized org.apache.hadoop.metrics2.sink.timeline.TimelineMetric getTimelineMetric();
}


2016年11月9日水曜日

Ambari APIでサービスを特定ホストにインストールする

Grafanaを例に

curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST http://localhost:8080/api/v1/clusters/${_CLS}/services/AMBARI_METRICS/components/METRICS_GRAFANA
curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST -d '{"host_components":[{"HostRoles":{"component_name":"METRICS_GRAFANA"}}]}' \
http://localhost:8080/api/v1/clusters/${_CLS}/hosts?Hosts/host_name=${_HOST}

PostgreSQL log:
LOG:  execute <unnamed>: INSERT INTO servicecomponentdesiredstate (component_name, desired_state, service_name, cluster_id, desired_stack_id) VALUES ($1, $2, $3, $4, $5)
DETAIL:  parameters: $1 = 'METRICS_GRAFANA', $2 = 'INSTALLED', $3 = 'AMBARI_METRICS', $4 = '2', $5 = '4'
LOG:  execute <unnamed>: INSERT INTO hostcomponentdesiredstate (admin_state, desired_state, maintenance_state, restart_required, security_state, host_id, desired_stack_id, service_name, cluster_id, component_name) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
DETAIL:  parameters: $1 = NULL, $2 = 'INIT', $3 = 'OFF', $4 = '0', $5 = 'UNSECURED', $6 = '4', $7 = '4', $8 = 'AMBARI_METRICS', $9 = '2', $10 = 'METRICS_GRAFANA'
LOG:  execute <unnamed>: INSERT INTO hostcomponentstate (id, current_state, security_state, upgrade_state, version, host_id, service_name, cluster_id, component_name, current_stack_id) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
DETAIL:  parameters: $1 = '453', $2 = 'INIT', $3 = 'UNSECURED', $4 = 'NONE', $5 = 'UNKNOWN', $6 = '4', $7 = 'AMBARI_METRICS', $8 = '2', $9 = 'METRICS_GRAFANA', $10 = '4'

インストール:
curl -u admin:admin -H "X-Requested-By:ambari" -X PUT -d '{"RequestInfo":{"context":"Install Grafana","operation_level":{"level":"HOST_COMPONENT","cluster_name":"'${_CLS}'","host_name":"'${_HOST}'","service_name":"AMBARI_METRICS"}},"Body":{"HostRoles":{"state":"INSTALLED"}}}' http://localhost:8080/api/v1/clusters/bne_c1/hosts/node1.localdomain/host_components/METRICS_GRAFANA


削除してみる:
curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE  http://localhost:8080/api/v1/clusters/${_CLS}/services/AMBARI_METRICS/components/METRICS_GRAFANA

LOG:  execute <unnamed>: DELETE FROM hostcomponentdesiredstate WHERE ((((host_id = $1) AND (cluster_id = $2)) AND (component_name = $3)) AND (service_name = $4))
DETAIL:  parameters: $1 = '4', $2 = '2', $3 = 'METRICS_GRAFANA', $4 = 'AMBARI_METRICS'
LOG:  execute <unnamed>: DELETE FROM hostcomponentstate WHERE (id = $1)
DETAIL:  parameters: $1 = '453'
LOG:  execute <unnamed>: DELETE FROM servicecomponentdesiredstate WHERE (((cluster_id = $1) AND (component_name = $2)) AND (service_name = $3))
DETAIL:  parameters: $1 = '2', $2 = 'METRICS_GRAFANA', $3 = 'AMBARI_METRICS'

2016年8月30日火曜日

Hive(Java)のコードまたはクラスをちょっとだけ変えてみる

例:https://issues.apache.org/jira/secure/attachment/12749206/HIVE-11498.003.patch

1) どのjarかわからない場合、全部探してみる
find -L /usr/hdp/current/hive-client/ -type f -name '*.jar' -print0 | xargs -0 -n1 -I {} bash -c "less {} | grep -w Driver && echo {}"
# NOTE: シンボリックリンクのため最後の"/"を忘れない(または-L)

-rw----     2.0 fat    64666 bl defN 15-Sep-30 19:09 org/apache/hadoop/hive/ql/Driver.class
/usr/hdp/current/hive-client/lib/hive-exec-1.2.1.2.3.2.0-2950.jar

または、
grep SymbolicInputFormat -l /usr/hdp/current/hive-client/lib/*
[root@node2 ~]# less /usr/hdp/current/hive-client/lib/hive-exec-1.2.1000.2.4.2.0-258.jar | grep SymbolicInputFormat

-rw----     2.0 fat     4917 bl defN 16-Apr-25 06:49 org/apache/hadoop/hive/ql/io/SymbolicInputFormat.class

2) バックアップを作る
cp -p /usr/hdp/current/hive-client/lib/hive-exec-1.2.1.2.3.2.0-2950.jar /tmp/hive-exec-1.2.1.2.3.2.0-2950.jar
# NOTE: バックアップはクラスパス内には作らない

3) ソースコードをダウンロードする
mkdir ~/workspace; cd ~/workspace
mkdir -p org/apache/hadoop/hive/ql
wget https://raw.githubusercontent.com/apache/hive/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/Driver.java -O org/apache/hadoop/hive/ql/Driver.java

4) 確認または編集(viかpatchコマンド)
vi Driver.java

5) クラスパスを取得(jcmd PID VM.system_properties | grep ^java.class.path などでも可):
eval "export `cat /proc/$(cat /var/run/hive/hive-server.pid)/environ | tr '\0' '\n' | grep ^CLASSPATH`"

eval "export `strings /proc/$(cat /var/run/hive/hive-server.pid)/environ | grep ^CLASSPATH`"

sudo -u oozie /usr/jdk64/jdk1.8.0_112/bin/jcmd `cat /var/run/oozie/oozie.pid` VM.system_properties | grep java.class.path

export CLASSPATH=$(lsof -p `cat /var/run/oozie/oozie.pid` | grep -oE '/.+\.jar$' | tr '\n' ':')

6) コンパイル
source /etc/hadoop/conf/hadoop-env.sh
$JAVA_HOME/bin/javac org/apache/hadoop/hive/ql/Driver.java

7) Jarを更新する
$JAVA_HOME/bin/jar uf /usr/hdp/current/hive-client/lib/hive-exec-1.2.1.2.3.2.0-2950.jar org/apache/hadoop/hive/ql/Driver*class

8) 確認
less /usr/hdp/current/hive-client/lib/hive-exec-1.2.1.2.3.2.0-2950.jar | grep '/Driver'


ちなみにADD JAR用に.jarファイル(dummy.jar)を作るには
eval "export `cat /proc/$(cat /var/run/hive/hive.pid)/environ | tr '\0' '\n' | grep ^CLASSPATH`
$JAVA_HOME/bin/javac dummy/*.java
$JAVA_HOME/bin/jar cvf dummy.jar dummy/*.class
# 必要なテーブルを作成後
hive -e 'ADD JAR /root/dummy.jar;insert into dummy select * from dummy2'


Rangerの場合:
[root@node3 classes]# pwd
/usr/hdp/2.5.5.0-157/ranger-admin/ews/webapp/WEB-INF/classes
[root@node3 classes]# export CLASSPATH="`find /usr/hdp/current/ranger-admin/ -name '*.jar' | tr '\n' ':'`"
[root@node3 classes]# /usr/jdk64/jdk1.8.0_112/bin/javac org/apache/ranger/biz/UserMgr.java
[root@node3 classes]# sudo -u ranger /usr/bin/ranger-admin restart

Ambariから Hive用のJDBCドライバーを追加・更新する

On Ambari Server
1) Download the latest mysql-connector-java-xxxx.jar and copy into this server's /usr/share/java/ 
2) ln -sf /usr/share/java/mysql-connector-java-xxxx.jar /usr/share/java/mysql-connector-java.jar
3) ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar 
4) find /var/lib/ambari-server/resources -name 'mysql-*.jar' -ls Above is to make sure it's updated.

On HiveServer2/Metastore Server
1) find / -name 'mysql-*.jar' -ls 
2) remove old mysql-connector-xxx.jar from Agent's tmp directory and /usr/hdb/<version>/hive/lib 
3) remove old mysql-jdbc-driver.jar from Agent's cache directory 
4) Replace /usr/hdp/<version>/hadoop/lib/mysql-connector-java.jar with newer version if exists.
5) Restart ambari-agent 
6) Restart Hive (hiveserver2/metastore) from Ambari 
7) Run find command to make sure the version is correct by checking the file size

2016年8月29日月曜日

JDBCドライバーバージョンを 素早く確認する

A) MySQL:
$ zipgrep 'Bundle-Version' /usr/hdp/current/hive-client/lib/mysql-connector*.jar
$ zipgrep 'Bundle-Version' /usr/share/java/mysql-connector-java.jar

出力例:
META-INF/MANIFEST.MF:Bundle-Version: 5.1.39

サーバ側:
ただ単に、mysqlコマンドを使うか
SHOW VARIABLES LIKE "%version%";

B) PostgreSQL
$ zipgrep 'Bundle-Version' /usr/hdp/current/hive-client/lib/postgresql*.jar
$ zipgrep 'Bundle-Version' /usr/share/java/postgresql-jdbc.jar

出力例:
META-INF/MANIFEST.MF:Bundle-Version: 9.4.1208.jre7

古いバージョンだと何も出ない模様、その場合は:
$ unzip -c /usr/share/java/postgresql-jdbc.jar 'org/postgresql/Driver.class' | grep -oP 'PostgreSQL.+?\)'
PostgreSQL 9.0 JDBC4 (build 801)

サーバ側 (psql):
SELECT version();

C) Oracle
$ zipgrep 'Implementation-Version' /usr/share/java/ojdbc6.jar
META-INF/MANIFEST.MF:Implementation-Version: 11.2.0.4.0

サーバ側(sqlplus):
SELECT * FROM v$version;

NOTE: Oracle JDBC と server version:
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-faq-090281.html#01_02

2016年7月27日水曜日

Ambariにパッチを自作して適用してみる

1) IDEなどから.patchファイルを作成する。例えば、MetaStoreをスタート時にKinitする。

[root@node1 ~]# cat ~/hive_metastore_kinit.patch
Index: ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py (revision 6a8abfa65789b87da764549c27ca0f1440b91297)
+++ ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py (revision )
@@ -55,6 +55,14 @@
     env.set_params(params)

     # writing configurations on start required for securtity
+    if params.security_enabled:
+        import status_params
+        cached_kinit_executor(status_params.kinit_path_local,
+                              status_params.hive_user,
+                              params.hive_metastore_keytab_path,
+                              params.hive_server_principal, # FIXME: Should use 'hive.metastore.kerberos.principal'
+                              status_params.hostname,
+                              status_params.tmp_dir)
     self.configure(env)

     hive_service('metastore', action='start', upgrade_type=upgrade_type)


2) 適用してみる
[root@node1 ~]# cd /var/lib/ambari-server/
[root@node1 ambari-server]# patch -p3 -b -i ~/hive_metastore_kinit.patch [--verbose]

3) 確認
[root@node1 ambari-server]# ls -l resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore*
-rwxr-xr-x 1 root root 10655 Jul 27 09:03 resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py
-rwxr-xr-x 1 root root 10404 May  5 19:11 resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py.orig

[root@node1 ambari-server]# diff resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py.orig resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py
57a58,60
>     if params.security_enabled:
>         kinit_command=format("{kinit_path_local} -kt {hive_metastore_keytab_path} {hive_server_principal}; ") # FIXME: Should use 'hive.metastore.kerberos.principal'
>         Execute(kinit_command,user=params.smokeuser)

4) Ambari Serverの再起動
[root@node1 ambari-server]# ambari-server restart

5) Agent側にもコピーされたのを確認
[root@node2 ~]# ls -l /var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py*
-rw-r--r-- 1 root root 10655 Jul 27 09:07 /var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py
-rw-r--r-- 1 root root 10777 Jul 27 09:07 /var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.pyc
-rw-r--r-- 1 root root 10404 Jul 27 09:07 /var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py.orig


6) 元に戻す
[root@node1 ~]# cd /var/lib/ambari-server/
[root@node1 ambari-server]# mv resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py.orig resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py
mv: overwrite `resources/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py'? y
[root@node1 ambari-server]# ambari-server restart

HDP SqoopクライアントのデバッグにJDBを使用するメモ

1) vim /usr/hdp/current/hadoop-client/bin/hadoop.distro

    if [ -n "$HADOOP_JDB" ]; then
      echo "export CLASSPATH=$CLASSPATH"
      echo "${JAVA_HOME}/bin/jdb" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
    else
      exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
    fi

2) HADOOP_JDB="Y" sqoop import --username SYSTEM --password oracle --direct --connect 'jdbc:oracle:thin:@//192.168.8.22:1521/XE'  --query "select * from TEST.TMP_SQOOP_DF_TEST67 WHERE \$CONDITIONS" --split-by COLUMN_NUMBER --target-dir /tmp/test > jdb_sqoop_import.sh

3) jdb_sqoop_import.shからいらないラインを削除する

4) bash ./jdb_sqoop_import.sh

5) JDB内でブレークポイントを指定する
> stop in org.apache.sqoop.manager.oracle.OraOopManagerFactory.isOraOopEnabled
> run

6) next と localsを繰り返す


2016年2月21日日曜日

テスト・開発環境をほぼ自動で作成する

目的:

Hadoop/HDPでトラブルシューティングなどをする際に、AmbariやHDPの異なるバージョンのVMを作りたい時があると思います。
その際いちいちマニュアルで作ると面倒なので、よく使う手順をBASHでまとめてみました。

本スクリプトでできること:

  • Dockerのインストールと指定した数のコンテナの作成
  • Ambari Serverのインストール(Agentはインストールしません)
  • オプション :HDP用のローカルレポジトリの作成

本スクリプトでしないこと:

  • HDP自体のインストールはしません (-aオプションでします)
    HDPのインストールを自動化されたい場合はAmbari Blueprintを参照してください


ステップ:

  1. Ubuntu OSをVM Guestとしてインストール
    AzureやAWSだと(OpenStackやvCenterでも)簡単にUbuntu 14.0.4か16.04をデプロイできるかと思います。
  2. VirtualBoxやVMWare Workstationなどでは、Ubuntuインストール後のVMをクローンできるようバックアップしておくことをお勧めします。
  3. Ubuntuに”root”としてログイン
  4. スクリプトをダウンロード
    wget https://raw.githubusercontent.com/hajimeo/samples/master/bash/start_hdp.sh -O ./start_hdp.sh
  5. 現在のユーザに実行権限
    chmod u+x ./start_hdp.sh
  6. スクリプトを自動モードで開始
    ./start_hdp.sh -a

    上記は全てデフォルトの設定でHDPのインストールまで完了します。
    もしインストール(設定ではなく)がうまくいかない場合、http://public-repo-1.hortonworks.com が繋がるか確認してください。
または、
  1. スクリプトをインストールモードで開始
    ./start_hdp.sh -i
  2. いくつかの質問に答えていきます。通常はデフォルトの値でOKです。
    • Run apt-get upgrade before setting up? [N]:
    • NTP Server [ntp.ubuntu.com]:
    • IP address for docker0 interface [172.17.0.1]:
    • Network Address (xxx.xxx.xxx.) for docker containers [172.17.100.]:
    • Domain Suffix for docker containers [.localdomain]:
    • Container OS type (small letters) [centos]:
    • Container OS version [6]:
    • How many nodes? [4]:
    • Hostname for docker host in docker private network? [dockerhost1]:
    • Ambari server hostname [node1.localdomain]:
    • Ambari version (used to build repo URL) [2.2.0.0]:
    • If you have set up a Local Repo, please change below
    • Ambari repo [http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.2.0.0/ambari.repo]:
    • Would you like to set up local repo for HDP? (may take long time to downlaod) [N]: Y
    • Local repository directory (Apache root) [/var/www/html]:
    • HDP (repo) version [2.3.4.0]:
    • URL for HDP repo tar.gz file [http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.4.0/HDP-2.3.4.0-centos6-rpm.tar.gz]:
    • URL for UTIL repo tar.gz file [http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6/HDP-UTILS-1.1.0.20-centos6.tar.gz]:
    • INFO : Interview completed.
    • Would you like to save your response? [Y]:
    • INFO : Saved ./start_hdp.resp
    • Would you like to start setup this host? [Y]:
  3. 最後の質問をYesにするとインストールが始まります。
  4. パッケージのダウンロードが一番時間がかかりますが、Azure上だと10〜20分ぐらい待つと、スクリプトが完了する、またはエラーが有った場合はスクリプトが止まります。
  5. ambari-serverの起動に成功しているとブラウザーでアクセスできます。
    *注意* : このままだとローカルのPCからはコンテナ(node1)上のAmbariのポート8080には直接アクセスできません。
    簡単な方法としては、"ssh -D 18080 username@ubuntu-hostname"などでプロキシを作成するか、"ssh -L 8080:node1.localdomain:8080 username@ubuntu-hostname"などでポートフォーワーディングするなどがあります。
    個人的にはProxyとChromeのアドオン”SwitchySharp”などがおすすめです。

    ambari-serverが起動していなかった場合は、node1にログインして"ambari-server start"を試してください。
  6. Ambariにブラウザからアクセスできた場合は、普通通りにHDPをインストールしてください。またはBlueprintなど。
  7. HDPのローカルレポジトリを作成した場合は、Hostnameをdockerhost1.localdomain(最初のインタビュー時に変更可能)に変えてください。
  8. Private Keyは全てのノードの.ssh/id_rsaにあります

インストール完了後

Ubuntu VM再起動後には "./start_hdp. sh -s"でHDPサービスを起動できます。