I wanted to write this blog as many of the users might be desperately looking on how to launch H2o on the Hadoop Sandbox. Let’s start with a quick introduction of H2o and then we will see how to setup this in your Hortonworks Sandbox 2.3.
What is H2O:
H2O is a statistical, machine learning and math runtime tool for bigdata analysis. Developed by the predictive analytics company H2O.ai, H2O has established a leadership in the ML scene together with R and Databricks’ Spark. According to the team, H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is designed to help users scale machine learning, math, and statistics over large datasets. With H2O, you can make better predictions by harnessing sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.
H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine.
In addition to H2O’s point and click Web-UI, its REST API allows easy integration into various clients. This means explorative analysis of data can be done in a typical fashion in R, Python, and Scala; and entire workflows can be written up as automated scripts.
Now we will see how to Install and Launch a H2o Web User Interface.
Virtual Box Network Settings:
Please choose the Bridged Adapter Settings in the Virtual Box you have.
Download the Latest H2o Package as below.
I tried with h20 on hdp2.2 package in HDP 2.3 image. It worked.
Unzip and cd to the latest downloaded package
Launch H2o Cluster as below:
Wait until you the message “Blocking Until the H2O Cluster Shuts Down”
Memory Requirements Analysis:
mapred.child.java.opts: -Xms1g -Xmx1g
mapred.map.child.java.opts: -Xms1g -Xmx1g
Extra memory percent: 10
YARN container size (mapreduce.map.memory.mb) = -mapperXmx value + (-mapperXmx * -extramempercent [default is 10%])
Please set this in Ambari and restart YARN:
This must be useful when you have your data file exceends 4g. So we need to have yarn memory settings atleast to be 32g.
yarn.scheduler.maximum-allocation-mb = 32684 mb
The mapreduce.map.memory.mb value must be less than the YARN memory configuration values for the launch to succeed.
Launch Command Details:
- Each H20 Nodes runs as a mapper.
- We need to run only one mapper per host.
- There are no Combiners or Reducers used in the H2o launch.
- -nodes — > The number of H2o nodes to start in the H2o Cluster. Recommended number is 1.
- -mapperXmx — > The amount of Java Memory Required for each node. This should be larger than your data file size.
- -output — > This must be a HDFS location where temporary h2o files will get stored. This directory is part of the Hadoop ToolRunner framework that H2o builds on. Before launching this command make sure this directory does not exist.
Note down the H2o Flow Web URL . Here it is http://192.168.0.19:54321. Launch this now in IE or Chrome or Firefox browsers.
Track this Job through Hue or Ambari:
There are many Launch parameters available in the h2o.ai website for your reference. They are as below.
- -h | -help: Display help
- -jobname <JobName>: Specify a job name for the Jobtracker to use; the default is H2O_nnnnn(where n is chosen randomly)
- -driverif <IP address of mapper -> driver callback interface>: Specify the IP address for callback messages from the mapper to the driver.
- -driverport <port of mapper -> callback interface>: Specify the port number for callback messages from the mapper to the driver.
- -network <IPv4Network1>[,<IPv4Network2>]: Specify the IPv4 network(s) to bind to the H2O nodes; multiple networks can be specified to force H2O to use the specified host in the Hadoop cluster. 1.2.0/24allows 256 possibilities.
- -timeout <seconds>: Specify the timeout duration (in seconds) to wait for the cluster to form before failing. Note: The default value is 120 seconds; if your cluster is very busy, this may not provide enough time for the nodes to launch. If H2O does not launch, try increasing this value (for example, -timeout 600).
- -disown: Exit the driver after the cluster forms.
- -notify <notification file name>: Specify a file to write when the cluster is up. The file contains the IP and port of the embedded web server for one of the nodes in the cluster. All mappers must start before the H2O cloud is considered “up”.
- -mapperXmx <per mapper Java Xmx heap size>: Specify the amount of memory to allocate to H2O (at least 6g).
- -extramempercent <0-20>: Specify the extra memory for internal JVM use outside of the Java heap. This is a percentage of mapperXmx.
- -n | -nodes <number of H2O nodes>: Specify the number of nodes.
- -nthreads <maximum number of CPUs>: Specify the number of CPUs to use. Enter -1to use all CPUs on the host, or enter a positive integer.
- -baseport <initialization port for H2O nodes>: Specify the initialization port for the H2O nodes. The default is 54321.
- -ea: Enable assertions to verify boolean expressions for error detection.
- -verbose:gc: Include heap and garbage collection information in the logs.
- -XX:+PrintGCDetails: Include a short message after each garbage collection.
- -license <license file name>: Specify the directory of local filesytem location and the license file name.
- -o | -output <HDFS output directory>: Specify the HDFS directory for the output.
- -flow_dir <Saved Flows directory>: Specify the directory for saved flows. By default, H2O will try to find the HDFS home directory to use as the directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified using -flow_dir.
In the Next blog, we will see how to import the HDFS file into the H2o Flow UI, Create Training Sets, Build a GLM (Generilized Linear Model), 50 tree GBM (Gradient Boosting Machine) Model and creating Some Visualizations.