Configuring SHELL Type Task Node


Batch Data Processing supports multiple frameworks such as Hive, Spark, MapReduct, etc. When creating a workflow, you can add SHELL task nodes to develop data.


This section shows how to configure SHELL task nodes.

Executing HiveSQL Tasks

You can implement batch computing by using SHELL task nodes to execute HiveSQL tasks through the command line.

Command Format

canaanhive [arguments]

Parameter Description

Parameter

Example

Description

-f <arg>

-f demo.sql

The HQL file name.

-d <arg>

-d 2018-01-01

The time parameter. In this example, the ${env.FORMAT} in the HQL file will automatically be replaced according to format in the Time Parameter Format table below.

-E <paraN>=<valN>

-E para=abc

The user-defined parameter assigned in the HQL file. In this example, the ${env.para} in the HQL file will automatically be replaced with abc.

-str <sql>

-str “show tables;”

The SQL statement.


Note

The letter P in the FORMAT parameter in the following table stands for Previous, so PnD/PnM/PnY stands for the previous n day(s)/month(s)/year(s).

The letter N stands for Next, so NnD/NnM/NnY stands for the next n day(s)/month(s)/year(s).

Time Parameter Format

FORMAT

Range

Value

YYYYMMDD

2018-01-01

YYYYMMDD_PnD

1<= n <=30

2017-12-31 ~ 2017-12-02

YYYYMMDD_PnM

1<= n <=12

2017-12-01 ~ 2017-01-01

YYYYMMDD_PnY

1<= n <=2

2017-01-01 ~ 2016-01-01

YYYYMMDD_NnD

1<= n <=2

2018-01-02 ~ 2018-01-03

YYYYMMDD_NnM

1<= n <=2

2018-02-01 ~ 2018-03-01

YYYYMMDD_NnY

1<= n <=2

2019-01-01 ~ 2020-01-01

YYYYMM

2018-01

YYYYMMDD_PnD

1<= n <=2

2017-12 ~ 2017-12

YYYYMMDD_PnM

1<= n <=2

2017-12 ~ 2017-11

YYYYMMDD_PnY

1<= n <=2

2017-01 ~ 2016-01

YYYYMMDD_NnD

1<= n <=2

2018-01 ~ 2018-01

YYYYMMDD_NnM

1<= n <=2

2018-02 ~ 2018-03

YYYYMMDD_NnY

1<= n <=2

2019-01 ~ 2020-01

YYYY

2018

YYYY_PnD

1<= n <=2

2017 ~ 2017

YYYY_PnM

1<= n <=2

2017 ~ 2017

YYYY_PnY

1<= n <=2

2017 ~ 2016

YYYY_NnD

1<= n <=2

2018 ~ 2018

YYYY_NnM

1<= n <=2

2018 ~ 2018

YYYY_NnY

1<= n <=2

2019 ~ 2020

MM

01

DD

01

Examples

With the command line in the SHELL node as per the following:

canaanhive -f demo.sql -d 2018-01-01 -E DB=demo


And an HQL file demo.sql having the following sample code:

use ${env.DB};
create table if not exists demo(id string);
insert into demo values('${env.YYYYMMDD}');


The executed content will be:

use demo;
create table if not exists demo(id string);
insert into demo values('2018-01-01');

Executing Spark Tasks

You can execute PySpark and Spark tasks through the command line by using SHELL task nodes.

Command Format

Using a PySpark Job as an example, create a SHELL type node and use the SHELL command to run the main function of the Job.

sh predict.sh

Submitting a PySpark Job

submit-pyspark-application    [options]      <python file>     [app arguments]

Parameter Description

Parameter

Description

–python 2.7/3.5

The Python version, supports 2.7 or 3.5. The default is 2.7.

–pythonEnvPath

The VirtualEnv path In HDFS. If not set, the default Python envrionment will be used.

–name NAME

The name of your application.

–queue QUEUE_NAME

The YARN queue to submit to (Default: “default”).

–num-executors NUM

The number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least _NUM_.

–executor-cores NUM

The number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode.)

–driver-cores NUM

The number of cores used by the driver, only in cluster mode (Default: 1).

–conf PROP=VALUE

The arbitrary Spark configuration property.

–py-files PY_FILES

The comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

–files FILES

The comma-separated list of files to be placed in the working directory of each executor.

–archives ARCHIVES

The comma-separated list of archives to be extracted into the working directory of each executor.

–driver-memory MEM

The memory for the driver (e.g. 1000M, 2G) (Default: 2G).

–driver-java-options

The extra Java options to pass to the driver.

–driver-library-path

The extra library path entries to pass to the driver.

–driver-class-path

The extra class path entries to pass to the driver. Note that jars added with –jars are automatically included in the classpath.

Submitting a Spark Job

submit-spark-application    [options]      <app-jar>     [app arguments]

Parameter Description

Parameter

Description

–class CLASS_NAME

Your application’s main class (for Java / Scala apps).

–name NAME

The name of your application.

–packages

The comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The local maven repo will be searched, and the Maven central and any additional remote repositories will be given by the –repositories option. The format for the coordinates should be groupId:artifactId:version.

–jars JARS

The comma-separated list of local jars to include on the driver and executor classpaths.

–conf PROP=VALUE

The arbitrary Spark configuration property.

–files FILES

The comma-separated list of files to be placed in the working directory of each executor.

–archives ARCHIVES

The comma-separated list of archives to be extracted into the working directory of each executor.

–driver-memory MEM

The memory for the driver (e.g. 1000M, 2G) (Default: 2G).

–driver-java-options

The extra Java options to pass to the driver.

–driver-library-path

The extra library path entries to pass to the driver.

–driver-class-path

The extra class path entries to pass to the driver. Note that jars added with –jars are automatically included in the classpath.

–executor-cores NUM

The number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode.)

–driver-cores NUM

The number of cores used by the driver, only in cluster mode (Default: 1).

–queue QUEUE_NAME

The YARN queue to submit to (Default: “default”).

–num-executors NUM

The number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least _NUM_.

Examples

Using the example command to run the main function predict.sh in the SHELL node:

sh predict.sh


The example code for the main function is:

submit_pyspark_application_func(){
    submit-pyspark-application \
    --deploy-mode cluster \
    --queue ${1} \
    --name pyspark_predict_test \
    --num-executors 10 \
    --driver-memory 16g \
    --executor-memory 12g \
    --driver-cores 2 \
    --executor-cores 3 \
    --conf spark.eventLog.enabled=true \
    --conf spark.network.timeout=240000 \
    --conf spark.executor.heartbeatInterval=24000 \
    --conf spark.yarn.executor.memoryOverhead=8192 \
    --archives hdfs://user/db_test/userPythonLib.zip#ANACONDA  \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/MINICONDA/bin/python \
    --conf spark.yarn.maxAppAttempts=1 \
    --conf spark.logger_table=wens_status_algo_running \
    --conf spark.hdfs_user=${2} \
    --conf spark.hdfs_path=hdfs://titan/user/${2} \
    --conf spark.start_date=${3} \
    --conf spark.end_date=${4} \
    --conf spark.site_ids=${5} \
    --conf spark.metric_save_path=/user/${2}/operaphm_temperature/metrics \
    --py-files anomaly.py,hadoop_common_functions.py,layout.py,utm.zip,rle.py,common_tools.py,steadystatefilter.py,math_utils.py \
    --conf spark.eventLog.enabled=true  predict.py
}

echo "test"


In the above, predict.py is the incoming py file, and needs to be in the same directory as predict.sh.