Unit 2. Designing a Pipeline


The EAP MI Pipelines provides the logic, file, model and other types of operators to meet the business needs of a variety of model training processes and scenarios. This unit describes how to use operators to organize pipelines and develop wind farm power generation prediction models.

Prerequisites

Before orchestrating a pipeline, you can create a new experiment in the MI Pipelines by following these steps:

  1. Log in to the EnOS Management Console, and select Enterprise Analytics Platform > Machine Intelligence Studio > MI Pipelines from the left navigation bar to open the Experiment List homepage.

  2. Click New Experiment and enter the name (kmmkdsdemo) and description of the experiment.

  3. Click OK to create the experiment, open the Pipeline Designer page of the experiment for designing and developing the pipeline.

    ../_images/creating_experiment.png

Designing Pipelines

The following operators are needed to orchestrate the pipeline in this tutorial:

  1. Hive operator: query the Hive to get the list of sites to be trained (partition or field query), and get the keytab and kerberos profiles required by the Hive operator.

  2. Git Directory operator: get the file transform1.py from the Git directory and use it as input of Python operator

  3. Python operator: format the input file and take it as the input of ParallelFor operator

  4. ParallelFor operator: implement the loop processing for each site


Drag the operators to the editing canvas, and the pipeline after orchestration is shown in the figure below:

../_images/pipeline_overview.png


The configuration instructions for each operator orchestrated in the pipeline are given as follows:

Hive Operator

Name: Hive

Description: query the Hive to get the list of sites to be trained, and get the keytab and krb5 profiles

Input parameters

Parameter Name

Data type

Operation Type

Value

data_source_name

String

Declaration

Name of the registered Hive data source

sqls

List

Declaration

[“set tez.am.resource.memory.mb=1024”,”select distinct lower(masterid) as masterid from kmmlds1”]

queue

String

Declaration

root.eaptest01 (Name of the big data queue applied for through resource management)

Output parameters

Parameter Name

Value

resultset

File

An sample of operator configuration is given as follows:

../_images/hive_config.png

Git Directory Operator

Name: Git Directory for Transform1

Description: pull the Python code file transform1.py from the Git directory

Input parameters

Parameter Name

Data type

Operation Type

Value

data_source_name

String

Declaration

Name of the registered Git data source

branch

String

Declaration

master

project

String

Declaration

workspace1

paths

List

Declaration

[“workspace1/kmmlds/transform1.py”]

Output parameters

Parameter Name

Value

workspace

directory

paths

list

An sample of operator configuration is given as follows:

../_images/git_directory_1.png

Python Operator

Name: Transform1

Description: format the input file and take it as the input of ParallelFor operator. The query output format of the Hive operator is [[], [["abcde0001"], ["cgnwf0046"]]]. It cannot be used directly by the ParallelFor operator. Before that, it needs to be converted through the Python operator, and the converted format is ["abcde0001", "cgnwf0046"].

Input parameters

Parameter Name

Data type

Operation Type

Value

workspace

Directory

Reference

Git Directory for Transform1.workspace

entrypoint

String

Declaration

workspace1/kmmlds/transform1.py

requirements_file_path

String

Declaration

list_data

File

Reference

Hive.resultset

Output parameters

Parameter Name

Value

output_list

list

An sample of operator configuration is given as follows:

../_images/python_transform_1.png

ParallelFor Operator

Name: Loop for masterid

Description: orchestrate the pipeline in the sub-canvas to process each site

Input parameters

Parameter Name

Operation Type

Value

Transform1.output_list

Reference

item

An sample of operator configuration is given as follows:

../_images/parallel_for.png

Next Unit

Training, Registering and Deploying Model