Create a Dataset¶
This topic introduces how to create datasets.
Create Datasets from External Data Sources¶
Dataset types supported by each data source connection are as follows:
Data Source Connection |
Supported Dataset Type |
---|---|
MySQL |
Tabular Dataset |
Hive |
Tabular Dataset |
Blob |
Tabular Datasets (Delimited Text or ORC file format), File Dataset |
S3 |
Tabular Datasets (Delimited Text or ORC file format), File Dataset |
HDFS |
Tabular Datasets (Delimited Text or ORC file format), File Dataset |
Create Datasets from MySQL or Hive Data Sources¶
To create a dataset from MySQL or Hive data source connection (using MySQL data source as example):
Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
Select New Dataset > Create from Data Sources and complete the following basic information of the dataset:
Dataset Name: Enter name of the dataset.
Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).
Data Source: Select the data source connection name (which is registered through the Resource Configuration > Connection Configuration page). The system will check the connection automatically.
Dataset Type: Select tabular.
Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).
Description: Enter description of the dataset.
On the Data Source page, enter the SQL query statement and query timeout value (1 ~ 600 seconds). If you want to query dynamic data, you can enter a statement with variables in SQL Query, where you need to enter the variables as
%s
and specify the default value of variables successively in Advanced Configuration. For example,select * from table where key_a = %s
.Note
You can only use a single SQL query statement when creating a Tabular dataset from data source connections.
On the Data Preview page, select Preview Data to view the query results (displaying the first 50 data records of the query result only).
On the SCHEMA Settings page, specify alias, attributes, data type, and description for data fields as needed. If you need to reset the Schema information, select the Reset button to restore the default settings.
On the Confirmation page, check the completeness of the dataset configuration. Select Finish to create the dataset. The created dataset will be displayed in the dataset list.
Creating Datasets from Blob, S3, or HDFS Data Sources¶
To create a dataset from Blob, S3, or HDFS data source connection (using HDFS data source to create File type dataset as example):
Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
Select New Dataset > Create from Data Sources and complete the following basic information of the dataset:
Dataset Name: Enter name of the dataset.
Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).
Data Source: Select the data source connection name (which is registered through the Resource Configuration > Connection Configuration page). The system will check the connection automatically.
Dataset Type: Select file.
Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).
Description: Enter description of the dataset.
On the Data Source page, enter path of the file to be used.
On the Confirmation page, check the completeness of the dataset configuration. Select Finish to create the dataset. The created dataset will be displayed in the dataset list.
Create Datasets from Local Files¶
Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
Select New Dataset > Create from Local Files and complete the following basic information of the dataset:
Dataset Name: Enter name of the dataset.
Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).
Dataset Type: Select tabular or file:
tabular type: Need to select the corresponding file type (Delimited Text or ORC) and ensure that the uploaded files can be correctly parsed
file type: No need to select the file type
Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).
Description: Enter description of the dataset.
On the Upload File page, select to upload one or multiple files for creating the dataset. The selected files to be uploaded will be displayed in the file list, including the file name and file size.
After uploading the needed files, complete the file configuration, data preview, and Schema settings, and confirmation based on the selected type of dataset to be created. For detailed steps, see Create Datasets from External Data Sources.
Note
If you want to upload extra files after some files are uploaded successfully, you need to upload all the files again because uploading new files will overwrite the files that are already uploaded previously.
A single batch of uploaded files must not exceed 1GB, and the total uploaded file size must not exceed 10GB. For big files, consider uploading the files to HDFS, Blob, or S3 stores for creating datasets.
Create Datasets from Operator Output/Input¶
To create a dataset from operator output/input file:
Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
Select New Dataset > Create from input/output files of operators and complete the following basic information of the dataset:
Dataset Name: Enter the dataset name
Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list)
Dataset Type: Select tabular or file:
tabular type: Need to select the corresponding file type (Delimited Text or ORC) and ensure that the uploaded files can be correctly parsed
file type: No need to select the file type
Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list)
Description: Enter description of the dataset
On the File Selection page, enter the minio path of the file. See View the Basic Information and Details of Running Instances for information about where to get the minio path.
On the File Configuration page, set column delimiter, character set, escape character, quote character, and so on.
On the Data Preview page, select Preview Data to view the query results (displaying the first 50 data records of the query result only).
On the SCHEMA Settings page, specify alias, attributes, data type, and description for data fields as needed. If you need to reset the Schema information, select the Reset button to restore the default settings.
On the Confirmation page, check the completeness of the dataset configuration. Select Finish to create the dataset. The created dataset will be displayed in the dataset list.