Create a Dataset


This topic introduces how to create datasets.

Create Datasets from External Data Sources


Dataset types supported by each data source connection are as follows:


Data Source Connection Supported Dataset Type
MySQL Tabular Dataset
Hive Tabular Dataset
Blob Tabular Datasets (Delimited Text or ORC file format), File Dataset
S3 Tabular Datasets (Delimited Text or ORC file format), File Dataset
HDFS Tabular Datasets (Delimited Text or ORC file format), File Dataset

Create Datasets from MySQL or Hive Data Sources


To create a dataset from MySQL or Hive data source connection (using MySQL data source as example):

  1. Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.

  2. Select New Dataset > Create from Data Sources and complete the following basic information of the dataset:

    • Dataset Name: Enter name of the dataset.
    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).
    • Data Source: Select the data source connection name (which is registered through the Resource Configuration > Connection Configuration page). The system will check the connection automatically.
    • Dataset Type: Select tabular.
    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).
    • Description: Enter description of the dataset.
  3. On the Data Source page, enter the SQL query statement and query timeout value (1 ~ 600 seconds). If you want to query dynamic data, you can enter a statement with variables in SQL Query, where you need to enter the variables as %s and specify the default value of variables successively in Advanced Configuration. For example, select * from table where key_a = %s.

    Note

    You can only use a single SQL query statement when creating a Tabular dataset from data source connections.

  4. On the Data Preview page, select Preview Data to view the query results (displaying the first 50 data records of the query result only).

  5. On the SCHEMA Settings page, specify alias, attributes, data type, and description for data fields as needed. If you need to reset the Schema information, select the Reset button to restore the default settings.

  6. On the Confirmation page, check the completeness of the dataset configuration. Select Finish to create the dataset. The created dataset will be displayed in the dataset list.

Creating Datasets from Blob, S3, or HDFS Data Sources


To create a dataset from Blob, S3, or HDFS data source connection (using HDFS data source to create File type dataset as example):

  1. Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
  2. Select New Dataset > Create from Data Sources and complete the following basic information of the dataset:
    • Dataset Name: Enter name of the dataset.
    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).
    • Data Source: Select the data source connection name (which is registered through the Resource Configuration > Connection Configuration page). The system will check the connection automatically.
    • Dataset Type: Select file.
    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).
    • Description: Enter description of the dataset.
  3. On the Data Source page, enter path of the file to be used.
  4. On the Confirmation page, check the completeness of the dataset configuration. Select Finish to create the dataset. The created dataset will be displayed in the dataset list.

Create Datasets from Local Files


  1. Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
  2. Select New Dataset > Create from Local Files and complete the following basic information of the dataset:
    • Dataset Name: Enter name of the dataset.
    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list).
    • Dataset Type: Select tabular or file:
      • tabular type: Need to select the corresponding file type (Delimited Text or ORC) and ensure that the uploaded files can be correctly parsed
      • file type: No need to select the file type
    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list).
    • Description: Enter description of the dataset.
  3. On the Upload File page, select to upload one or multiple files for creating the dataset. The selected files to be uploaded will be displayed in the file list, including the file name and file size.
  4. After uploading the needed files, complete the file configuration, data preview, and Schema settings, and confirmation based on the selected type of dataset to be created. For detailed steps, see Create Datasets from External Data Sources.

Note

  • If you want to upload extra files after some files are uploaded successfully, you need to upload all the files again because uploading new files will overwrite the files that are already uploaded previously.
  • A single batch of uploaded files must not exceed 1GB, and the total uploaded file size must not exceed 10GB. For big files, consider uploading the files to HDFS, Blob, or S3 stores for creating datasets.

Create Datasets from Operator Output/Input


To create a dataset from operator output/input file:

  1. Log in to the EnOS Management Console and select Data Analytics > AI Studio > Dataset Management.
  2. Select New Dataset > Create from input/output files of operators and complete the following basic information of the dataset:
    • Dataset Name: Enter the dataset name
    • Dataset Alias: Enter alias of the dataset (which will be displayed as the name of the dataset in the dataset list)
    • Dataset Type: Select tabular or file:
      • tabular type: Need to select the corresponding file type (Delimited Text or ORC) and ensure that the uploaded files can be correctly parsed
      • file type: No need to select the file type
    • Tags: Enter 1 or more tags for the dataset (supporting Chinese and English tags, which can be used for searching datasets in the dataset list)
    • Description: Enter description of the dataset
  3. On the File Selection page, enter the minio path of the file. See View the Basic Information and Details of Running Instances for information about where to get the minio path.
  4. On the File Configuration page, set column delimiter, character set, escape character, quote character, and so on.
  5. On the Data Preview page, select Preview Data to view the query results (displaying the first 50 data records of the query result only).
  6. On the SCHEMA Settings page, specify alias, attributes, data type, and description for data fields as needed. If you need to reset the Schema information, select the Reset button to restore the default settings.
  7. On the Confirmation page, check the completeness of the dataset configuration. Select Finish to create the dataset. The created dataset will be displayed in the dataset list.