Dataset Management Overview


A dataset is a reference to the storage location of data in the data source and a copy of its metadata. Datasets do not store data, so datasets do not consume additional storage overhead, nor affect the integrity of the data source. Datasets can be one of the data asset types and become the starting point of data process in AI Studio.


The Dataset Management provided by the AI Studio can help data scientists access data quickly when training models, it improves their efficiency by addressing the concerns regarding how to connect data sources and the differences of data access paths of various data sources.


../_images/arch1.png

Main Functions

Multiple Data Formats and File Types


Dataset Management supports File and Tabular dataset types:

  • File type datasets reference a single or multiple files in data stores or from public URLs. File type datasets can be applied to any format or type of files, which can help with a wider range of machine learning solutions, including deep learning.
  • Tabular type datasets analyze the provided files or file lists and present data in tabular form. In this way, data can be transformed into pandas or spark data framework for model training. Tabular type datasets can created through CSV, TSV, Parquet, JSON files, or through SQL query results.

Multiple Creating Methods


Dataset Management supports multiple methods for creating datasets, including:

  • Creating from external data sources
  • Creating from local files
  • Creating from input/output of operators

Multiple Data Sources


Dataset Management supports creating datasets from multiple data sources, including MySQL, Blob, S3, Hive, and HDFS.

Data Sharing and Collaboration


Users in the same organization can share the datasets and collaborate on data. Users do not need to repeatedly create data source connections to obtain data.

Version Management


Dataset Management supports version management of datasets. Dataset versioning is a method to tag the data state so that users can experiment with a specific version of dataset or recreate it in the future.

Sample Datasets


Dataset Management provides multiple common sample datasets to help users with algorithm model training.