Dataset | Notion

Available datasets

List all datasets

Tensorflow predefine various datasets ranging from image classification to language models. We can view the collections of dataset with tfds.list_builders() (API):

['abstract_reasoning',
 'aflw2k3d',
 'amazon_us_reviews',
 'bair_robot_pushing_small',
 'bigearthnet',
 ...
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'xnli']

Load provided datasets

tfds.load(name, split=split, shuffle_files=True, with_info=True) API

Load provided datasets, the way to identify the split can be found here. The first return variable will be a Tensorflow Dataset object while the second variable (only show when setting with_info=True) will be a Tensorflow DatasetInfo object.

dataset, info = tfds.load('iris', split='train', shuffle_files=True, with_info=True)
print(info)

tfds.core.DatasetInfo(
    name='iris',
    version=1.0.0,
    description='This is perhaps the best known database to be found in the pattern recognition
literature. Fisher's paper is a classic in the field and is referenced
frequently to this day. (See Duda & Hart, for example.) The data set contains
3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are NOT
linearly separable from each other.
',
    urls=['<https://archive.ics.uci.edu/ml/datasets/iris>'],
    features=FeaturesDict({
        'features': Tensor(shape=(4,), dtype=tf.float32),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    }),
    total_num_examples=150,
    splits={
        'train': 150,
    },
    supervised_keys=('features', 'label'),
    citation="""@misc{Dua:2019 ,
    author = "Dua, Dheeru and Graff, Casey",
    year = "2017",
    title = "{UCI} Machine Learning Repository",
    url = "<http://archive.ics.uci.edu/ml>",
    institution = "University of California, Irvine, School of Information and Computer Sciences"
    }""",
    redistribution_info=,
)

Tensorflow Dataset

Tensorflow provided useful interface to process datasets. By using the Dataset object, we can start using some useful functions to facilitate the manipulation of input data.

Dataset creation

tf.data.Dataset.from_tensor_slices API, tf.data.Dataset.from_generator API

Create Dataset object from Tensor or Generator (it can also be created from TFRecord, we will cover TFRecord later). We can pass the dictionary of Tensors as the parameter, this is helpful for keep track of the names of Tensors in each instance.

X = tf.convert_to_tensor(np.random.normal(0, 1, 500).reshape(100, 5))
y = tf.convert_to_tensor(np.random.normal(0, 1, 100))
dataset = tf.data.Dataset.from_tensor_slices({'X': X, 'y': y})
for instance in dataset:
    pass # do something with example using example['X'] and example['y']

Shuffle

dataset.shuffle(buffer) API

Shuffle the order of instances (tf.train.Example) with given buffer size (the instances is randomly shuffled in the buffer size). Return a new Dataset.