Load The Data

Overview

In order to use the data in our tensorflow records we need to load it into a tensorflow Dataset. The initial data loading is shared by all the examples and will be described here.

A full description of loading and parsing data from tensorflow records is out of scope for this tutorial, but I will try to go into as much detail as I can. For further information on the details of loading and parsing data with Tensorflow see the official Tensorflow documentation. To load a dataset from a tensorflow records we need several data structures and functions.

  1. We need to describe the data types for each of our features.

  2. We need a method that will parse a single example (i.e., a query) given the feature descriptions in (1).

  3. Optionally apply one of the scalers.

Describe The Features

Although ranking data of this sort is not sequential in the common usage of the term, it is useful to think of each query as an unordered sequence of documents. Sequential data is then often categorized into two broad categories of features.

The first category are context features, which encodes information relevant to every item in the sequence (i.e., document in the query). There are no (real) context features in the MSLR (or any other publicly available LETOR dataset I am aware of). However, in general these might be things like the user id of the person making the query, the time of the query, the query terms, etc. For debugging and evaluation purposes, I encode the query id as a context feature so that we can track instances by id in the Tensorflow Dataset pipeline. For example, to identify which queries are most difficult to rank.

The second category are sequence features, which encode the specific information for each document are are the type included in the MSLR dataset.

Each category of feature is encoded as a dictionary. The keys of the dictionary are the feature names. The values are a Tensorflow class that describes the shape and data type of the feature.

The method examples.pipeline.make_feature_description() will create and return the appropriate dictionaries.

The body of this function is very simple:

context_features = {'qid': tf.io.FixedLenFeature([], dtype=tf.int64)}

# The input features X
sequence_features = {
    f'{k+1}': tf.io.FixedLenSequenceFeature([], dtype=tf.float32)
    for k in range(n_features)
}

# The target feature Y
sequence_features['target'] = tf.io.FixedLenSequenceFeature([], dtype=tf.int64)

Parse A Single Query Instance

Parsing an instance is a two step process. First we obtain a raw example from the data file using tensorflow.io.parse_single_sequence_example() using the feature descriptions obtained above. Then we convert the features into tensors of the appropriate shape. The result is returned as a tuple whose first element is the X data and whose second element is the target y value. The X and y data do not have to be single tensors, but can be nested dictionaries or lists to provide more flexibility. Note however that Keras’ built in training and evaluation methods will choke on nested dictionaries, so it’s best to avoid nesting them if you ever plan on using Keras. In these examples I organize the X data using a dictionary with 7 key/value pairs. Although I actually only use one of these in the current models, I leave them here as placeholders for datasets that might have a richer feature representation.

The method examples.pipeline.parse_example() will create and return the (X, y) tuple. The body for this method is presented below:

# Read the parsed example from the file using the provided tensorflow function.
example = tf.io.parse_single_sequence_example(proto, context_desc, sequence_desc)

# The result is actually a tuple of the contextual and sequential features
contextual, sequential = example

# Extract the target from the sequential features and delete it from the
# dictionary.
target = sequential['target']
del sequential['target']

# In the raw example returned, the features are actually encoded as a dictionary that maps
# from a feature name to a value. This line extracts the values into a list sorted by the
# feature name (as an integer) and then converts the list into a tensor using tf.stack.
dense_features = tf.stack(
    [a for k, a in sorted(sequential.items(), key=lambda x: int(x[0])) if k != 'target'],
    axis=1
)

# Now return the X and y values.
return (
    {
        # This is not a real feature and will not be used by the models, but it could be
        # helpful to pass it along in the pipeline so we can associate examples back to their
        # identifier.
        'context_meta_qid': contextual['qid'],

        # In other datasets there might be context features that are categorical, which could
        # be encoded as 1-hot vectors or some other representation (e.g., suitable for an
        # embedding layer). The 0 signifies that there are none of these features.
        'context_one_hot': 0.,

        # Similarly there might be be context features that allow multiple categories of a
        # categorical variable to be active simultaneously.
        # For example, a bag of words (although this particular example is probably better
        # handled using a different technique, such as the hashing trick).
        # I typically call these multi-hot, although there does not seem to be a consensus
        # on the meaning of the term.
        'context_multi_hot': 0.,

        # These are the remaining "dense" or standard contextual features of the data.
        'context_dense': 0.,

        # The remaining features are analogous to the contextual features but are for the
        # sequential ones.
        'sequence_one_hot': 0.,
        'sequence_multi_hot': 0.,
        'sequence_dense': dense_features
    },
    tf.cast(target, dtype=tf.float32)
)

Apply A Scaler

Optionally we can apply one of the scaling methods to the data. This is a bit tricky. There do not seem to be any (simple) built in methods for preprocessing data in Tensorflow. Tensorflow Transform does seem to provide a rich suite of tools for preprocessing data, but comes at significant cost. It is quite heavy weight relying on other external dependencies (i.e., Apache Beam). The learning curve is also quite steep.

In these examples I’ve taken a different approach. When converting the dataset we can optionally train a number of scalers implemented in sklearn. These scalers cannot be used directly in the dataset pipeline (at least efficiently). However, the deletor.preprocessing module contains corresponding versions of these scalers that implement the transform and inverse_transform methods using pure tensorflow operations, which are suitable for use with tf.Datasets.

Putting It All Together

Here is the full code for loading a tensorflow records file into a tf.Dataset with equivalent functionality, but slightly simplified for readability.

def load_dataset(
        dataset_filename: str,
        scalers: List[str] = None,
        n_features: int = N_FEATURES
    ):
    # Try to infer the compression type from the filename suffix.
    compression_type = (
        'GZIP' if dataset_filename.endswith('.gz') else
        'ZLIB' if dataset_filename.endswith('.zip') else
        None
    )

    # Get the feature description templates as described above.
    context_desc, sequence_desc = make_feature_description(n_features)

    # Create the initial Dataset object.
    dataset = tf.data.TFRecordDataset(dataset_filename, compression_type)

    # Parse the examples
    # I use a partial function for convenience here.
    # That function is then passed to Dataset.map, which will apply the parsing function to
    # each example read from the tensorflow records file. When we iterate over the resulting
    # Dataset we will get (X, y) pairs where the X value has the dictionary structure
    # described above.
    parse_fn = partial(parse_example, context_desc=context_desc, sequence_desc=sequence_desc)
    dataset = dataset.map(parse_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)

    # Optionally apply a scaler
    if scalers:
        scaler_filename, scaler_name = scalers

        with shelve.open(scaler_filename, 'r') as db:
            # This is a function that simply constructs one of the scaler classes
            # in the deletor.preprocessing module that corresponds to the scaler_name.
            scaler = make_scaler(scaler_name, db[scaler_name])

            # A function we can pass to Dataset.map that will apply our scaler to the
            # appropriate features in our X dictionary.
            def normalize(x, y):
                x['sequence_dense'] = scaler.transform(x['sequence_dense'])
                return x, y

            # Now actually apply the scaler to the data in the dataset.
            dataset = dataset.map(normalize)

    return dataset