Data Preparation¶

Overview¶

The first step in working with the toolkit is to prepare the data so that it is compatible with the scoring models. In all of the examples I use the MSLR-WEB30k dataset. This data is in SVM^light format, which has the following structure:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>

And here is a small example of what some actual data might look like:

qid:1 1:3 2:3 3:0 4:0 5:3 6:1
qid:1 1:3 2:0 3:3 4:0 5:3 6:1
qid:6 1:3 2:0 3:3 4:1 5:3 6:1
qid:6 1:3 2:0 3:3 4:1 5:3 6:1
qid:6 1:3 2:0 3:3 4:0 5:3 6:1

Each row describes a single document. The first column is the relevance ranking (0 to 4 in the MSLR datasets). The second column indicates the query the document belongs to. In this example there are 2 queries. Query 1 has 2 documents and query 6 has 3 documents. The remaining columns describe the features using a key/value pair demarcated by a colon. In this sample there are only 6 features, although the real MSLR dataset contains 136. The first token is the feature index (starting at 1 in the MSLR dataset) and the second token is the value. In the MSLR dataset every feature is present for every document, but this is not necessary in the SVM^light format.

Data Conversion¶

This data needs to be converted to tensors that can be fed into the scoring models. There are several ways this can be accomplished. For the examples in this tutorial we will convert the data into TFRecords, which can then be easily used with the standard Tensorflow input pipeline using Datasets.

After downloading the MSLR dataset you can run the build_tfrecords.py script on a MSLR dataset file to convert it to TFRecords.

Assuming the path to the root directory of the MSLR dataset is $MSLR and the directory where you would like to save the tfrecords is $TFR run the following commands to convert the training data for Fold1.

python -m examples.build_tfrecords      \
    --input-file $MSLR/Fold1/train.txt  \
    --output-file $TFR/train.tfr        \
    --scaler-file $TFR/scalers.db       \
    --compression-type GZIP

Now repeat the process for the validation and test data.

The output for the training data of Fold1 should look similar to this:

2020-07-15 12:47:21,860 [buildrec]:147 INFO Writing sequence:     0
2020-07-15 12:50:01,619 [buildrec]:147 INFO Writing sequence:  1000
2020-07-15 12:52:58,341 [buildrec]:147 INFO Writing sequence:  2000
2020-07-15 12:55:56,027 [buildrec]:147 INFO Writing sequence:  3000
2020-07-15 12:59:09,241 [buildrec]:147 INFO Writing sequence:  4000
2020-07-15 13:02:05,667 [buildrec]:147 INFO Writing sequence:  5000
2020-07-15 13:04:23,887 [buildrec]:147 INFO Writing sequence:  6000
2020-07-15 13:07:02,358 [buildrec]:147 INFO Writing sequence:  7000
2020-07-15 13:09:51,110 [buildrec]:147 INFO Writing sequence:  8000
2020-07-15 13:12:50,987 [buildrec]:147 INFO Writing sequence:  9000
2020-07-15 13:15:51,253 [buildrec]:147 INFO Writing sequence: 10000
2020-07-15 13:19:04,765 [buildrec]:147 INFO Writing sequence: 11000
2020-07-15 13:21:26,715 [buildrec]:147 INFO Writing sequence: 12000
2020-07-15 13:23:57,307 [buildrec]:147 INFO Writing sequence: 13000
2020-07-15 13:26:40,790 [buildrec]:147 INFO Writing sequence: 14000
2020-07-15 13:29:40,815 [buildrec]:147 INFO Writing sequence: 15000
2020-07-15 13:32:35,545 [buildrec]:147 INFO Writing sequence: 16000
2020-07-15 13:35:57,254 [buildrec]:147 INFO Writing sequence: 17000
2020-07-15 13:38:25,885 [buildrec]:147 INFO Writing sequence: 18000
2020-07-15 13:40:31,136 [buildrec]:151 INFO Finished writing 18918 sequences
2020-07-15 13:40:31,173 [buildrec]:158 INFO Writing scalers

Command Line Options¶

usage: examples.build_tfrecords.py [-h] --input-file INPUT_FILE --output-file
                                   OUTPUT_FILE [--scaler-file SCALER_FILE]
                                   [--limit LIMIT]
                                   [--compression-type {GZIP,ZLIB}]
                                   [--compression-level COMPRESSION_LEVEL]

Named Arguments¶

--input-file

The path to the Microsoft Learning To Rank file is saved.

--output-file

The path where the tensorflow records file will be saved.

--scaler-file

Fit several sklern scalers to the data and save them to this file. (minimax: MinMaxScaler, standard: StandardScaler, robust: RobustScaler, power: PowerTransformer)

--limit

This option takes an integer argument that limits the number of documents read in the file. Once this many documents has been read the script will terminate. This can be useful for creating smaller datasets for debugging.

--compression-type

Possible choices: GZIP, ZLIB

The compression type to use for storing the tensorflow records.

--compression-level

This specifies the level of compression to use when one of the compression types is specified

Default: 6