deletor.random package

Submodules

deletor.random.sample module

Documentation

class deletor.random.sample.DocumentSampler(sample_size: int, n_samples: Optional[int] = None, multiple: Optional[int] = None, sample_pre_batch: bool = False, pad_value: float = - 3.4028235e+38)[source]

Bases: abc.ABC

A base class for various methods of sampling a set of documents from a batch.

abstract count_sampled_documents(**kwargs)[source]
abstract make_gather_indices(**kwargs)[source]
abstract make_scatter_indices(**kwargs)[source]
sample(x: Dict[str, tensorflow.python.framework.ops.Tensor], y: tensorflow.python.framework.ops.Tensor, w=None, **kwargs)[source]
abstract sample_after_batching(x: Dict[str, tensorflow.python.framework.ops.Tensor], y: tensorflow.python.framework.ops.Tensor, w=None, **kwargs)[source]
abstract sample_before_batching(x: Dict[str, tensorflow.python.framework.ops.Tensor], y: tensorflow.python.framework.ops.Tensor, w=None, **kwargs)[source]
class deletor.random.sample.IndependentMultiOutputSampler(sample_size: int, n_samples: Optional[int] = None, multiple: Optional[int] = None, sample_pre_batch: bool = False, pad_value: float = - 3.4028235e+38)[source]

Bases: deletor.random.sample.DocumentSampler

A sampling method that generates samples independently from the others and guarantees that no document is included in a sample more than once as long as the sample_size (i.e., group size) is less than the number of documents. However, the frequency of each document is not guaranteed to be the same and there is the potential for some documents to be excluded completely.

After applying this sampler, the input data (\(X\)) dictionary will have 3 new entries and the \(y\) value will be a 2 element tuple.

The new \(X\) entries are:

  • sample_dense

    Contains the sampled documents (the original feature tensor(s) are preserved in the sequential_dense entry).

  • scatter_idx

    Contains a set of indexes for use with tensorflow.scatter_nd() (or tensorflow.gather_nd()) to map the order of sampled documents back to their original order in the query. This is useful/essential for aggregating the scores of each document in a query.

  • document_counts

    Contains a tensor that keeps track of how many times each document has been sampled. This is useful if we want to average the scores over documents in a query instead of summing them.

Parameters
  • sample_size – The number of documents in each sample (i.e., the group size).

  • n_samples – The number of samples to generate.

  • multiple – The class will generate \(n H_{n}\) samples by default. See the coupon collector’s problem for the meaning of \(n H_{n}\). You can increase the number of samples by a multiple of this value with this parameter.

  • sample_pre_batch – Whether the sampler assumes the input has been padded and batched already or not. You almost certainly want this to be False.

  • pad_value – The value used to pad entries in the tensor (from tensorflow.data.Dataset.padded_batch()).

count_sampled_documents(indices, batch_size, n_samples, n_documents, sample_size)[source]
get_number_of_samples(multiple, n_documents, sample_size)[source]
make_gather_indices(batch_size, n_samples, shuffle_idx, sample_size, **kwargs)[source]
make_scatter_indices(gather_idx, **kwargs)[source]
sample_after_batching(x: Dict[str, tensorflow.python.framework.ops.Tensor], y: tensorflow.python.framework.ops.Tensor, w=None, **kwargs)[source]

Each sample is generated independently of the other samples. Each sample is guaranteed to have unique documents, however, not every document is guaranteed to be included in the output and the frequency of some documents may be more (or less) than others.

Parameters
  • x

  • y

  • w

Returns

sample_before_batching(x: Dict[str, tensorflow.python.framework.ops.Tensor], y: tensorflow.python.framework.ops.Tensor, w=None, **kwargs)[source]

Each sample is generated independently of the other samples. Each sample is guaranteed to have unique documents, however, not every document is guaranteed to be included in the output and the frequency of some documents may be more (or less) than others.

Parameters
  • x

  • y

  • w

Returns

class deletor.random.sample.IndependentSingleOutputSampler(sample_size: int, n_samples: Optional[int] = None, multiple: Optional[int] = None, sample_pre_batch: bool = False, pad_value: float = - 3.4028235e+38)[source]

Bases: deletor.random.sample.IndependentMultiOutputSampler

A sampling method that generates samples independently from the others and guarantees that no document is included in a sample more than once as long as the sample_size (i.e., group size) is less than the number of documents. However, the frequency of each document is not guaranteed to be the same and there is the potential for some documents to be excluded completely.

After applying this sampler, the input data (\(X\)) dictionary will have 3 new entries and the \(y\) value will be a 2 element tuple.

The new \(X\) entries are:

  • sample_dense

    Contains the sampled documents (the original feature tensor(s) are preserved in the sequential_dense entry).

  • scatter_idx

    Contains a set of indexes for use with tensorflow.scatter_nd() (or tensorflow.gather_nd()) to map the order of sampled documents back to their original order in the query. This is useful/essential for aggregating the scores of each document in a query.

  • document_counts

    Contains a tensor that keeps track of how many times each document has been sampled. This is useful if we want to average the scores over documents in a query instead of summing them.

Parameters
  • sample_size – The number of documents in each sample (i.e., the group size).

  • n_samples – The number of samples to generate.

  • multiple

    The class will generate \(n H_{n}\) samples by default. See the coupon collector’s problem for the meaning of \(n H_{n}\). You can increase the number of samples by a multiple of this value with this parameter.

  • sample_pre_batch – Whether the sampler assumes the input has been padded and batched already or not. You almost certainly want this to be False.

  • pad_value – The value used to pad entries in the tensor (from tensorflow.data.Dataset.padded_batch()).

count_sampled_documents(indices, batch_size, n_samples, n_documents, sample_size)[source]
get_number_of_samples(multiple, n_documents, sample_size)[source]
make_scatter_indices(gather_idx, **kwargs)[source]

Module contents