API Reference

The following constitutes the public API of PACE.

pace.evaluate(algorithm_class, dataset=<pace.data.BuiltinDataset object>, folds=5, selected_alleles=None, selected_lengths=None, nbr_train=1, test_alleles=None, test_lengths=None, nbr_test=10, scorers={'accuracy': <pace.evaluation.AccuracyScorer object>, 'ppv': <pace.evaluation.PpvScorer object>}, random_seed=127)

Evaluate an algorithm.

Given a dataset and an algorithm, this evaluates the algorithm by repeatedly splitting the dataset into training and testing subsets, training a new algorithm instance, asking it to make predictions about the testing subset, and scoring those predictions.

Parameters
  • algorithm_class (pace.PredictionAlgorithm) – a function taking no arguments that returns a new instance of the algorithm to test - If the algorithm class has a default constructor, you can simply pass in the class itself. Otherwise, pass in a lambda that fills in the constructor arguments appropriately. The algorithm must implement the interface specified by pace.PredictionAlgorithm.

  • dataset (pace.Dataset) – the dataset to use for testing - If omitted, the builtin dataset is used. The dataset must implement the interface specified by pace.Dataset.

  • folds (int) – the number of folds (i.e., iterations) to perform (default is 5)

  • selected_alleles (List[str], optional) – a list of alleles to use for training - If a value is given here, the dataset is filtered so that only samples for those alleles are used for training. (By default, no filtering is done.) Note that this will also determine the filtering of the test data unless a different filter is explicitly specified.

  • selected_lengths (List[int], optional) – a list of peptide lengths to use for training - If a value is given here, the dataset is filtered so that only samples for those lengths are used for training. (By default, no filtering is done.) Note that this will also determine the filtering of the test data unless a different filter is explicitly specified.

  • nbr_train (float, optional) – the nonbinder ratio for training - This determines the ratio of nonbinders to binders in the set of samples used for training the algorithm. It defaults to 1.

  • test_alleles (List[str], optional) – a list of alleles to use for testing - This is equivalent to selected_allles but determines the filtering for the testing phase. By default, the same set that was used for training is also used for testing.

  • test_lengths (List[int], optional) – a list of peptide lengths to use for testing - This is equivalent to selected_lengths but determines the filtering for the testing phase. By default, the same set that was used for training is also used for testing.

  • nbr_test – the nonbinder ratio for testing - This determines the ratio of nonbinders to binders in the set of samples used for testing the algorithm. It defaults to 10. (Using a value much higher than 10 with the default dataset (without subselecting) will exhaust the pool of nonbinders.)

  • scorers (Dict[str,pace.Scorer]) – a mapping from labels to scorers - If omitted, pace.evaluation.default_scorers is used.

  • random_seed (int, optional) – the random seed used to initialize the random state to ensure reproducible splits are obtained between different runs

Returns

a mapping from scorer labels to the results returned by that scorer (one per fold)

Return type

Dict[str,List[Any]]

pace.encode(sequences, aafeatmat='onehot')

Create a numerical encoding for the input peptide sequences Assumes that all input sequences have the same length (TO DO: how should we integrate error handling?)

Parameters
  • sequences – List of peptide sequences. A list of strings is accepted as well as a list of lists where the inner lists are single amino acids. All sequences need to be the same length.

  • aafeatmat – Either the name of one of the builtin peptide encodings or a pandas DataFrame with one amino acid per row, and columns with features. (Rows: 20 amino acids; columns: the encoding of each amino acid.)

Returns

encoded sequences

Return type

numpy.ndarray

pace.get_allele_similarity_mat(allele_similarity_name)

Get a matrix of pre-computed allele similarities

Parameters

allele_similarity_name (str) – Pre-computed allele similarity matrices are availble based on observed peptide binding motifs (‘motifs’) or HLA protein binding pocket residues (‘pockets’).

Returns

allele similarity matrix

Return type

pandas.core.frame.DataFrame

pace.get_similar_alleles(allele_similarity_name, allele, similarity_threshold)

Get the most similar alleles to a given allele, based on a specified allele similarity matrix and similarity threshold.

Parameters
  • allele_similarity_name (str) – Pre-computed allele similarity matrices are availble based on observed peptide binding motifs (‘motifs’) or HLA protein binding pocket residues (‘pockets’).

  • allele (str) – The allele for which to determine similar alleles

  • similarity_threshold – Numerical threhosld value that determins the cutoff for considering an allele similar to the given allele.

Returns

The similar alleles satisfying the specifid threshold along with the numerical similarity values. Note that the given allele is also returned.

Return type

pandas.core.frame.DataFrame

class pace.Sample

a sample to predict

property allele

the allele code for the MHC molecule

property peptide

the amino acid sequence for the peptide (as a string)

class pace.Dataset

an abstract base class defining the interface required of a dataset

abstract get_binders(length)

Get all binders with the specified length.

Parameters

length (int) – the peptide length the caller is interested in

Returns

all binders with that length - Note that this is allowed to return a single-use iterable.

Return type

Iterable[pace.Sample]

abstract get_nonbinders(length)

Get all nonbinders with the specified length.

Parameters

length (int) – the peptide length the caller is interested in

Returns

all non-binder peptides with that length.

Return type

List[str]

class pace.PredictionAlgorithm

an abstract base class defining the interface required of prediction algorithms that are to be evaluated by PACE

abstract predict(samples)

Predict whether or not a list of samples will bind.

Parameters

samples (List[pace.Sample]) – the samples to predict

Returns

predictions for each sample - Each prediction is a number between 0 and 1 indicating how likely the sample is to bind.

Return type

NumPy array-like object (e.g., list of floats)

abstract train(binders, nonbinders)

Train this instance using the supplied training data.

Parameters
  • binders – samples that are known to bind

  • nonbinders – samples that are known to not bind

class pace.PredictionResult

the result of predicting a single sample

property prediction

the algorithm’s prediction (between 0 and 1)

property sample

the sample that was predicted

property truth

the true answer (either 0 or 1)

class pace.Scorer

an abstract base class defining the interface required of scorers - A scorer quantifies (or summarizes) the quality of the prediction results.

abstract score(results)

Generate the score for a set of prediction results.

Parameters

results (Iterable[PredictionResult]) – the prediction results to score

Returns

whatever summary info the scorer would like to generate for the results

Return type

Any