MPRAlib package

mpralib.cli

mpralib.exception

exception mpralib.exception.IOException(message)[source]

Bases: MPRAlibException

Exception raised for IO-related errors.

Parameters:: message (str) – A description of the IO error.

exception mpralib.exception.MPRAlibException(message)[source]

Bases: Exception

MPRAlib error class for specific exceptions.

Parameters:: message (str) – A description of the error.

exception mpralib.exception.SequenceDesignException(column, file_path)[source]

Bases: IOException

Exception raised for errors related to sequence design file.

Parameters:: message (str) – A description of the sequence design error.

mpralib.mpradata

class mpralib.mpradata.BarcodeFilter(*values)[source]

Bases: Enum

Enumeration of available barcode filtering methods.

GLOBAL = 'global'

Filter barcodes based on RNA z-score.

Type:: str

LARGE_EXPRESSION = 'large_expression'

Filter barcodes based on Median Absolute Deviation (MAD).

Type:: str

MAX_COUNT = 'max_count'

Filter barcodes with counts above a specified maximum.

Type:: str

MIN_BCS_PER_OLIGO = 'min_bcs_per_oligo'

Filter barcodes based on a minimum number of barcodes per oligo.

Type:: str

MIN_COUNT = 'min_count'

Filter barcodes with counts below a specified minimum.

Type:: str

OLIGO_SPECIFIC = 'oligo_specific'

Filter barcodes based on standard deviation per oligo

Type:: str

RANDOM = 'random'

Randomly filter barcodes.

Type:: str

classmethod from_string(value)[source]

Creates a BarcodeFilter enum member from a string value.

Return type:: BarcodeFilter
Parameters:: value (str) – The string representation of the enum member.
Returns:: The corresponding BarcodeFilter enum member.
Raises:: ValueError – If the provided string does not match any BarcodeFilter member.

class mpralib.mpradata.CountSampling(*values)[source]

Bases: Enum

Enumeration representing the types of count sampling available for MPRA data.

DNA = 'DNA'

Represents DNA count sampling.

Type:: str

RNA = 'RNA'

Represents RNA count sampling.

Type:: str

RNA_AND_DNA = 'RNA_AND_DNA'

Represents both RNA and DNA count sampling.

Type:: str

class mpralib.mpradata.MPRABarcodeData(data, barcode_threshold=0)[source]

Bases: MPRAData

A class for handling barcode-level MPRA (Massively Parallel Reporter Assay) data, providing methods for data import, normalization, filtering, and aggregation to oligo-level data.

This class extends MPRAData and is designed to work with barcode-resolved MPRA datasets, supporting a variety of barcode filtering strategies, normalization routines, and data transformations. It leverages AnnData for data storage and manipulation.

Note

Filtering and normalization methods are barcode-aware and can be customized via method parameters.
Aggregation to oligo-level data is supported for downstream analysis.

apply_barcode_filter(barcode_filter, params={})[source]

Applies a specified barcode filter to the dataset using the provided parameters.

This method selects the appropriate barcode filtering function based on the barcode_filter argument and applies it to update the var_filter attribute. Supported filters include RNA z-score, MAD, random, minimum count, and maximum count. After applying the filter, metadata is updated to record the applied filter.

Return type:

None

Parameters:

barcode_filter (BarcodeFilter) – The type of barcode filter to apply.
params (dict, optional) – Additional parameters to pass to the filter function. Defaults to an empty dictionary.

Raises:

ValueError – If an unsupported barcode filter is provided.

apply_count_sampling(count_type, proportion=None, total=None, max_value=None, aggregate_over_replicates=False)[source]

Applies count sampling to RNA and/or DNA count data according to the specified parameters.

Return type:

None

Parameters:

count_type (CountSampling) – Specifies which counts to sample. Options are RNA, DNA, or RNA_AND_DNA.
proportion (Optional[float]) – Proportion of counts to sample (between 0 and 1). If None, this parameter is ignored.
total (Optional[int]) – Total number of counts to sample. If None, this parameter is ignored.
max_value (Optional[int]) – Maximum value for sampled counts. If None, this parameter is ignored.
aggregate_over_replicates (bool) – Whether to aggregate counts over replicates before sampling.

Side Effects:

Adds sampling metadata to the object.
Drops any normalized data associated with the object.

property barcode_counts: NDArray[int32]

Returns the barcode counts matrix, which is the number of observed and not filtered barcodes for each oligo.

Type:: NDArray[np.int32]

complexity(method='lincoln')[source]

Calculates and returns the complexity of barcodes using the Lincoln-Peterson or Chapman estimation.

Return type:: GenericAlias[int64]
Parameters:: method (str) – Either “lincoln” or “chapman”.
Returns:: The Lincoln-Peterson or Chapman estimate.

drop_barcode_counts()[source]

Removes or clears the barcode counts data from the current object.

Return type:: None
Raises:: NotImplementedError – If the method is not yet implemented.

drop_count_sampling()[source]

Removes count sampling data from the dataset.

This method performs the following actions: - Calls drop_normalized() to remove any normalized data. - Logs the action of dropping count sampling. - Deletes the “count_sampling” entry from the .uns attribute of the data. - Removes “rna_sampling” and “dna_sampling” layers from the data, if they exist.

Return type:: None

classmethod from_file(file_path)[source]

Create an instance of the class from a file.

This method reads data from a specified file (reporter experiment barcode or reporter experiment file format), processes it, and returns an instance of the class containing the data in an AnnData object.

Return type:

MPRABarcodeData

Parameters:

file_path (str) – Path to the input file containing reporter experiment barcode or reporter experiment data.

Returns:

An instance of MPRAData containing the processed data in an AnnData object.

Raises:

IOError – If the file cannot be read or parsed.
ValueError – If the file format is invalid.

property oligo_data: MPRAOligoData

Returns an instance of MPRAOligoData containing aggregated oligo-level data.

Type:: MPRAOligoData

class mpralib.mpradata.MPRAData(data, barcode_threshold=0)[source]

Bases: ABC

Abstract base class for handling MPRA (Massively Parallel Reporter Assay) data using AnnData objects.

This class provides a standardized interface and core functionality for managing, normalizing filtering, and analyzing MPRA data, including DNA/RNA counts, barcode handling, activity computation, and correlation analysis. It is designed to be subclassed for specific MPRA data formats.

Parameters:

data (anndata.AnnData) – The AnnData object containing MPRA data.
barcode_threshold (int, optional) – Minimum barcode count threshold for filtering. Defaults to 0.

_SCALING

Default scaling factor for normalization.

Type:: float

_PSEUDOCOUNT

Default pseudocount for normalization.

Type:: int

_data

The AnnData object containing MPRA data.

Type:: anndata.AnnData

Raises:: ValueError – If required metadata (e.g., sequence design file) is not loaded.

LOGGER = <Logger mpralib.mpradata (INFO)>

Logger for the class.

Type:: logging.Logger

property activity: NDArray[float32]

Returns the activity values calculated from normalized RNA and DNA counts, applying the variable filter if present.

Type:: NDArray[np.float32]

add_sequence_design(df_sequence_design, sequence_design_file_path)[source]

Add sequence design metadata to the object’s data.

Return type:

None

Parameters:

df_sequence_design (pd.DataFrame) – DataFrame containing sequence design information, indexed by oligo identifiers.
sequence_design_file_path (str) – Path to the file from which the sequence design data was loaded to store it into the metadata.

abstract property barcode_counts: NDArray[int32]

Returns the barcode counts matrix, which is the number of observed and not filtered barcodes for each oligo.

Type:: NDArray[np.int32]

property barcode_threshold: int

Returns the threshold for barcode filtering.

Type:: int

correlation(method='pearson', count_type=Modality.ACTIVITY)[source]

Calculates and return the correlation for activity or normalized counts.

Return type:: GenericAlias[float32]
Returns:: The Pearson or Spearman correlation matrix.

property data: AnnData

The underlying AnnData object containing MPRA data.

Type:: ad.AnnData

property dna_counts: NDArray[int32]

Returns the raw DNA or, if present, sampled DNA counts, applying the variable filter if present.

Type:: NDArray[np.int32]

abstractmethod drop_barcode_counts()[source]

Removes or clears the barcode counts data from the current object.

Return type:: None
Raises:: NotImplementedError – If the method is not yet implemented.

drop_normalized()[source]

Removes normalized RNA and DNA data layers as well as activity from the dataset.

This method deletes the “rna_normalized”, “dna_normalized” and “activity” layers from the self.data.layers attribute, logs the operation, updates the metadata to indicate that normalization is no longer present, and drops any associated correlation data.

Return type:: None

drop_total_counts()[source]

Removes total RNA and DNA counts from the dataset.

Return type:: None

abstractmethod classmethod from_file(file_path)[source]

Create an instance of the class from a file.

This method reads data from a specified file (reporter experiment barcode or reporter experiment file format), processes it, and returns an instance of the class containing the data in an AnnData object.

Return type:

MPRAData

Parameters:

file_path (str) – Path to the input file containing reporter experiment barcode or reporter experiment data.

Returns:

An instance of MPRAData containing the processed data in an AnnData object.

Raises:

IOError – If the file cannot be read or parsed.
ValueError – If the file format is invalid.

property n_obs: int

Returns the number of observations (barcodes) in the dataset.

Type:: int

property n_vars: int

Returns the number of variables (samples) in the dataset.

Type:: int

property normalized_dna_counts: NDArray[float32]

Returns the normalized DNA counts from the dataset, applying the variable filter if present.

Type:: NDArray[np.float32]

property normalized_rna_counts: NDArray[float32]

Returns the normalized RNA counts from the dataset, applying the variable filter if present.

Type:: NDArray[np.float32]

property obs_names: Index

Returns the observation names (barcodes) of the dataset.

Type:: pd.Index

property observed: NDArray[bool]

Returns a boolean NumPy array indicating which barcodes (observations) have non-zero counts in either DNA or RNA. Uses sampled counts when available. otherwise raw counts

Type:: NDArray[np.bool_]

property oligos: Series

Returns the oligo names for each variable in the dataset.

Type:: pd.Series

property pseudo_count: int

Pseudocount added during normalization to avoid division by zero.

Type:: int

property raw_dna_counts: NDArray[int32]

Returns the raw DNA counts from the dataset.

Type:: NDArray[np.int32]

property raw_rna_counts: NDArray[int32]

Returns the raw RNA counts from the dataset.

Type:: NDArray[np.int32]

classmethod read(file_data_path)[source]

Reads an AnnData object from a file.

Return type:: MPRAData
Parameters:: file_data_path (str) – The path from which the AnnData object will be read.
Returns:: An instance of the class containing the data read from the file.
Return type:: MPRAData

property rna_counts: NDArray[int32]

Returns the raw RNA or, if present, sampled RNA counts, applying the variable filter if present.

Type:: NDArray[np.int32]

property scaling: float

Scaling factor for normalization.

Type:: float

property total_dna_counts: NDArray[int32]

Returns the total DNA counts for each replicate. Usually it are the total raw counts per replicate. Only when sampled data is availabe it returns the sampled counts.

Type:: NDArray[np.int32]

property total_rna_counts: NDArray[int32]

Returns the total RNA counts for each replicate. Usually it are the total raw counts per replicate. Only when sampled data is availabe it returns the sampled counts.

Type:: NDArray[np.int32]

property var_filter: NDArray[bool]

Returns a boolean NumPy array indicating which variables (samples) are filtered out.

Type:: NDArray[np.bool_]

property var_names: Index

Returns the variable names (samples) of the dataset.

Type:: pd.Index

property variant_map: DataFrame

Returns a DataFrame mapping SPDI IDs to alleles and oligos.

Raises:: ValueError – If the sequence design file is not loaded in the metadata.
Type:: pd.DataFrame

write(file_data_path)[source]

Writes the AnnData object to a file.

Return type:: None
Parameters:: file_data_path (os.PathLike) – The path where the AnnData object will be saved.

class mpralib.mpradata.MPRAOligoData(data, barcode_threshold=0)[source]

Bases: MPRAData

MPRAOligoData is a subclass of MPRAData designed to handle MPRA (Massively Parallel Reporter Assay) oligo-level data.

This class provides methods for loading, normalizing, and managing barcode counts and associated data layers for MPRA experiments. Barcode counts must be pre-set before accessing, as they cannot be computed within this class. The normalization process includes pseudocount handling to avoid division by zero and supports per-barcode normalization.

Raises:: MPRAlibException – If barcode counts are not set when accessed.

property barcode_counts: NDArray[int32]

Returns the barcode counts matrix, which is the number of observed and not filtered barcodes for each oligo.

Type:: NDArray[np.int32]

drop_barcode_counts()[source]

Removes or clears the barcode counts data from the current object.

Raises:: NotImplementedError – If the method is not yet implemented.

classmethod from_file(file_path)[source]

Create an instance of the class from a file.

This method reads data from a specified file (reporter experiment barcode or reporter experiment file format), processes it, and returns an instance of the class containing the data in an AnnData object.

Return type:

MPRAOligoData

Parameters:

file_path (str) – Path to the input file containing reporter experiment barcode or reporter experiment data.

Returns:

An instance of MPRAData containing the processed data in an AnnData object.

Raises:

IOError – If the file cannot be read or parsed.
ValueError – If the file format is invalid.

class mpralib.mpradata.Modality(*values)[source]

Bases: Enum

An enumeration representing different data modalities in MPRA (Massively Parallel Reporter Assay) experiments.

ACTIVITY = 'activity'

Represents activity data modality, typically calculated as the log2 ratio of normalized RNA to DNA counts.

Type:: str

DNA = 'dna'

Represents DNA data modality.

Type:: str

DNA_NORMALIZED = 'dna_normalized'

Represents normalized DNA data modality.

Type:: str

RNA = 'rna'

Represents RNA data modality.

Type:: str

RNA_NORMALIZED = 'rna_normalized'

Represents normalized RNA data modality.

Type:: str

classmethod from_string(value)[source]

Creates a Modality enum member from a string value.

Return type:: Modality
Parameters:: value (str) – The string representation of the enum member.
Returns:: The corresponding Modality enum member.
Raises:: ValueError – If the provided string does not match any Modality member.

mpralib.utils.file_validation

class mpralib.utils.file_validation.SchemaToFileNameMap[source]

Bases: object

as_dict()[source]

get(key)[source]

set(key, file_name)[source]

class mpralib.utils.file_validation.ValidationSchema(*values)[source]

Bases: Enum

REPORTER_BARCODE_TO_ELEMENT_MAPPING = 'reporter_barcode_to_element_mapping'

REPORTER_ELEMENT = 'reporter_element'

REPORTER_EXPERIMENT = 'reporter_experiment'

REPORTER_EXPERIMENT_BARCODE = 'reporter_experiment_barcode'

REPORTER_GENOMIC_ELEMENT = 'reporter_genomic_element'

REPORTER_GENOMIC_VARIANT = 'reporter_genomic_variant'

REPORTER_SEQUENCE_DESIGN = 'reporter_sequence_design'

REPORTER_VARIANT = 'reporter_variant'

mpralib.utils.file_validation.validate_tsv_with_schema(tsv_file_path, schema_type)[source]

Validates a TSV file against a specified JSON schema.

This function reads a TSV file (optionally gzipped), converts each row to a dictionary, and validates each row against the provided JSON schema. If any row fails validation, a warning is logged. If an unexpected error occurs during validation, it is logged and raised.

Return type:

bool

Parameters:

tsv_file_path (str) – Path to the TSV file to validate. The file may be gzipped.
schema_type (ValidationSchema) – The type of schema to validate against.

Returns:

True if all rows are valid according to the schema, False otherwise.

Raises:

Exception – If an unexpected error occurs during validation.

Logs:

Warnings for each row that fails schema validation.
Errors for unexpected exceptions during validation.
Info if the file is valid according to the schema.
Warning if the file is not valid according to the schema.

mpralib.utils.io

mpralib.utils.io.chromosome_map()[source]

Return type:: DataFrame

mpralib.utils.io.export_activity_file(mpradata, output_file_path)[source]

Export activity data from an MPRAdata object to a tab-separated values (TSV) file.

The function processes the grouped data from the MPRAdata object, extracts relevant information for each replicate, and writes the data to a TSV file. The output file contains columns for replicate, oligo name, DNA counts, RNA counts, normalized DNA counts, normalized RNA counts, log2 fold change, and the number of barcodes. Barcode filters, count sampling and barcode thresholds are applied.

Return type:

None

Parameters:

mpradata (MPRAdata) – An object containing MPRA (Massively Parallel Reporter Assay) data.
output_file_path (str) – The file path where the output TSV file will be saved.

mpralib.utils.io.export_barcode_file(mpradata, output_file_path)[source]

Export barcode count data to a file.

This function takes an MPRAdata object and exports its barcode count data to a specified file path in tab-separated values (TSV) format. The output file will contain columns for barcodes, oligo names, and DNA/RNA counts for each replicate. Modifides counts (barcode filter/sampling) if applicable will be written.

Return type:

None

Parameters:

mpradata (MPRAdata) – An object containing MPRA data, including barcodes, oligos, DNA counts, RNA counts, and replicates.
output_file_path (str) – The file path where the output TSV file will be saved.

mpralib.utils.io.export_counts_file(mpradata, output_file_path, normalized=False, filter=None)[source]

Return type:: None

mpralib.utils.io.is_bgzf(filepath)[source]

Check if a file is in BGZF (Blocked GNU Zip Format) format.

BGZF is a variant of the standard gzip format with extra fields that allow for random access. This function reads the file header and checks for the BGZF-specific magic numbers and flags.

Return type:: bool
Parameters:: filepath (str) – Path to the file to be checked.
Returns:: True if the file is in BGZF format, False otherwise.

mpralib.utils.io.is_compressed_file(filepath)[source]

Check if a file is compressed (gzip or bgz).

Return type:: bool
Parameters:: filepath (str) – Path to the file to check.
Returns:: True if the file is compressed, False otherwise.

mpralib.utils.io.is_gzip_file(filepath)[source]

Check if a file is a gzip-compressed file based on its magic number.

Return type:: bool
Parameters:: filepath (str or Path) – Path to the file to check.
Returns:: True if the file is gzip-compressed, False otherwise.

mpralib.utils.io.read_sequence_design_file(file_path)[source]

Read sequence design from a tab-separated values (TSV) file.

This function reads metadata from a TSV file and returns it as a pandas DataFrame. The metadata file should contain columns for sample ID, replicate, and any additional metadata. The sample ID should correspond to the oligo name in the MPRA data object.

Return type:: DataFrame
Parameters:: file_path (str) – The file path of the metadata TSV file.
Returns:: A DataFrame containing the metadata.

mpralib.utils.plot

mpralib.utils.plot.barcodes_outlier(data)[source]

Return type:: Figure

mpralib.utils.plot.barcodes_per_oligo(data, replicates=None)[source]

Return type:: FacetGrid

mpralib.utils.plot.correlation(data, layer, replicates=None)[source]

Return type:: PairGrid

mpralib.utils.plot.dna_vs_rna(data, replicates=None)[source]

Return type:: JointGrid