MPRAlib package
mpralib.cli
mpralib.exception
- exception mpralib.exception.IOException(message)[source]
Bases:
MPRAlibExceptionException raised for IO-related errors.
- Parameters:
message (str) – A description of the IO error.
- exception mpralib.exception.MPRAlibException(message)[source]
Bases:
ExceptionMPRAlib error class for specific exceptions.
- Parameters:
message (str) – A description of the error.
- exception mpralib.exception.SequenceDesignException(column, file_path)[source]
Bases:
IOExceptionException raised for errors related to sequence design file.
- Parameters:
message (str) – A description of the sequence design error.
mpralib.mpradata
- class mpralib.mpradata.BarcodeFilter(*values)[source]
Bases:
EnumEnumeration of available barcode filtering methods.
- GLOBAL = 'global'
Filter barcodes based on RNA z-score.
- Type:
str
- LARGE_EXPRESSION = 'large_expression'
Filter barcodes based on Median Absolute Deviation (MAD).
- Type:
str
- MAX_COUNT = 'max_count'
Filter barcodes with counts above a specified maximum.
- Type:
str
- MIN_BCS_PER_OLIGO = 'min_bcs_per_oligo'
Filter barcodes based on a minimum number of barcodes per oligo.
- Type:
str
- MIN_COUNT = 'min_count'
Filter barcodes with counts below a specified minimum.
- Type:
str
- OLIGO_SPECIFIC = 'oligo_specific'
Filter barcodes based on standard deviation per oligo
- Type:
str
- RANDOM = 'random'
Randomly filter barcodes.
- Type:
str
- classmethod from_string(value)[source]
Creates a BarcodeFilter enum member from a string value.
- Return type:
- Parameters:
value (str) – The string representation of the enum member.
- Returns:
The corresponding BarcodeFilter enum member.
- Raises:
ValueError – If the provided string does not match any BarcodeFilter member.
- class mpralib.mpradata.CountSampling(*values)[source]
Bases:
EnumEnumeration representing the types of count sampling available for MPRA data.
- DNA = 'DNA'
Represents DNA count sampling.
- Type:
str
- RNA = 'RNA'
Represents RNA count sampling.
- Type:
str
- RNA_AND_DNA = 'RNA_AND_DNA'
Represents both RNA and DNA count sampling.
- Type:
str
- class mpralib.mpradata.MPRABarcodeData(data, barcode_threshold=0)[source]
Bases:
MPRADataA class for handling barcode-level MPRA (Massively Parallel Reporter Assay) data, providing methods for data import, normalization, filtering, and aggregation to oligo-level data.
This class extends MPRAData and is designed to work with barcode-resolved MPRA datasets, supporting a variety of barcode filtering strategies, normalization routines, and data transformations. It leverages AnnData for data storage and manipulation.
Note
Filtering and normalization methods are barcode-aware and can be customized via method parameters.
Aggregation to oligo-level data is supported for downstream analysis.
- apply_barcode_filter(barcode_filter, params={})[source]
Applies a specified barcode filter to the dataset using the provided parameters.
This method selects the appropriate barcode filtering function based on the barcode_filter argument and applies it to update the var_filter attribute. Supported filters include RNA z-score, MAD, random, minimum count, and maximum count. After applying the filter, metadata is updated to record the applied filter.
- Return type:
None- Parameters:
barcode_filter (BarcodeFilter) – The type of barcode filter to apply.
params (dict, optional) – Additional parameters to pass to the filter function. Defaults to an empty dictionary.
- Raises:
ValueError – If an unsupported barcode filter is provided.
- apply_count_sampling(count_type, proportion=None, total=None, max_value=None, aggregate_over_replicates=False)[source]
Applies count sampling to RNA and/or DNA count data according to the specified parameters.
- Return type:
None- Parameters:
count_type (CountSampling) – Specifies which counts to sample. Options are RNA, DNA, or RNA_AND_DNA.
proportion (Optional[float]) – Proportion of counts to sample (between 0 and 1). If None, this parameter is ignored.
total (Optional[int]) – Total number of counts to sample. If None, this parameter is ignored.
max_value (Optional[int]) – Maximum value for sampled counts. If None, this parameter is ignored.
aggregate_over_replicates (bool) – Whether to aggregate counts over replicates before sampling.
- Side Effects:
Adds sampling metadata to the object.
Drops any normalized data associated with the object.
- property barcode_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the barcode counts matrix, which is the number of observed and not filtered barcodes for each oligo.
- Type:
NDArray[np.int32]
- complexity(method='lincoln')[source]
Calculates and returns the complexity of barcodes using the Lincoln-Peterson or Chapman estimation.
- Return type:
ndarray[tuple[Any,...],dtype[int64]]- Parameters:
method (str) – Either “lincoln” or “chapman”.
- Returns:
The Lincoln-Peterson or Chapman estimate.
- drop_barcode_counts()[source]
Removes or clears the barcode counts data from the current object.
- Return type:
None- Raises:
NotImplementedError – If the method is not yet implemented.
- drop_count_sampling()[source]
Removes count sampling data from the dataset.
This method performs the following actions: - Calls drop_normalized() to remove any normalized data. - Logs the action of dropping count sampling. - Deletes the “count_sampling” entry from the .uns attribute of the data. - Removes “rna_sampling” and “dna_sampling” layers from the data, if they exist.
- Return type:
None
- classmethod from_file(file_path)[source]
Create an instance of the class from a file.
This method reads data from a specified file (reporter experiment barcode or reporter experiment file format), processes it, and returns an instance of the class containing the data in an AnnData object.
- Return type:
- Parameters:
file_path (str) – Path to the input file containing reporter experiment barcode or reporter experiment data.
- Returns:
An instance of MPRAData containing the processed data in an AnnData object.
- Raises:
IOError – If the file cannot be read or parsed.
ValueError – If the file format is invalid.
- property oligo_data: MPRAOligoData
Returns an instance of MPRAOligoData containing aggregated oligo-level data.
- Type:
- class mpralib.mpradata.MPRAData(data, barcode_threshold=0)[source]
Bases:
ABCAbstract base class for handling MPRA (Massively Parallel Reporter Assay) data using AnnData objects.
This class provides a standardized interface and core functionality for managing, normalizing filtering, and analyzing MPRA data, including DNA/RNA counts, barcode handling, activity computation, and correlation analysis. It is designed to be subclassed for specific MPRA data formats.
- Parameters:
data (anndata.AnnData) – The AnnData object containing MPRA data.
barcode_threshold (int, optional) – Minimum barcode count threshold for filtering. Defaults to 0.
- _SCALING
Default scaling factor for normalization.
- Type:
float
- _PSEUDOCOUNT
Default pseudocount for normalization.
- Type:
int
- _data
The AnnData object containing MPRA data.
- Type:
anndata.AnnData
- Raises:
ValueError – If required metadata (e.g., sequence design file) is not loaded.
- LOGGER = <Logger mpralib.mpradata (INFO)>
Logger for the class.
- Type:
logging.Logger
- property activity: ndarray[tuple[Any, ...], dtype[float32]]
Returns the activity values calculated from normalized RNA and DNA counts, applying the variable filter if present.
- Type:
NDArray[np.float32]
- add_sequence_design(df_sequence_design, sequence_design_file_path)[source]
Add sequence design metadata to the object’s data.
- Return type:
None- Parameters:
df_sequence_design (pd.DataFrame) – DataFrame containing sequence design information, indexed by oligo identifiers.
sequence_design_file_path (str) – Path to the file from which the sequence design data was loaded to store it into the metadata.
- abstract property barcode_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the barcode counts matrix, which is the number of observed and not filtered barcodes for each oligo.
- Type:
NDArray[np.int32]
- property barcode_threshold: int
Returns the threshold for barcode filtering.
- Type:
int
- correlation(method='pearson', count_type=Modality.ACTIVITY)[source]
Calculates and return the correlation for activity or normalized counts.
- Return type:
ndarray[tuple[Any,...],dtype[float32]]- Returns:
The Pearson or Spearman correlation matrix.
- property data: AnnData
The underlying AnnData object containing MPRA data.
- Type:
ad.AnnData
- property dna_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the raw DNA or, if present, sampled DNA counts, applying the variable filter if present.
- Type:
NDArray[np.int32]
- abstractmethod drop_barcode_counts()[source]
Removes or clears the barcode counts data from the current object.
- Return type:
None- Raises:
NotImplementedError – If the method is not yet implemented.
- drop_normalized()[source]
Removes normalized RNA and DNA data layers as well as activity from the dataset.
This method deletes the “rna_normalized”, “dna_normalized” and “activity” layers from the self.data.layers attribute, logs the operation, updates the metadata to indicate that normalization is no longer present, and drops any associated correlation data.
- Return type:
None
- abstractmethod classmethod from_file(file_path)[source]
Create an instance of the class from a file.
This method reads data from a specified file (reporter experiment barcode or reporter experiment file format), processes it, and returns an instance of the class containing the data in an AnnData object.
- Return type:
- Parameters:
file_path (str) – Path to the input file containing reporter experiment barcode or reporter experiment data.
- Returns:
An instance of MPRAData containing the processed data in an AnnData object.
- Raises:
IOError – If the file cannot be read or parsed.
ValueError – If the file format is invalid.
- property n_obs: int
Returns the number of observations (barcodes) in the dataset.
- Type:
int
- property n_vars: int
Returns the number of variables (samples) in the dataset.
- Type:
int
- property normalized_dna_counts: ndarray[tuple[Any, ...], dtype[float32]]
Returns the normalized DNA counts from the dataset, applying the variable filter if present.
- Type:
NDArray[np.float32]
- property normalized_rna_counts: ndarray[tuple[Any, ...], dtype[float32]]
Returns the normalized RNA counts from the dataset, applying the variable filter if present.
- Type:
NDArray[np.float32]
- property obs_names: Index
Returns the observation names (barcodes) of the dataset.
- Type:
pd.Index
- property observed: ndarray[tuple[Any, ...], dtype[bool]]
Returns a boolean NumPy array indicating which barcodes (observations) have non-zero counts in either DNA or RNA. Uses sampled counts when available. otherwise raw counts
- Type:
NDArray[np.bool_]
- property oligos: Series
Returns the oligo names for each variable in the dataset.
- Type:
pd.Series
- property pseudo_count: int
Pseudocount added during normalization to avoid division by zero.
- Type:
int
- property raw_dna_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the raw DNA counts from the dataset.
- Type:
NDArray[np.int32]
- property raw_rna_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the raw RNA counts from the dataset.
- Type:
NDArray[np.int32]
- property rna_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the raw RNA or, if present, sampled RNA counts, applying the variable filter if present.
- Type:
NDArray[np.int32]
- property scaling: float
Scaling factor for normalization.
- Type:
float
- property total_dna_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the total DNA counts for each replicate. Usually it are the total raw counts per replicate. Only when sampled data is availabe it returns the sampled counts.
- Type:
NDArray[np.int32]
- property total_rna_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the total RNA counts for each replicate. Usually it are the total raw counts per replicate. Only when sampled data is availabe it returns the sampled counts.
- Type:
NDArray[np.int32]
- property var_filter: ndarray[tuple[Any, ...], dtype[bool]]
Returns a boolean NumPy array indicating which variables (samples) are filtered out.
- Type:
NDArray[np.bool_]
- property var_names: Index
Returns the variable names (samples) of the dataset.
- Type:
pd.Index
- property variant_map: DataFrame
Returns a DataFrame mapping SPDI IDs to alleles and oligos.
- Raises:
ValueError – If the sequence design file is not loaded in the metadata.
- Type:
pd.DataFrame
- class mpralib.mpradata.MPRAOligoData(data, barcode_threshold=0)[source]
Bases:
MPRADataMPRAOligoData is a subclass of MPRAData designed to handle MPRA (Massively Parallel Reporter Assay) oligo-level data.
This class provides methods for loading, normalizing, and managing barcode counts and associated data layers for MPRA experiments. Barcode counts must be pre-set before accessing, as they cannot be computed within this class. The normalization process includes pseudocount handling to avoid division by zero and supports per-barcode normalization.
- Raises:
MPRAlibException – If barcode counts are not set when accessed.
- property barcode_counts: ndarray[tuple[Any, ...], dtype[int32]]
Returns the barcode counts matrix, which is the number of observed and not filtered barcodes for each oligo.
- Type:
NDArray[np.int32]
- drop_barcode_counts()[source]
Removes or clears the barcode counts data from the current object.
- Raises:
NotImplementedError – If the method is not yet implemented.
- classmethod from_file(file_path)[source]
Create an instance of the class from a file.
This method reads data from a specified file (reporter experiment barcode or reporter experiment file format), processes it, and returns an instance of the class containing the data in an AnnData object.
- Return type:
- Parameters:
file_path (str) – Path to the input file containing reporter experiment barcode or reporter experiment data.
- Returns:
An instance of MPRAData containing the processed data in an AnnData object.
- Raises:
IOError – If the file cannot be read or parsed.
ValueError – If the file format is invalid.
- class mpralib.mpradata.Modality(*values)[source]
Bases:
EnumAn enumeration representing different data modalities in MPRA (Massively Parallel Reporter Assay) experiments.
- ACTIVITY = 'activity'
Represents activity data modality, typically calculated as the log2 ratio of normalized RNA to DNA counts.
- Type:
str
- DNA = 'dna'
Represents DNA data modality.
- Type:
str
- DNA_NORMALIZED = 'dna_normalized'
Represents normalized DNA data modality.
- Type:
str
- RNA = 'rna'
Represents RNA data modality.
- Type:
str
- RNA_NORMALIZED = 'rna_normalized'
Represents normalized RNA data modality.
- Type:
str
- classmethod from_string(value)[source]
Creates a Modality enum member from a string value.
- Return type:
- Parameters:
value (str) – The string representation of the enum member.
- Returns:
The corresponding Modality enum member.
- Raises:
ValueError – If the provided string does not match any Modality member.
mpralib.utils.file_validation
- class mpralib.utils.file_validation.ValidationSchema(*values)[source]
Bases:
Enum- REPORTER_BARCODE_TO_ELEMENT_MAPPING = 'reporter_barcode_to_element_mapping'
- REPORTER_ELEMENT = 'reporter_element'
- REPORTER_EXPERIMENT = 'reporter_experiment'
- REPORTER_EXPERIMENT_BARCODE = 'reporter_experiment_barcode'
- REPORTER_GENOMIC_ELEMENT = 'reporter_genomic_element'
- REPORTER_GENOMIC_VARIANT = 'reporter_genomic_variant'
- REPORTER_SEQUENCE_DESIGN = 'reporter_sequence_design'
- REPORTER_VARIANT = 'reporter_variant'
- mpralib.utils.file_validation.validate_tsv_with_schema(tsv_file_path, schema_type)[source]
Validates a TSV file against a specified JSON schema.
This function reads a TSV file (optionally gzipped), converts each row to a dictionary, and validates each row against the provided JSON schema. If any row fails validation, a warning is logged. If an unexpected error occurs during validation, it is logged and raised.
- Return type:
bool- Parameters:
tsv_file_path (str) – Path to the TSV file to validate. The file may be gzipped.
schema_type (ValidationSchema) – The type of schema to validate against.
- Returns:
True if all rows are valid according to the schema, False otherwise.
- Raises:
Exception – If an unexpected error occurs during validation.
- Logs:
Warnings for each row that fails schema validation.
Errors for unexpected exceptions during validation.
Info if the file is valid according to the schema.
Warning if the file is not valid according to the schema.
mpralib.utils.io
- mpralib.utils.io.export_activity_file(mpradata, output_file_path)[source]
Export activity data from an MPRAdata object to a tab-separated values (TSV) file.
The function processes the grouped data from the MPRAdata object, extracts relevant information for each replicate, and writes the data to a TSV file. The output file contains columns for replicate, oligo name, DNA counts, RNA counts, normalized DNA counts, normalized RNA counts, log2 fold change, and the number of barcodes. Barcode filters, count sampling and barcode thresholds are applied.
- Return type:
None- Parameters:
mpradata (MPRAdata) – An object containing MPRA (Massively Parallel Reporter Assay) data.
output_file_path (str) – The file path where the output TSV file will be saved.
- mpralib.utils.io.export_barcode_file(mpradata, output_file_path)[source]
Export barcode count data to a file.
This function takes an MPRAdata object and exports its barcode count data to a specified file path in tab-separated values (TSV) format. The output file will contain columns for barcodes, oligo names, and DNA/RNA counts for each replicate. Modifides counts (barcode filter/sampling) if applicable will be written.
- Return type:
None- Parameters:
mpradata (MPRAdata) – An object containing MPRA data, including barcodes, oligos, DNA counts, RNA counts, and replicates.
output_file_path (str) – The file path where the output TSV file will be saved.
- mpralib.utils.io.export_counts_file(mpradata, output_file_path, normalized=False, filter=None)[source]
- Return type:
None
- mpralib.utils.io.is_bgzf(filepath)[source]
Check if a file is in BGZF (Blocked GNU Zip Format) format.
BGZF is a variant of the standard gzip format with extra fields that allow for random access. This function reads the file header and checks for the BGZF-specific magic numbers and flags.
- Return type:
bool- Parameters:
filepath (str) – Path to the file to be checked.
- Returns:
True if the file is in BGZF format, False otherwise.
- mpralib.utils.io.is_compressed_file(filepath)[source]
Check if a file is compressed (gzip or bgz).
- Return type:
bool- Parameters:
filepath (str) – Path to the file to check.
- Returns:
True if the file is compressed, False otherwise.
- mpralib.utils.io.is_gzip_file(filepath)[source]
Check if a file is a gzip-compressed file based on its magic number.
- Return type:
bool- Parameters:
filepath (str or Path) – Path to the file to check.
- Returns:
True if the file is gzip-compressed, False otherwise.
- mpralib.utils.io.read_sequence_design_file(file_path)[source]
Read sequence design from a tab-separated values (TSV) file.
This function reads metadata from a TSV file and returns it as a pandas DataFrame. The metadata file should contain columns for sample ID, replicate, and any additional metadata. The sample ID should correspond to the oligo name in the MPRA data object.
- Return type:
DataFrame- Parameters:
file_path (str) – The file path of the metadata TSV file.
- Returns:
A DataFrame containing the metadata.