3D Plot Animations:
Binding Site

Alternative Splicing

3D Protein

Byte Array Encoders, Clustering and Preprocessing Steps for Compression, Similarity Search Tools for FASTA/FASTQ files

Check It Out (GitHub Repository)

Encode, cluster, and analyze DNA sequences, utilizing various encoding schemes and clustering algorithms to process and compress DNA data.

The code has also been wrapped into a PyPI package:

GeneVecTools (PyPI Package)

Encoder Class

Purpose: To encode and decode DNA sequences and clusters using a custom one-hot encoding scheme.

Input Fields:

  • Cluster of Sequences: A collection of DNA sequences to be encoded.

    • List of lists where each sub-list contains sequences and their corresponding indices.

    • Example: [["ACGT", 1], ["TCGA", 1], ["ACGTGTCGAGTGT", 2]]

Methods:

  1. encode_sequence(sequence)

    • Encodes a DNA sequence into an integer using a custom one-hot encoding scheme.

    • Input: sequence (str) - DNA sequence to encode.

    • Output: int - Encoded integer.

  2. encode_cluster(cluster)

    • Encodes a cluster of DNA sequences into a byte array.

    • Input: cluster (list) - List of sequences and their indices.

    • Output: bytearray - Encoded byte array.

  3. encode_clusters(clusters)

    • Encodes multiple clusters of DNA sequences.

    • Input: clusters (list) - List of clusters.

    • Output: list - List of encoded byte arrays.

  4. decode_sequence(val)

    • Decodes an integer back into a DNA sequence.

    • Input: val (int) - Encoded integer.

    • Output: str - Decoded DNA sequence.

  5. decode_cluster(byte_array)

    • Decodes a byte array back into a cluster of DNA sequences.

    • Input: byte_array (bytearray) - Encoded byte array.

    • Output: list - Decoded cluster of sequences and their indices.

  6. decode_clusters(clusters)

    • Decodes multiple encoded clusters.

    • Input: clusters (list) - List of encoded byte arrays.

    • Output: list - List of decoded clusters.

Expected Output:

  • Encoded Cluster: The encoded representation of the input cluster of sequences.

    • Byte array.

  • Decoded Cluster: The original cluster obtained after decoding the encoded cluster.

    • List of lists containing sequences and their indices.

Mapper Class

Purpose: To create feature sets from DNA sequences, perform clustering, and compress the clustered sequences.

Input Fields:

  • Sequence List: List of DNA sequences to create feature sets and perform clustering.

    • Example: ["ACGT", "TCGA", "ACGTGTCGAGTGT", "ACGATGCGCGCTAGGT", "ACGTGTCGCGCAATCGCTAGAC"]

  • Feature Set Parameters:

    • k: The length of the k-mers to be considered.

      • Positive integer.

    • m: Number of features with the highest variance to be selected.

      • Positive integer.

Methods:

  1. encode(s)

    • Encodes a sequence or list of sequences into an integer.

    • Input: s (list or str) - Sequence or list of sequences to encode.

    • Output: int - Encoded integer.

  2. decode(c, k)

    • Decodes an integer back into a sequence.

    • Input: c (int) - Encoded integer, k (int) - Length of k-mers.

    • Output: str - Decoded sequence.

  3. vec(i, j, k, R)

    • Generates a feature vector for a given sequence.

    • Input: i (int), j (int), k (int), R (list) - Parameters for feature vector generation.

    • Output: float - Feature vector.

  4. feature_set(R, k)

    • Creates a feature set from a list of sequences.

    • Input: R (list), k (int) - List of sequences and length of k-mers.

    • Output: list - Feature set.

  5. select_high_variance(feature_set, m)

    • Selects features with the highest variance.

    • Input: feature_set (list), m (int) - Feature set and number of features to select.

    • Output: list - Selected features.

  6. groupings(S, sequences)

    • Groups similar sequences based on clustering.

    • Input: S (list), sequences (list) - Clustering result and list of sequences.

    • Output: list - Grouped sequences.

Expected Output:

  • Feature Set: The extracted features based on the provided sequence list and k-mer length.

    • List of vectors representing the features.

  • Clustered Data: The result of clustering the feature set.

    • List of clustered sequences.

  • Grouped Sequences: The grouped similar sequences based on the clustering.

    • List of lists containing grouped sequences.

VecMapper Class (from vectorize.py)

Purpose: To create k-mers from DNA sequences, generate feature vectors, and select features with the highest variance.

Input Fields:

  • Sequence List: List of DNA sequences to create k-mers and feature vectors.

    • Example: ["ACGT", "TCGA", "ACGTGTCGAGTGT", "ACGATGCGCGCTAGGT", "ACGTGTCGCGCAATCGCTAGAC"]

Methods:

  1. phi(base)

    • Maps a nucleotide to an integer.

    • Input: base (str) - Nucleotide (A, C, G, T).

    • Output: int - Mapped integer.

  2. encoding_function(s)

    • Encodes a k-mer into an integer.

    • Input: s (str) - k-mer.

    • Output: int - Encoded integer.

  3. minimizer(s)

    • Calculates the minimizer of a sequence.

    • Input: s (str) - Sequence.

    • Output: int - Minimizer.

  4. get_kmer_frequency(kmer, read_list)

    • Calculates the frequency of a k-mer in a list of sequences.

    • Input: kmer (str), read_list (list) - k-mer and list of sequences.

    • Output: float - Frequency.

  5. feature_vector(reads_matrix, minimizer_list_of_lists)

    • Generates a feature vector for a list of sequences.

    • Input: reads_matrix (list), minimizer_list_of_lists (list) - List of sequences and minimizers.

    • Output: list - Feature vectors.

  6. select_m_highest_variance(feature_matrix, m)

    • Selects features with the highest variance.

    • Input: feature_matrix (list), m (int) - Feature matrix and number of features to select.

    • Output: list - Selected features.

  7. make_kmers(sequences, k)

    • Creates k-mers from a list of sequences.

    • Input: sequences (list), k (int) - List of sequences and length of k-mers.

    • Output: list - List of k-mers.

Expected Output:

  • K-mers: The generated k-mers from the sequences.

    • List of k-mers.

  • Feature Vectors: The generated feature vectors from the sequences.

    • List of vectors.

  • Selected Features with Highest Variance: The features with the highest variance.

    • List of selected features.

These classes and methods allow for comprehensive encoding, clustering, and analysis of DNA sequences. Each class and method has specific input fields and produces defined outputs, facilitating the processing and compression of genetic data.

Next
Next

Peking University Master’s Thesis