Citation: Lee, S. (2024, May 25). Automated Profile HMM Creation With HMMER & MAFFT Using Python. Sepo.
March 25th, 2024
Introduction
Profile HMMs built by HMMER and other similar software is a position-specific scoring model that describes what residues are most likely to occur and the frequency of insertions or deletions that occur at the position within a multiple sequence alignment [1]. This is suggested to be more accurate in determining a homologous sequence because in certain sequences, such deletions, insertions, or substitutions may not affect the function of the sequence [1].
Limitations
While profile HMMs produced by HMMER are really useful, they do come with their fair share of limitations. For example, each residue is considered to be independent of the other residues in the amino acid sequence which neglects the possibility of a greater correlation [1]. Therefore, it has been suggested profile HMMs are not ideal for modeling structural RNA because of correlated base pairs [1]. Furthermore, phylogenetic relationships are not considered and are instead assumed to be independent [1].
Method
Reference Sequence Collection
If you already have your protein sequences, you can skip this section.
However, this may not always be the case. In that scenario, it is best to find the protein sequences of the gene on the NCBI Protein database (https://www.ncbi.nlm.nih.gov/protein). If you have a particular organism you want to investigate, it is best to use a reference protein sequence from the organism or relative of the organism of interest.
You will need to download the FASTA file of the gene which should have a .faa
extension. Then, you will need to run a Protein BLAST in the NCBI browser with the gene as the query sequence. In most instances, the default settings should be fine when running BLAST.
Once the BLAST job is finished, you can download the sequences that were found to be similar to the query sequence. Note that the smaller the e-value, the more similar the sequences are. Therefore, you can download all the sequences from the results or just 20. It is up to your discretion on what would make an ideal reference dataset for the profile HMM.
Prerequisites
Install Python, MAFFT, and HMMER on your system [1, 2].
Ensure you have a FASTA file with protein sequences ready. In this article, it will be called
gene.fasta
.
Python Automation Script Creation
Open your preferred code editor: This could be anything from Notepad or TextEdit to more sophisticated IDEs like Visual Studio Code or PyCharm.
Create a new Python file: Name it something descriptive, like align_and_build_hmm.py
.
Copy and paste the following Python script into your new file:
import subprocess
# Function to align sequences with MAFFT
def align_sequences(input_file, output_file):
mafft_command = f"mafft --auto {input_file} > {output_file}"
subprocess.call(mafft_command, shell=True)
# Function to build a profile HMM from the alignment
def build_hmm(input_file, output_file):
hmmbuild_command = f"hmmbuild {output_file} {input_file}"
subprocess.call(hmmbuild_command, shell=True)
# Main function that orchestrates the alignment and HMM building
def main():
# Align sequences with MAFFT
align_sequences('gene.fasta', 'aligned_gene.fasta')
# Build a profile HMM from the alignment
build_hmm('aligned_gene.fasta', 'gene.hmm')
if __name__ == "__main__":
main()
Save the file. It will be helpful if it is saved in a directory that’s easy to navigate from your terminal or command prompt.
Prepare Your Input Data
Place the
gene.fasta
file (which contains the protein sequences you wish to align and analyze) in the same directory as your Python script.
Run the Script
You can also run the script through the terminal:
Open your terminal or command prompt and navigate to the directory containing your Python script and the
gene.fasta
file. You can navigate using thecd
command followed by the path to your directory.Execute the script by typing the following command and pressing Enter:
python align_and_build_hmm.py
Review the Output
After running the script, check the directory for two new files:
aligned_gene.fasta
: This file contains your aligned sequences, output by MAFFT.gene.hmm
: This is the profile HMM generated from your aligned sequences by HMMER.
Conclusion
Congratulations! You’ve successfully automated the process of aligning sequences and generating a profile HMM. This script can be a valuable tool in bioinformatics workflows, allowing you to efficiently process sequences with minimal manual intervention.
References
HMMER: biosequence analysis using profile hidden Markov models [http://hmmer.org/]
Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772–780. https://doi.org/10.1093/molbev/mst010