Which blast to use




















A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and a negative score indicates that the alignment was found less often than by chance.

The BLOSUM matrices Figure 2b were constructed in a similar manner, but from sequences that were selected to avoid frequently occurring, highly related sequences. The underlying data were derived from the BLOCKS database [ 19 , 20 ], which is a set of ungapped alignments of sequences from families of related proteins. Using about 2, blocks of aligned sequence segments characterizing more than groups of related proteins, the sequences in each block were sorted into closely related clusters and the frequencies of substitutions between these clusters within a family used to calculate the probability of a meaningful substitution.

Lower cutoff values allow more diverse sequences into the groups, and the corresponding matrices are therefore appropriate for examining more distant relationships. Mutational events include not only substitutions but also insertions and deletions.

The consequence with respect to sequence alignment and comparison is the need to introduce gaps into one or both sequences in order to produce a proper alignment. The penalty for the creation of a gap should be large enough that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time.

For example, some protein structural elements tend to evolve as a unit, but entire elements may move relative to one another. Affine gap penalties, which impose an 'opening' penalty for a gap and an 'extension' penalty that decreases the relative penalty for each additional position in an already opened gap, address both of these issues. NCBI's BLAST page [ 2 ] allows one to choose from several different sets of parameters for scoring gaps existence penalties of 7, 8, and 9 with an extension penalty of 2, and existence penalties of 10,11 and 12 with an extension penalty of 1.

The need for an automated way of finding the optimal alignment out of the numerous alternatives is clear, but the method must be consistent and biologically meaningful. Choosing a good alignment by eye is possible, but life is too short to do it more than once or twice. For two long sequences, doing this directly would take a considerable amount of time, even on the fastest computers. Examining the calculations in detail, however, one might notice that the vast majority of the time would be spent evaluating the same portions of the candidate alignments many times over.

This redundant aspect of sequence comparison makes it amenable to a time-saving shortcut called dynamic programming. Dynamic programming methods were first described in the s, outside the context of bioinformatics, and first applied in this context by Needleman and Wunsch in [ 22 ]. These methods find an optimal solution to a given problem by breaking the original problem into smaller and smaller subproblems until the subproblems have a trivial solution, and then using those solutions to construct solutions for larger and larger portions of the original problem.

In sequence comparison, the overall problem is determining the optimal alignment of two sequences. This is broken down into smaller and smaller alignments of parts of one sequence with parts of another sequence to the smallest case, which is the alignment of a single residue from one sequence with a single residue from the other sequence.

This solution to this smallest subproblem is known, and is taken from the scoring matrix. A generalization of the recursive dynamic programming approach, the Smith-Waterman algorithm [ 23 ] is an exhaustive, mathematically optimal method, which handles sequence comparisons in a single computation and is guaranteed to find the highest scoring alignment.

The algorithm incorporates the concepts of mismatches and gaps, and identifies optimal local alignments. Local alignments, where parts of one sequence are aligned to parts of another are more biologically relevant than global alignments where entire sequences are aligned to each other, because long regions of high similarity are the exception, rather than the rule, for most biological applications.

As fast as computers are, and as efficient as the dynamic programming algorithms are, they are still far too slow to enable exhaustive searches of huge sequence repositories such as GenBank [ 24 , 25 ] or SWISS-PROT [ 26 , 27 ]. An exhaustive search of GenBank is still beyond the reach of most researchers' computer power - and with the growth of sequence databases outstripping increases in computation speed, this situation is not going to get better any time soon.

Neither is guaranteed to find the best local alignment, but they almost always do. These high-scoring 'hits' are used as 'seeds' for the slower, more sophisticated dynamic programming algorithm. BLAST also performs some pre-processing of the query sequence - to filter out low-complexity regions such as CA repeats and to discard words not likely to form high-scoring pairs.

From a practical standpoint, BLAST is generally the way to go, not only because of its better accuracy, but also because of its availability and its wide acceptance as the standard.

If we define a segment as a contiguous subsequence of a nucleotide or amino-acid sequence, and a segment pair as a pair of segments of the same length, one from each of the two sequences being compared, then the task that BLAST performs is the identification of all pairs of similar segments whose score exceeds a given threshold.

The resulting pairs of similar segments are called high-scoring segment pairs HSPs. The segment pair with the highest score is the maximal-scoring segment pair MSP ; its alignment cannot be improved by extending it or shortening it.

Detail for each of the steps is as follows. This word list is then expanded to include all high-scoring matching words, keeping only those that score more than the neighborhood word score threshold T when scored using a scoring matrix such as PAM or BLOSUM For typical parameter values, this results in about 50 words per residue of the query sequence.

Low compositional complexity or short-periodicity repeats can yield extremely large numbers of statistically significant but biologically uninteresting results. The filtering and removal of these can be controlled with the -F flag of the stand-alone version of BLAST and with check boxes in the web version.

The default word lengths are 3 and 11, for amino-acid sequences and nucleotide sequences, respectively, and are adjustable using the -W flag in the stand-alone version. No gaps are allowed. The list of matches is reduced by taking only those that will score above a given threshold, called the neighborhood word-score threshold.

There is a trade-off at this stage between speed and sensitivity: a higher threshold gives greater speed but increases the chance of missing relevant pairs.

Approximately 50 of these matches are usually kept for each of the words generated from the original query. In the second step, BLAST searches through the target sequence database for exact matches to the word list generated Figure 3b.

Because BLAST has already pre-processed and indexed the databases for the occurrence of all words in each sequence in the database, this search is extremely fast. If a match is found, it is used to seed a possible alignment between the query and the database sequences. In the third step, the original BLAST method tried to extend the alignment from the matching words in both directions as long as the score continued to increase Figure 3c.

The resulting alignment was called a high-scoring pair, or HSP. Gapped BLAST [ 28 ] uses a lower threshold for generating the list of high-scoring matching words; the algorithm uses short matched regions with no insertions or deletions between them and within a certain distance of each other as the starting points for longer ungapped alignments.

Next, BLAST determines whether each score found by one of the above methods is greater in value than a given cutoff score S, determined empirically by examining the range of scores given by comparing random sequences and then choosing a value that is significantly greater.

The maximal scoring pairs, or MSPs, from the entire database are identified and listed. Finally, BLAST determines the statistical significance of each score, initially by calculating the probability that two random sequences, one the length of the query sequence and the other the length of the database the sum of the lengths of all of the database sequences with the same composition nucleotide or amino acid could produce the calculated score.

Sometimes, two or more segment pairs can be made into a longer alignment; in such cases, a combined assessment of the significance is made by one of two methods [ 29 ]: the Poisson method is based on the assumption that the probability of the multiple scores is higher when the lower score of each set is higher; the sum-of-scores method calculates the probability of the sum of the scores.

When the expectation value for a given database sequence satisfies the user-selectable threshold parameter set by the - e flag with the stand-alone version; see Table 3 , the match is reported.

The first part of the output is the header and gives the BLAST program and version used, the reference, and the names and lengths of the query sequence and the target database.

The second part is a summary of the sequences producing significant alignments along with normalized bit scores and E values. The third part displays the alignments and includes more detailed information about the scores, including raw score, bit score, E value and identity. If you frame your question carefully, meaning a careful choice of parameters and databases against which to search, BLAST and other sequence comparison tools can provide a vast resource of useful information.

But in using sequence similarity to infer homology, one should take care to follow a few simple rules. The first is the header a , which includes the BLAST program and version used, and the name and length of both the query sequence and of the target database. In this case, the program used was BLASTX, so the query sequence was a nucleotide sequence and was translated in all six frames and compared to a protein database, nr, which is the non-redundant protein database maintained by NCBI.

The second part of the output b is a summary of sequences producing significant alignments, along with both normalized scores and E values see text for further details; only the four highest-scoring hits are shown. Given that nucleotide and protein databases are not uniformly populated, nucleotide and amino-acid sequence comparisons should be used to complement each other.

Only part of the subject sequences, when appropriate, is now retrieved, and performance results are presented under "Partial subject sequence retrieval" below. First, we introduce a set of BLAST command-line applications built with the software library discussed above. Then, we present an example use of database masking as well as two performance analyses that demonstrate improvements in search time: searches with very long queries and searches of chromosome-sized database sequences.

For each performance analysis, we prepared a baseline application that disables the new feature being tested. Finally, we discuss an example of retrieving subject sequences from an arbitrary source. Extensive documentation about the different command-line options is available [ 17 ], so only general comments about the interface are presented here.

For example, there is a "blastx" application that translates a nucleotide query and compares it to a protein database, and a "blastn" application that compares a nucleotide query to a nucleotide database.

The command-line options and help messages are specific to each application. Users also need to optimize for different tasks within a single command-line application. BLASTN, on the other hand, is the traditional nucleotide-nucleotide search program and uses a smaller word size and affine gapping by default. The concept of a "task" allows a user to optimize the search for different scenarios within one application.

Setting the task for the blastn application changes the default value of a number of command-line arguments, such as the word size, but also the default scoring parameters for insertions, deletions, and mismatches. These values are changed to typical values that would be used with the selected task. Power users of BLAST often have a specially crafted set of command-line options that they find useful for their particular task. However, lacking a method to save these, they must write scripts or simply re-type them for each search.

A user may then rerun a set of commands by specifying the strategy file, though a new query and database can be specified with the command-line. This file is currently written as ASN. Tables listing the command-line options, as well as their types and defaults, were provided as additional file 1 for this article. A specialized tool, such as WindowMasker [ 19 ] or RepeatMasker [ 10 ], can provide masking information for a single-species database when it is created, and it becomes unnecessary to mask every query.

Currently, database masking is only available in soft-masking mode. To test the performance of database masking, human ESTs from UniGene cluster were searched against the build Two sets of searches were run. One used the lower-case query masking to filter out interspersed repeats; the other used the database masking to do the same. Alignments with a score of or more were retained. Table 1 presents the results, which indicate that differences in query masking with RepeatMasker caused extra matches.

For example GI is only bases long and is not masked by RepeatMasker at all, but the portion of the genome it matches is masked. For GI the last 78 bases are not masked, but the portion of the genome it matches is masked by RepeatMasker. Currently, database masking is not supported for searches of translated database sequences i.

Database masking is not a new concept. Kent [ 13 ] mentions cases where BLAT users might find repeat masking of the database useful. Morgulis et al. In both of these cases, it is not simple to turn the masking on or off or to switch the type of masking e.

The implementation presented here allows this flexibility. Breaking longer queries into smaller pieces for processing can lead to significantly shorter search times. At the same time, splitting the query into pieces makes it possible to guarantee that the query length is always bounded, allowing the use of smaller data types in the lookup table.

Use of a smaller data type never makes performance worse, so it is used in the tests described in this section. A baseline blastx application that does not split the query was prepared. Query splitting decreases the search time for queries longer than 20 kbases, and the improvement continues with increasing query length. The Cachegrind memory profiling tool [ 24 ] confirmed a smaller number of cache misses with query splitting. Figure 3 presents those results. Figures 2 and 3 reflect an expect value cutoff of 1.

The query length in kbases is on the x-axis, with a log scale. Three searches were performed with both the baseline and the blastx applications for each data point , and the lowest time for each application was used. Cache misses were measured by Cachegrind [ 24 ] and only misses reading from the cache are shown. On the x-axis are different query lengths in kbases. The number of L2 cache misses is shown on the y-axis.

The top line is for the baseline application without query splitting, the bottom line is for the blastx application. Cameron et al. This work emphasized improving the worst-case behavior typically seen with very long nucleotide queries.

The query splitting approach does not preclude the use of a DFA or some other optimization instead of a lookup table. Partial retrieval of subject sequences is most effective when a small fraction of the subject sequence is required in the trace-back phase, such as in a search of ESTs against chromosomes.

A baseline blastn application that retrieves the entire subject sequence in the trace-back phase was prepared. Figure 4 presents search times with the standard blastn application and a baseline application. A word size of 24 and database masking with RepeatMasker was used. The ESTs with matches to the largest number of subject sequences showed the best improvement.

The three rightmost data points on Figure 4 are for GIs , , and left to right. These three ESTs match four, six, and eight database sequences respectively. Overall, sequences matched only one subject sequence, two matched two sequences and there was one match each for four, six, and eight sequences.

On the x-axis are times for the baseline application; on the y-axis are times for the new blastn application. Sequences with the best improvement are those furthest to the right, and they also matched the largest number of subject sequences.

A word size of 24 was used for the runs as well as database masking with RepeatMasker. Three searches were done with both the baseline and blastn application for each data point, and the lowest time for each application was used.

Future developments include adding hard-masking support for databases, and making database masking available for programs with translated database sequences tblastn and tblastx. At this point, only the scanning phase of the BLAST search is multi-threaded; we also plan to make the trace-back phase multi-threaded.

The design allows the addition of features that greatly benefit performance, such as query splitting and partial retrieval of subject sequences. It also allows the replacement of the lookup table with another design, so that new implementations can easily be added.

The new library also supports a framework for retrieving subject sequences from arbitrary data sources. The applications have a new, more logical organization that groups together similar types of searches in one application. The concept of a task allows a user to specify an optimal parameter set for a given task.

Strategy files were also introduced, allowing a user to record parameters of a search in order to later rerun it in stand-alone mode or at the NCBI web site. There are no restrictions on use by non-academics. J Mol Biol , 3 — Nucleic Acids Res , 25 17 — HTML ]. Nucleic Acids Res , 26 17 — Bioinformatics , 15 12 — Article PubMed Google Scholar. Nucleic Acids Res , 29 14 — We offer a diverse selection of courses from leading universities and cultural institutions from around the world.

These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life. You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations. Learn more about how FutureLearn is transforming access to education.

Learn more about this course. View transcript. Welcome to course two, Week 2. In this section, for organism, you can either include or exclude a particular taxa. And you can either choose to include, leaving this box blank, or tick it to exclude.

We then use the non-redundant database and a bacteria taxa ID. This is a result from the search. Next is the graphical summary. Next is the turquoise box, which represents the length of your nucleotide query, which is represented by 1, nucleotides, roughly.

Below, you can see the red bars, and these indicate your alignment scores. Here, in our results, most of our sequences— if not all— are over And in this instance, the higher the score, the better the alignment. If we then minimise that, next is a Descriptions tab. The E value, or expected value, is the percentage identity between the query and the subject, gives you the likelihood that this match was found by chance given the length of the sequence and the size of the database.

The lower the E value, the better. If you minimise this, you want to go to the Alignment section.



0コメント

  • 1000 / 1000