preg

 

Function

Regular expression search of a protein sequence

Description

This searches for matches of a regular expression to a protein sequence.

A regular expression is a way of specifying an ambiguous pattern to search for. Regular expressions are commonly used in some computer programming languages and may be more familiar to some users than to others.

The following is a short guide to regular expressions in EMBOSS:

^
use this at the start of a pattern to insist that the pattern can only match at the start of a sequence. (eg. '^M' matches a methionine at the start of the sequence)
$
use this at the end of a pattern to insist that the pattern can only match at the end of a sequence (eg. 'R$' matches an arginine at the end of the sequence)
()
groups a pattern. This is commonly used with '|' (eg. '(ACD)|(VWY)' matches either the first 'ACD' or the second 'VWY' pattern )
|
This is the OR operator to enable a match to be made to either one pattern OR another. There is no AND operator in this version of regular expressions.

The following quantifier characters specify the number of time that the character before (in this case 'x') matches:

x?
matches 0 or 1 times (ie, '' or 'x')
x*
matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)
x+
matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)

Quantifiers can follow any of the following types of character specification:

x
any character (ie 'A')
\x
the character after the backslash is used instead of its normal regular expression meaning. This is commonly used to turn off the special meaning of the characters '^$()|?*+[]-.'. It may be especially useful when searching for gap characters in a sequence (eg '\.' matches only a dot character '.')
[xy]
match one of the characters 'x' or 'y'. You may have one or more characters in this set.
[x-z]
match any one of the set of characters starting with 'x' and ending in 'y' in ASCII order (eg '[A-G]' matches any one of: 'A', 'B', 'C', 'D', 'E', 'F', 'G')
[^x-z]
matches anything except any one of the group of characters in ASCII order (eg '[^A-G]' matches anything EXCEPT any one of: 'A', 'B', 'C', 'D', 'E', 'F', 'G')
.
the dot character matches any other character (eg: 'A.G' matches 'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)

Combining some of these features gives these examples from the PROSITE patterns database:

'[STAGCN][RKH][LIVMAFY]$'

which is the 'Microbodies C-terminal targeting signal'.

'LP.TG[STGAVDE]'

which is the 'Gram-positive cocci surface proteins anchoring hexapeptide'.

Regular expressions are case-sensitive. The pattern 'AAAA' will not match the sequence 'aaaa'.

Usage

Here is a sample session with preg


% preg 
Regular expression search of a protein sequence
Input sequence(s): tsw:*_rat
Regular expression pattern: IA[QWF]A
Output file [100k_rat.preg]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
  [-pattern]           regexp     Regular expression pattern
  [-outfile]           outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   Associated qualifiers:
  "-sequence" related qualifiers
   -sbegin1             integer    First base used
   -send1               integer    Last base used, def=seq length
   -sreverse1           boolean    Reverse (if DNA)
   -sask1               boolean    Ask for begin/end/reverse
   -snucleotide1        boolean    Sequence is nucleotide
   -sprotein1           boolean    Sequence is protein
   -slower1             boolean    Make lower case
   -supper1             boolean    Make upper case
   -sformat1            string     Input sequence format
   -sopenfile1          string     Input filename
   -sdbname1            string     Database name
   -sid1                string     Entryname
   -ufo1                string     UFO features
   -fformat1            string     Features format
   -fopenfile1          string     Features file name
  "-outfile" related qualifiers
   -odirectory3         string     Output directory

   General qualifiers:
   -auto                boolean    Turn off prompts
   -stdout              boolean    Write standard output
   -filter              boolean    Read standard input, write standard output
   -options             boolean    Prompt for required and optional values
   -debug               boolean    Write debug output to program.dbg
   -acdlog              boolean    Write ACD processing log to program.acdlog
   -acdpretty           boolean    Rewrite ACD file as program.acdpretty
   -acdtable            boolean    Write HTML table of options
   -verbose             boolean    Report some/full command line options
   -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning             boolean    Report warnings
   -error               boolean    Report errors
   -fatal               boolean    Report fatal errors
   -die                 boolean    Report deaths


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-pattern]
(Parameter 2)
Regular expression pattern Any regular epression pattern is accepted Required
[-outfile]
(Parameter 3)
Output file name Output file <sequence>.preg
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

preg reads any protein sequence USA.

Input files for usage example

'tsw:*_rat' is a sequence entry in the example protein database 'tsw'

Output file format

Output files for usage example

File: 100k_rat.preg

preg search of tsw:*_rat with pattern IA[QWF]A
Matches in 100K_RAT
       100K_RAT   390 IAQA

Data files

None.

Notes

None.

References

None.

Warnings

Regular expressions are case-sensitive. The pattern 'AAAA' will not match the sequence 'aaaa'.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0. Always returns 0.

Known bugs

None.

See also

Program nameDescription
antigenicFinds antigenic sites in proteins
digestProtein proteolytic enzyme or reagent cleavage digest
fuzzproProtein pattern search
fuzztranProtein pattern search after translation
helixturnhelixReport nucleic acid binding motifs
oddcompFinds protein sequence regions with a biased composition
patmatdbSearch a protein sequence with a motif
patmatmotifsSearch a PROSITE motif database with a protein sequence
pepcoilPredicts coiled coil regions
pestfindFinds PEST motifs as potential proteolytic cleavage sites
pscanScans proteins using PRINTS
sigcleaveReports protein signal cleavage sites

Other EMBOSS programs allow you to search for simple patterns and may be easier for the user who has never used regular expressions before:

Author(s)

Peter Rice (pmr © ebi.ac.uk)
Informatics Division, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

Written (1999) - Peter Rice

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments