lf374, SoftwareDevelopment: LINUX & PERL, ѧϰ�ͷ��ѧ��Ϣ�ĵ��Թ��

<-- | վ��ͼ | �� | ��

��

��ڿ�

��

��LF

This document is available in: English Castellano ChineseGB Deutsch Francais

by Carlos Andrés Pérez
<caperez /at/ usc.edu.co>

��:

Carlos Andrés Pérez �Ƿ��ģ��ר�ң��ѧ��ʿ��GIEV �ļ��(GIEV, the Grupo de Investigación en Educación Virtual (GIEV) - Research Group in Virtual Learning��ѧϰ��о�С��)��ַ: Universidad Santiago de Cali, Calle 5ª carrera 62 Campus Pampalinda, Cali – Colombia.

Ŀ¼:

��Ϣѧ��Bioinformatics��
Perl
��Perl��ļ��:
��Ұ��ģʽ��Search for aminoacid patterns��
��㰱��Ƶ�ȣ�Calculus of aminoacid frequences��:
Bibliographic References
��ƪ��·��

LINUX & PERL, ѧϰ�ͷ��ѧ��Ϣ�ĵ��Թ��

ժҪ:

��ƪ��½��˶�DNA��RNA�͵��ݿ��Ϣ��ȡʱ��Unix�ϵ�Perl��һЩ�ŵ㡣��ЩPerl��ݱȶԴ��ͷ��ƻ��DNA��¡��ķ�չ��Ľ��Щ��ÿ��Ĵ��Ϣʹ��Ǵ��Щ��Ϣ�ķ�ʽ��ò��Ľ��

��ͬ��飨��л��ϵ�һ��ϼ��Ϣ�ƶ��Ϣѧ��Ϊ��Щ��ݵĻ��ֶΡ�

_________________ _________________ _________________

��Ϣѧ��Bioinformatics��

��Ϣѧ��ʼ�ڿ�ѧ��ǽ��ѧ��ָ�ʽ��Ų��ó��Щ��ݡ��ܳ�һ��ʱ��Ϣѧ��еķ��ϡ�Ȼ��Ź��ӵĽṹģ�͵��Ҫ�Կ�ʼ��֣��Ӽ��Ҳ��ʼ��Ϊ��ﻯѧ��Ҫ��ߡ�ÿ�춼��й��ڷ��3D��Ϣ��ݱ��ɼ��ǶԻ��ʶ��о�Ҳ�ӵ��Ļ��о�ת��Ϊ��ϻ��չʽ��о��Ϣѧ�ķ�չ��ڸ��⵰��֮��Ϊ��໥֮��á��ͨ��³´�л��֯�໥�ġ��ҲԽ��Խ��ѵ��ʶ��֯��Щ��ݵ��Ҫ�ԡ�

��Ϣѧ��ʹ��÷ǳ��Ȥ��һ��Ϣѧ��о�Ŀ��ҳ��ӵĹ�ϵ��Ŀ��ǡǡ��һ��Ȥ�ĳ��⣬��Ϊ��Ҫ��ϲ��ǵõ��Щ��Ϣ��Ȼ��еõ����һЩ��ĺ��Ч��һЩ��ʶ��ǻ��֣��ѧ�еĲ�ͬ��֪ʶ��Ƿǳ��Ҫ�ģ��ݵĹ��ϡ��Ч�ɿ��㷨��ǿ��Ӳ��㼼��ദ��ʹ�õȡ�

Perl

Larry Wall ��1986�꿪ʼ��Perl�� Perl��һ�ֽ��͵��ԣ��Ǵ��ı��ļ��ͽ��̵�ǿ��Ĺ��ߡ�Perlʹ��ܹ��ܿ�Ŀ��С��򡣿��˵��Perl�Ǹ߼��ԣ��C��ͽű��ԣ��bash��һ��Ч��ϡ�

Perl��ڶ��ֲ��ϵͳ��ƽ̨�ϣ��Perl��Unix�ϵ��ҿ��ٷ�չ�ġ��Perl�㷺��web��ƣ��䷢չ�ܿ�㳬��Ԥ�롣��Perl֮ǰ��ʹ��awk,thirst��grep ��ļ��ȡ��Ϣ��

Perl��ЩUNIX�Ϲ㷺ʹ�õĹ��ͳһ��һ��棬��Щ��չ��ִ��Ӧ��

Perl��һ��ѣ��ɵĳ��ԣ��ִ��ʵ��ʹ�õĸ��ֲ��ϵͳ�ϡ��UNIX��MacOSX�ϣ��Ԥ��װ�õģ��ϵͳ�ϣ��Ȱ�װ��Perl��http://www.cpan.org ��վ��а�װ��ʹ��Perl�ĺܶ�ʵ��Ϣ��

��Linux�£��Perl��ǽ��ļ��Ϊperl ��һ��Ȼ��perl ��ν��ִ��

��һ�ֳ��õķ��Ҫ��perl ��Ϊ�ˣ��Ҫ��£� (a)�ڳ��ļ��һ��ע�ͣ�

#!/usr/bin/env perl

print "Hi\n";

(b) ��ļ��Ͽ�ִ�е��ԣ�

% chmod +x greetings.pl

��ǾͿ��ֱ��ͨ��ļ��

% ./greetings.pl

��Perl��ļ��:

��ı��ʽ�ķ��У��ǿ��Perlдһ��ߡ��ǿ��Կ��SWISS-PROT(db_human_swissprot)��ʽ��ݿ��id��ҵ��С�

#!/usr/bin/perl

# Look for aminoacid sequence in a database

# SWISS-PROT formated, with a given id code

# Ask for the code in the ID field

# and it assigns it from the input(STDIN)to a variable

print "Enter the ID to search: ";
$id_query=<STDIN>;
chomp $id_query;
# We open the database file

# but if it isn't possible the program ends

open (db, "human_kinases_swissprot.txt") ||
 die "problem opening the file human_kinases_swissprot.txt\n";
# Look line by line in the database

while (<db>) {
chomp $_;
# Check if we are in the ID field
if ($_ =~ /^ID/) {
# If it is possitive we gather the information

# breaking the line by spaces

($a1,$id_db) = split (/\s+/,$_);
# but if there is no coincidence of ID we continue to the following

next if ($id_db ne $id_query);
# When they coincide, we put a mark

$signal_good=1;
# Then we check the sequence field

# and if the mark is 1 (chosen sequence)
# If possitive, we change the mark to 2,to collect the sequence

} elsif (($_ =~ /^SQ/) && ($signal_good==1)) {
$signal_good=2;
# Finally, if the mark is 2, we present each line

# of the sequence, until the line begins with //
# is such case we broke the while
} elsif ($signal_good == 2) {
last if ($_ =~ /^\/\//);
print "$_\n";
}
}
# When we left the while instruction we check the mark

# if negative that means that we don't find the chosen sequence

# that will give us an error

if (!$signal_good) {
print "ERROR: "."Sequence not found\n";
}
# Finally, we close the file
# that still si open

close (db);
exit;

��Ұ��ģʽ��Search for aminoacid patterns��

#!/usr/bin/perl
# Searcher for aminoacid patterns
# Ask the user the patterns for search
print "Please, introduce the pattern to search in query.seq: ";
$patron = <STDIN>;
chomp $patron;
# Open the database file
# but if it can't it ends the program
open (query, "query_seq.txt") || die "problem opening the file query_seq.txt\n";
# Look line by line the SWISS-PROT sequence
while (<query>) {
chomp $_;
# When arrives to the SQ field,put the mark in 1

   if ($_ =~ /^SQ/) {

         $signal_seq = 1;
# When arrive to the end of sequence, leave the curl

# Check that this expression is put before to check

# the mark=1,because this line doesn't belong to the aminoacid sequence

   } elsif ($_ =~ /^\/\//) {

         last;
# Check the mark if it is equal to 1, if possitive

# eliminate the blank spaces in the sequence line

# and join every line in a new variable

# To concatenate, we also can do:

# $secuencia_total.=$_;

   } elsif ($signal_seq == 1) {

         $_ =~ s/ //g;

         $secuencia_total=$secuencia_total.$_;

   }

  }
# Now check the sequence, collected in its entirety,

# for the given pattern

  if ($secuencia_total =~ /$patron/) {

   print "The sequence query.seq contains the pattern $patron\n";

  } else {

   print "The sequence query.seq doesn't contains the pattern $patron\n";

  }
# Finally we close the file

# and leave the program

close (query);

exit;

��֪��ݿ��ģʽ�ľ��λ�ã��Ǳ��ʹ��`$&'��ڶ��ʽ��ֵ��Ȼ��ҵ��ģʽ��Ӧ�ý��`if ($$secuencia_total>= ~/$$patron>/ һ��ĺ��棩��⣬��Խ��` $ ` ' ��` $ ´ '��ʹ�ã��ǻὫ�ҵ��ģʽ��λ�õ��Ϣ��档��Щ��ȷ�ļ��ǰ��ĳ��У��ǾͿ��Ը��ģʽ��ȷ��λ�á�ע�⣺lengthҲ�Ƿǳ��õģ��һ��ݵĳ��ȡ�

# Only we need to change the if where the pattern was found # Now check the sequence, collected in its entirety,
# for the given pattern
# and check its position in the sequence
if ($secuencia_total =~ /$patron/) {
$posicion=length($`)+1;
print "The sequence query_seq.txt contains the pattern $patron in the following position $posicion\n"; } else {
print "The sequence query_seq.txt doesn't contains the pattern $patron\n";
}

��㰱��Ƶ�ȣ�Calculus of aminoacid frequences��:

��ͬ���ض��İ��ֵ�Ƶ��ǲ�ͬ�ģ��Ϊ��Ǵ��ڲ�ͬ�Ļ��桢��ҹ��ܲ�ͬ��棬��Ǹ��һ��չʾ��μ��ĳ�ְ��Ƶ�ȡ�

#!/usr/bin/perl # Calculates the frequency of aminoacid in a proteinic sequence # Gets the file name from the command line # (SWISS-PROT formatted) # Also can be asked with print from the <STDIN> if (!$ARGV[0]) {print "The execution line shall be: program.pl file_swissprot\n";} $fichero = $ARGV[0]; # Initialize the variable $errores my $errores=0; # Open the file for reading open (FICHA, "$fichero") || die "problem opening the file $fichero\n"; # First we check the sequence as did in the example 2 while (<FICHA>) { chomp $_; if ($_ =~ /^SQ/) { $signal_good = 1; } elsif ($signal_good == 1) { last if ($_ =~ /^\/\//); $_ =~ s/\s//g; $secuencia.=$_; } } close (FICHA); # Now use a curl that checks every position of the aminoacid # in the sequence (from a funcion of its own,that can be used after in other # programs) comprueba_aa ($secuencia); # Print the results to the screen # First the 20 aminoacids and then the array with their frequencies # In this case 'sort' can't be used in foreach, # because the array contains the frequencies (numbers) print"A\tC\tD\tE\tF\tG\tH\tI\tK\tL\tM\tN\tP\tQ\tR\tS\tT\tV\tW\tY\n"; foreach $each_aa (@aa) { print "$each_aa\t"; } # Ten it gives the possible errors # and ends the program print "\nerrores = $errores\n"; exit; # Functions # This one calculates each aminoacid frequency # from a proteinic sequence sub comprueba_aa { # Gets the sequence my ($secuencia)=@_; # and runs aminoacid by aminoacid, using a for running # from 0 until the sequence length for ($posicion=0 ; $posicion<length $secuencia ; $posicion++ ) { # Gets the aminoacid $aa = substr($secuencia, $posicion, 1); # and checks which one is using if # when it is checked it aggregates 1 to the correspondant frequency # in an array using a pointer for each one # ordered in alphabetic way if ( $aa eq 'A' ) { $aa[0]++; } elsif ( $aa eq 'C' ) { $aa[1]++; } elsif ( $aa eq 'D' ) { $aa[2]++; } elsif ( $aa eq 'E' ) { $aa[3]++; } elsif ( $aa eq 'F' ) { $aa[4]++; } elsif ( $aa eq 'G' ) { $aa[5]++; } elsif ( $aa eq 'H' ) { $aa[6]++; } elsif ( $aa eq 'I' ) { $aa[7]++; } elsif ( $aa eq 'K' ) { $aa[8]++; } elsif ( $aa eq 'L' ) { $aa[9]++; } elsif ( $aa eq 'M' ) { $aa[10]++; } elsif ( $aa eq 'N' ) { $aa[11]++; } elsif ( $aa eq 'P' ) { $aa[12]++; } elsif ( $aa eq 'Q' ) { $aa[13]++; } elsif ( $aa eq 'R' ) { $aa[14]++; } elsif ( $aa eq 'S' ) { $aa[15]++; } elsif ( $aa eq 'T' ) { $aa[16]++; } elsif ( $aa eq 'V' ) { $aa[17]++; } elsif ( $aa eq 'W' ) { $aa[18]++; } elsif ( $aa eq 'Y' ) { $aa[19]++; # If the aminoacid is not found # it aggregates 1 to the errors } else { print "ERROR: Aminoacid not found: $aa\n"; $errores++; } } # Finally returns to the frequency array return @aa; }

��Ǹ��Ŵ��Ȼ�Ĳ��ϸ��е��Ϣ��˺η��֮һ��ת¼��RNA ��DNA��и��Ƴ��Ŵ��Ϣ��Ȼ��ֽ��Щ��Ϣ��ݸ��ʻ��߰��С�Ϊ�ˣ��Ǳ��ʹ��백��Ӧ�Ļ��--��ν��RNA��DNA��ӡ��Ҫ��ȡEscherichia coli��һ�ְ�[��]ϣ�ϸ˾��Ĵ󳦸˾�� �Ļ��Ӧ�İ��У��Щ��Ϣ��EMBL��European Molecular Biology Laboratory��Ҫ��ĸ�ʽ��Щת��֮��ǽ��е�ת¼��ϢУ�顣��ӣ��ǳ��б�Ҫ��Ĺ��associative variables of arrays��͹�ϣ��

#!/usr/bin/perl # Translates an ADN sequence from an EMBL fiche # to the aminoacid correspondant # Gets the file name from the command line # (SWISS-PROT formatted) # Also can be asked with print from the <STDIN> if (!$ARGV[0]) {print "The program line shall be: program.pl ficha_embl\n";} $fichero = $ARGV[0]; # Open the file for reading open (FICHA, "$fichero") || die "problem opening the file $fichero\n"; # First we check the sequence as did in the example 2 while (<FICHA>) { chomp $_; if ($_ =~ /^FT CDS/) { $_ =~ tr/../ /; ($a1,$a2,$a3,$a4) = split (" ",$_); } elsif ($_ =~ /^SQ/) { $signal_good = 1; } elsif ($signal_good == 1) { last if ($_ =~ /^\/\//); # Eliminate numbers and spaces $_ =~ tr/0-9/ /; $_ =~ s/\s//g; $secuencia.=$_; } } close (FICHA); # Now we define an associate array with the correpondence # of every aminoacids with their nucleotide # correspondants (also in an own function, # for if the same genetic code is used in other program my(%codigo_genetico) = ( 'TCA' => 'S',# Serine 'TCC' => 'S',# Serine 'TCG' => 'S',# Serine 'TCT' => 'S',# Serine 'TTC' => 'F',# Fenilalanine 'TTT' => 'F',# Fenilalanine 'TTA' => 'L',# Leucine 'TTG' => 'L',# Leucine 'TAC' => 'Y',# Tirosine 'TAT' => 'Y',# Tirosine 'TAA' => '*',# Stop 'TAG' => '*',# Stop 'TGC' => 'C',# Cysteine 'TGT' => 'C',# Cysteine 'TGA' => '*',# Stop 'TGG' => 'W',# Tryptofane 'CTA' => 'L',# Leucine 'CTC' => 'L',# Leucine 'CTG' => 'L',# Leucine 'CTT' => 'L',# Leucine 'CCA' => 'P',# Proline 'CCC' => 'P',# Proline 'CCG' => 'P',# Proline 'CCT' => 'P',# Proline 'CAC' => 'H',# Hystidine 'CAT' => 'H',# Hystidine 'CAA' => 'Q',# Glutamine 'CAG' => 'Q',# Glutamine 'CGA' => 'R',# Arginine 'CGC' => 'R',# Arginine 'CGG' => 'R',# Arginine 'CGT' => 'R',# Arginine 'ATA' => 'I',# IsoLeucine 'ATC' => 'I',# IsoLeucine 'ATT' => 'I',# IsoLeucine 'ATG' => 'M',# Methionina 'ACA' => 'T',# Treonina 'ACC' => 'T',# Treonina 'ACG' => 'T',# Treonina 'ACT' => 'T',# Treonina 'AAC' => 'N',# Asparagina 'AAT' => 'N',# Asparagina 'AAA' => 'K',# Lisina 'AAG' => 'K',# Lisina 'AGC' => 'S',# Serine 'AGT' => 'S',# Serine 'AGA' => 'R',# Arginine 'AGG' => 'R',# Arginine 'GTA' => 'V',# Valine 'GTC' => 'V',# Valine 'GTG' => 'V',# Valine 'GTT' => 'V',# Valine 'GCA' => 'A',# Alanine 'GCC' => 'A',# Alanine 'GCG' => 'A',# Alanine 'GCT' => 'A',# Alanine 'GAC' => 'D',# Aspartic Acid 'GAT' => 'D',# Aspartic Acid 'GAA' => 'E',# Glutamic Acid 'GAG' => 'E',# Glutamic Acid 'GGA' => 'G',# Glicine 'GGC' => 'G',# Glicine 'GGG' => 'G',# Glicine 'GGT' => 'G',# Glicine ); # Translate every codon in its correspondant aminoacid # and aggregates to the proteinic sequence print $a3; for($i=$a3 - 1; $i < $a4 - 3 ; $i += 3) { $codon = substr($secuencia,$i,3); # Pass the codon from subcase (EMBL format) to uppercase $codon =~ tr/a-z/A-Z/; $protein.= codon2aa($codon); } print "This proteinic sequence of the gen:\n$secuencia\nis the following:\n$protein\n\n"; exit;

Bibliographic References

http://bioperl.org/

http://changjiang.whlib.ac.cn/pylorus/download/book/Beginning%20Perl%20for%20Bioinformatics/contents.html

http://www.unix.org.ua/orelly/perl/prog3/

Example files :
- human_kinases_swissprot.txt
- query_seq.txt
- ecoli_embl.txt

��ƪ��·��
ÿƪ��¶��и��Եķ��ҳ�档��ҳ����ύ��ۣ�Ҳ��Բ鿴��ߵ��ۣ�

��ҳ��

<--, LF ��ҳ

Go to the index of this issue

��ҳ��LinuxFocus�༭��ά��
© Carlos Andrés Pérez
"some rights reserved" see linuxfocus.org/license/
http://www.LinuxFocus.org ��Ϣ:

es --> -- : Carlos Andrés Pérez <caperez /at/ usc.edu.co>

en --> CN: �� <daxiawj(Q)gmail.com>

2005-05-06, generated by lfparser version 2.52

LINUX & PERL, ѧϰ�ͷ�������ѧ��Ϣ�ĵ��Թ���

������Ϣѧ��Bioinformatics��

Perl

��Perl���ļ�����:

���Ұ������ģʽ��Search for aminoacid patterns��

���㰱�����Ƶ�ȣ�Calculus of aminoacid frequences��:

Bibliographic References

����ƪ���·�������

LINUX & PERL, ѧϰ�ͷ��ѧ��Ϣ�ĵ��Թ��

��Ϣѧ��Bioinformatics��

��Perl��ļ��:

��Ұ��ģʽ��Search for aminoacid patterns��

��㰱��Ƶ�ȣ�Calculus of aminoacid frequences��:

��ƪ��·��