[LinuxFocus-icon]
<--  | վ���ͼ  | ����  | ����

���� | �����ڿ� | ���� | ����LF
This document is available in: English  Castellano  ChineseGB  Deutsch  Francais  

La Foto
by Carlos Andrés Pérez
<caperez /at/ usc.edu.co>

��������:

Carlos Andrés Pérez �Ƿ���ģ���ר�ң�����ѧ��ʿ��GIEV �ļ�������(GIEV, the Grupo de Investigación en Educación Virtual (GIEV) - Research Group in Virtual Learning��������ѧϰ�����о�С��)����ַ: Universidad Santiago de Cali, Calle 5ª carrera 62 Campus Pampalinda, Cali – Colombia.


Ŀ¼:

 

LINUX & PERL, ѧϰ�ͷ�������ѧ��Ϣ�ĵ��Թ���

[Illustration]

ժҪ:

��ƪ���½����˶�DNA��RNA�͵������������ݿ��������Ϣ��ȡʱ����Unix�ϵ�Perl�����һЩ�ŵ㡣��ЩPerl��������������ݱȶԴ����ͷ��������������ƻ���DNA��¡�����ķ�չ�������������Ľ�������Щ����ÿ������Ĵ�����Ϣʹ�����Ǵ�����Щ��Ϣ�ķ�ʽ���ò������Ľ���

��ͬ�����飨�����л����ϵ�һ������ϼ�����������Ϣ�ƶ���������Ϣѧ��Ϊ����������Щ���ݵĻ����ֶΡ�

_________________ _________________ _________________

 

������Ϣѧ��Bioinformatics��

������Ϣѧ��ʼ�ڿ�ѧ���ǽ�����ѧ���������ָ�ʽ��Ų����ó�����������Щ���ݡ��ܳ�һ��ʱ��������������Ϣѧ�����������еķ����ϡ�Ȼ�������Ź������ӵĽṹģ�͵���Ҫ�Կ�ʼ���֣����Ӽ����Ҳ��ʼ��Ϊ�������ﻯѧ����Ҫ���ߡ�ÿ�춼�����й��ڷ���3D��Ϣ�����ݱ��ɼ������ǶԻ������ʶ���о�Ҳ�ӵ����Ļ����о�ת��Ϊ�������ϻ�����չʽ���о�������������Ϣѧ�ķ�չ�����ڸ��������⵰����֮��Ϊ���໥֮���������á��������ͨ���³´�л����֯�໥�ġ�������������ҲԽ��Խ���ѵ���ʶ����֯����Щ���ݵ���Ҫ�ԡ�

������Ϣѧ������������������ʹ����÷dz���Ȥ����һ��������Ϣѧ���о�Ŀ�����ҳ������������ӵĹ�ϵ�������Ŀ��ǡǡ��һ����Ȥ�ij���������⣬��Ϊ�����Ҫ�������ϲ��������ǵõ�����Щ��Ϣ��Ȼ����еõ����������һЩ����ĺ���Ч��һЩ��ʶ�����ǻ����֣����������ѧ�еIJ�ͬ�����֪ʶ��������Ƿdz���Ҫ�ģ��������ݵĹ��������ϡ���Ч�ɿ����㷨��ǿ����Ӳ��������㼼�����ദ������ʹ�õȡ�  

Perl

Larry Wall ��1986�꿪ʼ����Perl�� Perl��һ�ֽ����͵����ԣ��Ǵ����ı����ļ��ͽ��̵�ǿ��Ĺ��ߡ�Perlʹ�������ܹ��ܿ�Ŀ�����С���򡣿���˵��Perl�Ǹ߼�������ԣ�����C���ͽű����ԣ���bash����һ����Ч��ϡ�

Perl������������ڶ��ֲ���ϵͳ��ƽ̨�ϣ�����Perl����Unix�ϵ������ҿ��ٷ�չ�ġ�����Perl�㷺������web������ƣ��䷢չ�ܿ�㳬������Ԥ�롣��Perl֮ǰ������ʹ��awk,thirst��grep �������ļ�����ȡ��Ϣ��

Perl����ЩUNIX�Ϲ㷺ʹ�õĹ���ͳһ��һ���������棬������Щ������չ���ִ�������Ӧ��������

Perl��һ����ѣ����ɵij������ԣ������������ִ�����ʵ������ʹ�õĸ��ֲ���ϵͳ�ϡ���UNIX��MacOSX�ϣ�����Ԥ��װ�õģ�������ϵͳ�ϣ����Ȱ�װ��Perl��http://www.cpan.org ��վ���а�װ��ʹ��Perl�ĺܶ�ʵ����Ϣ��

��Linux�£�����Perl�����ǽ����������ļ�����Ϊperl ��������һ��������Ȼ��perl �����ν���ִ���������������

��һ�ֳ��õķ���������Ҫ����perl ������Ϊ�ˣ�������Ҫ�����������£� (a)�ڳ�����ļ������һ�������ע�ͣ�

#!/usr/bin/env perl
print "Hi\n";

(b) ������ļ����������Ͽ�ִ�е����ԣ�

% chmod +x greetings.pl

���������ǾͿ���ֱ��ͨ���ļ����������������

% ./greetings.pl

 

��Perl���ļ�����:

�����������ı���ʽ�ķ������У����ǿ�����Perlдһ�������������ߡ�������������ǿ��Կ��������SWISS-PROT(db_human_swissprot)��ʽ�����ݿ�����id�������ҵ��������С�

#!/usr/bin/perl
# Look for aminoacid sequence in a database
# SWISS-PROT formated, with a given id code
# Ask for the code in the ID field
# and it assigns it from the input(STDIN)to a variable
print "Enter the ID to search: "; $id_query=<STDIN>; chomp $id_query; # We open the database file
# but if it isn't possible the program ends
open (db, "human_kinases_swissprot.txt") || die "problem opening the file human_kinases_swissprot.txt\n"; # Look line by line in the database
while (<db>) { chomp $_; # Check if we are in the ID field if ($_ =~ /^ID/) { # If it is possitive we gather the information
# breaking the line by spaces
($a1,$id_db) = split (/\s+/,$_); # but if there is no coincidence of ID we continue to the following
next if ($id_db ne $id_query); # When they coincide, we put a mark
$signal_good=1; # Then we check the sequence field
# and if the mark is 1 (chosen sequence) # If possitive, we change the mark to 2,to collect the sequence
} elsif (($_ =~ /^SQ/) && ($signal_good==1)) { $signal_good=2; # Finally, if the mark is 2, we present each line
# of the sequence, until the line begins with // # is such case we broke the while } elsif ($signal_good == 2) { last if ($_ =~ /^\/\//); print "$_\n"; } } # When we left the while instruction we check the mark
# if negative that means that we don't find the chosen sequence
# that will give us an error
if (!$signal_good) { print "ERROR: "."Sequence not found\n"; } # Finally, we close the file # that still si open
close (db); exit;

 

���Ұ������ģʽ��Search for aminoacid patterns��

#!/usr/bin/perl
# Searcher for aminoacid patterns
# Ask the user the patterns for search
print "Please, introduce the pattern to search in query.seq: ";
$patron = <STDIN>;
chomp $patron;
# Open the database file
# but if it can't it ends the program
open (query, "query_seq.txt") || die "problem opening the file query_seq.txt\n";
# Look line by line the SWISS-PROT sequence
while (<query>) {
chomp $_;
# When arrives to the SQ field,put the mark in 1
if ($_ =~ /^SQ/) {
$signal_seq = 1; # When arrive to the end of sequence, leave the curl
# Check that this expression is put before to check
# the mark=1,because this line doesn't belong to the aminoacid sequence
} elsif ($_ =~ /^\/\//) {
last; # Check the mark if it is equal to 1, if possitive
# eliminate the blank spaces in the sequence line
# and join every line in a new variable
# To concatenate, we also can do:
# $secuencia_total.=$_;
} elsif ($signal_seq == 1) {
$_ =~ s/ //g;
$secuencia_total=$secuencia_total.$_;
}
} # Now check the sequence, collected in its entirety,
# for the given pattern
if ($secuencia_total =~ /$patron/) {
print "The sequence query.seq contains the pattern $patron\n";
} else {
print "The sequence query.seq doesn't contains the pattern $patron\n";
} # Finally we close the file
# and leave the program
close (query);
exit;

�����֪�����ݿ���ģʽ�ľ���λ�ã����DZ���ʹ���������`$&'����������ڶ��������ʽ��ֵ����Ȼ�������ҵ���ģʽ��Ӧ�ý�������`if ($$secuencia_total>= ~/$$patron>/ һ��ĺ��棩�����⣬���Խ�����` $ ` ' ��` $ ´ '�������ʹ�ã����ǻὫ�ҵ���ģʽ������λ�õ���Ϣ���档����Щ������ȷ�ļ���ǰ��ij����У����ǾͿ��Ը���ģʽ��ȷ��λ�á�ע�⣺lengthҲ�Ƿdz����õģ��������һ�����ݵij��ȡ�

 

# Only we need to change the if where the pattern was found
# Now check the sequence, collected in its entirety,
# for the given pattern
# and check its position in the sequence
if ($secuencia_total =~ /$patron/) {
$posicion=length($`)+1;
print "The sequence query_seq.txt contains the pattern $patron in the following position $posicion\n"; } else {
print "The sequence query_seq.txt doesn't contains the pattern $patron\n";
}
 

���㰱�����Ƶ�ȣ�Calculus of aminoacid frequences��:

��ͬ��������ض��İ�������ֵ�Ƶ���Dz�ͬ�ģ�������Ϊ���Ǵ��ڲ�ͬ�Ļ������桢���ҹ��ܲ�ͬ�����棬���Ǹ���һ��������չʾ��μ������������������ij�ְ�����Ƶ�ȡ�


#!/usr/bin/perl # Calculates the frequency of aminoacid in a proteinic sequence # Gets the file name from the command line # (SWISS-PROT formatted) # Also can be asked with print from the <STDIN> if (!$ARGV[0]) {print "The execution line shall be: program.pl file_swissprot\n";} $fichero = $ARGV[0]; # Initialize the variable $errores my $errores=0; # Open the file for reading open (FICHA, "$fichero") || die "problem opening the file $fichero\n"; # First we check the sequence as did in the example 2 while (<FICHA>) { chomp $_; if ($_ =~ /^SQ/) { $signal_good = 1; } elsif ($signal_good == 1) { last if ($_ =~ /^\/\//); $_ =~ s/\s//g; $secuencia.=$_; } } close (FICHA); # Now use a curl that checks every position of the aminoacid # in the sequence (from a funcion of its own,that can be used after in other # programs) comprueba_aa ($secuencia); # Print the results to the screen # First the 20 aminoacids and then the array with their frequencies # In this case 'sort' can't be used in foreach, # because the array contains the frequencies (numbers) print"A\tC\tD\tE\tF\tG\tH\tI\tK\tL\tM\tN\tP\tQ\tR\tS\tT\tV\tW\tY\n"; foreach $each_aa (@aa) { print "$each_aa\t"; } # Ten it gives the possible errors # and ends the program print "\nerrores = $errores\n"; exit; # Functions # This one calculates each aminoacid frequency # from a proteinic sequence sub comprueba_aa { # Gets the sequence my ($secuencia)=@_; # and runs aminoacid by aminoacid, using a for running # from 0 until the sequence length for ($posicion=0 ; $posicion<length $secuencia ; $posicion++ ) { # Gets the aminoacid $aa = substr($secuencia, $posicion, 1); # and checks which one is using if # when it is checked it aggregates 1 to the correspondant frequency # in an array using a pointer for each one # ordered in alphabetic way if ( $aa eq 'A' ) { $aa[0]++; } elsif ( $aa eq 'C' ) { $aa[1]++; } elsif ( $aa eq 'D' ) { $aa[2]++; } elsif ( $aa eq 'E' ) { $aa[3]++; } elsif ( $aa eq 'F' ) { $aa[4]++; } elsif ( $aa eq 'G' ) { $aa[5]++; } elsif ( $aa eq 'H' ) { $aa[6]++; } elsif ( $aa eq 'I' ) { $aa[7]++; } elsif ( $aa eq 'K' ) { $aa[8]++; } elsif ( $aa eq 'L' ) { $aa[9]++; } elsif ( $aa eq 'M' ) { $aa[10]++; } elsif ( $aa eq 'N' ) { $aa[11]++; } elsif ( $aa eq 'P' ) { $aa[12]++; } elsif ( $aa eq 'Q' ) { $aa[13]++; } elsif ( $aa eq 'R' ) { $aa[14]++; } elsif ( $aa eq 'S' ) { $aa[15]++; } elsif ( $aa eq 'T' ) { $aa[16]++; } elsif ( $aa eq 'V' ) { $aa[17]++; } elsif ( $aa eq 'W' ) { $aa[18]++; } elsif ( $aa eq 'Y' ) { $aa[19]++; # If the aminoacid is not found # it aggregates 1 to the errors } else { print "ERROR: Aminoacid not found: $aa\n"; $errores++; } } # Finally returns to the frequency array return @aa; }

����������Ǹ��Ŵ���Ȼ�IJ���������ϸ���е���Ϣ�����˺η�������֮һ����ת¼��RNA ��DNA�������и��Ƴ��Ŵ���Ϣ��Ȼ���ֽ���Щ��Ϣ���ݸ������ʻ��߰��������С�Ϊ�ˣ����DZ���ʹ���백�����Ӧ�Ļ�������--��ν��RNA��DNA���������ӡ�����Ҫ��ȡEscherichia coli��һ�ְ�[����]ϣ�ϸ˾����Ĵ󳦸˾��� �Ļ�������Ӧ�İ��������У�����Щ��Ϣ������EMBL��European Molecular Biology Laboratory��Ҫ��ĸ�ʽ��������Щת��֮�����ǽ������е�ת¼��ϢУ�顣��������ӣ��dz��б�Ҫ��������Ĺ���������associative variables of arrays���͹�ϣ����


#!/usr/bin/perl # Translates an ADN sequence from an EMBL fiche # to the aminoacid correspondant # Gets the file name from the command line # (SWISS-PROT formatted) # Also can be asked with print from the <STDIN> if (!$ARGV[0]) {print "The program line shall be: program.pl ficha_embl\n";} $fichero = $ARGV[0]; # Open the file for reading open (FICHA, "$fichero") || die "problem opening the file $fichero\n"; # First we check the sequence as did in the example 2 while (<FICHA>) { chomp $_; if ($_ =~ /^FT CDS/) { $_ =~ tr/../ /; ($a1,$a2,$a3,$a4) = split (" ",$_); } elsif ($_ =~ /^SQ/) { $signal_good = 1; } elsif ($signal_good == 1) { last if ($_ =~ /^\/\//); # Eliminate numbers and spaces $_ =~ tr/0-9/ /; $_ =~ s/\s//g; $secuencia.=$_; } } close (FICHA); # Now we define an associate array with the correpondence # of every aminoacids with their nucleotide # correspondants (also in an own function, # for if the same genetic code is used in other program my(%codigo_genetico) = ( 'TCA' => 'S',# Serine 'TCC' => 'S',# Serine 'TCG' => 'S',# Serine 'TCT' => 'S',# Serine 'TTC' => 'F',# Fenilalanine 'TTT' => 'F',# Fenilalanine 'TTA' => 'L',# Leucine 'TTG' => 'L',# Leucine 'TAC' => 'Y',# Tirosine 'TAT' => 'Y',# Tirosine 'TAA' => '*',# Stop 'TAG' => '*',# Stop 'TGC' => 'C',# Cysteine 'TGT' => 'C',# Cysteine 'TGA' => '*',# Stop 'TGG' => 'W',# Tryptofane 'CTA' => 'L',# Leucine 'CTC' => 'L',# Leucine 'CTG' => 'L',# Leucine 'CTT' => 'L',# Leucine 'CCA' => 'P',# Proline 'CCC' => 'P',# Proline 'CCG' => 'P',# Proline 'CCT' => 'P',# Proline 'CAC' => 'H',# Hystidine 'CAT' => 'H',# Hystidine 'CAA' => 'Q',# Glutamine 'CAG' => 'Q',# Glutamine 'CGA' => 'R',# Arginine 'CGC' => 'R',# Arginine 'CGG' => 'R',# Arginine 'CGT' => 'R',# Arginine 'ATA' => 'I',# IsoLeucine 'ATC' => 'I',# IsoLeucine 'ATT' => 'I',# IsoLeucine 'ATG' => 'M',# Methionina 'ACA' => 'T',# Treonina 'ACC' => 'T',# Treonina 'ACG' => 'T',# Treonina 'ACT' => 'T',# Treonina 'AAC' => 'N',# Asparagina 'AAT' => 'N',# Asparagina 'AAA' => 'K',# Lisina 'AAG' => 'K',# Lisina 'AGC' => 'S',# Serine 'AGT' => 'S',# Serine 'AGA' => 'R',# Arginine 'AGG' => 'R',# Arginine 'GTA' => 'V',# Valine 'GTC' => 'V',# Valine 'GTG' => 'V',# Valine 'GTT' => 'V',# Valine 'GCA' => 'A',# Alanine 'GCC' => 'A',# Alanine 'GCG' => 'A',# Alanine 'GCT' => 'A',# Alanine 'GAC' => 'D',# Aspartic Acid 'GAT' => 'D',# Aspartic Acid 'GAA' => 'E',# Glutamic Acid 'GAG' => 'E',# Glutamic Acid 'GGA' => 'G',# Glicine 'GGC' => 'G',# Glicine 'GGG' => 'G',# Glicine 'GGT' => 'G',# Glicine ); # Translate every codon in its correspondant aminoacid # and aggregates to the proteinic sequence print $a3; for($i=$a3 - 1; $i < $a4 - 3 ; $i += 3) { $codon = substr($secuencia,$i,3); # Pass the codon from subcase (EMBL format) to uppercase $codon =~ tr/a-z/A-Z/; $protein.= codon2aa($codon); } print "This proteinic sequence of the gen:\n$secuencia\nis the following:\n$protein\n\n"; exit;
 

Bibliographic References

 

����ƪ���·�������

ÿƪ���¶��и��Եķ���ҳ�档�����ҳ����������ύ���ۣ�Ҳ���Բ鿴�������ߵ����ۣ�




��ҳ��LinuxFocus�༭��ά��
© Carlos Andrés Pérez
"some rights reserved" see linuxfocus.org/license/
http://www.LinuxFocus.org
������Ϣ:
es --> -- : Carlos Andrés Pérez <caperez /at/ usc.edu.co>
en --> CN: �� �� <daxiawj(Q)gmail.com>

2005-05-06, generated by lfparser version 2.52