Showing posts with label bioinformatics. Show all posts

2009-02-09

Perl 정규식
출처 Newbie 님의 블러그 (http://medialab.egloos.com)

생물 관련 프로그램의 대부분은 유닉스 기반으로 제작되어 있고, 유닉스 자체가 원래 텍스트 파일을 기본적인 데이터 교환 수단으로 쓰고 있기 때문에 이것은 피할 수 없는 일일 것이다. 오히려 데이터 교환의 측면에서 보면 텍스트 파일?바이너리 형태의 데이터 파일에 비해 유리한 점이 있는 것이 사실이다. 즉, 포맷이 공개되지 않은 텍스트 파일을 읽어들이기 위한 파서를 제작하는 것이 미지의 바이너리 파일을 읽어들이는 것 보다는 훨씬 쉬우니까 말이다. 결국 XML 파일도 본질적으로는 텍스트 파일 아닌가.

이 러한 텍스트 파싱 작업을 위해서는 정규식 (Regular Expression) 을 사용하는 것이 가장 효율적이며, 결국 Perl 에서 정규식을 빼면 앙꼬 없는 찐빵이 되고 만다. 물론 Python, Java, .NET 등에서도 정규식을 지원하긴 하지만 근본적으로 이들이 지원하는 정규식의 모체는 Perl 이며, 외부 모듈 형태로 정규식을 지원하는 이런 언어에 비해서 Perl 에서는 언어의 구성 요소로 정규식이 통합되어 있다는 것이 큰 차이점이다.

참고로 정규 표현식에 대해서 자세히 알고 싶다면 이 책을 추천하는 바이다.

http://kangcom.com/common/bookinfo/bookinfo.asp?sku=200303040023

정 규 표현식에 입문하면서 보기에는 조금 어려울지도 모르겠지만, 어느 정도의 간단한 정규식 정도는 작성가능하다면 이 책은 매우 큰 보탬이 된다. Perl 과 자바, 닷넷 등의 잡다한 언어에서의 정규식 사용법및 고급 활용법까지 두루 다루고 있는 바야흐로 정규식의 바이블이라고 할 수 있을만한 책이다.

파이썬의 경우에는 이 링크를 참조하기 바란다.

1. 정규식이란?

그렇다면 정규식이란 무엇인가? wikipedia 에서는 다음과 같이 정의하고 있다.

A regular expression is a string that describes or matches a set of strings, according to certain syntax rules.

즉, 정규식이란 특정한 문법 규직에 따라서 문자열을 기술하거나 매치시킬 수 있는 문자열을 의미한다.
정의는 그렇다고 치고, 실제 활용예로써 설명하도록 하자.

특정한 DNA Sequence 상에서 BamHI site 의 갯수를 알고 싶다면?

#!/usr/bin/perl -w
# modified from perl scripts in http://stein.cshl.org/genome_informatics/regex/regex1.html
#

use strict;

# 파일을 읽어들이고, 파일이 없으면 사용법을 출력하고 종료
my $filename = shift or die "usage BamHI.perl ";

# $filename 으로 입력받은 파일을 FASTA 라는 이름의 파일 핸들로 연다
open (FASTA , "<$filename") or print "$filename does not exist "; # $sites 에는 찾은 Enzyme site 갯수를 저장한다 my $sites=0; # FASTA 핸들을 이용하여 전체 파일을 한줄씩 읽어서 $line에 저장한다. while (my $line = ) {

# 라인 종료 문자를 지운다
chomp ($line);

# $line에 GGATCC 라는 문자열이 있으면 $site 변수를 1 증가시킨다
if ($line =~ m/GGATCC/){

$sites++;
}
}

# $sites 값이 1 이상이면 (BamHI Site 가 발견되면 메시지 출력
if ($sites){
print "$sites BamHI sites total ";
}
else {
print "there is no BamHI site! ";
}

if ($line =~ m/GGATCC/){

$sites++;
}
이 부분이 제일 중요하다. $line =~ m/정규식/ 이라는 문법은 만약 $line 내에 GGATCC 라는 문자열이 있으면 true 가 된다.

그렇다면 만약 HincII 와 같은 Restriction Enzyme, 즉 GTYRAC 와 같이 Y (C or T), R (A or G) 를 찾고 싶다면?

if ($line =~ m/GT[CT][AG]AC/){

$sites++;
}

이렇게 쓴다. 즉 [CT] 는 C or T, [AG] 는 A or G 의 의미이다.

즉, 정규식은 일정한 규칙을 가지는 문자열을 찾기 위해서 이것을 패턴화하는 하나의 규칙이라고 생각하면 된다.

2, Perl 정규식의 간략한 문법

- 정규식은 두개의 슬래쉬 () 안에 쓴다.

- 패턴에 사용될 수 있는 문자

* 일반적인 문자 (a-z,A-Z, 0-9 및 일부의 문장기호)
* "." 문자는 모든 문자와 매칭된다
* 대괄호를 쓰면 대괄호 안에 있는 글자에 속한 경우만을 매칭된다.

예 : 염기서열 [AaGgCcTNn]
알파벳, 숫자 [A-Z0-9]
A-B 를 제외한 글자 [^A-B] A-Z를 제외한 글자

* 많이 사용되는 조합은 다음과 같이 별도의 메타캐릭터로 지정된다. 메타캐릭터를 사용할 때는 역슬래쉬 (한글 키보드에서는 로 표시되지만) 를 이용한다.

d : 숫자 [0-9]

예 : 123-3456 혹은 345-2345 와 같은 것을 매칭하려고 할ㅤㄸㅒㅤ
ddd-dddd

w : 알파벳과 숫자[A-Za-z_0-9]

W : 알파벳과 숫자 이외의 기호

s : 공백문자 매칭

S : 공백문자가 아닌 글자

- 앵커 문자는 패턴의 위치를 한정지을 때 사용된다.

^ : 라인의 처음부터 매칭

$ : 라인의 끝을 표시

- 수량자는 지정된 패턴의 반복을 기술한다.

? : 정확히 한번만 매칭
* : 0번 이상 매칭
+ : 1번 이상 매칭되는 경우

예 S+공백 문자가 아닌 1개 이상의 문자를 매칭

{3} : 3번 매칭되는 경우

예 : 위에서 123-3456 과 같은 전화번호를 ddd-dddd 로 매칭한다고 하였지만 이것은 d{3}-d{4} 로 축약하여 쓸 수 있다.

{2,4} : 2번에서 4번 매칭되는 경우

예 : 전화번호 국번이 3자리 혹은 4자리일때는 d{3,4}-d{4] 와 같이 기술한다. 그렇게 하면 123-1234 번과 1234-1234 번과 같은 경우d{3,4}-d{4} 와 같이 쓰면 된다.
{4,} : 최소한 4번 매칭되는 경우

마지막으로. 그렇다면 '^' 내지는 '|' 와 같은 문자를 검색에 사용하려면 어떻게 해야 하는가? 간단하게는 백슬래쉬 () 를 붙이고 쓰면 된다.

일단 간단한 문법은 여기까지 알아보도록 한다.

3. 패턴 추출

사실 정규식을 실제로 쓰는 주 용도 앞에서 말했듯이 텍스트 문자에서 원하는 내용만 추출해 내는 것, 그리고 패턴을 이용하여 치환을 하는 것이다.

>gi|84489096|ref|YP_447328.1| GatB [Methanosphaera stadtmanae DSM 3091]
MMCGLEIHVQLNTNSKLFCSCPTNYQSAPNNTNICPVCLNQPGAKPYPPNQAALDNAIKVALMLGCEISN
EVIYFMRKHYDYPDLSSGYQRTSVPVGIKGELNGVRIHEIHVEEDPGQYKPDRGTVDFNRSGIPLIEIVT
EPDMKSPEEARNFLNELIRVLNYSGSARGEGTMRADVNISIEGGKRAEVKNVNSIRGAYKVLKFELIRQK
NILRRGGEVQQETRAYLESQMITVPMRLKEDADDYRYIPDPDLPPLKIDPAHVEEIRQNMPEPAHLKTER
FVEEYGIDKKDAKVLTSELELADAFEEVCKEVDANVAARLMRDELKRVLHYNKIQYAESKITPSDIVELI
NLIESKQVTPEAAHKLIEQMPGNDKTPTEIGNEMDIIGVVEDDAIVNAINQAIEENPNAVEDYKNGKDNA
VNFLVGQVMRLTRGKANAGETNKMIKEKLDQL

가령 이런 fasta sequence 의 header 부분에서,

>gi|84489096|ref|YP_447328.1| GatB [Methanosphaera stadtmanae DSM 3091]

위의 문장중 색으로 표시된 2 부분만 가지만 간단히 추출하고 싶다면?

$line = ">gi|84489096|ref|YP_447328.1| GatB [Methanosphaera stadtmanae DSM 3091]";

$line 이라는 스칼라 변수에 위의 문자열이 담겨 있는 상태라고 가정하고,

/I(S+)[(.*)]$/

라는 정규식을 쓴다.

'|' 문자는 정규식 내에서 다른 의미로 쓰이기 때문에, 백슬래쉬 () 와 함께 사용해야만 본래의 "|' 자체를 매칭가능하다. (대괄호 문자 '[' ']' 도 마찬가지로 [, ] 로 사용한다)

[(.*)]$

= [ 와 ] 사이에 있는 아무 문자 (.*) 를 추출한다. 추출되는 영역은 (, ) 로 지정한다.

그 앞에 있는 | (S+)는 "| GatB" 부분에서 I 및 공백문자 다음에 나오는 문자를 캡춰한다는 의미이다.

if ($line =~ m/I(S+)[(.*)]$/) {

$gene = $1;

$organism = $2;

}

$1 과 $2 변수의 경우 첫번째 그룹 (S+) 및 두번ㅤㅉㅒㅤ 그룹 (.*) 이 담겨져 있는 변수이므로 이 변수를 억세스하여 괄호 안의 패턴을 추출해 낼 수 있다.

if ($line =~ m/(.{50})TATTAT(.{25})/) {

$upstream = $1;

$downstream = $2;

}

위의 예는 'TATTAT' 서열의 상위 50글자를 캡춰하여 $upstream 에, 하위 25 글자를 캡춰하여 $downstream 에 넣는 예이다.

예는 많이 들면 들수록 좋으니 몇 가지 더.

Library_Plate-Well좌표-F.ab1 과 같은 파일명에서 Library 이름과 Plate 이름, Well 좌표를 추출해 내는 방법

가 령 Library 이름이 SEQUENCE이고, 이것의 96 Plate 가 01부터 99 까지 존재, 그리고 Well 좌표는 A01 부터 H12라고 하고 (96 Plate), Primer는 F 또는 R이다. 이런 경우를 정규식 하나로 각각을 추출해 내려면,

$filename = "SEQUENCING_01-A01-F.ab1";

if ($filename =~ m/(w+)_(d{1,3})-([A-H][0-1][0-9])-([FR]).ab1/) {

$library = $1;

$plate = $2;

$wellposition = $3;

$primer = $4;

}

print $library," ", $plate," ",$wellposition," ", $primer, " ";

위의 결과에서는 당연히 sequencing, 01, A01, F 로 각각의 정보가 추출되어 나올 것이다.

각각의 색은 캡춰되는 Region에 따라서 별도로 지정하였다.

4. 치환

정규식의 부분 캡춰 기능 이외에 유용한 활용법은 문자열의 치환이다. 즉, 정규식으로 검색한 패턴에 대해서 미리 지정된 치환을 수행할 수 있다.

치환의 문법은

$variable =~ s/정규식/치환할 문자열/

디폴트 상태로는 검색된 패턴 중 제일 먼저 나오는 패턴에 대해서만 치환이 일어나지만, 만약 문자열 전체에서 나오는 모든 패턴에 대해서 치환을 수행하고 싶다면,

$variable =~ s/정규식/치환할 문자열/g

'g' 스위치를 이용하면 된다.

예 :

$base = "ACGTGCGTGATTTTTTTAGG";

$base =~ s/TTTTTTT/AAAAAAA/;

'TTTTTTT' 를 검색하여 AAAAAA로 치환한다.

$base = "ACGTGCGTGATTTTTTTAGG";

$base =~ s/GT/CA/g;

GT를 검색하여 CA로 치환한다. 위의 서열에는 GT 가 여러번 등장하는데, g 스위치를 이용하여 모든 'GT' 라는 패턴을 CA로 치환할 수 있다.

# g switch 를 지정하지 않았을 때. 맨 처음에 나오는 GT Pattern 1개에 대해서만 CA 치환이 이루어진다.
$base = "ACCAGCGTGATTTTTTTAGG";

# g switch 를 지정하지 않았을 때. 맨 처음에 나오는 GT Pattern 1개에 대해서만 CA 치환이 이루어진다.
$base = "ACCAGCCAGATTTTTTTAGG";

치환 기능을 응용하여 문자열 중의 특정부분만을 삭제하는 것을 매우 쉽게 구현 가능하다.

$base = "ACCCCC*CGTGATT*TTTTTAGG";

이 와 같이 '*' 로 gap 이 표시되어 있는 sequence 에서 * 만을 삭제하고자 한다면? 만약 치환 기능을 사용하지 않는다면 * 위치를 찾고, * 를 제외한 부분 문자열을 구하고, 이것을 이어 붙이는 귀찮은 일을 해야 하겠지만, 치환 기능을 이용하면,

$base =~ s/*//g;

이 한줄로 해결된다. 즉, '문자열 중의 * 를 모두 찾아서 공백으로 치환한다 (=삭제한다)' 라는 의미이다. (* 문자는 Perl 내에서 별도의 의미를 가지고 있으므로 * 글자 자체를 검색하려면 백슬래시 를 앞에 붙여주고, 모든 * 를 검색하여 치환을 해 주어야 하므로 g 스위치를 켜게 된다)

이전에 든 예제에서 SEQUENCING_01-A01 부분만 얻고 싶다면? 물론 앞에서 설명한 캡춰 기능을 이용하여 사용할 수도 있겠지만, 이렇게 해도 된다.

$filename = "SEQUENCING_01-A01-F.ab1";

$filename =~ s/-[FR].ab1$//;

2008-11-05

TIME's Best Inventions of 2008

Invention of the Year

1. The Retail DNA Test

By Anita Hamilton

Article Tools

Before meeting with Anne Wojcicki, co-founder of a consumer gene-testing service called 23andMe, I know just three things about her: she's pregnant, she's married to Google's Sergey Brin, and she went to Yale. But after an hour chatting with her in the small office she shares with co-founder Linda Avey at 23andMe's headquarters in Mountain View, Calif., I know some things no Internet search could reveal: coffee makes her giddy, she has a fondness for sequined shoes and fresh-baked bread, and her unborn son has a 50% chance of inheriting a high risk for Parkinson's disease.

Learning and sharing your genetic secrets are at the heart of 23andMe's controversial new service — a $399 saliva test that estimates your predisposition for more than 90 traits and conditions ranging from baldness to blindness. Although 23andMe isn't the only company selling DNA tests to the public, it does the best job of making them accessible and affordable. The 600,000 genetic markers that 23andMe identifies and interprets for each customer are "the digital manifestation of you," says Wojcicki (pronounced Wo-jis-key), 35, who majored in biology and was previously a health-care investor. "It's all this information beyond what you can see in the mirror."

We are at the beginning of a personal-genomics revolution that will transform not only how we take care of ourselves but also what we mean by personal information. In the past, only élite researchers had access to their genetic fingerprints, but now personal genotyping is available to anyone who orders the service online and mails in a spit sample. Not everything about how this information will be used is clear yet — 23andMe has stirred up debate about issues ranging from how meaningful the results are to how to prevent genetic discrimination — but the curtain has been pulled back, and it can never be closed again. And so for pioneering retail genomics, 23andMe's DNA-testing service is Time's 2008 Invention of the Year.

The 1997 film Gattaca depicted it as a futuristic nightmare, but human-genotyping has emerged instead as both a real business and a status symbol. Movie mogul Harvey Weinstein says he is backing 23andMe not for its cinematic possibilities but because "I think it is a good investment. This is strictly medical and business-like." Google has chipped in almost half the $8.9 million in funding raised by the firm, which counts Warren Buffett, Rupert Murdoch and Ivanka Trump among its clients.

Weinstein isn't saying what his test told him, but Wojcicki and her famous husband are perfectly willing to discuss their own genetic flaws. Most worrisome is a rare mutation that gives Brin an estimated 20% to 80% chance of getting Parkinson's disease. There's a 50% chance that the couple's child, due later this year, will inherit that same gene. "I don't find this embarrassing in any way," says Brin, who blogged about it in September. "I felt it was a lot of work and impractical to keep it secret, and I think in 10 years it will be commonplace to learn about your genome."

And yet while Wojcicki and Brin aren't worried about genetic privacy, others are. In May, President George W. Bush signed a bill that makes it illegal for employers and insurers to discriminate on the basis of genetic information. California and New York tried to block the tests on the grounds that they were not properly licensed, but have so far been unsuccessful. Others worry about how sharing one's genetic data might affect close relatives who would prefer not to let a family history of schizophrenia or Lou Gehrig's disease become public. And what if a potential mate demands to see your genome before getting serious? Such hypotheticals are endless. And some researchers argue that the tests are flawed. "The uncertainty is too great," says Dr. Muin Khoury, director of the National Office of Public Health Genomics at the Centers for Disease Control and Prevention, who argues that it is wrong to charge people for access to such preliminary and incomplete data. Many diseases stem from several different genes and are triggered by environmental factors. Since less than a tenth of our 20,000 genes have been correlated with any condition, it's impossible to nail down exactly what component is genetic. "A little knowledge is a dangerous thing," says Dr. Alan Guttmacher of the National Institutes of Health.

23andMe is unfazed by its detractors. "It's somewhat paternalistic to say people shouldn't get these tests because 'we don't want people to misunderstand or get upset,'" says board member Esther Dyson. There can be a psychological upside too: some people decide to lead healthier lifestyles. Brin is currently funding Parkinson's research. And not all customers' results are as troubling as his. Nate Guy, 19, of Warrenton, Va., was relieved that though his uncle had died of prostate cancer, his own risk for the disease was about average. He even posted a video about it on YouTube. And unflattering findings can have a silver lining. "Now I have an excuse for not remembering things, because my memory is probably genetically flawed," Guy says.

Wojcicki and Avey see themselves not just as businesswomen but also as social entrepreneurs. With their customers' consent, they plan to amass everyone's genetic footprint in a giant database that can be mined for clues to which mutations make us susceptible to specific diseases and which drugs people are more likely to respond to. "You're donating your genetic information," says Wojcicki. "We could make great discoveries if we just had more information. We all carry this information, and if we bring it together and democratize it, we could really change health care."

2008-09-04

Riken genome browser

http://omicspace.riken.jp/db/genome.html.en

성능보다는 UI 가 인상적인 브라우져 ;;

The egg is the world.

2009-02-09

2008-11-05

TIME's Best Inventions of 2008

1. The Retail DNA Test

2008-09-04

Twitter Updates

Twitter Updates

Labels

Blog Archive