【ROSALIND】【练Python，学生信】10 寻找DNA的consensus string

2019-02-06 17:11 作者:未琢 0人读过 | 我要投稿

如果第一次阅读本系列文档请先移步阅读【ROSALIND】【练Python，学生信】00 写在前面谢谢配合~

题目：

寻找最可能的祖先序列

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

所给：不超过十条等长DNA序列，长度均不超过1kb，以FASTA格式给出。

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

需得：这些DNA的一致序列和profile矩阵（如果存在多条一致序列，返回任意一条即可）。

测试数据

>Rosalind_1

ATCCAGCT

>Rosalind_2

GGGCAACT

>Rosalind_3

ATGGATCT

>Rosalind_4

AAGCAACC

>Rosalind_5

TTGGAACT

>Rosalind_6

ATGCCATT

>Rosalind_7

ATGGCACT

测试输出

ATGCAACT

A: 5 1 0 0 5 5 0 0

C: 0 0 1 4 2 0 6 1

G: 1 1 6 3 0 1 0 0

T: 1 5 0 0 0 1 1 6

背景

在理想状态下，当我们有多条长度相等的同源序列时，可以通过序列比对推到出最有可能的祖先序列，方法如下：

将m条长度为n的序列组成一个m×n的矩阵

G G G C A A C T

A T G G A T C T

A A G C A A C C

T T G G A A C T

A T G C C A T T

A T G G C A C T

比对之后可以得到一个4×n的矩阵，分别统计A、C、G、T在各位点出现的次数：

A 5 1 0 0 5 5 0 0

C 0 0 1 4 2 0 6 1

G 1 1 6 3 0 1 0 0

T 1 5 0 0 0 1 1 6

取每个位点出现次数最多的符号即可（出现次数最多的符号可能有多个，因此一致序列可能有多个可能）。

思路

需要统计各符号在每条序列每个位点上出现的频率，因此考虑用两层循环，第一层用于扫描位点，第二层用于扫描各序列，即可统计出各符号出现的次数。

代码

def readfasta(lines): # 用封装好的函数读取FASTA文件

seq = []

index = []

seqplast = ""

numlines = 0

for i in lines:

if '>' in i:

index.append(i.replace("\n", "").replace(">", ""))

seq.append(seqplast.replace("\n", ""))

seqplast = ""

numlines += 1

else:

seqplast = seqplast + i.replace("\n", "")

numlines += 1

if numlines == len(lines):

seq.append(seqplast.replace("\n", ""))

seq = seq[1:]

return index, seq

f = open('rosalind_cons.txt', 'r')

lines = f.readlines()

f.close()

[index, seq] = readfasta(lines)

A = []

G = []

T = []

C = []

i = 0

while i < len(seq[0]): # 第一层循环，扫描位点

j = 0

a = 0

g = 0

c = 0

t = 0

while j < len(seq): # 第二层循环，扫描各序列

if seq[j][i] == 'A': # 判断该序列该位点是否为”A”

a += 1 # 是则相应计数加一

elif seq[j][i] == 'G':

g += 1

elif seq[j][i] == 'C':

c += 1

elif seq[j][i] == 'T':

t += 1

j += 1

i +=1

A.append(a) # 记录该位点出现”A”的次数

G.append(g)

C.append(c)

T.append(t)

# 推出一致序列

common = ''

base = 0

while base < len(A):

if max(A[base],G[base],C[base],T[base]) == A[base]:

common += 'A'

elif max(A[base],G[base],C[base],T[base]) == G[base]:

common += 'G'

elif max(A[base],G[base],C[base],T[base]) == T[base]:

common += 'T'

elif max(A[base],G[base],C[base],T[base]) == C[base]:

common += 'C'

base += 1

# 把计数各项从整数改为字符串型，方便写入输出文件

i = 0

while i < len(A):

A[i] = str(A[i])

G[i] = str(G[i])

C[i] = str(C[i])

T[i] = str(T[i])

i += 1

f = open('output.txt','a')

f.write(common + '\n')

f.write("A: ")

f.write(' '.join(A) + '\n')

f.write("C: ")

f.write(' '.join(C) + '\n')

f.write("G: ")

f.write(' '.join(G) + '\n')

f.write("T: ")

f.write(' '.join(T))

f.close()

标签：

【ROSALIND】【练Python，学生信】10 寻找DNA的consensus string

【ROSALIND】【练Python，学生信】10 寻找DNA的consensus string的评论 (共条)

你可能也喜欢这些文章

最新发布的文章

【ROSALIND】【练Python，学生信】10 寻找DNA的consensus string

本文作者的其他文章

【ROSALIND】【练Python，学生信】10 寻找DNA的consensus string的评论 (共 条)

你可能也喜欢这些文章

最新发布的文章

【ROSALIND】【练Python，学生信】10 寻找DNA的consensus string的评论 (共条)