【ROSALIND】【练Python,学生信】25 序列拼接

如果第一次阅读本系列文档请先移步阅读【ROSALIND】【练Python,学生信】00 写在前面 谢谢配合~

题目:
将基因组片段组装为最短“超串”
Given: At most 50 DNA strings of approximately equal length, not exceeding 1 kbp, in FASTA format (which represent reads deriving from the same strand of a single linear chromosome). The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length..
所给:不超过50个DNA序列,长度相近且不会超过1kb,,以fasta格式给出,代表来自一个线性染色体上同一条链的片段。同时数据集满足如下要求:肯定有唯一解法,使这些片段两两重合长度大于其一般长度,从而组装成一个完整序列。
Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).
需得:由所给的所有序列组成的最短完整序列。
特别提醒
所给数据集是极其理想化的,在实际中reads不可能只来自同一条链。而且在实际中直接拼接出完整基因组的可能性极小,通常只能得到临近的部分序列,称为contigs(重叠群)。
测试数据
>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59
GCCGGAATAC
测试输出
ATTAGACCTGCCGGAATAC
背景
测序使我们进行生物学研究的重要方法,现代高通量测序中,我们只能得到长度有限的部分序列片段,称为reads,如何把大量reads拼接成完整的基因组成为亟待解决的重要问题。序列拼接的示意图如下:

思路
我的想法是将所有序列两两比较,每次都将重合度最大的两个序列合并,重复这个过程直至所有序列合并为一个序列。在这个思路的指导下,我用了大量的循环,最多处达到三层循环,但只要不被循环绕晕,整体思路还是较为简单直接的。
代码
def readfasta(lines):
"""读入fasta格式文件的函数"""
seq = []
index = []
seqplast = ""
numlines = 0
for i in lines:
if '>' in i:
index.append(i.replace("\n", "").replace(">", ""))
seq.append(seqplast.replace("\n", ""))
seqplast = ""
numlines += 1
else:
seqplast = seqplast + i.replace("\n", "")
numlines += 1
if numlines == len(lines):
seq.append(seqplast.replace("\n", ""))
seq = seq[1:]
return index, seq
def findoverlap(s1,s2):
"""返回两个序列重叠碱基数量的函数"""
n1 = len(s1)
n2 = len(s2)
if n1 >= n2:
i = n2
flag = 0
while i > 0:
if s1[-i:] == s2[:i]: # 把前一个序列的后部与第二个序列的前部依次比较
flag = i # 用flag变量记录重叠的碱基数
break
i -= 1
else:
i = n1 -1
flag = 0
while i > 1:
if s1[-i:] == s2[:i]:
flag = i
break
i -= 1
return flag
def isoverlap(s1, s2):
"""判断两个序列前后顺序的函数"""
flag1 = findoverlap(s1, s2)
flag2 = findoverlap(s2, s1)
if flag1 >= flag2: # 比较两个序列的先后位置对重叠碱基数的影响
flag = flag1
orders = 1 # 用orders变量记录先后顺序
elif flag2 > flag1:
flag = flag2
orders = 2
return flag, orders
def addseq(n, seq1, seq2):
"""将两个序列合并成一个的函数"""
n1 = len(seq1)
aseq = seq1[:n1 - n] + seq2
return aseq
f = open('input.txt', 'r')
lines = f.readlines()
f.close()
[index, seq] = readfasta(lines)
seq1 = ''
seq2 = ''
while len(seq) > 1: # 用一个大循环控制,直至所有片段合成一个
for i in range(len(seq)): # 下面的两个for循环实现seq中所有序列两两比对
miniseq = [] # miniseq用来防止序列自身比对
k =0
while k < len(seq):
if k != i:
miniseq.append(seq[k])
k += 1
maxoverlap = 0 # 用maxoverlap记录当前轮比较最多的重合碱基数
for j in range(len(miniseq)):
[overlap, orders] = isoverlap(seq[i], miniseq[j])
if overlap > maxoverlap: # 本次比较与之前的最大重合数比,若更大,则代替后者称为新的最大重合碱基数
maxoverlap = overlap
if orders == 1: # seq1和seq2记录带来最大重合碱基数的两个序列,orders指示顺序
seq1 = seq[i]
seq2 = miniseq[j]
else:
seq1 = miniseq[j]
seq2 = seq[i]
totalseq = addseq(maxoverlap, seq1, seq2) # 将本轮具有最大重合度的两个序列合并
seq.remove(seq1) # 删去已被合并的两个序列
seq.remove(seq2)
seq.append(totalseq) # 将合并的新串加入seq序列,进入下一轮比较
f = open('output.txt', 'w')
f.write(totalseq)
f.close()