【ROSALIND】【练Python,学生信】34 测序片段错误修正

如果第一次阅读本系列文档请先移步阅读【ROSALIND】【练Python,学生信】00 写在前面 谢谢配合~

题目:
修正测序片段中的错误(Error Correction in Reads)
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:
s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).
所给:约1000条等长的测序片段,最多不长过50bp,以FASTA格式给出。一些片段包含单个的碱基替换。单个片段s符合以下两种情况之一:
s是正确的序列,在数据集中至少出现两次(可能以反向互补序列的形式给出);
s包含错误碱基,在数据集中只出现一次,与正确序列仅相差一个碱基。
Return: A list of all corrections in the form "[old read]->[new read]". (Each correction must be a single symbol substitution, and you may return the corrections in any order.)
需得:以错误序列->修正序列形式给出的列表,可以用任何顺序排列。
测试数据
>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC
测试输出
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA
生物学背景
二代测序读长很短,错误率较高且无法预测。当发现有碱基替换时,需要考虑是真正的突变,还是由测序错误导致。
思路
我的思路是用两重循环扫描数据集,通过挨个比较,将数据分为正确序列和错误序列两部分。输出的时候,再用两重循环,遍历正确和错误的部分,得到结果。
代码
def readfasta(lines):
"""读入fasta格式文件的函数"""
seq = []
index = []
seqplast = ""
numlines = 0
for i in lines:
if '>' in i:
index.append(i.replace("\n", "").replace(">", ""))
seq.append(seqplast.replace("\n", ""))
seqplast = ""
numlines += 1
else:
seqplast = seqplast + i.replace("\n", "")
numlines += 1
if numlines == len(lines):
seq.append(seqplast.replace("\n", ""))
seq = seq[1:]
return index, seq
def recom(s):
"""输出原序列反向互补序列的函数"""
re = s[::-1]
c = ""
for i in re:
if i == 'A':
c = c + 'T'
elif i == 'G':
c = c + 'C'
elif i == 'T':
c = c + 'A'
elif i == 'C':
c = c + 'G'
return c
def hamm(s1, s2):
"""计算汉明距离的函数"""
hs = 0
for i in range(len(s1)):
if s1[i] != s2[i]:
hs += 1
return hs
f = open('input.txt', 'r')
lines = f.readlines()
f.close()
[index, seq] = readfasta(lines)
recomseq = [] # 存储反向互补序列
correct = [] # 存储正确的序列
wrong = [] # 存储错误的序列
length = len(seq)
for i in range(length): # 依次取出每个测序片段
flag = 0
recomseq = recom(seq[i]) # 将其转为反向互补序列
if seq[i] not in correct and recomseq not in correct: # 如果正反序列都不在已有的正确序列中
for j in range(i+1,len(seq)):
if hamm(seq[i], seq[j]) == 0: # 如果在还未扫描的读长中发现了相同的序列
flag = 1 # flag标记是否有相同序列
elif hamm(recomseq, seq[j]) == 0: # 如果在还未扫描的读长中发现了反向互补序列的相同序列
flag = 1
if flag == 1: # 如果有相同序列,这个读长是正确的
correct.append(seq[i])
else: # 反之,是错误序列
wrong.append(seq[i])
f = open('output.txt', 'a')
for i in wrong:
for j in correct: # 依次取出错误序列,与正确序列挨个比较
recomseq = recom(j)
if hamm(i, j) == 1: # 如果存在一个碱基替换
f.write(i + '->' + j)
f.write('\n')
# print(i, end='->')
# print(j)
elif hamm(i, recomseq) == 1:
f.write(i, + '->' + recomseq)
f.write('\n')
# print(i, end='->')
# print(recomseq)
f.close()