欢迎光临散文网 会员登陆 & 注册

【ROSALIND】【练Python,学生信】34 测序片段错误修正

2020-01-24 16:25 作者:未琢  | 我要投稿

如果第一次阅读本系列文档请先移步阅读【ROSALIND】【练Python,学生信】00 写在前面  谢谢配合~

题目:

修正测序片段中的错误(Error Correction in Reads)

Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:

s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);

s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).

所给:约1000条等长的测序片段,最多不长过50bp,以FASTA格式给出。一些片段包含单个的碱基替换。单个片段s符合以下两种情况之一:

        s是正确的序列,在数据集中至少出现两次(可能以反向互补序列的形式给出);

        s包含错误碱基,在数据集中只出现一次,与正确序列仅相差一个碱基。

Return: A list of all corrections in the form "[old read]->[new read]". (Each correction must be a single symbol substitution, and you may return the corrections in any order.)

需得:以错误序列->修正序列形式给出的列表,可以用任何顺序排列。

 

测试数据

>Rosalind_52

TCATC

>Rosalind_44

TTCAT

>Rosalind_68

TCATC

>Rosalind_28

TGAAA

>Rosalind_95

GAGGA

>Rosalind_66

TTTCA

>Rosalind_33

ATCAA

>Rosalind_21

TTGAT

>Rosalind_18

TTTCC

测试输出

TTCAT->TTGAT

GAGGA->GATGA

TTTCC->TTTCA

 

生物学背景

        二代测序读长很短,错误率较高且无法预测。当发现有碱基替换时,需要考虑是真正的突变,还是由测序错误导致。

 

思路

        我的思路是用两重循环扫描数据集,通过挨个比较,将数据分为正确序列和错误序列两部分。输出的时候,再用两重循环,遍历正确和错误的部分,得到结果。

 

代码

def readfasta(lines):
   
"""读入fasta格式文件的函数"""
   
seq = []
    index = []
    seqplast =
""
   
numlines = 0
   
for i in lines:
       
if '>' in i:
            index.append(i.replace(
"\n", "").replace(">", ""))
            seq.append(seqplast.replace(
"\n", ""))
            seqplast =
""
           
numlines += 1
       
else:
            seqplast = seqplast + i.replace(
"\n", "")
            numlines +=
1
       
if numlines == len(lines):
            seq.append(seqplast.replace(
"\n", ""))
    seq = seq[
1:]
   
return index, seq


def recom(s):
   
"""输出原序列反向互补序列的函数"""
   
re = s[::-1]
    c =
""
   
for i in re:
       
if i == 'A':
            c = c +
'T'
       
elif i == 'G':
            c = c +
'C'
       
elif i == 'T':
            c = c +
'A'
       
elif i == 'C':
            c = c +
'G'

   
return c


def hamm(s1, s2):
   
"""计算汉明距离的函数"""

   
hs = 0
   
for i in range(len(s1)):
       
if s1[i] != s2[i]:
            hs +=
1

   
return hs

 

 

f = open('input.txt', 'r')
lines = f.readlines()
f.close()

[index, seq] = readfasta(lines)
recomseq = []
# 存储反向互补序列
correct = [] # 存储正确的序列
wrong = [] # 存储错误的序列
length = len(seq)
for i in range(length): # 依次取出每个测序片段
   
flag = 0
   
recomseq = recom(seq[i]) # 将其转为反向互补序列
   
if seq[i] not in correct and recomseq not in correct: # 如果正反序列都不在已有的正确序列中
       
for j in range(i+1,len(seq)):
           
if hamm(seq[i], seq[j]) == 0: # 如果在还未扫描的读长中发现了相同的序列
               
flag = 1 # flag标记是否有相同序列
           
elif hamm(recomseq, seq[j]) == 0: # 如果在还未扫描的读长中发现了反向互补序列的相同序列
               
flag = 1
       
if flag == 1: # 如果有相同序列,这个读长是正确的
           
correct.append(seq[i])
       
else: # 反之,是错误序列
           
wrong.append(seq[i])

f =
open('output.txt', 'a')

for i in wrong:
   
for j in correct: # 依次取出错误序列,与正确序列挨个比较
       
recomseq = recom(j)
       
if hamm(i, j) == 1: # 如果存在一个碱基替换
           
f.write(i + '->' + j)
            f.write(
'\n')
           
# print(i, end='->')
            # print(j)
       
elif hamm(i, recomseq) == 1:
            f.write(i, +
'->' + recomseq)
            f.write(
'\n')
           
# print(i, end='->')
            # print(recomseq)

f.close()


【ROSALIND】【练Python,学生信】34 测序片段错误修正的评论 (共 条)

分享到微博请遵守国家法律