【ROSALIND】【练Python，学生信】34 测序片段错误修正

2020-01-24 16:25 作者:未琢 0人读过 | 我要投稿

如果第一次阅读本系列文档请先移步阅读【ROSALIND】【练Python，学生信】00 写在前面谢谢配合~

题目：

修正测序片段中的错误（Error Correction in Reads）

Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:

s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);

s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).

所给：约1000条等长的测序片段，最多不长过50bp，以FASTA格式给出。一些片段包含单个的碱基替换。单个片段s符合以下两种情况之一：

s是正确的序列，在数据集中至少出现两次（可能以反向互补序列的形式给出）；

s包含错误碱基，在数据集中只出现一次，与正确序列仅相差一个碱基。

Return: A list of all corrections in the form "[old read]->[new read]". (Each correction must be a single symbol substitution, and you may return the corrections in any order.)

需得：以错误序列->修正序列形式给出的列表，可以用任何顺序排列。

测试数据

>Rosalind_52

TCATC

>Rosalind_44

TTCAT

>Rosalind_68

TCATC

>Rosalind_28

TGAAA

>Rosalind_95

GAGGA

>Rosalind_66

TTTCA

>Rosalind_33

ATCAA

>Rosalind_21

TTGAT

>Rosalind_18

TTTCC

测试输出

TTCAT->TTGAT

GAGGA->GATGA

TTTCC->TTTCA

生物学背景

二代测序读长很短，错误率较高且无法预测。当发现有碱基替换时，需要考虑是真正的突变，还是由测序错误导致。

思路

我的思路是用两重循环扫描数据集，通过挨个比较，将数据分为正确序列和错误序列两部分。输出的时候，再用两重循环，遍历正确和错误的部分，得到结果。

代码

def readfasta(lines):
    """读入fasta格式文件的函数"""
    seq = []
    index = []
    seqplast = ""
    numlines = 0
    for i in lines:
        if '>' in i:
            index.append(i.replace("\n", "").replace(">", ""))
            seq.append(seqplast.replace("\n", ""))
            seqplast = ""
            numlines += 1
        else:
            seqplast = seqplast + i.replace("\n", "")
            numlines += 1
        if numlines == len(lines):
            seq.append(seqplast.replace("\n", ""))
    seq = seq[1:]
    return index, seq

def recom(s):
    """输出原序列反向互补序列的函数"""
    re = s[::-1]
    c = ""
    for i in re:
        if i == 'A':
            c = c + 'T'
        elif i == 'G':
            c = c + 'C'
        elif i == 'T':
            c = c + 'A'
        elif i == 'C':
            c = c + 'G'

    return c

def hamm(s1, s2):
    """计算汉明距离的函数"""

    hs = 0
    for i in range(len(s1)):
        if s1[i] != s2[i]:
            hs += 1

    return hs

f = open('input.txt', 'r')
lines = f.readlines()
f.close()

[index, seq] = readfasta(lines)
recomseq = [] # 存储反向互补序列
correct = [] # 存储正确的序列
wrong = [] # 存储错误的序列
length = len(seq)
for i in range(length): # 依次取出每个测序片段
    flag = 0
    recomseq = recom(seq[i]) # 将其转为反向互补序列
    if seq[i] not in correct and recomseq not in correct: # 如果正反序列都不在已有的正确序列中
        for j in range(i+1,len(seq)):
            if hamm(seq[i], seq[j]) == 0: # 如果在还未扫描的读长中发现了相同的序列
                flag = 1 # flag标记是否有相同序列
            elif hamm(recomseq, seq[j]) == 0: # 如果在还未扫描的读长中发现了反向互补序列的相同序列
                flag = 1
        if flag == 1: # 如果有相同序列，这个读长是正确的
            correct.append(seq[i])
        else: # 反之，是错误序列
            wrong.append(seq[i])

f = open('output.txt', 'a')

for i in wrong:
    for j in correct: # 依次取出错误序列，与正确序列挨个比较
        recomseq = recom(j)
        if hamm(i, j) == 1: # 如果存在一个碱基替换
            f.write(i + '->' + j)
            f.write('\n')
            # print(i, end='->')
            # print(j)
        elif hamm(i, recomseq) == 1:
            f.write(i, + '->' + recomseq)
            f.write('\n')
            # print(i, end='->')
            # print(recomseq)

f.close()

标签：

【ROSALIND】【练Python，学生信】34 测序片段错误修正

【ROSALIND】【练Python，学生信】34 测序片段错误修正的评论 (共条)

你可能也喜欢这些文章

最新发布的文章

【ROSALIND】【练Python，学生信】34 测序片段错误修正

本文作者的其他文章

【ROSALIND】【练Python，学生信】34 测序片段错误修正的评论 (共 条)

你可能也喜欢这些文章

最新发布的文章

【ROSALIND】【练Python，学生信】34 测序片段错误修正的评论 (共条)