使用Python进行One Hot编码的快速方法

xoefb8l8  于 2023-11-16  发布在  Python
关注(0)|答案(2)|浏览(97)

在我的项目中,我需要对数百万个DNA序列进行约100次(总计数十亿次相似序列)的OneHotEncode,因此有效的方法对我来说非常重要。
bsp是我的代码,它需要4.5s的10K序列。

import numpy as np
import os,sys,time

def dna2onehot(dnaSeq):
    seqLen = len(dnaSeq)
    dnaSeq = dnaSeq.upper()

    # initialize the matrix to seqlen x 4
    seqMatrix = np.zeros((seqLen,4))
    # change the value to matrix
    for i in range(0,seqLen):
            if dnaSeq[i] == 'A':
                    seqMatrix[i,0] = 1
            if dnaSeq[i] == 'C':
                    seqMatrix[i,1] = 1
            if dnaSeq[i] == 'G':
                    seqMatrix[i,2] = 1
            if dnaSeq[i] == 'T':
                    seqMatrix[i,3] = 1
    ret = np.array(seqMatrix.flat)
    return ret
#

sequence = "TCTGAGTCCCAATACACAAGAGGTTCCCTCACCTGTTCTGGTGTCAGACCCTCCCAGATGATCACCTCTCCTATGGCGGGGAAGGTGCCTGGATGTCTAAAGCCTGAAATGGGGATCTATCCCAGAAGCTGTGTAGCTTCTGCCTGTCCCAGAAGCTGTGTTGTTTCTGTATTCAGCTTGCTCACCCTCCGCAGTCCATTGATCTGCACAGACTGTTCTCAGATGGACTCGTGAGACAAGATGGCTCCTTCACCTGCTCTGGGGATCAGAACCCTCCCAGGTGGCCACCTCTCCTGTGGTGGGGAAGGTACCTGGAAGTCTTCAGCCCAAAACAGGGCCTGTCCCAGAAGCTGTGTCTCTTCTGCCTATCCCAGAAGCTGTATTGCTTCTGCTGTCCACTTGCTCACCCTCTGCAGTCTGCATGCTGATCTGCGCAGACTGTTCTCAGAGGGATCTGGCAGACAAGTTGGCTCCCTCACCTGCTCTGGGGCGGGGGGGGGGGGTTCAGAGCCCTCCTGGGCAGCCACCTCTCCTCTAGCAGAGAAGGTGCTGGGATGTCTTGAGCAGGAAACGGGGTATGTCCCAGAAGCTGTCTTGCTTCTGCAATCCACATGCTCAGCCTCTGCAGTCTGTGAGCTAATCTGGGCAGTCTGGTCTCAGGGGACTCTGGAGACAAGATGGCTCCCTCACCTGCTCTGGGGGTCAAAGCCCTCCTTGGCAGCCACCTTTTTCAGGCGGAGAAGGTGCCCGGATGTCTGGAGCCTGAAACAGGGGTATGTCCCAGACACTGTGTAGCTTCTGCCTGCCCCAGAAGATGTGTCACTTCCTCAGTCTGCTTGTTCACCCTCCACAGTCTGCAAGCTGATCTGCACAGACTGGTCTCAGAGGGACCTAGAAGACAAGATCAAGAAAAGTCTTATAGGTATAATGAATCAAGCAGAAAATGAAACATCAGAAGCTTAAGATAAAATACAGGATCTAGTCCAAATTAGCAAGAAGTA"

count = 10000
datalist = []
t1 = time.time()
for k in range(count):
    datalist.append(dna2onehot(sequence))
#
t2 = time.time()
print("time cost:",t2-t1)

字符串
你有什么建议来减少使用python的时间(我的整个项目都是基于python的)?

fhg3lkii

fhg3lkii1#

你可以使用scikit-learn中的OneHotEncoder

import numpy as np
from sklearn.preprocessing import OneHotEncoder

# create the encoder object
encoder = OneHotEncoder()

sequence = 'TCTGAGTCCCAATACACAAGAGGTTCCCTCACCTGTTCTGGTGTCAGACCCTCCCAGATGATCACCTCTCCTATGGCG'
sequence += 'GGGAAGGTGCCTGGATGTCTAAAGCCTGAAATGGGGATCTATCCCAGAAGCTGTGTAGCTTCTGCCTGTCCCAGAAGC'
sequence += 'TGTGTTGTTTCTGTATTCAGCTTGCTCACCCTCCGCAGTCCATTGATCTGCACAGACTGTTCTCAGATGGACTCGTGA'
sequence += 'GACAAGATGGCTCCTTCACCTGCTCTGGGGATCAGAACCCTCCCAGGTGGCCACCTCTCCTGTGGTGGGGAAGGTACC'
sequence += 'TGGAAGTCTTCAGCCCAAAACAGGGCCTGTCCCAGAAGCTGTGTCTCTTCTGCCTATCCCAGAAGCTGTATTGCTTCT'
sequence += 'GCTGTCCACTTGCTCACCCTCTGCAGTCTGCATGCTGATCTGCGCAGACTGTTCTCAGAGGGATCTGGCAGACAAGTT'
sequence += 'GGCTCCCTCACCTGCTCTGGGGCGGGGGGGGGGGGTTCAGAGCCCTCCTGGGCAGCCACCTCTCCTCTAGCAGAGAAG'
sequence += 'GTGCTGGGATGTCTTGAGCAGGAAACGGGGTATGTCCCAGAAGCTGTCTTGCTTCTGCAATCCACATGCTCAGCCTCT'
sequence += 'GCAGTCTGTGAGCTAATCTGGGCAGTCTGGTCTCAGGGGACTCTGGAGACAAGATGGCTCCCTCACCTGCTCTGGGGG'
sequence += 'TCAAAGCCCTCCTTGGCAGCCACCTTTTTCAGGCGGAGAAGGTGCCCGGATGTCTGGAGCCTGAAACAGGGGTATGTC'
sequence += 'CCAGACACTGTGTAGCTTCTGCCTGCCCCAGAAGATGTGTCACTTCCTCAGTCTGCTTGTTCACCCTCCACAGTCTGC'
sequence += 'AAGCTGATCTGCACAGACTGGTCTCAGAGGGACCTAGAAGACAAGATCAAGAAAAGTCTTATAGGTATAATGAATCAA'
sequence += 'GCAGAAAATGAAACATCAGAAGCTTAAGATAAAATACAGGATCTAGTCCAAATTAGCAAGAAGTA'

# transform sequence to a Nx1 array, pass through fit/transform operation
seq_arr = np.array(list(sequence)).reshape(-1, 1)
seq_1hot = encoder.fit_transform(seq_arr).toarray()

seq_1hot
# returns:
array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.]])

字符串
您可以通过查看以下内容来查看哪些字母对应于哪些列:

encoder.categories_
# returns:
[array(['A', 'C', 'G', 'T'], dtype='<U1')]


所以在这种情况下,它们是按字母顺序排列的。

zf9nrax1

zf9nrax12#

这里有一个更快(更简单)的方法:

sequence = "TCTGAGTCCCAATACACAAGAGGTTCCCTCACCTGTTCTGGTGTCAGACCCTCCCAGATGATCACCTCTCCTATGGCGGGGAAGGTGCCTGGATGTCTAAAGCCTGAAATGGGGATCTATCCCAGAAGCTGTGTAGCTTCTGCCTGTCCCAGAAGCTGTGTTGTTTCTGTATTCAGCTTGCTCACCCTCCGCAGTCCATTGATCTGCACAGACTGTTCTCAGATGGACTCGTGAGACAAGATGGCTCCTTCACCTGCTCTGGGGATCAGAACCCTCCCAGGTGGCCACCTCTCCTGTGGTGGGGAAGGTACCTGGAAGTCTTCAGCCCAAAACAGGGCCTGTCCCAGAAGCTGTGTCTCTTCTGCCTATCCCAGAAGCTGTATTGCTTCTGCTGTCCACTTGCTCACCCTCTGCAGTCTGCATGCTGATCTGCGCAGACTGTTCTCAGAGGGATCTGGCAGACAAGTTGGCTCCCTCACCTGCTCTGGGGCGGGGGGGGGGGGTTCAGAGCCCTCCTGGGCAGCCACCTCTCCTCTAGCAGAGAAGGTGCTGGGATGTCTTGAGCAGGAAACGGGGTATGTCCCAGAAGCTGTCTTGCTTCTGCAATCCACATGCTCAGCCTCTGCAGTCTGTGAGCTAATCTGGGCAGTCTGGTCTCAGGGGACTCTGGAGACAAGATGGCTCCCTCACCTGCTCTGGGGGTCAAAGCCCTCCTTGGCAGCCACCTTTTTCAGGCGGAGAAGGTGCCCGGATGTCTGGAGCCTGAAACAGGGGTATGTCCCAGACACTGTGTAGCTTCTGCCTGCCCCAGAAGATGTGTCACTTCCTCAGTCTGCTTGTTCACCCTCCACAGTCTGCAAGCTGATCTGCACAGACTGGTCTCAGAGGGACCTAGAAGACAAGATCAAGAAAAGTCTTATAGGTATAATGAATCAAGCAGAAAATGAAACATCAGAAGCTTAAGATAAAATACAGGATCTAGTCCAAATTAGCAAGAAGTA"

sequence_array = [sequence] * 10000

ntdict = {'A' : [1,0,0,0],
          'G' : [0,1,0,0],
          'C' : [0,0,1,0],            
          'T' : [0,0,0,1]}

onehot = [[ntdict [s] for s in list (seq)] for seq in sequence_array]

字符串

相关问题