如何在python中用条件合并两个文件中的行？

qyuhtwio 于 2022-12-21 发布在 Python

关注(0)|答案(4)|浏览(153)

我需要合并两个文件中的行，基于条件，其中一个文件的行是第二个文件行的一部分。
第一个文件的一部分：

12319000    -64,7357668067227   -0,1111052148685535  
12319000    -79,68527661064425  -0,13231739777754026  
12319000    -94,69642857142858  -0,15117839559513543    
12319000    -109,59301470588237 -0,18277783185642743  
12319001    99,70264355742297   0,48329515727315125  
12319001    84,61113445378152   0,4060446341409862  
12319001    69,7032037815126    0,29803063228455073  
12319001    54,93886554621849   0,20958105041136763  
12319001    39,937394957983194  0,13623056582981297  
12319001    25,05574229691877   0,07748669438398018  
12319001    9,99716386554622    0,028110643107892755

第二个文件的一部分：

12319000.abf    mutant  1  
12319001.abf    mutant  2  
12319002.abf    mutant  3

我需要创建一个文件，该行由以下内容组成：第一个文件的所有行和第二个文件的所有行。除了第一列中的文件名。
正如您所看到的，第一个文件中有不止一行与第二个文件中的一行相对应。我需要对每一行都执行该操作，因此输出应该如下所示：

12319000    -94,69642857142858  -0,15117839559513543  mutant    1  
12319000    -109,59301470588237 -0,18277783185642743  mutant    1  
12319001    99,70264355742297   0,48329515727315125  mutant 2  
12319001    84,61113445378152   0,4060446341409862  mutant  2

我写了这段代码：

oocytes = open(file_with_oocytes, 'r')  
results = open(os.path.join(path, 'results.csv'), 'r')  
results_new = open(os.path.join(path, 'results_with_oocytes.csv'), 'w')  
for line in results:  
    for lines in oocytes:  
        if lines[0:7] in line:  
            print line + lines[12:]

但是它打印出了这个，没有别的，在第一个文件中有45行：

12319000    99,4952380952381    0,3011778623990699
    mutant  1  

12319000    99,4952380952381    0,3011778623990699
    mutant  2  

12319000    99,4952380952381    0,3011778623990699
    mutant  3

代码有什么问题吗？或者它应该以某种完全不同的方式来完成？

python

来源：https://stackoverflow.com/questions/9950600/how-to-combine-lines-in-two-files-with-condition-in-python

4条答案

按热度按时间

u0njafvf1#

Python中的文件句柄有状态;也就是说，它们不像列表那样工作。你可以反复迭代列表，每次都得到所有的值。另一方面，文件有一个位置，下一个read()将从该位置开始发生。当你迭代文件时，你将每行read()。当你到达最后一行时，文件指针位于文件末尾。从文件末尾输入read()将返回字符串''！
您需要做的是在开始时读入oocytes文件一次，并存储值，可能如下所示：

oodict = {}
for line in oocytes:
    oodict[line[0:7]] = line[12:]

for line in results:
    results_key = line[0:7]
    if results_key in oodict:
        print oodict[results_key] + line

赞(0）回复(0）举报 2022-12-21

3npbholx2#

请注意，除了第二个文件中的文件扩展名的长度之外，此解决方案不依赖于任何字段的长度。

# make a dict keyed on the filename before the extension
# with the other two fields as its value
file2dict = dict((row[0][:-4], row[1:])  
                     for row in (line.split() for line in file2))

# then add to the end of each row 
# the values to it's first column
output = [row + file2dict[row[0]] for row in (line.split() for line in file1)]

仅出于测试目的，我使用了：

# I just use this to emulate a file object, as iterating over it yields lines
# just use file1 = open(whatever_the_filename_is_for_this_data)
# and the rest of the program is the same
file1 = """12319000    -64,7357668067227   -0,1111052148685535
12319000    -79,68527661064425  -0,13231739777754026
12319000    -94,69642857142858  -0,15117839559513543
12319000    -109,59301470588237 -0,18277783185642743
12319001    99,70264355742297   0,48329515727315125
12319001    84,61113445378152   0,4060446341409862
12319001    69,7032037815126    0,29803063228455073
12319001    54,93886554621849   0,20958105041136763
12319001    39,937394957983194  0,13623056582981297
12319001    25,05574229691877   0,07748669438398018
12319001    9,99716386554622    0,028110643107892755""".splitlines()

# again, use file2 = open(whatever_the_filename_is_for_this_data)
# and the rest of the program will work the same
file2 = """12319000.abf    mutant  1
12319001.abf    mutant  2
12319002.abf    mutant  3""".splitlines()

在这里你应该只使用普通的文件对象。测试数据的输出是：

[['12319000', '-64,7357668067227', '-0,1111052148685535', 'mutant', '1'],
    ['12319000', '-79,68527661064425', '-0,13231739777754026', 'mutant', '1'],
    ['12319000', '-94,69642857142858', '-0,15117839559513543', 'mutant', '1'],
    ['12319000', '-109,59301470588237', '-0,18277783185642743', 'mutant', '1'],
    ['12319001', '99,70264355742297', '0,48329515727315125', 'mutant', '2'],
    ['12319001', '84,61113445378152', '0,4060446341409862', 'mutant', '2'],
    ['12319001', '69,7032037815126', '0,29803063228455073', 'mutant', '2'],
    ['12319001', '54,93886554621849', '0,20958105041136763', 'mutant', '2'],
    ['12319001', '39,937394957983194', '0,13623056582981297', 'mutant', '2'],
    ['12319001', '25,05574229691877', '0,07748669438398018', 'mutant', '2'],
    ['12319001', '9,99716386554622', '0,028110643107892755', 'mutant', '2']]

赞(0）回复(0）举报 2022-12-21

nqwrtyyt3#

好吧，先做一些简单的事情，你在行的末尾打印了换行符--你可能想把它和行[0：-1]一起去掉
接下来，“lines[0：7]”只测试行的前7个字符--你想要测试8个字符，这就是为什么相同的“line”值被打印出3个不同的变异值。
最后，您需要为结果中的每一行关闭并重新打开卵母细胞。如果不这样做，您的输出将在第一行结果之后结束。
实际上，另一个答案更好--不要为每一行结果打开和关闭卵母细胞--打开它并读入（一个列表）一次，然后为每一行结果迭代该列表。

赞(0）回复(0）举报 2022-12-21

72qzrwbm4#

您的代码应该经过一些调整才能工作：

oocytes = open(file_with_oocytes, 'r').readlines()
results = open(os.path.join(path, 'results.csv'), 'r').readlines()  
results_new = open(os.path.join(path, 'results_with_oocytes.csv'), 'w')  
for line in results:  
    for lines in oocytes:  
        if lines[0:8] in line:  
            results_new.write(line.strip() + lines[12:])

注意添加readlines()是为了得到可迭代列表。另一个重要的修正是在0:8范围内，因为你需要整个标识符。
我知道这个答案会在10年后出现，但我认为这是一个很好的练习来解决一个相当常见的任务。

赞(0）回复(0）举报 2022-12-21

我来回答

如何在python中用条件合并两个文件中的行？

4条答案

相关问题

热门标签

最新问答