如何使csv比较结果适用于3个单独的列而不是1个列

ivqmmu1c  于 2023-03-10  发布在  其他
关注(0)|答案(2)|浏览(95)

我写的程序取2个csv,每个来自我们的基因结果报告器,并将它们合并到“样品Id”列。
| 试验ID_x|基因符号|NCBI SNP参考|样品编号|调用_x|试验ID_y|基因符号|调用_y|唯一ID|
| - ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|- ------|
| C_###|细胞色素P2 D 6|编号|编号12878|C/A|C_##|细胞色素P2 D 6|C/A|“一致”|
| C_###|细胞色素P2 D 6|编号|编号12878_1|C/C|C_##|细胞色素P2 D 6|C/C|“一致”|
比较报告者的呼叫以获得类似的结果。本质上,如果呼叫x!=呼叫y,则标记为不一致。
问题是:有时2行是同一个样品,用下划线隔开名称(例如NA1234NA1234_1)。
考虑到代码只检查一行上的调用,我无法找到比较同一样本的其他行的方法。
请注意,对于第一行,NA12878具有C/AC/A的匹配调用。下一个样本NA12878_1(同一个样本)具有C/CC/C的匹配调用。
这些调用实际上不一致。C/AC/C不匹配,必须标记以供审查。
我试过几种不同的策略,但都没有奏效。

import pandas as pd
import tkinter as tk
from tkinter import filedialog
import tkinter.messagebox
import jinja2
import openpyxl

# The purpose of this program is to compare calls from duplicates, and the calls are aligned from 2 reporters
# There will be 2 Genotyper output passes
# We will flag the calls/comparisons if !=

# Defining the upload prompt
# File Upload window.
def ask_path():
    root = tk.Tk()
    root.withdraw()
    path = filedialog.askopenfile(defaultextension=".csv",
                                  filetypes=[("csv files", '*.csv'),
                                             ('all files', '*.*')])  # shows dialog box and return the path
    return path

# Pop-up window branding
def onClick(msg):
    tkinter.messagebox.showinfo("Precision Genetics", msg)

# Prompt & Load First Pass
genotyper1_col = ['Gene Symbol', 'Sample ID', 'NCBI SNP Reference', 'Assay ID', 'Call']
onClick('Upload First Pass')
geno1 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper1_col)

# Prompt & Load the Second Pass
genotyper2_col = ['Gene Symbol', 'Assay ID', 'Sample ID', 'NCBI SNP Reference', 'Call']
onClick('Upload Second Pass')
geno2 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper2_col)

# Merge on specified column
merged_data = pd.merge(geno1, geno2, on=['NCBI SNP Reference', 'Sample ID'])

# Filter out rows that don't have an NCBI SNP Reference in both passes
merged_data = merged_data[~merged_data['NCBI SNP Reference'].isna()]

# Here is where we will attempt to flag the Call Columns if they are not equal to one another
merged_data.loc[merged_data['Call_x'] != merged_data['Call_y'], 'Flag'] = 'Not in Agreement'
merged_data.loc[merged_data['Call_x'] == merged_data['Call_y'], 'Flag'] = '-'

# Highlight Cell if it isnt in agreement
def highlight_cells(value):
    if value == merged_data['Not in Agreement']:
        color = 'red'
    else:
        color = ''
    return 'background-color: {}'.format(color)

merged_data.style.applymap(highlight_cells)

# Save the data
onClick('Please Choose Where You Would Like To Store This Data')

def save_loc(dataframe):
    root = tk.Tk()
    root.withdraw()
    file_path = filedialog.asksaveasfilename(defaultextension=".csv",
                                             filetypes=[("CSV Files", "*.csv"),
                                                        ("All Files", "*.*")])
    if file_path:
        dataframe.to_csv(file_path, index=False)
    root.destroy()

save_loc(merged_data)
  • 新的更新 * 这是新编辑的代码,在惊人的帮助后,你们都给了我。无论什么原因,我仍然得到“不一致”的一切在标记栏。此外,我认为这将是一个好主意,以更新到底是什么样的合并csv看起来像它的标题,以及为什么我改变了位置值,他们是什么。再次感谢一切。
def get_root_id(s):
    return s.split("_", 1)[0]

sample_ids = defaultdict(list)
headers = []

with open("merged_data.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        root_id = get_root_id(row[3])
        sample_ids[root_id].append(row)

for rows_by_id in sample_ids.values():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[4:7])

    if len(call_vals) > 1:
        for row in rows_by_id:
            row[8] = 'Not In Agreement'

output_file = 'output.csv'
with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for rows_by_id in sample_ids.values():
        for row in rows_by_id:
            writer.writerow(row)
output_data = pd.read_csv(output_file)
print(output_data)
6qqygrtg

6qqygrtg1#

我将您的问题解释为:“如何比较具有相似ID的多个行?”如果这代表您的问题,请继续阅读。
我从原始图像中重建了数据(包括实际的CSV,这样我们可以更容易地处理数据)。我删除了冗余列,并添加了一个ID列来显示原始行顺序。我还剪切了一些行。我认为这准确地突出了数据的特征:

ID,Sample_ID,Call_X,Call_Y,Flag
1,NA12878,C/A,C/A,
2,NA12878_1,C/C,C/C,
3,NTC,,,Not In Agreement
4,PGX000133,C/C,C/C,
5,PGX000133_1,NOAMP,NOAMP,
6,PGX000135,C/C,C/C,
7,PGX000135_D,C/C,C/C,
8,PGX000135_1,C/C,C/C,

我设想阅读CSV并创建一个结构,其中各行按其样本ID的 * 根 * 分组在一起,类似于:

{
    "NA12878": [
        ["1", "NA12878", "C/A", "C/A", ""],
        ["2", "NA12878_1", "C/C", "C/C", ""],
    ],
    "NTC": [
        ["3", "NTC", "", "", "Not In Agreement"],
    ],
    "PGX000133": [
        ["4", "PGX000133", "C/C", "C/C", ""],
        ["5", "PGX000133_1", "NOAMP", "NOAMP", ""],
    ],
    "PGX000135": [
        ["6", "PGX000135", "C/C", "C/C", ""],
        ["7", "PGX000135_D", "C/C", "C/C", ""],
        ["8", "PGX000135_1", "C/C", "C/C", ""],
    ],
}

使用该结构,我可以循环遍历多组行,并比较调用x/y值。

  • 我相信如果只有一个值(例如,C/C),那么行的集合就是好的。
  • 如果存在多个值,则该组行“不一致”。

我使用Python set,并使用每组行的所有调用值更新它:
一个二个一个一个
如果我发现一个长度大于1的集合,我会循环返回该组中的所有行,并设置flag字段:

...
    if len(call_vals) > 1:
        for row in rows_by_id:
            row[4] = "Not In Agreement"

下面是我的完整程序,其中我使用defaultdict(list)轻松地将行附加到样本ID,并创建我所设想的初始结构(前面已经展示过)。

import csv
from collections import defaultdict

def get_root_id(s):
    return s.split("_", 1)[0]  # NA12878_1 → NA12878; PGX000133 → PGX000133

sample_ids = defaultdict(list)
header = []

# My header will look like (w/row indices):
#   ID,Sample_ID,Call_X,Call_Y,Flag
#   0  1         2      3      4

with open("input.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        root_id = get_root_id(row[1])
        sample_ids[root_id].append(row)

for rows_by_id in sample_ids.values():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[2:4])

    if len(call_vals) > 1:
        for row in rows_by_id:
            row[4] = "Not In Agreement"

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for rows_by_id in sample_ids.values():
        for row in rows_by_id:
            writer.writerow(row)

这种结构的一个特点是,如果输入CSV是按Sample_ID排序的,那么输出CSV也将具有相同的排序......(不知道这是否重要,但这就是它的工作方式。)
我的输出如下所示:
| 识别号|样本ID|调用_X|调用_Y|旗帜|
| - ------|- ------|- ------|- ------|- ------|
| 1个|编号12878|C/A|C/A|不一致|
| 第二章|编号12878_1|C/C|C/C|不一致|
| 三个|负温度系数|||不一致|
| 四个|前列腺素X 000133|C/C|C/C|不一致|
| 五个|PGX000133_1产品|NOAMP|NOAMP|不一致|
| 六个|前列腺素X 000135|C/C|C/C||
| 七|PGX000135_D产品|C/C|C/C||
| 八个|PGX000135_1产品|C/C|C/C||

2skhul33

2skhul332#

你可以用to_numpy()flattenCall_xCall_y的所有值放到一个数组中,然后用all检查这个数组中的所有值是否相等。

def compare_unique_ids(df):
    for unique_id in df['Unique ID'].unique():
        values = df[df['Unique ID'] == unique_id].loc[:, ["Call_x", "Call_y"]].to_numpy().flatten()
        if not (values[0] == values).all():
            df.loc[df['Unique ID'] == unique_id, "Flag"] = "Not in Agreement"
        else:
            df.loc[df['Unique ID'] == unique_id, "Flag"] = "-"

相关问题