如何使我的CSV比较结果适用于三个单独的列,而不是一个列?

huangapple go评论71阅读模式
英文:

How can I get my csv comparison results to work for 3 separate columns instead of one

问题

程序我已经编写了,它接受两个CSV文件,分别来自我们的基因结果报告员,并在“Sample Id”列上合并它们。

来自报告员的呼叫将用于比较相似的结果。基本上,如果呼叫x不等于呼叫y,它将被标记为不一致。

问题在于,有时两行属于同一样本,由下划线分隔(例如NA1234NA1234_1)。

考虑到代码只检查一行上的呼叫,我无法找到一种比较同一样本的其他行的方法。

请注意,对于第一行,NA12878具有匹配的呼叫C/AC/A。下一个样本NA12878_1(它是相同的样本)具有匹配的呼叫C/CC/C

这些呼叫实际上并不一致。C/AC/C不是匹配的呼叫,必须标记为审查。

我已尝试了一些不同的策略,但没有使其按需要工作。

请注意,以下是您提供的Python代码,用于处理这些CSV文件的任务:

import pandas as pd
import tkinter as tk
from tkinter import filedialog
import tkinter.messagebox

# ...(其他代码)

# Here is where we will attempt to flag the Call Columns if they are not equal to one another
merged_data.loc[merged_data['Call_x'] != merged_data['Call_y'], 'Flag'] = 'Not in Agreement'
merged_data.loc[merged_data['Call_x'] == merged_data['Call_y'], 'Flag'] = '-'

# ...(其他代码)

# Save the data
save_loc(merged_data)

另外,您还提供了更新后的代码,但遇到“Not In Agreement”标记的问题。您还列出了更新后的CSV文件的示例以及更改位置值的原因。

如果您需要进一步的帮助或有其他问题,请随时提问。

英文:

The program I have written takes 2 csv's, one from each of our gene results reporters, and merges them on the column "Sample Id".

Assay ID_x Gene Symbol NCBI SNP Reference Sample ID Call_x Assay ID_y Gene Symbol Call_y Unique ID
C_### CYP2D6 rs## NA12878 C/A C_## CYP2D6 C/A 'In Agreement'
C_### CYP2D6 rs## NA12878_1 C/C C_## CYP2D6 C/C 'In Agreement'

The calls from the reporters are compared for similar results. Essentially if call x != call y, it is flagged for being not in agreement.

The problem: sometimes 2 rows are of the same sample, separated by name in the presence of an underscore (ex. NA1234 and NA1234_1).

Considering the code only checks the calls on one line, I can't find a way to compare the other rows of the same sample.

Note that for the first row NA12878 has matching calls of C/A and C/A. The next sample NA12878_1 (which is the same sample) has an matching call for C/C and C/C.

These calls are actually not in agreement. C/A and C/C are not matching calls and must be flagged for review.

I've tried a few different tactics, but none have made it work as needed.

import pandas as pd
import tkinter as tk
from tkinter import filedialog
import tkinter.messagebox
import jinja2
import openpyxl


# The purpose of this program is to compare calls from duplicates, and the calls are aligned from 2 reporters
# There will be 2 Genotyper output passes
# We will flag the calls/comparisons if !=

# Defining the upload prompt
# File Upload window.
def ask_path():
    root = tk.Tk()
    root.withdraw()
    path = filedialog.askopenfile(defaultextension=".csv",
                                  filetypes=[("csv files", '*.csv'),
                                             ('all files', '*.*')])  # shows dialog box and return the path
    return path


# Pop-up window branding
def onClick(msg):
    tkinter.messagebox.showinfo("Precision Genetics", msg)


# Prompt & Load First Pass
genotyper1_col = ['Gene Symbol', 'Sample ID', 'NCBI SNP Reference', 'Assay ID', 'Call']
onClick('Upload First Pass')
geno1 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper1_col)

# Prompt & Load the Second Pass
genotyper2_col = ['Gene Symbol', 'Assay ID', 'Sample ID', 'NCBI SNP Reference', 'Call']
onClick('Upload Second Pass')
geno2 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper2_col)

# Merge on specified column
merged_data = pd.merge(geno1, geno2, on=['NCBI SNP Reference', 'Sample ID'])

# Filter out rows that don't have an NCBI SNP Reference in both passes
merged_data = merged_data[~merged_data['NCBI SNP Reference'].isna()]

# Here is where we will attempt to flag the Call Columns if they are not equal to one another
merged_data.loc[merged_data['Call_x'] != merged_data['Call_y'], 'Flag'] = 'Not in Agreement'
merged_data.loc[merged_data['Call_x'] == merged_data['Call_y'], 'Flag'] = '-'


# Highlight Cell if it isnt in agreement
def highlight_cells(value):
    if value == merged_data['Not in Agreement']:
        color = 'red'
    else:
        color = ''
    return 'background-color: {}'.format(color)


merged_data.style.applymap(highlight_cells)

# Save the data
onClick('Please Choose Where You Would Like To Store This Data')


def save_loc(dataframe):
    root = tk.Tk()
    root.withdraw()
    file_path = filedialog.asksaveasfilename(defaultextension=".csv",
                                             filetypes=[("CSV Files", "*.csv"),
                                                        ("All Files", "*.*")])
    if file_path:
        dataframe.to_csv(file_path, index=False)
    root.destroy()


save_loc(merged_data)

NEW UPDATE
Here is the new edited code, after the amazing help you all have given me. For whatever reason, I'm still getting "Not In Agreement" for everything in the Flagging column. Also, I thought it would be a good idea to update exactly what the merged csv looks like with its headers, and why I changed to location values to what they are. Thanks again for everything all.

def get_root_id(s):
    return s.split("_", 1)[0]


sample_ids = defaultdict(list)
headers = []


with open("merged_data.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        root_id = get_root_id(row[3])
        sample_ids[root_id].append(row)

for rows_by_id in sample_ids.values():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[4:7])

    if len(call_vals) > 1:
        for row in rows_by_id:
            row[8] = 'Not In Agreement'

output_file = 'output.csv'
with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for rows_by_id in sample_ids.values():
        for row in rows_by_id:
            writer.writerow(row)
output_data = pd.read_csv(output_file)
print(output_data)



</details>


# 答案1
**得分**: 1

我理解你的问题是:“如何比较具有相似ID的多行数据?”如果这代表你的问题请继续阅读

我从原始图像重建了你的数据包括实际的CSV使我们更容易处理数据)。我去掉了冗余的列并添加了一个ID列以显示原始行的顺序我还删除了一些行我认为这准确地突出了你数据的特点

```none
ID,Sample_ID,Call_X,Call_Y,Flag
1,NA12878,C/A,C/A,
2,NA12878_1,C/C,C/C,
3,NTC,,,Not In Agreement
4,PGX000133,C/C,C/C,
5,PGX000133_1,NOAMP,NOAMP,
6,PGX000135,C/C,C/C,
7,PGX000135_D,C/C,C/C,
8,PGX000135_1,C/C,C/C,

我想象着读取CSV并创建一个结构,其中行根据其样本ID的“根”分组在一起,类似于:

{
    "NA12878": [
        ["1", "NA12878", "C/A", "C/A", ""],
        ["2", "NA12878_1", "C/C", "C/C", ""],
    ],
    "NTC": [
        ["3", "NTC", "", "", "Not In Agreement"],
    ],
    "PGX000133": [
        ["4", "PGX000133", "C/C", "C/C", ""],
        ["5", "PGX000133_1", "NOAMP", "NOAMP", ""],
    ],
    "PGX000135": [
        ["6", "PGX000135", "C/C", "C/C", ""],
        ["7", "PGX000135_D", "C/C", "C/C", ""],
        ["8", "PGX000135_1", "C/C", "C/C", ""],
    ],
}

有了这个结构,我可以循环遍历行组,并比较调用x/y的值。对于任何ID:

  • 我认为如果只有一个值(例如C/C),那么行集合是好的。
  • 如果不止一个值,则行集合“不一致”。

我使用Python的set,只需更新所有组的行的调用值:

for sample_id, rows_by_id in sample_ids.items():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[2:4])
    print(f"{sample_id}: {call_vals}")
NA12878: {'C/C', 'C/A'}
NTC: {''}
PGX000133: {'C/C', 'NOAMP'}
PGX000135: {'C/C'}

如果我发现集合的长度大于1,我会再次循环遍历组中的所有行,并设置标志字段:

    ...
    if len(call_vals) > 1:
        for row in rows_by_id:
            row[4] = "Not In Agreement"

这是我的完整程序,我使用defaultdict(list)来轻松将行附加到样本ID,并创建我想象的初始结构(如上所示)。

import csv
from collections import defaultdict


def get_root_id(s):
    return s.split("_", 1)[0]  # NA12878_1 → NA12878; PGX000133 → PGX000133


sample_ids = defaultdict(list)
header = []

# My header will look like (w/row indices):
#   ID,Sample_ID,Call_X,Call_Y,Flag
#   0  1         2      3      4

with open("input.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        root_id = get_root_id(row[1])
        sample_ids[root_id].append(row)


for rows_by_id in sample_ids.values():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[2:4])

    if len(call_vals) > 1:
        for row in rows_by_id:
            row[4] = "Not In Agreement";


with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for rows_by_id in sample_ids.values():
        for row in rows_by_id:
            writer.writerow(row)

这个结构的一个特点是,如果你的输入CSV按Sample_ID排序,输出CSV也将具有相同的排序...(不知道这是否重要,但这是它的工作方式。)

我的输出如下:

ID Sample_ID Call_X Call_Y Flag
1 NA12878 C/A C/A Not In Agreement
2 NA12878_1 C/C C/C Not In Agreement
3 NTC Not In Agreement
4 PGX000133 C/C C/C Not In Agreement
5 PGX000133_1 NOAMP NOAMP Not In Agreement
6 PGX000135 C/C C/C
7 PGX000135_D C/C C/C
8 PGX000135_1 C/C C/C
英文:

I interpret your problem to be, "How do I compare multiple rows with a similar ID?" If that represents your problem, read on.

I reconstructed your data from the original image (including actual CSV makes it easier for us to play with the data). I removed the redundant columns, and added an ID column to show the original row order. I also cut out some rows. I think this accurately highlights the features of your data:

ID,Sample_ID,Call_X,Call_Y,Flag
1,NA12878,C/A,C/A,
2,NA12878_1,C/C,C/C,
3,NTC,,,Not In Agreement
4,PGX000133,C/C,C/C,
5,PGX000133_1,NOAMP,NOAMP,
6,PGX000135,C/C,C/C,
7,PGX000135_D,C/C,C/C,
8,PGX000135_1,C/C,C/C,

I picture reading the CSV and creating a structure where rows are grouped together by the root of their sample ID, something like:

{
    &quot;NA12878&quot;: [
        [&quot;1&quot;, &quot;NA12878&quot;, &quot;C/A&quot;, &quot;C/A&quot;, &quot;&quot;],
        [&quot;2&quot;, &quot;NA12878_1&quot;, &quot;C/C&quot;, &quot;C/C&quot;, &quot;&quot;],
    ],
    &quot;NTC&quot;: [
        [&quot;3&quot;, &quot;NTC&quot;, &quot;&quot;, &quot;&quot;, &quot;Not In Agreement&quot;],
    ],
    &quot;PGX000133&quot;: [
        [&quot;4&quot;, &quot;PGX000133&quot;, &quot;C/C&quot;, &quot;C/C&quot;, &quot;&quot;],
        [&quot;5&quot;, &quot;PGX000133_1&quot;, &quot;NOAMP&quot;, &quot;NOAMP&quot;, &quot;&quot;],
    ],
    &quot;PGX000135&quot;: [
        [&quot;6&quot;, &quot;PGX000135&quot;, &quot;C/C&quot;, &quot;C/C&quot;, &quot;&quot;],
        [&quot;7&quot;, &quot;PGX000135_D&quot;, &quot;C/C&quot;, &quot;C/C&quot;, &quot;&quot;],
        [&quot;8&quot;, &quot;PGX000135_1&quot;, &quot;C/C&quot;, &quot;C/C&quot;, &quot;&quot;],
    ],
}

With that structure I can loop over groups of rows, and compare the call x/y values. For any ID:

  • I believe if there is a single value (e.g., C/C) then the set of rows is good.
  • If there is more than a single value then the set of rows is "not in agreement".

I use a Python set and just update it with all call values for every group of rows:

for sample_id, rows_by_id in sample_ids.items():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[2:4])
    print(f&quot;{sample_id}: {call_vals}&quot;)
NA12878: {&#39;C/C&#39;, &#39;C/A&#39;}
NTC: {&#39;&#39;}
PGX000133: {&#39;C/C&#39;, &#39;NOAMP&#39;}
PGX000135: {&#39;C/C&#39;}

If I find a set with a length greater than 1, I loop back over all rows in the group and set the flag field:

    ...
    if len(call_vals) &gt; 1:
        for row in rows_by_id:
            row[4] = &quot;Not In Agreement&quot;

Here's my complete program, where I use a defaultdict(list) to easily append rows to a sample ID and create that initial structure I imagined (and showed earlier).

import csv
from collections import defaultdict


def get_root_id(s):
    return s.split(&quot;_&quot;, 1)[0]  # NA12878_1 → NA12878; PGX000133 → PGX000133


sample_ids = defaultdict(list)
header = []

# My header will look like (w/row indices):
#   ID,Sample_ID,Call_X,Call_Y,Flag
#   0  1         2      3      4

with open(&quot;input.csv&quot;, newline=&quot;&quot;) as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        root_id = get_root_id(row[1])
        sample_ids[root_id].append(row)


for rows_by_id in sample_ids.values():
    call_vals = set()
    for row in rows_by_id:
        call_vals.update(row[2:4])

    if len(call_vals) &gt; 1:
        for row in rows_by_id:
            row[4] = &quot;Not In Agreement&quot;


with open(&quot;output.csv&quot;, &quot;w&quot;, newline=&quot;&quot;) as f:
    writer = csv.writer(f)
    writer.writerow(header)
    for rows_by_id in sample_ids.values():
        for row in rows_by_id:
            writer.writerow(row)

One feature of this structure is that if your input CSV is sorted by Sample_ID, the output CSV will have the same sort... (dunno if that's important, but that's how it will work out.)

My output looks like:

ID Sample_ID Call_X Call_Y Flag
1 NA12878 C/A C/A Not In Agreement
2 NA12878_1 C/C C/C Not In Agreement
3 NTC Not In Agreement
4 PGX000133 C/C C/C Not In Agreement
5 PGX000133_1 NOAMP NOAMP Not In Agreement
6 PGX000135 C/C C/C
7 PGX000135_D C/C C/C
8 PGX000135_1 C/C C/C

答案2

得分: 0

你可以使用 to_numpy()flattenCall_xCall_y 的所有值合并到一个数组中。然后,你可以使用 all 来检查该数组的所有值是否相等。这是我写的代码:

def compare_unique_ids(df):
    for unique_id in df['Unique ID'].unique():
        values = df[df['Unique ID'] == unique_id].loc[:, ["Call_x", "Call_y"]].to_numpy().flatten()
        if not (values[0] == values).all():
            df.loc[df['Unique ID'] == unique_id, "Flag"] = "Not in Agreement"
        else:
            df.loc[df['Unique ID'] == unique_id, "Flag"] = "-"

如果你有任何其他问题,请随时提出。

英文:

You can use to_numpy() and flatten to get all the values for Call_x and Call_y into a single array. You can then check if all the values of that array are equal using all. This is what I wrote:

def compare_unique_ids(df):
for unique_id in df[&#39;Unique ID&#39;].unique():
values = df[df[&#39;Unique ID&#39;] == unique_id].loc[:, [&quot;Call_x&quot;, &quot;Call_y&quot;]].to_numpy().flatten()
if not (values[0] == values).all():
df.loc[df[&#39;Unique ID&#39;] == unique_id, &quot;Flag&quot;] = &quot;Not in Agreement&quot;
else:
df.loc[df[&#39;Unique ID&#39;] == unique_id, &quot;Flag&quot;] = &quot;-&quot;

huangapple
  • 本文由 发表于 2023年3月7日 01:18:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75653874.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定