英文:
How can I get my csv comparison results to work for 3 separate columns instead of one
问题
程序我已经编写了,它接受两个CSV文件,分别来自我们的基因结果报告员,并在“Sample Id”列上合并它们。
来自报告员的呼叫将用于比较相似的结果。基本上,如果呼叫x不等于呼叫y,它将被标记为不一致。
问题在于,有时两行属于同一样本,由下划线分隔(例如NA1234
和NA1234_1
)。
考虑到代码只检查一行上的呼叫,我无法找到一种比较同一样本的其他行的方法。
请注意,对于第一行,NA12878
具有匹配的呼叫C/A
和C/A
。下一个样本NA12878_1
(它是相同的样本)具有匹配的呼叫C/C
和C/C
。
这些呼叫实际上并不一致。C/A
和C/C
不是匹配的呼叫,必须标记为审查。
我已尝试了一些不同的策略,但没有使其按需要工作。
请注意,以下是您提供的Python代码,用于处理这些CSV文件的任务:
import pandas as pd
import tkinter as tk
from tkinter import filedialog
import tkinter.messagebox
# ...(其他代码)
# Here is where we will attempt to flag the Call Columns if they are not equal to one another
merged_data.loc[merged_data['Call_x'] != merged_data['Call_y'], 'Flag'] = 'Not in Agreement'
merged_data.loc[merged_data['Call_x'] == merged_data['Call_y'], 'Flag'] = '-'
# ...(其他代码)
# Save the data
save_loc(merged_data)
另外,您还提供了更新后的代码,但遇到“Not In Agreement”标记的问题。您还列出了更新后的CSV文件的示例以及更改位置值的原因。
如果您需要进一步的帮助或有其他问题,请随时提问。
英文:
The program I have written takes 2 csv's, one from each of our gene results reporters, and merges them on the column "Sample Id".
Assay ID_x | Gene Symbol | NCBI SNP Reference | Sample ID | Call_x | Assay ID_y | Gene Symbol | Call_y | Unique ID |
---|---|---|---|---|---|---|---|---|
C_### | CYP2D6 | rs## | NA12878 | C/A | C_## | CYP2D6 | C/A | 'In Agreement' |
C_### | CYP2D6 | rs## | NA12878_1 | C/C | C_## | CYP2D6 | C/C | 'In Agreement' |
The calls from the reporters are compared for similar results. Essentially if call x != call y, it is flagged for being not in agreement.
The problem: sometimes 2 rows are of the same sample, separated by name in the presence of an underscore (ex. NA1234
and NA1234_1
).
Considering the code only checks the calls on one line, I can't find a way to compare the other rows of the same sample.
Note that for the first row NA12878
has matching calls of C/A
and C/A
. The next sample NA12878_1
(which is the same sample) has an matching call for C/C
and C/C
.
These calls are actually not in agreement. C/A
and C/C
are not matching calls and must be flagged for review.
I've tried a few different tactics, but none have made it work as needed.
import pandas as pd
import tkinter as tk
from tkinter import filedialog
import tkinter.messagebox
import jinja2
import openpyxl
# The purpose of this program is to compare calls from duplicates, and the calls are aligned from 2 reporters
# There will be 2 Genotyper output passes
# We will flag the calls/comparisons if !=
# Defining the upload prompt
# File Upload window.
def ask_path():
root = tk.Tk()
root.withdraw()
path = filedialog.askopenfile(defaultextension=".csv",
filetypes=[("csv files", '*.csv'),
('all files', '*.*')]) # shows dialog box and return the path
return path
# Pop-up window branding
def onClick(msg):
tkinter.messagebox.showinfo("Precision Genetics", msg)
# Prompt & Load First Pass
genotyper1_col = ['Gene Symbol', 'Sample ID', 'NCBI SNP Reference', 'Assay ID', 'Call']
onClick('Upload First Pass')
geno1 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper1_col)
# Prompt & Load the Second Pass
genotyper2_col = ['Gene Symbol', 'Assay ID', 'Sample ID', 'NCBI SNP Reference', 'Call']
onClick('Upload Second Pass')
geno2 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper2_col)
# Merge on specified column
merged_data = pd.merge(geno1, geno2, on=['NCBI SNP Reference', 'Sample ID'])
# Filter out rows that don't have an NCBI SNP Reference in both passes
merged_data = merged_data[~merged_data['NCBI SNP Reference'].isna()]
# Here is where we will attempt to flag the Call Columns if they are not equal to one another
merged_data.loc[merged_data['Call_x'] != merged_data['Call_y'], 'Flag'] = 'Not in Agreement'
merged_data.loc[merged_data['Call_x'] == merged_data['Call_y'], 'Flag'] = '-'
# Highlight Cell if it isnt in agreement
def highlight_cells(value):
if value == merged_data['Not in Agreement']:
color = 'red'
else:
color = ''
return 'background-color: {}'.format(color)
merged_data.style.applymap(highlight_cells)
# Save the data
onClick('Please Choose Where You Would Like To Store This Data')
def save_loc(dataframe):
root = tk.Tk()
root.withdraw()
file_path = filedialog.asksaveasfilename(defaultextension=".csv",
filetypes=[("CSV Files", "*.csv"),
("All Files", "*.*")])
if file_path:
dataframe.to_csv(file_path, index=False)
root.destroy()
save_loc(merged_data)
NEW UPDATE
Here is the new edited code, after the amazing help you all have given me. For whatever reason, I'm still getting "Not In Agreement" for everything in the Flagging column. Also, I thought it would be a good idea to update exactly what the merged csv looks like with its headers, and why I changed to location values to what they are. Thanks again for everything all.
def get_root_id(s):
return s.split("_", 1)[0]
sample_ids = defaultdict(list)
headers = []
with open("merged_data.csv", newline="") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
root_id = get_root_id(row[3])
sample_ids[root_id].append(row)
for rows_by_id in sample_ids.values():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[4:7])
if len(call_vals) > 1:
for row in rows_by_id:
row[8] = 'Not In Agreement'
output_file = 'output.csv'
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(header)
for rows_by_id in sample_ids.values():
for row in rows_by_id:
writer.writerow(row)
output_data = pd.read_csv(output_file)
print(output_data)
</details>
# 答案1
**得分**: 1
我理解你的问题是:“如何比较具有相似ID的多行数据?”如果这代表你的问题,请继续阅读。
我从原始图像重建了你的数据(包括实际的CSV使我们更容易处理数据)。我去掉了冗余的列,并添加了一个ID列以显示原始行的顺序。我还删除了一些行。我认为这准确地突出了你数据的特点:
```none
ID,Sample_ID,Call_X,Call_Y,Flag
1,NA12878,C/A,C/A,
2,NA12878_1,C/C,C/C,
3,NTC,,,Not In Agreement
4,PGX000133,C/C,C/C,
5,PGX000133_1,NOAMP,NOAMP,
6,PGX000135,C/C,C/C,
7,PGX000135_D,C/C,C/C,
8,PGX000135_1,C/C,C/C,
我想象着读取CSV并创建一个结构,其中行根据其样本ID的“根”分组在一起,类似于:
{
"NA12878": [
["1", "NA12878", "C/A", "C/A", ""],
["2", "NA12878_1", "C/C", "C/C", ""],
],
"NTC": [
["3", "NTC", "", "", "Not In Agreement"],
],
"PGX000133": [
["4", "PGX000133", "C/C", "C/C", ""],
["5", "PGX000133_1", "NOAMP", "NOAMP", ""],
],
"PGX000135": [
["6", "PGX000135", "C/C", "C/C", ""],
["7", "PGX000135_D", "C/C", "C/C", ""],
["8", "PGX000135_1", "C/C", "C/C", ""],
],
}
有了这个结构,我可以循环遍历行组,并比较调用x/y的值。对于任何ID:
- 我认为如果只有一个值(例如C/C),那么行集合是好的。
- 如果不止一个值,则行集合“不一致”。
我使用Python的set
,只需更新所有组的行的调用值:
for sample_id, rows_by_id in sample_ids.items():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[2:4])
print(f"{sample_id}: {call_vals}")
NA12878: {'C/C', 'C/A'}
NTC: {''}
PGX000133: {'C/C', 'NOAMP'}
PGX000135: {'C/C'}
如果我发现集合的长度大于1,我会再次循环遍历组中的所有行,并设置标志字段:
...
if len(call_vals) > 1:
for row in rows_by_id:
row[4] = "Not In Agreement"
这是我的完整程序,我使用defaultdict(list)
来轻松将行附加到样本ID,并创建我想象的初始结构(如上所示)。
import csv
from collections import defaultdict
def get_root_id(s):
return s.split("_", 1)[0] # NA12878_1 → NA12878; PGX000133 → PGX000133
sample_ids = defaultdict(list)
header = []
# My header will look like (w/row indices):
# ID,Sample_ID,Call_X,Call_Y,Flag
# 0 1 2 3 4
with open("input.csv", newline="") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
root_id = get_root_id(row[1])
sample_ids[root_id].append(row)
for rows_by_id in sample_ids.values():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[2:4])
if len(call_vals) > 1:
for row in rows_by_id:
row[4] = "Not In Agreement";
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(header)
for rows_by_id in sample_ids.values():
for row in rows_by_id:
writer.writerow(row)
这个结构的一个特点是,如果你的输入CSV按Sample_ID排序,输出CSV也将具有相同的排序...(不知道这是否重要,但这是它的工作方式。)
我的输出如下:
ID | Sample_ID | Call_X | Call_Y | Flag |
---|---|---|---|---|
1 | NA12878 | C/A | C/A | Not In Agreement |
2 | NA12878_1 | C/C | C/C | Not In Agreement |
3 | NTC | Not In Agreement | ||
4 | PGX000133 | C/C | C/C | Not In Agreement |
5 | PGX000133_1 | NOAMP | NOAMP | Not In Agreement |
6 | PGX000135 | C/C | C/C | |
7 | PGX000135_D | C/C | C/C | |
8 | PGX000135_1 | C/C | C/C |
英文:
I interpret your problem to be, "How do I compare multiple rows with a similar ID?" If that represents your problem, read on.
I reconstructed your data from the original image (including actual CSV makes it easier for us to play with the data). I removed the redundant columns, and added an ID column to show the original row order. I also cut out some rows. I think this accurately highlights the features of your data:
ID,Sample_ID,Call_X,Call_Y,Flag
1,NA12878,C/A,C/A,
2,NA12878_1,C/C,C/C,
3,NTC,,,Not In Agreement
4,PGX000133,C/C,C/C,
5,PGX000133_1,NOAMP,NOAMP,
6,PGX000135,C/C,C/C,
7,PGX000135_D,C/C,C/C,
8,PGX000135_1,C/C,C/C,
I picture reading the CSV and creating a structure where rows are grouped together by the root of their sample ID, something like:
{
"NA12878": [
["1", "NA12878", "C/A", "C/A", ""],
["2", "NA12878_1", "C/C", "C/C", ""],
],
"NTC": [
["3", "NTC", "", "", "Not In Agreement"],
],
"PGX000133": [
["4", "PGX000133", "C/C", "C/C", ""],
["5", "PGX000133_1", "NOAMP", "NOAMP", ""],
],
"PGX000135": [
["6", "PGX000135", "C/C", "C/C", ""],
["7", "PGX000135_D", "C/C", "C/C", ""],
["8", "PGX000135_1", "C/C", "C/C", ""],
],
}
With that structure I can loop over groups of rows, and compare the call x/y values. For any ID:
- I believe if there is a single value (e.g., C/C) then the set of rows is good.
- If there is more than a single value then the set of rows is "not in agreement".
I use a Python set
and just update it with all call values for every group of rows:
for sample_id, rows_by_id in sample_ids.items():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[2:4])
print(f"{sample_id}: {call_vals}")
NA12878: {'C/C', 'C/A'}
NTC: {''}
PGX000133: {'C/C', 'NOAMP'}
PGX000135: {'C/C'}
If I find a set with a length greater than 1, I loop back over all rows in the group and set the flag field:
...
if len(call_vals) > 1:
for row in rows_by_id:
row[4] = "Not In Agreement"
Here's my complete program, where I use a defaultdict(list)
to easily append rows to a sample ID and create that initial structure I imagined (and showed earlier).
import csv
from collections import defaultdict
def get_root_id(s):
return s.split("_", 1)[0] # NA12878_1 → NA12878; PGX000133 → PGX000133
sample_ids = defaultdict(list)
header = []
# My header will look like (w/row indices):
# ID,Sample_ID,Call_X,Call_Y,Flag
# 0 1 2 3 4
with open("input.csv", newline="") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
root_id = get_root_id(row[1])
sample_ids[root_id].append(row)
for rows_by_id in sample_ids.values():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[2:4])
if len(call_vals) > 1:
for row in rows_by_id:
row[4] = "Not In Agreement"
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(header)
for rows_by_id in sample_ids.values():
for row in rows_by_id:
writer.writerow(row)
One feature of this structure is that if your input CSV is sorted by Sample_ID, the output CSV will have the same sort... (dunno if that's important, but that's how it will work out.)
My output looks like:
ID | Sample_ID | Call_X | Call_Y | Flag |
---|---|---|---|---|
1 | NA12878 | C/A | C/A | Not In Agreement |
2 | NA12878_1 | C/C | C/C | Not In Agreement |
3 | NTC | Not In Agreement | ||
4 | PGX000133 | C/C | C/C | Not In Agreement |
5 | PGX000133_1 | NOAMP | NOAMP | Not In Agreement |
6 | PGX000135 | C/C | C/C | |
7 | PGX000135_D | C/C | C/C | |
8 | PGX000135_1 | C/C | C/C |
答案2
得分: 0
你可以使用 to_numpy()
和 flatten
将 Call_x
和 Call_y
的所有值合并到一个数组中。然后,你可以使用 all
来检查该数组的所有值是否相等。这是我写的代码:
def compare_unique_ids(df):
for unique_id in df['Unique ID'].unique():
values = df[df['Unique ID'] == unique_id].loc[:, ["Call_x", "Call_y"]].to_numpy().flatten()
if not (values[0] == values).all():
df.loc[df['Unique ID'] == unique_id, "Flag"] = "Not in Agreement"
else:
df.loc[df['Unique ID'] == unique_id, "Flag"] = "-"
如果你有任何其他问题,请随时提出。
英文:
You can use to_numpy()
and flatten
to get all the values for Call_x
and Call_y
into a single array. You can then check if all the values of that array are equal using all
. This is what I wrote:
def compare_unique_ids(df):
for unique_id in df['Unique ID'].unique():
values = df[df['Unique ID'] == unique_id].loc[:, ["Call_x", "Call_y"]].to_numpy().flatten()
if not (values[0] == values).all():
df.loc[df['Unique ID'] == unique_id, "Flag"] = "Not in Agreement"
else:
df.loc[df['Unique ID'] == unique_id, "Flag"] = "-"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论