英文:
Dataframe column with quoted CSV to named dataframe columns
问题
I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.
# dummy data
dfMyData = pd.DataFrame({
"_raw": [
"""{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,","logstream":"Blah1","loggroup":"group 1"}""",
"""{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,","logstream":"Blah2","loggroup":"group 2"}"""
]
})
# Column names for the _raw.message field that is generated.
MessageColumnNames = ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))
I've seen this done before with str.split()
to split on columns and then concat it back to the original dataframe, however the str.split
method doesn't handle quoted values within the CSV. pd.read_csv
can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.
Additionally, when I split dfMyData['_raw.message']
out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.
Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?
英文:
I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.
# dummy data
dfMyData = pd.DataFrame({"_raw": [\
"""{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,,","logstream":"Blah1","loggroup":"group 1"}""",
"""{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,,","logstream":"Blah2","loggroup":"group 2"}"""
]})
# Column names for the _raw.message field that is generated.
MessageColumnNames = ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))
I've seen this done before with str.split()
to split on columns and then concat it back to the original dataframe, however the str.split
method doesn't handle quoted values within the CSV. pd.read_csv
can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.
Additionally, when I split dfMyData['_raw.message']
out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.
Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?
答案1
得分: 1
以下是您要翻译的代码部分:
from io import StringIO
out = pd.concat(
[
dfMyData,
dfMyData.pop("_raw.message")
.apply(lambda x: pd.read_csv(StringIO(x), header=None, quotechar="'").iloc[0])
.add_prefix("csv_col_"),
],
axis=1,
)
print(out)
以下是您要翻译的打印部分:
_raw.timestamp _raw.logstream _raw.loggroup csv_col_0 csv_col_1 csv_col_2 csv_col_3 csv_col_4 csv_col_5 csv_col_6 csv_col_7 csv_col_8 csv_col_9 csv_col_10 csv_col_11
0 1691096387000 Blah1 group 1 20230803 20:59:47 ip-123-123-123-123 mickey 321.321.321.321 111111 10673010 type NaN I am a, quoted, string, with commas, 0 NaN NaN
1 1691096386000 Blah2 group 2 20230803 21:00:47 ip-456-456-456-456 mouse 654.654.654.654 222222 10673010 type NaN I am another quoted string 0 NaN NaN
希望这对您有所帮助。如果您需要进一步的翻译或解释,请随时告诉我。
英文:
Try:
from io import StringIO
out = pd.concat(
[
dfMyData,
dfMyData.pop("_raw.message")
.apply(lambda x: pd.read_csv(StringIO(x), header=None, quotechar="'").iloc[0])
.add_prefix("csv_col_"),
],
axis=1,
)
print(out)
Prints:
_raw.timestamp _raw.logstream _raw.loggroup csv_col_0 csv_col_1 csv_col_2 csv_col_3 csv_col_4 csv_col_5 csv_col_6 csv_col_7 csv_col_8 csv_col_9 csv_col_10 csv_col_11
0 1691096387000 Blah1 group 1 20230803 20:59:47 ip-123-123-123-123 mickey 321.321.321.321 111111 10673010 type NaN I am a, quoted, string, with commas, 0 NaN NaN
1 1691096386000 Blah2 group 2 20230803 21:00:47 ip-456-456-456-456 mouse 654.654.654.654 222222 10673010 type NaN I am another quoted string 0 NaN NaN
答案2
得分: 1
更新
实际上,你只需将数据传递给一个CSV读取器,而CSV读取器本身是pandas.DataFrame
的合适数据类型:
pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)
先前的答案
我们可以尝试将数据转换成CSV格式,然后使用适当的参数读取它们:
import csv
from tempfile import TemporaryFile
seq = dfMyData.iloc[:,1] # 原始数据中感兴趣的列
columns = [*'ABCDEFGHIJKL'] # 未来数据列的自定义名称
with TemporaryFile() as file:
seq.to_csv(
file,
sep='\N{unit separator}',
header=False,
index=False,
quoting=csv.QUOTE_NONE
)
file.seek(0) # 从开头读取数据
df = pd.read_csv(
file,
header=None,
names=columns,
quotechar="'"
)
print(df)
注意:
quoting=csv.QUOTE_NONE
用于避免每行末尾的\"
sep='\N{unit separator}'
用于避免与逗号混淆- 在读取时使用
quotechar="'"
,因为行内有特定的引用方式 - 由于我们正在转储一个没有索引的序列,所以
\N{unit separator}
分隔符永远不会出现在最终数据中
转换后的数据:
英文:
Update
Actually all you need is to pass your data into a csv-reader, which in turn is an appropriate data type for pandas.DataFrame
:
pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)
Previous answer
We can try to convert the data into a csv and read them back with appropriate parameters:
import csv
from tempfile import TemporaryFile
seq = dfMyData.iloc[:,1] # column of interest in the original data
columns = [*'ABCDEFGHIJKL'] # custom names of future data columns
with TemporaryFile() as file:
seq.to_csv(
file,
sep='\N{unit separator}',
header=False,
index=False,
quoting=csv.QUOTE_NONE
)
file.seek(0) # read data from the start
df = pd.read_csv(
file,
header=None,
names=columns,
quotechar="\'"
)
print(df)
Notes:
quoting=csv.QUOTE_NONE
to avoid\"
at the ends of each linesep='\N{unit separator}'
to avoid confusion with commasquotechar="\'"
when reading back because of a specific quoting inside the lines- since we are dumping a sequence without indexes, the
'\N{unit separator}'
delimiter will never make it into the final data
Transformed data:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论