数据框具有带引号的CSV,转为命名的数据框列

huangapple go评论83阅读模式
英文:

Dataframe column with quoted CSV to named dataframe columns

问题

I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.

# dummy data
dfMyData = pd.DataFrame({
    "_raw": [
        """{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,","logstream":"Blah1","loggroup":"group 1"}""",
        """{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,","logstream":"Blah2","loggroup":"group 2"}"""
    ]
})
# Column names for the _raw.message field that is generated.
MessageColumnNames =  ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))

I've seen this done before with str.split() to split on columns and then concat it back to the original dataframe, however the str.split method doesn't handle quoted values within the CSV. pd.read_csv can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.

Additionally, when I split dfMyData['_raw.message'] out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.

Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?

英文:

I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.

# dummy data
dfMyData = pd.DataFrame({"_raw": [\
			"""{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,,","logstream":"Blah1","loggroup":"group 1"}""",
			"""{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,,","logstream":"Blah2","loggroup":"group 2"}"""
			]})
# Column names for the _raw.message field that is generated.
MessageColumnNames =  ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))

I've seen this done before with str.split() to split on columns and then concat it back to the original dataframe, however the str.split method doesn't handle quoted values within the CSV. pd.read_csv can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.

Additionally, when I split dfMyData['_raw.message'] out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.

Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?

答案1

得分: 1

以下是您要翻译的代码部分:

from io import StringIO

out = pd.concat(
    [
        dfMyData,
        dfMyData.pop("_raw.message")
        .apply(lambda x: pd.read_csv(StringIO(x), header=None, quotechar="'").iloc[0])
        .add_prefix("csv_col_"),
    ],
    axis=1,
)
print(out)

以下是您要翻译的打印部分:

   _raw.timestamp _raw.logstream _raw.loggroup          csv_col_0           csv_col_1 csv_col_2        csv_col_3  csv_col_4  csv_col_5 csv_col_6  csv_col_7                             csv_col_8  csv_col_9  csv_col_10  csv_col_11
0   1691096387000          Blah1       group 1  20230803 20:59:47  ip-123-123-123-123    mickey  321.321.321.321     111111   10673010      type        NaN  I am a, quoted, string, with commas,          0         NaN         NaN
1   1691096386000          Blah2       group 2  20230803 21:00:47  ip-456-456-456-456     mouse  654.654.654.654     222222   10673010      type        NaN            I am another quoted string          0         NaN         NaN

希望这对您有所帮助。如果您需要进一步的翻译或解释,请随时告诉我。

英文:

Try:

from io import StringIO


out = pd.concat(
    [
        dfMyData,
        dfMyData.pop("_raw.message")
        .apply(lambda x: pd.read_csv(StringIO(x), header=None, quotechar="'").iloc[0])
        .add_prefix("csv_col_"),
    ],
    axis=1,
)
print(out)

Prints:

   _raw.timestamp _raw.logstream _raw.loggroup          csv_col_0           csv_col_1 csv_col_2        csv_col_3  csv_col_4  csv_col_5 csv_col_6  csv_col_7                             csv_col_8  csv_col_9  csv_col_10  csv_col_11
0   1691096387000          Blah1       group 1  20230803 20:59:47  ip-123-123-123-123    mickey  321.321.321.321     111111   10673010      type        NaN  I am a, quoted, string, with commas,          0         NaN         NaN
1   1691096386000          Blah2       group 2  20230803 21:00:47  ip-456-456-456-456     mouse  654.654.654.654     222222   10673010      type        NaN            I am another quoted string          0         NaN         NaN

答案2

得分: 1

更新

实际上,你只需将数据传递给一个CSV读取器,而CSV读取器本身是pandas.DataFrame的合适数据类型:

pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)

先前的答案

我们可以尝试将数据转换成CSV格式,然后使用适当的参数读取它们:

import csv
from tempfile import TemporaryFile

seq = dfMyData.iloc[:,1]      # 原始数据中感兴趣的列
columns = [*'ABCDEFGHIJKL']   # 未来数据列的自定义名称

with TemporaryFile() as file:

    seq.to_csv(
        file, 
        sep='\N{unit separator}',
        header=False,
        index=False,
        quoting=csv.QUOTE_NONE
    )

    file.seek(0)    # 从开头读取数据

    df = pd.read_csv(
        file, 
        header=None,
        names=columns,
        quotechar="'"
    )

print(df)

注意:

  • quoting=csv.QUOTE_NONE 用于避免每行末尾的\"
  • sep='\N{unit separator}' 用于避免与逗号混淆
  • 在读取时使用 quotechar="'",因为行内有特定的引用方式
  • 由于我们正在转储一个没有索引的序列,所以\N{unit separator} 分隔符永远不会出现在最终数据中

转换后的数据:

数据框具有带引号的CSV,转为命名的数据框列

英文:

Update

Actually all you need is to pass your data into a csv-reader, which in turn is an appropriate data type for pandas.DataFrame:

pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)

Previous answer

We can try to convert the data into a csv and read them back with appropriate parameters:

import csv
from tempfile import TemporaryFile

seq = dfMyData.iloc[:,1]      # column of interest in the original data
columns = [*'ABCDEFGHIJKL']   # custom names of future data columns

with TemporaryFile() as file:

    seq.to_csv(
        file, 
        sep='\N{unit separator}',
        header=False,
        index=False,
        quoting=csv.QUOTE_NONE
    )

    file.seek(0)    # read data from the start

    df = pd.read_csv(
        file, 
        header=None,
        names=columns,
        quotechar="\'"
    )

print(df)

Notes:

  • quoting=csv.QUOTE_NONE to avoid \" at the ends of each line
  • sep='\N{unit separator}' to avoid confusion with commas
  • quotechar="\'" when reading back because of a specific quoting inside the lines
  • since we are dumping a sequence without indexes, the '\N{unit separator}' delimiter will never make it into the final data

Transformed data:

数据框具有带引号的CSV,转为命名的数据框列

huangapple
  • 本文由 发表于 2023年8月5日 04:05:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76838859.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定