2023年8月5日 04:05:44go评论87阅读模式

英文:

Dataframe column with quoted CSV to named dataframe columns

问题

I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.

# dummy data
dfMyData = pd.DataFrame({
    "_raw": [
        """{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,","logstream":"Blah1","loggroup":"group 1"}""",
        """{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,","logstream":"Blah2","loggroup":"group 2"}"""
    ]
})
# Column names for the _raw.message field that is generated.
MessageColumnNames =  ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))

I've seen this done before with str.split() to split on columns and then concat it back to the original dataframe, however the str.split method doesn't handle quoted values within the CSV. pd.read_csv can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.

Additionally, when I split dfMyData['_raw.message'] out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.

Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?

英文:

# dummy data
dfMyData = pd.DataFrame({&quot;_raw&quot;: [\
			&quot;&quot;&quot;{&quot;timestamp&quot;:1691096387000,&quot;message&quot;:&quot;20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,&#39;I am a, quoted, string, with commas,&#39;,0,,&quot;,&quot;logstream&quot;:&quot;Blah1&quot;,&quot;loggroup&quot;:&quot;group 1&quot;}&quot;&quot;&quot;,
			&quot;&quot;&quot;{&quot;timestamp&quot;:1691096386000,&quot;message&quot;:&quot;20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,&#39;I am another quoted string&#39;,0,,&quot;,&quot;logstream&quot;:&quot;Blah2&quot;,&quot;loggroup&quot;:&quot;group 2&quot;}&quot;&quot;&quot;
			]})
# Column names for the _raw.message field that is generated.
MessageColumnNames =  [&quot;Timestamp&quot;,&quot;dest_host&quot;,&quot;username&quot;,&quot;src_ip&quot;,&quot;port&quot;,&quot;number&quot;,&quot;type&quot;,&quot;who_knows&quot;,&quot;message_string&quot;,&quot;another_number&quot;,&quot;who_knows2&quot;,&quot;who_knows3&quot;]
# Convert column to json object/dict
dfMyData[&#39;_raw&#39;] = dfMyData[&#39;_raw&#39;].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient=&#39;records&#39;))

Additionally, when I split dfMyData['_raw.message'] out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.

Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?

答案1

得分: 1

以下是您要翻译的代码部分：

from io import StringIO

out = pd.concat(
    [
        dfMyData,
        dfMyData.pop("_raw.message")
        .apply(lambda x: pd.read_csv(StringIO(x), header=None, quotechar="'").iloc[0])
        .add_prefix("csv_col_"),
    ],
    axis=1,
)
print(out)

以下是您要翻译的打印部分：

   _raw.timestamp _raw.logstream _raw.loggroup          csv_col_0           csv_col_1 csv_col_2        csv_col_3  csv_col_4  csv_col_5 csv_col_6  csv_col_7                             csv_col_8  csv_col_9  csv_col_10  csv_col_11
0   1691096387000          Blah1       group 1  20230803 20:59:47  ip-123-123-123-123    mickey  321.321.321.321     111111   10673010      type        NaN  I am a, quoted, string, with commas,          0         NaN         NaN
1   1691096386000          Blah2       group 2  20230803 21:00:47  ip-456-456-456-456     mouse  654.654.654.654     222222   10673010      type        NaN            I am another quoted string          0         NaN         NaN

希望这对您有所帮助。如果您需要进一步的翻译或解释，请随时告诉我。

英文:

Try:

from io import StringIO


out = pd.concat(
    [
        dfMyData,
        dfMyData.pop(&quot;_raw.message&quot;)
        .apply(lambda x: pd.read_csv(StringIO(x), header=None, quotechar=&quot;&#39;&quot;).iloc[0])
        .add_prefix(&quot;csv_col_&quot;),
    ],
    axis=1,
)
print(out)

Prints:

   _raw.timestamp _raw.logstream _raw.loggroup          csv_col_0           csv_col_1 csv_col_2        csv_col_3  csv_col_4  csv_col_5 csv_col_6  csv_col_7                             csv_col_8  csv_col_9  csv_col_10  csv_col_11
0   1691096387000          Blah1       group 1  20230803 20:59:47  ip-123-123-123-123    mickey  321.321.321.321     111111   10673010      type        NaN  I am a, quoted, string, with commas,          0         NaN         NaN
1   1691096386000          Blah2       group 2  20230803 21:00:47  ip-456-456-456-456     mouse  654.654.654.654     222222   10673010      type        NaN            I am another quoted string          0         NaN         NaN

答案2

得分: 1

更新

实际上，你只需将数据传递给一个CSV读取器，而CSV读取器本身是pandas.DataFrame的合适数据类型：

pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)

先前的答案

我们可以尝试将数据转换成CSV格式，然后使用适当的参数读取它们：

import csv
from tempfile import TemporaryFile

seq = dfMyData.iloc[:,1]      # 原始数据中感兴趣的列
columns = [*'ABCDEFGHIJKL']   # 未来数据列的自定义名称

with TemporaryFile() as file:

    seq.to_csv(
        file, 
        sep='\N{unit separator}',
        header=False,
        index=False,
        quoting=csv.QUOTE_NONE
    )

    file.seek(0)    # 从开头读取数据

    df = pd.read_csv(
        file, 
        header=None,
        names=columns,
        quotechar="'"
    )

print(df)

注意：

quoting=csv.QUOTE_NONE 用于避免每行末尾的\"
sep='\N{unit separator}' 用于避免与逗号混淆
在读取时使用 quotechar="'"，因为行内有特定的引用方式
由于我们正在转储一个没有索引的序列，所以\N{unit separator} 分隔符永远不会出现在最终数据中

转换后的数据：

英文:

Update

Actually all you need is to pass your data into a csv-reader, which in turn is an appropriate data type for pandas.DataFrame:

pd.DataFrame(csv.reader(dfMyData[&#39;_raw.message&#39;], quotechar=&quot;&#39;&quot;), columns=columns)

Previous answer

We can try to convert the data into a csv and read them back with appropriate parameters:

import csv
from tempfile import TemporaryFile

seq = dfMyData.iloc[:,1]      # column of interest in the original data
columns = [*&#39;ABCDEFGHIJKL&#39;]   # custom names of future data columns

with TemporaryFile() as file:

    seq.to_csv(
        file, 
        sep=&#39;\N{unit separator}&#39;,
        header=False,
        index=False,
        quoting=csv.QUOTE_NONE
    )

    file.seek(0)    # read data from the start

    df = pd.read_csv(
        file, 
        header=None,
        names=columns,
        quotechar=&quot;\&#39;&quot;
    )

print(df)

Notes:

quoting=csv.QUOTE_NONE to avoid \" at the ends of each line
sep='\N{unit separator}' to avoid confusion with commas
quotechar="\'" when reading back because of a specific quoting inside the lines
since we are dumping a sequence without indexes, the '\N{unit separator}' delimiter will never make it into the final data

Transformed data:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据框具有带引号的CSV，转为命名的数据框列

问题

答案1

答案2

How to adjust the image to meet the minimum requirements and avoid receiving the Telegram API Bad Request error: PHOTO_INVALID_DIMENSIONS?

从HTML中使用Zip_longest()函数数据获取Django模型的ID

热图不包括每行中的第一个非NaN记录。

将数据框中的值更改为相应的数字。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论