如何使用Python将此对话分隔成每行一条记录?

huangapple go评论78阅读模式
英文:

How can I separate this conversation into per row using python,

问题

以下是翻译的内容:

让我们假设下面的字符串是一个具有“body”作为列名的行的内容。现在我想要从这个字符串中为每个发言者创建一行。

助手:
嗨,我是助手,虚拟助手,我今天可以如何帮助您?
您是否正在查询:
电子书
有声书
购买
订阅
电影
等等

客户说:
电影

预期的输出应该如下所示:

发言者 文本内容
助手 嗨,我是助手,虚拟助手,我今天可以如何帮助您?您是否正在查询:电子书有声书购买订阅电影等等
客户说 电影

我尝试过这个,但是

Testresult = tempchatdf.body.str.split(":\*\*", expand = True)
英文:

Lets say this string below is a content of a row with a "body" as the column name. Now I want to create a row per speaker out from this string.

Helper:
Hi, I'm Helper, Virtual Assistant, how can I help you today?
Are you inquiring about:
eBooks
Audiobooks
Purchasing
Subscriptions
Movies
etc

Cx said:
Movies

The expected output should be like:

Speaker Transcript
Helper Hi, I'm Helper, Virtual Assistant, how can I help you today? Are you inquiring about:eBooksAudiobooks Purchasing Subscriptions Movies etc
Cx said Movies

I have tried this but the

Testresult = tempchatdf.body.str.split(":\*\*",expand = True)

答案1

得分: 1

你只需获取str并执行split(":")

split = string.split(":")
result = [split[0], ":".join(split[1:])]

这将获取第一个拆分并将其作为发言者(第0个索引),然后将其余部分与":"连接起来。这样做是为了确保任何额外的":"都会返回。

如果字符串包含多行这样的内容,您可以将其放入一个循环中。

table = []
for line in string:
    split = string.split(":")
    table.append([split[0], ":".join(split[1:])])
英文:

You can just take the str and split(":").

split = string.split(":")
result = [split[0], ":".join(split[1:])]

This takes the first split and makes that the speaker (0th index) then combines the rest of the split with ":". This is done to ensure that any extra ":" comes back.

If the string contains several lines of these, you can just wrap it in a loop.

table = []
for line in string:
    split = string.split(":")
    table.append([split[0], ":".join(split[1:])])

答案2

得分: 0

这是一种使用 re.findall() 来匹配 body 字符串并创建新数据框的方法:

row_str = df["body"].values[0]
data = re.findall(r'(.+?):\s*(.+)', row_str)

new_df = pd.DataFrame(data, columns=["Speaker", "Transcript"])
print(new_df)

或者,您可以使用正则表达式与 pandas.DataFrame.explode 结合在列表推导中的方法:

pattern = r'^([a-zA-Z\s]+):';
rows = [{"Speaker": re.match(pattern, line).group(1).strip(), 
         "Transcript": line.split(":", 1)[1].strip()}
         for line in df["body"].str.split("\n").explode().tolist()
         if re.match(pattern, line)]

new_df = pd.DataFrame(rows)
print(new_df)
Speaker Transcript
Helper Hi, I'm Helper, Virtual Assistant, how can I help you today? Are you inquiring about:eBooksAudiobooks Purchasing Subscriptions Movies etc
Cx said Movies
英文:

Here is an approach using re.findall() to match the body string and creating a new df

row_str = df["body"].values[0]
data = re.findall(r'(.+?):\s*(.+)', row_str)

new_df = pd.DataFrame(data, columns=["Speaker", "Transcript"])
print(new_df)

Or you can use re with pandas.DataFrame.explode in a list comprehension

pattern = r'^([a-zA-Z\s]+):'
rows = [{"Speaker": re.match(pattern, line).group(1).strip(), 
         "Transcript": line.split(":", 1)[1].strip()}
         for line in df["body"].str.split("\n").explode().tolist()
         if re.match(pattern, line)]

new_df = pd.DataFrame(rows)
print(new_df)
Speaker Transcript
Helper Hi, I'm Helper, Virtual Assistant, how can I help you today? Are you inquiring about:eBooksAudiobooks Purchasing Subscriptions Movies etc
Cx said Movies

huangapple
  • 本文由 发表于 2023年3月7日 04:06:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655359.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定