英文:
How can I separate this conversation into per row using python,
问题
以下是翻译的内容:
让我们假设下面的字符串是一个具有“body”作为列名的行的内容。现在我想要从这个字符串中为每个发言者创建一行。
助手:
嗨,我是助手,虚拟助手,我今天可以如何帮助您?
您是否正在查询:
电子书
有声书
购买
订阅
电影
等等
客户说:
电影
预期的输出应该如下所示:
发言者 | 文本内容 |
---|---|
助手 | 嗨,我是助手,虚拟助手,我今天可以如何帮助您?您是否正在查询:电子书有声书购买订阅电影等等 |
客户说 | 电影 |
我尝试过这个,但是
Testresult = tempchatdf.body.str.split(":\*\*", expand = True)
英文:
Lets say this string below is a content of a row with a "body" as the column name. Now I want to create a row per speaker out from this string.
Helper:
Hi, I'm Helper, Virtual Assistant, how can I help you today?
Are you inquiring about:
eBooks
Audiobooks
Purchasing
Subscriptions
Movies
etc
Cx said:
Movies
The expected output should be like:
Speaker | Transcript |
---|---|
Helper | Hi, I'm Helper, Virtual Assistant, how can I help you today? Are you inquiring about:eBooksAudiobooks Purchasing Subscriptions Movies etc |
Cx said | Movies |
I have tried this but the
Testresult = tempchatdf.body.str.split(":\*\*",expand = True)
答案1
得分: 1
你只需获取str
并执行split(":")
。
split = string.split(":")
result = [split[0], ":".join(split[1:])]
这将获取第一个拆分并将其作为发言者(第0个索引),然后将其余部分与":"连接起来。这样做是为了确保任何额外的":"都会返回。
如果字符串包含多行这样的内容,您可以将其放入一个循环中。
table = []
for line in string:
split = string.split(":")
table.append([split[0], ":".join(split[1:])])
英文:
You can just take the str
and split(":")
.
split = string.split(":")
result = [split[0], ":".join(split[1:])]
This takes the first split and makes that the speaker (0th index) then combines the rest of the split with ":". This is done to ensure that any extra ":" comes back.
If the string contains several lines of these, you can just wrap it in a loop.
table = []
for line in string:
split = string.split(":")
table.append([split[0], ":".join(split[1:])])
答案2
得分: 0
这是一种使用 re.findall()
来匹配 body
字符串并创建新数据框的方法:
row_str = df["body"].values[0]
data = re.findall(r'(.+?):\s*(.+)', row_str)
new_df = pd.DataFrame(data, columns=["Speaker", "Transcript"])
print(new_df)
或者,您可以使用正则表达式与 pandas.DataFrame.explode
结合在列表推导中的方法:
pattern = r'^([a-zA-Z\s]+):';
rows = [{"Speaker": re.match(pattern, line).group(1).strip(),
"Transcript": line.split(":", 1)[1].strip()}
for line in df["body"].str.split("\n").explode().tolist()
if re.match(pattern, line)]
new_df = pd.DataFrame(rows)
print(new_df)
Speaker | Transcript |
---|---|
Helper | Hi, I'm Helper, Virtual Assistant, how can I help you today? Are you inquiring about:eBooksAudiobooks Purchasing Subscriptions Movies etc |
Cx said | Movies |
英文:
Here is an approach using re.findall()
to match the body
string and creating a new df
row_str = df["body"].values[0]
data = re.findall(r'(.+?):\s*(.+)', row_str)
new_df = pd.DataFrame(data, columns=["Speaker", "Transcript"])
print(new_df)
Or you can use re with pandas.DataFrame.explode
in a list comprehension
pattern = r'^([a-zA-Z\s]+):'
rows = [{"Speaker": re.match(pattern, line).group(1).strip(),
"Transcript": line.split(":", 1)[1].strip()}
for line in df["body"].str.split("\n").explode().tolist()
if re.match(pattern, line)]
new_df = pd.DataFrame(rows)
print(new_df)
Speaker | Transcript |
---|---|
Helper | Hi, I'm Helper, Virtual Assistant, how can I help you today? Are you inquiring about:eBooksAudiobooks Purchasing Subscriptions Movies etc |
Cx said | Movies |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论