在数据框中迭代行以通过正则表达式进行搜索。

huangapple go评论112阅读模式
英文:

Iterate over rows in a data frame to search by regular expressions

问题

我正在尝试使用正则表达式从查询中提取SQL表格。对于单个查询,可以使用re.findall来完成。

import re
Query = ["SELECT * FROM   WS_DE_Staging.stage_dual_h_20"]
for xx in Query:
    r1 = re.findall(r"FROM|JOIN|from|join\s+([A-Za-z_.\[\]]+)",xx)
    print(r1)

但是现在在第二阶段中,我必须将其用于一个包含所有报表名称和其SQL查询的表格。因此,我使用pandas读取CSV文件并创建一个数据框。

不知道下一步该如何使用上述的re.findall表达式迭代所有出现的情况。

import re
Query = ["SELECT * FROM   WS_JE_Staging.stage_dual_h_20", "Select * from DummyEmployee"]
for xx in Query:
    r1 = re.findall(r"FROM|JOIN|from|join\s+([A-Za-z_.\[\]]+)",xx)
    print(r1)

import pandas as pd, re, numpy as np
df = pd.read_csv("SqlQ.csv", delimiter=',')

print(df.index)
英文:

I am trying to fetch sql tables from query using regular expressions. That is done for single query by using re.findall

import re
Query = ["SELECT * FROM   WS_DE_Staging.stage_dual_h_20"]
for xx in Query:
    r1 = re.findall(r"FROM|JOIN|from|join\s+([A-Za-z_.\[\]]+)",xx)
    print(r1)

But now in phase 2 I have to use this for a table in which I hold all the report names and their sql queries. So I am using pandas to read the CSV and create a data frame.

Don't know the next step how I can iterate over all the occurrences using the above re.findall expression.

import re
Query = ["SELECT * FROM   WS_JE_Staging.stage_dual_h_20", "Select * from DummyEmployee"]
for xx in Query:
    r1 = re.findall(r"FROM|JOIN|from|join\s+([A-Za-z_.\[\]]+)",xx)
    print(r1)

import pandas as pd, re, numpy as np
df = pd.read_csv("SqlQ.csv", delimiter=',')

print(df.index)

答案1

得分: 0

正如您所提到的,可以使用re.IGNORECASE来简化正则表达式模式。从数据框中提取表名比迭代项目更好。

import pandas as pd
import re
pattern = r"(?<=from|join)\s+([\da-z_.\[\]]+)"
df = pd.read_csv(r"d:\temp\SqlQ.csv", delimiter=',')
df['table'] = df['sql'].str.extract(pattern, flags=re.IGNORECASE)
print(df)

使用Set()获取唯一的表列表。

df['table'] = df['sql'].str.findall(pattern, flags=re.IGNORECASE)
df['table'] = df['table'].apply(lambda x: list(set(x)))
英文:

The regex pattern can be simplified using re.IGNORECASE. Extracting table names from DF is better than iterating items.

import pandas as pd
import re
pattern = r&quot;(?&lt;=from|join)\s+([\da-z_.\[\]]+)&quot;
df = pd.read_csv(r&quot;d:\temp\SqlQ.csv&quot;, delimiter=&#39;,&#39;)
df[&#39;table&#39;] = df[&#39;sql&#39;].str.extract(pattern, flags=re.IGNORECASE)
print(df)

Get unique table list with Set().

df[&#39;table&#39;] = df[&#39;sql&#39;].str.findall(pattern, flags=re.IGNORECASE)
df[&#39;table&#39;] = df[&#39;table&#39;].apply(lambda x: list(set(x)))

huangapple
  • 本文由 发表于 2023年8月9日 17:41:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866464.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定