英文:
Parsing HTML data in SQL Server 2016 for large table
问题
我正在尝试解析存在于我的一个表中的HTML数据,但问题是表的大小很大,我无法使用此处提供的函数。我只是想知道是否有一种有效的方法来解析来自大表的数据。
我尝试了这里提供的解决方案,它运行得很好,但无法用于大表。
https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
示例代码在这里:http://sqlfiddle.com/#!18/27baba/1
数据看起来像这样:
ID Valuen
1 <t><some>this is the value</t></some>another value . <none> 3rd value </none>
2 <t><bold>Correct value </t></bold>new value . <color> salary </color>
示例输出:
ID Valuen
1 this is the value another value 3rd value
2 Correct value new value . salary
请注意,我尝试过使用XML解析,但由于无法将所有数据转换为XML,所以它无法正常工作。任何帮助将不胜感激,谢谢。
英文:
I am trying to parse the HTML data which exists in one of my table but issue is the table size is huge and I can't use the function provided here. I was just wondering if there is any efficient way to parse the data from a huge table.
I tried the solution here and it works great but can't use for huge table.
https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
Sample code here: http://sqlfiddle.com/#!18/27baba/1
Data looks like this:
ID Valuen
1 <t><some>this is the value</t></some>another value . <none> 3rd value </none>
2 <t><bold>Correct value </t></bold>new value . <color> salary </color>
Sample Output:
ID Valuen
1 this is the value another value 3rd value
2 Correct value new value . salary
Please note that I tried using XML parsing as well but it does not work as its unable to convert all the data to XML.
Any help will be appreciated, thank you.
答案1
得分: 3
以下是翻译的内容:
如果要使用辅助函数
标准示例或dbFiddle
选择 A.ID
,B.*
从 测试 A
交叉应用 [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<') B
返回
ID RetSeq RetVal
1 1 这是值
1 2 另一个值。
1 3 第三个值
2 1 正确的值
2 2 新的值。
2 3 工资
与上述相同,只是添加了 string_agg()
选择 A.ID
,B.*
从 测试 A
交叉应用 ( 选择 NewVal = string_agg(RetVal,' ') WITHIN GROUP (ORDER BY RetSeq)
来自 [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<')
) B
结果
ID NewVal
1 这是值 另一个值。 第三个值
2 正确的值 新的值。 工资
如果感兴趣的话,这是函数
创建函数 [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
返回表格
如下
返回 (
选择 RetSeq = 行编号() over (按 RetSeq 排序)
,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
从 (
选择 RetSeq = [Key]+1
,RetVal = trim(Value)
来自 OpenJSON( N'["'+replace(string_escape(@String,'json'),@Delim1,'","')+N'"]' )
) C1
在哪里 charindex(@Delim2,RetVal)>1
)
英文:
If open to a helper function
Normalized Example or dbFiddle
Select A.ID
,B.*
From test A
Cross Apply [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<') B
Returns
ID RetSeq RetVal
1 1 this is the value
1 2 another value .
1 3 3rd value
2 1 Correct value
2 2 new value .
2 3 salary
Same as Above, Just Added a string_agg()
Select A.ID
,B.*
From test A
Cross Apply ( Select NewVal = string_agg(RetVal,' ') WITHIN GROUP (ORDER BY RetSeq)
From [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<')
)B
Results
ID NewVal
1 this is the value another value . 3rd value
2 Correct value new value . salary
The Function if Interested
CREATE FUNCTION [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
Returns Table
As
Return (
Select RetSeq = row_number() over (order by RetSeq)
,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
From (
Select RetSeq = [Key]+1
,RetVal = trim(Value)
From OpenJSON( N'["'+replace(string_escape(@String,'json'),@Delim1,'","')+N'"]' )
) C1
Where charindex(@Delim2,RetVal)>1
)
答案2
得分: 1
请尝试在您的数据集上使用以下解决方案。
SQL
-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500));
INSERT INTO @tbl(id, tokens) VALUES
(1, '<t><some>this is the value</t></some>another value . <none> 3rd value </none>'),
(2, '<t><bold>Correct value </t></bold>new value . <color> salary </color>');
-- DDL and sample data population, end
SELECT t.ID
, Result = STRING_AGG(LEFT(token, CHARINDEX('<', token + '<') - 1), SPACE(1))
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(tokens, '>', ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(x)
CROSS APPLY x.nodes('/root/r[text()][not(substring(text()[1],1,1)="<")]/text()') AS t2(c)
CROSS APPLY (SELECT c.value('.','VARCHAR(500)')) AS t3(token)
GROUP BY ID;
Output
ID | Result |
---|---|
1 | this is the value another value . 3rd value |
2 | Correct value new value . salary |
英文:
Please try the following solution on your data set.
SQL
-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500));
INSERT INTO @tbl(id, tokens) VALUES
(1, '<t><some>this is the value</t></some>another value . <none> 3rd value </none>'),
(2, '<t><bold>Correct value </t></bold>new value . <color> salary </color>');
-- DDL and sample data population, end
SELECT t.ID
, Result = STRING_AGG(LEFT(token, CHARINDEX('<', token + '<') - 1), SPACE(1))
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(tokens, '>', ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(x)
CROSS APPLY x.nodes('/root/r[text()][not(substring(text()[1],1,1)="<")]/text()') AS t2(c)
CROSS APPLY (SELECT c.value('.','VARCHAR(500)')) AS t3(token)
GROUP BY ID;
Output
ID | Result |
---|---|
1 | this is the value another value . 3rd value |
2 | Correct value new value . salary |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论