在SQL Server 2016中解析大型表格中的HTML数据

huangapple go评论68阅读模式
英文:

Parsing HTML data in SQL Server 2016 for large table

问题

我正在尝试解析存在于我的一个表中的HTML数据,但问题是表的大小很大,我无法使用此处提供的函数。我只是想知道是否有一种有效的方法来解析来自大表的数据。

我尝试了这里提供的解决方案,它运行得很好,但无法用于大表。
https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

示例代码在这里:http://sqlfiddle.com/#!18/27baba/1

数据看起来像这样:

ID	Valuen
1	<t><some>this is the value</t></some>another value . <none> 3rd value </none>
2	<t><bold>Correct value </t></bold>new value . <color> salary </color>

示例输出:

ID 	Valuen
1	 this is the value another value 3rd value
2	 Correct value  new value . salary

请注意,我尝试过使用XML解析,但由于无法将所有数据转换为XML,所以它无法正常工作。任何帮助将不胜感激,谢谢。

英文:

I am trying to parse the HTML data which exists in one of my table but issue is the table size is huge and I can't use the function provided here. I was just wondering if there is any efficient way to parse the data from a huge table.

I tried the solution here and it works great but can't use for huge table.
https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

Sample code here: http://sqlfiddle.com/#!18/27baba/1

Data looks like this:

ID	Valuen
  1	&lt;t&gt;&lt;some&gt;this is the value&lt;/t&gt;&lt;/some&gt;another value . &lt;none&gt; 3rd value &lt;/none&gt;
  2	&lt;t&gt;&lt;bold&gt;Correct value &lt;/t&gt;&lt;/bold&gt;new value . &lt;color&gt; salary &lt;/color&gt;

Sample Output:

 ID 	Valuen
   1	 this is the value another value 3rd value
   2	 Correct value  new value . salary 

Please note that I tried using XML parsing as well but it does not work as its unable to convert all the data to XML.

Any help will be appreciated, thank you.

答案1

得分: 3

以下是翻译的内容:

如果要使用辅助函数

标准示例或dbFiddle

选择 A.ID
      ,B.*
 测试 A
交叉应用 [dbo].[tvf-Str-Extract-JSON](ValueN,'&gt;','&lt;') B

返回

ID	RetSeq	RetVal
1	1   	这是值
1	2   	另一个值。
1	3   	第三个值
2	1   	正确的值
2	2   	新的值。
2	3   	工资

与上述相同,只是添加了 string_agg()

选择 A.ID
      ,B.*
 测试 A
交叉应用 ( 选择 NewVal = string_agg(RetVal,' ') WITHIN GROUP (ORDER BY RetSeq) 
                 来自 [dbo].[tvf-Str-Extract-JSON](ValueN,'&gt;','&lt;')  
      ) B

结果

ID	NewVal
1	这是值 另一个值。  第三个值
2	正确的值  新的值。  工资

如果感兴趣的话,这是函数

创建函数 [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
返回表格
如下
返回 (  
    
    选择 RetSeq = 行编号() over ( RetSeq 排序)
          ,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
     (
            选择 RetSeq = [Key]+1
                  ,RetVal = trim(Value)
         来自 OpenJSON( N'[&quot;'+replace(string_escape(@String,'json'),@Delim1,'&quot;,&quot;')+N'&quot;]' )

          ) C1
      在哪里 charindex(@Delim2,RetVal)>1

)
英文:

If open to a helper function

Normalized Example or dbFiddle

Select A.ID
      ,B.*
 From  test A
 Cross Apply   [dbo].[tvf-Str-Extract-JSON](ValueN,&#39;&gt;&#39;,&#39;&lt;&#39;) B

Returns

ID	RetSeq	RetVal
1	1   	this is the value
1	2   	another value .
1	3   	3rd value
2	1   	Correct value
2	2   	new value .
2	3   	salary

Same as Above, Just Added a string_agg()

Select A.ID
      ,B.*
 From  test A
 Cross Apply  ( Select NewVal = string_agg(RetVal,&#39; &#39;) WITHIN GROUP (ORDER BY RetSeq) 
                  From [dbo].[tvf-Str-Extract-JSON](ValueN,&#39;&gt;&#39;,&#39;&lt;&#39;)  
              )B

Results

ID	NewVal
1	this is the value another value .   3rd value
2	Correct value  new value .   salary

The Function if Interested

CREATE FUNCTION [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
Returns Table 
As
Return (  

	Select RetSeq = row_number() over (order by RetSeq)
		  ,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
	From  (
			Select RetSeq = [Key]+1
				  ,RetVal = trim(Value)
			 From  OpenJSON( N&#39;[&quot;&#39;+replace(string_escape(@String,&#39;json&#39;),@Delim1,&#39;&quot;,&quot;&#39;)+N&#39;&quot;]&#39; )

		  ) C1
	Where charindex(@Delim2,RetVal)&gt;1

)

答案2

得分: 1

请尝试在您的数据集上使用以下解决方案。

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500)); 
INSERT INTO @tbl(id, tokens) VALUES 
(1, '<t><some>this is the value</t></some>another value .  <none> 3rd value </none>'),
(2, '<t><bold>Correct value </t></bold>new value .  <color> salary </color>');
-- DDL and sample data population, end

SELECT t.ID
	, Result = STRING_AGG(LEFT(token, CHARINDEX('<', token + '<') - 1), SPACE(1))
FROM @tbl AS t
    CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
        REPLACE(tokens, '>', ']]></r><r><![CDATA[') + 
        ']]></r></root>' AS XML)) AS t1(x)
	CROSS APPLY x.nodes('/root/r[text()][not(substring(text()[1],1,1)="<")]/text()') AS t2(c)
	CROSS APPLY (SELECT c.value('.','VARCHAR(500)')) AS t3(token)
GROUP BY ID;

Output

ID Result
1 this is the value another value . 3rd value
2 Correct value new value . salary
英文:

Please try the following solution on your data set.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500)); 
INSERT INTO @tbl(id, tokens) VALUES 
(1, &#39;&lt;t&gt;&lt;some&gt;this is the value&lt;/t&gt;&lt;/some&gt;another value .  &lt;none&gt; 3rd value &lt;/none&gt;&#39;),
(2, &#39;&lt;t&gt;&lt;bold&gt;Correct value &lt;/t&gt;&lt;/bold&gt;new value .  &lt;color&gt; salary &lt;/color&gt;&#39;);
-- DDL and sample data population, end

SELECT t.ID
	, Result = STRING_AGG(LEFT(token, CHARINDEX(&#39;&lt;&#39;, token + &#39;&lt;&#39;) - 1), SPACE(1))
FROM @tbl AS t
    CROSS APPLY (SELECT TRY_CAST(&#39;&lt;root&gt;&lt;r&gt;&lt;![CDATA[&#39; + 
        REPLACE(tokens, &#39;&gt;&#39;, &#39;]]&gt;&lt;/r&gt;&lt;r&gt;&lt;![CDATA[&#39;) + 
        &#39;]]&gt;&lt;/r&gt;&lt;/root&gt;&#39; AS XML)) AS t1(x)
	CROSS APPLY x.nodes(&#39;/root/r[text()][not(substring(text()[1],1,1)=&quot;&lt;&quot;)]/text()&#39;) AS t2(c)
	CROSS APPLY (SELECT c.value(&#39;.&#39;,&#39;VARCHAR(500)&#39;)) AS t3(token)
GROUP BY ID;

Output

ID Result
1 this is the value another value . 3rd value
2 Correct value new value . salary

huangapple
  • 本文由 发表于 2023年4月7日 03:14:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75953002.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定