2023年3月12日 08:55:05go评论91阅读模式

英文:

Javascript - Scans Text Corresponding to a Certain Label

问题

以下是提取的 'Reference:' 标签对应的信息，直到遇到星号符号的代码部分：

// 提取 Reference
var referenceIndex = description.indexOf("Reference:");
if (referenceIndex !== -1) {
  var referenceText = description.substring(referenceIndex);
  var referenceLines = referenceText.split('\n');
  var references = [];
  for (var i = 1; i < referenceLines.length; i++) {
    if (referenceLines[i].trim().startsWith("*")) {
      break;
    }
    references.push(referenceLines[i].trim());
  }
  console.log('Reference: ');
  console.log(references.join('\n'));
} else {
  console.log('Reference: ');
  console.log('');
}

期望的输出是：

Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

英文:

I have the text below:

This is a code update
* Official Name:  Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title: 
 WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title: 
   IECC Com Update(#1)
   IECC Res Update (#2)
   IECC Res Update (#3)
* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)
* Effective Date: January 4, 2023

I want to extract the information corresponding to the tag 'Reference:' but the code below only gives me one line. I want to scan all text until it encounters the asterisk symbol.

//Extract Reference    
var reference = description.search(&quot;Reference:&quot;);
if(reference != -1){
  reference = description.match(/(?&lt;=^\* Reference\s*:)[\s]*[\n]*[^\n\r]*/m);  
  reference  = reference?.[0].trim();   
}else{
  reference = &#39;&#39;;
}
console.log(&#39;Reference: &#39; + reference);

Expected Output:

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

答案1

得分: 3

我决定跟随 @Nick 的建议，不对 "subject" 字符串做任何假设。

我提供了两种宽松的方法，它们可以工作：

当没有 Reference 项时（返回空字符串），
当 Reference 项内容为空时，
当 Reference 项内容位于字符串末尾时（因此不会跟随其他项）。

第一种方法在所有情况下都适用，无论内容如何：

let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? '';

如果您假设内容不包含星号字符，还可以使用第二种更高效的模式：

let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? '';

这是唯一的区别，但这个更简单。

无论您选择哪一种方法，结果已经被修剪。

英文:

I decided to follow @Nick's idea not making any assumption about the 'subject' string.

I produced two lenient approaches in the sense that they work:

when there's no Reference item (returning an empty string),
when the Reference item has an empty content,
and when the Reference item content is at the end of the string (thus not followed by an other item).

The first works in all cases whatever the content:

let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? &#39;&#39;;

A second more efficient pattern is possible if you assume the content doesn't contain asterisk characters:

let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? &#39;&#39;;

This is the only break, but this one is from far more simple.

Whatever the one you choose, the result is already trimmed.

答案2

得分: 2

你可以使用这个正则表达式：

(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)

它匹配：

(?:^|\n)\*\s*Reference:\s*：行首的 * Reference:
([\s\S]*?)：尽可能少的字符，捕获在第一个组内
(?=\s*\n\*|$)：匹配空格、换行和 * 或行尾的正向前瞻

正则表达式演示在regex101上。

text = `* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
`
reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? &#39;&#39;
console.log(reference)

英文:

You could use this regex:

(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)

Which matches:

(?:^|\n)\*\s*Reference:\s* : * Reference: at the beginning of a line
([\s\S]*?) : a minimal number of characters, captured in group 1
(?=\s*\n\*|$) : a positive lookahead for spaces, a newline and a * or end-of-line

Regex demo on regex101

text = `* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
`
reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? &#39;&#39;
console.log(reference)

答案3

得分: 2

在JavaScript中使用lookaround：

(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)

解释

(?<= 正向先行断言，确保从当前位置向左是：
- ^\* Reference\s*:\s* 匹配以可选空白字符开头的 * Reference 后跟 : 的字符串
) 关闭正向先行断言
\S 匹配非空白字符
[^]*? 匹配任何字符，包括换行符，尽可能少地匹配
(?= 正向先行断言，确保右侧是：
- ^\s*\* 匹配以可选空白字符开头的 * 字符
) 关闭正向先行断言

请参考正则表达式演示。

const regex = /(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update
* Official Name:  Noner
...
* Effective Date: January 4, 2023
`;
const m = s.match(regex);
if (m) console.log(m[0]);

以上是您提供的代码部分的翻译，不包括其他内容。

英文:

In Javascript with lookarounds:

(?&lt;=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)

Explanation

(?<= Positive lookbehind, assert that from the current position to the left is:
- ^\* Reference\s*:\s* Match * Reference followed by : between optional whitespace chars at the start of the string
) Close the lookbehind
\S Match a non whitespace char
[^]*? Match any character including newlines, as few as possible
(?= Positive lookahead, assert that to the right is:
- ^\s*\* Match optional whitespace chars followed by * at the start of the string
) Close te lookahead

See a regex demo.

const regex = /(?&lt;=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update
* Official Name:  Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title: 
 WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title: 
   IECC Com Update(#1)
   IECC Res Update (#2)
   IECC Res Update (#3)
* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)
* Effective Date: January 4, 2023
`;
const m = s.match(regex);
if (m) console.log(m[0]);

答案4

得分: 1

你可以简单地使用正则表达式的"lookarounds"来实现：

(?<=Reference: )(.|\n)*?(?=\*)

然后修剪（trim）输出。

这是代码示例或在线演练：

const text = `This is a code update
* Official Name:  Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https/lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title: 
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title: 
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc: 
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf(true  (#2)
* Effective Date: January 4, 2023`
const patt = /(?<=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())

英文:

You can simply use lookarounds:

(?&lt;=Reference: )(.|\n)*?(?=\*)

Then trim the output.

Code example or playground:

const text = `This is a code update
* Official Name:  Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title: 
 WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title: 
   IECC Com Update(#1)
   IECC Res Update (#2)
   IECC Res Update (#3)
* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)
* Effective Date: January 4, 2023`
const patt = /(?&lt;=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())

答案5

得分: 1

避免使用正则表达式，它会降低可读性和性能。

在你的情况下，输入可能会很长，但目标模式相当简单。

思路是找到'* Reference'并提取直到下一个'*'之前的所有字符串。

因此，我们使用string.indexOf和string.substring。


const text = `
* Official Name:  Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title: 
 WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title: 
   IECC Com Update(#1)
   IECC Res Update (#2)
   IECC Res Update (#3)
* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf(true  (#2)
* Effective Date: January 4, 2023
`;
 
const pre = text.indexOf('* Reference');
const start = text.indexOf('h', pre);
const end = text.indexOf('*', start);
const slice = text.substring(start, end);
console.log({ slice });

英文:

Avoid regex, it reduces readability and performance.

In your case, the input can be very long, but the target pattern is quite simple.

The idea is to find '* Reference' and extract all the string until next '*'.

So we get to use string.indexOf and string.substring.


const text = `
* Official Name:  Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency:  Agency Ni
* Reference: 
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title: 
 WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title: 
   IECC Com Update(#1)
   IECC Res Update (#2)
   IECC Res Update (#3)
* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)
* Effective Date: January 4, 2023
`;
 
const pre = text.indexOf(&#39;* Reference&#39;);
const start = text.indexOf(&#39;h&#39;, pre);
const end = text.indexOf(&#39;*&#39;, start);
const slice = text.substring(start, end);
console.log({ slice });

答案6

得分: 0

尝试使用这个：

const match = description.match(/(?<=* Reference\s*:)[\s\S]*?*/)[0].slice(0, -1);
console.log(match)


这个正则表达式使用了你在问题中使用的回顾，然后匹配任何字符（包括换行符），直到下一个 `*` 字符。我使用了 `.slice()` 来去掉最后一个字符，因为正则表达式也会匹配到最后一个 `*` 字符。
我对这段代码进行了性能基准测试，与其他答案相比，发现它是最快的，速度提高了大约 50%（[请参见此处的性能基准](https://jsbench.me/udlf4pcck5/1)）。

英文:

Try using this:

const match = description.match(/(?&lt;=\* Reference\s*:)[\s\S]*?\*/)[0].slice(0, -1);
console.log(match)

This RegEx uses the look behind you used in your question, but then matches any character (including newlines) until the next * character. I used .slice() to get rid of the last character because the RegEx will match that last * character too.

I benchmarked this code against the other answers and found that it was the fastest by about 50% (see benchmark here).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

JavaScript – 扫描与特定标签对应的文本

问题

答案1

答案2

答案3

答案4

答案5

答案6

如何将环境变量从外部来源传递给React.js应用程序。

URL正则匹配，包括查询字符串和不包括查询字符串。

在页面调整大小时，在flexbox中将标题与图像对齐

WebRTC – 多房间视频会议

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论