JavaScript – 扫描与特定标签对应的文本

huangapple go评论69阅读模式
英文:

Javascript - Scans Text Corresponding to a Certain Label

问题

以下是提取的 'Reference:' 标签对应的信息,直到遇到星号符号的代码部分:

// 提取 Reference
var referenceIndex = description.indexOf("Reference:");
if (referenceIndex !== -1) {
  var referenceText = description.substring(referenceIndex);
  var referenceLines = referenceText.split('\n');
  var references = [];
  for (var i = 1; i < referenceLines.length; i++) {
    if (referenceLines[i].trim().startsWith("*")) {
      break;
    }
    references.push(referenceLines[i].trim());
  }
  console.log('Reference: ');
  console.log(references.join('\n'));
} else {
  console.log('Reference: ');
  console.log('');
}

期望的输出是:

Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
英文:

I have the text below:

This is a code update

* Official Name:  Noner


* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)

* Effective Date: January 4, 2023

I want to extract the information corresponding to the tag 'Reference:' but the code below only gives me one line. I want to scan all text until it encounters the asterisk symbol.

//Extract Reference    
var reference = description.search(&quot;Reference:&quot;);
if(reference != -1){
  reference = description.match(/(?&lt;=^\* Reference\s*:)[\s]*[\n]*[^\n\r]*/m);  
  reference  = reference?.[0].trim();   
}else{
  reference = &#39;&#39;;
}
console.log(&#39;Reference: &#39; + reference);

Expected Output:

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

答案1

得分: 3

我决定跟随 @Nick 的建议,不对 "subject" 字符串做任何假设。

我提供了两种宽松的方法,它们可以工作:

  • 当没有 Reference 项时(返回空字符串),
  • Reference 项内容为空时,
  • Reference 项内容位于字符串末尾时(因此不会跟随其他项)。

第一种方法在所有情况下都适用,无论内容如何:

let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? '';

如果您假设内容不包含星号字符,还可以使用第二种更高效的模式:

let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? '';

这是唯一的区别,但这个更简单。

无论您选择哪一种方法,结果已经被修剪。

英文:

I decided to follow @Nick's idea not making any assumption about the 'subject' string.

I produced two lenient approaches in the sense that they work:

  • when there's no Reference item (returning an empty string),
  • when the Reference item has an empty content,
  • and when the Reference item content is at the end of the string (thus not followed by an other item).

The first works in all cases whatever the content:

let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? &#39;&#39;;

A second more efficient pattern is possible if you assume the content doesn't contain asterisk characters:

let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? &#39;&#39;;

This is the only break, but this one is from far more simple.

Whatever the one you choose, the result is already trimmed.

答案2

得分: 2

你可以使用这个正则表达式:

(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)

它匹配:

  • (?:^|\n)\*\s*Reference:\s*:行首的 * Reference:
  • ([\s\S]*?):尽可能少的字符,捕获在第一个组内
  • (?=\s*\n\*|$):匹配空格、换行和 * 或行尾的正向前瞻

正则表达式演示在regex101上。

text = `* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055
`

reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? &#39;&#39;

console.log(reference)
英文:

You could use this regex:

(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)

Which matches:

  • (?:^|\n)\*\s*Reference:\s* : * Reference: at the beginning of a line
  • ([\s\S]*?) : a minimal number of characters, captured in group 1
  • (?=\s*\n\*|$) : a positive lookahead for spaces, a newline and a * or end-of-line

Regex demo on regex101

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

text = `* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055
`

reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? &#39;&#39;

console.log(reference)

<!-- end snippet -->

答案3

得分: 2

在JavaScript中使用lookaround:

(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)

解释

  • (?<= 正向先行断言,确保从当前位置向左是:
    • ^\* Reference\s*:\s* 匹配以可选空白字符开头的 * Reference 后跟 : 的字符串
  • ) 关闭正向先行断言
  • \S 匹配非空白字符
  • [^]*? 匹配任何字符,包括换行符,尽可能少地匹配
  • (?= 正向先行断言,确保右侧是:
    • ^\s*\* 匹配以可选空白字符开头的 * 字符
  • ) 关闭正向先行断言

请参考正则表达式演示

const regex = /(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update

* Official Name:  Noner

...

* Effective Date: January 4, 2023
`;

const m = s.match(regex);
if (m) console.log(m[0]);

以上是您提供的代码部分的翻译,不包括其他内容。

英文:

In Javascript with lookarounds:

(?&lt;=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)

Explanation

  • (?&lt;= Positive lookbehind, assert that from the current position to the left is:
    • ^\* Reference\s*:\s* Match * Reference followed by : between optional whitespace chars at the start of the string
  • ) Close the lookbehind
  • \S Match a non whitespace char
  • [^]*? Match any character including newlines, as few as possible
  • (?= Positive lookahead, assert that to the right is:
    • ^\s*\* Match optional whitespace chars followed by * at the start of the string
  • ) Close te lookahead

See a regex demo.

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const regex = /(?&lt;=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update

* Official Name:  Noner


* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)

* Effective Date: January 4, 2023
`;

const m = s.match(regex);
if (m) console.log(m[0]);

<!-- end snippet -->

答案4

得分: 1

你可以简单地使用正则表达式的"lookarounds"来实现:

(?<=Reference: )(.|\n)*?(?=\*)

然后修剪(trim)输出。

这是代码示例或在线演练

const text = `This is a code update

* Official Name:  Noner

* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https/lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

IECC Com Update(#1)

IECC Res Update (#2)

IECC Res Update (#3)

* Final Source Doc: 
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf(true  (#2)

* Effective Date: January 4, 2023`

const patt = /(?<=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())
英文:

You can simply use lookarounds:

(?&lt;=Reference: )(.|\n)*?(?=\*)

Then trim the output.

Code example or playground:

<!-- begin snippet: js hide: true console: true babel: false -->

<!-- language: lang-js -->

const text = `This is a code update

* Official Name:  Noner


* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)

* Effective Date: January 4, 2023`

const patt = /(?&lt;=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())

<!-- end snippet -->

答案5

得分: 1

避免使用正则表达式,它会降低可读性和性能。

在你的情况下,输入可能会很长,但目标模式相当简单。

思路是找到'* Reference'并提取直到下一个'*'之前的所有字符串。

因此,我们使用string.indexOfstring.substring


const text = `
* Official Name:  Noner


* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf(true  (#2)

* Effective Date: January 4, 2023
`;
 
const pre = text.indexOf('* Reference');
const start = text.indexOf('h', pre);
const end = text.indexOf('*', start);
const slice = text.substring(start, end);
console.log({ slice });
英文:

Avoid regex, it reduces readability and performance.

In your case, the input can be very long, but the target pattern is quite simple.

The idea is to find '* Reference' and extract all the string until next '*'.

So we get to use string.indexOf and string.substring.


const text = `
* Official Name:  Noner


* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021

* Agency:  Agency Ni

* Reference: 

https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm

https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Citation: WAC 51-52 / WSR 23-02-055

* Draft Doc Title: 

 WSR 23-02-055 (#1)

* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)

* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)

* Final Doc Title: 

   IECC Com Update(#1)

   IECC Res Update (#2)

   IECC Res Update (#3)

* Final Source Doc: 
  https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)
 https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)

* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&amp;full=true&amp;pdf=true (#1)

https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&amp;fdsfullfdsf=true&amp;pfdsfdf=true  (#2)

* Effective Date: January 4, 2023
`;
 
const pre = text.indexOf(&#39;* Reference&#39;);
const start = text.indexOf(&#39;h&#39;, pre);
const end = text.indexOf(&#39;*&#39;, start);
const slice = text.substring(start, end);
console.log({ slice });

答案6

得分: 0

尝试使用这个:

const match = description.match(/(?<=* Reference\s*:)[\s\S]*?*/)[0].slice(0, -1);
console.log(match)


这个正则表达式使用了你在问题中使用的回顾,然后匹配任何字符(包括换行符),直到下一个 `*` 字符。我使用了 `.slice()` 来去掉最后一个字符,因为正则表达式也会匹配到最后一个 `*` 字符。

我对这段代码进行了性能基准测试,与其他答案相比,发现它是最快的,速度提高了大约 50%([请参见此处的性能基准](https://jsbench.me/udlf4pcck5/1))。
英文:

Try using this:

const match = description.match(/(?&lt;=\* Reference\s*:)[\s\S]*?\*/)[0].slice(0, -1);
console.log(match)

This RegEx uses the look behind you used in your question, but then matches any character (including newlines) until the next * character. I used .slice() to get rid of the last character because the RegEx will match that last * character too.

I benchmarked this code against the other answers and found that it was the fastest by about 50% (see benchmark here).

huangapple
  • 本文由 发表于 2023年3月12日 08:55:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75710487.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定