英文:
Javascript - Scans Text Corresponding to a Certain Label
问题
以下是提取的 'Reference:' 标签对应的信息,直到遇到星号符号的代码部分:
// 提取 Reference
var referenceIndex = description.indexOf("Reference:");
if (referenceIndex !== -1) {
var referenceText = description.substring(referenceIndex);
var referenceLines = referenceText.split('\n');
var references = [];
for (var i = 1; i < referenceLines.length; i++) {
if (referenceLines[i].trim().startsWith("*")) {
break;
}
references.push(referenceLines[i].trim());
}
console.log('Reference: ');
console.log(references.join('\n'));
} else {
console.log('Reference: ');
console.log('');
}
期望的输出是:
Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
英文:
I have the text below:
This is a code update
* Official Name: Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title:
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title:
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc:
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true (#2)
* Effective Date: January 4, 2023
I want to extract the information corresponding to the tag 'Reference:' but the code below only gives me one line. I want to scan all text until it encounters the asterisk symbol.
//Extract Reference
var reference = description.search("Reference:");
if(reference != -1){
reference = description.match(/(?<=^\* Reference\s*:)[\s]*[\n]*[^\n\r]*/m);
reference = reference?.[0].trim();
}else{
reference = '';
}
console.log('Reference: ' + reference);
Expected Output:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
答案1
得分: 3
我决定跟随 @Nick 的建议,不对 "subject" 字符串做任何假设。
我提供了两种宽松的方法,它们可以工作:
- 当没有 Reference 项时(返回空字符串),
- 当 Reference 项内容为空时,
- 当 Reference 项内容位于字符串末尾时(因此不会跟随其他项)。
第一种方法在所有情况下都适用,无论内容如何:
let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? '';
如果您假设内容不包含星号字符,还可以使用第二种更高效的模式:
let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? '';
这是唯一的区别,但这个更简单。
无论您选择哪一种方法,结果已经被修剪。
英文:
I decided to follow @Nick's idea not making any assumption about the 'subject' string.
I produced two lenient approaches in the sense that they work:
- when there's no Reference item (returning an empty string),
- when the Reference item has an empty content,
- and when the Reference item content is at the end of the string (thus not followed by an other item).
The first works in all cases whatever the content:
let ref_pat = /^\* Reference:\s*(.*\S(?:\s+.*\S)*?)??\s*(?:^\*|(?![\s\S]))/m;
let reference = description.match(ref_pat)?.[1] ?? '';
A second more efficient pattern is possible if you assume the content doesn't contain asterisk characters:
let ref_pat = /^\* Reference:\s*([^*]*[^*\s])/m;
let reference = description.match(ref_pat)?.[1] ?? '';
This is the only break, but this one is from far more simple.
Whatever the one you choose, the result is already trimmed.
答案2
得分: 2
你可以使用这个正则表达式:
(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)
它匹配:
(?:^|\n)\*\s*Reference:\s*
:行首的* Reference:
([\s\S]*?)
:尽可能少的字符,捕获在第一个组内(?=\s*\n\*|$)
:匹配空格、换行和*
或行尾的正向前瞻
正则表达式演示在regex101上。
text = `* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
`
reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? ''
console.log(reference)
英文:
You could use this regex:
(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)
Which matches:
(?:^|\n)\*\s*Reference:\s*
:* Reference:
at the beginning of a line([\s\S]*?)
: a minimal number of characters, captured in group 1(?=\s*\n\*|$)
: a positive lookahead for spaces, a newline and a*
or end-of-line
Regex demo on regex101
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
text = `* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
`
reference = text.match(/(?:^|\n)\*\s*Reference:\s*([\s\S]*?)(?=\s*\n\*|$)/)?.[1] ?? ''
console.log(reference)
<!-- end snippet -->
答案3
得分: 2
在JavaScript中使用lookaround:
(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)
解释
(?<=
正向先行断言,确保从当前位置向左是:^\* Reference\s*:\s*
匹配以可选空白字符开头的* Reference
后跟:
的字符串
)
关闭正向先行断言\S
匹配非空白字符[^]*?
匹配任何字符,包括换行符,尽可能少地匹配(?=
正向先行断言,确保右侧是:^\s*\*
匹配以可选空白字符开头的*
字符
)
关闭正向先行断言
请参考正则表达式演示。
const regex = /(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update
* Official Name: Noner
...
* Effective Date: January 4, 2023
`;
const m = s.match(regex);
if (m) console.log(m[0]);
以上是您提供的代码部分的翻译,不包括其他内容。
英文:
In Javascript with lookarounds:
(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)
Explanation
(?<=
Positive lookbehind, assert that from the current position to the left is:^\* Reference\s*:\s*
Match* Reference
followed by:
between optional whitespace chars at the start of the string
)
Close the lookbehind\S
Match a non whitespace char[^]*?
Match any character including newlines, as few as possible(?=
Positive lookahead, assert that to the right is:^\s*\*
Match optional whitespace chars followed by*
at the start of the string
)
Close te lookahead
See a regex demo.
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
const regex = /(?<=^\* Reference\s*:\s*)\S[^]*?(?=^\s*\*)/gm;
const s = `This is a code update
* Official Name: Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title:
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title:
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc:
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true (#2)
* Effective Date: January 4, 2023
`;
const m = s.match(regex);
if (m) console.log(m[0]);
<!-- end snippet -->
答案4
得分: 1
你可以简单地使用正则表达式的"lookarounds"来实现:
(?<=Reference: )(.|\n)*?(?=\*)
然后修剪(trim)输出。
这是代码示例或在线演练:
const text = `This is a code update
* Official Name: Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https/lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title:
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title:
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc:
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf(true (#2)
* Effective Date: January 4, 2023`
const patt = /(?<=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())
英文:
You can simply use lookarounds:
(?<=Reference: )(.|\n)*?(?=\*)
Then trim the output.
Code example or playground:
<!-- begin snippet: js hide: true console: true babel: false -->
<!-- language: lang-js -->
const text = `This is a code update
* Official Name: Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title:
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title:
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc:
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true (#2)
* Effective Date: January 4, 2023`
const patt = /(?<=Reference: )(.|\n)*?(?=\*)/gm
console.log(text.match(patt)?.[0].trim())
<!-- end snippet -->
答案5
得分: 1
避免使用正则表达式,它会降低可读性和性能。
在你的情况下,输入可能会很长,但目标模式相当简单。
思路是找到'* Reference'并提取直到下一个'*'之前的所有字符串。
因此,我们使用string.indexOf
和string.substring
。
const text = `
* Official Name: Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title:
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title:
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc:
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf(true (#2)
* Effective Date: January 4, 2023
`;
const pre = text.indexOf('* Reference');
const start = text.indexOf('h', pre);
const end = text.indexOf('*', start);
const slice = text.substring(start, end);
console.log({ slice });
英文:
Avoid regex, it reduces readability and performance.
In your case, the input can be very long, but the target pattern is quite simple.
The idea is to find '* Reference' and extract all the string until next '*'.
So we get to use string.indexOf
and string.substring
.
const text = `
* Official Name: Noner
* Pub: https://content.upcodes.co/viewer/washington/wa-mechanical-code-2021
* Agency: Agency Ni
* Reference:
https://web.archive.org/web/20230226234118/https://lawfilesext.leg.wa.gov/law/wsr/agency/BuildingCodeCouncil.htm
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Citation: WAC 51-52 / WSR 23-02-055
* Draft Doc Title:
WSR 23-02-055 (#1)
* Draft Source Doc: https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#1)
* Draft Drive: https://drive.google.com/file/d/1pYmwQS3t-ZX-Vyg9yBabtIpXZ7By2G6f/view?usp=share_link ( #1)
* Final Doc Title:
IECC Com Update(#1)
IECC Res Update (#2)
IECC Res Update (#3)
* Final Source Doc:
https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/20230303022030/https://lawfilesext.leg.wa.gov/law/wsr/2023/02/23-02-055.htm (#2)
* Final Drive: https://web.archive.org/web/20230303022130/https://apps.leg.wa.gov/wac/default.aspx?cite=51-52&full=true&pdf=true (#1)
https://web.archive.org/web/2023030302fdfdfg2130/https://apps.legfdg.gov/wac/default.aspx?cite=51-52&fdsfullfdsf=true&pfdsfdf=true (#2)
* Effective Date: January 4, 2023
`;
const pre = text.indexOf('* Reference');
const start = text.indexOf('h', pre);
const end = text.indexOf('*', start);
const slice = text.substring(start, end);
console.log({ slice });
答案6
得分: 0
尝试使用这个:
const match = description.match(/(?<=* Reference\s*:)[\s\S]*?*/)[0].slice(0, -1);
console.log(match)
这个正则表达式使用了你在问题中使用的回顾,然后匹配任何字符(包括换行符),直到下一个 `*` 字符。我使用了 `.slice()` 来去掉最后一个字符,因为正则表达式也会匹配到最后一个 `*` 字符。
我对这段代码进行了性能基准测试,与其他答案相比,发现它是最快的,速度提高了大约 50%([请参见此处的性能基准](https://jsbench.me/udlf4pcck5/1))。
英文:
Try using this:
const match = description.match(/(?<=\* Reference\s*:)[\s\S]*?\*/)[0].slice(0, -1);
console.log(match)
This RegEx uses the look behind you used in your question, but then matches any character (including newlines) until the next *
character. I used .slice()
to get rid of the last character because the RegEx will match that last *
character too.
I benchmarked this code against the other answers and found that it was the fastest by about 50% (see benchmark here).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论