英文:
How can I get all the strings that have the same pattern
问题
我有一个具有以下结构的 XML 文件:
<expression>[Customer ].[Sales ].[L_MOIS]</expression><expression>cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE])
+ ' ' +
cast_varchar([Customer ].[Sales ].[C_ANNEE])</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer ].[Sales ].[DT_JOUR] <= getdate()</filterExpression></detailFilter></detailFilters></query><query name="RSmag"><source><model /></source><selection><dataItem aggregate="none" name="Code magasin"><expression>[Customer statistics].[Stores].[C_MAGASIN]</expression></dataItem><dataItem aggregate="none" name="Libellé magasin" sort="ascending"><expression>[Customer statistics].[Stores].[L_MAGASIN]</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer statistics].[Stores].[C_DEPOT] <>'500'</filterExpression></detailFilter><detailFilter><filterExpression>[Customer statistics].[Stores].[C_MAGASIN] not in ('005120';'005130';'005140')</filterExpression></detailFilter></detailFilters>
</query><query name="CAdept_avec_metier_cumul"><source><model /></source><selection><dataItem aggregate="none" name="Cod Metier" rollupAggregate="none"><expression>[Customer ].[Articles].[COD_DPTG]</expression></dataItem><dataItem name="Nombre de tickets" rollupAggregate="total">
<expression>count(distinct [Customer ].[Sales ].[ID_TICKET])</expression></dataItem><dataItem name="Nombre de tickets non affecté" rollupAggregate="total"><expression>count(distinct
(case
when [Customer ].[Sales ].[C_AFFECTATION] <> 1
then [Customer ].[Sales ].[ID_TICKET]
else null
end)
)</expression>
我想要提取所有标签的名称,在结果中应该包括:
[Customer ].[Sales ].[C_ANNEE]
[Customer ].[Sales ].[DT_JOUR]
但现在我得到的是:
Customer
Sales
C_ANNEE
File f = new File("");
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(f), "UTF-8"));
String str;
while ((str = in.readLine()) != null) {
Matcher m = Pattern.compile("\\[(.*?)\\]").matcher(str);
while (m.find()) {
listres.add(m.group(1));
}
}
英文:
I have an xml file that has this structure
<expression>[Customer ].[Sales ].[L_MOIS]</expression><expression>cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE])
+ ' ' +
cast_varchar([Customer ].[Sales ].[C_ANNEE])</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer ].[Sales ].[DT_JOUR] <= getdate()</filterExpression></detailFilter></detailFilters></query><query name="RSmag"><source><model /></source><selection><dataItem aggregate="none" name="Code magasin"><expression>[Customer statistics].[Stores].[C_MAGASIN]</expression></dataItem><dataItem aggregate="none" name="Libellé magasin" sort="ascending"><expression>[Customer statistics].[Stores].[L_MAGASIN]</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer statistics].[Stores].[C_DEPOT] <>'500'</filterExpression></detailFilter><detailFilter><filterExpression>[Customer statistics].[Stores].[C_MAGASIN] not in ('005120';'005130';'005140')</filterExpression></detailFilter></detailFilters>
</query><query name="CAdept_avec_metier_cumul"><source><model /></source><selection><dataItem aggregate="none" name="Cod Metier" rollupAggregate="none"><expression>[Customer ].[Articles].[COD_DPTG]</expression></dataItem><dataItem name="Nombre de tickets" rollupAggregate="total">
<expression>count(distinct [Customer ].[Sales ].[ID_TICKET])</expression></dataItem><dataItem name="Nombre de tickets non affecté" rollupAggregate="total"><expression>count(distinct
(case
when [Customer ].[Sales ].[C_AFFECTATION] <> 1
then [Customer ].[Sales ].[ID_TICKET]
else null
end)
)</expression>
I want to extract all the names of tab, in result I should have:
[Customer ].[Sales ].[C_ANNEE]
[Customer ].[Sales ].[DT_JOUR]
But now what I'm getting is:
Customer
Sales
C_ANNEE
File f = new File("");
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(f), "UTF-8"));
String str;
while ((str = in.readLine()) != null) {
Matcher m = Pattern.compile("\\[(.*?)\\]").matcher(str);
while (m.find()) {
listres.add(m.group(1));
}
}
答案1
得分: 1
分解问题为两个独立的部分:
1)使用合适的 XML 解析器解析 XML 数据,提取所需的文本。
2)针对提取的文本字段,使用正则表达式提取所需的子字符串。
以下示例使用 SAX 解析器(我顺便提一下,我正在使用 Java 13)。
假设我们有一个包含以下 XML 内容的文件:
<root>
...
</root>
注意以下内容:
a)我根据问题的示例数据进行了合理的猜测,创建了一个有效的 XML 文档。
b)我通过使用 &lt;
和 &gt;
转义了文本中的 <
和 >
符号。
第一步 - 解析数据
这个解决方案使用 SAX 进行解析 - 还有很多其他的替代方法。
以下代码将读取输入文件的每一行,丢弃除 <expression>
或 <filterExpression>
标签之外的任何标签。这个集合可以根据需要进行调整(watchedElements
)。
代码收集这些标签内部的文本,并通过移除换行符和额外的空格来清理它。
这给我们提供了一组类似于以下的 10 个文本字符串:
[Customer ].[Sales ].[L_MOIS]
cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE]) + ' ' + cast_varchar([Customer ].[Sales ].[C_ANNEE])
[Customer ].[Sales ].[DT_JOUR] <= getdate()
...
第二步 - 应用正则表达式
对于这些字符串中的每一个,我们使用正则表达式找到我们想要的数据:
\[.*?\](\.\[.*?\])*
这个正则表达式会搜索从开头的“[”到下一个“]”的内容,并且会重复此过程,零个或多个后续的以点分隔的“[”和“]”字符串。
为了处理不需要的子匹配,我们只保留第零组:
Matcher m = pattern.matcher(text);
while (m.find()) {
System.out.println("*** 匹配项: " + m.group(0));
}
这会给我们以下 12 个结果:
[Customer ].[Sales ].[L_MOIS]
[Customer ].[Sales ].[L_MOIS_ANNEE]
...
完整解决方案
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.util.Set;
import java.util.HashSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ParseFromFileUsingSax {
Pattern pattern = Pattern.compile("\\[.*?\\](\\.\\[.*?\\])*");
public void parseUsingSax() {
try {
...
} catch (Exception e) {
System.err.print(e);
}
}
private String formatString(String text) {
...
}
private void printMatches(String text) {
...
}
}
英文:
Break the problem into two separate parts:
-
Parse the XML data using a suitable XML parser, to extract the text we want.
-
For the extracted text fields, use a regular expression to extract the required sub-strings.
The following example uses a SAX parser (I am using Java 13, by the way).
Assume we have a file containing the following XML:
<root>
<query name="RSmag">
<source>
<model />
</source>
<selection>
<dataItem aggregate="none" name="Code magasin">
<expression>
[Customer ].[Sales ].[L_MOIS]
</expression>
<expression>
cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE]) + ' ' + cast_varchar([Customer ].[Sales ].[C_ANNEE])
</expression>
</dataItem>
</selection>
<detailFilters>
<detailFilter>
<filterExpression>
[Customer ].[Sales ].[DT_JOUR] &lt;= getdate()
</filterExpression>
</detailFilter>
</detailFilters>
</query>
<query name="RSmag">
<source>
<model />
</source>
<selection>
<dataItem aggregate="none" name="Code magasin">
<expression>
[Customer statistics].[Stores].[C_MAGASIN]
</expression>
</dataItem>
<dataItem aggregate="none" name="Libellé magasin" sort="ascending">
<expression>
[Customer statistics].[Stores].[L_MAGASIN]
</expression>
</dataItem>
</selection>
<detailFilters>
<detailFilter>
<filterExpression>
[Customer statistics].[Stores].[C_DEPOT] &lt;&gt; '500'
</filterExpression>
</detailFilter>
<detailFilter>
<filterExpression>
[Customer statistics].[Stores].[C_MAGASIN]
not in ('005120';'005130';'005140')
</filterExpression>
</detailFilter>
</detailFilters>
</query>
<query name="CAdept_avec_metier_cumul">
<source>
<model />
</source>
<selection>
<dataItem aggregate="none" name="Cod Metier" rollupAggregate="none">
<expression>
[Customer ].[Articles].[COD_DPTG]
</expression>
</dataItem>
<dataItem name="Nombre de tickets" rollupAggregate="total">
<expression>
count(distinct [Customer ].[Sales ].[ID_TICKET])
</expression>
</dataItem>
<dataItem name="Nombre de tickets non affecté" rollupAggregate="total">
<expression>count(distinct
(case
when [Customer ].[Sales ].[C_AFFECTATION] &lt;&gt; 1
then [Customer ].[Sales ].[ID_TICKET]
else null
end)
)
</expression>
</dataItem>
</selection>
</query>
</root>
Note the following:
a) I made an educated guess to create a valid XML document, based on the question's sample data.
b) I escaped the <
and >
symbols in the text, by using &lt;
and &gt;
.
Step 1 - Parsing the Data
This solution uses SAX for parsing - there are plenty of alternatives.
The following will read each line of the input file, discarding any tags which are not <expression>
or <filterExpression>
tags. This set can be adjusted as needed (watchedElements
).
The code collects the text inside each of these tags, and cleans it up by removing newlines, and extra whitespace.
This gives us a set of 10 text strings, like this:
[Customer ].[Sales ].[L_MOIS]
cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE]) + ' ' + cast_varchar([Customer ].[Sales ].[C_ANNEE])
[Customer ].[Sales ].[DT_JOUR] <= getdate()
[Customer statistics].[Stores].[C_MAGASIN]
[Customer statistics].[Stores].[L_MAGASIN]
[Customer statistics].[Stores].[C_DEPOT] <> '500'
[Customer statistics].[Stores].[C_MAGASIN] not in ('005120';'005130';'005140')
[Customer ].[Articles].[COD_DPTG]
count(distinct [Customer ].[Sales ].[ID_TICKET])
count(distinct (case when [Customer ].[Sales ].[C_AFFECTATION] <> 1 then [Customer ].[Sales ].[ID_TICKET] else null end) )
Step 2 - Applying the Regex
For each of these strings we use a regular expression to find the data we want:
\[.*?\](\.\[.*?\])*
This searches for an opening "[", through to the next "]", and it repeats this for zero or more subsequent "[" and "]" strings separated by a period.
To deal with unwanted sub-matches we only keep group zero:
Matcher m = pattern.matcher(text);
while (m.find()) {
System.out.println("*** Matches found : " + m.group(0));
}
This gives us the following 12 results:
[Customer ].[Sales ].[L_MOIS]
[Customer ].[Sales ].[L_MOIS_ANNEE]
[Customer ].[Sales ].[C_ANNEE]
[Customer ].[Sales ].[DT_JOUR]
[Customer statistics].[Stores].[C_MAGASIN]
[Customer statistics].[Stores].[L_MAGASIN]
[Customer statistics].[Stores].[C_DEPOT]
[Customer statistics].[Stores].[C_MAGASIN]
[Customer ].[Articles].[COD_DPTG]
[Customer ].[Sales ].[ID_TICKET]
[Customer ].[Sales ].[C_AFFECTATION]
[Customer ].[Sales ].[ID_TICKET]
Full Solution
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.util.Set;
import java.util.HashSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ParseFromFileUsingSax {
// Looks for an opening "[" followed by a closing "]" with an
// optional "." to string items together into one group.
Pattern pattern = Pattern.compile("\\[.*?\\](\\.\\[.*?\\])*");
public void parseUsingSax() {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
// the tags we will inspect (all others will be skipped):
Set<String> watchedElements = new HashSet();
watchedElements.add("expression");
watchedElements.add("filterExpression");
DefaultHandler handler = new DefaultHandler() {
private boolean inElement = false;
private StringBuilder stringBuilder;
@Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
if (watchedElements.contains(name)) {
inElement = true;
stringBuilder = new StringBuilder();
}
}
@Override
public void characters(char[] buffer, int start, int length) throws SAXException {
if (inElement) {
stringBuilder.append(buffer, start, length);
}
}
@Override
public void endElement(String uri, String localName,
String name) throws SAXException {
if (watchedElements.contains(name)) {
inElement = false;
String extractedText = formatString(stringBuilder.toString());
System.out.println();
System.out.println("Extracted XML text : " + extractedText);
printMatches(extractedText);
}
}
};
saxParser.parse("C:/tmp/query_data.xml", handler);
} catch (Exception e) {
System.err.print(e);
}
}
private String formatString(String text) {
text = text.replaceAll("\\r\\n|\\r|\\n", " "); // remove newlines
text = text.replaceAll(" *", " "); // collapse multiple spaces
return text.trim(); // remove leading/trailing whitespace
}
private void printMatches(String text) {
Matcher m = pattern.matcher(text);
while (m.find()) {
System.out.println("*** Matches found : " + m.group(0));
}
}
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论