英文:
Finding volume and item count using regular expressions
问题
我目前正在构建一个用于杂货店的JavaScript网络爬虫,它可以处理产品的标题,然后返回产品的数量、容量和每升价格。大多数产品标题看起来像这样:
Coca cola(香草口味)12 x 330 mL
为了获取有关此产品的元数据,我编写了一个正则表达式。它将查找一个词边界,后面跟着一个1或2位数的数字,然后是空格、字符串 'x'、另一个空格,最后是1、2或3位数字:
const filter = new RegExp(/\b\d{1,2}\sx\s\d{1,3}/);
然后,我对每个结果使用正则表达式进行匹配,然后计算产品数量、产品体积、升数以及每升价格。
if (result.title.match(filter)) {
result.itemCount = parseInt(result.title.match(/\d{1}\s/));
result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
result.litreVolume = (result.itemCount * result.itemVolume) / 1000;
result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
} else {
result.itemCount = 1;
result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
result.litreVolume = result.itemVolume / 1000;
result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
}
90%的结果看起来很好,但有时我会得到意外的结果。例如:
- 数量为NaN,这可能与一些标题包含更多数字有关(例如Coca Cola (4-Way) 12 x 330 mL)
- 体积为无穷大
- 每升价格太高
显然,我在计算所需的元数据时做错了什么。有没有更好的使用正则表达式进行计算的方法?我是否忽略了某些东西,可以使我的计算不太容易出错?
英文:
I am currently building a JavaScript web scraper for a grocery store that processes a title of a product and then returns the item count, volume and price per litre of a product. Most of the product titles look something like this:
Coca cola (vanilla flavour) 12 x 330 mL
In order to obtain meta data about this product, I have written a Regular Expression. It will look for look for a word boundary followed by a 1 or 2 digit number, whitespace, the string 'x', another whitespace and finally a 1, 2 or 3 digit number:
const filter = new RegExp(/\b\d{1,2}\sx\s\d{1,3}/);
I then test each result for a match with the Regular Expression and then calculate the item count, item volume, volume in litres and then the price per litre.
if (result.title.match(filter)) {
result.itemCount = parseInt(result.title.match(/\d{1}\s/));
result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
result.litreVolume = (result.itemCount * result.itemVolume) / 1000;
result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
} else {
result.itemCount = 1;
result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
result.litreVolume = result.itemVolume / 1000;
result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
}
90% of the results look good, but sometimes I get unexpected results. For example:
- an item count of NaN, which may have to do with the fact that some titles contain several more numbers (Coca Cola (4-Way) 12 x 330 mL))
- a volume of Infinity
- a price per litre that is way too high
Clearly I am doing something wrong with my approach to calculating the desired meta data. What would be a better way of doing calculations with RegEx? Am I missing something that would make my calculations less prone to errors?
答案1
得分: 1
如果我理解正确,过滤器\b\d{1,2}\sx\s\d{1,3}
是有效的,但您的子过滤器不起作用(\d{1}\s
)...
我以前只在C#中使用正则表达式,但我看到您也可以在Java中使用分组。
将您的模式更改为(\b\d{1,2})\sx\s(\d{1,3})
。当您在正则表达式中放置括号时,该部分将成为一个分组,您之后可以访问它。
正如我所说,我已经几年没有使用Java了,但我从网上找到了这段代码片段。它展示了如何在Java中使用分组。您应该使用模式(\b\d{1,2})\sx\s(\d{1,3})
。如果与C#中相同,group(0)是整个结果,group(1)是您的第一个实际分组,group(2)是第二个。
// 创建一个Pattern对象
Pattern r = Pattern.compile(pattern);
// 现在创建匹配器对象。
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
}
我认为您可以用比上面提到的代码更少的代码来编写它,但您已经明白了思路;-)
英文:
If i understand correctly filter \b\d{1,2}\sx\s\d{1,3}
works, but your sub filters do not (\d{1}\s
)...
I only used to using regex in c# but, i saw you could use groups in java also.
change your pattern to (\b\d{1,2})\sx\s(\d{1,3})
. When you put brackets in your regex, that part becomes a group that you can acces afterwards.
As i said, i haven't used java in a few years, but i picked this code snippet from the web. It shows how to use groups in java. As pattern you should use the (\b\d{1,2})\sx\s(\d{1,3})
. If it is the same as in c# group(0) is the whole result, group(1) is your first actual group, group(2) is the second.
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
}
I think you can write it with less code than stated above, but you get the picture
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论