2023年2月17日 23:30:02go评论86阅读模式

英文:

Regex, that matches variable grams sizes

问题

我想要捕获表示给定产品重量的字符串部分。具体来说，是"1kg"字符串或"500g"字符串。我需要捕获其中之一，以便可以轻松地在pandas.Series对象中进行交互。

我尝试过以下方法：

s.str.extract(r"(.kg)|(.g)", flags=re.IGNORECASE)

由于字符串前面的数字数量可能会有所不同，我想尝试不同的方法。

英文:

I have the following sample series

 s = pd.Series({0: &#39;A&#231;&#250;car Refinado UNI&#195;O Pacote 1kg&#39;,
 1: &#39;A&#231;&#250;car Refinado QUALIT&#193; Pacote 1Kg&#39;,
 2: &#39;A&#231;&#250;car Refinado DA BARRA Pacote 1kg&#39;,
 3: &#39;A&#231;&#250;car Refinado CARAVELAS Pacote 1kg&#39;,
 4: &#39;A&#231;&#250;car Refinado GUARANI Pacote 1Kg&#39;,
 5: &#39;A&#231;&#250;car Refinado Granulado Do&#231;&#250;car UNI&#195;O Pacote 1kg&#39;,
 6: &#39;A&#231;&#250;car Refinado Light UNI&#195;O Fit Pacote 500g&#39;,
 7: &#39;A&#231;&#250;car Refinado Granulado Premium UNI&#195;O Pacote 1kg&#39;,
 8: &#39;A&#231;&#250;car Refinado UNI&#195;O 1kg - Pacote com 10 Unidades&#39;,
 9: &#39;A&#231;&#250;car Refinado Granulado em Cubos UNI&#195;O Pote 250g&#39;,
 10: &#39;A&#231;&#250;car Refinado Granulado Premium Caravelas Pacote 1kg&#39;,
 11: &#39;Acucar Refinado Uniao 1kg&#39;})

What I want to do is to capture the string part that represents the weights of the given products. In specific, the "1kg" string or the "500g" string.
I need to capture one or another, so I can easily interact through the pandas.Series object.

What I tried

s.str.extract(r&quot;(.kg)|(.g)&quot;,flags = re.IGNORECASE)

Since the number of number before the string can vary I would like a different approach.

答案1

得分: 1

s.str.extract(r"(\d+.?\d*?k?g)",flags=re.IGNORECASE)

英文:

Use the following regex matching (assuming that the numeric part can be also a float number):

s.str.extract(r&quot;(\d+\.?\d*?k?g)&quot;,flags=re.IGNORECASE)

答案2

得分: 1

使用这个扩展数据：

&gt;&gt;&gt; s = pd.Series({
...     0: 'Açúcar Refinado UNIÃO Pacote 1kg',
...     1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
...     2: 'Açúcar Refinado DA BARRA Pacote 1kg',
...     3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
...     4: 'Açúcar Refinado GUARANI Pacote 1Kg',
...     5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
...     6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
...     7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
...     8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
...     9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
...     10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
...     11: 'Açúcar Refinado União 1kg',
...     12: 'something something 1.25kg',
...     13: 'something something 1,25kg'})

解析出数字和单位：

&gt;&gt;&gt; s.str.extract(r'(\d+(?:[\.,]\d*)?)( ?k?g)', flags=re.IGNORECASE) \
...     .assign(k=lambda d: d[0].str
...             .replace('(?<=\d),(?=\d)', '.', regex=True)
...             .pipe(pd.to_numeric))
       0   1       k
0      1  kg    1.00
1      1  Kg    1.00
2      1  kg    1.00
3      1  kg    1.00
4      1  Kg    1.00
5      1  kg    1.00
6    500   g  500.00
7      1  kg    1.00
8      1  kg    1.00
9    250   g  250.00
10     1  kg    1.00
11     1  kg    1.00
12  1.25  kg    1.25
13  1,25  kg    1.25

我还允许小数点和单位之间有可选的空格。这也适用于处理非整数数字，考虑不同的小数标记，例如在欧洲大陆，小数点标记为1,25，而在英语世界中通常是1.25。

我在小数部分使用了非捕获组；Roman的版本也有效。对于解析数字，如果格式混杂，我会规范化小数格式。否则，您可以使用 import io; pd.read_csv(io.StringIO(your_df.to_csv()), decimal=',') 进行重新解析。

如果字符串中有像 250g ... 1kg 这样的情况，你可能需要在将其传递到这个函数之前进行过滤或清理。还考虑在正则表达式末尾添加 \b 以确保不匹配像 50grandmas 这样的情况。

还要感谢您提供了数据帧构造函数的原始版本。

英文:

With this extended data:

&gt;&gt;&gt; s = pd.Series({
...     0: &#39;A&#231;&#250;car Refinado UNI&#195;O Pacote 1kg&#39;,
...     1: &#39;A&#231;&#250;car Refinado QUALIT&#193; Pacote 1Kg&#39;,
...     2: &#39;A&#231;&#250;car Refinado DA BARRA Pacote 1kg&#39;,
...     3: &#39;A&#231;&#250;car Refinado CARAVELAS Pacote 1kg&#39;,
...     4: &#39;A&#231;&#250;car Refinado GUARANI Pacote 1Kg&#39;,
...     5: &#39;A&#231;&#250;car Refinado Granulado Do&#231;&#250;car UNI&#195;O Pacote 1kg&#39;,
...     6: &#39;A&#231;&#250;car Refinado Light UNI&#195;O Fit Pacote 500g&#39;,
...     7: &#39;A&#231;&#250;car Refinado Granulado Premium UNI&#195;O Pacote 1kg&#39;,
...     8: &#39;A&#231;&#250;car Refinado UNI&#195;O 1kg - Pacote com 10 Unidades&#39;,
...     9: &#39;A&#231;&#250;car Refinado Granulado em Cubos UNI&#195;O Pote 250g&#39;,
...     10: &#39;A&#231;&#250;car Refinado Granulado Premium Caravelas Pacote 1kg&#39;,
...     11: &#39;Acucar Refinado Uniao 1kg&#39;,
...     12: &#39;something something 1.25kg&#39;,
...     13: &#39;something something 1,25kg&#39;})

Parsing out the numbers and the units:

&gt;&gt;&gt; s.str.extract(r&#39;(\d+(?:[\.,]\d*)?)( ?k?g)&#39;, flags=re.IGNORECASE) \
...     .assign(k=lambda d: d[0].str
...             .replace(&#39;(?&lt;=\d),(?=\d)&#39;, &#39;.&#39;, regex=True)
...             .pipe(pd.to_numeric))
       0   1       k
0      1  kg    1.00
1      1  Kg    1.00
2      1  kg    1.00
3      1  kg    1.00
4      1  Kg    1.00
5      1  kg    1.00
6    500   g  500.00
7      1  kg    1.00
8      1  kg    1.00
9    250   g  250.00
10     1  kg    1.00
11     1  kg    1.00
12  1.25  kg    1.25
13  1,25  kg    1.25

I also allow for an optional space between the decimal and the units. Extended also to deal with non-integer numbers, accounting also for different decimal markers: eg in continental Europe, decimals are marked like 1,25 rather than 1.25 as in the Anglosphere.

I use a non-capturing group for the decimal portion; Roman's version also works. For parsing the number, I would normalise the decimal format if mixed. If otherwise, you can re-parse by import io; pd.read_csv(io.StringIO(your_df.to_csv()), decimal=',').

You will get more capture groups on the row if you have a string like 250g ... 1kg. You may want to filter or otherwise clean that before throwing it into this function. Also consider appending a \b to ensure that you don't match something like 50grandmas.

Thanks also for providing the data frame constructor ab initio.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

正则表达式，匹配不定大小的变量字节组。

问题

答案1

答案2

Compute outliers 2 standard dev away for each pandas DataFrame column and replace with NaN

如何最好地过滤异常的cause（或context）？

只通过调试Selenium提供数据。

找到包围现有边界矩形顶点的四个圆的边界矩形的坐标。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。