如何在制表符分隔的数值中拆分字段?

huangapple go评论107阅读模式
英文:

How do I split a field in tab separated values?

问题

我试图使用Sublime Text 3和RegReplace自动化我经常执行的一系列正则表达式。运行了几个规则后,我得到了以下类似的数据(数百行,以制表符分隔):

M    543    E0385-C-R    BSC    SDA    SV    After    Quarterly    N/A    N/A
M    543    A-2    Room    SDA    AV    N/A    Quarterly    N/A    N/A
M    543    H9    BSC    SDA    SV    After    Quarterly    N/A    N/A

我的下一步是在第三列的第一个破折号 '-' 处拆分它(如果存在),将 '-' 替换为 \t。如果没有破折号,我想在该列末尾添加一个 \t(以保持所有行中列的一致性)。以下是我想要的输出:

M    543    E0385    C-R    BSC    SDA    SV    After    Quarterly    N/A    N/A
M    543    A    2    Room    SDA    AV    N/A    Quarterly    N/A    N/A
M    543    H9        BSC    SDA    SV    After    Quarterly    N/A    N/A

前两行在第一个破折号处拆分。第三行没有破折号,所以我在第三个字段的末尾插入了一个制表符。

到目前为止,我已经想出了如何匹配列的内容(看起来还有点笨拙):

(?<=\t|^)[a-zA-Z0-9-\/]*(?=\t|$)

然后我会找到第三个匹配项,并进一步处理它以替换第一个破折号。我不太确定如何做这个(很高兴有人能向我展示如何进行这种“嵌套”评估)。

另一种方法带我来到这里:

(?<=\t\d{3}\t)([a-zA-Z0-9-]*)\K-

这假设第二列中总是有三位数... 这是我不信任的假设。

因此,该模式的一般化版本为:

(?:\S+\t){2}\S*?\K(-)

我认为这非常不错。在这里查看它。

  • 如何使此表达式在没有破折号的情况下匹配“单词”的末尾?
  • 我能使它更健壮(以便后续列中的破折号不匹配)吗?
  • 是否有更好的方法解决这个问题(例如上面的两级匹配),我该如何实现?
英文:

I'm trying to automate a series of regular expressions that I perform regularly by using Sublime Text 3 and RegReplace. After running a few rules, I get data that looks like this (hundreds of lines, tab-separated)

M	543	E0385-C-R	BSC	SDA	SV	After	Quarterly	N/A	N/A
M	543	A-2	Room	SDA	AV	N/A	Quarterly	N/A	N/A
M	543	H9	BSC	SDA	SV	After	Quarterly	N/A	N/A

My next step is to split the third columns at the first dash '-' if it exists by replacing the '-' with a \t. If there is no dash, I would like to add a \t at the end of the column (this is to keep my columns consistent across all the rows. Here is the output I'm looking for:

M	543	E0385	C-R	BSC	SDA	SV	After	Quarterly	N/A	N/A
M	543	A 	2	Room	SDA	AV	N/A	Quarterly	N/A	N/A
M	543	H9		BSC	SDA	SV	After	Quarterly	N/A	N/A

The first two lines split at the first dash. The third line does not have a dash, so I insert a tab at the end of the third field.

So far I have figured out how to match the content of the columns (It still seems a little clumsy) with:

(?&lt;=\t|^)[a-zA-Z0-9-\/]*(?=\t|$)

I would then find the third match and further process it with to replace the first dash. Not really sure how to do this (would be happy to have someone show me how you do this 'nested' evaluation).

A different approach brought me here:

(?&lt;=\t\d{3}\t)([a-zA-Z0-9-]*)\K-

This assumes that there are always three digits in the second column... an assumption I don't trust.

So a generalization of the pattern got me:

(?:\S+\t){2}\S*?\K(-)

which is pretty good, I think. See it here.

  • How do I get this expression to match the end of the 'word' if there is no dash?
  • Can I make it more robust (so a dash in a subsequent column does not match)?
  • Is there a better approach to this problem (like the two-level match above, maybe) and how do I implement it?

答案1

得分: 1

替代方案如下:

查找:^((?:[^\t]*\t){2}[^\t-]*)\K-?

替换:\t

英文:

It is indeed not a good idea to rely on the presence of 3 digits as an indication for the second column.

Instead make use of the start-of-line assertion (^) and match the first two columns and the third column up to the first hyphen (if it is there). Then start capturing from that point onwards (\K) and capture the hyphen if it is there, otherwise you'll just capture an empty string. The replace that with a TAB:

Find: ^((?:[^\t]*\t){2}[^\t-]*)\K-?

Replace: \t

huangapple
  • 本文由 发表于 2023年8月10日 23:06:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76877025.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定