读取具有某些字段中的空值和格式不完全相同的csv文件?

huangapple go评论56阅读模式
英文:

Read csv file having null values in some fields and also doesn't have the exact the same format values?

问题

我有一个CSV文件(ABC.CSV),其数据格式如下:

COLUMN1 COLUMN2 COLUMN3      COLUMN4     COLUMN5
12345   ABC     RR,MM K      NAO,KUM      DEV
34567   CDEF    NN                        INT
89567   KGH     PP, BHIM     PRKC         PROD
9876    PIM                               DEV          
6543    KCDEF   NICE,MAN K                INT  
5432    GHK                  SIN,NICE C   PROD

以下是我正在使用的PowerShell代码来读取这个CSV文件:

$filePath = "C:\path\to\your\abc.csv"
$searchString = "9876"

# 读取文件的内容
$content = Get-Content -Path $filePath

# 处理文件的每一行
foreach ($line in $content) {
    # 将行拆分为单个值
    $values = $line -split ','

    # 提取值
    $column1 = $values[0].Trim()
    $column2 = $values[1].Trim()
    $column3 = ($values[2].Trim(), $values[3].Trim()) -join ','
    $column4 = $values[4].Trim()
    $column5 = $values[5].Trim()
    
    if ($column1 -like "9876" -and $column5 -like "PROD") {
     Write-Host $column1
     Write-Host $column2
     Write-Host $column3
     Write-Host $column4
     Write-Host $column5
    }
}

这段代码只有在所有字段都正确时才能正常工作,但如果任何字段为空或COLUMN 4和5的值只是CC而不是CC,KK,它会报错。

对于这个值是可以正常工作的:

12345   ABC     RR,MM K      NAO,KUM      DEV

但对于这种情况没有正确的结果:

34567   CDEF    NN                        INT
89567   KGH     PP, BHIM     PRKC         PROD
9876    PIM                               DEV          
6543    KCDEF   NICE,MAN K                INT  
5432    GHK                  SIN,NICE C   PROD
英文:

I have csv file (ABC.CSV) which is having data in below format

COLUMN1 COLUMN2 COLUMN3      COLUMN4     COLUMN5
12345   ABC     RR,MM K      NAO,KUM      DEV
34567   CDEF    NN                        INT
89567   KGH     PP, BHIM     PRKC         PROD
9876    PIM                               DEV          
6543    KCDEF   NICE,MAN K                INT  
5432    GHK                  SIN,NICE C   PROD

Below is the powershell code which I am using to read this csv file

$filePath = "C:\path\to\your\abc.csv"
$searchString = "9876"

# Read the content of the file
$content = Get-Content -Path $filePath

# Process each line of the file
foreach ($line in $content) {
    # Split the line into individual values
    $values = $line -split ','

    # Extract the values
    $column1 = $values[0].Trim()
    $column2 = $values[1].Trim()
    $column3 = ($values[2].Trim('"'), $values[3].Trim('"')) -join ','
    $column4 = $values[4].Trim()
    $column5 = $values[5].Trim()
    
    if ($column1 -like "9876" -and $column5 -like "PROD" {
     Write-Host $column1
     Write-Host $column2
     Write-Host $column3
     Write-Host $column4
     Write-Host $column5

    }

This code is working only when all fields are in correct shape but any filed is null or COLUMN 4 & 5 is having value just CC not CC, KK then it is throwing the error.

FINE FOR THIS VALUE

12345   ABC     RR,MM K      NAO,KUM      DEV

NOT SHOWING CORRECT RESULTS FOR SUCH VALUES

    34567   CDEF    NN                        INT
    89567   KGH     PP, BHIM     PRKC         PROD
    9876    PIM                               DEV          
    6543    KCDEF   NICE,MAN K                INT  
    5432    GHK                  SIN,NICE C   PROD

答案1

得分: 1

如已指出,您的数据不是以CSV格式呈现,而似乎是以固定宽度的列形式呈现,其边界由列名开始的字符位置隐含确定。

以下将您的文件转换为CSV格式,并使用 ConvertTo-Csv 解析结果 - 请注意,此解决方案基于上述假设而通用工作;它既不依赖于特定列数,也不依赖于特定长度

# 读取文件的标题行和所有数据行。
$headerLine, $dataLines = Get-Content $filePath

# 获取结束字段的字符索引。
# + -1 添加一个额外的数组元素,用作行的末尾的占位符。
$fieldEndIndices = [regex]::Matches($headerLine, ' \S').Index + -1

# 遍历所有数据行。
$objects =
  $dataLines |  
  ForEach-Object {
    # 将当前行拆分为字段,修剪每个字段并将其括在 "..." 中。
    $pos = 0
    $fields = 
      foreach ($fieldEndIndex in $fieldEndIndices) {
        if ($fieldEndIndex -eq -1) { $fieldEndIndex = $_.Length - 1 } 
        '"' + $_.Substring($pos, $fieldEndIndex - $pos + 1).Trim() + '"'
        $pos += $fieldEndIndex - $pos + 1
      }
    # 输出字段作为CSV行。
    $fields -join ','
  } |
  ConvertFrom-Csv -Header (-split $headerLine) # 解析CSV数据为对象。

运行上述代码后,$objects 包含一个[pscustomobject]实例的数组,其属性以输入数据中的列命名,其值为字段值。

要可视化结果,您可以运行 $objects | Format-Table,将得到以下结果,显示数据已按预期解析:

COLUMN1 COLUMN2 COLUMN3    COLUMN4    COLUMN5
------- ------- -------    -------    -------
12345   ABC     RR,MM K    NAO,KUM    DEV
34567   CDEF    NN                    INT
89567   KGH     PP, BHIM   PRKC       PROD
9876    PIM                           DEV
6543    KCDEF   NICE,MAN K            INT
5432    GHK                SIN,NICE C PROD
英文:

<!-- language-all: sh -->

As has been noted, your data isn't in CSV format, but seemingly in the form of fixed-width columns, whose boundaries are implied by the character positions where the column names start.

The following transforms your file into CSV format and parses the result into objects using ConvertTo-Csv - note that the solution works generically based on the assumptions above; it neither relies on a specific number of columns nor on their specific lengths:

# Read the file into the header line and all data lines.
$headerLine, $dataLines = Get-Content $filePath

# Get the indices of the characters that end the fields.
# + -1 adds an extra array element that is a placeholder for the end of the line.
$fieldEndIndices = [regex]::Matches($headerLine, &#39; \S&#39;).Index + -1

# Iterate over all data lines.
$objects =
  $dataLines |  
  ForEach-Object {
    # Split the line at hand into fields, trim each field and enclose it in &quot;...&quot;
    $pos = 0
    $fields = 
      foreach ($fieldEndIndex in $fieldEndIndices) {
        if ($fieldEndIndex -eq -1) { $fieldEndIndex = $_.Length - 1 } 
        &#39;&quot;&#39; + $_.Substring($pos, $fieldEndIndex - $pos + 1).Trim() + &#39;&quot;&#39;
        $pos += $fieldEndIndex - $pos + 1
      }
    # Output the fields as a CSV line
    $fields -join &#39;,&#39;
  } |
  ConvertFrom-Csv -Header (-split $headerLine) # Parse the CSV data into objects.

After running the above, $objects contains an array of [pscustomobject] instances whose properties are named for the columns in the input data and whose values are the field values.

To visualize the results, you can run $objects | Format-Table, which yields the following, showing that the data was parsed as intended:

COLUMN1 COLUMN2 COLUMN3    COLUMN4    COLUMN5
------- ------- -------    -------    -------
12345   ABC     RR,MM K    NAO,KUM    DEV
34567   CDEF    NN                    INT
89567   KGH     PP, BHIM   PRKC       PROD
9876    PIM                           DEV
6543    KCDEF   NICE,MAN K            INT
5432    GHK                SIN,NICE C PROD

答案2

得分: 0

你已经使用数据进行了修复,而不是CSV。请尝试下面的正则表达式,它使用了硬编码的列宽:

$filename = 'c:\temp\test.csv'
$pattern = '(?&lt;column1&gt;.{8})(?&lt;column2&gt;.{8})(?&lt;column3&gt;.{13})(?&lt;column4&gt;.{11})(?&lt;column5&gt;.*?)'
$data = Get-Content -Path $filename | Select-Object -Skip 1 | Select-String -Pattern $pattern
$table = $data | foreach {[PSCustomObject]@{
   column1 = $_.Matches.Groups[1].Value.Trim() 
   column2 = $_.Matches.Groups[2].Value.Trim()
   column3 = $_.Matches.Groups[3].Value.Trim()
   column4 = $_.Matches.Groups[4].Value.Trim()
   column5 = $_.Matches.Groups[5].Value.Trim()
} }
$table | Format-Table

结果如下:

column1 column2 column3    column4    column5
------- ------- -------    -------    -------
12345   ABC     RR,MM K    NAO,KUM    DEV
34567   CDEF    NN                    INT
89567   KGH     PP, BHIM   PRKC       PROD
9876    PIM                           DEV
6543    KCDEF   NICE,MAN K            INT
5432    GHK                SIN,NICE C PROD
英文:

You have fixed with data, not CSV. Try regex below which uses hardcoded column widths :

<!-- begin snippet: js hide: false console: true babel: false -->

$filename = &#39;c:\temp\test.csv&#39;
$pattern = &#39;(?&lt;column1&gt;.{8})(?&lt;column2&gt;.{8})(?&lt;column3&gt;.{13})(?&lt;column4&gt;.{11})(?&lt;column5&gt;.*)&#39;
$data = Get-Content -Path $filename | Select-Object -Skip 1 | Select-String -Pattern $pattern
$table = $data | foreach {[PSCustomObject]@{
   column1 = $_.Matches.Groups[1].Value.Trim() 
   column2 = $_.Matches.Groups[2].Value.Trim()
   column3 = $_.Matches.Groups[3].Value.Trim()
   column4 = $_.Matches.Groups[4].Value.Trim()
   column5 = $_.Matches.Groups[5].Value.Trim()
} }
$table | Format-Table

<!-- end snippet -->

Results

<!-- begin snippet: js hide: false console: true babel: false -->

column1 column2 column3    column4    column5
------- ------- -------    -------    -------
12345   ABC     RR,MM K    NAO,KUM    DEV
34567   CDEF    NN                    INT
89567   KGH     PP, BHIM   PRKC       PROD
9876    PIM                           DEV
6543    KCDEF   NICE,MAN K            INT
5432    GHK                SIN,NICE C PROD

<!-- end snippet -->

huangapple
  • 本文由 发表于 2023年5月28日 01:27:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76348156.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定