如何将这个ASCII文本文件转换为可用的数据格式?

huangapple go评论82阅读模式
英文:

How can I get this ASCII text file into a usable data format?

问题

我想使用NIBRS在2021年逮捕的“主文件下载”。然而,这些数据以ASCII文本文件的形式提供,我不知道如何将其转换为可用的数据集。从帮助文件中看来,长数字字符串的某些位置对应不同的列,但我不知道如何告诉R或Stata在这些位置进行切割并导入文件。在这方面的任何帮助将不胜感激!

数据的前几行如下:

350AK001046 921  00 20000  0000000     N0000008198                                                            KETCHIKAN                Alaska                                                                                                                                                                                                                                                                                                                                                                                                                                       
350AK001046 921  01100210               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008                    
350AK001046 921  01100240               000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005                    
350AK001046 921  01100260               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025 
英文:

I want to use NIBRS' "master file download" for arrests in 2021. However, this data comes in an ASCII text file that I do not know how to convert into a usable dataset. It seems like, from the help file, certain positions of the long number string correspond to different columns, but I don't know how to tell R or Stata to import the file while cutting at those positions. Any help here would be much appreciated!

The first few rows of the data look like this:

350AK001046 921  00 20000  0000000     N0000008198                                                            KETCHIKAN                Alaska                                                                                                                                                                                                                                                                                                                                                                                                                                       
350AK001046 921  01100210               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008                    
350AK001046 921  01100240               000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005                    
350AK001046 921  01100260               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025 

答案1

得分: 1

不要翻译的部分已去除,以下是翻译好的部分:

假设我们有在末尾附加的lee.dat文件生成的内容。它由包含8个或4个字段的行组成,我们假设8个字段的行应该位于随后的4个字段行之前。现在使用count.fields来查找每行中字段的数量,然后使用cumsum创建一个分组变量g,将8个字段的行与后面的4个字段行分组。然后粘贴每个组并读取所有内容。

File <- "lee.dat"
g <- cumsum(count.fields(File) == 8)
File |>
  readLines() |>
  tapply(g, function(x) paste(x[1], x[-1])) |>
  unlist() |>
  read.table(text = _, colClasses = "character")

结果为:

               V1  V2 V3    V4      V5          V6        V7     V8          V9 V10      V11 V12
    1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
    2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
    3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000

为了方便阐述,我们已将4字段行中的长字段减少为3个字符。

Lines <- "350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000"
cat(Lines, file = "lee.dat")
英文:

Assume that we have the file lee.dat generated in the Note at the end. It is made up of lines that are 8 or 4 fields and we assume that the 8 field rows are to be prepended to the 4 field rows that follow it. Now use count.fields to find the number of fields in each row and then use cumsum to create a grouping variable g that will group the 8 field row with the following 4 field rows. Then paste each group and read everything in.

File &lt;- &quot;lee.dat&quot;
g &lt;- cumsum(count.fields(File) == 8)
File |&gt;
  readLines() |&gt;
  tapply(g, function(x) paste(x[1], x[-1])) |&gt;
  unlist() |&gt;
  read.table(text = _, colClasses = &quot;character&quot;)

giving

           V1  V2 V3    V4      V5          V6        V7     V8          V9 V10      V11 V12
1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000

Note

We have reduce the long field in the 4 field rows to 3 characters for ease of exposition.

Lines &lt;- &quot;350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000
&quot;
cat(Lines, file = &quot;lee.dat&quot;)

答案2

得分: 0

在 Stata 中,这需要使用 infix 命令。

如果你将示例数据放在名为 test.raw 的文件中,以下代码将提取前四列:

# delim ;
infix  
	ID 1-1 
	numeric_state 2-3 
	str ori_code 4-10
using test.raw; 

在此之后,通过 list 命令查看输出如下:

     +--------------------------+
     | ID   numeri~e   ori_code |
     |--------------------------|
  1. |  3         50    AK00104 |
  2. |  3         50    AK00104 |
  3. |  3         50    AK00104 |
  4. |  3         50    AK00104 |
     +--------------------------+

我从你提供的链接中的“Master File Downloads”下点击“Download the Help file for the Arrest File”来获取了初始的列标题和位置。以下是 PDF 的截图:如何将这个ASCII文本文件转换为可用的数据格式?

英文:

In Stata this would be a job for the infix command.

If you put your example data in a file called test.raw, this code would extract the first four columns:

# delim ;
infix  
	ID 1-1 
	numeric_state 2-3 
	str ori_code 4-10
using test.raw; 

Here's what the output from a list command looks after that:

     +--------------------------+
     | ID   numeri~e   ori_code |
     |--------------------------|
  1. |  3         50    AK00104 |
  2. |  3         50    AK00104 |
  3. |  3         50    AK00104 |
  4. |  3         50    AK00104 |
     +--------------------------+

I got the initial column headers and positions from clicking "Download the Help file for the Arrest File" under "Master File Downloads" at the link you provided. Here's a screenshot of the PDF: 如何将这个ASCII文本文件转换为可用的数据格式?

答案3

得分: 0

最适合这样的任务的工具通常是“Excel”,但在处理“大数据”时,在653行出现故障??+(最近由Covid患者测试的丢失经典地突出显示)。

因此,我认为最好的方法是将文件分割成65000行的部分,可以使用文件分割器来处理头部或尾部,然后将其导入电子表格中的列。

如何将这个ASCII文本文件转换为可用的数据格式?

英文:

The best tool for such a task, is usually "Excel" but with "Big Data" it fails at line 653??+ (classically Highlighted most recently by the Loss of Covid Patient tests)
Hence I think the best approach would be to split the file in to parts of 65000 lines, with heads or tails or a file splitter, THEN use file import into Spreadsheets with columns.

如何将这个ASCII文本文件转换为可用的数据格式?

huangapple
  • 本文由 发表于 2023年2月23日 22:53:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75546475.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定