英文:
How can I get this ASCII text file into a usable data format?
问题
我想使用NIBRS在2021年逮捕的“主文件下载”。然而,这些数据以ASCII文本文件的形式提供,我不知道如何将其转换为可用的数据集。从帮助文件中看来,长数字字符串的某些位置对应不同的列,但我不知道如何告诉R或Stata在这些位置进行切割并导入文件。在这方面的任何帮助将不胜感激!
数据的前几行如下:
350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008
350AK001046 921 01100240 000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005
350AK001046 921 01100260 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025
英文:
I want to use NIBRS' "master file download" for arrests in 2021. However, this data comes in an ASCII text file that I do not know how to convert into a usable dataset. It seems like, from the help file, certain positions of the long number string correspond to different columns, but I don't know how to tell R or Stata to import the file while cutting at those positions. Any help here would be much appreciated!
The first few rows of the data look like this:
350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008
350AK001046 921 01100240 000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005
350AK001046 921 01100260 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025
答案1
得分: 1
不要翻译的部分已去除,以下是翻译好的部分:
假设我们有在末尾附加的lee.dat
文件生成的内容。它由包含8个或4个字段的行组成,我们假设8个字段的行应该位于随后的4个字段行之前。现在使用count.fields
来查找每行中字段的数量,然后使用cumsum
创建一个分组变量g
,将8个字段的行与后面的4个字段行分组。然后粘贴每个组并读取所有内容。
File <- "lee.dat"
g <- cumsum(count.fields(File) == 8)
File |>
readLines() |>
tapply(g, function(x) paste(x[1], x[-1])) |>
unlist() |>
read.table(text = _, colClasses = "character")
结果为:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000
注
为了方便阐述,我们已将4字段行中的长字段减少为3个字符。
Lines <- "350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000"
cat(Lines, file = "lee.dat")
英文:
Assume that we have the file lee.dat
generated in the Note at the end. It is made up of lines that are 8 or 4 fields and we assume that the 8 field rows are to be prepended to the 4 field rows that follow it. Now use count.fields
to find the number of fields in each row and then use cumsum
to create a grouping variable g
that will group the 8 field row with the following 4 field rows. Then paste each group and read everything in.
File <- "lee.dat"
g <- cumsum(count.fields(File) == 8)
File |>
readLines() |>
tapply(g, function(x) paste(x[1], x[-1])) |>
unlist() |>
read.table(text = _, colClasses = "character")
giving
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000
Note
We have reduce the long field in the 4 field rows to 3 characters for ease of exposition.
Lines <- "350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000
"
cat(Lines, file = "lee.dat")
答案2
得分: 0
在 Stata 中,这需要使用 infix
命令。
如果你将示例数据放在名为 test.raw 的文件中,以下代码将提取前四列:
# delim ;
infix
ID 1-1
numeric_state 2-3
str ori_code 4-10
using test.raw;
在此之后,通过 list 命令查看输出如下:
+--------------------------+
| ID numeri~e ori_code |
|--------------------------|
1. | 3 50 AK00104 |
2. | 3 50 AK00104 |
3. | 3 50 AK00104 |
4. | 3 50 AK00104 |
+--------------------------+
我从你提供的链接中的“Master File Downloads”下点击“Download the Help file for the Arrest File”来获取了初始的列标题和位置。以下是 PDF 的截图:
英文:
In Stata this would be a job for the infix
command.
If you put your example data in a file called test.raw, this code would extract the first four columns:
# delim ;
infix
ID 1-1
numeric_state 2-3
str ori_code 4-10
using test.raw;
Here's what the output from a list command looks after that:
+--------------------------+
| ID numeri~e ori_code |
|--------------------------|
1. | 3 50 AK00104 |
2. | 3 50 AK00104 |
3. | 3 50 AK00104 |
4. | 3 50 AK00104 |
+--------------------------+
I got the initial column headers and positions from clicking "Download the Help file for the Arrest File" under "Master File Downloads" at the link you provided. Here's a screenshot of the PDF:
答案3
得分: 0
最适合这样的任务的工具通常是“Excel”,但在处理“大数据”时,在653行出现故障??+(最近由Covid患者测试的丢失经典地突出显示)。
因此,我认为最好的方法是将文件分割成65000行的部分,可以使用文件分割器来处理头部或尾部,然后将其导入电子表格中的列。
英文:
The best tool for such a task, is usually "Excel" but with "Big Data" it fails at line 653??+ (classically Highlighted most recently by the Loss of Covid Patient tests)
Hence I think the best approach would be to split the file in to parts of 65000 lines, with heads or tails or a file splitter, THEN use file import into Spreadsheets with columns.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论