2023年2月23日 22:53:12go评论172阅读模式

英文:

How can I get this ASCII text file into a usable data format?

问题

我想使用NIBRS在2021年逮捕的“主文件下载”。然而，这些数据以ASCII文本文件的形式提供，我不知道如何将其转换为可用的数据集。从帮助文件中看来，长数字字符串的某些位置对应不同的列，但我不知道如何告诉R或Stata在这些位置进行切割并导入文件。在这方面的任何帮助将不胜感激！

数据的前几行如下：

350AK001046 921  00 20000  0000000     N0000008198                                                            KETCHIKAN                Alaska                                                                                                                                                                                                                                                                                                                                                                                                                                       
350AK001046 921  01100210               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008                    
350AK001046 921  01100240               000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005                    
350AK001046 921  01100260               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025

英文:

I want to use NIBRS' "master file download" for arrests in 2021. However, this data comes in an ASCII text file that I do not know how to convert into a usable dataset. It seems like, from the help file, certain positions of the long number string correspond to different columns, but I don't know how to tell R or Stata to import the file while cutting at those positions. Any help here would be much appreciated!

The first few rows of the data look like this:

350AK001046 921  00 20000  0000000     N0000008198                                                            KETCHIKAN                Alaska                                                                                                                                                                                                                                                                                                                                                                                                                                       
350AK001046 921  01100210               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008                    
350AK001046 921  01100240               000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005                    
350AK001046 921  01100260               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025

答案1

得分: 1

不要翻译的部分已去除，以下是翻译好的部分：

假设我们有在末尾附加的lee.dat文件生成的内容。它由包含8个或4个字段的行组成，我们假设8个字段的行应该位于随后的4个字段行之前。现在使用count.fields来查找每行中字段的数量，然后使用cumsum创建一个分组变量g，将8个字段的行与后面的4个字段行分组。然后粘贴每个组并读取所有内容。

File <- "lee.dat"
g <- cumsum(count.fields(File) == 8)
File |>
  readLines() |>
  tapply(g, function(x) paste(x[1], x[-1])) |>
  unlist() |>
  read.table(text = _, colClasses = "character")

结果为：

               V1  V2 V3    V4      V5          V6        V7     V8          V9 V10      V11 V12
    1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
    2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
    3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000

注

为了方便阐述，我们已将4字段行中的长字段减少为3个字符。

Lines <- "350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000"
cat(Lines, file = "lee.dat")

英文:

Assume that we have the file lee.dat generated in the Note at the end. It is made up of lines that are 8 or 4 fields and we assume that the 8 field rows are to be prepended to the 4 field rows that follow it. Now use count.fields to find the number of fields in each row and then use cumsum to create a grouping variable g that will group the 8 field row with the following 4 field rows. Then paste each group and read everything in.

File &lt;- &quot;lee.dat&quot;
g &lt;- cumsum(count.fields(File) == 8)
File |&gt;
  readLines() |&gt;
  tapply(g, function(x) paste(x[1], x[-1])) |&gt;
  unlist() |&gt;
  read.table(text = _, colClasses = &quot;character&quot;)

giving

           V1  V2 V3    V4      V5          V6        V7     V8          V9 V10      V11 V12
1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000

Note

We have reduce the long field in the 4 field rows to 3 characters for ease of exposition.

Lines &lt;- &quot;350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000
&quot;
cat(Lines, file = &quot;lee.dat&quot;)

答案2

得分: 0

在 Stata 中，这需要使用 infix 命令。

如果你将示例数据放在名为 test.raw 的文件中，以下代码将提取前四列：

# delim ;
infix  
	ID 1-1 
	numeric_state 2-3 
	str ori_code 4-10
using test.raw;

在此之后，通过 list 命令查看输出如下：

     +--------------------------+
     | ID   numeri~e   ori_code |
     |--------------------------|
  1. |  3         50    AK00104 |
  2. |  3         50    AK00104 |
  3. |  3         50    AK00104 |
  4. |  3         50    AK00104 |
     +--------------------------+

我从你提供的链接中的“Master File Downloads”下点击“Download the Help file for the Arrest File”来获取了初始的列标题和位置。以下是 PDF 的截图：

英文:

In Stata this would be a job for the infix command.

If you put your example data in a file called test.raw, this code would extract the first four columns:

# delim ;
infix  
	ID 1-1 
	numeric_state 2-3 
	str ori_code 4-10
using test.raw;

Here's what the output from a list command looks after that:

     +--------------------------+
     | ID   numeri~e   ori_code |
     |--------------------------|
  1. |  3         50    AK00104 |
  2. |  3         50    AK00104 |
  3. |  3         50    AK00104 |
  4. |  3         50    AK00104 |
     +--------------------------+

I got the initial column headers and positions from clicking "Download the Help file for the Arrest File" under "Master File Downloads" at the link you provided. Here's a screenshot of the PDF:

答案3

得分: 0

最适合这样的任务的工具通常是“Excel”，但在处理“大数据”时，在653行出现故障??+（最近由Covid患者测试的丢失经典地突出显示）。

因此，我认为最好的方法是将文件分割成65000行的部分，可以使用文件分割器来处理头部或尾部，然后将其导入电子表格中的列。

英文:

The best tool for such a task, is usually "Excel" but with "Big Data" it fails at line 653??+ (classically Highlighted most recently by the Loss of Covid Patient tests)
Hence I think the best approach would be to split the file in to parts of 65000 lines, with heads or tails or a file splitter, THEN use file import into Spreadsheets with columns.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将这个ASCII文本文件转换为可用的数据格式？

问题

答案1

注

Note

答案2

答案3

在knitr新版本中，块消息处理的变化。

如何在R中将行的值包装到特定字符处的新行

使用subset删除带有NA的行在函数内部不起作用。

创建新列：从第一个现有列创建，如果没有现有列，则为NA。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论