2023年6月9日 04:31:03go评论93阅读模式

英文:

Extracting Text and Tables in Semi-Structured .txt

问题

I have translated the content you provided. Please let me know if you need any further assistance with your R code or any other questions you may have.

英文:

I have a .txt file that serves as the codebook for a large dataset that looks similar to this

==============================                                                
VAR V960922                                                                    
              NUMERIC                                                         
                                                                              
         Admin.48                                                             
                                                                              
         SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION                    
         -----------------------------------------------------------          
                                                                              
              Post mode in this variable refers to beginning mode             
              (question Admin.47).                                            
                                                                              
        749      1.   Assigned to personal, administered as                   
                      personal IW                                             
          7      2.   Assigned to personal, administered as                   
                      telephone IW                                            
         28      3.   Assigned to telephone, administered as                  
                      personal IW                                             
        750      4.   Assigned to telephone, administered as                  
                      telephone IW                                            
                                                                              
                 0.   Inap, no Post IW                                        
                                                                              
==============================

I would like to be able to convert this structure into a data frame to help with cleaning and labeling the dataset for use later. My ideal end result would be a table like this


| Var Name | Freqeuncies | Value Labels
| -------- | --------    | ---------------------------------------------------
| V960922  |        749  | 1. Assigned to personal, administered as personal IW
| V960922  |          7  | 2. Assigned to personal, administered as telephone IW
| V960922  |         28  | 3. Assigned to telephone, administered as personal IW
| V960922  |        750  | 4. Assigned to telephone, administered as telephone IW
| V960922  |         NA  | 0. Inap, no Post IW

Repeating for each of the variables included in the txt file. Each variable in the file follows a similar structure but has variations in the number of values or length of the summary for instance.

My main strategy so far has been to read in the txt file with readLines and then use str_subset to break off lines of the text that meet the criteria I need with the goal of then appending these together to create a data frame.

nes &lt;- readLines(&quot;nes1996var.txt&quot;)
 
vars &lt;- str_subset(nes, &quot;^VAR&quot;, )
vars


numbers &lt;- str_subset(nes,&quot;\\d?\\.&quot;)
numbers

The first instance of just grabbing variable names worked okay since I ended up with a vector of all the variables like I wanted.

However, trying to pull the tables has been trickier. I've seen other threads on StackOverflow suggest to filter off of the rows that start with numbers, but in the text file there's a lot of deadspace before the numbers so I can't pull just the rows that begin with numbers because technically there aren't any.

So instead I've pulled all the rows that have any numbers at all that are then followed by a period, hoping to catch on the value labels formatting. This was better but not perfect, both because it captured a lot of rows from summaries that included years or other numbers and the fact that some of the rows in the tables actually go over and fill in the second row, meaning sometimes the necessary text got cut off.

Even after that I couldn't find a way to separate the frequency number from the value label strings since they were placed on the same row.

Is there a more efficient/effective method of achieving what I want? I'm somewhat experienced with R but I am also still learning a lot if that helps also.

Edit: The solution provided by Dave did what I needed once I made a few tweaks. Here is the code that worked for me in case anyone happens to be in a similar situation.

rl &lt;- readLines(.txt file path)


## trim the white space from the front and back of each string 
## this will put the frequencies as the first characters in their lines. 
rl &lt;- trimws(rl)

## find the variable delimiters
delims &lt;- grep(&quot;==============================&quot;, rl)

## initialize the output as a list
out &lt;- vector(mode=&quot;list&quot;, length=length(delims)-1)
    ## loop over the delimiters
for (i in 1:(length(delims) - 1)) {
  ## find the text between adjacent delimiters and call that vbl
  vbl &lt;- rl[(delims[i] + 1):(delims[(i + 1)] - 1)]
  ## capture the varname as the stuff after &quot;VAR &quot; in the first row of vbl
  varname &lt;- gsub(&quot;VAR (.*)&quot;, &quot;\&quot;, vbl[1])
  ## identify the lines that start with a number
  resps &lt;- grep(&quot;^\\d&quot;, vbl)
  
  if (length(resps) &gt; 0) {
    ## identify the closest blank line to the last last response value and treat 
    ## that as the delimiter for the end of the last response category
    blanks &lt;- which(vbl == &quot;&quot;)
    resps &lt;- c(resps, blanks[min(which(blanks &gt; max(resps)))])
    ## grab the frequencies and remove the last one because the last one should be blank
    freqs &lt;- gsub(&quot;^(\\d+).*&quot;, &quot;\&quot;, vbl[resps])
    ## thanks to use padding out resps with the blank line after the last response category
    freqs &lt;- freqs[-length(freqs)]
    ## for each identified response, paste together the text between the identified response row 
    ## and everything that comes before the next identifies response row.
    vlabs &lt;- sapply(1:(length(resps) - 1), function(j) {
      paste(vbl[resps[j]:(resps[(j + 1)] - 1)], collapse = &quot; &quot;)
    })
    ## remove the frequencies and white space from the start of the variable labels
    ## trim the white space around variable labels as well
    vlabs &lt;- trimws(gsub(&quot;^\\d+\\s+(.*)&quot;, &quot;\&quot;, vlabs))
    ## collect all the information in one place
    out[[i]] &lt;- data.frame(`Var Name` = varname, Frequencies = freqs, `Value Labels` = vlabs)
  } else {
    out[[i]] &lt;- data.frame(`Var Name` = character(0), Frequencies = character(0), `Value Labels` = character(0))
  }
}

答案1

得分: 1

这是一个示例。通过评论识别每段代码的功能。我假设等号分隔每个变量的行。

rl &lt;- readLines(textConnection(&quot;==============================                                                 
VAR V960922                                                                    
              NUMERIC                                                         
                                                                              
         Admin.48                                                             
                                                                              
         SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION                    
         -----------------------------------------------------------          
                                                                              
              Post mode in this variable refers to beginning mode             
              (question Admin.47).                                            
                                                                              
        749      1.   Assigned to personal, administered as                   
                      personal IW                                             
          7      2.   Assigned to personal, administered as                   
                      telephone IW                                            
         28      3.   Assigned to telephone, administered as                  
                      personal IW                                             
        750      4.   Assigned to telephone, administered as                  
                      telephone IW                                            
                                                                              
                 0.   Inap, no Post IW                                        
                                                                              
============================== &quot;))

## trim the white space from the front and back of each string 
## this will put the frequencies as the first characters in their lines. 
rl &lt;- trimws(rl)

## find the variable delimiters
delims &lt;- grep(&quot;==============================&quot;, rl)

## initialize the output as a list
out &lt;- vector(mode=&quot;list&quot;, length=length(delims)-1)

## loop over the delimiters
for(i in 1:(length(delims)-1)){
  ## find the text between adjacent delimiters and call that vbl
  vbl &lt;- rl[(delims[i]+1):(delims[(i+1)]-1)]
  ## capture the varname as the stuff after &quot;VAR &quot; in the first row of vbl
  varname &lt;- gsub(&quot;VAR (.*)&quot;, &quot;\\1&quot;, vbl[1])
  ## identify the lines that start with a number 
  resps &lt;- grep(&quot;^\\d&quot;, vbl)
  ## identify the closest blank line to the last last response value and treat 
  ## that as the delimiter for the end of the last response category
  blanks &lt;- which(vbl == &quot;&quot;)
  resps &lt;- c(resps, blanks[min(which(blanks &gt; max(resps)))])
  ## grab the frequencies and remove the last one because the last one should be blank 
  freqs &lt;- gsub(&quot;^(\\d+).*&quot;, &quot;\\1&quot;, vbl[resps])
  ## thanks to use padding out resps with the blank line after the last response category
  freqs &lt;- freqs[-length(freqs)]
  ## for each identified response, paste together the text between the identified response row 
  ## and everything that comes before the next identifies response row. 
  vlabs &lt;- sapply(1:(length(resps)-1), function(i){
    paste(vbl[resps[i]:(resps[(i+1)]-1)], collapse=&quot; &quot;)
  })
  ## remove the frequencies and white space from the start of the variable labels
  ## trim the white space around variable labels as well
  vlabs &lt;- trimws(gsub(&quot;^\\d+\\s+(.*)&quot;, &quot;\\1&quot;, vlabs))
  ## collect all the information in one place
  out[[i]] &lt;- data.frame(`Var Name` = varname, 
                    Frequencies = freqs, 
                    `Value Labels` = vlabs)  
  
}
## make all the variables into a markdown table
lapply(out, knitr::kable)
#&gt; [[1]]
#&gt; 
#&gt; 
#&gt; |Var.Name |Frequencies |Value.Labels                                             |
#&gt; |:--------|:-----------|:--------------------------------------------------------|
#&gt; |V960922  |749         |1.   Assigned to personal, administered as personal IW   |
#&gt; |V960922  |7           |2.   Assigned to personal, administered as telephone IW  |
#&gt; |V960922  |28          |3.   Assigned to telephone, administered as personal IW  |
#&gt; |V960922  |750         |4.   Assigned to telephone, administered as telephone IW |
#&gt; |V960922  |0           |0.   Inap, no Post IW                                    |

<sup>Created on 2023-06-08 with reprex v2.0.2</sup>

英文:

Here's an example. Comments through identify what each piece of code does. My assumption is that the delisting rows of equals signs separate each variable.

rl &lt;- readLines(textConnection(&quot;==============================                                                
VAR V960922                                                                    
              NUMERIC                                                         
                                                                              
         Admin.48                                                             
                                                                              
         SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION                    
         -----------------------------------------------------------          
                                                                              
              Post mode in this variable refers to beginning mode             
              (question Admin.47).                                            
                                                                              
        749      1.   Assigned to personal, administered as                   
                      personal IW                                             
          7      2.   Assigned to personal, administered as                   
                      telephone IW                                            
         28      3.   Assigned to telephone, administered as                  
                      personal IW                                             
        750      4.   Assigned to telephone, administered as                  
                      telephone IW                                            
                                                                              
                 0.   Inap, no Post IW                                        
                                                                              
============================== &quot;))

## trim the white space from the front and back of each string 
## this will put the frequencies as the first characters in their lines. 
rl &lt;- trimws(rl)

## find the variable delimiters
delims &lt;- grep(&quot;==============================&quot;, rl)

## initialize the output as a list
out &lt;- vector(mode=&quot;list&quot;, length=length(delims)-1)

## loop over the delimiters
for(i in 1:(length(delims)-1)){
  ## find the text between adjacent delimiters and call that vbl
  vbl &lt;- rl[(delims[i]+1):(delims[(i+1)]-1)]
  ## capture the varname as the stuff after &quot;VAR &quot; in the first row of vbl
  varname &lt;- gsub(&quot;VAR (.*)&quot;, &quot;\\1&quot;, vbl[1])
  ## identify the lines that start with a number 
  resps &lt;- grep(&quot;^\\d&quot;, vbl)
  ## identify the closest blank line to the last last response value and treat 
  ## that as the delimiter for the end of the last response category
  blanks &lt;- which(vbl == &quot;&quot;)
  resps &lt;- c(resps, blanks[min(which(blanks &gt; max(resps)))])
  ## grab the frequencies and remove the last one because the last one should be blank 
  freqs &lt;- gsub(&quot;^(\\d+).*&quot;, &quot;\\1&quot;, vbl[resps])
  ## thanks to use padding out resps with the blank line after the last response category
  freqs &lt;- freqs[-length(freqs)]
  ## for each identified response, paste together the text between the identified response row 
  ## and everything that comes before the next identifies response row. 
  vlabs &lt;- sapply(1:(length(resps)-1), function(i){
    paste(vbl[resps[i]:(resps[(i+1)]-1)], collapse=&quot; &quot;)
  })
  ## remove the frequencies and white space from the start of the variable labels
  ## trim the white space around variable labels as well
  vlabs &lt;- trimws(gsub(&quot;^\\d+\\s+(.*)&quot;, &quot;\\1&quot;, vlabs))
  ## collect all the information in one place
  out[[i]] &lt;- data.frame(`Var Name` = varname, 
                    Frequencies = freqs, 
                    `Value Labels` = vlabs)  
  
}
## make all the variables into a markdown table
lapply(out, knitr::kable)
#&gt; [[1]]
#&gt; 
#&gt; 
#&gt; |Var.Name |Frequencies |Value.Labels                                             |
#&gt; |:--------|:-----------|:--------------------------------------------------------|
#&gt; |V960922  |749         |1.   Assigned to personal, administered as personal IW   |
#&gt; |V960922  |7           |2.   Assigned to personal, administered as telephone IW  |
#&gt; |V960922  |28          |3.   Assigned to telephone, administered as personal IW  |
#&gt; |V960922  |750         |4.   Assigned to telephone, administered as telephone IW |
#&gt; |V960922  |0           |0.   Inap, no Post IW                                    |

<sup>Created on 2023-06-08 with reprex v2.0.2</sup>

答案2

得分: 0

如果您使用*nix系统，在处理数据时使用awk来“整理”数据，然后使用read.fwf（"f"ixed "w"idth "f"ormat）来读取数据。如果看到截断的行，也许可以将最后一个宽度（100）设置为更高的值。

bash/shell

% awk '/^VAR .*[[:digit:]]/{x = $2}
     $1 ~ /^[[:digit:]]+$/{printf("%s %s", x,$0); set = 1} 
     $1 ~ /^[[:alpha:]]+$/ && set == 1{print $0; set = 0}  
     $1 == "0."{print x,$0}' file > file_new

R

library(stringr) # 用于str_squish()

dat <- read.fwf("file_new", widths=c(7, 13, 100), 
  col.names=c("Var Name", "Freqeuncies", "Value Labels"), check.names=F)

dat$'Value Labels' <- str_squish(dat$'Value Labels')

dat
   Var Name Freqeuncies                                           Value Labels
1   V960922         749   1. Assigned to personal, administered as personal IW
2   V960922           7  2. Assigned to personal, administered as telephone IW
3   V960922          28  3. Assigned to telephone, administered as personal IW
4   V960922         750 4. Assigned to telephone, administered as telephone IW
5   V960922          NA                                    0. Inap, no Post IW
6   V360925         754   1. Assigned to personal, administered as personal IW
7   V360925          11  2. Assigned to personal, administered as telephone IW
8   V360925           6  3. Assigned to telephone, administered as personal IW
9   V360925          NA                                    0. Inap, no Post IW
10  V360925         699 4. Assigned to telephone, administered as telephone IW

数据

% cat file
==============================
VAR V960922
              NUMERIC

         Admin.48

         SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION
         -----------------------------------------------------------

              Post mode in this variable refers to beginning mode
              (question Admin.47).

        749      1.   Assigned to personal, administered as
                      personal IW
          7      2.   Assigned to personal, administered as
                      telephone IW
         28      3.   Assigned to telephone, administered as
                      personal IW
        750      4.   Assigned to telephone, administered as
                      telephone IW

                 0.   Inap, no Post IW

==============================
VAR V360925
              NUMERIC

         Admin.48

         SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION
         -----------------------------------------------------------

              Post mode in this variable refers to beginning mode
              (question Admin.47).

        754      1.   Assigned to personal, administered as
                      personal IW
         11      2.   Assigned to personal, administered as
                      telephone IW
          6      3.   Assigned to telephone, administered as
                      personal IW
                 0.   Inap, no Post IW
        699      4.   Assigned to telephone, administered as
                      telephone IW

==============================

英文:

If you are on a *nix system, doing the heavy lifting with awk to "tidy" the data, then use read.fwf ("f"ixed "w"idth "f"ormat). Maybe set the last width (100) to a higher value if you see truncated lines.

bash/shell

% awk &#39;/^VAR .*[[:digit:]]/{x = $2}
$1 ~ /^[[:digit:]]+$/{printf(&quot;%s %s&quot;, x,$0); set = 1} 
$1 ~ /^[[:alpha:]]+$/ &amp;&amp; set == 1{print $0; set = 0}  
$1 == &quot;0.&quot;{print x,$0}&#39; file &gt; file_new

R

library(stringr) # for str_squish()
dat &lt;- read.fwf(&quot;file_new&quot;, widths=c(7, 13, 100), 
col.names=c(&quot;Var Name&quot;, &quot;Freqeuncies&quot;, &quot;Value Labels&quot;), check.names=F)
dat$&#39;Value Labels&#39; &lt;- str_squish(dat$&#39;Value Labels&#39;)
dat
Var Name Freqeuncies                                           Value Labels
1   V960922         749   1. Assigned to personal, administered as personal IW
2   V960922           7  2. Assigned to personal, administered as telephone IW
3   V960922          28  3. Assigned to telephone, administered as personal IW
4   V960922         750 4. Assigned to telephone, administered as telephone IW
5   V960922          NA                                    0. Inap, no Post IW
6   V360925         754   1. Assigned to personal, administered as personal IW
7   V360925          11  2. Assigned to personal, administered as telephone IW
8   V360925           6  3. Assigned to telephone, administered as personal IW
9   V360925          NA                                    0. Inap, no Post IW
10  V360925         699 4. Assigned to telephone, administered as telephone IW

Data

% cat file
==============================
VAR V960922
NUMERIC
Admin.48
SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION
-----------------------------------------------------------
Post mode in this variable refers to beginning mode
(question Admin.47).
749      1.   Assigned to personal, administered as
personal IW
7      2.   Assigned to personal, administered as
telephone IW
28      3.   Assigned to telephone, administered as
personal IW
750      4.   Assigned to telephone, administered as
telephone IW
0.   Inap, no Post IW
==============================
VAR V360925
NUMERIC
Admin.48
SUMMARY - POST MODE ASSIGNMENT AND ADMINISTRATION
-----------------------------------------------------------
Post mode in this variable refers to beginning mode
(question Admin.47).
754      1.   Assigned to personal, administered as
personal IW
11      2.   Assigned to personal, administered as
telephone IW
6      3.   Assigned to telephone, administered as
personal IW
0.   Inap, no Post IW
699      4.   Assigned to telephone, administered as
telephone IW
==============================

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取半结构化 .txt 中的文本和表格。

问题

答案1

答案2

bash/shell

R

数据

bash/shell

R

Data

在 ggplotly 中，图例中的元素分组失败。

如何检测数据框是否被dplyr从子函数分组？

重命名rpart绘图节点，数学符号在名称中

在R中如何对每个组的系数进行乘法运算，然后计算原始值的百分比。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论