将制表符分隔的字符串拆分成不同的列。

huangapple go评论137阅读模式
英文:

split tabs string into different columns

问题

在我的铸造环境中,我有一个名为"data"的pyspark数据集。

每一行都包含一个看起来像TSV的字符串。每一行都有一个类似这样的值:

  1. ott-akamai-logs-processor srv 2023-07-29 17:46:50.134 2023-07-29 17:46:49.358 unstruct 103b9271-777 ott node-3.13.1 ssc-2.8.2-kinesis snowplow-enrich-kinesis-3.7.0 3.65.234.x 12345679 DE HE Karachi 60313 50.1188 8.6843 Malta {"schema":"iglu:com.xxx/1-0-0","data":{"schema":"xxx/hls_manifest_requested/jsonschema/1-0-1","data":{"channel":"bildtv-broadcast","session_id":"xxx","request_id":"xxx","total_bytes":351,"referrer":"^","geo_country":"DE","geo_state":"Berlin","geo_city":"-","variant_name":"6.m3u8"}}} snowplow-nodejs-tracker/3.13.1 Europe/Berlin 2023-07-29 17:46:49.281 {"schema":"xxx/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:nl.basjes/yauaa_context/jsonschema/1-0-4","data":{"deviceBrand":"Unknown","deviceName":"Unknown","operatingSystemVersionMajor":"??","layoutEngineNameVersion":"Unknown ??","operatingSystemNameVersion":"Unknown ??","agentInformationEmail":"Unknown","networkType":"Unknown","webviewAppNameVersionMajor":"Unknown ??","layoutEngineNameVersionMajor":"Unknown ??","operatingSystemName":"Unknown","agentVersionMajor":"3","layoutEngineVersionMajor":"??","webviewAppName":"Unknown","deviceClass":"Unknown","agentNameVersionMajor":"Snowplow-Nodejs-Tracker 3","operatingSystemNameVersionMajor":"Unknown ??","webviewAppVersionMajor":"??","operatingSystemClass":"Unknown","webviewAppVersion":"??","layoutEngineName":"Unknown","agentName":"Snowplow-Nodejs-Tracker","agentVersion":"3.13.1","layoutEngineClass":"Unknown","agentNameVersion":"Snowplow-Nodejs-Tracker 3.13.1","operatingSystemVersion":"??","agentClass":"Special","layoutEngineVersion":"??","agentInformationUrl":"Unknown"}},{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","data":{"useragentFamily":"Other","useragentMajor":null,"useragentMinor":null,"useragentPatch":null,"useragentVersion":"Other","osFamily":"Other","osMajor":null,"osMinor":null,"osPatch":null,"osPatchMinor":null,"osVersion":"Other","deviceFamily":"Other"}}]} 2023-07-29 17:46:09.938 com.axelspringer.ott hls_manifest_requested jsonschema 1-0-1 2023-07-29 17:46:09.938

在这里,各个部分由制表符分隔。对于每个制表符分隔,我想将值放入不同的列中。我该如何做?

  1. def unnamed_1(my_df):
  2. df = my_df
  3. return df
英文:

In my foundry environment, I have a pyspark dataset with only one column called "data".

Each row has a string that looks like a TSV. Each row has a value like this:

  1. ott-akamai-logs-processor srv 2023-07-29 17:46:50.134 2023-07-29 17:46:49.358 unstruct 103b9271-777 ott node-3.13.1 ssc-2.8.2-kinesis snowplow-enrich-kinesis-3.7.0 3.65.234.x 12345679 DE HE Karachi 60313 50.1188 8.6843 Malta {"schema":"iglu:com.xxx/1-0-0","data":{"schema":"xxx/hls_manifest_requested/jsonschema/1-0-1","data":{"channel":"bildtv-broadcast","session_id":"xxx","request_id":"xxx","total_bytes":351,"referrer":"^","geo_country":"DE","geo_state":"Berlin","geo_city":"-","variant_name":"6.m3u8"}}} snowplow-nodejs-tracker/3.13.1 Europe/Berlin 2023-07-29 17:46:49.281 {"schema":"xxx/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:nl.basjes/yauaa_context/jsonschema/1-0-4","data":{"deviceBrand":"Unknown","deviceName":"Unknown","operatingSystemVersionMajor":"??","layoutEngineNameVersion":"Unknown ??","operatingSystemNameVersion":"Unknown ??","agentInformationEmail":"Unknown","networkType":"Unknown","webviewAppNameVersionMajor":"Unknown ??","layoutEngineNameVersionMajor":"Unknown ??","operatingSystemName":"Unknown","agentVersionMajor":"3","layoutEngineVersionMajor":"??","webviewAppName":"Unknown","deviceClass":"Unknown","agentNameVersionMajor":"Snowplow-Nodejs-Tracker 3","operatingSystemNameVersionMajor":"Unknown ??","webviewAppVersionMajor":"??","operatingSystemClass":"Unknown","webviewAppVersion":"??","layoutEngineName":"Unknown","agentName":"Snowplow-Nodejs-Tracker","agentVersion":"3.13.1","layoutEngineClass":"Unknown","agentNameVersion":"Snowplow-Nodejs-Tracker 3.13.1","operatingSystemVersion":"??","agentClass":"Special","layoutEngineVersion":"??","agentInformationUrl":"Unknown"}},{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","data":{"useragentFamily":"Other","useragentMajor":null,"useragentMinor":null,"useragentPatch":null,"useragentVersion":"Other","osFamily":"Other","osMajor":null,"osMinor":null,"osPatch":null,"osPatchMinor":null,"osVersion":"Other","deviceFamily":"Other"}}]} 2023-07-29 17:46:09.938 com.axelspringer.ott hls_manifest_requested jsonschema 1-0-1 2023-07-29 17:46:09.938

Here, things are separated by tabs. For each tab separation, I want to put the values into different columns. How can I do so?

  1. def unnamed_1(my_df):
  2. df = my_df
  3. return df

答案1

得分: 2

你可以使用Spark中的split函数来完成这个任务:
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html

  1. import pyspark.sql.functions as f
  2. split_col = f.split(df['data'], '\t')
  3. formatted_df = (
  4. df
  5. .withColumn('column_a', split_col.getItem(0))
  6. .withColumn('column_b', split_col.getItem(1))
  7. .withColumn('column_c', split_col.getItem(2))
  8. # ...
  9. .drop('data')
  10. )

如果你有一个按正确顺序排列的所需列的列表,你可以使用迭代来自动化这个过程:

  1. columns = [
  2. 'column_a',
  3. 'column_b',
  4. 'column_c',
  5. # ...
  6. ]
  7. formatted_df = df
  8. for i in range(len(columns)):
  9. column_name = columns[i]
  10. column_values = split_col.getItem(i)
  11. formatted_df = formatted_df.withColumn(column_name, column_values)
  12. formatted_df = formatted_df.drop('data')

如果你的数据以正确格式的TSV形式接收,你可以在读取数据时使用delimiter选项来解析它:

  1. df = (
  2. spark.read.format("csv")
  3. .option("delimiter", "\t")
  4. .load('data.tsv')
  5. )
英文:

You can use the split function in spark to accomplish this:
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html

  1. import pyspark.sql.functions as f
  2. split_col = f.split(df['data'], '\t')
  3. formatted_df = (
  4. df
  5. .withColumn('column_a', split_col.getItem(0))
  6. .withColumn('column_b', split_col.getItem(1))
  7. .withColumn('column_c', split_col.getItem(2))
  8. # ...
  9. .drop('data')
  10. )

If you have a list of the desired columns in the correct order, you can automate it with iteration:

  1. columns = [
  2. 'column_a',
  3. 'column_b',
  4. 'column_c',
  5. # ...
  6. ]
  7. formatted_df = df
  8. for i in range(len(columns)):
  9. column_name = columns[i]
  10. column_values = split_col.getItem(i)
  11. formatted_df = formatted_df.withColumn(column_name, column_values)
  12. formatted_df = formatted_df.drop('data')

If you're receiving this data as a properly formatted TSV you could instead use the delimiter option while reading in the data to parse this.

  1. df = (
  2. spark.read.format("csv")
  3. .option("delimiter", "\t")
  4. .load('data.tsv')
  5. )

答案2

得分: 1

你可以使用 pandas.read_csv 来实现:

  1. import pandas as pd
  2. from io import StringIO
  3. string = 'ott-akamai-logs-processor srv 2023-07-29 17:46:50.134 2023-07-29 17:46:49.358 unstruct 103b9271-777 ott node-3.13.1 ssc-2.8.2-kinesis snowplow-enrich-kinesis-3.7.0 3.65.234.x 12345679 DE HE Karachi 60313 50.1188 8.6843 Malta {"schema":"iglu:com.xxx/1-0-0","data":{"schema":"xxx/hls_manifest_requested/jsonschema/1-0-1","data":{"channel":"bildtv-broadcast","session_id":"xxx","request_id":"xxx","total_bytes":351,"referrer":"^","geo_country":"DE","geo_state":"Berlin","geo_city":"-","variant_name":"6.m3u8"}}} snowplow-nodejs-tracker/3.13.1 Europe/Berlin 2023-07-29 17:46:49.281 {"schema":"xxx/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:nl.basjes/yauaa_context/jsonschema/1-0-4","data":{"deviceBrand":"Unknown","deviceName":"Unknown","operatingSystemVersionMajor":"??","layoutEngineNameVersion":"Unknown ??","operatingSystemNameVersion":"Unknown ??","agentInformationEmail":"Unknown","networkType":"Unknown","webviewAppNameVersionMajor":"Unknown ??","layoutEngineNameVersionMajor":"Unknown ??","operatingSystemName":"Unknown","agentVersionMajor":"3","layoutEngineVersionMajor":"??","webviewAppName":"Unknown","deviceClass":"Unknown","agentNameVersionMajor":"Snowplow-Nodejs-Tracker 3","operatingSystemNameVersionMajor":"Unknown ??","webviewAppVersionMajor":"??","operatingSystemClass":"Unknown","webviewAppVersion":"??","layoutEngineName":"Unknown","agentName":"Snowplow-Nodejs-Tracker","agentVersion":"3.13.1","layoutEngineClass":"Unknown","agentNameVersion":"Snowplow-Nodejs-Tracker 3.13.1","operatingSystemVersion":"??","agentClass":"Special","layoutEngineVersion":"??","agentInformationUrl":"Unknown"}},{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","data":{"useragentFamily":"Other","useragentMajor":null,"useragentMinor":null,"useragentPatch":null,"useragentVersion":"Other","osFamily":"Other","osMajor":null,"osMinor":null,"osPatch":null,"osPatchMinor":null,"osVersion":"Other","deviceFamily":"Other"}}]} 2023-07-29 17:46:09.938 com.axelspringer.ott hls_manifest_requested jsonschema 1-0-1 2023-07-29 17:46:09.938'
  4. tsvString = StringIO(string)
  5. df = pd.read_csv(tsvString, sep='\t')
英文:

You can use pandas.read_csv for that:

  1. import pandas as pd
  2. from io import StringIO
  3. string = 'ott-akamai-logs-processor srv 2023-07-29 17:46:50.134 2023-07-29 17:46:49.358 unstruct 103b9271-777 ott node-3.13.1 ssc-2.8.2-kinesis snowplow-enrich-kinesis-3.7.0 3.65.234.x 12345679 DE HE Karachi 60313 50.1188 8.6843 Malta {"schema":"iglu:com.xxx/1-0-0","data":{"schema":"xxx/hls_manifest_requested/jsonschema/1-0-1","data":{"channel":"bildtv-broadcast","session_id":"xxx","request_id":"xxx","total_bytes":351,"referrer":"^","geo_country":"DE","geo_state":"Berlin","geo_city":"-","variant_name":"6.m3u8"}}} snowplow-nodejs-tracker/3.13.1 Europe/Berlin 2023-07-29 17:46:49.281 {"schema":"xxx/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:nl.basjes/yauaa_context/jsonschema/1-0-4","data":{"deviceBrand":"Unknown","deviceName":"Unknown","operatingSystemVersionMajor":"??","layoutEngineNameVersion":"Unknown ??","operatingSystemNameVersion":"Unknown ??","agentInformationEmail":"Unknown","networkType":"Unknown","webviewAppNameVersionMajor":"Unknown ??","layoutEngineNameVersionMajor":"Unknown ??","operatingSystemName":"Unknown","agentVersionMajor":"3","layoutEngineVersionMajor":"??","webviewAppName":"Unknown","deviceClass":"Unknown","agentNameVersionMajor":"Snowplow-Nodejs-Tracker 3","operatingSystemNameVersionMajor":"Unknown ??","webviewAppVersionMajor":"??","operatingSystemClass":"Unknown","webviewAppVersion":"??","layoutEngineName":"Unknown","agentName":"Snowplow-Nodejs-Tracker","agentVersion":"3.13.1","layoutEngineClass":"Unknown","agentNameVersion":"Snowplow-Nodejs-Tracker 3.13.1","operatingSystemVersion":"??","agentClass":"Special","layoutEngineVersion":"??","agentInformationUrl":"Unknown"}},{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","data":{"useragentFamily":"Other","useragentMajor":null,"useragentMinor":null,"useragentPatch":null,"useragentVersion":"Other","osFamily":"Other","osMajor":null,"osMinor":null,"osPatch":null,"osPatchMinor":null,"osVersion":"Other","deviceFamily":"Other"}}]} 2023-07-29 17:46:09.938 com.axelspringer.ott hls_manifest_requested jsonschema 1-0-1 2023-07-29 17:46:09.938'
  4. tsvString = StringIO(string)
  5. df = pd.read_csv(tsvString,sep='\t')

huangapple
  • 本文由 发表于 2023年8月10日 21:39:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76876280.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定