Spark DataFrame解析json字符串
2018-10-22
张子阳
分类: 大数据处理
近期在做Spark Streaming方面的测试,从Kafka中实时取数据。此时接收到的数据是一段json数组形式的字符串,那么就需要将其解析为DataFrame的行数据,以方便进行实时运算。下面的代码示例演示了如何实现这个功能,因为比较简单,就不多作说明了。
假设初始的字符串是:[{"a":1,"b":2},{"a":3,"b":4},{"a":5,"b":6},{"a":7,"b":8}],最终将其解析为下面的DataFrame:
| a | b |
|---|---|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |
| 7 | 8 |
# coding=utf-8
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
spark = SparkSession.builder\
.master("local[*]")\
.appName("test.dataframe")\
.getOrCreate()
# 第1步,加载数据,默认为字符串类型的单列,列名为value
data = ['[{"a":1,"b":2},{"a":3,"b":4},{"a":5,"b":6},{"a":7,"b":8}]']
df = spark.createDataFrame(data, T.StringType())
df.printSchema()
df.show()
schema = T.ArrayType(T.StructType([
T.StructField("a", T.IntegerType()),
T.StructField("b", T.IntegerType())
]))
# 第2步,将列转为数组类型
df = df.select(F.from_json(df["value"], schema).alias("json"))
df.printSchema()
df.show()
# 第3步,将列转为Struct类型
df = df.select(F.explode(df["json"]).alias("col"));
df.printSchema()
df.show()
# 第4步,对Struct进行拆分
col = df["col"]
df = df.withColumn("a", col["a"]) \
.withColumn("b", col["b"]) \
.drop("col")
df.printSchema()
df.show()
print "success"
df.show()只显示了struct的字段值,没有显示字段名称,并且用[],而不是{}来代表Struct,所以看上去像是一个数组
其执行结果如下:
# $SPARK_HOME/bin/spark-submit /data/pyjobs/sample/dataframe-json.py
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
|[{"a":1,"b":2},{"...|
+--------------------+
root
|-- json: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: integer (nullable = true)
+--------------------+
| json|
+--------------------+
|[[1, 2], [3, 4], ...|
+--------------------+
root
|-- col: struct (nullable = true)
| |-- a: integer (nullable = true)
| |-- b: integer (nullable = true)
+------+
| col|
+------+
|[1, 2]|
|[3, 4]|
|[5, 6]|
|[7, 8]|
+------+
root
|-- a: integer (nullable = true)
|-- b: integer (nullable = true)
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
+---+---+
success
感谢阅读,希望这篇文章能给你带来帮助!