Now you can use:
spark.conf.set("spark.sql.session.timeZone", "UTC")
Since https://issues.apache.org/jira/browse/SPARK-18936 in 2.2.0
EDIT:
Additionally I set my default TimeZone to UTC to avoid implicit conversions
TimeZone.setDefault(TimeZone.getTimeZone("UTC"))
Otherwise you will get implicit conversions from your default Timezone to UTC when no Timezone information is present in the Timestamp you're converting
Example:
val rawJson = """ {"some_date_field": "2018-09-14 16:05:37"} """
val dsRaw = sparkJob.spark.createDataset(Seq(rawJson))
val output =
dsRaw
.select(
from_json(
col("value"),
new StructType(
Array(
StructField("some_date_field", DataTypes.TimestampType)
)
)
).as("parsed")
).select("parsed.*")
If my default TimeZone is Europe/Dublin which is GMT+1 and Spark sql session timezone is set to UTC, Spark will assume that "2018-09-14 16:05:37" is in Europe/Dublin TimeZone and do a conversion (result will be "2018-09-14 15:05:37")
PySpark
spark = pyspark.sql.SparkSession \
.Builder()\
.appName('test') \
.master('local') \
.config('spark.driver.extraJavaOptions', '-Duser.timezone=GMT') \
.config('spark.executor.extraJavaOptions', '-Duser.timezone=GMT') \
.config('spark.sql.session.timeZone', 'UTC') \
.getOrCreate()
No comments:
Post a Comment