Microsoft Fabric Updates Blog

Legacy Timestamp Support in Native Execution Engine for Fabric Runtime 1.3

The recent update to Native Execution Engine on Fabric Runtime 1.3 brings support for legacy timestamp handling, allowing seamless processing of timestamp data created by different Spark versions. This feature helps to address compatibility issues introduced when Spark 3.0 switched to the Java 8 date/time API, which uses the Proleptic Gregorian calendar (SQL ISO standard), in contrast to earlier versions that relied on a hybrid Julian-Gregorian calendar. 

Why Legacy Timestamp Support Matters 

Spark writes dates and timestamps in Parquet files as integers or longs (representing days or seconds from the UNIX epoch, January 1, 1970). Due to the calendar switch, the same date or timestamp value written by Spark 2.x and Spark 3.0 might have different values in Parquet files. For example, suppose we’re writing someone’s date of birth, like January 1, 1955, to Parquet. In Spark 2.x, January 1, 1955, would be written to Parquet as -5,479 (representing 5,479 days before the epoch) in a date field. In Spark 3.x, January 1, 1955, would be written as -5,478 days before the epoch, due to the consistent use of the Proleptic Gregorian calendar. This results in a one-day discrepancy when reading the same Parquet file across Spark versions if no rebasing is applied. 

Legacy timestamp support addresses this by rebasing timestamp values between these two calendar systems during read and write operations, so data remains accurate across Spark versions. 

Configuration Parameters for Legacy Timestamp Handling 
Several configurations in Spark control rebasing behavior for dates and timestamps stored in Parquet files. In Fabric Runtime 1.3, these settings allow you to specify how Spark should read date and timestamp data: 

For INT96 Timestamps

  • spark.sql.parquet.int96RebaseModeInWrite 
  • spark.sql.parquet.int96RebaseModeInRead 

For Date and INT64 Timestamps (in milliseconds or microseconds)

  • spark.sql.parquet.datetimeRebaseModeInWrite 
  • spark.sql.parquet.datetimeRebaseModeInRead 

Each of these can be set to: 

  • LEGACY: For INT96 timestamps, rebase dates/timestamps from the Julian calendar to the Proleptic Gregorian calendar. For DATE and INT64 timestamps (in days, milliseconds, or microseconds), rebase from the Proleptic Gregorian calendar to the hybrid Julian-Gregorian calendar. 
  • CORRECTED: No rebasing; use the Proleptic Gregorian calendar consistently for all dates and timestamps. 
  • EXCEPTION (default): Throws an error if there’s a compatibility issue. 

For example: 

SET spark.sql.parquet.datetimeRebaseModeInRead = 'LEGACY'; 

Native Execution Engine’s Behavior and Solution 

Native Execution Engine on Fabric Runtime 1.3, now includes legacy timestamp support to handle date and timestamp values written by earlier versions of Spark. 

Solution 

The legacy timestamp support feature allows the native execution engine to handle legacy timestamp values within Parquet files and Delta Tables without requiring users to configure additional settings. This feature, controlled by the configuration parameter spark.gluten.legacy.timestamp.rebase.enabled, enables the engine to automatically adjust for calendar differences, ensuring seamless compatibility across Spark versions. For dates that might be impacted by calendar discrepancies, the native execution engine uses predefined offsets to ensure consistent handling. Dates after the UNIX epoch (1970-01-01) are processed as-is, without requiring any rebase adjustments, as they are natively compatible. 

How to Enable and Use Legacy Timestamp Support in Native Execution Engine on Fabric Runtime 1.3 

To enable legacy timestamp support, simply set the following configuration in your Spark session or environment: 

SET spark.gluten.legacy.timestamp.rebase.enabled = true; 

This configuration enables Fabric’s native execution engine to rebase dates and timestamps as needed to ensure compatibility with legacy Spark-written data, including data from both Spark 2.x and Spark 3.x. 

This feature will be available in production across all regions by November 14, 2024, though it will not be enabled by default. Users can activate it through configuration settings. Default enablement is planned for a future update. 

Powiązane wpisy w blogu

Legacy Timestamp Support in Native Execution Engine for Fabric Runtime 1.3

października 29, 2024 autor: Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”

października 28, 2024 autor: Estera Kot

We’re thrilled to announce that the Native Execution Engine is now available at no additional cost, unlocking next-level performance and efficiency for your workloads. What’s New?  The Native Execution Engine now supports Fabric Runtime 1.3, which includes Apache Spark 3.5 and Delta Lake 3.2. This upgrade enhances Microsoft Fabric’s Data Engineering and Data Science workflows, … Continue reading “Native Execution Engine available at no additional cost!”