Microsoft Fabric Updates Blog

Announcing improvements to CSV data ingestion in Synapse Data Warehouse in Microsoft Fabric

CSV files are widely used for data exchange and data ingestion into data warehouses, but they often pose challenges on performance. In accordance to a study from Microsoft Research, up to 90% of the total time spent in data ingestion occurs in parsing non-binary data such as JSON when using conventional file parsers. This is illustrated in Figure 1.

A bar chart comparing percentage of time spent on parsing versus query processing, for four different queries. The chart shows that more than 90% of the cost comes from parsing for queries one, two, and three, and over 80% of the cost comes from parsing for query number four.
Figure 1: Parsing vs. Query processing Cost
Twitter Dataset, Queries from [30], Spark+Jackson
Source: “Mison: A Fast JSON Parser for Data Analytics”

Today, we’re excited to announce a new, faster way to ingest data from CSV files into Data Warehouse in Microsoft Fabric: introducing CSV file parser version 2.0 for COPY INTO. The new CSV file parser builds on innovation from Microsoft Research’s Data Platform and Analytics group to make CSV file ingestion blazing fast on Data Warehouse.

Benefits

The performance benefits you will enjoy with the new CSV file parser vary depending on the number of files you have in the source, the size of these files, and the data layout. Our testing revealed an overall improvement of 38% in ingestion times on a diverse set of scenarios, and in some cases, more than 4 times faster when compared to the legacy CSV parser.  

How it works

To use the new CSV file parser, we have introduced a new option to the COPY INTO statement: PARSER_VERSION. When this option is used with the value ‘2.0’, the new CSV file parser is used. For example:

COPY INTO mytable
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/'
WITH (
    FILE_TYPE = 'CSV',
    PARSER_VERSION = '2.0' --this parameter is optional, and is the new default
)

The performance of the new CSV file parser is so great that we have decided to make it the default option for COPY INTO, so you don’t even have to specify that option to enjoy the benefits of the new file parser. In some rare cases, however, the new CSV parser is not supported, so you may need to use the legacy CSV file parser by specifying the option PARSER_VERSION = ‘1.0’ with COPY INTO. For more information on unsupported scenarios and full syntax, refer to our documentation.

Next steps

The new CSV file parser is now globally available. As mentioned, it is the new default file parser for CSV files during ingestion, so you do not need to do anything to enjoy its benefits.

To learn more about the Mison parser, visit Mison: A Fast JSON Parser for Data Analytics. Even though this research has been published with JSON formats at its origin, this work has since expanded to other file formats, such as CSV.

Gerelateerde blogberichten

Announcing improvements to CSV data ingestion in Synapse Data Warehouse in Microsoft Fabric

oktober 7, 2024 door Alex Lin

Introducing Managed VNet Support for Fabric Eventstream! By creating a Fabric’s Managed Private Endpoint, you can now securely connect Eventstream to your Azure services, such as Azure Event Hubs or IoT Hub, within a private network or behind a firewall. This integration ensures your data is securely transmitted over a private network, enabling you to … Continue reading “Secure Data Streaming with Managed Private Endpoints in Eventstream (Preview)”

oktober 4, 2024 door Jason Himmelstein

We had an incredible time in our host city of Stockholm for FabCon Europe! 3,300 attendees joined us from our international community, and it was wonderful to meet so many of you in person. Throughout the week of FabCon Europe, our teams published a wealth of valuable content, and we want to ensure you have … Continue reading “Fabric Community Conference Europe Recap”