Microsoft Fabric Updates Blog

Announcing improvements to CSV data ingestion in Synapse Data Warehouse in Microsoft Fabric

CSV files are widely used for data exchange and data ingestion into data warehouses, but they often pose challenges on performance. In accordance to a study from Microsoft Research, up to 90% of the total time spent in data ingestion occurs in parsing non-binary data such as JSON when using conventional file parsers. This is illustrated in Figure 1.

A bar chart comparing percentage of time spent on parsing versus query processing, for four different queries. The chart shows that more than 90% of the cost comes from parsing for queries one, two, and three, and over 80% of the cost comes from parsing for query number four.
Figure 1: Parsing vs. Query processing Cost
Twitter Dataset, Queries from [30], Spark+Jackson
Source: “Mison: A Fast JSON Parser for Data Analytics”

Today, we’re excited to announce a new, faster way to ingest data from CSV files into Data Warehouse in Microsoft Fabric: introducing CSV file parser version 2.0 for COPY INTO. The new CSV file parser builds on innovation from Microsoft Research’s Data Platform and Analytics group to make CSV file ingestion blazing fast on Data Warehouse.

Benefits

The performance benefits you will enjoy with the new CSV file parser vary depending on the number of files you have in the source, the size of these files, and the data layout. Our testing revealed an overall improvement of 38% in ingestion times on a diverse set of scenarios, and in some cases, more than 4 times faster when compared to the legacy CSV parser.  

How it works

To use the new CSV file parser, we have introduced a new option to the COPY INTO statement: PARSER_VERSION. When this option is used with the value ‘2.0’, the new CSV file parser is used. For example:

COPY INTO mytable
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/'
WITH (
    FILE_TYPE = 'CSV',
    PARSER_VERSION = '2.0' --this parameter is optional, and is the new default
)

The performance of the new CSV file parser is so great that we have decided to make it the default option for COPY INTO, so you don’t even have to specify that option to enjoy the benefits of the new file parser. In some rare cases, however, the new CSV parser is not supported, so you may need to use the legacy CSV file parser by specifying the option PARSER_VERSION = ‘1.0’ with COPY INTO. For more information on unsupported scenarios and full syntax, refer to our documentation.

Next steps

The new CSV file parser is now globally available. As mentioned, it is the new default file parser for CSV files during ingestion, so you do not need to do anything to enjoy its benefits.

To learn more about the Mison parser, visit Mison: A Fast JSON Parser for Data Analytics. Even though this research has been published with JSON formats at its origin, this work has since expanded to other file formats, such as CSV.

Related blog posts

Announcing improvements to CSV data ingestion in Synapse Data Warehouse in Microsoft Fabric

May 16, 2024 by Dan Liu

Leverage the power of task flows to design and build your data solutions and manage workspace items in Microsoft Fabric. We’re thrilled to announce that the task flows feature is now in public preview and is enabled for all existing Microsoft Fabric users. Fabric is unifying everything needed to deliver end-to-end data and analytics solutions … Continue reading “Announcing the public preview of task flows in Microsoft Fabric”

May 14, 2024 by Nimrod Shalit

We are happy to share that new items are available to use in CI/CD. You can also build a full E2E automated CI/CD with Fabric APIs. Click to read more, and learn about additional announcements.