U-SQL is a Microsoft query language that combines SQL-like declarative syntax with C# programming, allowing it to be used to process both structured and unstructured data in big data environments. Introduced in 2015, U-SQL is part of Microsoft’s Azure Data Lake Analytics cloud service, but it allows users to run queries against multiple data stores in the Azure cloud.
SQL is the standard language for querying relational databases, while C# (pronounced “C-sharp”) is a programming language developed by Microsoft. The company integrated them to create U-SQL in response to the emergence of big data systems, which often store unstructured data that can pose different challenges for SQL and procedural programming languages. U-SQL was designed to unify the two approaches by creating native extensibility for user-written C# code in an SQL implementation.
U-SQL has its roots in a declarative and extensible scripting language called SCOPE that Microsoft uses internally. In a 2008 research paper, the company said it developed SCOPE – short for Structured Computations Optimized for Parallel Execution – to allow SQL-savvy data analysts to query search logs, Internet clickstreams, and other large datasets that were increasingly stored in distributed environments. platforms like Hadoop instead of SQL-based relational databases.
The integration between SQL and C# in U-SQL is based on SCOPE, as is the U-SQL query execution and optimization framework. The U-SQL metadata system, SQL syntax, and language semantics are modeled after ANSI SQL and Transact-SQL (T-SQL), Microsoft’s implementation of the query language for its SQL Server database. However, U-SQL does not fully adhere to ANSI SQL; for example, commands such as SELECT must be written in uppercase, and C# syntax is used for expressions inside commands.
According to Microsoft, U-SQL allows users to process any type of data and at any scale. The language automatically adapts queries to use available system resources, allowing users to focus on querying data instead of servicing infrastructure needs or writing what Microsoft calls “plumbing code.”
As part of Microsoft’s Azure Data Lake platform, U-SQL is the built-in language for analyzing datasets in Azure Data Lake Store, which works together with the Analysis Service to provide a lake environment of data in the cloud. U-SQL can also be used to run queries against relational data stores that include Azure SQL Database, the cloud-based cousin of SQL Server, as well as Azure SQL Data Warehouse, Azure Blob Storage, and SQL Server instances that are configured in Azure virtual machines. .
How to use U-SQL
U-SQL code is written as scripts that include a sequence of statements to initiate processing actions. Currently, the Azure Data Lake Analytics service only supports batch processing jobs through U-SQL. Therefore, queries cannot directly return results; according to Microsoft, U-SQL scripts are typically structured to retrieve data from source systems in a rowset format, transform it as needed, and then output the transformed data to a U-SQL file or table for analysis.
To process unstructured datasets in a data lake, U-SQL users can apply a read schema, a common approach in big data systems that do not require conformance to a rigid schema, as do relational databases. Custom processing logic and a variety of functions, types, aggregates and user-defined objects – the latter including extractors, processors, outputs and more – can also be inserted into scripts.
The image below shows an example of a simple U-SQL script from Microsoft. In this case, the script does not include any data transformation: it extracts data from a search log source file, applies a schema, and writes the resulting rowset to a CSV file:
U-SQL scripts can be written and submitted to the Azure Data Lake Analytics service for execution in Visual Studio 2017, using a plugin called Azure Data Lake and Stream Analytics Tools. A U-SQL extension is also available for Visual Studio Code, Microsoft’s lightweight code editor; it is simply known as Azure Data Lake Tools. Additionally, users can run U-SQL jobs on Azure Data Lake Analytics through the Azure portal, Azure CLI, or Azure PowerShell.
Other data stores that can be queried with U-SQL are considered external data sources. Users must run federated queries against them, a process that involves creating data source objects in U-SQL to summarize connection details to a particular data store and pass queries to its own query engine for execution.
Microsoft says the ability to embed custom C# code in U-SQL scripts allows users to express complex business algorithms as part of queries. Using C# types as the default coding style also makes it easier for users to visualize how data will be processed when writing queries, according to the company.