What is Azure HDInsight?

Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. With these frameworks, you can enable a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.

Scenarios for using HDInsight

Azure HDInsight can be used for a variety of scenarios in big data processing. It can be historical data (data that’s already collected and stored) or real-time data (data that’s directly streamed from the source).

Batch processing (ETL)

Extract, transform, and load (ETL) is a process where unstructured or structured data is extracted from heterogeneous data sources. It’s then transformed into a structured format and loaded into a data store. You can use the transformed data for data science or data warehousing.

Data warehousing

We can use HDInsight to perform interactive queries at petabyte scales over structured or unstructured data in any format. You can also build models connecting them to BI tools. 

Internet of Things (IoT)


We can use HDInsight to process streaming data that’s received in real time from different kinds of devices.

Data science

We can use HDInsight to build applications that extract critical insights from data. You can also use Azure Machine Learning on top of that to predict future trends for your business. 

Hybrid

You can use HDInsight to extend your existing on-premises big data infrastructure to Azure to leverage the advanced analytics capabilities of the cloud.

Cluster types in HDInsight

Cluster TypeDescription
Apache HadoopA framework that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.
Apache SparkAn open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications.
Apache HBaseA NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data–potentially billions of rows times millions of columns.
ML ServicesA server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.
Apache StormA distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight.
Apache Interactive QueryIn-memory caching for interactive and faster Hive queries.
Apache KafkaAn open-source platform that’s used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams.

Development tools for HDInsight

We can use HDInsight development tools, including IntelliJ, Eclipse, Visual Studio Code, and Visual Studio, to author and submit HDInsight data query and job with seamless integration with Azure.

  • Azure toolkit for IntelliJ
  • Azure toolkit for Eclipse
  • Azure HDInsight tools for VS Code
  • Azure data lake tools for Visual Studio

Business intelligence on HDInsight

Familiar business intelligence (BI) tools retrieve, analyze, and report data that is integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:

  • Apache Spark BI using data visualization tools with Azure HDInsight
  • Visualize Apache Hive data with Microsoft Power BI in Azure HDInsight
  • Visualize Interactive Query Hive data with Power BI in Azure HDInsight
  • Connect Excel to Apache Hadoop with Power Query (requires Windows)
  • Connect Excel to Apache Hadoop with the Microsoft Hive ODBC Driver (requires Windows)

Leave a comment

Trending