Big Data - DataShark Academy

MongoDB with Python: Everything You Need to Know

This post may contain affiliate links. Please read our disclosure for more info.

NoSQL is a type of database that is used to store and retrieve data that is not structured like a traditional relational database. MongoDB...

learn-pyspark-window-functions-datashark.academy

PySpark Window Functions – Combining Windows and Calling Different Columns

This post may contain affiliate links. Please read our disclosure for more info.

PySpark Window functions are used to calculate results such as the rank, row number e.t.c over a range of input rows. PySpark Window functions operate...

Mastering PySpark Window Ranking Functions: A Comprehensive Guide with Code Examples and Performance Profiling

This post may contain affiliate links. Please read our disclosure for more info.

In this article, we will discuss PySpark Window Ranking Functions, which are used to sort and rank data within groups. We will cover various...

PySpark Partitioning by Multiple Columns – A Complete Guide with Examples

This post may contain affiliate links. Please read our disclosure for more info.

In this article, we'll explore PySpark's partitioning feature, which allows us to partition our data by one or more columns. Partitioning can help optimize...

Mastering PySpark Window Functions: Cumulative Calculations (Running Totals and Averages)

This post may contain affiliate links. Please read our disclosure for more info.

PySpark window functions are an essential tool for processing and analyzing large datasets. In this blog post, we'll dive into one of the most...

Apache Kafka: A Step-by-Step Guide to Handling Producer and Consumer Failures

This post may contain affiliate links. Please read our disclosure for more info.

Comprehensive guide on how to handle Apache Kafka producer and consumer failures. This post offers step-by-step code examples and practical advice on configuring fault...

Mastering Apache Kafka Architecture: A Comprehensive Tutorial for Data Engineers and Developers

This post may contain affiliate links. Please read our disclosure for more info.

An in-depth overview of the architecture of Apache Kafka, a popular distributed streaming platform used for real-time data processing. It explores the key components...

Apache-Kafka-Architecture-DataShark.Academy-

Anatomy of Kafka Architecture

This post may contain affiliate links. Please read our disclosure for more info.

Apache Kafka builds real-time streaming data pipelines. What this means is that; using apache Kafka you can move data from one system to another...

PySpark Window Functions - Simple Aggregations

PySpark Window Functions – Simple Aggregation: A Real-World Guide

This post may contain affiliate links. Please read our disclosure for more info.

Learn how to use Pyspark window functions for simple aggregations in this step-by-step tutorial. Follow real-world use cases with code examples and understand when...

Apache Kafka Tutorial by DataShark.Academy

Apache Kafka Guru – Zero to Hero in Minutes

This post may contain affiliate links. Please read our disclosure for more info.

In this course you will learn about Apache Kafka. Just in few minutes you will be on the route to be an Apache Kafka...

Mastering Apache Sqoop with Hortonworks Sandbox, Hadoop, Hive & MySQL - DataShark.Academy

Master Apache SQOOP with Big Data Hadoop

This post may contain affiliate links. Please read our disclosure for more info.

In this apache sqoop tutorial, you will learn everything that you need to know about Apache Sqoop and how to integrate it within Big...

Everything you need to know about Hadoop Shell

This post may contain affiliate links. Please read our disclosure for more info.

Hadoop Shell is a Linux like terminal utility that can be used to interact with Hadoop’s distributed file system. For Linux users it will...

How to setup Apache Hadoop Cluster on a Mac or Linux Computer

This post may contain affiliate links. Please read our disclosure for more info.

If you have checked our post on How to Quickly Setup Apache Hadoop on Windows PC, then you will find in this post that its...

How to Quickly Setup Apache Hadoop on Windows PC

This post may contain affiliate links. Please read our disclosure for more info.

Hadoop is an open source distributed storage and processing software framework sponsored by Apache Software Foundation. It’s core technology is based on Java as...

6 Reasons Why Hadoop is The Best Choice for Big Data Application (Home)

6 Reasons Why Hadoop is THE Best Choice for Big Data Applications

This post may contain affiliate links. Please read our disclosure for more info.

Often people ask us about what is big data? what is Hadoop? Where did it come from? and why it’s such a hot topic...

How to find bad partitions in hive table

How to find bad partitions in a huge HIVE table

This post may contain affiliate links. Please read our disclosure for more info.

Recently we found an issue with use of ANALYZE table queries inside Hive, where analyze command was changing ‘LOCATION’ property of random partitions in...