How to find bad partitions in hive tables using minimal work

Recently we found an issue with use of ANALYZE table queries inside Hive, where analyze command was changing ‘LOCATION’ property of random partitions in a table to point to another database/table. Now the situation gets worse if there are multiple tables are involved with tens of thousands of partitions in each. Next question for us was; how to find bad partitions in hive tables? Read on to find out how we solved this problem in an efficient manner.

In order to fix such partitions, we had to drop the tables and re-create them from scratch.

But the challenge was to identify impacted tables. With over 100s or 1000s of partitions in tables, it’s practically impossible to look up ‘LOCATION’ property of each partition inside a table through HIVE CLI.

Most common way to identify such partitions is:

Connect to HIVE CLI
Run describe formatted query manually on each suspicious table/partition:

hive> describe formatted <table> partition <partition spec>


e.g

hive> describe formatted orders partition (dt=20170902);

Then look for ‘Location:’ property in output.

Note: As evident this approach is time-consuming considering running describe formatted queries on all individual partitions in a table and moreover there are chances of missing bad partitions in the process.

Related posts that you don’t want to miss:

6 Reasons Why Hadoop is THE Best Choice for Big Data Applications – This article explains why Hadoop is the leader in the market and will be one for long long time.
What is MobaXterm and How to install it on your computer for FREE – If you haven’t used MobaXterm before, then I highly recommend you to try it today. I guarantee you will never look back.
Learn ElasticSearch and Build Data Pipelines – This is actually an intro to a very comprehensive course on integrating Hadoop with ElasticSearch. This is one of key skills for advancing your data engineering career.
How to Quickly Setup Apache Hadoop on Windows PC – Learn how to set up Apache Hadoop on your home computer and use it as a corporate data center.

You might also like: What is MobaXterm and How to install it on your computer for FREE

Better Approach (reading from Hive’s meta store)

Run below query in Hive’s Meta store which is on a MySQL DB:

Note: only hadoop admins may be allowed to fire such queries, so check with them before trying.

select DBS.NAME, TBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION
from DBS, SDS,TBLS,PARTITIONS
where DBS.DB_ID = TBLS.DB_ID
and PARTITIONS.SD_ID = SDS.SD_ID
and TBLS.TBL_ID=PARTITIONS.TBL_ID
and tbl_name = '<table_name_to_check>'
and DBS.NAME = '<database to check in>'
and SDS.LOCATION not like '%<database name - same as DBS.NAME>%'
order by 1,2;


e.g
select DBS.NAME, TBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION
from DBS, SDS,TBLS,PARTITIONS
where DBS.DB_ID = TBLS.DB_ID
and PARTITIONS.SD_ID = SDS.SD_ID
and TBLS.TBL_ID=PARTITIONS.TBL_ID
and tbl_name = 'orders'
and DBS.NAME = 'sales'
and SDS.LOCATION not like '%sales%'
order by 1,2;

Result will look like this:

+------+----------+-----------+--------------------------------------+

| NAME | TBL_NAME | PART_NAME |           LOCATION |

+------+----------+-----------+--------------------------------------+

| sales | orders | dt=20160917 | hdfs://not-sales-db/abc/dt=20160304 |

+------+----------+-----------+--------------------------------------+

P.S. not-sales-db and abc are different database and table where current partition is pointing to.

Nice, we finally resolved the issue with bad partitions in huge hive tables with minimal work.

We highly recommend you to check this book (& book), if you are interested in learning more about hadoop. If you want to know how our students have boosted their career and pay scale, then you got to take this course. It will teach you rare skill today that not many data engineers know about. Try this out before it’s late.

5 Comments

Learn ElasticSearch and Build Data Pipelines » DataShark AcademyFebruary 26, 2018

[…] I finished writing about a cool trick in Hive to find troubling partitions. It contains good information to impress your manager […]

Log in to Reply
How to Quickly Setup Apache Hadoop on Windows PCMarch 9, 2018

[…] Big Data technologies such as HDFS, Hive, Spark, Flume etc all with it. You can read more about a critical bug we found in Apache Hive and how to troubleshoot it. So it may take a while before its fully downloaded.So RELAX […]

Log in to Reply
The Best Data Processing Architectures: Lambda vs KappaMarch 14, 2018

[…] Ingest – This is 2nd stage, where gathered data needs to be transformed in a structured or semi structured format so as to ease up further analytical processing over it in next stage. (based on system design, there could be thin line between this stage and gather data stage). Apache hive is one of the tools that is used in this stage to access structured data in hadoop. If you want to use hive, then I recommend reading this article. […]

Log in to Reply
Next-Generation Big Data [Impala] | Stuart Hayes (mongopi)September 4, 2018

[…] files problem (see also here for how to avoid the ‘small files problem’ and here for how to find bad partitions in a huge Hive table) – Table partitioning – Denormalise […]

Log in to Reply
How to avoid small files problem in Hadoop and fix it?September 19, 2018

[…] How to find bad partitions in a huge HIVE table – Before you use alter table command to concatenate files in Hive, I highly recommend you to read this article. […]

Log in to Reply

Most common way to identify such partitions is:

Better Approach (reading from Hive’s meta store)

5 Comments

Leave a Reply Cancel reply