How to find bad partitions in hive table

How to find bad partitions in a huge HIVE table

This post may contain affiliate links. Please read our disclosure for more info.

How to find bad partitions in hive table (blog)Recently we found an issue with use of ANALYZE table queries inside Hive, where analyze command was changing ‘LOCATION’ property of random partitions in a table to point to another database/table. Now the situation gets worse if there are multiple tables are involved with tens of thousands of partitions in each. Next question for us was; how to find bad partitions in hive tables? Read on to find out how we solved this problem in an efficient manner.

In order to fix such partitions, we had to drop the tables and re-create them from scratch.

But the challenge was to identify impacted tables. With over 100s or 1000s of partitions in tables, it’s practically impossible to look up ‘LOCATION’ property of each partition inside a table through HIVE CLI.

Most common way to identify such partitions is:

  1. Connect to HIVE CLI
  2. Run describe formatted query manually on each suspicious table/partition:
hive> describe formatted <table> partition <partition spec>


e.g

hive> describe formatted orders partition (dt=20170902);

Then look for ‘Location:’ property in output.

Note: As evident this approach is time-consuming considering running describe formatted queries on all individual partitions in a table and moreover there are chances of missing bad partitions in the process.

 

Related posts that you don’t want to miss:

You might also like:   Everything you need to know about Hadoop Shell

 

Better Approach (reading from Hive’s meta store)

  1. Run below query in Hive’s Meta store which is on a MySQL DB:

Note: only hadoop admins may be allowed to fire such queries, so check with them before trying.

select DBS.NAME, TBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION
from DBS, SDS,TBLS,PARTITIONS
where DBS.DB_ID = TBLS.DB_ID
and PARTITIONS.SD_ID = SDS.SD_ID
and TBLS.TBL_ID=PARTITIONS.TBL_ID
and tbl_name = '<table_name_to_check>'
and DBS.NAME = '<database to check in>'
and SDS.LOCATION not like '%<database name - same as DBS.NAME>%'
order by 1,2;


e.g
select DBS.NAME, TBLS.TBL_NAME,PARTITIONS.PART_NAME,SDS.LOCATION
from DBS, SDS,TBLS,PARTITIONS
where DBS.DB_ID = TBLS.DB_ID
and PARTITIONS.SD_ID = SDS.SD_ID
and TBLS.TBL_ID=PARTITIONS.TBL_ID
and tbl_name = 'orders'
and DBS.NAME = 'sales'
and SDS.LOCATION not like '%sales%'
order by 1,2;

Result will look like this:

+------+----------+-----------+--------------------------------------+

| NAME | TBL_NAME | PART_NAME |           LOCATION |

+------+----------+-----------+--------------------------------------+

| sales | orders | dt=20160917 | hdfs://not-sales-db/abc/dt=20160304 |

+------+----------+-----------+--------------------------------------+

P.S. not-sales-db and abc are different database and table where current partition is pointing to.

GET MORE LIKE THIS
DELIVERED RIGHT TO YOUR MAILBOX
we hate spams too, promise.

Nice, we finally resolved the issue with bad partitions in huge hive tables with minimal work.

We highly recommend you to check this book (& book), if you are interested in learning more about hadoop. If you want to know how our students have boosted their career and pay scale, then you got to take this course. It will teach you rare skill today that not many data engineers know about. Try this out before it’s late.


[jetpack-related-posts]

5 Comments

  1. […] I finished writing about a cool trick in Hive to find troubling partitions. It contains good information to impress your manager […]

  2. […] Big Data technologies such as HDFS, Hive, Spark, Flume etc all with it. You can read more about a critical bug we found in Apache Hive and how to troubleshoot it. So it may take a while before its fully downloaded.So RELAX […]

  3. […] Ingest – This is 2nd stage, where gathered data needs to be transformed in a structured or semi structured format so as to ease up further analytical processing over it in next stage. (based on system design, there could be thin line between this stage and gather data stage). Apache hive is one of the tools that is used in this stage to access structured data in hadoop. If you want to use hive, then I recommend reading this article. […]

  4. […] files problem (see also here for how to avoid the ‘small files problem’ and here for how to find bad partitions in a huge Hive table) – Table partitioning – Denormalise […]

  5. […] How to find bad partitions in a huge HIVE table – Before you use alter table command to concatenate files in Hive, I highly recommend you to read this article. […]

Leave a Reply

Scroll to top