Learn how to compile external libraries within a scala application.

In this post, I will share how to compile & package old or unmanaged libraries with your scala application. Recently I was building a new Scala application using an older thrift library and SBT compiler gave me very hard time in getting it to work. There are various hacks to go around this problem but I wanted to stick with SBT compiler’s base flows as much as possible and leverage its core functionality to its fullest. Keeping reading to learn how to compile external libraries within a scala application and my experience…

Lets start with a quick recap of what SBT is? Basically SBT is a dependency manager just like Maven which is quite easy to use and is gaining a lot of attention within Scala & Spark worlds.

If you haven’t installed SBT on your PC yet, then checkout my other post on step by step guide about installing Scala, SBT and Scala.

Now lets talk a little bit of what happens behind the scenes when SBT compiler is invoked. When you type sbt compile on the command prompt, it looks for a file called build.sbt (you can name it anything.sbt also) in the project’s root directory. In this file, which is equivalent to Maven’s pom.xml, you will specify Scala version, name of your application (used to name finally packaged jar), and some dependencies needed by your Scala application. Here’s a sample build.sbt file:

name := "My Scala Project"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.6.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0"
 
libraryDependencies ++= Seq(
    "org.apache.hadoop" % "hadoop-core" % "0.20.2",
    "org.apache.hbase" % "hbase" % "0.90.4"
   
)
 
scalacOptions += "-target:jvm-1.7"

Once the build.sbt file is located, SBT recursively downloads all dependencies including the internal ones needed by libraries themselves.

On a quick side note, you might be interested in learning why having a high performance spark application is a way to go.

For instance, in this sample build.sbt file, I want SBT to compile my Scala application using spark-core version 1.6.0, spark-hive version 1.6.0, spark-sql version 2.1.0 or higher pretty much similar to what we do in any maven application. In addition, I also want two more dependencies; hadoop-core 0.20.2 and hbase-0.90.4 which I declared together within Seq(). There’s no difference between dependencies declared outside Seq() or within, they are just two different ways to declare same thing.

You might also like: Installing Spark – Scala – SBT (S3) on Windows PC

Alright, lets now focus on HBase dependency:

"org.apache.hbase" % "hbase" % "0.90.4"

that’s where I faced most of the problems. Hbase-0.90.4 internally uses thrift library but since hadoop-core is set to version 0.20.2, by default SBT will try to download org.apache.thrift#thrift;0.2.0. Ironically, thrift-0.2.0 wasn’t easily available on org.apache or mvnrepository.com for SBT to download. I had to search it a little bit and finally I found it under RedHat GA repository at

https://maven.repository.redhat.com/ga/org/apache/thrift/thrift/0.2.0/

Anyways, the downloaded jars and libraries are stored by default at

/home/user/.ivy2/cache/

Thats where all downloaded dependencies or external libraries will be cached and be available for any subsequent builds in future. This is to avoid downloading same dependencies every time user compiles the application.

If you want to force a local jars or libraries, then here’s how you can do it:

Create a directory as

/home/user/.ivy2/local

2. Under this directory create subsequent directories for library(s) that you want to download. I will take the case of thrift-0.2.0. So, I added following directory structure:

/home/user/.ivy2/local/org.apache.thrift/thrift/0.2.0/ivys

Quick side note, here is a list of related posts that I recommend:

Installing Spark – Scala – SBT (S3) on Windows PC – This article explain how to turn your home computer into a Spark super machine.
6 Reasons Why Hadoop is THE Best Choice for Big Data Applications – This article explains why Hadoop is the leader in the market and will be one for long long time.
What is MobaXterm and How to install it on your computer for FREE – If you haven’t used MobaXterm before, then I highly recommend you to try it today. I guarantee you will never look back.
Learn ElasticSearch and Build Data Pipelines – This is actually an intro to a very comprehensive course on integrating Hadoop with ElasticSearch. This is one of key skills for advancing your data engineering career.

Now there are 2 ways to get your desired thrift.jar compiled with SBT.

First method

Create an ivy.xml file and let it download all dependencies in the usual way.

For this, you can create a file called ‘ivy.xml’ under

/home/user/.ivy2/local/org.apache.thrift/thrift/0.2.0/ivys

Note to keep name of this file exactly as mentioned. SBT specifically looks for ivy.xml file at this path.

Then, add following properties inside ivy.xml file. ( you might need to change it according to the libraries you need )

<?xml version="1.0" encoding="UTF-8"?>
<ivy-module version="2.0" xmlns:m="http://ant.apache.org/ivy/maven" xmlns:e="http://ant.apache.org/ivy/extra">
        <info organisation="org.apache.thrift"
                module="thrift"
                revision="0.2.0"
                status="release"
                publication="20141118121911"
        >
                <license name="The Apache Software License, Version 2.0" url="http://www.apache.org/licenses/LICENSE-2.0.txt" />
                <description homepage="http://thrift.apache.org">
                Thrift is a software framework for scalable cross-language services development.
                </description>
                <e:sbtTransformHash>e0c5ee03acc03200c1cebed43f387c2bf613e676</e:sbtTransformHash>
        </info>
        <configurations>
                <conf name="default" visibility="public" description="runtime dependencies and master artifact can be used with this conf" extends="runtime,master"/>
                <conf name="master" visibility="public" description="contains only the artifact published by this module itself, with no transitive dependencies"/>
                <conf name="compile" visibility="public" description="this is the default scope, used if none is specified. Compile dependencies are available in all classpaths."/>
                <conf name="provided" visibility="public" description="this is much like compile, but indicates you expect the JDK or a container to provide it. It is only available on the compilation classpath, and is not transitive."/>
                <conf name="runtime" visibility="public" description="this scope indicates that the dependency is not required for compilation, but is for execution. It is in the runtime and test classpaths, but not the compile classpath." extends="compile"/>
                <conf name="test" visibility="private" description="this scope indicates that the dependency is not required for normal use of the application, and is only available for the test compilation and execution phases." extends="runtime"/>
                <conf name="system" visibility="public" description="this scope is similar to provided except that you have to provide the JAR which contains it explicitly. The artifact is always available and is not looked up in a repository."/>
                <conf name="sources" visibility="public" description="this configuration contains the source artifact of this module, if any."/>
                <conf name="javadoc" visibility="public" description="this configuration contains the javadoc artifact of this module, if any."/>
                <conf name="optional" visibility="public" description="contains all optional dependencies"/>
        </configurations>
 
        <publications>
                <artifact name="thrift" type="jar" ext="jar" conf="master" />
        </publications>
        <dependencies>
                <!-- https://mvnrepository.com/artifact/org.apache.thrift/thrift -->
                <!--<dependency org="org.apache.thrift" name="thrift" rev="0.2.0"/>
                -->
          </dependencies>

</ivy-module>

This should get you thrift.jar as well as any other dependent libraries internally used by thrift or as specified under <dependencies> tag above.

You might also like: Everything you need to know about Hadoop Shell

You might want to check this one skill that can boost your confidence and make you excel at job in no time.

But hold on, there’s an easier way too, which I learned the hard way !!!

Second Method

Here you go, after step#2 above create another directory called ‘jars’. So your directory structure would look like this;

/home/user/.ivy2/local/org.apache.thrift/thrift/0.2.0/jars

Then download the required version of jar file into this directory. If you are on Linux or MAC, you can use wget, for windows you can manually download it from browser:

wget https://maven.repository.redhat.com/ga/org/apache/thrift/thrift/0.2.0/thrift-0.2.0.jar

Provide appropriate permissions

chmod 754 thrift-0.2.0.jar

Working behind a firewall?

If your server is behind a firewall and couldn’t connect directly to external maven repositories, then you can download the jar file on your PC/Mac manually and just upload it to above path on your server, or create a bash shell script with proxy settings as shown here:

#!/bin/bash
#name: sbtcompiler.sh

export http_proxy=<proxy url>
export https_proxy=${http_proxy}
export ftp_proxy=${http_proxy}
export rsync_proxy=${http_proxy}
 
echo "compiling..."
sbt clean compile
echo "completed sbt compile"
 
echo "building a package..."
sbt package
echo "completed sbt package"

Finally, compile and package your project as:

WANT TO ADVANCE YOUR CAREER?

Enroll in Master Apache SQOOP complete course today for just $20 (a ~~$200~~ value)

Only limited seats. Don’t miss this opportunity!!!

sbt clean compile package
OR
./sbtcompiler.sh

Above commands consists of 3 individual SBT commands;

clean – to delete any previous .class or intermediate files
compile – obviously to compile .scala into .class files
package – to create a jar file

You might also like: Apache Kafka Guru - Zero to Hero in Minutes

Finally you should get a jar package created as target/scala-xx.x/<project-name>.jar

If you are interested in reading more about Scala and SBT and wants to set trio on a local PC, then I think you will like my other article on setting up Spark, Scala and SBT on a Windows PC.

Please share how this article helped you in comments below.

2 Comments

Installing Spark – Scala – SBT (S3) on Windows PC » DataShark AcademyMarch 2, 2018

[…] If your Scala application requires some external libraries then I warn you that as of SBT 0.13, it is not easy to compile them in single scala application but it is possible. You can find out how to compile external libraries in a scala application here. […]

Log in to Reply
What is MobaXterm and How to install it on your computer for FREEMarch 8, 2018

[…] For advanced users, there are many more cool features such as Remote Desktop for windows and Unix both. Yes you read it right, for Both. Then there’s a textEditor inbuilt if you like to make notes or even write programs which I don’t use much as I prefer proper IDEs for development. You can check how I used MobaXterm to fix a critical problem at work. […]

Log in to Reply

How to compile unmanaged libraries within a SCALA application

First method

Second Method

Working behind a firewall?

WANT TO ADVANCE YOUR CAREER?

2 Comments

Leave a Reply Cancel reply