Analytics with no RDBMS brakes on!
A good friend brought up a problem the other day that tweaked my curiosity. I decided to tackle the issue and see if a simple solution was available. How to perform cross database dump queries and collate the results into a single response, on an object oriented storage layer.
The base idea here is to remove the restrictions of the RDBMs and perform OLAP, ad-hoc queries, and potential machine learning on the data via a fast object storage layer. Freeing up resources by running in-depth analytics/ML jobs outside of the RDBMs environment.
Our data flow is as per the below diagram. Using Sqoop I import tables from two separate Oracle databases to avro file format. I then use spark to import the avro files and perform analytics, so as to provide the collated result.
Skip to the video if you just want to see it in action, note that for this implementation I used the following versions:
- Sqoop : v1.4.7
- Hadoop : v2.7.3
- Spark: v2.4.4
- Oracle: v12
- Jupyter notebook core: v4.6.0
The following Sqoop job imports the database table in avro format to my S3 storage:
sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:oracle:thin:@oracle1:1521:orcl1 --username myuser --password mypassword --table USERSTABLE --target-dir s3a://oracleavro/todaysdump/USERS/ --num-mappers 10 --split-by user_id --as-avrodatafile --compression-codec snappy
I perform the same for my second oracle database table to have them both available on my shared fast object storage, we can list the contents of each output folder, here is the ‘maint’ table file output:
hadoop fs -ls s3a://oracleavro/todaysdump/MAINTs3a://oracleavro/todaysdump/MAINT/_SUCCESS
s3a://oracleavro/todaysdump/MAINT/part-m-00000.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00001.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00002.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00003.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00004.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00005.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00006.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00007.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00008.avro
s3a://oracleavro/todaysdump/MAINT/part-m-00009.avro
From within my jupyter notebook I use the following code to import each avro file as a spark DF:
from pyspark.sql import SparkSession
import pyspark## CONSTANTS
APP_NAME = "AvroSpark Application"
AVROUSERS = 's3a://oracleavro/todaysdump/USERS/*.avro'
AVROMAINT = 's3a://oracleavro/todaysdump/MAINT/*.avro'## Main functionality
spark = SparkSession.builder.appName(APP_NAME).getOrCreate()
usersDF = spark.read.format("avro").load(AVROUSERS)maintDF = spark.read.format("avro").load(AVROMAINT)
I can check run a simple query on each dataframe:
My random generated data from the source Oracle databases is showing correctly. I can now join the data to get a ‘consolidated’ view of our tables:
filterusers = usersDF.filter("NAME = 'jbthomas'")filtermaint = maintDF.filter("NAME = 'jbthomas'")results = filtermaint.crossJoin(filterusers)results.show()
Check out the following video to see all this in action :
https://youtube.com/FX32oMpyLOMM
That was easy!
In this example I am dumping small 1000 row tables and performing a single query. If I had a larger dataset, Sqoop provides parameters to increase the parallelism of the import. As per the documentation: “You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or — num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ.”
To test Sqoop parallelism I ran an import of a 35GB table on a small 5 node cloudera cluster:
- — num-mappers 1 => import job took 4hrs 19min
- — num-mappers 3 => import job took 1hr 15min
- — num-mappers 9 => import job took 27min
Once again using an S3 layer built for concurrency and parallelism impacts our Sqoop and Spark jobs time to result. Please read my post ‘data pipelines does storage matter’ to see the impact storage has on such workloads.
note: This is an import of a linkedin article I wrote