Issues and Solutions for Loading Mainframe Data into Hadoop – A Case Study

The Story

Over the past few months, Alebra has been actively involved with a multi-billion-dollar company that provides software and data services to financial and healthcare industries. Historically, mainframes perform most of the data processing. Driven by a desire is to increase competitive advantage and value of their products, their aim is to increase the quality, quantity and timeliness of data provided to their customers – all while lowering overall costs. To this end, a decision was made to implement a Hadoop platform on commodity Intel architecture. The Hadoop platform would perform the new application processing as well as a portion of existing mainframe processing.

The Requirements

The new application requires replicating many terabytes of data contained in hundreds of z/OS DB2 tables in the new Hadoop system. These transfers would need to be performed daily during a short processing window. Since this processing window often occurs during peak processing periods, processing overhead for this data transfer must be kept to a minimum. Due to the large number of transfers, operational efficiency to schedule, monitor and resolve problems was also required. In summary, the requirements were:

High bandwidth transfer capability
Lowest procession overhead possible
Operational efficiency

The company was already an Alebra customer and user of a Parallel Data Mover (PDM) and z/OpenGate solution for other high availability file transfer needs, and was fully aware that an Alebra solution would give them the best opportunity to meet these demanding requirements. However, there were some additions and tweaks that were necessary to fully realize all the benefits of the Alebra solution.

Issue #1 – Data Piping to Hadoop Loader

The Problem

Problem determination was problematic. To increase speed and save interim storage on the Hadoop system, the hope was to use existing PDM capability for piping data directly to the Hadoop loader. While this technique functionally works, problem determination and resolution took too much time when an error occurred. When the Hadoop loader encounters an error, the error condition is not detectable by the file transfer operation via the standard Linux pipe capability. The transfer simply hangs until ultimately timing out. With the large number of transfer operations and short timeframes, a solution to the problem resolution time was crucial to operational efficiency.

The Solution

In close collaboration with our customer, Alebra developed native file transfer support for Hadoop Distributed File Systems (HDFS). Using standard Hadoop Program Calls, PDM now can store or retrieve files in the HDSF. Any error condition from the source system, target system or the transfer itself is immediately detected and the details are logged to assist in rapid problem resolution.

Issue #2 – Interim Storage on the Mainframe

The Problem

Typically, when z/OS DB2 data is moved to another system, the DB2 unload program is used. The DB2 unload program provisions disk storage to store a sequential z/OS file then transferred in a separate operation. For previous requirements, this was not an issue because the total volume of data was not excessively large. However, the new application required an expensive storage upgrade to accommodate the interim files.

The Solution

Eliminate the need for interim files. PDM supports retrieving and sending data with z/OS programs through the standard IBM Subsystem Interface (SSI) facility. In this case, a simple JCL change to the DB2 High Speed Unload Program directs data flow to PDM instead of the usual interim file on disk or tape. Output of the unload program is immediately sent to the destination system in lieu of creating a local sequential file. This not only saves disk space, it also reduces the total transfer time – from the source to landing data on the target system.

Issue #3 – CPU Requirements of z/OS DB2 High-Speed Unload Program

The Problem

The DB2 unload program consumes too much CPU time. The easiest method to transform z/OS data to a format readable by other system types is to perform the data transformation on z/OS. Using this type of unload option however increase the CPU time significantly compared to unloading the z/OS data “as is”. The additional CPU time was unacceptable given the large amounts of data sent daily.

The Solution

Transform the data on the Hadoop system. The DB2 unload operation was changed to unload data in native z/OS format. CPU time for the unload operation dramatically dropped saving significant operating cost. Data transformations are now performed on the Hadoop system at a lower total cost.

The Complete Solution

The diagram below illustrates the data flow of the final solution deployed by this customer.

Key points of this solution are as follows:

The shortest end-to-end time is now possible for file transfers.
All phases of the data movement are invoked by a single simply JCL change to the DB2 Unload Program; all phases run concurrently. A fraction of a second after the last piece of data is read from z/OS disk, the Hadoop file is complete and available for use.
No interim data files are required eliminating the need to provision storage.
Massive transfer bandwidth is available to run multiple transfer operations – all at the maximum speed the disk storage can sustain.
Quick and easy problem resolution; all error conditions in any phase of the data movement are logged.

Summary

The volume of data that moves across platforms continues to increase dramatically. When your requirements call for a lower cost, higher speed and operationally efficient alternative – contact Alebra Technologies. There is a good chance we can help you.

Bill Yeager | Chief Technology Officer | bill.yeager@alebra.com