What's the Sqoop on Mainframe?

Someone asked "How do we pull data from a mainframe for analysis?". At the time, I got a lot of puzzled looks.

Pulling data from mainframe can sound daunting. There are a couple of ways to approach it:-
  • DB2 (IBM Database 2) connection
    • Can be more taxing computationally on the mainframe as it needs to process the DB2 SQL to retrieve the desired results repeatedly.
  • File Transfer Protocol (FTP)
    • Simple file transfer
    • Less computationally expensive if the data set can be extracted predictably.
Apache Sqoop is capable of pulling the data out of mainframe using both mechanisms. The FTP method is what I've used before when pulling mainframe datasets into Hadoop.

The mainframe FTP server doesn't behave identically to most standard FTP servers, here are some of the differences:-
  • the folder hierarchy is separated by periods/dots (.).
  • the syntax to reference folders/files usually uses quotes. eg. 'folder1.folder2'
  • the logical type of the last item in the hierarchy changes depending on data set type:-
    • Sequential Data Set - the last item type is a file.
    • Partitioned Data Set - the last item is a folder.
    • Generation Data Set - the last item is a folder.

Dataset types, example named datasets and their corresponding filesystem mapping on the FTP server
Apache Sqoop has this logic built in (SQOOP-2938) to map the folders transparently, allowing the user to simple specify the --datasettype and --dataset parameters.

The behaviours of each --datasettype setting are as follows:-
  • 'p' - partitioned data set. This retrieves ALL the files in a folder, the resulting output is multiple files.
  • 'g' - generation data group. This retrieves the 'latest' file in the data group, determined by lexical order (last GDG file in the FTP folder listing), resulting output is a single file.
  • 's' - sequential dataset. This retrieves a single file from the FTP server.
As an example to play with, here is a Docker container that emulates some of the mainframe FTP server functionality.

From the Sqoop documentation pages, an example of importing data from mainframe:-
sqoop import-mainframe --dataset SomeGdg --connect <host> --username myuser --password-alias \
    mypasswordalias --datasettype g --tape true --outdir /tmp/imported/sqoop \
    --target-dir /data/imported/mainframe/SomeGdg


This command will do the following:-
  1. Initiate the FTP connection to <host> with login myuser
  2. Change working directory into the SomeGdg folder to retrieve the latest generation data file
  3. Place the output in /tmp/imported/sqoop.
I hope that demystifies the mainframe capability of Sqoop for some out there.

Comments

  1. Casino Site » Lucky Club Casino Review (2021)
    Lucky Club Casino Review · All the bonus rounds are in-depth and there are many promotions to get the most out of the best welcome luckyclub.live offers. · Mobile app

    ReplyDelete
  2. This pace allows them to quickly tweak the Shelby GT500 little by little, bringing in new bodily designs at breakneck speeds. They additionally developed a brand new} spoiler-wing hybrid design that the team is high precision machining asking ‘the swing’. This new design is a outstanding factor within the GT500’s impressive aerodynamic capabilities. If you’ve ever owned a rare or older automotive, you understand that repairs can be costly and a headache. Instead of having them manufactured within the conventional method, they’re looking to 3D printers, that are much less time-consuming and less expensive.

    ReplyDelete

Post a Comment

Popular Posts