Hey everyone,
Hope you’re doing good!
Recently, a client approached me with an urgent issue: they needed to load 456 files into an Oracle Database table to resolve an integration problem. These files were compressed in ZIP format and contained data in fixed-position records with 201 columns. The task was to dynamically unzip each file and load the data into a single table, appending the records and including the name of each file in the table for traceability.
On this blog post I’ll show how I accomplished this using a combination of Shell Script and SQL*Loader.
The client’s requirement was clear:
- Unzip each file;
- Load the data from each file into the same database table;
- Append the records when loading;
- Include an additional column in the table to store the name of each file being processed;
- Given the number of files and the volume of data, we needed an efficient and automated solution.
I created a shell script to handle the unzipping and loading process. The script dynamically updates the SQL*Loader control file to include the name of each file during the load process.
Here’s a breakdown of the solution:
Shell Script for Automation
- Unzips each file;
- Loops through the unzipped data files;
- Dynamically updates the control file;
- Loads the data using SQL*Loader;
- Moves processed files to an archive directory.
SQL*Loader Control File
- Maps all 201 columns;
- Includes an additional column for the file name.
Let’s create our folder structure in our Oracle DB Server, let’s imagine we have a NFS mountpoint /oracle_nfs.
With the user owner of DB on Operating System:
cd /oracle_nfs
mkdir APP_Loader
cd APP_Loader
Now, let’s create our SQL*Loader Control File:
OPTIONS (DIRECT=true,SKIP=1)
LOAD DATA
INFILE 'dummy.dat'
BADFILE 'dummy.bad'
DISCARDFILE 'dummy.dsc'
APPEND INTO TABLE OWNER.TABLE_TO_BE_LOADED
WHEN (COL01 != ' ')
TRAILING NULLCOLS
(
COL01 POSITION(1:20) CHAR "RTRIM(:COL01)",
COL02 POSITION(21:28) CHAR,
COL03 POSITION(29:32) CHAR,
COL04 POSITION(33:33) CHAR,
COL05 POSITION(34:58) CHAR,
COL06 POSITION(59:73) CHAR,
COL07 POSITION(74:76) CHAR,
COL08 POSITION(77:80) CHAR,
COL09 POSITION(81:90) CHAR,
COL10 POSITION(91:92) CHAR,
COL11 POSITION(93:94) CHAR,
COL12 POSITION(95:95) CHAR,
COL13 POSITION(96:96) CHAR,
COL14 POSITION(97:97) CHAR,
COL15 POSITION(98:99) CHAR,
COL16 POSITION(100:109) CHAR,
COL17 POSITION(110:139) CHAR,
COL18 POSITION(140:142) CHAR,
COL19 POSITION(143:144) CHAR,
COL20 POSITION(145:147) CHAR,
COL21 POSITION(148:148) CHAR,
COL22 POSITION(149:173) CHAR,
COL23 POSITION(174:188) CHAR,
COL24 POSITION(189:191) CHAR,
COL25 POSITION(192:192) CHAR,
COL26 POSITION(193:201) CHAR,
COL27 CONSTANT 'DUMMY.DAT'
)
Control File Parameters Explanation:
- DIRECT=true: This parameter instructs SQL*Loader to use direct path loading, which can significantly increase the speed of the load by bypassing some of the regular processing (such as logging and rollback);
- SKIP=1: This parameter tells SQL*Loader to skip the first record in the input data file. This is often used when the first line of the file contains headers or metadata that should not be loaded into the table. This is our case here.
- INFILE ‘dummy.dat’: Specifies the name of the input data file to be loaded. This will be replaced dynamically in our script;
- BADFILE ‘dummy.bad’: Specifies the name of the file where records that cause errors during loading will be written. This will be replaced dynamically in our script;
- DISCARDFILE ‘dummy.dsc’: Specifies the name of the file where records that do not meet any WHEN clause conditions are written. This will be replaced dynamically in our script;
- APPEND INTO TABLE OWNER.TABLE_TO_BE_LOADED: Specifies the table into which data will be loaded. The APPEND option tells SQL*Loader to add the new data to the existing data in the table (as opposed to INSERT which would require the table to be empty);
- WHEN (COL01 != ‘ ‘): This condition specifies that only records where COL01 is not equal to 20 blank spaces will be loaded. This can be used to filter out unwanted records;
- TRAILING NULLCOLS: This directive allows SQL*Loader to treat any missing trailing columns in the input data file as nulls. This can help avoid errors if some records do not contain values for all columns.
- After those definitions, we have the COLUMNS definitions, which I will not explain because they are self-explained :-), but I will talk about the last column:
- CONSTANT ‘DUMMY.DAT’: Instead of reading from the input file, this specifies that the column will always be filled with the constant value ‘DUMMY.DAT’. In our script, this is dynamically replaced with the actual file name being processed.
So, in summary, our SQL*Loader controlfile structure allow us to:
- Load data directly into the table, skipping the first record;
- Filter records based on a condition;
- Treat trailing null columns properly;
- Apply specific transformations to each field;
- Append data to an existing table without removing the old data;
- Capture the file name in a dedicated column for traceability.
Now, let’s create our Shell Script to load the files. Our script will be called APP_Loader.sh:
#!/bin/bash
####################################################################################
# Script Name : Data_Loader.sh
# Author : Marcus Vinicius Miguel Pedro
# Date : 2024-06-07
# Version : 1.0
# Description : This script processes ZIP files containing integration data files.
# It unzips each ZIP file, loads the data from each .APP file
# into the Oracle database using SQL*Loader, and then moves the
# processed files to an archive directory.
#
# Usage : ./APP_Loader.sh
#
# Parameters :
# None
#
# Prerequisites:
# - Update the script with the credentials of a user who has
# permissions to insert data into the target table.
# - Adjust directory paths if necessary.
#
# Notes :
# - The script assumes the .APP files are named appropriately.
# - Ensure there is sufficient disk space in the directories.
####################################################################################
ZIP_DIR="/oracle_nfs/APP_Loader"
UNZIP_DIR="/oracle_nfs/APP_Loader"
LOADED_DIR="/oracle_nfs/APP_Loader/APP_Loaded"
for zip_file in "$ZIP_DIR"/*.ZIP; do
unzip -o "$zip_file" -d "$UNZIP_DIR"
for data_file in "$UNZIP_DIR"/*.APP; do
filename=$(basename "$data_file" .APP)
sed "s/CONSTANT 'DUMMY.DAT'/CONSTANT '$filename.APP'/" APP.ctl > temp_APP.ctl
sqlldr userid=myuser/mypassword control=temp_APP.ctl bad="$filename.bad" log="$filename.log" data="$data_file"
done
rm -f "$data_file"
mv "$zip_file" "$LOADED_DIR"
mv *.log "$LOADED_DIR"
mv *.bad "$LOADED_DIR"
done
Let’s break down our script:
- ZIP_DIR: This variable holds the directory path where the ZIP files are stored;
- UNZIP_DIR: This variable holds the directory path where the ZIP files will be unzipped. In this case, it is the same as the ZIP_DIR;
- LOADED_DIR: This variable holds the directory path where processed files (both ZIP files and the corresponding log and bad files) will be moved after processing;
- Then we have two loops:
- The main one iterates basically unzipping the ZIP files, calling the inner loop and then moving the ZIP files, log files and bad files to LOADED_DIR;
- The inner loop is iterates o APP (Data) files to be loaded, extracting the base name of the datafile by stripping off the .APP extension, storing it in the filename variable; then will replace the ‘DUMMY.DAT’ to the datafile being processed; as the last step will run SQL*Loader.
Benefits using this approach:
- Automation: The entire process is automated, reducing manual intervention and the risk of errors;
- Efficiency: Direct path loading with SQL*Loader significantly speeds up the data loading process;
- Scalability: The script can handle any number of files, making it easy to scale for large datasets;
- Traceability: Including the file name in the table ensures that each record can be traced back to its source file;
- Performance: Direct path loading optimizes performance by bypassing certain processing steps like logging and rollback.
Let’s now show an example of the script working. With user which is owner of DB on Operating System:
./APP_Loader.sh
A small part of output is exhibited below:
Archive: /oracle_nfs/APP_Loader/0015_0006_20240406060029_APP20240406-063847-568.ZIP
inflating: /oracle_nfs/APP_Loader/0015_0006_20240406060029.APP
SQL*Loader: Release 19.0.0.0.0 - Production on Fri Jun 7 21:19:32 2024
Version 19.21.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.
Path used: Direct
Load completed - logical record count 7133.
Table OWNER.TABLE_TO_BE_LOADED:
7133 Rows successfully loaded.
Check the log file:
0015_0006_20240406060029.log
for more information about the load.
Archive: /oracle_nfs/APP_Loader/0015_0006_20240408060025_APP20240408-061329-963.ZIP
inflating: /oracle_nfs/APP_Loader/0015_0006_20240408060025.APP
SQL*Loader: Release 19.0.0.0.0 - Production on Fri Jun 7 21:19:34 2024
Version 19.21.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.
Path used: Direct
Load completed - logical record count 12198.
Table OWNER.TABLE_TO_BE_LOADED:
12198 Rows successfully loaded.
Check the log file:
0015_0006_20240408060025.log
Perfect! The script rans really fast. We can now do some checks. Let’s go to the directory where the files were moved after being processed:
cd /oracle_nfs/APP_Loader/APP_Loaded
Now, let’s check if we have any errors:
cat *.log | grep ‘loaded’ | grep -v ‘Total|Table|success’ | sort | uniq
0 Rows not loaded because all fields were null.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded due to data errors.
As we can see above, we did not face any errors. Which is great!
Now, let’s check the loaded records:
cat *.log | grep ‘loaded’ | grep -v ‘Total|Table’ | grep success | awk ‘{sum += $1} END {print sum}’
2625502
According with the output above, we had ~2.6M records loaded to the table. Let’s now check the table:
select count(*) from OWNER.TABLE_TO_BE_LOADED;
COUNT(*)
———-
2625502
Perfect, the count on table match with the info on log!
By combining Shell and SQL*Loader we can create a robust and simple solution to efficiently load hundreds of files (or maybe thousands if you need it) into Oracle Database. This approach not only meets the client’s requirements but also provides a scalable and traceable method for handling large data loads. Whether you’re dealing with a few files or thousands, this script can streamline your data integration process and improve overall performance.
Hope it helps!
Peace,
Vinicius
Related posts
Disclaimer
My postings reflect my own views and do not necessarily represent the views of my employer, Accenture.