Friday, March 28, 2014

ODI Series - Pulling file names and timestamp from directory

Recently, I was asked a question by my colleague, Linga (he always comes up with something or the other) - Can we pull file names from a directory and process the latest one out of them, while keeping the history of processed files ?

I could recall that one other colleague Parasuram PJP did something similar. When I tried out his code, that did not work for me due to some reason. Anyways, he was kind enough to share some pearls of wisdom and I sat down searching details about Jython to fetch file names and additional information. To find out the latest file out of all, we need to capture the timestamp of its last modification.

Since we need to keep a log of all files which gets processed, we will be maintaining a RDBMS table (SQL Server in my case) -  FILENAMES with following structure

Along with this comes a twist where initial part file names will be same having different suffix. Just to simulate the scenario, I could think of some files which were on my system used while exporting Essbase data. The directory looks something like this. We have multiple files with their names starting with 'DtDemo'. The task is to load these files into SQL table and then find out the latest file to process it.

This can be achieved by writing few lines of code in Jython, call it as a procedure and then put it in a package. 
We start with creating a new procedure to use Jython technology and add a step 'PullFile' to it

Define an option for procedure to store the source directory name - SOURCE_DIR

'Command on Source' uses the RDBMS schema where the FILENAMES table exists and 'Command on Target' uses Jython with code to pull filenames, timestamp for last modification. This information is pushed to SQL table.

The SQL table have all files with names starting with 'DtDemo'. I played around with 'DtDemoBasic.txt' to save it at different times to get different timestamps for this file. Every time the procedure is executed, it loads the names of the files with their last modified timestamp. The filename for the latest one can be pulled by doing a MAX on timestamp.

It was not less that Linga added another requirement to optimize his integration - Process the file only if the file got updated, else ignore it

All I could think of  - compare the timestamp of latest two files and then decide whether to process it or not. Lets create procedure which executes an update command to update the PRCSD_FLAG to 'Y' for the latest file. This procedure should get executed only when the timestamp of latest file is different from second latest file. We would do this check in an ODI  variable and then  call the procedure based on variable's value

The variable would be executed on the RDBMS schema with table FILENAMES and have the code to compare the timestamps of top first file with the top second file when files are arranged in descending order of timestamps

The complete flow can be arranged in a package with following steps:

  • Pull all files from directory in a table
  • Refresh variable with SQL to compare timestamps of top two files
  • Evaluate the variable to check if value is '0' (zero)
  • Execute procedure to update flag if variable value is zero

Jython code and SQL queries can be downloaded from here.

Hope this helps !!!


  1. Nice post, illustrates the flexibility of ODI in satisfying arbitrary and diverse customer requirements.