difftr — Diff Pentaho KTR files

I’ve been using Pentaho Data Integration (PDI) as part of my various jobs over past few years. PDI is an ETL tool that is often used for purpose of migrating data from one database to another. PDI runs scripts, in KTR format, which are directed graphs of steps, each of which manipulates the data rows that passes through them. For instance, an Add constant step adds new columns with constant values for all rows passing through it. PDI provides a visual development environment where these steps can be added and connected together as to make a program that takes a bunch of row + column data, manipulates them, and then outputs them.

Since KTR files are really just programs, they evolve and it is a good idea to keep track of their evolution using standard version control systems such as git. However, KTRs have an underlying XML format that is not persistent in terms of ordering of various elements, etc. Therefore, utilities such as diff are useless on KTRs. That’s why I decided to make a simple tool that would visually diff any two KTRs.

difftr is a web tool that visually presents the diff between any two KTR files.

An example of two KTRs and their corresponding diff as output by difftr

The white boxes above represent steps that are common and the same in the two KTRs. Yellow boxes represent steps that are common but changed between the two. Red indicates deleted steps and green, added. Clicking on the yellow box opens another window with a line based diff of that step (see below) showing how a field in the SQL of the input step has changed.

...
<connection />
<sql>select
  col_a
  col_b
from tab</sql>
<limit>0</limit>
...