Simple ETL - solving problems the easy way

What Info Rhino Limited does with ETL

Info Rhino Limited is my limited company. The vast majority of our business is onsite for clients, and it is always to do with data. It could be; building a new database, migrating systems to a new platform, moving data between data silos, and often building data warehouses with intelligent front end reporting.

One of the biggest challenges, is ETL. These days, developers can use sophisticated ETL solutions such as Talend, Informatica and SSIS. Our preferred tool is SSIS - it has fantastic .Net integration, looks nice, and is highly scalable. 

The key challenge with ETL solutions is they are very manual - SSIS has this concept of BIML - which is a kind of markup language allowing packages to be created via code. However, by going down this route, you are tying yourself to a specific ETL platform when all you want to do is to get some data into a database.

ETL means Extract, Transform and Load

The challenges I face when loading data

For my property platform, I am sourcing A LOT of data. The other challenge I face is, I don't know which data I will end up using. For example, I want to include health statistics data, but may or may not end up with a good dataset. As a data analyst, I want data in a database, and fairly easily viewable so I can analyse it further. SQL still remains an awesome query language for set based data. I don't want to spend long periods of times creating ETL packages, just to load data into a database. 

Without question, if I needed regular feeds of large data volumes, I would be using an ETL solution for loading data.

The golden rules of ETL

  • Always use an ETL solution for loading data. Never reinvent the wheel.
  • Never write custom applications to load data. Never reinvent the wheel.

Rules are to broken aren't they?

For example, I have so far downloaded, NAPTAN data, Police and crime data, Train station data, some property data. I am faced with hundreds of files, which may or may not fit specific formats. 

Most data is in CSV or XML Format. So, I wrote two applications which can dynamically load data into database schemas. The applications will never compete with hand crafted ETL solutions providers but I can happily point my applications at a folder and let it run through files loading them into a database.

The two main applications and how they work

IRXMLLoader

Well formed XML, capable of being loaded into a .Net Dataset will contain one or more data tables. With a little bit of .Net code and data type mapping, this app does the following;

  • Creates the database tables if they don't exist.
  • Detects column data types.
  • loads files into those tables.

One issue, is performance of detecting and creating tables and so we allow for additional options on how many files to try creating a schema for.

Practical application use of the IRXMLLoader

  1. We would download hundreds of XML files.
  2. Configure the loader to; the target database, the target schema, the source folder, the log file output.
  3. Run the application. If all is good, it will start importing data matching data to tables.

IRCSVLoader

Very similar to the XML Loader but with a few key differences. Whereas, with XML and the .Net libraries used to read XML, we can't necessarily know the data types. A typical use case, is we may download a number of CSV files but not necessarily know their structure other than the delimiter type.

The application detects the likely column data type and loads in the data to a common schema in a key value pair format.

Once in the database, a little jiggery pokery with SQL allows us to create tables for matching common column structures and to then import that data into those tables.

Where can we get these applications?

Right now, I am rebranding my website and working on my property platform. The eventual goal is to sell these applications on my website www.inforhino.co.uk .

Who would benefit from using these applications?

SME's and enterprises

Any SME who requires data inside a SQL Server Database without wanting to spend significant periods of time writing ETL packages. Most of my clients too. When data has millions or potentially billions of lines of data - more traditional ETL would be appropriate.

Data professionals, Data Scientists and Business Intelligence

You want to focus on the data rather than the ETL, these tools will help with that.

 

Add comment