I'm currently working with a lot of Extract, Transform, and Load (ETL) type problems at my job. Someone has data they need to get from point A to point B and maybe have some processing done in between. That being said, the input data in this equation can take many forms. Sometimes those forms can take on a state that crashes an application and possibly halts the ETL processing altogether. In this series, I'd like to talk about some defensive programming best practices on how to prepare your application for non-ideal situations that very often take place in the real world of ETL processing.
String types are pretty straightforward on ingestion. Most files and sources that we read as input or write as output can be cast into Strings directly. Most issues I've seen come in processing Strings, which I'll talk more about in an upcoming section.
Recently I was trying to debug a process written in C#. It takes data from an Excel file and transforms it into a fixed-width text file with transformed data. The data coming in can be of a couple different types, but must be output as a string to be written to the text file for a third party vendor to digest. The application was running into a "gotcha" that comes up in .NET around transforming data into a String.
The Object is the ultimate base class used by classes in .NET. This means we could be returning ANY class from the
getValueFromField() function and try to cast it into a string. If a type other than System.String is retrieved, you might get an error such as this one:
System.InvalidCastException: Unable to cast object of type 'System.Double' to type 'System.String'.
.NET has a
.ToString() method that is much safer to use than a type cast. The method is implemented on many classes, ensuring that you're much more likely to guarantee that you can produce a String type. Below was what we introduced to allow for processing to continue. Note the null-conditional operator. You can read more on that here: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/operators/member-access-operators#null-conditional-operators--and-
It looks like the data type for that incoming field had changed to being interpreted as a Double at some point. However, we still want to write that field to the output text file. Making this change allowed the file to be processed without error. While the program runs, ideally we should leverage the strong typing in C# and use a new method like
getDoubleFromField() to make sure we aren't passing our logic a generic Object class, which was the root cause of the issue. The reason we did not, is because there is no processing that really requires the System.Double type for this field. Simply going from a string field on the Excel sheet to a String in the output file was the most defensive option.
There's so much more to talk about on this topic, so I'll be updating it as I write more. Stay tuned for my upcoming sections on parsing String data. If you liked what you read, let me know! Feel free to reach out to me at firstname.lastname@example.org. Happy coding!