Load Big Excel Files


Just as an introduction, when loading excel files we can use one of two methods that, in simple terms, can be explained in the following way:


    - DOM - All the structure and content of the file is loaded at once into memory;


    - SAX - Use a stream to sequentially load elements as they are being used.


The problem with the DOM approach (used by the Excel To RecordList widget) is that, by loading every piece of data and structure into memory we may end up with a big RAM consumption when handling big Excel files.


In the cases where we are handling big files, the SAX approach may be a good option in order to avoid that excessive RAM consumption.


Microsoft provides an API called Open XML SDK that can be used to read excel files written in OpenXML format (.xlsx). Take in consideration that OpenXML is the standard format for excel files since Office 2007, so this API can only be used in files saved in these versions.


The Open XML SDK allows both DOM and SAX reading implementations and can be used as standalone dependency, not requiring any component installation in the frontends, allowing it's use simple use in cloud environments.


Also keep in mind that since we are working with a Windows DLL you will only be able to use it in installations using the .NET stack.


By using the SAX method with the Open XML SDK we are able to reduce the RAM consumption, increasing server’s availability and therefore allow the scalability of the solution;


Depending on the Platform version you are working with you may want to use a different version of the OpenXML SDK:


    - Outsystems v8 - OpenXML SDK 2.0 (http://www.microsoft.com/en-us/download/details.aspx?id=5124)

    - Platform 9 - OpenXML SDK 2.5 (http://www.microsoft.com/en-us/download/details.aspx?id=30425)


A complete set of documentation and examples on the usage of the SDK can be find at https://msdn.microsoft.com/en-us/library/office/bb448854.aspx.


As additional information regarding the usage of the SAX and the Open XML SDK API you can refer to the following links:







Regarding the methods to obtain the real value of each cell, there are several specifics regarding the way the Excel files are saved so my recommendation is for you to check documentation regarding SharedStringPart, NumberingFormats, CellFormats, etc.


Here are some resources you can use to put you in the correct direction:


    - NumberingFormats/CellFormats: http://stackoverflow.com/questions/19034805/how-to-distinguish-inline-numbers-from-ole-automation-date-numbers-in-openxml-sp/19582685#19582685;

    - SharedStringPart: http://www.sqlscientist.com/2014/04/get-column-header-list-from-excel-using.html.


A possible way of applying this method is to use the OpenXML SDK somehow in the way explained before and insert the excel data into a database entity, which allows a simpler use inside an Outsystems eSpace.


Hope this is useful.




Carlos Sousa.

Hi Carlos!  Can this be used to eliminate errors caused by large Excel loads?  For instance, I have an issue if my excel file is over 4000 rows OS breaks.  So I have to break up Excel files into 3999 row blocks...not fun.
Taking in consideration the amount of rows you're talking about I don't think the problem is related to the RAM consumption, unless of course, you've a huge number of columns or a massive amount of data in some fields.

You probably are hitting someother problem and my sugestion would be for you to perfectly understand what's going on, since by using the built-in methods will be able to develop the application much faster.

This method and the information in the post targets more extreme scenarios

In some previous usages I've done of the Open XML SDK I was working with the max amount of lines allowed in a xlsx file (1,048,576) so 4000 will not be a problem for sure.