Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 13. March 2011 17:52

I had heard good things about OpenXML, but until now I didn’t have time to give it a try. After attending a rather intimate session on the topic at the MVP Summit, I realized I should look into it. For those of you like me who haven’t kept up with the news, the general idea is that, since the release of Office 2007, Office files are no longer saved as obscure proprietary files: they are essentially zipped xml files. If you rename an Excel file from MyFile.xlsx to MyFile.zip and open it, you will see that it is simply a collection of xml files, describing the various parts of your Workbook and their relationships. This has a few interesting implications, one of them being that you can create or edit an Excel file without using Excel, or even having Excel installed on your machine.

The OpenXML SDK is a free library which provides strongly typed .NET classes to manipulate these files without having to deal with raw XML, and are LINQ-friendly, which is awesome.

One scenario where this comes very handy is if you have some form of a .NET application which needs to read input data from an Excel file; another interesting case is a .NET application which needs to produce some Office outputs for the user. Rather than launch an instance of the Office application and use the COM Interop, you can perform all these tasks safely in .NET, without having to worry about cleanly closing the application.

In line with the first scenario, my initial goal was to see if I could read the contents of an Excel Workbook with a console app. Rather than going into lengthy explanations, here is the code I ended up with, which borrows heavily from the samples provided with the SDK:

namespace OpenXmlApp
{
   using System;
   using System.Collections.Generic;
   using System.Linq;
   using DocumentFormat.OpenXml;
   using DocumentFormat.OpenXml.Packaging;
   using DocumentFormat.OpenXml.Spreadsheet;

   public static class Program
   {
      private static void Main(string[] args)
      {
         var filePath = @"C:/Tests/protectedFile.xlsx";
         using (var document = SpreadsheetDocument.Open(filePath, false))
         {
            var workbookPart = document.WorkbookPart;
            var workbook = workbookPart.Workbook;

            var sheets = workbook.Descendants<Sheet>();
            foreach (var sheet in sheets)
            {
               var worksheetPart = (WorksheetPart)workbookPart.GetPartById(sheet.Id);
               var sharedStringPart = workbookPart.SharedStringTablePart;
               var values = sharedStringPart.SharedStringTable.Elements<SharedStringItem>().ToArray();
               
               var cells = worksheetPart.Worksheet.Descendants<Cell>();
               foreach (var cell in cells)
               {
                  Console.WriteLine(cell.CellReference);
                  // The cells contains a string input that is not a formula
                  if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
                  {
                     var index = int.Parse(cell.CellValue.Text);
                     var value = values[index].InnerText;
                     Console.WriteLine(value);
                  }
                  else
                  {
                     Console.WriteLine(cell.CellValue.Text);
                  }

                  if (cell.CellFormula != null)
                  {
                     Console.WriteLine(cell.CellFormula.Text);                    
                  }
               }
            }
         }

         Console.ReadLine();
      }
   }
}

A few comments:

  • I am opening the document as read-only, setting the second argument to false.
  • workbook.Descendants<Sheet>() returns an IEnumerable<Sheet>, which means that you can now query it using Linq if you please.
  • I am still wrapping my head around the organization of elements. Coming from “classic” Excel, I expect to be able to navigate down directly from a Workbook into its Worksheets; here, the Sheet contained in the Workbook is merely a key which indicates what sheets exist, and what Id to use when requesting them. Navigating between the parts of the file will take a bit of getting used to.
  • I love the fact that you can directly iterate over the Cells of a Worksheet. The cells variable above retrieves only cells that have some content, and nothing more. No need to read cells into 2-d arrays and iterating over all of them.
  • On the other hand, I found the organization of the cells content a bit disorienting at first. Interestingly, cells that contain strings that are not formulas do not store the value in the cell element itself. They are stored in a SharedStringTable, and the cell contains an index, in Cell.CellValue.Text, which indicates which element of that table it contains. This seems to be true only for strings that are not formulas, however: if the cell contains a formula, or some non-string type, then the content is stored in CellValue.Text, and there is no record in SharedStringTable. I am sure this will make sense to me some day.
  • I am interested to see how easy or painful it is to work with Cells addressed by their index (as in, Cells[3,2] ). This is fairly straightforward using the Interop, but from what I have seen so far, I expect it will be a bit more involved here, because that’s just not how the data is organized.

In short, I found the SDK pleasant to install and use so far (and well documented), and I can definitely see scenarios where I will be using it in the future. On the other hand, I suspect I will end up writing quite a few helper methods to make it more usable – probably trying to make it look closer to the classic Interop. I suspect also that it will turn out to be better suited for applications like Word and PowerPoint, because of the more hierarchical nature of their content.

Comments

3/13/2011 6:07:07 PM #

trackback

Codebix.com - Your post is on Codebix.com

This post has been featured on Codebix.com. The place to find latest articles on programming. Click on the url to reach your post's page.

Codebix.com | Reply

2/3/2012 1:42:49 AM #

Joe

This is the first post I found that showed code that I could understand and use to get cell values.
Do you have any code that shows how to update cell values/
Thanks,
Joe

Joe United States | Reply

2/3/2012 1:45:18 AM #

Joe

Also can you point me the documentatio you used.
Thanks,
Joe

Joe United States | Reply

2/6/2012 8:52:17 PM #

Mathias

Joe,
glad this helped! Regarding documentation, when you download the OpenXML SDK, there are a bunch of samples and examples which I found very useful, that's what got me started.
Mathias

Mathias United States | Reply

3/12/2012 10:57:30 PM #

Ranjan

Really very very helpful article. I found many articles but could not get success to get cell value. But your code saved my life.

Thanks a lot dear.

Ranjan

Ranjan India | Reply

10/3/2012 6:51:45 AM #

Henry

Nice contribution.  Thanks.  Now I need to figure out how to update these cells.  

Henry United States | Reply

11/18/2012 3:23:33 AM #

sudipta ghosh

Very useful post!!! Thanks.

Looking for more similar type of post  having the following:

1.Write to excel fastest way.
2.Write excel having macros-enable(having buttons)
3.Read and write on the same excel either data value or style index
4.example on style format.

sudipta ghosh India | Reply

8/2/2013 3:16:20 PM #

pingback

Pingback from hirendhara.biz

What do you need to do to get Excel to read the styles in files created by the OpenXML SDK library? | Q Sites

hirendhara.biz | Reply

Add comment




  Country flag

biuquote
  • Comment
  • Preview
Loading



Comments

Comment RSS