Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 13. March 2011 17:52

I had heard good things about OpenXML, but until now I didn’t have time to give it a try. After attending a rather intimate session on the topic at the MVP Summit, I realized I should look into it. For those of you like me who haven’t kept up with the news, the general idea is that, since the release of Office 2007, Office files are no longer saved as obscure proprietary files: they are essentially zipped xml files. If you rename an Excel file from MyFile.xlsx to MyFile.zip and open it, you will see that it is simply a collection of xml files, describing the various parts of your Workbook and their relationships. This has a few interesting implications, one of them being that you can create or edit an Excel file without using Excel, or even having Excel installed on your machine.

The OpenXML SDK is a free library which provides strongly typed .NET classes to manipulate these files without having to deal with raw XML, and are LINQ-friendly, which is awesome.

One scenario where this comes very handy is if you have some form of a .NET application which needs to read input data from an Excel file; another interesting case is a .NET application which needs to produce some Office outputs for the user. Rather than launch an instance of the Office application and use the COM Interop, you can perform all these tasks safely in .NET, without having to worry about cleanly closing the application.

In line with the first scenario, my initial goal was to see if I could read the contents of an Excel Workbook with a console app. Rather than going into lengthy explanations, here is the code I ended up with, which borrows heavily from the samples provided with the SDK:

namespace OpenXmlApp
{
   using System;
   using System.Collections.Generic;
   using System.Linq;
   using DocumentFormat.OpenXml;
   using DocumentFormat.OpenXml.Packaging;
   using DocumentFormat.OpenXml.Spreadsheet;

   public static class Program
   {
      private static void Main(string[] args)
      {
         var filePath = @"C:/Tests/protectedFile.xlsx";
         using (var document = SpreadsheetDocument.Open(filePath, false))
         {
            var workbookPart = document.WorkbookPart;
            var workbook = workbookPart.Workbook;

            var sheets = workbook.Descendants<Sheet>();
            foreach (var sheet in sheets)
            {
               var worksheetPart = (WorksheetPart)workbookPart.GetPartById(sheet.Id);
               var sharedStringPart = workbookPart.SharedStringTablePart;
               var values = sharedStringPart.SharedStringTable.Elements<SharedStringItem>().ToArray();
               
               var cells = worksheetPart.Worksheet.Descendants<Cell>();
               foreach (var cell in cells)
               {
                  Console.WriteLine(cell.CellReference);
                  // The cells contains a string input that is not a formula
                  if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
                  {
                     var index = int.Parse(cell.CellValue.Text);
                     var value = values[index].InnerText;
                     Console.WriteLine(value);
                  }
                  else
                  {
                     Console.WriteLine(cell.CellValue.Text);
                  }

                  if (cell.CellFormula != null)
                  {
                     Console.WriteLine(cell.CellFormula.Text);                    
                  }
               }
            }
         }

         Console.ReadLine();
      }
   }
}

A few comments:

  • I am opening the document as read-only, setting the second argument to false.
  • workbook.Descendants<Sheet>() returns an IEnumerable<Sheet>, which means that you can now query it using Linq if you please.
  • I am still wrapping my head around the organization of elements. Coming from “classic” Excel, I expect to be able to navigate down directly from a Workbook into its Worksheets; here, the Sheet contained in the Workbook is merely a key which indicates what sheets exist, and what Id to use when requesting them. Navigating between the parts of the file will take a bit of getting used to.
  • I love the fact that you can directly iterate over the Cells of a Worksheet. The cells variable above retrieves only cells that have some content, and nothing more. No need to read cells into 2-d arrays and iterating over all of them.
  • On the other hand, I found the organization of the cells content a bit disorienting at first. Interestingly, cells that contain strings that are not formulas do not store the value in the cell element itself. They are stored in a SharedStringTable, and the cell contains an index, in Cell.CellValue.Text, which indicates which element of that table it contains. This seems to be true only for strings that are not formulas, however: if the cell contains a formula, or some non-string type, then the content is stored in CellValue.Text, and there is no record in SharedStringTable. I am sure this will make sense to me some day.
  • I am interested to see how easy or painful it is to work with Cells addressed by their index (as in, Cells[3,2] ). This is fairly straightforward using the Interop, but from what I have seen so far, I expect it will be a bit more involved here, because that’s just not how the data is organized.

In short, I found the SDK pleasant to install and use so far (and well documented), and I can definitely see scenarios where I will be using it in the future. On the other hand, I suspect I will end up writing quite a few helper methods to make it more usable – probably trying to make it look closer to the classic Interop. I suspect also that it will turn out to be better suited for applications like Word and PowerPoint, because of the more hierarchical nature of their content.

by Mathias 7. September 2010 16:56

In my previous posts, I explored how to identify text in a PowerPoint slide and use the Google .NET API to translate it; let’s put it all together in a simple VSTO add-in for PowerPoint 2007.

The translation functionality is displayed in a Custom Task Pane, where you can pick the language of origin, and the language to translate to. I used the same general design I presented in my Excel add-in tutorial, using WPF controls in the task pane with the MVVM pattern, leveraging the small yet very useful MVVM foundation framework. When running, this is how the add-in looks like:

TranslatorStart

TranslatorEnd

I added only 3 languages in there, but it is fairly easy to modify the code and get it to work with any language pair supported by Google translate.

Download code sample

by Mathias 2. September 2010 17:04

The beauty of working with a framework like .NET is that when you have a problem, chances are, someone else did before you, and might have even resolved it for you. In our last post, I explained how to find text in a PowerPoint slide using C#. My goal was to translate it using the Google Translate service, and I intended to write my own code to call the web service and retrieve the translation. Turns out, there is a .NET API for Google Translate, ready to use, which does it for you already (of course, I found that out already after rolling out my own code, which wasn’t nearly as good).

Building up where we left off, I quickly wrote this class, which will translate the slide currently in view; I think the code is self-explanatory: get the slide, create a Google translator, pass it the language of origin and of destination, and translate every chunk of text you find! The only thing I needed to do was to download GoogleTranslateAPI_0.3.1, add the dll to the project as a reference.

namespace ClearLines.PowerPointTranslator
{
   using Google.API.Translate;
   using PowerPoint = Microsoft.Office.Interop.PowerPoint;

   public class SlideTranslator
   {
      public static void TranslateSlide(Language from, Language to)
      {
         var googleTranslator = new TranslateClient("http://www.clear-lines.com");

         var powerpoint = Globals.ThisAddIn.Application;
         var slide = (PowerPoint.Slide)powerpoint.ActiveWindow.View.Slide;

         foreach (PowerPoint.Shape shape in slide.Shapes)
         {
            if (shape.HasTextFrame == Microsoft.Office.Core.MsoTriState.msoTrue)
            {
               var textFrame = shape.TextFrame;
               var textRange = textFrame.TextRange;
               TranslateTextRange(textRange, googleTranslator, from, to);
            }
         }
      }

      public static void TranslateTextRange(
         PowerPoint.TextRange textRange, 
         TranslateClient translator, 
         Language from,
         Language to)
      {
         var paragraphs = textRange.Paragraphs(-1, -1);
         foreach (PowerPoint.TextRange paragraph in paragraphs)
         {
            var text = paragraph.Text;
            text = text.Replace("\r", "");
            paragraph.Text = translator.Translate(text, from, to);
         }
      }
   }
}

Later this week, after I do some code scrubbing, I’ll post the entire VSTO solution. Until then, have fun – it’s almost the week-end!

by Mathias 21. August 2010 16:38

In a previous post, we saw how to programmatically search for text in a PowerPoint slide, by iterating over the Shapes contained in a slide, finding the ones that have a TextFrame, and accessing their TextRange property. TextRange exposes a Text property, which “represents the text contained in the specified object”.

Our goal is to translate a slide from a language to another, which means translating every chunk of text we find. However, the Text property contains a bit more than just text. Suppose you were working with a slide like the one below, which contains multiple bullet points, with various indentations:

DaftPunkSlide 

If you inspect the Text for the content area, you’ll see that it looks like this:

Work It\rMake It\rDo It\rMakes Us\rHarder\rBetter\rFaster\rStronger

At the end of each bullet point, we have a \r, which indicates a line break. If we want to maintain the formatting of our slide when we translate it, we’ll have to deal with it.

We’ll worry about the actual  translation later – for the moment we will use a fake method, which will show us what chunk of text has been translated:

public static string Translate(string text)
{
   return "Translated [" + text + "]";
}

A crude approach

A first approach would be to simply take the entire Text we find in the TextRange, manually separate it into chunks by splitting it around the carriage return character, translating the chunk, and re-composing the text, re-inserting the carriage returns.

Starting where we left off last time, let’s loop over the Shapes in the slide:

private void TranslateSlide()
{
   var powerpoint = Globals.ThisAddIn.Application;
   var presentation = powerpoint.ActivePresentation;
   var slide = (PowerPoint.Slide)powerpoint.ActiveWindow.View.Slide;
   foreach (PowerPoint.Shape shape in slide.Shapes)
   {
      if (shape.HasTextFrame == Microsoft.Office.Core.MsoTriState.msoTrue)
      {
         var textFrame = shape.TextFrame;
         var textRange = textFrame.TextRange;
         var text = textRange.Text;
         textRange.Text = CrudeApproach(text);
      }
   }
}

More...

by Mathias 8. August 2010 17:30

I am currently on a project which involves creating a PowerPoint VSTO add-in. I have very limited experience with PowerPoint automation, so before committing to the project, I thought it would be a good idea to explore a bit the object model, to gauge how difficult things could get, and I set to write a small PowerPoint add-in which would automatically translate slides. Sounds like a simple enough project, how difficult could it be?

Turns out, not too difficult, but not completely trivial either. I discovered quickly that the PowerPoint object model, unlike most Office applications, doesn’t have much (any?) documentation for the .Net developer; the best I found is the VBA PowerPoint 2007 developer reference, which gives a decent starting point to figure out what the objects are about. So I thought I would share my exploration of the PowerPoint jungle, and hopefully spare some trouble to other .Net developers.

The plan

The objective is simple: write an add-in which allows the user to

  • select a language to translate from, and a language to translate to,
  • create a duplicate of the current slide, translating all the text and keeping the layout

The plan will be to use Google Translate to perform the translation. In order to do that, we will nedd to extract out all pieces of text that require translating.

Finding all the text in a slide

Lets’ start by identifying where we have text in the current slide. Let’s first create a PowerPoint 2007 Add-in project in Visual Studio. To keep things simple for now, we will add a Ribbon control with a button, and when that button is clicked, we’ll start working on the current slide:

RibbonWithButton

Double-click on the Button (I renamed my button translateButton) to generate an event handler for the Click event, and get the current Slide:

private void translateButton_Click(object sender, RibbonControlEventArgs e)
{
   var powerpoint = Globals.ThisAddIn.Application;
   if (powerpoint.ActivePresentation.Slides.Count > 0)
   {
      var slide = (PowerPoint.Slide)powerpoint.ActiveWindow.View.Slide as PowerPoint.Slide;
   }
}

More...

Comments

Comment RSS