"Perhaps you've noticed the new Microsoft Office formats" is something you'd say back in 2007. Nevertheless, I'd bet that a good portion of people haven't looked into what that new "x" is at the end of all default files (.docx, .xlsx, .pptx) and an equal number of people probably don't care. However, I'd like to try and convince these doubters just how great this is! This "x" stands for XML (extensible markup language) and makes the file formats seem hipper than their x-less counterparts. Rather than documents and spreadsheets being stored as binary files, which are difficult to read, all information is stored in easily read zipped-xml files. That's right - all .abcx files are just zip files.
Let's look at an example Excel file (Ex-Book.xlsx):
It has a little bit of everything: values, text, color formats, comments, and formulas. If we wish to look at how the data are stored, we can easily change the extension of the file from .xlsx to .zip.
Naturally, we already have a way to take care of zip files in Stata with the command unzipfile. So from this point on, we're going to navigate this file strictly using Stata. My Excel workbook is located on my desktop, so let's unzip it.
cap mkdir unzipbook cd unzipbook unzipfile ../Ex-Book.xlsx, replace dir xl/
Note that we didn't have to change the extension, and we now have a list of files and directories under the "xl" folder. Files and properties of interest:
These are good reference files, but our main interest is located within the "worksheets" directory under our "xl" folder. Here, we can find each sheet named sheet1, sheet2, sheet3, etc. The sheets are indexed according to the workbook.xml files (ex: "rId1" corresponds to sheet1). I've always had trouble getting formulas out of Excel. Instead of resorting to VBA, we can now use Stata to catalog this information (I've found this very useful for auditing purpose). Here's a snippet of our worksheets/sheet1.xml file:
Interpreting this section, we can see that it contains rows 2-4 which corresponds to our strings (A, B, C), our styles (Red, Orange, Yellow), and our formulas. Going from top to bottom, by deduction we can see that our first <c> or cell class contains a value of 3 and references cell D2 with the style of 7. Here, since the type is a string ("s"), we need to reference our sharedStrings.xml file to see what the fourth string is. We look for the fourth string, since the count initializes at zero rather than looking at the third string. Our fourth observation is a "D" and our fifth observation is an "E" in the shareStrings.xml file.
The third row contains our colored cells without any data. Therefore, they have no values, but contain styles corresponding to the fill colors. A quick note: the order here goes (3, 4, 2, 5, 6) because I accidentally chose yellow first. Finally, the most valuable part of this exercise (for me at least) are the formulas. Cell A4 contains the formula "=A1&A2" and displays the evaluated value of the formula. When linking to other workbooks, the workbook path will likely be another reference such as [1], [2], [3], etc. A quick use of some nice Stata commands below yields a simple code for creating an index of your workbook's information. tempfile f1 f2 filefilter xl/worksheets/sheet1.xml `f1', from("<c") to("\r\n<c") filefilter `f1' `f2', from("</c>") to("</c>\r\n") import delimited `f2', delimit("@@@", asstring) clear keep if regexm(v1, "([^<>]+)") gen value = regexs(1) if regexm(v1, "v>([^<>]+)") replace formula = subinstr(formula, "&", "&", .) keep cell formula value
Thanks to that little x, we are able to easily obtain meta-information about our Excel files, presentations, or even Word documents. Think about having to look through thousands of documents to find a specific comment or reference - seems like a simple loop would suffice. Maybe that x is pretty cool after all.
2 Comments
With the release of Stata 14.1, Stata made a good amount of changes to one of my favorite commands: putexcel – I have mixed feelings about this.
First, I want to take the opportunity to stress version control. Always begin your do-files with the version command. If you don’t, you’ll likely get the error r(101) “using not allowed” when trying to implement your old putexcel commands in a do-file. A quick note about version: you must run this command with the putexcel statement. Just having it at the top of your do-file and calling it doesn’t work, unless you run the do-file all the way through. Think of using version control like using a local macro. What I like:
Putexcel has been simplified…
…but it requires changing the way you think about putexcel. Think of the command in a more modular structure.
Example: putexcel set MyBook.xlsx, sheet("MyData") putexcel A1=("Title of Data") A2=("Subtitle") A3=formula("SUM(A4:A20)"), bold putexcel A3:A20, nformat(number_sep) hcenter
.
The updated options simplify your code a bit: No more bold(on), rather just bold or nobold and there’s no need for that modify command. Update: as of the December 21st update, modify is an option again... Advanced number formats may be used: No more restrictions, the advanced number format can now be used and can be found under putexcel advanced – which makes sense since it was able to be used already in the Mata xl() object. putexcel set MyBook.xlsx, sheet("MyData") putexcel A3, nformat(#.000) nobold border(all) What I don't like:
Using not allowed! I liked the using statement within putexcel, especially when I needed to add something as simple as a title on an excel spreadsheet. Now, I have to use two lines of code instead of one. This is a minor complaint compared to the next section.
The new syntax – it’s very difficult to work with large files, which is why I’ll be sticking with my “version 14.0” command when writing putexcel statements to large excel files. Why? I cannot find a simple way to write different format-types in one statement (say bold font in cell A1 and a border around cell B2). I have to do this in two separate commands. This is a big problem when you’re working with large excel files because Stata keeps opening the file, saving it, and closing the file for EVERY putexcel statement. We could waste minutes writing borders and adding different colors. Granted, Excel files shouldn’t be 40MB but we have them; and management wants to keep them that way, putting the Stata programmer between a rock and a hard place. Check out my previous post on “block” putexcel for working with very large excel files. What I hope happens:
We get the best of both worlds – the putexcel set keeps the Excel file open, so that it doesn’t continuously save file after file but the new format stays. If we have a putexcel set, why not putexcel close? Oh, and using is added back as an optional/hidden feature of putexcel. That, my friends, would be my ideal putexcel command: not too hot, not too cold; simple, yet powerful.
Coercion. It’s the topic of the day, and a short topic at that. Coercion is the leading driver for Excel usage for most Stata users. Workbook linkage is usually a pain – the practice of linking one cell from a workbook to a different cell in another workbook – and can turn a relatively simple task into a nightmare usually ending the job with a headache and a stiff drink.
However, with a few simple Mata commands, the crisscross of equal signs, dollar signs, and formulas can turn a repetitive task into a few simple copy-and-paste commands. There is also less room for error. For fun, we'll do this entire thing in Mata. First, we need to decide how many observations/variables we need. Let’s look at our current workbook and target workbook.
So we need 5 columns and 7 rows, with columns from D to H and rows from 3 to 9 in our BookB set. Write it:
mata: st_addobs(7) st_addvar("str255", "v" :+ strofreal(1..5))
Starting with a blank dataset, add seven observations. From there, we create 5 variables (v1 v2 v3 v4 v5) that are all of type str255. This is a placeholder value for long equations. The second part of this command “v” :+ strofreal(1..5) is my favorite part about Mata. The colon operator (:+) is an element-wise operator that is super helpful. No more Matrix conformability errors (well maybe a few). It creates the matrix: ("v1", "v2", "v3", "v4", "v5") without much work and allows for any number of variables! Sidenote – changing (1..5) to (1..500000) took about 3 seconds to create 500,000 v#’s. Next, let’s set up our workbook name: bookname = "'C:\Users\wmmatsuo\Desktop\[BookB.xlsx]Linked'!" Specify the location of the workbook with the file name in square brackets and enclose the sheet name in apostrophes. End that expression with an exclamation point! Since we know we want D through H, and cells 3-9, we can write the following: cols = numtobase26(4..8) rows = strofreal(3::9) numtobase26() turns 4 through 8 into D through H. For more information see my previous post on its inverse. The rows variable is a column vector containing values 3 through 9. Now it’s time to build our expression: expr = "=" :+ bookname :+ "$" :+ J(7, 1, cols) :+ "$" :+ rows We put an equals sign in front of everything so that Excel will know we’re calling an equation. Follow that with the workbook name. If you didn’t know, dollar signs in excel will lock your formulas so they don’t move when you copy formulas from one cell to another. We’ll put those in as well. The next part, J(7, 1, cols), takes our column letters and essentially repeats it seven rows down. Thanks to the colon operator, we just have to add another dollar sign to all elements of the matrix and our row numbers. Since the new cols matrix contains seven rows, and our row vector contains seven rows, it knows to repeat the row values for all columns. Let’s just put that matrix back into Stata and compress our variables. st_sstore(., ("v" :+ strofreal(1..5)), expr) stata("compress") end And voilà, we have a set of equations ready to copy into Excel. One quick tip: set your copy preferences so that when you copy the formulas, you won’t also copy variable names by going to edit->preferences->data editor-> uncheck “Include variable names on copy to clipboard”. Now just copy the equations directly from your Stata browse window into Excel, and enjoy those sweet results! In other news, this post built upon the putexcel linking discoveries found over at belenchavez.com - who recently had her Fitbit stolen! Follow her journey to get it back on her website or on Twitter - hopefully she'll be able to coerce the thief into returning the device.
Check it: Excel Art.
Stata 14 – a cause for fanfare – expanded Stata’s abilities to format excel files. Sounds exciting right? You’re damn right. Excel files no longer have to be pre-templated before the dreaded export excel messes up your number formatting. Putexcel used to alleviate this problem in Stata 13, but only for numeric matrices. Luckily for us those days are long lived but long gone, and putexcel evolved into a very useful command.
So what’s the first step in creating this captivating art? Finding an easily parsable file format of course. The EPS (Encapsulated PostScript) format does just that – take a look for yourself: EBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEAEAEAEBEBEBEBEBEBEBEBEBEBEBEB EAEAEBEBEBEBEAEAEBEBEBEBEBEBEBEBEBEBEAEAEBEBEBEBEBEBEBEBEBEBEBEB EAE4E7E5E3DECFB292948D97A29D9F9D9FA9A6A49477726A5D4F484A4742484B 453B36302E2C302E2F3735332C2B32322C2D393C423E3738404950575E5F605B 545D655D5E5C5E6C6972767D888D9DA9B4B9BAC1B7ADA19995907A66646E6C7C 9099B7A99F9E9CADAB97838FB1CDE1E9ECEBEBEBEBEBEBEBEBEBEBEBEBEBEBEB EBEBEAEAEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEBEB EAEAEBEBEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEAEBEBEBEBEBEB There’s no binary, just text. And while it may look like the transcript of a sugared-out toddler, it contains a lot of good color information. The first thing to know is that all colors we deal with are related to light. The three additive primary colors are red, green, and blue. If you ever looked closely at an old TV screen (like I did as a young’un), you’d see these three distinct colors. In this case, each color uses 8 bits: 2*2*2*2*2*2*2*2 = 256 possible combinations per color. Each color then gets a number from 0-255. Zero means the color is off while 255 corresponds with a full-on color! So RED in RGB mode would look like this: “255 0 0”. Converting this number to base 16 yields “FF 00 00” or “FF0000” without spaces. Going backwards: “EAE4E7” is the same as “EA E4 E7” which converts to “234 228 231” in base 10. This is made easy with Mata’s frombase command. After a few tricks - such as finding the right order of the data stored - we’re able to convert the previous gibberish to pixel information/excel cell information. This is where putexcel gets good. The matrix of color information is passed to a Stata dataset so that it utilizes the fpattern cell expression of putexcel. An example Stata variable would look like this: V1 A1=fpattern(“solid”, “234 243 243”) A2=fpattern(“solid”, “234 243 243”) A3=fpattern(“solid”, “234 243 243”) Because we have a column of strings we can use the levelsof command to create a list containing each of these expressions and write them directly using the putexcel command. Now we’re ready to shake shake shake out that final command: foreach v of varlist * { levelsof(`v'), local(`v') clean local cellexp = "`v' `cellexp'" } local cellexp = subinstr("`cellexp'", " ", "' ", .) local cellexp = subinstr("`cellexp'", "v", "`v", .) putexcel `cellexp' using "Art.xlsx", sheet("LoveStory") replace The trick here is using the option "clean" in our levelsof command to strip all the double quotes so that we can use it directly in our putexcel statement. While we could have included the putexcel statement within the loop, the advantage here is that, because we’re only calling the command once, we don’t have to continually open and close the excel file for each putexcel statement - making it run in seconds. Now we’re ready to find our Starbucks lovers, get in fights at 2:30 am, and never ever get back together because we just saved ourselves so much time! Can you spot all the Taylor Swift references? I count seven. |
AuthorWill Matsuoka is the creator of W=M/Stata - he likes creativity and simplicity, taking pictures of food, competition, and anything that can be analyzed. Archives
July 2016
Categories
All
|