Topics


Blogs


Forums


Samples


Media


Labs


Resources

 




DevCentral > Weblogs > Lori MacVittie - Two Different Socks
 Honey? Does this format make my data look fat?
posted on Wednesday, July 09, 2008 4:31 AM

CNet is reporting that Google is ditching XML for a faster, more compact alternative known as ProtocolBuffers. I'm going to type this post really fast before Don finds out and starts laughing at me because he's always had this thing against XML, claiming it was too bloated and slow.

Apparently Google, the 800-pound gorilla, is on Don's side of this argument, as it just blogged about its newest creation, ProtocolBuffers.

From CNet's Blog PostGoogle thought of using XML as a lingua franca to send messages between its different servers. But XML can be complicated to work with and, more significantly, creates large files that can slow application performance.

I disagree with the statement that it is XML that creates large files. No, no it's not. It's people that create large files in a data format, and that can happen regardless of whether it's binary or not. If you've ever worked in digital cartography or drafting, then you know what I'm talking about. AutoCAD files are huge, and they're binary. It's the application and the people designing the application combined with the amount of data that's being stored or transferred that determines whether a file will end up large or small. While binary is almost always more compact and more efficient than XML, it isn't always the case, nor is it inevitable that XML files will end up large and bloated.

Bad code can be just as inefficient and slow and bloated as inefficient use of a data format. I'm not saying Google's engineers have written bad code or that they are going to write bad code. In fact they probably won't given their track record. But blaming poor performance on a data format is like blaming poor car performance on the car's frame. There's just too many other factors that go into application performance to single out a data format. Network conditions, server load, server platform, coding techniques, etc... can all impact the performance of an application positively and negatively.

While it's certainly likely that Google will see an improvement in performance by moving to its new data exchange format, it's going to be losing at the same time. It's losing the simple integration and interoperability that comes from a standards-based technology like XML. We've been moving away from EAI-like technology that requires coding and development to integrate applications since the advent of SOA, so it's surprising to see such a services-oriented organization like Google move back into the dark ages of integration with this decision. XML became the lingua-franca of integration because it's much easier to integrate into a meta-data driven architecture, which is really one of the foundational pillars of Web 2.0 and SOA.

I will admit that ProtocolBuffers are intriguing and that given the performance needs of an organization like Google it very well may be necessary for it to move away from XML due at least in part to the performance of modern parsers to something more processor efficient, which certainly sounds like ProtocolBuffers. But it's the rare organization that needs that kind of speed and, for the most part, XML will continue to suit the majority of folks just fine.

Follow me on Twitter View Lori's profile on SlideShare



 
      

Feedback


7/9/2008 7:20 AM
Gravatar XML is a horrendus protocol for optimization.

Look at any protocol used at Layer 4 and below. They're all binary. When moving the amount of data that google moves around, I'm sure ASCII looks awfully wasteful, they throw away one bit for every 8 they transmit.

Worse, XML can't be easily parsed by low-level languages. If you need to read a binary value, you read in a few bytes from your "Protocol Buffer", toss it into a variable, done. XML requires a parser - in many ways, it's the argument of, say, PHP vs. C: PHP needs another step to become bytecode.

This is very much like TCP in many ways: There's a prototype detailing the message format, much as a TCP header is laid out, and once you've got that information, and know what bytes represent what data, you don't have to parse each message anymore, you just read bytes x-y, and know that's your value. With XML, every request has to be put through another step, the XML parser, before it can be read. You need to parse the message format on initial open, not with each packet. This is almost certainly a huge win for them.

With the number of requests Google processes, and the amount of traffic they have to move, it doesn't surprise me at all that they'd like to avoid this additional processing step. It's a wise choice, and one that I'd expect from a company trying to optimize their internal infrastructure.

Ken Snider

7/9/2008 11:44 AM
Gravatar Google isn't "moving back", since they never used XML in the first place. And they're not losing interop with any industry-standard tools since they don't want to use external code anyway - they want to write everything themselves. But the rest of the industry should pay attention to standards and interop, not what Google is doing.
Wes Felter

7/9/2008 12:10 PM
Gravatar True, the industry should look to standards and interop, and as long as what Google is doing is internal only then all is good.

But when large vendors create proprietary formats/frameworks and then offer them up to the industry at large, they tend to be adopted.



Lori MacVittie

7/9/2008 1:14 PM
Gravatar It's gonna be another rousing night in the MacVittie household!

You're wrong, and you know it. XML is fat by definition - tags are overhead, numbers-as-text are always at least as large as the source number (and then only if the number is single digit), XML is the ultimate personification of bloatware.

People can make inefficient communications protocols in binary, but you can't make XML efficient. If Google needs the performance boost and they've found a way to do it, all the more power to 'em. The fact that most companies don't need the performance boost has no bearing on Google's stance - they DO need it.

/Ducks

Don.
Don MacVittie

7/9/2008 3:39 PM
Gravatar They didn't just start using it- They've been using it for a while. They just open-sourced the format and a few API's for common languages so that other people can use it.
Alex

7/9/2008 4:57 PM
Gravatar Looks like the borrowed the idea from Thrift, however that is a Facebook creation which of course Google wouldn't just adopt outright...
TheBull

7/10/2008 4:31 AM
Gravatar @Alex @Wes Thanks for clarifying. The post is not clear on that, it implies that Google tried XML and dismissed it for being bloated and slow. They may have tested it out, that isn't clear and it isn't often that any organization publicly discusses what they've tried and ditched.

@Don OH RLY? Perhaps you'd enjoy sleeping on the couch tonight? :-) XML isn't any fatter than JSON, or a ton of other text-based formats. What you're really arguing against is text-based data formats, not necessarily just XML.

It's a trade-off, as always. You trade interoperability and standards (XML) for speed and compactness (binary). Sometimes the former is more important than the latter, else we wouldn't be seeing the mainstream adoption of SOA and SOAP and XML in general.
Lori MacVittie

7/11/2008 2:22 PM
Gravatar Interoperability and standards are complete independent of text and binary. The PACS standard is binary and anybody who uses it is guaranteed interoperability with anyone else who uses it. The only real advantage of XML is human readability for ease of debugging.
Pat

7/11/2008 2:28 PM
Gravatar @Pat

That's not the only advantage of XML. Being meta-data driven means that it's a lot easier to integrate. Sure, you can get interoperability with PACS, but it requires coding which necessarily increases the development life cycle.

You also gain agility through XML-based standards like SOAP and WSDL, because the interface is separated from the implementation and modifications to the underlying code don't necessarily require client recompilation. That's not true when you're changing around interfaces with a standard that's coded.

Lori
Lori MacVittie

7/11/2008 4:55 PM
Gravatar Decreased development lifecycle only requires good open source libraries that implement the protocol in question. In addition, there is nothing mutually exclusive about meta-data and compact binary protocols.

1st 2 bytes contain parameter name, next 2 bytes contain parameter value, etc.
Pat
 Leave Feedback
Title  
Name  
Email
Url
Comments   
Please add 3 and 8 and type the answer here: