Twitter makes me mental or how not to write an API
Posted : Fri October 8th
3 random things
Posted : Tue July 6th
Whose blog is it anyway?
Posted : Wed May 5th
Random thoughts from the week
Posted : Sat April 10th
SEO and the magic beans
Posted : Mon April 5th
I really think
Posted : Sat March 27th
Posted : Sat March 6th
Right way, wrong way
Posted : Thu February 11th
An update for December
Posted : Thu December 31st
MySQL, JDBC, Unicode and You
Posted : Sun November 29th
Whatever doesn't kill me will make me stronger
Posted : Thu November 5th
Somewhat random thoughts
Posted : Sat October 17th
Strange SSL woes
Posted : Wed October 14th
Because everybody has a Mom
Posted : Sat October 3rd
Max in Whose blog is it anyway?
on Mon May 10th
Rob in Whose blog is it anyway?
on Fri May 7th
Anonymous in SEO and the magic beans
on Thu April 8th
Max in SEO and the magic beans
on Thu April 8th
n.o. in SEO and the magic beans
on Thu April 8th
silky in Right way, wrong way
on Fri February 19th
Twitter makes me mental or how not to write an API
Posted : Friday October 8th, 2010
Sometimes in the land of development you'll find yourself working on painful tasks, usually involving some half-baked library with shoddy documentation. Lord knows I've had a few of these. And I have to say the twitter API ranks right up there with the worst of them.
The overly complicated setup, the lack of a step-by-step guide that isn't missing the all important steps 3 and 17, the fact that you can do everything right and twitter will still throw random 401, 403 and 500 errors at you and the handy-dandy undocumented "features" (like locking out an entire netblock at a time if you attempt to use single use authentication) all contribute to making it a real "winner". But the documentation in general is what makes this crap stand out above the rest.
As an example I would like to present a page that I nominate for the 2010 worst page of the year wand is featured on the dev.twitter.com site. This page isn't just bad because it's poorly structured (which it is) and needlessly complicated (which it also is) but more than anything because I think the most direct results of it will be to actually "uneducate" the masses. No. Really. Think about it. Most documentation/guides are written to help people understand something better. In this case most readers will actually be dumber for reading it.
Before I delve into the many different problems with this page I think I need to give a little background on the issue. The question related to how Twitter counts "characters" aka the 140 limit.
Character encoding is a thing I know a goodly amount about (which saved me from being infected by the stupidity on the page in question) and the basics are as follows.
The reason all this is important is that when you talk about data limits of any sort, sending, buffering, receiving etc, you are almost always referring to bytes. Because, sensibly, bytes are what is actually going on at a low level and no matter what alphabet (character set) you use 4 (for example) bytes is always going to be 4 bytes regardless of if that represents 4 characters or 2 or 1.
- Bytes are not the same as characters. Characters you can think of as the letters you see on the screen. Like A B C etc
- Bytes are the computer representation of characters. They are how computers are actually storing, comparing etc characters
- Some alphabets (or character sets), like the one used for English (called latin) will have a one to one character to byte ratio. So one character equals one byte
- Some character sets have multiple bytes per character (for example Chinese). So one character might take two, three or four bytes.
So keeping that in mind let's look at all the things wrong with http://dev.twitter.com/...characters. And again the question is a simple one, does Twitter count to 140 bytes (which would make sense) or 140 characters. A fairly simple question to answer, or one would think anyway...
- To give a bit of credit here the answer is actually in paragraph 2. And since I understood it I should, according to Twitter, be working for Twitter. Sadly though the explanation is a poor one because a short non-geeky version of what it says could (and should) easily be given. Here is the short non-geeky version, "Twitter counts characters not bytes".
- The page then veers off into a random attack on Wikipedia. Which is to say the least odd, I mean why point out a link to Wikipedia only then to say how bad it is? Not co-incidentally this is where my real upset started. Character encodings are one of the most chronically misunderstood computer topics and the nonsense that appears in this section is just contributing to it. To begin with the Wikipedia explanation is better than what actually follows and statements like "For many Tweets all characters are a single byte and this page is of no use" is just so much unneeded confusion. What they mean by that by the way is using a latin (English) character set. And it's not "of no use", it's of plenty of use you just have to understand that for most latin characters it's a one to one byte to character ratio.
- I have to jump back slightly here to award what is one the stupidest sentences I have ever read which follows the Wikipedia reference. In all it's glory it is
The definition we're interested in here is not the general definition of a character in computing but rather the definition of what "character" means when we say "140 characters".
I have no idea what the author of the page was trying to say at that point but they badly need to have some sense knocked back into them. The whole "defintion" of what "140 characters" is *IS* a computing definition. Further the understanding of the nature of bytes vs characters is key to that definition no matter how much you try and make out like it's different.
- Now we get to the "meat" of the page which can be found starting with the counting characters section. At this point can be found some (rather poor) explanation of the relationship of bytes vs characters (which the page just went through some convolutions to try and say didn't matter) followed by diving right into codepoints. Now excepting what comes before and after this would be passable but in context... it just finished saying that none of this was important *and* it turns out later it is unimportant but for different reasons than originally stated. So essentially so much cruft that annoys those with actual knowledge and will confuse the hell (and make it appear all very complicated) to those who don't.
- The next section, on Combining Diacritical Marks is just more useless cruft. It won't do anything to help those who don't know and is just bumphf if you do. At least it avoids outright harmful stupidity.
- Unicode Normalization is the section that follows and is really the key to understanding the whole page and there are two sentences that actually give the answer which are "Twitter also counts the number of codepoints in the text rather than UTF-8 bytes." and "Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text.". To translate "Twitter counts characters. The character count is derived using Normalization Form C of the text". Sadly though in the real version the two sentences do not appear in that order nor are they consecutive. And in fact the most important sentence (the first one saying what is counted) is the one that's also buried the deepest. Considering that it is the answer to what the whole page is supposedly about it is remarkable that it is buried so deep.
In short. Terrible.
Now look, you can make the argument that a complex description is required, and no doubt Twitter suffers from having to support APIs for languages used by "programmers" who wouldn't know a byte if it rose up and took one out of their ass (I'm looking at you PHP) but honest to god I could have written a much better version in two paragraphs and with a bunch of links for "further reading". It seems to me that at least part of the problem is that whoever wrote that page doesn't know who the audience will be. The technically competent and savvy, some kid who thinks programming means "I installed WordPress", people who just enjoy reading poorly structured documents on any subject and/or my grandmother. This page seems to try and reach all these groups and with a predictable result.
And back to my original point, and pain. This document/page is pretty typical of the "documentation" that Twitter provides. Just cluttered with cruft and statements that are at the least disingenuous, important items are buried and all in all you have to read things 10 times before you can decide if you are crazy or the author was. (Hint: it's the author). Really, really dreadful stuff. My hat goes off to all those who came before and have written the actual low-level Twitter APIs that all these apps use.