MaxStocker.com   MaxStocker.com    
   
Home About Blog Stuff Contact
 
   
 

March 2010

I really think
Posted : Sat March 27th

Blackberry development
Posted : Sat March 6th

Right way, wrong way
Posted : Thu February 11th

An update for December
Posted : Thu December 31st

MySQL, JDBC, Unicode and You
Posted : Sun November 29th

Whatever doesn't kill me will make me stronger
Posted : Thu November 5th

Somewhat random thoughts
Posted : Sat October 17th

Strange SSL woes
Posted : Wed October 14th

Because everybody has a Mom
Posted : Sat October 3rd

Sometimes I wonder
Posted : Tue September 29th

Many updates
Posted : Sat September 26th

IE Rant
Posted : Sun September 6th

Status update
Posted : Sun August 30th

Blog reuse tips
Posted : Mon August 17th

Recent Comments

Max in Whose blog is it anyway?
on Mon May 10th

Rob in Whose blog is it anyway?
on Fri May 7th

Anonymous in SEO and the magic beans
on Thu April 8th

Max in SEO and the magic beans
on Thu April 8th

n.o. in SEO and the magic beans
on Thu April 8th

silky in Right way, wrong way
on Fri February 19th

Categories

Technical
69 Entries

Security
18 Entries

Java
23 Entries

Privacy
6 Entries

Database
11 Entries

Internet
58 Entries

Business
31 Entries

Site Updates
19 Entries

Personal
86 Entries

RSS Feed RSS Feed

Tag Cloud

I really think
Posted : Saturday March 27th, 2010

that there is no quicker path to crappy code then having to screen scrape data. I really do.

One of the things I have to do in my current job is to have software that connects to other sites (that are owned/used by our clients) and scrape data from those sites to pull into our database. It's somewhat a thankless task because it's ugly and awful and nobody really notices when it works but it can cause problems when it fails.

And the code. Well. I tried. But the fact is it's just ugly. There is some abstraction at least so that the parsing/scraping mess is an interface used by the sorting and loading code. But still...

And I'm not sure what I could do different or better to be honest. Every scrape is different and some of them involve a bunch of JavaScript parsing so it's not even so simple as just loading an HTML DOM.

And I'm not aided by the fact that some of these systems are truly dreadful from an user-interaction perspective. The kind of systems that rely on sessions for navigation thus resulting in unrecoverable errors should you hit back in your browser or attempt to see some data in any order but the one officially proscribed. I don't want to go off on a total rant here but it's 2010! Who writes web applications like that anymore? This isn't a new paradigm anymore guys, and frankly if your app can't handle a user hitting the back button on any page ever without complete collapse your app just plain out sucks.

I'll tell you, if it wasn't for fiddler I'd be totally, well, you know. (On an aside fiddler now supports HTTPS, I think it has for a little while now although it didn't used to and I hadn't used it before now, and while your browser will (rightly) complain about what is essentially a man-in-the-middle attack you can use it to fully trace HTTPS interactions)

On a funny note though I had one of our clients (whose data we have been scraping successfully for months) call and ask me if I could explain/help them to do the same because they need to scrape the data and couldn't figure out the magic I did to do it. In fairness it was one of the JavaScript related scrapes and took some painful reverse engineering on my part to figure out what the heck was going on.

Tags

code  fiddler  scraping  ugly 

Categories

Technical  Personal 

Comments

 
   
  Follow me on Twitter   My Facebook Profile   My LinkedIn Profile   RSS feed of my blog Home   |   About   |   Blog   |   Stuff   |   Contact   |   Privacy Policy  
   
  © 2008 Max Stocker