![]() |
![]() |
|||||||||||||||
|
||||||||||||||||
|
![]() |
I really think
that there is no quicker path to crappy code then having to screen scrape data. I really do. One of the things I have to do in my current job is to have software that connects to other sites (that are owned/used by our clients) and scrape data from those sites to pull into our database. It's somewhat a thankless task because it's ugly and awful and nobody really notices when it works but it can cause problems when it fails. And the code. Well. I tried. But the fact is it's just ugly. There is some abstraction at least so that the parsing/scraping mess is an interface used by the sorting and loading code. But still... And I'm not sure what I could do different or better to be honest. Every scrape is different and some of them involve a bunch of JavaScript parsing so it's not even so simple as just loading an HTML DOM. And I'm not aided by the fact that some of these systems are truly dreadful from an user-interaction perspective. The kind of systems that rely on sessions for navigation thus resulting in unrecoverable errors should you hit back in your browser or attempt to see some data in any order but the one officially proscribed. I don't want to go off on a total rant here but it's 2010! Who writes web applications like that anymore? This isn't a new paradigm anymore guys, and frankly if your app can't handle a user hitting the back button on any page ever without complete collapse your app just plain out sucks. I'll tell you, if it wasn't for fiddler I'd be totally, well, you know. (On an aside fiddler now supports HTTPS, I think it has for a little while now although it didn't used to and I hadn't used it before now, and while your browser will (rightly) complain about what is essentially a man-in-the-middle attack you can use it to fully trace HTTPS interactions) On a funny note though I had one of our clients (whose data we have been scraping successfully for months) call and ask me if I could explain/help them to do the same because they need to scrape the data and couldn't figure out the magic I did to do it. In fairness it was one of the JavaScript related scrapes and took some painful reverse engineering on my part to figure out what the heck was going on. Tags Categories Comments |
||||||||||||||
|
Home | About | Blog | Stuff | Contact | Privacy Policy | |||||||||||||||
| © 2008 Max Stocker | ||||||||||||||||