Thursday, March 10, 2011

Applescript Adventures: How to visit 3,000 webpages in 10 minutes

My first experience with AppleScript has left me with some mixed feelings about the language. I was able to complete a menial task that I would not have wanted to do manually. However, I found the natural language structure to be an obstacle to performing progressively more advanced tasks. That said, would recommend learning AppleScript, it's pretty awesome.

Task
The impetus for learning ApplesSript was a simple but mind numbing assignment. I had pull down around images and data for almost 1000 items from a system while maintaining a relationship between the data and the image files. The system was behind password protection so wget alone wouldn't suffice. Also, to download the images the click event had to be triggered on the page (thank you ASP.NET). And the company who built and maintains the system was not being helpful.

Hello AppleScript!
I was left with no alternative but to visit every page and click all the links, LAME! Thankfully AppleScript is pretty powerful. Every Apple user has to try this out at least once. AppleScript can open applications and perform different operations, then pass the results to another application. Why do something when I can tell my computer to do it for me?

The script itself ended up being pretty simple:
  • grab jQuery
  • Open browser
  • grab list of links
  • visit each link
    • inject jQuery
    • get data from this link
    • download image
  • go to next link

Opening up applications is trivial with AppleScript. The tricky parts are pulling down the list of links and then visiting them in order.

Gimme jQuery
Both Chrome and Safari allow JavaScript to be executed through AppleScript. To make everything easier jQuery was injected on page load. First jQuery has to be loaded by the AppleScript.

set jqueryFile to ("/Path/to/jQuery/file/jquery.js")
open for access jqueryFile
set jqueryContents to (read jqueryFile)
close access jqueryFile

Grabbing links
The next step is to open up Safari, tell it to open into a new window, go to an address and inject jquery. Unfortunately, Safari doesn't provide an easy way to determine a page is loading. Setting a delay is a pretty quick if unreliable solution.
tell application "Safari"
 activate
 --make new document and wait for new page to load
 delay 1 
 tell front document to set URL to "http://sickawesome.com"
 delay 15
 
 set doc to front document
 tell doc  
  do JavaScript jqueryContents
  -- the do JavaScript command returns javascript arrays as a list
  set image_hrefs to (do JavaScript "var hrefs = []; $('#list of links').each(function(){hrefs.push($(this).attr('href'))}); hrefs;")
 end tell
end tell

The downside of scripting Safari is already apparent. The loading status of a current document isn't directly available to AppleScript. Some other method has to be found to delay until the document is ready to run JavaScript. However, Safari was used for this step, because of how well it handles JavaScript. Safari, unlike Chrome, returns the value of a JavaScript statement, so that AppleScript can use it later. When Safari returns a JavaScript array AppleScript handles it as a list, no conversion to do. Awesome.

Quick Visits Only
The next step is the longest of the whole process. Visiting each page in succession, pulling down info and then going to the next. Chrome was chosen for two reasons: Chome is fastest browser out there, and Chrome tabs provide access to the loading status of the page. This is important, because after a couple dozen pages the connection speed fell dramatically, rendering delays ineffective.

However, as I mentioned before, Chrome does not return values from executed JavaScript. So a little more creativity is required. Fortunately both JavaScript and AppleScript have access to the title of a tab. So as long as the value can be cast as a string (unsure about arrays), Chrome can still pull out the data.

tell application "Google Chrome"
 activate
 tell (make new window) to tell tab 1  
  -- repeat loop essentially like python for in loop
  repeat with href in image_hrefs
   execute JavaScript "window.location ='https://baseURL" & href & "'"
 
   my waitForReady()
   execute JavaScript jqueryContents   
   --get something
   execute JavaScript "document.title = $('#block').html()"
   delay 0.2
   set value to get title
               end repeat
      end tell

A delay was stuck in just to make sure the JavaScript has time to execute before the AppleScript assumes it is done. The wait for ready subroutine is the key to Chromes suitability for this task.

on waitForReady()
 delay 1
 tell application "Google Chrome"
  tell window 1 to tell tab 1
   repeat
    execute JavaScript "document.title = document.readyState"
    set status to get title
    if status is "complete" and loading is not true then
     return true
    else
     delay 0.1
    end if
    
   end repeat
  end tell
 end tell
end waitForReady

The above function executes ten times a second until the the browser and document are ready loaded. JavaScript can be run when the document.readyState is "interactive", but sometimes the content of the page isn't ready to pull.

Conclusion
Writing this script was fun. Once I discovered the dictionaries, it was much easier to start experimenting. The uses for this language are innumerable. However, it has to be used in the right situation, writing and debugging these scripts can be a little frustrating. It could easily take more time to write than the script ends up saving.

As a programmer, I didn't appreciate the natural language syntax for AppleScript. I found it a little verbose and somewhat confusing. That is, I found it difficult to look at AppleScript samples and figure out how I could manipulate the code for another situation.

Google Book Stuffs
Go here for free info:
AppleScript: Definitive Guide
AppleScript: The Comprehensive Guide to Scripting and Automation on Mac OS X

1 comment: