Automatic Downloader in Python PDF Print E-mail
Tuesday, 17 May 2011 22:36

This article will walk you through creating a python script to download all files on a web-page as fast as possible. The same script can then be used to download all sorts of content on the web, especially when combining the code with a web crawler.

Fig 1. Automatic Downloader output on Windows 7

Disclaimer

Before we go any further you must understand that it is solely your responsibility to use the information provided here in a responsible and legal way. We will take no responsibility for the use or misuse of the information provided to you in this article; it is provided here for educational purposes only.

Background

First, let me explain how I got this idea to begin with. Google, as you may or may not know, is a great place to find a wide range of hidden content. Consider, for example, the following query: ‘intitle:"index.of" mp3 usher omg apache’. Let’s take a moment to see what it does. We are looking for an Apache page that indexes the content of a directory which contains an mp3 file with usher’s song (see Fig. 2). This little trick was a part of a larger article I found about google hacks and is designed to help you find free music online. It works because Apache will list all files in a directory if that directory does not contain an index.html or index.php file in it as part of the apache’s default behaviour. Many websites forget this simple fact, or are simply unaware of it. This leaves the door open for the savvy googler to find and download the content of these directories.

Apache Index Page

Fig 2. Apache index page

After successfully trying this trick I found that it was difficult to download multiple files manually in my browser; it is too labour intensive. Wouldn’t it be great if there was a way to quickly download all files from a directory listing?

Thus, the automatic downloader was born.

Python is a simple yet powerful scripting language, which made it the perfect contender for this task. Furthermore, it can be used for many similar automation tasks. Another option would be to use Ruby to write this, but since I know Python a little better and it is more established, it was the obvious choice.

Coding

I started by creating a regular expression to find all the links on the page. The following Regex finds all the links on a given page:

<a\s.*?href\s*?=\s*?"(.*?)" 

Next I wrote a short script to iterate over these links and download the content of each item and write it to a file:

</a\s.*?href\s*?=\s*?"(.*?)">



    patternLinks = re.compile(r'<a\s.*?href\s*?=\s*?"(.*?)"', re.DOTALL)
    iterator = patternLinks.finditer(html);

    # process individual items
    for match in iterator:
    #    print match.groups()
        fileUrl = match.group(1)
        if fileUrl.endswith(".mp3") or fileUrl.endswith(".wav") or fileUrl.endswith(".wma"):
            print fileUrl
            
            # absolute URL
            if fileUrl.startswith("http://"):
                filename = fileUrl[7:] # TODO: clean this up more!
                downloadUrl = fileUrl
            else: # relative url
                filename = fileUrl
                downloadUrl = url + '/' + fileUrl
    
            downloadConnection = urllib.urlopen(downloadUrl)
            newFile = open(filename, "wb")
            newFile.write(downloadConnection.read())
            print "finished writing " + filename

Finally, I added support for command line arguments so that it can easily be used to download content from any web page just by specifying the URL as an argument.

    #Reading in options from the command line
    optlist, args = getopt.getopt(sys.argv[1:], 'u:', ['url'])

    if(not optlist):
        print """Wrong format!
            Options are:
            -u (--url): the URL where to download the content from
            """
        exit()
    else:
        #Parsing options specified in command line
        for opt, val in optlist:
            if opt == "-u" or opt == "--url":
                url = val
                if val.endswith('/'):
                    url = val[:-1]
                    
                print "`%s`" % url
    
    # start reading the URL
    connection = urllib.urlopen(url)
    html = connection.read()

That is basically all there is to it. Next we will look at how we can improve performance by using multi-threading.

Adding Threads

Instead of downloading one file at a time, we can download files concurrently for improved performance. Threads can help us do just that. However, we will need a way to prevent the different threads from trying to download the same file. We can do this by using an array of booleans to represent the files we are trying to download and let the threads flip the value from false to true when they start processing that file. But instead let us use a data dictionary to keep all information about a file. We now have one record for each file containing a boolean value called inProgress, the file name and a URL. We will also be using a lock object (mutex) to prevent other threads from accessing the file list while it is being processed by another thread.

Worker Thread:

downloadListLock = threading.Lock()

class WorkerThread (threading.Thread):
    def __init__(self, downloadList):
        self._downloadList = downloadList
        threading.Thread.__init__ ( self )
    
    def run(self):
        # download the files
        i = 0
        print "started thread #%s" % self.getName()
        
        for download in self._downloadList:
            i += 1
            downloadListLock.acquire(True)
            
            if download['inProgress'] == False:
                print "downloading file %s of %s (%s, thread: %s)" % (i, len(self._downloadList), download['filename'], self.getName())
                download['inProgress'] = True
                downloadListLock.release()
                
                downloadConnection = urllib.urlopen(download['url'])
                newFile = open(download['filename'], "wb")
                newFile.write(downloadConnection.read())
            
                print "finished writing %s (thread: %s)" % (download['filename'], self.getName())
            else:
                downloadListLock.release()

Calling the worker thread:

downloadList = []
    
    # process individual items
    for match in iterator:
    #    print match.groups()
        fileUrl = match.group(1)
            
        if fileUrl.endswith(".mp3") or fileUrl.endswith(".wav") or fileUrl.endswith(".wma"):
            # absolute URL
            if fileUrl.startswith("http://"):
                filename = fileUrl[7:] # TODO: clean this up more!
                downloadUrl = fileUrl
            else: # relative url
                filename = fileUrl
                downloadUrl = url + '/' + fileUrl
            
            downloadList.append({'filename' : filename, 'url' : downloadUrl, 'inProgress': False})
    
    threads = []
    
    # create threads
    for i in range(0, numThreads):
        threads.append(WorkerThread(downloadList))
        threads[i].setName(i)
        threads[i].start()
    
    # wait for all threads to finish
    for t in threads:
        t.join()
        
    # print summary
    totalTime = datetime.datetime.now() - timeAtStart
    print "-" * 50
    print "Statistics"
    print "-" * 50
    print "Total Download Time: %s" % totalTime
    print "Average per File Time: %s" % (totalTime / len(downloadList))

Usage

To run this script, simply navigate to the script’s directory in your console and type:

> downloader -u http://www.example.com/music


Replace the url above with the url of the apache directory you are trying to download files from.

Download

References

  1. Google Hacking - http://www.i-hacked.com/content/view/23/42/





Add this page to your favorite Social Bookmarking websites
 
Last Updated on Tuesday, 16 August 2011 22:49
 
More articles :

» Getting Reliable z-index Cross-Browser

Turns out it is not as easy as one might think to get thecorrect z-index of an element using a javascript call like $(element).css(‘z-index’). The problem is how browser vendors apply the z-indexto an element. But, no worries I have created a...

» Script Tag Stripping Workaround in Joomla

If you ever tried inserting javascript into a Joomla article you may find it is a very difficult task; so, I went a head and made a plug-in to make this fast and easy. The reason why joomla makes it so difficult is because cross site scripting...

» Multiple Popup Windows Workaround

Pop-up window management poses a challenge in itself, but with an elusive Firefox bug effecting this process, it may seem impossible to do it well. I will present you with an account of my struggle with this issue and the simple solution I found for...

» Analog Clock in Flash

Analog clocks have always fascinated me, not just because they look great (think sports watches) but the challenge they present. How do we animate an analog clock using a computer? Well I decided to once and for all answer this question...

» Snow in JavaScript

Winter is finally over, but we can still make nice digital snow to cool us down during hot summer days. We will start by considering the path snow flakes take before they hit the ground, then we will find out how to implement it...

Comments  

 
0 #4 2011-12-05 16:24
This script works pretty good. I just used it on an HTTP basic authenticated download page. Worked right out of the box, due to some magic in urllib.
Quote
 
 
0 #3 2011-10-24 10:28
I am a beginner on python.I think this blog will be helpful to me in future. Thank you for sharing.
Quote
 
 
0 #2 2011-09-12 05:02
Thanks to this article, now I can download all files on a web-page faster.
Quote
 
 
+1 #1 2011-08-16 06:50
This is a really great guide!
Quote
 

Add comment


Security code
Refresh