|
This article will walk you through creating a python script to download all files on a web-page as fast as possible. The same script can then be used to download all sorts of content on the web, especially when combining the code with a web crawler.
Fig 1. Automatic Downloader output on Windows 7
Disclaimer
Before we go any further you must understand that it is solely your responsibility to use the information provided here in a responsible and legal way. We will take no responsibility for the use or misuse of the information provided to you in this article; it is provided here for educational purposes only.
Background
First, let me explain how I got this idea to begin with. Google, as you may or may not know, is a great place to find a wide range of hidden content. Consider, for example, the following query: ‘intitle:"index.of" mp3 usher omg apache’. Let’s take a moment to see what it does. We are looking for an Apache page that indexes the content of a directory which contains an mp3 file with usher’s song (see Fig. 2). This little trick was a part of a larger article I found about google hacks and is designed to help you find free music online. It works because Apache will list all files in a directory if that directory does not contain an index.html or index.php file in it as part of the apache’s default behaviour. Many websites forget this simple fact, or are simply unaware of it. This leaves the door open for the savvy googler to find and download the content of these directories.
Fig 2. Apache index page
After successfully trying this trick I found that it was difficult to download multiple files manually in my browser; it is too labour intensive. Wouldn’t it be great if there was a way to quickly download all files from a directory listing?
Thus, the automatic downloader was born.
Python is a simple yet powerful scripting language, which made it the perfect contender for this task. Furthermore, it can be used for many similar automation tasks. Another option would be to use Ruby to write this, but since I know Python a little better and it is more established, it was the obvious choice.
Coding
I started by creating a regular expression to find all the links on the page. The following Regex finds all the links on a given page:
<a\s.*?href\s*?=\s*?"(.*?)"
Next I wrote a short script to iterate over these links and download the content of each item and write it to a file:
</a\s.*?href\s*?=\s*?"(.*?)">
patternLinks = re.compile(r'<a\s.*?href\s*?=\s*?"(.*?)"', re.DOTALL)
iterator = patternLinks.finditer(html);
# process individual items
for match in iterator:
# print match.groups()
fileUrl = match.group(1)
if fileUrl.endswith(".mp3") or fileUrl.endswith(".wav") or fileUrl.endswith(".wma"):
print fileUrl
# absolute URL
if fileUrl.startswith("http://"):
filename = fileUrl[7:] # TODO: clean this up more!
downloadUrl = fileUrl
else: # relative url
filename = fileUrl
downloadUrl = url + '/' + fileUrl
downloadConnection = urllib.urlopen(downloadUrl)
newFile = open(filename, "wb")
newFile.write(downloadConnection.read())
print "finished writing " + filename
Finally, I added support for command line arguments so that it can easily be used to download content from any web page just by specifying the URL as an argument.
#Reading in options from the command line
optlist, args = getopt.getopt(sys.argv[1:], 'u:', ['url'])
if(not optlist):
print """Wrong format!
Options are:
-u (--url): the URL where to download the content from
"""
exit()
else:
#Parsing options specified in command line
for opt, val in optlist:
if opt == "-u" or opt == "--url":
url = val
if val.endswith('/'):
url = val[:-1]
print "`%s`" % url
# start reading the URL
connection = urllib.urlopen(url)
html = connection.read()
That is basically all there is to it. Next we will look at how we can improve performance by using multi-threading.
Adding Threads
Instead of downloading one file at a time, we can download files concurrently for improved performance. Threads can help us do just that. However, we will need a way to prevent the different threads from trying to download the same file. We can do this by using an array of booleans to represent the files we are trying to download and let the threads flip the value from false to true when they start processing that file. But instead let us use a data dictionary to keep all information about a file. We now have one record for each file containing a boolean value called inProgress, the file name and a URL. We will also be using a lock object (mutex) to prevent other threads from accessing the file list while it is being processed by another thread.
Worker Thread:
downloadListLock = threading.Lock()
class WorkerThread (threading.Thread):
def __init__(self, downloadList):
self._downloadList = downloadList
threading.Thread.__init__ ( self )
def run(self):
# download the files
i = 0
print "started thread #%s" % self.getName()
for download in self._downloadList:
i += 1
downloadListLock.acquire(True)
if download['inProgress'] == False:
print "downloading file %s of %s (%s, thread: %s)" % (i, len(self._downloadList), download['filename'], self.getName())
download['inProgress'] = True
downloadListLock.release()
downloadConnection = urllib.urlopen(download['url'])
newFile = open(download['filename'], "wb")
newFile.write(downloadConnection.read())
print "finished writing %s (thread: %s)" % (download['filename'], self.getName())
else:
downloadListLock.release()
Calling the worker thread:
downloadList = []
# process individual items
for match in iterator:
# print match.groups()
fileUrl = match.group(1)
if fileUrl.endswith(".mp3") or fileUrl.endswith(".wav") or fileUrl.endswith(".wma"):
# absolute URL
if fileUrl.startswith("http://"):
filename = fileUrl[7:] # TODO: clean this up more!
downloadUrl = fileUrl
else: # relative url
filename = fileUrl
downloadUrl = url + '/' + fileUrl
downloadList.append({'filename' : filename, 'url' : downloadUrl, 'inProgress': False})
threads = []
# create threads
for i in range(0, numThreads):
threads.append(WorkerThread(downloadList))
threads[i].setName(i)
threads[i].start()
# wait for all threads to finish
for t in threads:
t.join()
# print summary
totalTime = datetime.datetime.now() - timeAtStart
print "-" * 50
print "Statistics"
print "-" * 50
print "Total Download Time: %s" % totalTime
print "Average per File Time: %s" % (totalTime / len(downloadList))
Usage
To run this script, simply navigate to the script’s directory in your console and type:
> downloader -u http://www.example.com/music
Replace the url above with the url of the apache directory you are trying to download files from.
Download
References
- Google Hacking - http://www.i-hacked.com/content/view/23/42/
Add this page to your favorite Social Bookmarking websites
|
Comments
RSS feed for comments to this post.