I recently had to upload a large number (~1 million) of files to Amazon S3.
My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first, rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.
I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:
I got a belated Christmas present today. Thanks Jo + Simon!
All I want for 2015 is a Free/Open Source Software social network which is:
Is this too much to ask for? Does it exist already?
Every few days, someone publishes a new guide, tutorial, library or framework about web scraping, the practice of extracting information from websites where an API is either not provided or is otherwise incomplete.
However, I find these resources fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these articles.
The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.
In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.
It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.
I've finally found a real, practical use for my Raspberry Pi: it's an always-on machine that I can ssh to and use to wake up my home media server :)
It doubles as yet another syncthing client to add to my set.
And of course, it's running Arch ;)