I recently had to upload a large number (~1 million) of files to Amazon S3.
My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first, rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.
I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:
I got a belated Christmas present today. Thanks Jo + Simon!
All I want for 2015 is a Free/Open Source Software social network which is:
Is this too much to ask for? Does it exist already?
Every few days, someone publishes a new guide, tutorial, library or framework about web scraping, the practice of extracting information from websites where an API is either not provided or is otherwise incomplete.
However, I find these resources fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these articles.
The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.
In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.
It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.
I've finally found a real, practical use for my Raspberry Pi: it's an always-on machine that I can ssh to and use to wake up my home media server :)
It doubles as yet another syncthing client to add to my set.
And of course, it's running Arch ;)
A while ago, I started using keychain to manage my ssh and gpg agents. I did this with the following in my .bashrc# Start ssh-agent eval $(keychain --quiet --eval id_rsa)
Unfortunately, tools like keychain don't know about that and still expect it to be set, leading to some annoying breakage.
My fix is a quick and dirty one; I appended the following to .bashrcexport GPG_AGENT_INFO=~/.gnupg/S.gpg-agent:$(pidof gpg-agent):1
Here's this year's TODO diff as compared with last year.New Year's Resolutions
Read even more (fiction and non-fiction)
Write at least one short story
Write some more games
Go horse riding
Learn some more turkish
Play a lot more guitar
I did this but want to do more!
Lose at least a stone (in weight, from myself)
I almost did this and then put it all back on again
Try to be less of a pedant (except when it's funny)
Try to be more funny ;)
I'm sure I've achieved these ;)
Receive a lot less email
In particularly, unsubscrive from things I don't read
Particularly about technical subjects
Write more software
Release more software
Be a better husband and father
I think I'm doing alright but I'm sure I can do better
Improve or replace the engine I use for my blog
We used to think that happiness is based on succeeding at our goals, but it turns out not so much.
Most marathon runners, for example — not professionals but the amateur runners — their high for completing the marathon usually disappears even before their nipples stop bleeding.
My point is that the high lasts for a very short amount of time. If you track where people's happiness and satisfaction is, it is in "getting" or in making progress towards our goals. It's a more satisfying, life-affirming, motivating and happy thing than actually reaching them.
So it's a great thing to keep in mind... Health, for example. When you define health as something you want to do, living your life and looking back on the week and saying "That was a healthy week for me: I worked out this number of times, I did this amount of push-ups, I ate reasonably most of the time..." that's very satisfying and that's where you'll feel happy about yourself. And if you're too focused on a scale for some reason, then that's an external thing that you'll hit... and then what?
So it is about creating goals that are longer lasting and really focusing on the journey because that's really where we get our happiness and our satisfaction. Once it's achieved... now what?
Was lent a 15-course baroque lute.
A lot of triathlon training but also got back into cooking.
Returned to the Cambridge Duathlon.
Raced 50 and 100 mile cycling time trials & visited the Stratford Olympic pool (pictured).
Paced my sister at the Downtow-Upflow Half-marathon. Also released the first version of the Strava Enhancement Suite.
Visited Cornwall for my cousin's wedding (pictured). Another month for sport including my first ultramarathon and my first sub-20 minute 5k.
Entered a London—Oxford—London cycling brevet, my longest single-ride to date (269 km). Also visited the Tour of Britain and the Sri Chomnoy 24-hour endurance race.
London—Paris—London cycling tour (588 km).
Performed Handel's Messiah in Kettering.
Rather late but I guess that just confirms it’s really me, right? The signed text and IDs should be at http://mjr.towers.org.uk/transition-statement.txt
Thank you if you help me out here I’ll resign keys in a while.