Planet ALUG

Syndicate content
Planet ALUG - http://planet.alug.org.uk/
Updated: 1 hour 30 min ago

Chris Lamb: Uploading a large number of files to S3

Fri, 09/01/2015 - 14:31

I recently had to upload a large number (~1 million) of files to Amazon S3.

My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first, rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.

I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:

  1. First, change to your source directory. This is to ensure that the filenames created in your S3 bucket are not prefixed with the directory structure of your local filesystem — whilst s3-parallel-put has a --prefix option, it is ignored if you pass a fully-qualified source, ie. one starting with a /.
  2. Run with --dry-run --limit=1 and check that the resulting filenames will be correct after all:
$ export AWS_ACCESS_KEY_ID=FIXME $ export AWS_SECRET_ACCESS_KEY=FIXME $ /path/to/bin/s3-parallel-put \ --bucket=my-bucket \ --host=s3.amazonaws.com \ --put=stupid \ --insecure \ --dry-run --limit=1 \ . [..] INFO:s3-parallel-put[putter-21714]:./yadt/profile.Profile/image/circle/807.jpeg -> yadt/profile.Profile/image/circle/807.jpeg [..]
  1. Remove --dry-run --limit=1, and let it roll.
Categories: LUG Community Blogs

Jonathan McDowell: Cup!

Thu, 08/01/2015 - 14:58

I got a belated Christmas present today. Thanks Jo + Simon!

Categories: LUG Community Blogs

MJ Ray: Social Network Wishlist

Thu, 08/01/2015 - 05:10

All I want for 2015 is a Free/Open Source Software social network which is:

  • easy to register on (no reCaptcha disability-discriminator or similar, a simple openID, activation emails that actually arrive);
  • has an email help address or online support or phone number or something other than the website which can be used if the registration system causes a problem;
  • can email when things happen that I might be interested in;
  • can email me summaries of what’s happened last week/month in case they don’t know what they’re interested in;
  • doesn’t email me too much (but this is rare);
  • interacts well with other websites (allows long-term members to post links, sends trackbacks or pingbacks to let the remote site know we’re talking about them, makes it easy for us to dent/tweet/link to the forum nicely, and so on);
  • isn’t full of spam (has limits on link-posting, moderators are contactable/accountable and so on, and the software gives them decent anti-spam tools);
  • lets me back up my data;
  • is friendly and welcoming and trolls are kept in check.

Is this too much to ask for? Does it exist already?

Categories: LUG Community Blogs

Chris Lamb: Web scraping: Let's move on

Wed, 07/01/2015 - 19:53

Every few days, someone publishes a new guide, tutorial, library or framework about web scraping, the practice of extracting information from websites where an API is either not provided or is otherwise incomplete.

However, I find these resources fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these articles.

The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.

Et cetera.

In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.

It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.

Categories: LUG Community Blogs

Steve Engledow (stilvoid): Pinally

Wed, 07/01/2015 - 01:03

I've finally found a real, practical use for my Raspberry Pi: it's an always-on machine that I can ssh to and use to wake up my home media server :)

It doubles as yet another syncthing client to add to my set.

And of course, it's running Arch ;)

Categories: LUG Community Blogs

Steve Engledow (stilvoid): Keychain and GnuPG >= 2.1

Fri, 02/01/2015 - 16:48

A while ago, I started using keychain to manage my ssh and gpg agents. I did this with the following in my .bashrc

# Start ssh-agent eval $(keychain --quiet --eval id_rsa)

Recently, arch updated gpg to version 2.1.1 which, as per the announcement, no longer requires the GPG_AGENT_INFO environment variable.

Unfortunately, tools like keychain don't know about that and still expect it to be set, leading to some annoying breakage.

My fix is a quick and dirty one; I appended the following to .bashrc

export GPG_AGENT_INFO=~/.gnupg/S.gpg-agent:$(pidof gpg-agent):1

:)

Categories: LUG Community Blogs

Steve Engledow (stilvoid): TODO

Fri, 02/01/2015 - 13:58

Here's this year's TODO diff as compared with last year.

New Year's Resolutions
  • Read even more (fiction and non-fiction)

  • Write at least one short story

  • Write some more games

  • Go horse riding

  • Learn some more turkish

  • Play a lot more guitar

    I did this but want to do more!

  • Lose at least a stone (in weight, from myself)

    I almost did this and then put it all back on again

  • Try to be less of a pedant (except when it's funny)

  • Try to be more funny ;)

    I'm sure I've achieved these ;)

  • Receive a lot less email

    In particularly, unsubscrive from things I don't read

  • Blog more

    Particularly about technical subjects

  • Write more software

  • Release more software

  • Be a better husband and father

    I think I'm doing alright but I'm sure I can do better

  • Improve or replace the engine I use for my blog

Categories: LUG Community Blogs

Chris Lamb: Goals

Thu, 01/01/2015 - 22:36

Dr. Guy Winch:

We used to think that happiness is based on succeeding at our goals, but it turns out not so much.

Most marathon runners, for example — not professionals but the amateur runners — their high for completing the marathon usually disappears even before their nipples stop bleeding.

My point is that the high lasts for a very short amount of time. If you track where people's happiness and satisfaction is, it is in "getting" or in making progress towards our goals. It's a more satisfying, life-affirming, motivating and happy thing than actually reaching them.

So it's a great thing to keep in mind... Health, for example. When you define health as something you want to do, living your life and looking back on the week and saying "That was a healthy week for me: I worked out this number of times, I did this amount of push-ups, I ate reasonably most of the time..." that's very satisfying and that's where you'll feel happy about yourself. And if you're too focused on a scale for some reason, then that's an external thing that you'll hit... and then what?

So it is about creating goals that are longer lasting and really focusing on the journey because that's really where we get our happiness and our satisfaction. Once it's achieved... now what?

Categories: LUG Community Blogs

Chris Lamb: 2014: Selected highlights

Wed, 31/12/2014 - 18:23

Previously: 2012 & 2013.


January

Was lent a 15-course baroque lute.

February

Grandpa's funeral. In December he was posthumously awarded the Ushakov Medal (pictured) for his service in the Royal Navy's Arctic Convoys during the Second World War.

March

A lot of triathlon training but also got back into cooking.

April

Returned to the Cambridge Duathlon.

May

Raced 50 and 100 mile cycling time trials & visited the Stratford Olympic pool (pictured).

June

Ironman Austria.

July

Paced my sister at the Downtow-Upflow Half-marathon. Also released the first version of the Strava Enhancement Suite.

August

Visited Cornwall for my cousin's wedding (pictured). Another month for sport including my first ultramarathon and my first sub-20 minute 5k.

September

Entered a London—Oxford—London cycling brevet, my longest single-ride to date (269 km). Also visited the Tour of Britain and the Sri Chomnoy 24-hour endurance race.

October

London—Paris—London cycling tour (588 km).

November

Performed Handel's Messiah in Kettering.

December

Left Thread.com.

Categories: LUG Community Blogs