Planet ALUG

Syndicate content
Planet ALUG - http://planet.alug.org.uk/
Updated: 27 min 24 sec ago

Chris Lamb: Uploading a large number of files to S3

Fri, 09/01/2015 - 14:31

I recently had to upload a large number (~1 million) of files to Amazon S3.

My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first, rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.

I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:

  1. First, change to your source directory. This is to ensure that the filenames created in your S3 bucket are not prefixed with the directory structure of your local filesystem — whilst s3-parallel-put has a --prefix option, it is ignored if you pass a fully-qualified source, ie. one starting with a /.
  2. Run with --dry-run --limit=1 and check that the resulting filenames will be correct after all:
$ export AWS_ACCESS_KEY_ID=FIXME $ export AWS_SECRET_ACCESS_KEY=FIXME $ /path/to/bin/s3-parallel-put \ --bucket=my-bucket \ --host=s3.amazonaws.com \ --put=stupid \ --insecure \ --dry-run --limit=1 \ . [..] INFO:s3-parallel-put[putter-21714]:./yadt/profile.Profile/image/circle/807.jpeg -> yadt/profile.Profile/image/circle/807.jpeg [..]
  1. Remove --dry-run --limit=1, and let it roll.
Categories: LUG Community Blogs

Jonathan McDowell: Cup!

Thu, 08/01/2015 - 14:58

I got a belated Christmas present today. Thanks Jo + Simon!

Categories: LUG Community Blogs

MJ Ray: Social Network Wishlist

Thu, 08/01/2015 - 05:10

All I want for 2015 is a Free/Open Source Software social network which is:

  • easy to register on (no reCaptcha disability-discriminator or similar, a simple openID, activation emails that actually arrive);
  • has an email help address or online support or phone number or something other than the website which can be used if the registration system causes a problem;
  • can email when things happen that I might be interested in;
  • can email me summaries of what’s happened last week/month in case they don’t know what they’re interested in;
  • doesn’t email me too much (but this is rare);
  • interacts well with other websites (allows long-term members to post links, sends trackbacks or pingbacks to let the remote site know we’re talking about them, makes it easy for us to dent/tweet/link to the forum nicely, and so on);
  • isn’t full of spam (has limits on link-posting, moderators are contactable/accountable and so on, and the software gives them decent anti-spam tools);
  • lets me back up my data;
  • is friendly and welcoming and trolls are kept in check.

Is this too much to ask for? Does it exist already?

Categories: LUG Community Blogs

Chris Lamb: Web scraping: Let's move on

Wed, 07/01/2015 - 19:53

Every few days, someone publishes a new guide, tutorial, library or framework about web scraping, the practice of extracting information from websites where an API is either not provided or is otherwise incomplete.

However, I find these resources fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these articles.

The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.

Et cetera.

In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.

It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.

Categories: LUG Community Blogs

Steve Engledow (stilvoid): Pinally

Wed, 07/01/2015 - 01:03

I've finally found a real, practical use for my Raspberry Pi: it's an always-on machine that I can ssh to and use to wake up my home media server :)

It doubles as yet another syncthing client to add to my set.

And of course, it's running Arch ;)

Categories: LUG Community Blogs