Skip to content ↓

06/10/06

The Master Plan: follow up (& Netflix)

I’m posting an edited transcript of an email correspondence I had with my friend Alex after he read my post yesterday about Abe and LibraryThing. It may be a little bit heady and technical, but I think it’s extremely interesting if you like the way things are recommended to people. And I think Tim from LibraryThing, who also commented on the post, will have something to say about it.

As a disclaimer, we are kind of talking at cross purposes: Alex is not a LT user, and his reference is to the (heavily-blogged) Netflix prize to improve their recommendation engine. So he’s coming at it all from a very conceptual level. Anyway.

Peter,
just read your blog about Librarything. You might be interested in this story that was on slashdot recently:-

Will be interesting to see what comes out of it and whether or not it is
applicable to different domains (ie books rather than movies)

A

Thanks for this – you should comment on the blog as well as emailing.

I saw this post earlier this week – it’s been quite heavily blogged and although I haven’t read all the /. stuff (I don’t eat that feed) the summary that i’ve read seems quite fairly to be “if they’d open it up as an API before, then they could have got this for free and saved a million)” and then some. But that at the same time it’s great publicitiy for them. And I didn’t really feel I had much to add to that – its true.

thanks for your interest though – keep em coming!

P

I disagree – a public API is a relatively expensive thing in terms of network overhead in invoking calls. ie if I want to invoke foo() via an XML / RPC /
AJAX whatever call then the amount of time that it takes to run foo() on the
central server is microscopic compared with the time that it takes to deliver
the request for foo() to the server and send the results of foo() back to the
client.

Therefore you would tend to use an API for something which has a relatively
low number of calls for a given task thereby making it run in a feasible
time.

Now, designing a recommendation schema is the kind of thing which will involve
millions of calls during the development process while you decide what sort
of questions are useful ones to ask. It is this development process that
netflix are inviting people to do, and providing the dataset to them (the
public developers) enables them to run these test queries against a local
dataset (and in fact to choose how the local dataset is going to be stored /
cross-referenced). Once you have a set of queries and a backend
implementation to run them against that really works, then you can export
that as a public API, but to make an API beforehand makes no sense at all.

A

i might just copy and paste that on a few of the blogs so i sound very very smart. you’re ok with that, right? ;)

Had you checked out LT?

P

Post away – of course providing you don’t complain if someone shoots you down in
flames… ;-)

I’ve had a quick scout around LT but not enough to get a concrete idea. Does
sound interesting though. The recommendations are coming from 3rd party
APIs, not from his own code though?

A

it’s social. users add books to their collection (via an API to amazon, etc). LT compares collections and draws comparisons. genius.

P

P

its not that simple – if it was then netflix wouldn’t be offering 10^6 for
people to solve it.

you have the ability to rate books so you can use this to skew the
distributions between how much people like things and perform better matching
than just who shares the same collection.

however, people don’t rate things in similar ways. Someone who has 100 books
all of which are rated at 5 stars doesn’t make the recommendation as strong
for a particular book as someone who has an average weighting of 3 stars for
another 100 books and has labelled one book at 5 stars.

So you have to perform various normalisation things on people’s data-sets to
try and compensate for the different ways that people interpret their
collections.

The trouble is that this only really works across large data-sets, and the set
logic on this kind of thing gets exponentially more complex as you start to
perform more bespoke queries.

So, while it may be possible to perform relatively complex analysis given
enough time, being able to do it “on-demand” across a large data-set (library
thing has only 87,221 members at present and some of the queries can take up
to a minute to process) is non-trivial.

It probably comes down to finding a nice search that works well and then
refactoring it into something that will slot into google’s map/reduce
algorithm (ie will scale well across a parallel architecture rather than
demanding a single high capacity processor for long periods of time).

Its not that librarything isn’t very interesting and very good, but they
definitely haven’t solved what netflix want to solve. What they are good at
is persuading people to provide content (which is solved by the netflix
rental history).

I’ve been doing loads of work / research on the high-end scalability issues
and its a real fucker, especially as the data-set grows.

One thing that just occurred to me is that trust metrics are quite interesting
for this kind of thing – ie some people’s recommendations might be more
trusted just because of who they are and their past behaviour – kind of guru
status. The first people I saw looking at this was
advogato, although slashdot have done very
well in terms of their moderation and meta-moderation (although this is a
slightly different problem).

An ideal system would let people rank the recommendations that it gives them
and try and mutate the recommendation algorithm until it suits what one
person wants it to be: ie the recommendations that I think are best based on
books A,B & C might be completely different from what you think are best
based on exactly the same books. Neither sets are wrong in themselves, only
wrong given the situation. Urg. All very non-trivial (but quite a lot of
fun)

A

i’m so posting this thread. seriously.

OK, LT and NF are separate issues. I don’t think an LT type approach would solve NF at all, I just like it as a book web site.

But yes, it uses recommendations – programmatic ones at that. The recommendations are based on your collection – ratings don’t seem to come into it (afaik). You should check it out with 10 of your favourite books and see what it comes back with.

The thing is, it’s not supposed to do this – it’s supposed to be a bibiliographic service of collecting books; recommendation is a by-product. Although we’ll see how it develops.

My concern when Abe bought into LT was that data from booksellers would ‘pollute’ the integrity of users collections (user collections being based on taste, bookshops being based on the market).

p

It is is the connections that LT draws, that raises it from being just a lot
of information, to being a lot of relevant information. (To you). Infinitely more valuable and interesting.

I’ll populate it a bit on my return to UK – juggling a bit at the moment.

It does to a little bit of semi-clever analysis like so:-

- go to user profile: eg

- RHS has “users with sylphette’s books”. At the base level this is people
who just share common books with sylphette and is relatively easy to build
and hence fast. However, if I have 2000 books and you have 100 books and it
happens that I have all 100 of your books then it doesn’t mean that you will
like all of my 2000 books.
meburste (494),
debweiss (474),
ellenandjim (468),
eromsted (442),
ginaruiz (374),
chanale (371),
carminowe (367),
vernonlee (357),
obsessedbybooks (357),
lennonj (357),

- RHS has option for “weighted” listing for Users listing. You click this and
it says “This information is loading. Loading may take as much as a minute
if it hasn’t been updated recently” and then it returns a new list. During
this time period presumably their server is churning and the load is on
maximum so the more people that are doing it at the same time, the longer it
takes. If then gives you:-
(Weighted by book obscurity and library size)
famousgoodbyeking (120/401),
chanale (421/4448),
paulvm (57/323),
cinaedus (102/1082),
wellred2 (275/2699),
popa (345/5727),
apeejam (50/99),
vernonlee (366/3171),
gregsanchez (159/1224),
dustinfr (192/503),
meburste (539/6000)
Interestingly meburste has now dropped from 1 to 10. So it turns out that
meburste just has a large library but actually they don’t overlap all that
much. paulvm wasn’t even in the top 10 on the basic listing (or in fact in
the listing at all) because he had a small library and therefore couldn’t
overlap that much.

So what does this tell us? It tells us that people with big libraries are
considerably less useful than people who only buy (or upload) books that they
really love. ie if everyone puts up their top 100 books then the the
overlaps would be much more important because it would automatically discard
background noise (ie books that you just have but which aren’t super relevant
to you).

This would be solved by the rating system as it would then give you a
weighting that would automatically look at how much you *care* about the
books rather than just the fact that you have uploaded it. Cross-referencing
this with tags makes it even more interesting. But *very* *very*
computationally expensive.

definitely – most interesting.

This is a field that I’ve always been interested in – I spent some time
working on automatic classification of text documents based around word
distributions and statistical analysis with learning systems so I’ve thought
about this quite a lot. I also have a very nice book of papers on Automatic
Text Summarisation Systems that has a lot of overlap with some of these
problems (ie if you can work out what something is about automatically and
consistently then you can get a consistent classification thereby making it
easier to compare – tagging is always going to be personal and therefore
means a different thing to each user, much like the rating systems mentioned
above).

WRT the ‘poillution’ – that can then be introduced as a weighting into the algorithm as well – ie the source of the data becomes important as well as everything else.

Remember that for some tasks commercial data is more important than user data and vice versa. Again, another dimension of complexity to the argument…

Posted by Peter Collingridge in Future of the book, Librarything, Publishing.

YouTube // Update on YouTube

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment