Archive for the ‘Uncategorized’ Category.

Support screenshots with Monosnap and S3

I remember the first day that I discovered Skitch. Skitch let you take screenshots of an application or region on your screen and quickly post those online with a handy link that you could show to others. You could draw arrows or other simple annotations before flipping the switch to send the image to the interwebs.

Since Directed Edge moved from being an API- and developer-only company into also supporting e-commerce platforms, taking screenshots to help out our customers is a regular part of our customer support process. Sometimes a screenshot is worth a thousand (well, or at least several dozen) words.

Skitch was eventually bought by Evernote, and as an Evernote user I appreciate the Skitch integration, but as the interface has gradually gotten clunkier for the good old taking of support screenshots, and since we’ve come to lean heavily on them, I’d been on the lookout for a more elegant solution.

Hello, Monosnap.

I then stumbled across Monosnap, which has free integration with Amazon’s S3. I thought I’d give a quick run down of how we’ve moved from Skitch to using Monosnap + S3 to serve up “show.directededge.com

MonosnapMonosnap itself is pretty straightforward. You take screenshots, potentially annotate them and it uploads them somewhere.  It hearkens to the Skitch of old.  However, instead of only uploading images to its own server, it also allows you to upload images to various cloud services.

Monosnap Config - Account

We were primarily interested in S3, since we’re using it for image storage in another recommendations product that we have coming down the pipeline.

So, first we created a new S3 bucket.  We typically just name our buckets after the site that they’ll be backing:

Monosnap s3 Bucket

The rest of the default bucket settings are fine.

The “show”.

Now, with s3 you can typically access your buckets at a URL like:

https://s3.amazonaws.com/show.directededge.com

However, you can also rearrange that URL to a DNS CNAME friendly format:

http://show.directededge.com.s3.amazonaws.com/

So, now we need to head over to our DNS provider (Rackspace in our case) and enter in an appriate CNAME record, to get those lovely, Directed Edge-looking, URLs.

s3 CNAME

As is usual with CNAME records, we simply had to enter a record pointing from:

show.directededge.com

To:

show.directededge.com.s3.amazonaws.com.

(Note the period on the end, standard with CNAME records.)

With that the DNS and storage bits are done.

Configuring Monosnap

From there we just needed to add a couple values to Monosnap.  As seen in the configuration interface above, we had to enter our AWS API key and access token.  Unfortunately, creating a special account for Amazon’s Identity and Access Management (IAM) didn’t seem to get along well with Monosnap, so we were forced to use our master token.  Were we using S3 for more than image storage, this would probably be somewhat troubling.

However, once we entered our master credentials, we were able to select a bucket and begin uploading support screenshots to our new, pretty URL.  For example:

Directed Edge for Shopify
http://show.directededge.com/Directed_Edge_for_Shopify_20130807_012624.png

Terror in recommendations-ville

UPDATE: Now, 10 months later, we’ve not had a single additional instance of data corruption, which is a spot we’re glad to be back in.


Every startup seems to have one of those technical meltdown stories. Up until today, our worst had been back in 2009 when we were still in beta, and it was more embarassing than critical. (Lesson learned then: never deploy just before sleeping.)

Since then we’ve done pretty well. On average we have about 4-8 hours of downtime per year. So today was especially nasty.

The first sign of trouble

Around 11:01 p.m. GMT I get a mail about an unexpected reboot. One of our other guys had caught an issue and was looking into it, but I hadn’t noticed the SMS he’d sent me saying such was coming up.

We hop on Skype and start going through what’s going on. Our web services process was alternating between crashing and hanging every few minutes. Usually that means some corrupted indexes in one of our databases. We get that once or twice a year. It’s an annoyance (and let me tell you about the fun of trying to track down a bug that happens twice a year), but we have a script that can rebuild the indexes, depending on the size of the database, in somewhere between 10 seconds and 10 minutes.

A Heisenbug

The trick is figuring out which database is causing the problem. Aside from our enterprise customers, which are usually on their own hardware, we run a bunch of our small and medium sized databases off of the same hardware. Our web services setup is heavily multithreaded, so there are usually some 40-ish threads that are writing to the logs in parallel; figuring out which request was the one triggering the problem can sometimes be a bit tricky.

But we’ve been there before too. We have a switch we can use to flip on function tracing which usuall makes it clear pretty quickly — or at least has in every instance up until now.

This time it didn’t. We kept watching our logs, trying to catch the offending item and guessing the wrong one. We have a script to pull down databases from the production to our development machines, where we check the indexes for corruption. For about an hour we kept missing.

We hadn’t been offline for a full hour in more than a year, so we were getting antsy at that point. We’d thought about this situation before too though — what if we couldn’t find the offending database?

A not-so-super super-tool

The last time we had a similar issue (which was sorted out in under 15 minutes), I wrote a small program to go through and check every single database on a system. It was, as we soon will learn, not so aptly called, “IntegrityChecker”. We’d never run it in production, but it had been tested on dev machines. Basically it would disable an account for a few seconds, try to get a file lock on its files, and then run the integrity checker that we usually use offline to test such things.

Except that for some reason it was given a file lock when the OS should have rejected it. So, now we have mmap’ed files that are being written from two process. Not good. In fact, very, very bad.

Unfortunately, we didn’t notice this immediately. In fact, things looked great — it identified the database with a corrupt index, we rebuilt it, waited 15 minutes to make sure everything was kosher, and headed off to snooze. It was about 4:00 a.m. local time.

This isn’t supposed to happen

We have two redundant monitoring systems that live in a separate network and check all of our servers once per minute. Both of them send out a barage of SMSes if the servers stop responding to requests. But most of the API requests that were coming in were actually being handled — probably 80% or so — and our monitoring systems retry a second time if there’s a failure just so that we don’t get notified every time there’s a minor network glitch somewhere. So we weren’t notified that our servers were still sickly.

The score

Now, if you’re keeping score, here are the points so far of this perfect storm:

  • Relatively minor issue of database index corruption
  • IntegrityChecker that’s actually corrupting databases (and not just their indexes)
  • Monitoring not catching the problems

Digging deeper

9:30 a.m. rolls around and we’re up and trying to see if there’s anything we can piece together about the issues from the night before. Pretty soon one of the other guys that’s working with me on the problem notices that the server process actually kept crashing all night, albeit at a somewhat reduced frequency (every 10 minutes or so).

Since the IntegrityChecker had worked so well the night before (as we still thought), I gave it another go. It turned up this time six corrupt databases. We’d never seen that. I rebuilt them, and again, in perfect storm fashion, ran it again. Two more. Something was obviously very wrong.

A little while later the database process is back to restarting every minute. Panic ensues.

Our monitoring turns on itself

See, we not only have out-of-band monitoring, we also have monit running to restart things on our machines if something goes wrong.

We didn’t notice for about an hour that the errors in the monit log weren’t the usual ones. They were 404 — “not found” — errors. That was weird. It turned out the account that we use for status checking was corrupt. We fixed that and were back down to crashes every 10 minutes or so, but still had no idea what could possibly be causing the problems.

A couple of heavily caffinated hours of shooting in the dark passed. Eventually we started to suspect the IntegrityChecker. And we noticed that a few accounts (about a dozen — the ones that had the misfortune of writing while the IntegrityChecker was checking them) had been effectively nulled out. We started getting a sense of how widespread the damage was.

The low-down

There was a lot that came together to make this as bad as it was. To recap, we started off with a relatively minor issue, which was made into a major issue by a tool that was built specifically for dealing with the exact situation we were facing, and those large issues were not caught by our generally quite good monitoring, which eventually even became part of the problem itself.

The last resort: the backups

Fortunately we have backups, so we started going through accounts manually, offline this time, checking almost every database to try to find corruption. We found a few more.

At that point we started restoring the broken databases to their last known good state — about 17 hours before all of the mess started. We’ve done that now and things look pretty stable. It’s now been an 90 minutes since we had a crash or hang and all accounts are back online. In some cases we’re trying to merge data that came in later in the day where we can get to it, but that’s a bit spotty.

An apology

This is the first time in three and a half years of being in production that we lost data and we feel terrible about it. We’ll be doing some work on our internal systems to help us monitor, diagnose and fix these issues in the future. Thanks so much for your patience.

Edit: One additional note: if you’re one of our customers that uses our Shopify app and your data seems to be out of date, it’ll automatically be updated the next time that we pull your shop’s data in with our importer. There was no data loss for Shopify customers.

Bindings update that removes extra server round-trips

So, we noticed recently that our Java and Python bindings were doing an extra round-trip to our servers for every request. We’ve just done updates to each of them that removes this.

Per the HTTP spec, when a request uses HTTP BASIC authentication, as our API does, the client is supposed to first send a request without credentials. When it receives a 401 error, then it should retry with the credentials supplied and repeat that process for each and every request.

While that makes sense for interactive use from a browser, for use in a web services API, it makes much more sense to avoid the extra round-trip and send the credentials already in the first request, thus significantly speeding up request times.

You can get the updated versions of our bindings on Github.

Shopify app updates, new Ruby bindings beta, new web services features coming down the pipe

Shopify app updates

So, it’s been a gazillion years since we posted updates here, but there have been a number of things shaking out of the woodwork.

First, we just did the biggest update to our Shopify app since we first launched it a couple years back.  It features a new Bootstrap-ified configuration interface and a whole bunch of new recommendations types.

New Ruby Bindings in Beta

Whilst working on our Shopify app, which itself is a Rails app, we got frustrated with the current state of our Ruby bindings.

You see, our Ruby bindings were written back in the dark days before we actually used Ruby at all inside of the company.  Truth be told, the original version was written by yours truly a couple hours after I started learning Ruby.

These days, along with Java and C++ (which our lower-level, performance critical bits are written in), we write quite a bit of Ruby as as such, our tastes have become more refined.

Here are a couple of quick examples:

Old:

require 'rubygems'
require 'directed_edge'
 
DE_USER = ENV['DIRECTEDEDGE_TEST_DB']
DE_PASS = ENV['DIRECTEDEDGE_TEST_PASS']
 
# Import some data
 
DE_FILE = 'temp.xml'
 
exporter = DirectedEdge::Exporter.new(DE_FILE)
 
(1..10).each do |i|
  item = DirectedEdge::Item.new(exporter.database, "item_#{i}")
  item.add_tag('item')
  item.link_to("item_#{rand(10) + 1}")
  exporter.export(item)
end
 
exporter.finish
 
database = DirectedEdge::Database.new(DE_USER, DE_PASS)
database.import(DE_FILE)
 
# Get recommendations
 
item = DirectedEdge::Item.new(database, 'item_1')
item.related([ 'item' ]).each { |i| puts i }
 
# Update item
 
item.add_tag('foo')
item.save

New:

require 'rubygems'
require 'directededge'
 
DE_USER = ENV['DIRECTEDEDGE_TEST_DB']
DE_PASS = ENV['DIRECTEDEDGE_TEST_PASS']
 
database = DirectedEdge::Database.new(DE_USER, DE_PASS)
 
DirectedEdge::UpdateJob.run(database, :replace) do |update|
  (1..10).each do |i|
    update.item("item_#{i}") do |item|
      item.tags.add('item')
      item.links.add("item_#{rand(10) + 1}")
    end
  end
end
 
# Get recommendations
 
item = DirectedEdge::Item.new(database, 'item_1')
item.related(:tags => :item).each { |i| puts i }
 
# Update item
 
item.tags.add('foo')
item.save

The changes are most obvious in the importer above, but there are a lot of little subtle improvements in the API’s semantics.

If you’re up for beta testing the new bindings, drop us a line.  The documentation is sorely lacking at present (ahem, meaning non-existent), but that’s the main thing that we know of that’s missing at the moment:  the new bindings are already in use by our Shopify app and the code is up on Github.

Web services updates on the way

In a similar vein, we’ve got some stuff that we added to the web services for the Shopify app that still have some rough corners that we want to smooth out before pushing them out to the unwashed masses, but we’ll soon have explicit API support for tracking product view histories and showing recommendations based on them.  It’s been possible to do such in a hackey way in the past, but the new stuff adds some special math-fu for time-clustered recommendations and storing high-traffic queues of updates.  More on that soon.

The easiest way to add recommendations to your Rails app, announcing acts_as_edgy

So, there are two things I want to talk about. The first is how our new beta hotness, acts_as_edgy, just made it super easy to add Directed Edge recommendations to Rails 2 apps. How easy? One line easy.

Recommendations the Easy Way

Let’s suppose for example that you’ve got a model called User and another called Product. And let’s also suppose that you support “likes” in your app, so you’ve got a model called Like too that just maps from a user ID to a product ID. Now, you want to show people personalized recommendations when they log in and related products when they’re viewing product pages. Nothing too exotic there.

So, here’s what you add to your app in that case:

class User < ActiveRecord::Base
  acts_as_edgy(:like, Like, Product)
  ...
end

The acts_as_edgy line is the whole shebang. That’s it. After that’s in there you’ve got rake tasks that let you do rake edgy:export and it’ll push your data over to our servers. You can also have it automatically update your data with our web service on saves to your model. Did I mention it’s easy?

Okay, okay. So, yeah, I skipped the part about where you actually do something with the recommendations. But that’s not rocket surgery either. Now that you’ve got that in your model you can do:

User.first.edgy_recommended # returns a list of recommended products based on likes
Product.first.edgy_related  # returns a list of related products based on likes

You can do much more complicated stuff too. Our Black Magic figures out the route between models that you provide a list of. One of the apps that we’ve been testing with is Spree. Let’s say that we want to do product recommendations based on sales. In Spree that means that we have to map a User to a Product, but there are a bunch of intermediaries: Orders belong to a User, LineItems belong to an Order, LineItems also belong to a Variant and Variants belong to a Product. Whew. What’s that look like in acts_as_edgy nomenclature?

class User < ActiveRecord::Base
  acts_as_edgy(:purchase, Order, LineItem, Variant, Product)
  # ...
end

You just list the models that it has to pass through on its way to its target and it figures out (or, well, tries to, this assumes you’ve used nice consistent column names, which you of course did, didn’t you?) how to get there.

And then, once again, we can just query for related and recommended stuff like it’s nobody’s business.

You also have access to the regular stuff that our API supports. So if you had both likes and purchases in the same application, that’s where that handle sitting up there right at the front of the acts_as_edgy line comes in handy. You can chose how you want to weight those, e.g.:

Product.first.edgy_related(:like_weight => 0.1, :purchase_weight => 0.9)

And there you’ve got recommendations based on some mix and match weights that seem appropriate to you.

I wanted to get the cool stuff out first, but there are naturally a couple of set up steps that have to happen beforehand:

THIS IS THE PART YOU ACTUALLY HAVE TO READ

  1. Sign up for a Directed Edge account
  2. Install the directed-edge and will_paginate gems
  3. Install the plugin
  4. Run rake edgy:configure to enter your login info

There’s more on the nuts and bolts of that over on the Github page, and just let us know if you get stuck. But here’s a real world example. Wait. I’ve just realized I lied to you. There are three things I want to tell you about.

We have a Spree plugin.

Yes, yes, we do. Hell, we even have a full tutorial on working with Directed Edge and Spree! It only works with Spree 0.11 at present since we haven’t ported this baby to Rails 3 yet. Mostly we needed something real to test this thing on, and Spree seemed like a nice way to test with data models that weren’t tied to our own assumptions.

You can get at the full plugin on Github.

And here are the guts — the example I mentioned above:

  1. An initializer to add the line above to the User model
  2. A helper to figure out if we’re doing personalized recommendations, related products or basket recommendations
  3. A partial template to handle showing the results

That’s pretty lean for all of the glue and display code for adding recommendations (and instant updating) to a full-blown e-commerce thingereedoo.

The nerdy bits.

There are a few neat technical things that are happening behind the scenes to make all of this stuff easy from a user’s perspective.

SQL Generator

One of them is a fancy custom SQL generator that builds optimal queries for all of this stuff. Entire Rails models get exported with one query. The generated SQL can get hella ugly, but it offloads most of the dirty work to the database rather than having to do it all in Ruby code.

The above Spree example (the User to Product mapping) generates this SQL monstrosity:

SELECT users.id AS from_id, 'user' AS from_type, variants.product_id AS to_id,
CASE WHEN variants.product_id IS NOT NULL THEN 'product' END AS to_type,
CASE WHEN variants.product_id IS NOT NULL THEN 'purchase' END AS link_type FROM users
LEFT OUTER JOIN orders ON users.id = orders.user_id
LEFT OUTER JOIN line_items ON orders.id = line_items.order_id
LEFT OUTER JOIN variants ON line_items.variant_id = variants.id
WHERE users.id IS NOT NULL ORDER BY from_id;

We build up a thing we call a “connection” which is a number of “bridges”. A bridge is just a route between two models, including the foreign keys and so a connection is a full path from one model to another one via a chain of foreign keys. In the simple case this is detected based on the model’s name and built in ActiveRecord methods for reporting foreign keys, but you can also specify a bridge manually for foreign keys that are created that don’t match the typical nomenclature. That isn’t documented at the moment, but shout if that’s something that you need.

Triggers

Another neat thing is the automatic installation of model triggers. So when it’s building that connection mentioned before, our system knows which models trigger updates which need to be sent over to Directed Edge to keep your data in sync.

So if acts_as_edgy is set up to automagically send updates (a config parameter that can be called in the config blog that gets written when you call rake edgy:configure) then as soon as a model changes anywhere along that path, we get the goods. And this triggers a different code path that leads to our SQL generator just pulling out the stuff that needs to be updated in a single query.

Future: Asynchronous Web-service Calls

And since those updates are hitting a remote web service, it’s ideal if they’re not blocking execution. We make liberal use of a class we call Future (conceptually borrowed from QFuture from Qt) which executes a block of code in a background thread and only forces synchronization when its data is accessed. Here’s what it looks like:

class Future
  def initialize(postprocessor = nil, &finalize)
    if ENV['EDGY_SYNCHRONOUS']
      @data = postprocessor ? postprocessor.call(finalize.call) : finalize.call
      return
    end
    @postprocessor = postprocessor
    @future = Thread.new do
      begin
        finalize.call
      rescue => ex
        warn "Exception in background thread: #{ex}"
      end
    end
  end
 
  def method_missing(method, *args, &block)
    data.send(method, *args, &block)
  end
 
  def to_s
    data.to_s
  end
 
  private
 
  def data
    @data ||= @postprocessor ? @postprocessor.call(@future.value) : @future.value
end

Since we implement method_missing on the Future class to first block and then forward the call on, you can use it just like a normal value, e.g.:

foo = Future.new { sleep 1; 42 }
puts foo # prints 42

Altogether, while weighing in at a slim 382 lines of code the plugin is fairly light, but it’s pretty dense code with quite a bit of interesting stuff going on.

It’s still quite beta, so we expect there will be a handful of kinks to iron out and there are features we’d still like to add, but hopefully we’ll be doing that iteratively as you wonderful folks start building things with it and telling us what you need.

Google Spam Heresy: The AdSense Paradox

There’s been much ado about the problem of spam and Google of late. Being something of a search weenie, as my eyelids were feeling heavy today I found myself mulling over the problem, “How would one detect Google spamming?”

The answer turns out to be surprisingly easy. Who has an incentive to spam Google? People living from advertising. Who owns the largest online display ad network? Google.

Unlike with email, where there are heuristics at work to guess the intentions of the sender based on the content, Google has that data right in front of them.

So, here’s the heresy: the spamminess of a web site is inversely proportional to its ad click-through.

Think about it — in a typical internet search, a navigation path terminating at that page is the best result. If they click on an ad, it probably means you missed serving up the right page in the first place. As a corollary, the pages best optimized to pull you in via a search term and send you back out via a related ad are among the worst results.

So if you’re Google which value do you optimize for? More ad clicks or better search results? I’m a big enough Google fan that I believe that they’d mostly want to optimize for good search results since that’s what made them the company that they are. But what do you do if there’s an inverse correlation between the two? Bite the hand that feeds you?

Thinking about things this way in my opinion makes the issue even more interesting, because it seems to hint at something systemic — i.e. that there there might be something deeper problematic in financing search through display advertising.

Obviously this is a massive oversimplification of the problem of spam, but the paradox intrigued me.

What’s happening?

So, the blog has been a bit silent of late, mostly because there’s been a lot of incremental stuff that’s been happening. We shout that stuff out over at Twitter and Facebook, but we’ll pull some of the last couple-o-months together here. But while I still have you attention:

We need Rails beta testers. If you’ve been thinking about adding recommendations to your Rails app, drop us a line now.

So, with that out of the way, a few of the things that have gone down in Directed Edge-land of late:

  • We’ve been ramping up the features in our Shopify app based on things that we’ve been hearing from customers. The Shopify folks covered our new automagic bundling feature here.
  • New version of our Ruby gem, 0.2.1 out, and a couple of features added to our Java bindings so that they approach parity with the other bindings.
  • Wener Vogels, CTO of Amazon, has been spreading the good news about Directed Edge and we caught up with him when he was presenting about AWS in Berlin.
  • Yours truly will be doing a big tour out through New York, Austin and San Francisco / Silicon Valley in December and January. Holler if you want to meet and talk about recommendations.

Werner Vogels presenting Directed Edge as "one of the building blocks of Cloud Computing"

While a lot of the stuff in the last few months has been focused on internal tools and run-of-the-mill growing pains stuff, we’re getting pretty close to a series of announcements about new stuff coming down the pipe. (For some reason those always seem to come in batches.) We’re hoping the first of those will be ready in the next week or so.

Directed Edge delivering 33% of revenue for The Market Quarter

Jonathan Briggs, who runs The Market Quarter wrote a few months back on getting up and going with our Shopify app for recommendations. He reported back today after analyzing the first few months of data:

What astounded me was how many of my customers (not just visitors) clicked on recommendations. Indeed the ExpressRex referrer is responsible for a full third (33%) of my revenue in the last quarter and has a 36% conversion rate.

You can read the full post in his blog.

Shopify: Help Your Customers Discover Products and Grow Your Average Order Size

From over at the Shopify blog:

Eric Houtkooper of www.PupLife.com helps customers find and discover other products they may want. He has increased his customers’ average order size by recommending products as they browse the store and go through checkout. He suggests you do the same…

More there.

Using taps without running a taps server

So, the Ruby world has a nifty thinger for syncing up databases over the interwebs. It’s called Taps, from the superheroes over at Heroku. It’s great — you just run a little Sinatra-based server and then give the database URLs and it handles all of the plumbing.

But, you see, I was none-too-keen on having another long-running Ruby process, not to mention an open port with production database data lumbering around on it, so I thought I’d let you guys in on a little hack we produced internally to let you get all of the fun of taps, but without the taps server.

Basically it starts up the taps server on the remote server, tunnels the transfer over SSH, then sends a ctrl-c to the server to kill it’s done. It’s pull-only very intentionally — I want to push from a development database to a production database about like I want a hole in my head.

#!/usr/bin/env ruby
 
require 'rubygems'
require 'active_support/secure_random'
require 'net/ssh'
 
SSH_USER = '[sshuser]'
SSH_HOST = '[dbhost]'
 
LOCAL_DB = 'mysql://[dbuser]:[dbpass]@localhost/[dbname]'
REMOTE_DB = 'mysql://[dbuser]:[dbpass]@localhost/[dbname]'
 
TAPS_USER = ActiveSupport::SecureRandom.hex(16)
TAPS_PASS = ActiveSupport::SecureRandom.hex(16)
 
URL = "http://#{TAPS_USER}:#{TAPS_PASS}@localhost:5000"
 
Net::SSH.start(SSH_HOST, SSH_USER, :compression => true) do |ssh|
  ssh.forward.local(5000, 'localhost', 5000)
 
  ready = false
 
  channel = ssh.open_channel do |c|
    c.request_pty
    c.on_data { |c, data| ready = true if data =~ /port=5000/ }
    c.exec("taps server #{REMOTE_DB} #{TAPS_USER} #{TAPS_PASS}")
  end
 
  finished = false
 
  Thread.new do
    sleep 0.1 until ready
    system "taps pull #{LOCAL_DB} #{URL}"
    finished = true
  end
 
  ssh.loop(0.1) do
    channel.send_data(Net::SSH::Connection::Term::VINTR) if finished
    !finished
  end
end

Substitute in the right values in the constants up at the top and you’ve got a nifty way to securely use taps without leaving a server running.

What programming languages do our customers use?

No huge shockers, but a few places where it’s interesting to compare perceptions to reality:

  • PHP, for better or worse, still dominates web development.
  • Python’s much closer to Ruby in the usage we see, despite making less noise.
  • Scala, despite being the new hotness has very little actual uptake.
  • Perl is all but out of the game (though still more than Scala)

Making ActiveResource 34x faster: QActiveResource

One of the things that really hurts us when we’re pulling data from ActiveResource (as we do with Shopify and a couple projects internally) is the slowness of Rails’ ActiveResource. Using the Nokogiri backend helps a lot, but it’s still painfully slow. It’s so slow that the bottle neck is actually parsing the data rather than transfering it.

So I set off yesterday to rewrite our importer in C++ using Qt (and libcurl to grab the data). The result is a nice Qt-ified ActiveResource consumer that throws things into a collection of QVariant / hashes / lists.

Once I had that done I wondered, “What would the performance look like if I wrote a Ruby wrapper for the C++ implementation?” The answer was, fortunately, “Great!” meaning fortunately that the application logic can stay in Ruby with QActiveResource doing the heavy lifting.

It’s still relatively light on features — it just supports the find method, but the numbers speak for themselves:

The first column is the default pure-Ruby ActiveResource implementation, the second is with the same, but using the implemented-in-C Nokogiri backend. The third is just using my C++ backend directly and the fourth is with that bound to Ruby objects.

Methodology:

The data set is the set of products that we have in our test shop on Shopify. There are 17 of them, for a total of a 36k XML file. Each test iterates over reading that set 100 times. To remove other bottlenecks, I copied the file to a web server on localhost and queried that directly.

So, then that’s reading 1700 records for each backend over 100 request, with that the average times were:

  • Ruby / ActiveResource / REXML: 34.60 seconds
  • Ruby / ActiveResource / Nokogiri: 12.87 seconds
  • C++ / QActiveResource: 0.79 seconds
  • Ruby / QActiveResource: 1.01 seconds

All of the code is up on GitHub here, including the test data and the raw results.

API in Ruby and C++:

The Ruby API is very similar to a limited version of ActiveResource and supports things like this, for example:

resource = QAR::Resource.new(ENV['AR_BASE'], ENV['AR_RESOURCE'])
resource.find(:all, :params =&gt; { :page =&gt; 1 }).each { |record| puts record[:title] }

The C++ implementation also follows the same basic conventions, e.g.

#include 
#include 
 
int main()
{
    QActiveResource::Resource resource(QUrl(getenv("AR_BASE")), getenv("AR_RESOURCE"));
 
    foreach(QActiveResource::Record record, resource.find())
    {
        qDebug() &lt;&lt; record["title"].toString();
    }
 
    return 0;
}

At present this naturally requires Qt and libcurl to build. I may consider building a version which doesn’t require Qt if there’s general interest in such (we use Qt in the C++ bits of our code anyway, so there was no extra cost to schlurping it in).

There are more examples in the Ruby directory on GitHub. Once it matures a wee bit we’ll get it packaged up as a Gem.

Edit:

The API’s already been going through some changes, now it can be easily used as a mixin, a la:

require 'rubygems'
require 'active_resource'
require 'QAR'
 
class Product &lt; ActiveResource::Base
  extend QAR
  self.site = 'http://localhost/'
end

The curious case of equidistant boxes, a CSS fail

The above is not an exercise in geometric abstraction. It’s an attempt to do something seemingly simple, but to the best of my knowledge, impossible in CSS. The challenge is this:

Given a container, fill the container with n items of variable width, with the leftmost item flush with the left border, and the rightmost item flush with the right border, and an equal amount of space between each.

Unpossible! Try it.

This isn’t a contrived example. Being a pixel pedant, when we started in on our app for Shopify, where images may be any size within a bounding square, distributing them equidistantly seemed The Thing To Do. Here’s a shot from our test shop with the items properly distributed and the text properly centered under them:

So, let’s step through the options:

.container {
    border: 1px solid #333;
    width: 520px;
    margin-bottom: 1em;
 }
 
.container ul {
    padding: 0;
    margin: 0;
}
 
.container ul li {
    margin: 0;
    padding: 0;
    display: inline;
}
<div class="container">
  <ul>
    <li><img src="80.png" /></li>
    <li><img src="100.png" /></li>
    <li><img src="80.png" /></li>
    <li><img src="100.png" /></li>
    <li><img src="60.png" /></li>
  </ul>
</div>

That gives us the first box, things are messily clumped along the left side of the container. I quickly threw in the towel and went for a table. With a table it was easy to get things distributed evenly, but with ragged extreme borders:

table.container {
    width: 522px;
    border-spacing: 0;
}
 
.container td {
    text-align: center;
    padding: 0;
    margin: 0;
}
<table class="container">
  <tr>
    <td><img src="80.png" /></td>
    <td><img src="100.png" /></td>
    <td><img src="80.png" /></td>
    <td><img src="100.png" /></td>
    <td><img src="60.png" /></td>
  </tr>
</table>

But if we want to make the extremities flush, we’re stuck with a larger-than-mean gap between the first two and last two boxes:

.flush td.first {
    text-align: left;
}
 
.flush td.last {
    text-align: right;
}
<table class="container flush">
  <tr>
    <td class="first"><img src="80.png" /></td>
    <td><img src="100.png" /></td>
    <td><img src="80.png" /></td>
    <td><img src="100.png" /></td>
    <td class="last"><img src="60.png" /></td>
  </tr>
</table>

It turns out that the only way to do this seemingly simple task is to compute the margins ourselves and set them via Javascript. Here we take a table just like our second one, but with a “adjust” ID so that we can find it easily:

function adjust(id)
{
    var table = document.getElementById("adjust");
    var row = table.firstElementChild.firstElementChild;
    var cells = row.children;
    var totalImagesWidth = 0;
 
    for(var i = 0; i < cells.length; i++)
    {
        var image = cells[i].firstElementChild;
        totalImagesWidth += image.offsetWidth;
    }
 
    var extra = row.offsetWidth - totalImagesWidth;
 
    for(var i = 0; i < cells.length; i++)
    {
        var image = cells[i].firstElementChild;
        var padding = extra / (cells.length - 1);
        var buffer = Math.floor(padding * i) - Math.floor(padding * (i - 1));
 
        if(i == 0 || i == cells.length - 1 && cells.length >= 2)
        {
            cells[i].style.textAlign = ((i == 0) ? "left" : "right");
            cells[i].style.width = (image.offsetWidth + Math.floor(buffer / 2)) + "px";
        }
        else
        {
            cells[i].style.textAlign = "center";
            cells[i].style.width = (image.offsetWidth + buffer) + "px";
        }
    }
}
 
if(window.addEventListener)
{
    window.addEventListener("load", adjust, false);
}
else if(window.attachEvent)
{
    window.attachEvent("onload", adjust);
}

Now things get adjusted just as we’d want them. Let’s walk through the steps:

    var table = document.getElementById("adjust");
    var row = table.firstElementChild.firstElementChild;
    var cells = row.children;

Here we just get a handle to our building blocks, the table, row and cells.

    var totalImagesWidth = 0;
 
    for(var i = 0; i < cells.length; i++)
    {
        var image = cells[i].firstElementChild;
        totalImagesWidth += image.offsetWidth;
    }

Now we compute the total width of all of the images used in the table. We’ll use that to figure out how much left-over space we have to distribute:

    var extra = row.offsetWidth - totalImagesWidth;

From there we go in to assign this to each cell:

    for(var i = 0; i < cells.length; i++)
    {
        var image = cells[i].firstElementChild;
        var padding = extra / (cells.length - 1);
        var buffer = Math.floor(padding * i) - Math.floor(padding * (i - 1));

The padding is simply the extra space divided by the number of gaps that we have, but we compute the buffer for each iteration so that all of the left over pixels don’t accumulate at the end — i.e. if we had 7 items and the space between each should be in theory 10.333… pixels, that would leave us with 3 extra pixels stuck into the last gap. By doing the floor of each iteration and subtracting the previous value, we end up with appropriately distributed spare pixels.

        if(i == 0 || i == cells.length - 1 && cells.length >= 2)
        {
            cells[i].style.textAlign = ((i == 0) ? "left" : "right");
            cells[i].style.width = (image.offsetWidth + Math.floor(buffer / 2)) + "px";
        }
        else
        {
            cells[i].style.textAlign = "center";
            cells[i].style.width = (image.offsetWidth + buffer) + "px";
        }
    }

The second half of the loop just sets the cells at the extremities to be left / right aligned and gives each cell half of the buffer space for each of their internal gaps. The ones on the edges only need half of the buffer allocation since they don’t have a buffer on their outside edges.

The sad thing is this is actually a much simplified version of the real code. The real code also accounts for:

  • Centering text under these images (Sounds easy right? No. You have to compute the margins manually also.)
  • Dealing with stylesheets where images have margins, padding and borders and still making the right thing happen.
  • Only one item being shown.
  • Minimum sizes for images, centering them if they’re smaller (you don’t want to try to center text below a 5 pixel wide image that’s right aligned)

I can get into the mechanics of those in another post if necessary and have been considering stripping the real implementation out of our widget and throwing it up on Github where it can take a nice, clean JSON structure and turn it into a table of images, text, prices and links. Give a holler in the comments if you’d find such useful.

And with that, I’ll leave you with a gallery of what-this-looks-like-in-real-life shops, from the annals of our Shopify customers.

For the love of POSIX: A more useful way to not segfault from null pointers

There’s a post making the rounds on Hacker News about actually mapping the address 0 to make it possible to not crash when dereferencing a null pointer.

The thing is, POSIX provides a more useful mechanism for recovering from null pointer dereferences using signal handlers.

See, signals in UNIX can be caught and you can do things with them. Let’s say that we’ve got an event-driven application and we want to move on to the next event if the handler for one event caused a segfault. We can do that by catching the SIGSEGV and SIGBUS signals and returning to our event loop to process the next event.

Let’s look at some code:

#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
#include <string.h>
 
static jmp_buf event_loop_jump;
 
static void process_event(int event_id)
{
    if(event_id % 10 == 0)
    {
        /* Boom. */
 
        int *i = 0;
        ++*i;
    }
    else
    {
        printf("Processed Event: %i\n", event_id);
    }
}
 
static void handler(int signum)
{
    printf("Caught %s\n", strsignal(signum));
    longjmp(event_loop_jump, 1);
}
 
static void run_event_loop(void)
{
    int event_id = 0;
 
    for(event_id = 1; event_id <= 20; event_id++)
    {
        if(setjmp(event_loop_jump))
        {
            printf("Couldn't process event: %i\n", event_id);
        }
        else
        {
            process_event(event_id);
        }
    }
}
 
int main(void)
{
    signal(SIGSEGV, handler);
    signal(SIGBUS, handler);
 
    run_event_loop();
 
    return 0;
}

And that give us:

Processed Event: 1
Processed Event: 2
Processed Event: 3
Processed Event: 4
Processed Event: 5
Processed Event: 6
Processed Event: 7
Processed Event: 8
Processed Event: 9
Caught Bus error
Couldn't process event: 10
Processed Event: 11
Processed Event: 12
Processed Event: 13
Processed Event: 14
Processed Event: 15
Processed Event: 16
Processed Event: 17
Processed Event: 18
Processed Event: 19
Caught Bus error
Couldn't process event: 20

What’s happening, Peter

So, let’s break that down a bit — we’ve created a function called handler that matches the signature required for signal handlers. man signal knows more, or check out the GNU documentation. In main we specify that we want to use that function to handle SIGSEGV and SIGBUS, the typical things you’ll be looking at when dereferencing null pointers. From there we call run_event_loop, whose function should be obvious.

Jump Around

run_event_loop just loops through 20 fake events, but the interesting bit is setjmp. That saves our stack (the variables in use, basically). During normal execution it just returns 0, so we fall through to our process_event call below, which just pretends to handle and event and prints out a message.

We’ve wired our event handler to blow up once every 10 times. When it does execution is suspended and we cut over to our signal handler. Our signal handler, in turn, prints a little message and then breaks out of the signal handler and back into the normal event handling flow with a longjmp call that corresponds to our setjmp from before.

So once we call that we’re back at our line with the setjmp call, only this time it returns the value that we passed to longjmp — 1, and since that evaluates to true, we end up printing a message saying that we weren’t able to process the event.

In Context

On the whole this seems far more useful than mapping address zero. However, There Be Dragons in these waters, and it no doubt opens up a can of worms in terms of what’s done with static variables in the functions you’re breaking out of and if they’re left in a secure state — in other words, this does open up another potential security vector.

I’ll also note that we don’t actually do this in any of our stuff at Directed Edge, going for the less exciting, but more predictable process watchers to handle restarts when our C++ engine falls foul of memory protection (which, blessedly, is a rare occurrence).

Update:

For posterity, and based on some of the comments here and on Hacker News, it seems apt to note that this method isn’t being seriously suggested for production code; the idea behind the post was more as a novelty and a brief exploration in signal handling.

Our new digs

The Directed Edge crew just moved into a new big office that we’re sharing with the other Berlin startups 12like and SMS Guru in our lovely home turf of Kreuzberg, one of the hot-spots for startups in Berlin and right on the Spree.

Directed Edge driving recommendations in Microsoft's MediaRoom CES demo

When it rains, it pours. We’ve got yet another announcement today. Microsoft, and the fine fellows of their TV division, announced MediaRoom 2.0 at CES yesterday. From their press release:

Our strategy with Mediaroom is to combine the power of client software and cloud-based services to greatly enhance the way consumers experience digital entertainment,” says Enrique Rodriguez, corporate vice president for the TV, Video and Music Business at Microsoft. “We want to make it easier for consumers to find and discover great content, to watch, listen and engage in new ways, and to do so anywhere and on any screen. Mediaroom 2.0 is a key milestone in our strategy, providing the software platform to power operators’ service clouds to reach more screens, and more people, with more content than ever before.

We naturally are also big believers in the power of discovery, and we’re excited to say that the recommendations in the demo at CES are being driven by none other than Directed Edge’s recommendations engine. We’re looking forward to further collaboration in the coming months.

Microsoft’s MediaRoom makes it easy for TV and video providers to get up and going with distribution over the internet. If you’re at CES this week, swing by their booth and check it out!

Nerd power, you know we need it.

So, VentureBeat yesterday picked up on us “opening up to developers”:

Product-recommendation startup Directed Edge is launching a free API for its recommendations platform so that developers can create new applications for it. Offered as a web service, the platform plugs into your website, collects data and then delivers real-time recommendations, based either on what similar users have done, or on a user’s own past behavior.

We wanted to drop in a wee bit more info for the loyal devotees of our humble blog. Of course we’ve had developers working with our API for aeons now, but what’s new here is that while we were recently firming up our pricing tiers (which are now also up on the site) we decided to serve up a plan for non-commercial mashups, skunkworks and similar goodness. This one’s on the house, fellas.

Getting buy-in from our nerdy brethren has been key to us getting the word out about what we’re doing, and we wanted to make good on that. This gives you all of the features of our “Social Starter” plan, for free. So, without further ado, I give you our:

Free Developer Account

Directed Edge + Shopify = Easy

One of the things we’ve had brewing for a bit is our new, shiny, Shopify app. It’s our first time to dip our toes in the world of Stupidly Easy (TM) widget-based stuff. Like, seriously, you click a couple times, agree to give us money, paste two lines of Javascript in your Shopify template, et voilà. Recommendations.

Plus, Tobias, who called our recommendations the “killer app” for the Shopify, and his crew of Rubyists at Shopify have been great to work with. Shopify’s been on the way up for a while, now hosts thousands of shops and has seen tens of millions of dollars in sales in the shops they host.

This got picked up in an article in Venture Beat today; we’ll have another post up on that in a bit.

Directed Edge Ruby Bindings Gemified

Finally got the Directed Edge Ruby bindings all gemified and up on Gemcutter. Now you can install our bindings for Ruby with:

sudo gem install directed-edge

The Arc Challenge in Ruby

There’s been a bit of buzz again in the last couple days about the Arc Challenge in our corner of the interwebs, since a Haskell entry was thrown into the ring. Since we’re big Sinatra users (TweetRex, another internal demo and our just-about-to-come-out new account manager are all Sinatra), I thought I’d give it a whirl to see how it stacks up.

I did this first in the comments on news.yc, but then kept playing with it a bit, so I thought I’d give an expanded version here.

First let’s start with a version that works, without any extra massaging (line broken to avoid scrollbar):

require 'rubygems'
require 'sinatra'
enable :sessions
 
get '/' do
  session[:said] || '<form method="post"><input name="said" type="text" /><input type="submit" />
</form>'
end
 
post '/' do
  session[:said] = params[:said]
  '<a href="/">click here</a>'
end

The ugliness, relative to the Arc version is in the long HTML strings, but it’s certainly fairly concise, and also fairly readable.

The thing is, when doing web coding, most modern frameworks separate the presentation code from the application logic — and rightly so. This allows for the two to be refactored relatively independently. So, for the most part, I don’t see this missing logic in the Ruby core as much of an issue. However, in things like Rails’ ActionView, there are helpers that add in some some of the Arc-like constructs. I thought I’d write a few of my own, with concision in mind, to get an idea of what this sort of application could look like, and to give a better idea of how Ruby as a language stacks up to Lispy-concision.

Here I define a few convenience functions that we’ll pretend is our library.

require 'rubygems'
require 'sinatra'
enable :sessions
 
# "Library"
 
def element(name, params = {})
  children = block_given? ? (Array.new << yield).flatten : nil
  "<#{name} " + params.keys.inject('') { |v, k| v += "#{k}=\"#{params[k]}\" " } +
    (children ? ('>' + children.join('') + "</#{name}>") : '/>')
end
 
def input(type, name = '', params = {})
  element(:input, params.merge(:type => type, :name => name))
end
 
def form(params = {})
  params[:method] = :post if params[:method].nil?
  element(:form, params) { yield }
end
 
def link(target, params = {})
  element(:a, params.merge(:href => target)) { yield }
end
 
# Implementation
 
get '/' do
  session[:said] || form {[ input(:text, :said), input(:submit) ]}
end
 
post '/' do
  session[:said] = params[:said]
  link('/') { 'click me' }
end

(Before the Rubyists come out of the wood-work, I’ll note that I’m only so-so with Ruby, really we’re a C++ shop, and I wouldn’t really write an “element” function like that if I weren’t going OCD on concision.)

I tried to skimp on symbols by using blocks rather than a hash parameter to element, and could have saved another one by folding input(:submit) into its own function, but meh. So then the question comes, how does this stack up parse tree-wise? To do a comparison, I translated the “implementation” block there into pseudo-lisp:

(get '/'
 (or (hash 'said' session) (form (input 'text' 'said') (input 'submit'))))
(post '/'
 (setq (hash 'said' session) (params 'said'))
 (link '/' 'click me')))

I count 34 items, per Paul’s metric. The irony is that the version with the long HTML strings naturally has less, weighing in at 23 entries, on par with the original Arc implementation.

Why I don’t buy into the comparison

At the end of the day though, while fun, I still find the comparison rather contrived. As stated, I don’t want presentation code in my application logic. If I turn to a real Sinatra application that we’ve done, notably soon-to-go-live account manager, here’s the line count:

Erik: /Users/scott/Projects/DirectedEdge/AccountManager> wc -l account.rb directed_edge_admin.rb
 account_manager.rb views/index.haml views/created.haml
     121 account.rb
      52 directed_edge_admin.rb
     136 account_manager.rb
       4 views/index.haml
      16 views/created.haml
     329 total

The problem is that the Arc Challenge is set up to what Paul calls the “Hello, world.” of web applications, but effectively it becomes a standard library comparison more than a language comparison. In practice, even for our trivial little account manager, we connect to three different web services APIs (our billing provider, our in-house Rails CRM and the administrative API of our recommendations web service), parse our modify our HTML templates using Nakagiri’s CSS selectors and then merge that with some Haml code. Were we to consider things like XML / JSON parsing / generation “core” features of a modern web programming framework, the scales would be tipped in the favor of more established web languages.

I think this is why I personally tend to drift from language to language based on the the problem and constraints currently under consideration. We literally have C++, Java, Ruby, Perl, Javascript, PHP, Python, C# and a wee bit of Scala in our repository, roughly in that order (the last four only for our language bindings), all being thrown at problems they’re well suited to. I believe an unintended moral of Paul’s example is that a language is only as good as how simple it makes tasks at hand, and in practice, things like available libraries weigh heavily into that.

Revamped Developer Site

So, based on some feedback we’ve been getting we’ve just revamped our developer site. We’ve broken things out into a (we hope) pretty simple “Get started with [your programming language of choice]”.

developer-siteWe’d love to have some feedback or to know if there are still things that seem a bit unclear. One of the new content additions that previously wasn’t there is an example of a PHP integration.

We’ve also moved all of the example code into our bindings repository on GitHub, so you can grab the full classes there.

The "Interview" with Y Combinator That's Not

So, in our last installment of the Y Combinator saga, we talked about applying. Since a handful of folks will be gearing up for their interviews shortly, it seems a fine time to move on to the next step in our journey.

We’d managed to stay mostly poker faced about getting an interview this go around up until decision day, when there was a little optimism that crept in, evidenced by the fact that I thought it opportune to pick up a bottle of startup-priced champagne during the day’s shopping.

I’d been up most of the night working when the mail came at 7:00 a.m. Berlin time. We were naturally pretty excited.

If you’re going to San Francisco, be sure to wear some flowers in your hair:

We made some quick decisions: it was to be our first trip as a company to the valley. Accepted or otherwise, we wanted to get as plugged in as possible. We booked a week long trip and wanted to get our schedule as full as possible.

Connected to that we also took one of the first interview slots — 10:00 a.m., first day. YC interviews span three days (in our case Friday to Sunday) and you get an answer at the end of the day of interviews. Picking one on the first day meant we’d know 24 hours after landing if our primary objective was building our network or finding a place to live for the summer. We also wanted to get our pitch in before pitch-fatigue set in. If you’ve ever been to an event where you’ve heard startup after startup pitch, you know what I’m talking about. Strangely, to us, anyway, most of the initial sign-ups were clustered at the end of the third day. I can only presume that this was so that teams would have an extra bit of time to prepare.

For our week there, Valentin started making a list of companies to cold-call and try to set up meetings with and we got a handful — some with former YC companies and some not. They weren’t actively sales meetings — we just wanted to try to get a better handle on how recommendations could fit into the startup landscape. We figured meeting some YC companies would also give us an in to the YC mafia even if we weren’t accepted and we’d arranged to crash on the couches of various startups over the week.

Preparing for the interview, or “So you think you’re pretty smart, eh?”

We’d talked to investors in the past, so we thought we knew what was going to be important. I just found the stack of printouts that we prepared for the interview:

  • Mails from prominent pilot customers (with praiseful sections highlighted)
  • Feedback that we’d gotten on social media (Twitter, mostly)
  • Profiling information showing the speed of our engine
  • Full-color printouts of our demo-applications
  • Dashboard showing total number of qualified leads we’d generated

We brainstormed a big list of questions that I can’t find anymore that we thought might come up and talked through answers to all of them. We came up with a list of points that we wanted to be sure to mention and even practiced transitions from other topics to those.

All of that, however, turned out to be useless.

About 14 hours after landing we were at the YC office waiting to get started. On the whole we were pretty relaxed, since we felt like we had our pitch pretty together.

The interview that’s not an interview:

The setup is the team on one side of a table, the YC partners on the other side. We walked in, I opened up my laptop and set it on the table and put the folder of printouts next to it. I figured we’d start off introducing ourselves and saying something about our backgrounds, you know, the usual.

Paul started, “So, you guys are making a recommendations engine. Do you have a demo?

30 seconds in Paul, Trevor and Robert were standing behind us while I cycled through the various demos. It played out well, they’d ask if we could do X, and I’d pull out an example of us doing X. They asked a little about our algorithms, I remember Robert’s (only?) question being, “So where’s the rocket science here?” Questions were being asked more quickly than we could answer them and I remember at one point Trevor and Paul standing behind us asking each other questions and Jessica having to reel them in to point out that I was trying to show them something.

“Interview” is really the wrong word for what happens. You might expect something that’s in the genre of a job interview or an investor pitch, but it’s nothing like that. It’s more like a brainstorming session with the volume turned up to 11.

I don’t know if we hit any of the points we wanted to be sure to talk about. Certainly not most of them, and I remember that there were some notable ones missing — like that we already had pilot customers.

There were no questions about the business model, or size of the market, or distribution strategies or any of that jazz. That said, sometimes it does go that way. I’ve talked to a few teams about their interviews, and heard mixed things.

The hard thing about this rapid-fire sort of setup is that it’s really hard to give advice on what to do to prepare for a YC interview, but here are a few general things:

  • Have a demo and know your way around it. From what I’ve heard, demo-centric interviews are the ones that go the best.
  • Talk to as many people as possible about your startup and try to get them to ask as many questions as possible. Show them the demo and see what they think.
  • Make your demo look nice. Paul especially digs that and the first impressions help.

So don’t expect to go in and present. That doesn’t happen as far as I can tell. I think all of the preparation we did though, including talking both online and in person with YC alums helped us be mentally prepared.

Post interview:

One quirky note: if you put in your home phone number in the YC application, that’s the number that you’ll get a call on if you’re accepted. We’d given Jessica cards with our cell-phone numbers on them during the interview, but that didn’t seem to have registered. We waited, and waited and waited. Finally around 11:00 p.m. I sent a mail to Paul and Jessica asking if decisions had come down. No answer.

The next morning we had a mail saying yes, they’d tried to call us. Valentin fumbled around with figuring out how to get a message off of our answering machine in Germany and once we’d sorted it out, we basically heard, “Yes, please call us back.” So we did. Of course, they were interviewing people then, so a while until the next break for them to call us back, we accepted the offer and that was that.

We found an apartment in the next couple of days via an apartment-wanted post on Craigslist and spent the rest of the week doing meetings, which turns out to have been a really good idea — and that’s the thought I’ll leave you with for this installment — optimize so that whatever you hear from YC that you’re on a positive trajectory. We’ll be back with the next chapter soon at the same bat-time and same bat-place.

Search vs. Recommendations, or Authoritative and Related Sources in a Graph

twitter

“They’re making a search engine.”
A bunch of my friends think that.  It happens every week or so that I’ll get introduced as “a search engine guy”.  And maybe there could exist a definition of “search engine” which included recommendations.  But there is something at the core of recommendations that’s different from search.
Search is about finding.  You start with a topic you know exists and you want to find information about it.  Recommendations are about discovering things you didn’t know about.

“They’re making a search engine.”

A bunch of my friends think that. It happens every week or so that I’ll get introduced as “a search engine guy”. And maybe there exists a definition of “search engine” which includes recommendations. But when people think about Google Search vs. Amazon’s recommendations, the difference is between finding and discovering.

Search is about finding. You start with a topic you know exists and you want to find information about it. Recommendations are about discovering things you didn’t know about.

miles-davis-google

See, there we’ve got a bunch of info about Miles Davis. But that’s the thing — it’s all about Miles Davis. If you already know Miles Davis exists, search is a great way to find out more about him and his music, but it’s awkward for discovering things you don’t know about.  For comparison:

miles-davis-directededge

When looking things related to Miles Davis we get a list of the giants of jazz — most of whom played with Miles Davis at some point in their career. No prior knowledge of Charlie Parker is required in this context. If you know about Miles Davis and want to discover things which are like Miles Davis you need a recommendations engine.

The genesis of graph-based web search was in Jon Kleinberg’s seminal paper Authoritative Sources in a Hyperlinked Environment.  Kleinberg’s paper predated Brin and Page’s by a few months and was cited in the original PageRank paper.  From Kleinberg’s abstract:

The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authoritative” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure.

Note the recurring keyword: authoritative.

It turns out that the difference between search and discovery is not just the presentational difference between them — it is also algorithmic. When finding related rather than authoritative source in a graph we massage the data in fairly different ways. In fact, it turns out that authoritative sources are often simply noise in a search for related items. Let’s examine this visually again.

graph-1

Here we have a mocked up subgraph of Twitter — a few people that are following Barack Obama. When you start a search with Kleinberg’s algorithm (HITS), it begins by extracting a starting set of nodes based on a text search. Let’s imagine here that we’d searched for people mentioning Barack Obama and this was the set of nodes that were returned. Kleinberg’s algorithm attempts to determine the authoritative source in the set, and it’s pretty clear on visual inspection from this set that it’s the node called “Barack Obama”. The algorithm in the paper is naturally a bit more involved — it also incorporates the notion of “hubs”, but we’ll ignore those for now for simplicity. (Incidentally, Kleinberg’s paper is a rare combination of disruptive and accessible and well worth the time to read.)

Now if we were looking through that same subgraph and trying to find related users we’d need to use different logic. That someone is following Barack Obama says very little about them; certainly it doesn’t go far in determining what they’re likely to be interested in. If we were recommending a friend for Matt to follow, visually it’s clear that Jim would be a better recommendation than Bob.

As it turns out, Barack Obama, the “authoritative” node in this graph is in fact just noise when trying to deliver a set of recommendations and it’s best if we ignore it altogether.

graph-2

Again, as visually confirmed, removing the “authoritative source” from the subgraph makes finding related users for e.g. Matt or Dave much easier.

This problem surfaces all of the time in recommender systems. If we were applying it to finding related artists to Miles Davis, it would be that the terms “jazz” or “music” are far too often linked to Miles Davis and his ilk. On Twitter’s graph it’s people with so many followers that following them says little about a person. In a book store it’s that having bought Harry Potter says little about one’s more specific tastes.

In the early days of Directed Edge, we called this the “tell me something I don’t know” problem. That is, after all, what recommender systems are for. If you recognize all of the results in a set of personalized recommendations, they’re not doing their job of helping you discover things. If something in a set of search results seems unrecognizable, it’s probably just a bad result.

Facebook's news feed: The beginning of a recommendations dominated web

Facebook’s News Feed:  Social graph meets personalized news
Today Facebook moved from displaying live streams of friend updates to “news” by default.  There’s more to this than meets the eye at first.  What’s actually happened is Facebook has placed recommendations front and center on the third most popular site on the web.
This is big stuff.
One of the things that first got us excited about the recommendations space was seeing the merging of the social graph and traditional recommendations applications on the horizon.  Friend finders and the like are the most obvious (and boring) applications of recommendations applied to the social graph, but the potential goes far beyond that.
Information overload has always been the driver of major innovation on the web.  Social applications and information overload have been on a crash course for a while.  The intersection where they collide says much about the shape of things to come.  In these days where followers / friends / fans are quickly outpacing the ability for one to consume the “real time stream” as such, recommendations will be the way forward.
So, let’s step back a little bit and talk about recommendations.  I wrote an introduction to recommendations a while back that talked some about traditional recommendations based on things like purchase histories, ratings and whatnot and compared that to graph-based recommendations.  In the graph-based example I mentioned the possibility of using friends within a social network to drive the recommendations.
The thing is that these data sets need not be separate.  The social graph isn’t just friends; it’s a broad model for interactions between people and content on the web — there’s no reason not to consider products and news articles as elements of the social graph, and once they’re in that grand unified model of interaction on the web to harvest that data to figure out what people are likely to be interested in.
But that’s not just possible — it’s absolutely necessary.  Recommendations present pretty much the only way forward for handling the explosion of real-time data on the web.

Today Facebook moved from displaying live streams of friend updates to “news” by default. There’s more to this than meets the eye at first. What’s actually happened is Facebook has placed recommendations front and center on the third most popular site on the web.

facebook-news

This is big stuff.

Now instead of showing every update by default — Facebook is picking the things you’re likely to be interested in based on feedback from your friends:  recommendations, essentially.

One of the things that first got us excited about the recommendations space was seeing the merging of the social graph and traditional recommendations applications on the horizon. Friend finders and the like are the most obvious (and boring) applications of recommendations applied to the social graph, but the potential goes far beyond that.

Information overload has always been the driver of major innovation on the web. Social applications and information overload have been on a crash course for a while.  The intersection where they collide says much about the shape of things to come. In these days where followers / friends / fans are quickly outpacing the ability for one to consume the “real time stream” as such, it’s interesting to think of how you get out of that conundrum.

So, let’s step back a little bit and talk about recommendations.  I wrote an introduction to recommendations a while back that talked some about traditional recommendations based on things like purchase histories, ratings and whatnot and compared that to graph-based recommendations.  In the graph-based example I mentioned the possibility of using friends within a social network to drive the recommendations.

The thing is that these data sets need not be separate.  The social graph isn’t just friends; it’s a broad model for interactions between people and content on the web — there’s no reason not to consider products and news articles as elements of the social graph, and once they’re in that grand unified model of interaction on the web to harvest that data to figure out what people are likely to be interested in.

But that’s not just possible — it’s absolutely necessary.  Recommendations present pretty much the only way forward for handling the explosion of real-time data on the web.

On Applying to Y Combinator

I’ve decided to do a mini-series of sorts on our Y Combinator experience.  Originally I was going to fold everything into one ginormus post, but I think it better to break it up and focus on different aspects.

So, let’s start at the beginning:  Applying to YC.

This story starts a year earlier than one would assume.  Summer of 2009 wasn’t the first time that we’d applied to Y Combinator.

Valentin and I decided to start a startup together and gave notice at our jobs in March of 2008.  I’d been tinkering in the area of graph-based information organization for about 4 years already and had built a quick proof-of-concept of a recommender system based on some of those ideas.

Since I’d been reading Paul Graham’s essays since before the dawn of Y Combinator, applying seemed a natural step.  We labored over the application — finishing it a full month before the deadline; we had friends proof-read it; we went through several iterations.

I mean, we had to get accepted.  I’d previously been a fairly well-known open source developer, had been an invited speaker all over the world, had spent several years in R&D at Europe’s largest software company, I knew the problem space well and we were convinced that we’d just seen the tip of the iceberg in the shape of things to come in content discovery.  Valentin had studied business and communications and worked in the web world for over 10 years and been the webmaster for some huge German sites.  What could possibly go wrong?

We didn’t even get invited for an interview.

The really interesting part is what happened next.  I started coding furiously.  Paul Graham could go to hell.  I’d show him.

That was our first taste of rejection.  It’s a taste you get used to when starting a startup.  You have to learn to thrive on it.  We were starting a startup, dammit, and nobody telling us “no” was going to change that.

We’d already been going to some of the local founders meetups which were a great chance to get a taste for the scene.  About a month after getting rejected from YC we got invited to the mini-Seedcamp in Berlin and met some of the people that continue to advise us to this day.  We found our first pilot customer the next month after that — who subsequently died before they could go live — but it was the validation that “somebody wants our stuff” that was really important.  By three months after YC we’d had an investor approach us for the first time.  We pulled in more pilot customers, kept building out our demos, got our first press.

Even though I was certain that we were going to get into YC, I’d set aside enough money to live for a year on to get Directed Edge up and going.  That was essential.

There are thousands of details that could be dropped in there — I mean, it was a year in the life of an early-phase startup.  It was grueling.

We learned enormous amounts.  The money that I’d set aside was the best spent money on education in my life.  About midway through the year, Oliver Beste, one of the local startup luminaries, said something to us that’s stuck with me:

“It’s unclear in your first startup what the worth of your company will be, but the worth of your person is guaranteed to increase.”

The year flew by.  We’d thought we wouldn’t apply to Y Combinator again, but at that point we knew a number of Y Combinator alumni that encouraged us to give it another swing.  Like every point in that year, we were crazy busy.

Unlike the first time around, we did our application in a couple of hours, scarcely proof-read it, did a quick video and sent it off.  I mean, we already knew they didn’t like our idea, so it was just something we knocked out and moved on.

And of course, that time around, we got an interview, and were eventually accepted to Y Combinator.

What was the difference?  Team was the same.  Idea was basically the same.  I’m sure the application was better — I mean, we’d pitched 1000 times at that point, so it was a lot easier to just ad lib in talking about our company.  Plus we had a better demo and pilot customers.

Paul later told me that the big difference was that he recognized me from Hacker News and that I said smart things there.  So if that was the critical difference, the question is, “Why was I saying smart things?”

When we started with Directed Edge we were as ignorant about what starting a startup was about as the next set of fresh-off-the-employed-world-boat founders.  Maybe worse.

It’d be flattering to think that maybe it was sheer intelligence that got us over the hump, but, uhh, no.  It was that we kept going.  You’d have to try really hard to run your first startup full-time for a year and not learn a huge amount.

So, now I’ll finally get around to the point of this entry:  People now often ask me what they should do on their YC application to increase their chances of getting in. There are the usual things like concision, clarity of thought, quality of idea — those all help certainly.  But I’m convinced that the single most important thing that you can do to increase your chances of getting into Y-Combinator is to do what you should be doing anyway:  going full-speed ahead on your startup. That’s what teaches you about your company, your market, the real composition of your team.

It’s easy to look at some of the teams coming out of Y Combinator and assume that Y Combinator has been some great leveler making quality startups out of the raw founders, but the truth is that the teams that were moving the fastest at the end were the ones that were moving the fastest at the beginning (with a couple exceptions).

The term “accelerator” has been applied to the swath of Y Combinator-esque programs out there.  I never really liked that term, but now I realize that it is in fact descriptive:  Y Combinator operates on the first derivative of your speed — it doesn’t carry everyone a uniform distance; it speeds teams up.

In that sense, we’re lucky that we didn’t get into YC when we first applied.  We’d not have gotten nearly as much out of it.

And contrary to a bit of the mythology that surrounds YC, a pretty big chunk of the teams had working products developed on day one.  They weren’t teams that had decided to apply to YC — they had decided to start startups, and YC was a step along that path.

And that’s the second big point:  Decide you want to start a startup, not that you want to be in Y Combinator.

The single biggest fear that Y-Combinator has for the teams it accepts is that they’ll give up.  So if there’s something that you can have on your application that shows that you won’t I think that’s probably a lot more important than how you frame your idea.

So go build your company.

One final note:  Most of the people reading this and planning on applying to YC will get rejected.  Even if you’ve got a great company, they’re evaluating several hundred teams in a tiny span of time, so just bad luck might be enough for you to not get an interview.  I’d advise you to take on the “To hell with Paul Graham” attitude I mentioned.  He won’t hold it against you.  I told him about it and he said he’d probably have reacted the same way.  Rejection is part of starting a startup.  Just in the last two weeks I’ve seen companies on TechCrunch that interviewed with our batch and were rejected.

Good luck, and stay tuned for the next installment.

The Bindings Parade Continues: Java

We recently added a “language” button to our sign up form so that we could see which languages people signing up to evaluate our API were using and respond appropriately on our developer site and with more language bindings.  The latest addition to the family, joining Ruby, PHP and Python, are our Java bindings, up at the regular spot on Github.

Soon we’ll get around to giving some more love to the developer site based on feedback we’ve been collecting in the last few weeks and round out the section of tutorials to cover all of the languages that we currently have bindings for.

Segmentation of the Top 100 Sites in the US

We’ve been looking at different angles for planning out the next bit of Ye Olde Directed Edge Roadmap and we’re starting to run some numbers on how the number of online stores and media sites breaks down in the US and I wondered, “What’s the breakdown of the top 100 sites in the US?”  It’s the kind of thing that’s good to know when you’re trying to figure out how many sites on the interwebs might be able to make use of a recommendations engine.

Here are the total counts — the results were a little surprising:

  • Media – 15
  • Service – 12
  • Search – 10
  • Adult – 10
  • Content – 7
  • Blogging – 6
  • Social – 6
  • Images – 5
  • Gaming – 4
  • E-Commerce – 3
  • Brick & Mortar Store – 3
  • Advertising – 3
  • File-sharing – 3
  • Video – 2
  • Auctions – 2
  • Hardware – 2
  • Maps – 2
  • Software – 2
  • Jobs – 1
  • Music – 1
  • Weather – 1

The top-100 list came from Alexa, and some of the categories are a wee  bit ad-hoc, but it does give an interesting idea of what the break-downs are.  I wouldn’t have guessed, for example, that adult sites beat e-commerce sites by a factor of more than 3-to-1 in the US top 100.

The full data is here in a PDF.  Holler if you want the spreadsheet version to bang on.

Demo Day Coverage

Y-Combinator’s Demo “Day” spans two days at this point — almost all of the journalists were there today and there were a couple of stories to emerge from their presence:

Today we were last in rotation, which means showing up last in the live blogging as well.  On the plus side, tomorrow, with more investor-like folks around, we’ve got the first slot of the day.

The Logo Story

So, we have a new site and a new logo. You might have noticed. We can fake web design in house, but doing a new logo was out of reach. We needed help.

On Finding Reinforcements

I asked around. There were two suggestions that came my way multiple time: 99designs and David Pache.  David’s portfolio is stunningly good, so the decision wasn’t hard, and personally I’ve always been a bit squeamish about 99designs:  lots of people competing for a chance to get one sub-market price.  It just doesn’t seem like a system that would be biased towards high quality work.  Certainly the best designers would avoid it like the plague.

But there’s another more subtle reason that I’m glad we went with a great logo designer:  they’re better than me at picking quality logos.  If I were deciding from 20 logo designs on 99designs, I’d have been hard pressed to pick the best one.  In fact, David sent us 4 mockups, and we didn’t know which one to pick.  I’m glad this one was his favorite as we’ve since warmed to it.  We were using the current one as a stand-in when a bit of serendipity made things more interesting.

The Other Whiteboard Artifact

Yesterday, Paul Graham posted “A Whiteboard Artifact” about us mapping out some of our company is about — but that wasn’t the whole whiteboard.  When he was drawing out how to explain what we do, here’s what he drew on the right half of the board:

Notice anything?  Yeah, we did too.  Here’s our new logo in full-ginormus-glory:

Previously, amusingly, we weren’t sure if the new logo quite captured what we do.  Having an advisor duplicate it on the whiteboard, never having seen it — or even knowing that we were evaluating logos — once again made the decision easy.

And that’s the story of our new logo.  For a little more visual goodness, check out the corporate identity page that David put together, including some insight on the iterations in the design process.  We’re glad to have worked with him.

TechCrunch: "YC-Funded Directed Edge Sees A Post-Search Web Where Recommendations Rule"

So, our first coverage on TechCrunch is up — and along with it we’re coming out of hiding with our Y-Combinator backing.

Amazon, of course does product comparisons, but there’s no reason recommendations shouldn’t be a part of news consumption, music consumption, social networking, basically everything we do on the web. And while there are no shortage of companies out there that focus on some of these different fields specifically, Directed Edge has developed a system that can be plugged into all kinds of different sites.

There will be some more technical news that we’ll drop in in the next couple of days — there are a couple of features that we’ve been chomping at the bit to get out to users, plus we’ll be putting together even more content for our developer site.

Python Bindings, Updated Wikipedia Data

So, joining our Ruby and recently added PHP bindings, we now have Python bindings for our web services API.  They should be considered beta-quality for the next couple-o-weeks, but we’d love feedback and to hear about any bugs that you hit.  You can get a look at them here.

We’ve also, for the first time in a few months, updated the dump of wikipedia that our related pages are served from, which are also served up through our wikipedia related-pages API.

Think Vitamin on beyond relational databases

There’s a new article up on Think Vitamin titled, “Should You Go Beyond Relational Databases” discussing the emergence of a crop of non-relational databases.  It’s one of the better introductions that I’ve seen to the landscape of document-oriented databases, graph databases and key-value stores that I’ve run across.  There’s a brief mention of Directed Edge in the graph database section:

There is less choice in graph databases than there is in document databases: Neo4j, AllegroGraphand Sesame (which typically uses MySQL or PostgreSQL as storage back-end) are ones to look at.  FreeBase and DirectedEdge have developed graph databases for their internal use.

Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. I actually believe that many other applications’ data would also be well represented in graphs. However, as before, don’t try to force data into a graph if it fits better in tables or documents.

Our own database is quirky hybrid of a graph-database, key-value store and column oriented database — the graph is a collection of interconnected items, which can have arbitrary meta-data associated with them, with column-oriented physical organization.  In fact, rolled into the next batch of updates that we’re testing at the moment are some more groovy features for enhanced item meta data like images and stuff.  We hope to go live real soon with that and The Other Cool New Features.

PHP bindings are starting to shape up.

By popular demand we’ve been banging out some PHP bindings that mirror the Ruby bindings that appeared along with our last web services update and the new developer site.

The API is basically the same adjusted for PHP conventions. They’re still a work in progress and subject to change, but we’d love feedback from folks more steeped in the world of PHP than us.

So, up on ye olde Github, our PHP bindings!

C and C++ are not the same language.

In the web world, a world I’ve made my way back to in the last year, pointers are an exotic relic of the past.  Seriously.

I never got the memo; there was no going away party, I was just sitting there, minding my own structs and boom, next thing I know, I was that guy that hadn’t realized that those mallocs are so 90s.

But there is a certain reverence for the old ways.  You’re supposed to learn C or C++ (often just written as “C / C++”) to be a well rounded dork, much like you’re supposed to have read Shakespeare.  This is all fine and well, and I agree.  But there’s one sticking point.

C and C++ aren’t the same language.

Let’s first take a tour of the major reasons that are given for learning C or C++:

  • You learn how to do manual memory management
  • You get “closer to the hardware”
  • They offer performance benefits over most languages
  • There’s a wealth of code written in both that you may want to understand

All of these are true in both languages.

However, learning to think in C versus C++ is a completely different story.

Modern C++ is usually written in an object-oriented style.  Even modern C is usually written in a pseudo-object-oriented style using opaque types.  But the mechanisms for working with the two vary widely.  Notably, in C you have to build up the higher level abstractions on your own that are built into the language in C++.

It’s beyond the scope of this entry to get into the many differences between the two, but I decided to implement a common pattern in programming, the observer pattern in C, C++ and Java to illustrate the differences.

Observer Pattern in C:

#include <stdlib.h>
#include <stdio.h>
 
typedef void(* FoomaticListener)(void);
 
struct Foomatic
{
    FoomaticListener *listeners;
    int listener_count;
};
 
struct Foomatic *foomatic_create(void)
{
    return (struct Foomatic *) calloc(1, sizeof(struct Foomatic));
}
 
void foomatic_destroy(struct Foomatic *foomatic)
{
    free(foomatic->listeners);
    free(foomatic);
}
 
void foomatic_add_listener(struct Foomatic *foomatic, FoomaticListener listener)
{
    int count = ++foomatic->listener_count;
 
    foomatic->listeners =
        (FoomaticListener *) realloc(foomatic->listeners,
                                     sizeof(FoomaticListener) * count);
 
    foomatic->listeners[count - 1] = listener;
}
 
void foomatic_activate(const struct Foomatic *foomatic)
{
    int i = 0;
 
    for(i = 0; i < foomatic->listener_count; i++)
    {
        (*foomatic->listeners[i])();
    }
}
 
static void first_listener(void)
{
    printf("Whoopee.\n");
}
 
static void second_listener(void)
{
    printf("Whoopee.\n");
}
 
int main(void)
{
    struct Foomatic *foomatic = foomatic_create();
 
    foomatic_add_listener(foomatic, first_listener);
    foomatic_add_listener(foomatic, second_listener);
 
    foomatic_activate(foomatic);
 
    foomatic_destroy(foomatic);
 
    return 0;
}

Observer Pattern in C++:

#include <set>
#include <iostream>
 
class Foomatic
{
public:
    class Listener
    {
    public:
        virtual void activate() = 0;
    };
 
    void addListener(Listener *listener)
    {
        m_listeners.insert(listener);
    }
 
    void activate()
    {
        for(ListenerSet::const_iterator it = m_listeners.begin();
            it != m_listeners.end(); ++it)
        {
            (*it)->activate();
        }
    }
 
private:
    typedef std::set<listener *> ListenerSet;
    ListenerSet m_listeners;
};
 
class FooListener : public Foomatic::Listener
{
public:
    virtual void activate()
    {
        std::cout << "Whoopee." << std::endl;
    }
};
 
int main()
{
    Foomatic foomatic;
    FooListener first;
    FooListener second;
 
    foomatic.addListener(&first);
    foomatic.addListener(&second);
 
    foomatic.activate();
 
    return 0;
}

Observer Pattern in Java:

Foomatic.java

import java.util.Set;
import java.util.HashSet;
import java.util.Iterator;
 
public class Foomatic
{
    private Set<listener> listeners;
 
    public interface Listener
    {
        public void activate();
    }
 
    public Foomatic()
    {
        listeners = new HashSet<listener>();
    }
 
    public void addListener(Listener listener)
    {
        listeners.add(listener);
    }
 
    public void activate()
    {
        Iterator<listener> it = listeners.iterator();
 
        while(it.hasNext())
        {
            it.next().activate();
        }
    }
}

Callback.java

class FooListener implements Foomatic.Listener
{
    public void activate()
    {
        System.out.println("Whoopee.");
    }
}
 
public class Callback
{
    public static void main(String [] args)
    {
        Foomatic foomatic = new Foomatic();
        FooListener first = new FooListener();
        FooListener second = new FooListener();
 
        foomatic.addListener(first);
        foomatic.addListener(second);
 
        foomatic.activate();
    }
}

All three of these do the same thing (with the one proviso that the C version uses a simple list rather than a set for concision).

The first thing that should jump out at you is that two of these look very similar.  And it’s not the C and C++ versions.  Modern C++ is much more similar to Java than it is to C.  Learning to think in C++ is much closer to learning to think in Java.

In the C++ and Java examples a callback is achieved by defining an interface with an abstract method that’s implemented in a concrete subclass.  In C a function pointer is used.

Now, let’s get back to the root of the confusion.

C++ is backwards compatible to C.  Good C++ developers, especially those doing systems programming, tend to be familiar enough with the important parts of C to exploit its lower-level primitives and pack them into an object oriented structure.

But someone working on, say, GUI development in C++ might never come into contact with the C underworld.  You can be a goodly C++ developer and never use function pointers, rarely use macros (and then rarely in a particularly interesting way) and most of all be completely clueless on how to define a reasonable C API that does proper encapsulation.

Really knowing a programming language is much more about knowing how to wield it to solve problems rather than being able to write code that the compiler doesn’t barf on.

Let’s look back at the goals we had for learning one of these languages:

You learn to do manual memory management

In C++ memory management is often less opaque and handled by abstraction layers.  We didn’t have to think about memory being allocated to insert an element to a set or free it when we were done.  That was handled by the language and its standard library.  This holds somewhat generally for C++, so I believe if your goal is educational — simply to learn about memory management, C is probably closer to fitting the bill.

You get closer to the hardware

Again, C is probably a win.  Not because you can’t do systems programming in C++ (in fact, that’s most of what I do) but because when doing systems programming in C++ it tends to come out looking like blocks of C neatly organized into classes.  The bulk of the code that you can also use as a learning reference (glibc and the Linux kernel are both good here) is written in C.

They offer performance benefits over other languages

This is true for both, but C forces most of the time-consuming stuff to pass in front of your eyes.  There’s less magic happening behind the scenes.  When writing performance critical C++ understanding that it’s built on the same runtime as C is useful for understanding what’s actually happening when you call a virtual function.  (Answer:  Classes which have virtual functions have a “vtable” that’s created by the compiler which is simply an array of function pointers.)

There’s a wealth of code written in both that you may want to understand

This naturally has less of a clear winner.  C tends to be more dominant at the lower levels, C++ creeps in closer to the middle of the technology stack.  Systems libraries and kernels are usually written in C, things like web browsers and office suites are more often C++.

But wait … so I said all of that nice stuff about C, why do I still prefer C++?

If your goals are purely educational, C is probably a better fit.  At an atomic level it’s harder to understand, but there’s much less of it to understand.  C++’s syntax is everything from C plus a myriad of advanced compiler-fu that takes a good long while to get your head around.  Many C++ programmers who have been working with the language for half a decade still couldn’t tell you how to use partial template specialization.

But if you’re writing real code — something you’re going to be curling up with night after night until unemployment do you part, I’ll take C++ any day of the week.

If you want to learn about the innards of programming, filleting a program in C teaches you how.  If 95% of the time, you’d like to have that handled with the abstractions you’re working with, classes, templates, exceptions and other modern encapsulation mechanisms supported by C++ make working on a large code-base more palatable.  I’ve been writing C for 15 years and in the first version of the C example above, I forgot to free the list of function pointers.  C++ is also more concise and expressive.

Now, anticipating the reaction of the high-level crew, aren’t most of the arguments that I just made for C++ even more true for, say, Python, or Ruby?

Of course.  But C++ often hits a sweet-spot between languages where high-level abstractions are available with the raw power of C when you need it.

At Directed Edge we use a mix of languages, and even a mix of C++ styles, trying to hit their comparative sweet-spots.  The rough break down is:

  • Database:  C-ish C++.  Looks like C packaged into classes and mostly is.  Binary operators for the win.
  • Engine:  More “pure” C++.  Less low-level gobblety-gook.  Works mostly with abstractions built up in the database level.  Optimized like it’s nobody’s business.
  • Web Services:  Java.  It’s hardly the cool-kid on the block these days, but does well with throughput and fault-tollerance (which C and C++ are bad at).
  • Web Front-ends:  Ruby.  Horrible performance, but rapid development time.

“We think you’d also like…” and the Math of Suggestion – Part 2

Part two of my article on “The Math of Suggestion” went up today on Gründerszene, and here’s a link back to Part 1 to to refresh your memory.

On the Road: Silicon Valley and Boston

The two of us that founded Directed Edge will be making our first trip together to Silicon Valley later this month and I’ll be in Boston in June and might do a quick trip out to New York.  If any of the fine folks watching this space is in one of those places and would like to meet or send some event tips in our general direction, drop us a line!

“We think you’d also like…” and the Math of Suggestion – Part 1

Part I of my article on “The Math of Suggestion” just went up on Gründerszene covering the basics of user-based collaborative filtering just went up.  Part II will cover item-based collaborative filtering and some of the basics of graph based approaches.  We’ll post it here when it goes live.

Greg Linden on "What is a Good Recommendation Algorithm?"

Greg Linden is one of The Really Smart People in the world of recommendations.  He’s got an excellent post on the newly launched ACM blog.

I’ve had a draft of an essay sitting around for a while titled, “What’s wrong with recommendations?”  Greg hits on one of the key points that I’d intended to cover:

In the end, what we want is happy, satisfied users.  Will a recommendation engine that minimizes RMSE make people happy?

Check out the full post here.

Announcing TweetRex: Twitter friend recommender built on the Directed Edge recommendations engine

So a couple weeks back we decided to have a little fun with the Twitter API and built a friend recommender using the Directed Edge engine. What was originally going to be a two day project that our friend Björn Günzel and I wanted to bang out turned, as two day projects are wont to do, into a 10 day project.

How hard could it be?

It turns out that spidering the Twitter graph in real time presented some interesting challenges for our database. This was the first extremely write intensive application we’d tested: one single recommendations run can often trigger adding thousands or hundreds of thousands of items to our database as we do the progressive spidering.

To make things more fun, with all of those requests out to the Twitter API, the application needed to be heavily multithreaded — often using up to 500 concurrent threads split between spidering Twitter, computing recommendations and handling incoming requests.

Sinatra and new database features:

We wanted to build this application using only the Directed Edge database — no messy ORM layer and all that jazz.  Sinatra seemed like a nice, low-fat alternative to Rails.

The new stuff in our database lets it act as a full-blown key-value store with support for storing arbitrary data with arbitrary mime-types in the DB.  Right now we’re just using this for Twitter profile caching.  In the next update of our database / web-services it’ll be possible to use the database just like webdav assigning images or text to graph-nodes using a simple HTTP put and immediately retrieving them.  This opens up some doors of possibilities to being able to do something like, “Show me pictures of all related items.”  More on that once those features make their way into the released version of the web-services.

Go easy on her, gents.

We’ll see if the app survives its relatively gentle launch.  Right now we hit the Twitter API quite hard — such that in fact we have to limit the number of concurrent users using TweetRex or else our Twitter queries start timing out and recommendations take ages to generate and are incomplete.  You’ll get a little message if we’re overloaded and it’ll automatically move on to the log-in sequence once things are live.

So, what does it do?

TweetRex looks through the Twitter graph for people that are in your “neighborhood” that you’re not following.  The results get better as we spider more of the Twitter graph, so check back in from time to time.  There’s a little message box that pops up asking if we can post a tweet in your stream.  We’d love it if you’d oblige to help us get the word out about the app.

This is just the first beta.  We’ll be adding some more features over time, and are looking forward to your feedback!

And now, without further ado, I give you TweetRex:

http://tweetrex.directededge.com/

Bindings project on GitHub, Starting with Ruby Bindings


We just uploaded the first version of a simple Ruby binding to our API to Github. The code is BSD licensed, so you can do pretty much anything you want with it. We’ll be adding a few more features to the Ruby version in the following days and eventually follow with similar bindings for other languages.

Looking forward to your feedback!

On Building a Stupidly Fast Graph Database

It’s pretty clear to computer science geeks that Directed Edge is supposed to be doing groovy things with graphs. In fact our recommendation engine, and some of the things that are unique about our approach to recommendations, are built on our super-fast graph database. When we went live yesterday with the latest version of our recommendations web services, another, much bigger thing happened behind the scenes for us: we cut over to the new version of our graph database.

Every time that Directed Edge gets mentioned in nerdier circles we get a lot of questions about this fabled graph-engine, so we thought we’d indulge our techie friends with some background info.

When we first decided to build the Directed Edge engine, we’d built some in-memory and RDF-store based prototypes to start hashing out the recommendations algorithms, but the RDF stores couldn’t keep up performance-wise and staying memory-based obviously wasn’t an option for persistent data. So we tried porting things over to SQL-based databases … and got a working version, but it was so painfully slow that we knew we needed to keep looking. The initial version of our web services were built on simply serializing data objects to disk and reading records from index files that we loaded up into memory. This worked and was both fast and memory efficient, but ignored most of the harder problems of a true database.

What are those performance characteristics, exactly? Well, it’s fuzzy where the cut-off is, but we’re now able to import a graph with 2.5 million nodes and 60 million edges in 12 minutes on a MacBook. Query times for finding related items on a similarly large graph with warm disk-buffers average under 5 ms.

With the latest version of our web services and database, we’ve gone from having a graph-store to having a proper graph database. We now use straight memory-mapped I/O, disk-based linear-hashing, have mechanisms to address data fragmentation, thread-safe read / write locking and logging / log replaying.

Before we jump into the computer-science-y core of what we’re doing, I’ll throw down some nerd-cred, and give some perspective to the design decisions made. The design was done by me, Directed Edge’s CTO, and I’ve been looking at algorithms for graph-based similarity for around 5 years now. During the earlier half of the decade I did a series of talks on “contextual webs” to our crew in the Linux world. While I have no academic background in database design, I’ve been doing low level bit-wrangling for most of the last decade and Directed Edge’s data backend is the third key-value store that I’ve written from the ground up. So, what I’m trying to say is: some of these ideas may be original, many are not, some fly in the face of traditional database design which is likely some mix of novel and naive. Be that as it may, the closest that we’ve gotten to our performance using off the shelf databases was with BDB with 5% of the throughput of our database.

So, on to geekery. Let’s start at the top level of abstraction and work our way down. The impatient may feel free to turn the page at any time since the juicer bits are in the lower levels.  We’ve tried to bold the highlights.

The Store: Items, Links, References, Tags and Whathaveyou.

Sitting up on top of the architecture is a C++ class called “Store”. Like the rest of the things that sit below the web services (which are in Java), it’s written in C++. This is the API that we use to deal with a database. To our recommendations code, this is the database.

Things are pretty simple there, and the API is flexible enough that it’s let us swap out the implementation below it to compare other backends. Basically it’s this:

  • The store is a big collection of items, which can be anything — a user, a product, a web page.
  • Items are identified by a string unique identifier which maps to an integer index.
  • We can look up an item’s index by name really quickly.
  • Items have links and references, which are basically a list of things they’re connected to and things that connect to them. These are symmetrical — so creating a link from item A to item B, creates a reference from item B to item A.
  • Items can also have tags, e.g. “user”, “product”, “musician”, “film” and so on. Tags are non-exclusive, so an item can have as many tags as appropriate.
  • Links and references can have weights of 1 to 10 or be unweighted. We use that for ratings. So a rating is just a link from user A to product B with weight of 1 to 10.
  • We can also store arbitrary other fields along with these like “text”. Really, the only limitations on what can be associated with an item are what we choose to expose in the API here. The lower levels are flexible.

So that’s the top level. That’s what makes things look like a nice, clean graph architecture without mucking with all of the lower-level storage details. At this level everything’s thread-safe too, so the map-reduce pattern that we use when generating recommendations doesn’t leave a trail of destruction and SIGBADTHINGSs in its wake.

Things Start Looking Databasey: Rows and Columns

Like traditional databases, our database is organized into rows and columns. An item is just a row. Unlike most databases, there’s only one set of rows (only a small lie, which we’ll ignore for convenience), and the primary things contained in rows are lists.

There are really no huge surprises here:

  • A database has a list of columns, a stack of deleted rows (to recycle indexes) and a row count.
  • Adding a row to the database adds a row to all columns.
  • There’s a flag for if a column is hashed or not, which enables fast lookups and requires unique data values.
  • There’s a basic column type that just stores byte arrays and a bunch built on top of it for storing things like link and reference vectors, string lists, strings and so on.
  • Any field in an item can be empty and all of our column data types have sane values for empty contents, which are returned if the data pointer for a row is set to null.

Column Locking

We’ve got two types of locks that we use, one surprisingly boring, one a little more creative.

The most important fields in the Directed Edge database are the links and references. They convey things like “which pages are linked on this website” or “which products has this customer bought” or “which news items has this user clicked on”. Those lists are of the type IndexVector within our database and have their own special kind of lock.

The rest of the columns use normal read-write locks. We use an array of 8 read-write locks per column. When lock request comes in it is hashed by shifting the index by three bits and then issues a lock on the column. We found this in practice to be faster than keeping a table or tree of locks since maintaining that tree would require a single read-write lock for the tree maintenance and contention for that lock was rather high. Non-IndexVector columns (strings, string lists, and binary blobs, mostly) copy the data when they hold the read lock and release it immediately thereafter, so the locks are held only momentarily.

So, to repeat that — using a simple array of locks per column significantly outperformed keeping a list of locked items, even if it means that items share their locks with many other items.

We’ll revisit IndexVector locks again in a little while.

Memory-Mapped I/O … and a Disk Based Malloc?

We’re up to my second lie now. See, I promised that I’d walk through the layers of abstraction from the top down and here we’ve gone ahead and skipped to the end. To understand the data structures that are built on top of the files, it helps to know where that data lives and how we manage it.

We experimented with a few ways of doing I/O, but the one that seemed to consistently win, as well as make life a better place, was doing straight memory mapped I/O. For the non-system programmers among you, UNIX systems allows for files to be mapped to memory addresses. So if we’ve got an open file called “foo” and we call mmap on its descriptor, then we’re able to access the contents of “foo” directly based on the address returned by mmap — we can treat the file just like it’s an array of characters in memory.

This has a few advantages. One is that it lets the operating system effectively handle paging in data from the disk. UNIX also provides some handy functions like madvise() and msync() so that you can give the operating system hints on how to do that paging and when to write things back out to disk.

There’s also no overhead for accessing disk data since it’s mapped directly from the file into the system’s address space. Usually when you use things like fread() or fwrite() all of the data has to be copied into the kernel’s space, then over to the process’s space, then you can do stuff with it. By using mmap we avoid all of that copying mess and the overhead of switching between kernel and user space within syscalls, which is often significant.

There’s a good section on advanced I/O in O’Reilly’s Linux System Programming, which I stumbled across last week and gives a good introduction to mmap and other I/O strategies.

So, that’s what mmap does. How do we use it?

Well, one of the typical problems in deciding how to order disk records is how to manage disk fragmentation. As it turns out, that’s one of the problems that memory managers face when trying to decide how to allocate chunks of the system heap. So, we thought, since we’re treating the files we use as memory anyway, why not put two and two together?

What we ended up with is an implementation in our memory-mapped file class of malloc, realloc, calloc and free — the classic functions for memory management in C. We patterned our allocation with some modifications for our applications on the algorithms used in glibc’s malloc. There’s a good, if slightly stale, introduction to them here.

Having those in place lets us allocate blocks of our files just like we’re allocating memory and handles file fragmentation rather elegantly. It also has the added bonus of letting us translate normal in-memory algorithms directly to disk based algorithms.

This structure lets us put arbitrary data into a file and gives us back an offset to point to where it’s at, which is exactly what we do for things like arrays of indexes and strings and so on, but we still need some way of referencing that data to make use of it in our column types.

File-based Data Structures: Stack, Vector, Linear Hash

Most of our data structures are pretty similar to their in-memory based counterparts. The similarity was in fact one of the reasons that we chose the mmap approach. We have three basic data structures that we use on disk to reference the blobs of data that are inserted and removed with malloc, realloc and friends.

When we’re writing these records to disk, we choose whether or not they get their own file by whether or not they can make use of sequential access. Notably, vectors almost always get their own file because they’re accessed sequentially and only expand at the end. It’d be a pain in the derriere if we had a 50 MB vector and there was no room immediately after it in the file, so we had to free that whole 50 MB chunk and move it to the end of the file leaving a hole where it had been before. So vectors get to live a solitary life off in their own files. Hashes and stacks on the other hand don’t have that problem, their elements tend to be of smaller sizes and more scattered, so they can be mixed in with data without worrying about them fragmenting the file too much as they insert and remove elements.

Stacks are simple and don’t merit much mention. We use it to keep track of deleted records so that we can recycle their indexes the next time we create a new row. Yawn.

Vectors, on the other hand, are some of the meat-and-potatoes of the Directed Edge workhorse: the things we use to keep track of offsets in data files. These are different from the similarly named IndexVector. In Directed Edge lingo, an Index corresponds to a unique item with fields and stuff, and an Offset just points to a place in a data file. These vectors are used for Offsets. Actually, these vectors are a template, so they can be used for any data type of any fixed size, but they’re most important for offsets.

This is another point that we break from typical database design theory. In a typical database you’d look up a record by checking for it in a B-tree and then go off to find the data pointer for its record, which might have a continuation record at the end that you have to look more stuff up in … and so on. Our lookups are constant time. We hash the string identifier to get an Index and then use that Index to find the appropriate Offsets for its data. These vectors are sequential on disk, rather than using continuation tables or something similar, which makes constant time lookups possible.

Note: Like all things in the real world, “constant time” really means, “constant time in the parts we see, under cooperative conditions, if the operating system doesn’t have to do anything like figure out where that disk record actually is, so probably not really constant time, but still about as fast as you can get.”

Linear hashing is another important component of our lookup scheme. Linear hashing is a special sort of hash table that basically does dynamic expansion rather than having to reallocate the hash-slot array every time the data gets too big and futz around with reorganizing things. We also having a mechanism for letting our hash-table cooperate with our data files so that it handles inserting and removing data from them in addition to keeping track of where they are.

IndexVector The Great and Its Team of Sneaky Locks

So, now we finally return to the IndexVector. Since you’ve probably dozed off a couple times in this pile of wordiness, I’ll remind you of its function — the IndexVector is what keeps track of links between items. “Bob purchased The Beegees Greatest Hits” is a link between Bob and said album. Bob has an index and the album has an index. In Bob’s list of “links”, which is an IndexVector, we’ll put the value of the Beegee’s Greatest Hits.

If you’re doing traversal heavy graph-applications, like, say, finding related content, how you deal with those links is your where the rubber hits the road. Most of your performance lies in accessing these links.

So, since we’re using mmaped I/O we have no syscall overhead for grabbing this stuff from the disk. We also have no copy that must take place to get the data from kernel space to user space. Two points for us. But what we want to do is to Try Real Hard to make sure that we don’t copy it at all in the common case.

So our IndexVector class is really just a pointer to a place on the disk where it can read the data from and then it casts that data into indexes when we do things like someIndexVector[4]. Simple, right? Yeah, mostly. But there’s a catch.

You see, IndexVectors are sometimes changed. Like, if Bob buys another album, we’ve got to update that vector. To make things worse, sometimes they get bigger and get moved to a whole other part of the file where there’s room for it now. Unlike our other data types (strings, binary blobs and all that) we access IndexVectors a lot, so we don’t want to just copy the data every time there’s an access. (We tried, slowed down the whole engine by 40%.) So, we get sneaky.

You see, most of the time we can go about just willy-nilly accessing the IndexVector from its disk position and assume that nothing’s changing behind its back. We have a similar array of mutexes (8 of them, to be precise) for our IndexVectors. When a read request comes in, an IndexVector adds itself to a list of active readers and goes about reading stuff from the disk.

Now, the magic happens when a write lock comes in. We don’t want our writes or our reads to block on these Very Important Pieces Of Data, so what the write lock does is first copy the old data from that record into memory and then go through the list of active readers and do a swaperoo on the data pointer. The readers are now pointing to an in-memory copy of the data, rather than the disk-based version. It can then go on writing and moving and generally doing whatever it wants with the disk based copy. The lock also keeps a copy of the pre-write version of the data in memory so that all readers that come in and try to get a lock on that record get the in-memory version until the write is done.

This leaves us with somewhat more lax consistency than one would expect from an ACID-compliant database, but for our applications that’s fine. Like, if our recommendations-fu has missed the item that you rated a couple milliseconds ago when computing product recommendations for the next page, no planes will crash and no babies will be harmed. On the plus side this means that the only case where a read or write on IndexVectors will have to wait is in the case of two threads trying to write to the same item.

Them’s the real guts of our database, but I’ll leave you with a couple provisos for extra credit:

Every Bit Is Sacred

Conventional wisdom in modern programming is not to get hung up on using a few bytes here or there and just get on with writing your code. And it’s right.

But when you’re working on low-level, high-performance I/O, every bit is sacred. The reason is that every bit is extra data is extra page misses and extra page misses means more waiting on the big spinning magnet that is the bottleneck in almost everything that touches it. These things also come out of the woodwork when you’re multiplying every extraneous bit by several million. If we were to increase the amount of space used for one of our indexes by one byte storing the wikipedia link structure would take 114 more megabytes.

So we get a little trigger happy with our friendly neighborhood binary operators. We combine weights and indexes into the same field using the upper 4 bits of the item index to store the weight. Our malloc implementation uses 8-byte alignment, so we use the extra 3 bits at the bottom end of of the address for flags. By thinking up clever ways of putting the squeeze on data we have databases that are 40% smaller, which means a lot less page misses and a much faster database.

What About Matrices? I Though Recommender Systems Were Based On Linear Algebra.

The difference between a weighted graph and a sparse matrix is mostly one of visualization rather than of data. We conceptualize a matrix to be something different than a graph — a matrix is a huge field of rows and columns and a graph a sprawling series of interconnections, but in fact, a sparse matrix and a weighted graph are almost the same thing and easily convertible when viewed from a data-perspective.

So, if we’re thinking of a user’s product ratings as a sparse vector that maps to each product, we can easily model that as a series of weighted edges from the user to each product. In fact, it’d be difficult to produce a tighter format for storing a sparse matrix than a list of connections since we’re able to avoid things like pointers to the next used vector dimension and munge the weights into the same integer value that we use to represent a graph edge.

Beyond that, a graph model allows for us to do some more wild clustering than you’d typically get if you’re imagining the ratings set as a matrix. We can look at things that are not just one, but a couple levels out in the connection structure to figure out how things are related. This is one of the mechanisms that we use to mitigate the “cold start” problem that is endemic to typical collaborative filtering algorithms.

Al fine

I’m tired of writing, you’re tired of reading. If you’ve got questions or have The Killer Application in mind for such a beast, feel free to drop us a line via comments or mail or ping us on Twitter.

Updated API and Database Live

We’re excited that the new API is finally live.  Most of the important new stuff is going on behind the scenes with the latest version of our graph database, but here’s a rundown of the new goodies:

  • Full read-write support for all items, rather than having to upload full database dumps. This includes methods for adding and removing tags or links as well as rewriting items. Items can also be added or removed from the database.
  • Better results for obscure or tag-specific searches.  You can see an example of this in our last blog post.
  • Faster results.  Our new database is several times faster than our already speedy previous version.
  • Max results parameter in the search results, configurable from 1 to 100 results.
  • “Ignore linked” option to avoid returning recommendations for items a user has already interacted with — i.e. items they’ve already purchased or clicked on.

The bulk of the effort in the last months has been getting our graph database up to snuff as a proper database.  We’ve had a lot of questions about our graph database, so the next blog entry will be full of technical information on it.  We’ll also now be creating new accounts again, which we’d been holding off on for the last few weeks as they’d need to be moved over to the new database format.  All in all the upgrade went as planned and services were down for only around 5 minutes.

You can view the updated API documentation here or as a PDF.

We’re already at work merging in the next batch of features which will add special elements for common use cases — e.g. ratings, purchases and click-tracking and will also bring more a more advanced recommendations model that’s been brewing in one of our experimental development trees. 

Interview with Intruders.tv

Thanks to Jof and Vincent from Intruders.tv for doing the interview!

Twitter Recommendations Cloud

Gleaned from the RSS feed for a few weeks plus Wordle. Here’s one with a hand-created list of stop-words to try to get out some of the irrelevant stuff. Gives an interesting look at what people are asking for recommendations for on Twitter:

Twitter Killed the Radio Star, Logo Search

Putting the Micro- in Bloging:

 

We’ve been slack on blogging of late. The Twitter bug finally bit the two of us that founded Directed Edge. I finally got around today to setting up an “official” (meaning, “spared of our random personal musings”) Directed Edge Twitter account today:

http://twitter.com/directededge

You can find the two founders streams linked there.  But have no fear, there will be more good old fashion blogging action on the way.  I’ve got posts in the pipe on some of our experiences in picking a CRM system and some of the optimization work on our graph storage engine that should be coming along in the next week or so.

The time for a logo is nigh:

 

Yes, that’s right, we need a logo.  We’ve all run out of business cards, so before putting in an order for a new batch, I’ve decided to revisit the logo situation.  We’re looking for something simple and business-y that also can be reduced to line art for print.  Particular colors aren’t terribly important since the website needs a revamp too.  If you or somebody you dig feels like moved to send us a bid (with a link to a portfolio), please do.

Collective intelligence, BarCamp and the Berlin Web Week

I’ll be doing a session tomorrow or Sunday at BarCamp Berlin (currently the most upvoted presentation topic).  I’ll cover some of the basics of modern recommendation systems, including basic categories of algorithms and why recommendations are important for modern web applications.  Dave Sharrock and Garik Petrosyan from Be2 (dating site) will be co-presenting, talking about some of the things they’ve had to overcome in building a scalable match-making system.

Valentin will be doing a session on multi-lingual blogging.

We’ll also be at the following events in the Berlin web week:

Drop us a line if you’d like to meet and talk about recommendations and what they can do for your site!

Fuzzy search, related articles AJAX widget.

This one was requested a few times.  We’ve thrown together a simple AJAX widget for showing related articles on other sites.  We’ll add a little more bling later on, especially once we get some more data sets out there.

The idea is this:  You have a site that has information about music, or movies or airplanes or whatever, and you’d like to add links to related encyclopedia to your page with just a couple lines of customization.

To make that work as one would expect we also added fuzzy search.  That’s important because, for instance, the article for “Madonna” in Wikipedia is titled, “Madonna (entertainer)”.  So, now with our fuzzy searching you can search for the closest hit to a term with a specific tag, in this case “Madonna” tagged as “musician”.

The list of available tags is here.

The new AJAX widget is trivial to use.  Here’s what it looks like.  If you click it’ll take you to a page that you can use to copy-paste from.

You can get that on your page by adding two lines to you page / template:

<script src=”http://pedia.directededge.com/RelatedArticlesWidget.js” type=”text/JAVASCRIPT”></script>

<div class=”RelatedArticles” title=”Music Related to James Brown” fuzzy=”James Brown” fuzzytag=”musician” tags=”musician”></div>

All items that are returned use the “RelatedArticles” class, so that you can style them in your stylesheet as you see fit.  These are the supported attributes (only title or fuzzy are required):

  • topic:  A specific (exact) wikipedia article name.
  • fuzzy:  Find the nearest article name.
  • fuzzytag: modifies the fuzzy search so that it looks for an item with that tag.  Note:  this does not affect the related articles, for that use the following attribute.
  • tags: The tags for the returned articles. i.e. Using a fuzzytag for James Brown will specify that the main article should be a musician named James Brown, but specifying musician for tags will ensure that all of the returned results are also musicians.
  • title: The title to be used for the list.  This defaults to Related Encyclopedia Articles.
  • prefix: The prefix to be used for the links.  This defaults to http://pedia.directededge.com/article/ Here you could also specify http://en.wikipedia.org/wiki/ to link to the English wikipedia.

If you decide to link to Wikipedia or somewhere else using the last tag, we’d request that you at least give us a mention in your company / project’s blog somewhere.

Enjoy and feel free to drop us questions or comments!

Directed Edge interviewed for The Next Web

Thanks to Ernst from The Next Web for publishing an interview with us.

Web Services Live, Documented, Secured.

So in the space since the last blog post we’ve been working on getting everything squared away for our commercial web services API.  It’s now running live at webservices.directededge.com.  There’s some documentation up on how the REST API works there.  I also went through the hoops of moving over from our self-signed certificate to a proper certificate this week; I’d forgotten how much of a pain those can be to deal with.

If you’ve got any questions about the API or things that you’d like to do with it that don’t seem to be supported at the moment, we’d like to hear from you!

BarCamp Berlin Registration is Open

 

Directed Edge will be there.  Will you?

 

Directed Edge will be there.  Will you?

Greasemonkey script.

We’re very grateful that one of our users knocked out one of the items on our to-do list and created a Greasemonkey script for showing related articles on Wikipedia.  If you have Greasemonkey installed in Firefox you can just click on “install script” on this page.  To get related articles without being logged in to Wikipedia.

Per the comments on that page, we will start rolling out our Wikipedia demo in other languages probably about a week from now.

Related Pages on Wikipedia via our web services API.

So, we think this is pretty cool beans.  When we did our demo with a mashup of Wikipedia’s content we knew that we wanted something that potential customers could quickly look at and get a feel for what our recommendation engine is capable of, and we got a lot of good feedback about that in our recent technology preview.  On the other hand, we knew that we weren’t going to get the masses to switch over to user our Wikipedia interface.

One of the open questions for us as we pushed out the first bits of our web-services API  last week was, “Can we get this content to show up in Wikipedia proper?”

Last night after an extended hacking session where I tried a number of strategies for doing DOM scripting to pull in external content (and some misadventures in trying to do cross-site XMLHttpRequests) I managed to come up with a simple way of pulling in content from our web service via JSONP, and added support for JSON output to our web service along the way.  For Wikipedians that are logged in, it only requires adding one line to your monobook.js file and I’ve created a short how-to here.  The source code, for interested hackers is here.

Here’s what it looks like:

When we launched our demo a few people didn’t seem to get quite what it does that our engine is doing — we’re not just analyzing the current page and pulling in a few important links; we’re jumping out a few levels in the link structure and analyzing and ranking usually several thousand links in the neighborhood of the target page.  Often those pages are linked from the target page, but that’s hardly a surprise.  I come from a background of doing research in web-like search, so it’s no coincidence that our approach to finding related pages takes some hints from PageRank and other link-based approaches to sorting out the web.

We’d invite people to try this out and of course to keep playing with our mashup; we’ve gotten so used to having related pages that it’s hard to go back to the vanilla Wikipedia — having the related pages there makes it really easy to sort out things like, “What are the important related topics?” or “Well, I know about X, what are the main alternatives?”  And so on.  We’ve got some other exciting stuff up our collective sleeves that we’ll be rolling out in the next couple of weeks, so stay tuned!

API, Part II: Tags

Work on the web services API for the encyclopedia continues, now with tags.  Here’s a quick rundown:

You can get a list of supported tags here:

http://pedia.directededge.com/api/v1/tags/

That currently returns:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <tag>actor</tag>
  <tag>author</tag>
  <tag>book</tag>
  <tag>company</tag>
  <tag>film</tag>
  <tag>musician</tag>
</directededge>

You can then get results from article queries based on a tag, using something like this:

http://pedia.directededge.com/api/v1/article/KDE/tags/company/

Which returns:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <item id="KDE">
    <link>Trolltech</link>
    <link>Novell</link>
    <link>Hewlett-Packard Company</link>
    <link>Nokia</link>
    <link>World Wide Web Consortium</link>
    <link>Mandriva</link>
    <link>Canonical Ltd.</link>
    <link>Sirius Satellite Radio</link>
  </item>
</directededge>

You can query any article for any tag (unlike in the web interface).  Right now the results for “off topic” tags tend to be hit-or-miss.  One of the other big items on our to-do list is improving tagged results in our engine.

I’m posting incremental updates like this in the hopes that if you’re planning on using our API in a mashup that you’ll let us know what you like and don’t like before we freeze v1.

We’ve also decided on a couple of limitations for the open API that aren’t true for our commercial API (running either on customer data sets or open data sets):

  • You’re limited to 10 results.
  • You can only filter on one tag at a time, meaning, you can’t get ranked results for movies and music simultaneously.

We think those are pretty reasonable and still give users a fair bit of room to play for free.  If you’re interested in using our commercial API, drop us a line!  We’ve also just created an announcement list where we’ll notify folks that are signed up of important details.  You can sign up for that here.

First Encyclopedia API bits up.

This will still definitely be in flux, but I started getting parts of the REST API up if folks want to play with it.  Warning:  the format may change.

You can now hit something like:

http://pedia.directededge.com/api/v1/article/KDE

And get back:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <item id="KDE">
    <link>GNOME</link>
    <link>Unix-like</link>
    <link>Desktop environment</link>
    <link>Konqueror</link>
    <link>Qt (toolkit)</link>
    <link>KDE 4</link>
    <link>GNU Lesser General Public License</link>
    <link>X Window System</link>
    <link>KPart</link>
    <link>Widget toolkit</link>
  </item>
</directededge>

I’ll be adding support for JSON output and filtering based on tags in the next few days.  Once I’ve got a set of features there that I consider feature complete I’ll freeze the “v1” so that people can create mashups based on that and be sure that the API will remain stable.

This does do capitalization correction, but does not do redirect detection.  I’m debating if I want to do that by default or use another REST path since it requires another couple DB queries and is as such a little slower.

Toy of the Day: FeedMySearch

Like any new startup co-founder, I’m obsessive about seeing how what we’re doing trickles out over the web.  Being an RSS-warrior today I went looking for a Google search to RSS converter and found FeedMySearch, which now, a few hours into using it seems to do quite well in pulling in new information about Directed Edge as it hits Google’s indexes.

FeedMySearch for Directed Edge in Thunderbird

FeedMySearch for directededge.com in Thunderbird

 

Nifty tool.  Anything that stops me from compulsive reloading is a win.  Now back to implementing new features.  🙂

Directed Edge Launches Recommender Engine Public Beta!

It’s an exciting day for us at Directed Edge.  Today we’re finally putting our Wikipedia-based technology preview out there for people to play with.  Before you click over to it, here’s a little about what you’re looking at.

As our name implies, we’re graph theory nerds.  We look at the roughly 60 million links between the 2.5 million English Wikipedia pages, and with a few extra cues from the content, figure out  pages related to the current one and put that in a little box in the upper left (as evident from the image on our home page).  In some cases, if we’re able to pick out what sort of page it is, we also drop in a second box with just other pages of the same type.

Finding related pages in Wikipedia isn’t fundamentally what Directed Edge is about.  We’ve got a super-fast in-house graph storage system that makes it possible to do interesting stuff with graphs quickly, notably figure out which pages are related.  We’ve already got a couple of pilot customers lined up and will be working with a more in the next weeks to analyze their sites and figure out how things are related there.  We’ve got a prototype of our web-services API that they’ll be using to send us break-downs of how stuff’s connected on their site and we’ll send back what we hope are some pretty groovy recommendations.

There are dozens of things in the pipe for us:  ways to make recommendations better, ways to make the Wikipedia demo cooler, things customers want to see in our web services that we’d previously not thought of, and we could ramble on that for a while, but there are a few things that are on the very near horizon that didn’t quite make it into this round:

  • An open web-services API for accessing the recommendations from our Wikipedia demo.  This will be a stripped down, read-only version of our commercial API usable in web mash-ups.
  • Better tagged (i.e. music, movies, authors, companies) recommendations.  Support for tagged articles was one of the last features that made it into the current demo, and we’ve got some plans for improving the results for those.
  • Pulling in non-English Wikipedia variants.  We’ll probably start with German and French.
  • More info about our commercial web-services API.  We’re still nailing down some of the details, but as soon as we freeze the API for the first customers, we’ll add more docs to the site.

If you subscribe to our news feed you’ll see immediately when those services go live.  Even though we’re still in the beta-phase and are only accepting a limited number of customers, if you think you’d be interested in using our engine for your site down the line, we’d encourage you to register now since we’ll be offering a discount for our commercial services to everyone who fills out their info in the contact form during the beta phase.

More soon.  Enjoy!

Launch date, August 13th.

We’ve now committed to going into public beta / technology preview next Wednesday, August 13th.  We’ll be launching our new site with more information about our products and services at that time.

Press / bloggers may request invites by sending us a mail.  It’ll be an exciting next few days as we iron out the last kinks and get ready for the onslaught.

The website will be a bit in flux, but our bio info is still available here.

Edit: If you’ve been testing the demo previously the location has changed.  Drop us a line for the new URL.

Seedcamp Berlin

 

Valentin and Scott at Seedcamp Berlin

Valentin and Scott at Seedcamp Berlin

 

 

I stumbled across this photo of Valentin and I getting our questions answered about incorporation style at Seedcamp Berlin.  I must say that was one of the more useful events that we’ve been to so far — it was a great chance to get in touch with an impressive collection of mentors and to network with other (mostly) German startups.

Berlin: OpenCoffee / Business & Beer this week

There will be another update on Directed Edge stuff soon-ish, but I just wanted to get out a list of the upcoming Berlin events where we and other Berlin startups will be talking shop and doing demos:

Next round of events.

Just finished up attending the last of the events in the recent post and thought I’d mention the next batch on the Directed Edge calendar:

I’ve got two weeks left until I’m full-time on Directed Edge and it’s looking to be an exciting time.  Like most startups stumbling towards launch, there are a thousand threads being followed up simultaneously, the last couple of weeks more business than technical.  We’re going to make our demo progressively more open during the next weeks, but the feedback we’ve gotten from the first testers is encouraging.

Not-exactly-TechCrunched.

But I did make it into the video feed from the Prague TechCrunch event. There’s about a half-minute pseudo pitch in here. Lessoned learned from watching it back? Loosen up.  Get to the point faster.  Like in 10 seconds.  (Isn’t showing up in syndicated version, click through to the full post.)