Think Vitamin on beyond relational databases

There’s a new article up on Think Vitamin titled, “Should You Go Beyond Relational Databases” discussing the emergence of a crop of non-relational databases.  It’s one of the better introductions that I’ve seen to the landscape of document-oriented databases, graph databases and key-value stores that I’ve run across.  There’s a brief mention of Directed Edge in the graph database section:

There is less choice in graph databases than there is in document databases: Neo4j, AllegroGraphand Sesame (which typically uses MySQL or PostgreSQL as storage back-end) are ones to look at.  FreeBase and DirectedEdge have developed graph databases for their internal use.

Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. I actually believe that many other applications’ data would also be well represented in graphs. However, as before, don’t try to force data into a graph if it fits better in tables or documents.

Our own database is quirky hybrid of a graph-database, key-value store and column oriented database — the graph is a collection of interconnected items, which can have arbitrary meta-data associated with them, with column-oriented physical organization.  In fact, rolled into the next batch of updates that we’re testing at the moment are some more groovy features for enhanced item meta data like images and stuff.  We hope to go live real soon with that and The Other Cool New Features.

PHP bindings are starting to shape up.

By popular demand we’ve been banging out some PHP bindings that mirror the Ruby bindings that appeared along with our last web services update and the new developer site.

The API is basically the same adjusted for PHP conventions. They’re still a work in progress and subject to change, but we’d love feedback from folks more steeped in the world of PHP than us.

So, up on ye olde Github, our PHP bindings!

Webservices Update, Developer Site Launch

There’s been a lot cooking behind the scenes of late. Secretly the newest version of the webservices have been up and running for about a month. First we wanted to update the docs. Then they began growing. And there was a tutorial. And more examples. And the new Ruby bindings were shaping up.

Before we knew it, we had a full-blown developer site in the works. So, folks, that’s what’s on the menu today:

developer.directededge.com

http://developer.directededge.com/

There’s a bunch of stuff in there we’re excited about.  There’s a nice intro to some of the concepts that folks sometimes get hung up with when starting to work with the API. The API Docs have been improved and expanded. We’ve got examples out the wazoo. The docs for the Ruby bindings live there now and there’s a step by step tutorial on using them and a sample implementation for using them and pulling the data out of an existing MySQL database using ActiveRecord.

Whew, but that’s just the stuff about the webservices, what’s actually new in the webservices?

  • Separate methods for related and recommended queries giving much better personalized recommendations. This is the biggie (despite its humble packaging).
  • Weights (aka explicit ratings) are finally supported.  DB’s had them for a while.  Finally popped them up to the webservices level.  There’s more in the pipe there, but we’ve got to leave some excitement for the next update.
  • “excludeLinked” basically goes hand in hand with personalized recommendations; doesn’t recommend stuff you’ve already seen / bought / rated.
  • add / remove for incremental updates.  Means that you can add a tag, link, or whatever to an item without having to pull down the item or worry about consistency.
  • Lots of bug-fixes and performance enchancements. Not that our code wasn’t perfect before.  Ahem.

So, dig in, let us know what content is missing on the developer site and enjoy.  Hopefully the next batch of features will get pushed out just a wee bit faster.  We’ve also been slowly expanding the beta program and setting up accounts for more and more sites, so if you’ve got something interesting you’d like to bang out with our API, we look forward to hearing from you!

C and C++ are not the same language.

In the web world, a world I’ve made my way back to in the last year, pointers are an exotic relic of the past.  Seriously.

I never got the memo; there was no going away party, I was just sitting there, minding my own structs and boom, next thing I know, I was that guy that hadn’t realized that those mallocs are so 90s.

But there is a certain reverence for the old ways.  You’re supposed to learn C or C++ (often just written as “C / C++”) to be a well rounded dork, much like you’re supposed to have read Shakespeare.  This is all fine and well, and I agree.  But there’s one sticking point.

C and C++ aren’t the same language.

Let’s first take a tour of the major reasons that are given for learning C or C++:

  • You learn how to do manual memory management
  • You get “closer to the hardware”
  • They offer performance benefits over most languages
  • There’s a wealth of code written in both that you may want to understand

All of these are true in both languages.

However, learning to think in C versus C++ is a completely different story.

Modern C++ is usually written in an object-oriented style.  Even modern C is usually written in a pseudo-object-oriented style using opaque types.  But the mechanisms for working with the two vary widely.  Notably, in C you have to build up the higher level abstractions on your own that are built into the language in C++.

It’s beyond the scope of this entry to get into the many differences between the two, but I decided to implement a common pattern in programming, the observer pattern in C, C++ and Java to illustrate the differences.

Observer Pattern in C:

#include <stdlib.h>
#include <stdio.h>
 
typedef void(* FoomaticListener)(void);
 
struct Foomatic
{
    FoomaticListener *listeners;
    int listener_count;
};
 
struct Foomatic *foomatic_create(void)
{
    return (struct Foomatic *) calloc(1, sizeof(struct Foomatic));
}
 
void foomatic_destroy(struct Foomatic *foomatic)
{
    free(foomatic->listeners);
    free(foomatic);
}
 
void foomatic_add_listener(struct Foomatic *foomatic, FoomaticListener listener)
{
    int count = ++foomatic->listener_count;
 
    foomatic->listeners =
        (FoomaticListener *) realloc(foomatic->listeners,
                                     sizeof(FoomaticListener) * count);
 
    foomatic->listeners[count - 1] = listener;
}
 
void foomatic_activate(const struct Foomatic *foomatic)
{
    int i = 0;
 
    for(i = 0; i < foomatic->listener_count; i++)
    {
        (*foomatic->listeners[i])();
    }
}
 
static void first_listener(void)
{
    printf("Whoopee.\n");
}
 
static void second_listener(void)
{
    printf("Whoopee.\n");
}
 
int main(void)
{
    struct Foomatic *foomatic = foomatic_create();
 
    foomatic_add_listener(foomatic, first_listener);
    foomatic_add_listener(foomatic, second_listener);
 
    foomatic_activate(foomatic);
 
    foomatic_destroy(foomatic);
 
    return 0;
}

Observer Pattern in C++:

#include <set>
#include <iostream>
 
class Foomatic
{
public:
    class Listener
    {
    public:
        virtual void activate() = 0;
    };
 
    void addListener(Listener *listener)
    {
        m_listeners.insert(listener);
    }
 
    void activate()
    {
        for(ListenerSet::const_iterator it = m_listeners.begin();
            it != m_listeners.end(); ++it)
        {
            (*it)->activate();
        }
    }
 
private:
    typedef std::set<Listener *> ListenerSet;
    ListenerSet m_listeners;
};
 
class FooListener : public Foomatic::Listener
{
public:
    virtual void activate()
    {
        std::cout << "Whoopee." << std::endl;
    }
};
 
int main()
{
    Foomatic foomatic;
    FooListener first;
    FooListener second;
 
    foomatic.addListener(&first);
    foomatic.addListener(&second);
 
    foomatic.activate();
 
    return 0;
}

Observer Pattern in Java:

Foomatic.java

import java.util.Set;
import java.util.HashSet;
import java.util.Iterator;
 
public class Foomatic
{
    private Set<Listener> listeners;
 
    public interface Listener
    {
        public void activate();
    }
 
    public Foomatic()
    {
        listeners = new HashSet<Listener>();
    }
 
    public void addListener(Listener listener)
    {
        listeners.add(listener);
    }
 
    public void activate()
    {
        Iterator<Listener> it = listeners.iterator();
 
        while(it.hasNext())
        {
            it.next().activate();
        }
    }
}

Callback.java

class FooListener implements Foomatic.Listener
{
    public void activate()
    {
        System.out.println("Whoopee.");
    }
}
 
public class Callback
{
    public static void main(String [] args)
    {
        Foomatic foomatic = new Foomatic();
        FooListener first = new FooListener();
        FooListener second = new FooListener();
 
        foomatic.addListener(first);
        foomatic.addListener(second);
 
        foomatic.activate();
    }
}

All three of these do the same thing (with the one proviso that the C version uses a simple list rather than a set for concision).

The first thing that should jump out at you is that two of these look very similar.  And it’s not the C and C++ versions.  Modern C++ is much more similar to Java than it is to C.  Learning to think in C++ is much closer to learning to think in Java.

In the C++ and Java examples a callback is achieved by defining an interface with an abstract method that’s implemented in a concrete subclass.  In C a function pointer is used.

Now, let’s get back to the root of the confusion.

C++ is backwards compatible to C.  Good C++ developers, especially those doing systems programming, tend to be familiar enough with the important parts of C to exploit its lower-level primitives and pack them into an object oriented structure.

But someone working on, say, GUI development in C++ might never come into contact with the C underworld.  You can be a goodly C++ developer and never use function pointers, rarely use macros (and then rarely in a particularly interesting way) and most of all be completely clueless on how to define a reasonable C API that does proper encapsulation.

Really knowing a programming language is much more about knowing how to wield it to solve problems rather than being able to write code that the compiler doesn’t barf on.

Let’s look back at the goals we had for learning one of these languages:

You learn to do manual memory management

In C++ memory management is often less opaque and handled by abstraction layers.  We didn’t have to think about memory being allocated to insert an element to a set or free it when we were done.  That was handled by the language and its standard library.  This holds somewhat generally for C++, so I believe if your goal is educational — simply to learn about memory management, C is probably closer to fitting the bill.

You get closer to the hardware

Again, C is probably a win.  Not because you can’t do systems programming in C++ (in fact, that’s most of what I do) but because when doing systems programming in C++ it tends to come out looking like blocks of C neatly organized into classes.  The bulk of the code that you can also use as a learning reference (glibc and the Linux kernel are both good here) is written in C.

They offer performance benefits over other languages

This is true for both, but C forces most of the time-consuming stuff to pass in front of your eyes.  There’s less magic happening behind the scenes.  When writing performance critical C++ understanding that it’s built on the same runtime as C is useful for understanding what’s actually happening when you call a virtual function.  (Answer:  Classes which have virtual functions have a “vtable” that’s created by the compiler which is simply an array of function pointers.)

There’s a wealth of code written in both that you may want to understand

This naturally has less of a clear winner.  C tends to be more dominant at the lower levels, C++ creeps in closer to the middle of the technology stack.  Systems libraries and kernels are usually written in C, things like web browsers and office suites are more often C++.

But wait … so I said all of that nice stuff about C, why do I still prefer C++?

If your goals are purely educational, C is probably a better fit.  At an atomic level it’s harder to understand, but there’s much less of it to understand.  C++’s syntax is everything from C plus a myriad of advanced compiler-fu that takes a good long while to get your head around.  Many C++ programmers who have been working with the language for half a decade still couldn’t tell you how to use partial template specialization.

But if you’re writing real code — something you’re going to be curling up with night after night until unemployment do you part, I’ll take C++ any day of the week.

If you want to learn about the innards of programming, filleting a program in C teaches you how.  If 95% of the time, you’d like to have that handled with the abstractions you’re working with, classes, templates, exceptions and other modern encapsulation mechanisms supported by C++ make working on a large code-base more palatable.  I’ve been writing C for 15 years and in the first version of the C example above, I forgot to free the list of function pointers.  C++ is also more concise and expressive.

Now, anticipating the reaction of the high-level crew, aren’t most of the arguments that I just made for C++ even more true for, say, Python, or Ruby?

Of course.  But C++ often hits a sweet-spot between languages where high-level abstractions are available with the raw power of C when you need it.

At Directed Edge we use a mix of languages, and even a mix of C++ styles, trying to hit their comparative sweet-spots.  The rough break down is:

  • Database:  C-ish C++.  Looks like C packaged into classes and mostly is.  Binary operators for the win.
  • Engine:  More “pure” C++.  Less low-level gobblety-gook.  Works mostly with abstractions built up in the database level.  Optimized like it’s nobody’s business.
  • Web Services:  Java.  It’s hardly the cool-kid on the block these days, but does well with throughput and fault-tollerance (which C and C++ are bad at).
  • Web Front-ends:  Ruby.  Horrible performance, but rapid development time.

“We think you’d also like…” and the Math of Suggestion - Part 2

Part two of my article on “The Math of Suggestion” went up today on Gründerszene, and here’s a link back to Part 1 to to refresh your memory.

On the Road: Silicon Valley and Boston

The two of us that founded Directed Edge will be making our first trip together to Silicon Valley later this month and I’ll be in Boston in June and might do a quick trip out to New York.  If any of the fine folks watching this space is in one of those places and would like to meet or send some event tips in our general direction, drop us a line!

“We think you’d also like…” and the Math of Suggestion – Part 1

Part I of my article on “The Math of Suggestion” just went up on Gründerszene covering the basics of user-based collaborative filtering just went up.  Part II will cover item-based collaborative filtering and some of the basics of graph based approaches.  We’ll post it here when it goes live.

Greg Linden on “What is a Good Recommendation Algorithm?”

Greg Linden is one of The Really Smart People in the world of recommendations.  He’s got an excellent post on the newly launched ACM blog.

I’ve had a draft of an essay sitting around for a while titled, “What’s wrong with recommendations?”  Greg hits on one of the key points that I’d intended to cover:

In the end, what we want is happy, satisfied users.  Will a recommendation engine that minimizes RMSE make people happy?

Check out the full post here.

Announcing TweetRex: Twitter friend recommender built on the Directed Edge recommendations engine

So a couple weeks back we decided to have a little fun with the Twitter API and built a friend recommender using the Directed Edge engine. What was originally going to be a two day project that our friend Björn Günzel and I wanted to bang out turned, as two day projects are wont to do, into a 10 day project.

How hard could it be?

It turns out that spidering the Twitter graph in real time presented some interesting challenges for our database. This was the first extremely write intensive application we’d tested: one single recommendations run can often trigger adding thousands or hundreds of thousands of items to our database as we do the progressive spidering.

To make things more fun, with all of those requests out to the Twitter API, the application needed to be heavily multithreaded — often using up to 500 concurrent threads split between spidering Twitter, computing recommendations and handling incoming requests.

Sinatra and new database features:

We wanted to build this application using only the Directed Edge database — no messy ORM layer and all that jazz.  Sinatra seemed like a nice, low-fat alternative to Rails.

The new stuff in our database lets it act as a full-blown key-value store with support for storing arbitrary data with arbitrary mime-types in the DB.  Right now we’re just using this for Twitter profile caching.  In the next update of our database / web-services it’ll be possible to use the database just like webdav assigning images or text to graph-nodes using a simple HTTP put and immediately retrieving them.  This opens up some doors of possibilities to being able to do something like, “Show me pictures of all related items.”  More on that once those features make their way into the released version of the web-services.

Go easy on her, gents.

We’ll see if the app survives its relatively gentle launch.  Right now we hit the Twitter API quite hard — such that in fact we have to limit the number of concurrent users using TweetRex or else our Twitter queries start timing out and recommendations take ages to generate and are incomplete.  You’ll get a little message if we’re overloaded and it’ll automatically move on to the log-in sequence once things are live.

So, what does it do?

TweetRex looks through the Twitter graph for people that are in your “neighborhood” that you’re not following.  The results get better as we spider more of the Twitter graph, so check back in from time to time.  There’s a little message box that pops up asking if we can post a tweet in your stream.  We’d love it if you’d oblige to help us get the word out about the app.

This is just the first beta.  We’ll be adding some more features over time, and are looking forward to your feedback!

And now, without further ado, I give you TweetRex:

http://tweetrex.directededge.com/

Bindings project on GitHub, Starting with Ruby Bindings


We just uploaded the first version of a simple Ruby binding to our API to Github. The code is BSD licensed, so you can do pretty much anything you want with it. We’ll be adding a few more features to the Ruby version in the following days and eventually follow with similar bindings for other languages.

Looking forward to your feedback!

On Building a Stupidly Fast Graph Database

It’s pretty clear to computer science geeks that Directed Edge is supposed to be doing groovy things with graphs. In fact our recommendation engine, and some of the things that are unique about our approach to recommendations, are built on our super-fast graph database. When we went live yesterday with the latest version of our recommendations web services, another, much bigger thing happened behind the scenes for us: we cut over to the new version of our graph database.

Every time that Directed Edge gets mentioned in nerdier circles we get a lot of questions about this fabled graph-engine, so we thought we’d indulge our techie friends with some background info.

When we first decided to build the Directed Edge engine, we’d built some in-memory and RDF-store based prototypes to start hashing out the recommendations algorithms, but the RDF stores couldn’t keep up performance-wise and staying memory-based obviously wasn’t an option for persistent data. So we tried porting things over to SQL-based databases … and got a working version, but it was so painfully slow that we knew we needed to keep looking. The initial version of our web services were built on simply serializing data objects to disk and reading records from index files that we loaded up into memory. This worked and was both fast and memory efficient, but ignored most of the harder problems of a true database.

What are those performance characteristics, exactly? Well, it’s fuzzy where the cut-off is, but we’re now able to import a graph with 2.5 million nodes and 60 million edges in 12 minutes on a MacBook. Query times for finding related items on a similarly large graph with warm disk-buffers average under 5 ms.

With the latest version of our web services and database, we’ve gone from having a graph-store to having a proper graph database. We now use straight memory-mapped I/O, disk-based linear-hashing, have mechanisms to address data fragmentation, thread-safe read / write locking and logging / log replaying.

Before we jump into the computer-science-y core of what we’re doing, I’ll throw down some nerd-cred, and give some perspective to the design decisions made. The design was done by me, Directed Edge’s CTO, and I’ve been looking at algorithms for graph-based similarity for around 5 years now. During the earlier half of the decade I did a series of talks on “contextual webs” to our crew in the Linux world. While I have no academic background in database design, I’ve been doing low level bit-wrangling for most of the last decade and Directed Edge’s data backend is the third key-value store that I’ve written from the ground up. So, what I’m trying to say is: some of these ideas may be original, many are not, some fly in the face of traditional database design which is likely some mix of novel and naive. Be that as it may, the closest that we’ve gotten to our performance using off the shelf databases was with BDB with 5% of the throughput of our database.

So, on to geekery. Let’s start at the top level of abstraction and work our way down. The impatient may feel free to turn the page at any time since the juicer bits are in the lower levels.  We’ve tried to bold the highlights.

The Store: Items, Links, References, Tags and Whathaveyou.

Sitting up on top of the architecture is a C++ class called “Store”. Like the rest of the things that sit below the web services (which are in Java), it’s written in C++. This is the API that we use to deal with a database. To our recommendations code, this is the database.

Things are pretty simple there, and the API is flexible enough that it’s let us swap out the implementation below it to compare other backends. Basically it’s this:

  • The store is a big collection of items, which can be anything — a user, a product, a web page.
  • Items are identified by a string unique identifier which maps to an integer index.
  • We can look up an item’s items index by name really quickly.
  • Items have links and references, which are basically a list of things they’re connected to and things that connect to them. These are symmetrical — so creating a link from item A to item B, creates a reference from item B to item A.
  • Items can also have tags, e.g. “user”, “product”, “musician”, “film” and so on. Tags are non-exclusive, so an item can have as many tags as appropriate.
  • Links and references can have weights of 1 to 10 or be unweighted. We use that for ratings. So a rating is just a link from user A to product B with weight of 1 to 10.
  • We can also store arbitrary other fields along with these like “text”. Really, the only limitations on what can be associated with an item are what we choose to expose in the API here. The lower levels are flexible.

So that’s the top level. That’s what makes things look like a nice, clean graph architecture without mucking with all of the lower-level storage details. At this level everything’s thread-safe too, so the map-reduce pattern that we use when generating recommendations doesn’t leave a trail of destruction and SIGBADTHINGSs in its wake.

Things Start Looking Databasey: Rows and Columns

Like traditional databases, our database is organized into rows and columns. An item is just a row. Unlike most databases, there’s only one set of rows (only a small lie, which we’ll ignore for convenience), and the primary things contained in rows are lists.

There are really no huge surprises here:

  • A database has a list of columns, a stack of deleted rows (to recycle indexes) and a row count.
  • Adding a row to the database adds a row to all columns.
  • There’s a flag for if a column is hashed or not, which enables fast lookups and requires unique data values.
  • There’s a basic column type that just stores byte arrays and a bunch built on top of it for storing things like link and reference vectors, string lists, strings and so on.
  • Any field in an item can be empty and all of our column data types have sane values for empty contents, which are returned if the data pointer for a row is set to null.

Column Locking

We’ve got two type of locks that we use, one surprisingly boring, one a little more creative.

The most important fields in the Directed Edge database are the links and references. They convey things like “which pages are linked on this website” or “which products has this customer bought” or “which news items has this user clicked on”. Those lists are of the type IndexVector within our database and have their own special kind of lock.

The rest of the columns use normal read-write locks. We use an array of 8 read-write locks per column. When lock request comes in it is hashed by shifting the index by three bits and then issues a lock on the column. We found this in practice to be faster than keeping a table or tree of locks since maintaining that tree would require a single read-write lock for the tree maintenance and contention for that lock was rather high. Non-IndexVector columns (strings, string lists, and binary blobs, mostly) copy the data when they hold the read lock and release it immediately thereafter, so the locks are held only momentarily.

So, to repeat that — using a simple array of locks per column significantly outperformed keeping a list of locked items, even if it means that items share their locks with many other items.

We’ll revisit IndexVector locks again in a little while.

Memory-Mapped I/O … and a Disk Based Malloc?

We’re up to my second lie now. See, I promised that I’d walk through the layers of abstraction from the top down and here we’ve gone ahead and skipped to the end. To understand the data structures that are built on top of the files, it helps to know where that data lives and how we manage it.

We experimented with a few ways of doing I/O, but the one that seemed to consistently win, as well as make life a better place, was doing straight memory mapped I/O. For the non-system programmers among you, UNIX systems allows for files to be mapped to memory addresses. So if we’ve got an open file called “foo” and we call mmap on its descriptor, then we’re able to access the contents of “foo” directly based on the address returned by mmap — we can treat the file just like it’s an array of characters in memory.

This has a few advantages. One is that it lets the operating system effectively handle paging in data from the disk. UNIX also provides some handy functions like madvise() and msync() so that you can give the operating system hints on how to do that paging and when to write things back out to disk.

There’s also no overhead for accessing disk data since it’s mapped directly from the file into the system’s address space. Usually when you use things like fread() or fwrite() all of the data has to be copied into the kernel’s space, then over to the processes space, then you can do stuff with it. By using mmap we avoid all of that copying mess and the overhead of switching between kernel and user space with in syscalls, which is often significant.

There’s a good section on advanced I/O in O’Reilly’s Linux System Programming, which I stumbled across last week and gives a good introduction to mmap and other I/O strategies.

So, that’s what mmap does. How do we use it?

Well, one of the typical problems in deciding how to order disk records is how to manage disk fragmentation. As it turns out, that’s one of the problems that memory managers face when trying to decide how to allocate chunks of the system heap. So, we thought, since we’re treating the files we use as memory anyway, why not put two and two together?

What we ended up with is an implementation in our memory-mapped file class of malloc, realloc, calloc and free — the classic functions for memory management in C. We patterned our allocation with some modifications for our applications on the algorithms used in glibc’s malloc. There’s a good, if slightly stale, introduction to them here.

Having those in place lets us allocate blocks of our files just like we’re allocating memory and handles file fragmentation rather elegantly. It also has the added bonus of letting us translate normal in-memory algorithms directly to disk based algorithms.

This structure lets us put arbitrary data into a file and gives us back an offset to point to where it’s at, which is exactly what we do for things like arrays of indexes and strings and so on, but we still need some way of referencing that data to make use of it in our column types.

File-based Data Structures: Stack, Vector, Linear Hash

Most of our data structures are pretty similar to their in-memory based counterparts. That similarity was in fact one of the reasons that we chose that approach. We have three basic data structures that we use on disk to reference the blobs of data that are inserted and removed with malloc, realloc and friends.

When we’re writing these records to disk, we choose whether or not they get their own file by whether or not they can make use of sequential access. Notably, vectors almost always get their own file because they’re accessed sequentially and only expand at the end. It’d be a pain in the derriere if we had a 50 MB vector and there was no room immediately after it in the file, so we had to free that whole 50 MB chunk and move it to the end of the file leaving a hole where it had been before. So vectors get to live a solitary life off in their own files. Hashes and stacks on the other hand don’t have that problem, their elements tend to be of smaller sizes and more scattered, so they can be mixed in with data without worrying about them fragmenting the file too much as they insert and remove elements.

Stacks are simple and don’t merit much mention. We use it to keep track of deleted records so that we can recycle their indexes the next time we create a new row. Yawn.

Vectors, on the other hand, are some of the meat-and-potatoes of the Directed Edge workhorse: the things we use to keep track of offsets in data files. These are different from the similarly named IndexVector. In Directed Edge lingo, an Index corresponds to a unique item with fields and stuff, and an Offset just points to a place in a data file. These vectors are used for Offsets. Actually, these vectors are a template, so they can be used for any data type of any fixed side, but they’re most important for offsets.

This is another point that we break from typical database design theory. In a typical database you’d look up a record by checking for it in a B-tree and then go off to find the data pointer for its record, which might have a continuation record at the end that you have to look more stuff up in … and so on. Our lookups are constant time. We hash the string identifier to get an Index and then use that Index to find the appropriate Offsets for its data. These vectors are sequential on disk, rather than using continuation tables or something similar, which makes constant time lookups possible.

Note: Like all things in the real world, “constant time” really means, “constant time in the parts we see, under cooperative conditions, if the operating system doesn’t have to do anything like figure out where that disk record actually is, so probably not really constant time, but still about as fast as you can get.”

Linear hashing is another important component of our lookup scheme. Linear hashing is a special sort of hash table that basically does dynamic expansion rather than having to reallocate the hash-slot array every time the data gets too big and futz around with reorganizing things. We also also having a mechanism for letting our hash-table cooperate with our data files so that it handles inserting and removing data from them in addition to keeping track of where they are.

IndexVector The Great and Its Team of Sneaky Locks

So, now we finally return to the IndexVector. Since you’ve probably dozed off a couple times in this pile of wordiness, I’ll remind you of its function — the IndexVector is what keeps track of links between items. “Bob purchased The Beegees Greatest Hits” is a link between Bob and said album. Bob has an index and the album has an index. In Bob’s list of “links”, which is an IndexVector, we’ll put the value of the Beegee’s Greatest Hits.

If you’re doing traversal heavy graph-applications, like, say, finding related content, how you deal with those links is your where the rubber hits the road. Most of your performance lies in accessing these links.

So, since we’re using mmaped I/O we have no syscall overhead for grabbing this stuff from the disk. We also have no copy that must take place to get the data from kernel space to user space. Two points for us. But what we want to do is to Try Real Hard to make sure that we don’t copy it at all in the common case.

So our IndexVector class is really just a pointer to a place on the disk where it can read the data from and then it casts that data into indexes when we do things like someIndexVector[4]. Simple, right? Yeah, mostly. But there’s a catch.

You see, IndexVectors are sometimes changed. Like, if Bob buys another album, we’ve got to update that vector. To make things worse, sometimes they get bigger and get moved to a whole other part of the file where there’s room for it now. Unlike our other data types (strings, binary blobs and all that) we access IndexVectors a lot, so we don’t want to just copy the data every time there’s an access. (We tried, slowed down the whole engine by 40%.) So, we get sneaky.

You see, most of the time we can go about just willy-nilly accessing the IndexVector from its disk position and assume that nothing’s changing behind its back. We have a similar array of mutexes (8 of them, to be precise) for our IndexVectors. When a read request comes in, an IndexVector adds itself to a list of active readers and goes about reading stuff from the disk.

Now, the magic happens when a write lock comes in. We don’t want our writes or our reads to block on these Very Important Pieces Of Data, so what the right lock does is first copy the old data from that record into memory and then go through the list of active readers and do a swaperoo on the data pointer. The readers are now pointing to an in-memory copy of the data, rather than the disk-based version. It can then go on writing and moving and generally doing whatever it wants with the disk based copy. The lock also keeps a copy of the pre-write version of the data in memory so that all readers that come in and try to get a lock on that record get the in-memory version until the write is done.

This leaves us with somewhat more lax consistency than one would expect from an ACID-compliant database, but for our applications that’s fine. Like, if our recommendations-fu has missed the item that you rated a couple milliseconds ago when computing product recommendations for the next page, no planes will crash and no babies will be harmed. On the plus side this means that the only case where a read or write on IndexVectors will have to wait is in the case of two threads trying to write to the same item.

Them’s the real guts of our database, but I’ll leave you with a couple provisos for extra credit:

Every Bit Is Sacred

Conventional wisdom in modern programming is not to get hung up on using a few bytes here or there and just get on with writing your code. And it’s right.

But when you’re working on low-level, high-performance I/O, every bit is sacred. The reason is that every bit is extra data is extra page misses and extra page misses means more waiting on the big spinning magnet that is the bottleneck in almost everything that touches it. These things also come out of the woodwork when you’re multiplying every extraneous bit by several million. If we were to increase the amount of space used for one of our indexes by one byte storing the wikipedia link structure would take 114 more megabytes.

So we get a little trigger happy with our friendly neighborhood binary operators. We combine weights and indexes into the same field using the upper 4 bits of the item index to store the weight. Our malloc implementation uses 8-byte alignment, so we use the extra 3 bits at the bottom end of of the address for flags. By thinking up clever ways of putting the squeeze on data we have databases that are 40% smaller, which means a lot less page misses and a much faster database.

What About Matrices? I Though Recommender Systems Were Based On Linear Algebra.

The difference between a weighted graph and a sparse matrix is mostly one of visualization rather than of data. We conceptualize a matrix to be something different than a graph — a matrix is a huge field of rows and columns and a graph a sprawling series of interconnections, but in fact, a sparse matrix and a weighted graph are almost the same thing and easily convertible when viewed from a data-perspective.

So, if we’re thinking of a user’s product ratings as a sparse vector that maps to each product, we can easily model that as a series of weighted edges from the user to each product. In fact, it’d be difficult to produce a tighter format for storing a sparse matrix than a list of connections since we’re able to avoid things like pointers to the next used vector dimension and munge the weights into the same integer value that we use to represent a graph edge.

Beyond that, a graph model allows for us to do some more wild clustering than you’d typically get if you’re imagining the ratings set as a matrix. We can look at things that are not just one, but a couple levels out in the connection structure to figure out how things are related. This is one of the mechanisms that we use to mitigate the “cold start” problem that is endemic to typical collaborative filtering algorithms.

Al fine

I’m tired of writing, you’re tired of reading. If you’ve got questions or have The Killer Application in mind for such a beast, feel free to drop us a line via comments or mail or ping us on Twitter.

Updated API and Database Live

We’re excited that the new API is finally live.  Most of the important new stuff is going on behind the scenes with the latest version of our graph database, but here’s a rundown of the new goodies:

  • Full read-write support for all items, rather than having to upload full database dumps. This includes methods for adding and removing tags or links as well as rewriting items. Items can also be added or removed from the database.
  • Better results for obscure or tag-specific searches.  You can see an example of this in our last blog post.
  • Faster results.  Our new database is several times faster than our already speedy previous version.
  • Max results parameter in the search results, configurable from 1 to 100 results.
  • “Ignore linked” option to avoid returning recommendations for items a user has already interacted with — i.e. items they’ve already purchased or clicked on.

The bulk of the effort in the last months has been getting our graph database up to snuff as a proper database.  We’ve had a lot of questions about our graph database, so the next blog entry will be full of technical information on it.  We’ll also now be creating new accounts again, which we’d been holding off on for the last few weeks as they’d need to be moved over to the new database format.  All in all the upgrade went as planned and services were down for only around 5 minutes.

You can view the updated API documentation here or as a PDF.

We’re already at work merging in the next batch of features which will add special elements for common use cases — e.g. ratings, purchases and click-tracking and will also bring more a more advanced recommendations model that’s been brewing in one of our experimental development trees. 

Updated database / clustering in use for Wikipedia mashup

In preparation for going live with our new database, updated API and new clustering algorithms we’ve switched the Wikipedia mashup and API over to the new database.  The most obvious part of the transition is that obscure topics have higher quality results as do the results for tagged items.

Here’s an example for Canonical Ltd., note the difference in the results for related companies:

 

We’ll be following up in the next few days the details of the API updates and, by popular request, some technical information on our graph database.

Interview with Intruders.tv

Thanks to Jof and Vincent from Intruders.tv for doing the interview!

Twitter Recommendations Cloud

Gleaned from the RSS feed for a few weeks plus Wordle. Here’s one with a hand-created list of stop-words to try to get out some of the irrelevant stuff. Gives an interesting look at what people are asking for recommendations for on Twitter:

Twitter Killed the Radio Star, Logo Search

Putting the Micro- in Bloging:

 

We’ve been slack on blogging of late. The Twitter bug finally bit the two of us that founded Directed Edge. I finally got around today to setting up an “official” (meaning, “spared of our random personal musings”) Directed Edge Twitter account today:

http://twitter.com/directededge

You can find the two founders streams linked there.  But have no fear, there will be more good old fashion blogging action on the way.  I’ve got posts in the pipe on some of our experiences in picking a CRM system and some of the optimization work on our graph storage engine that should be coming along in the next week or so.

The time for a logo is nigh:

 

Yes, that’s right, we need a logo.  We’ve all run out of business cards, so before putting in an order for a new batch, I’ve decided to revisit the logo situation.  We’re looking for something simple and business-y that also can be reduced to line art for print.  Particular colors aren’t terribly important since the website needs a revamp too.  If you or somebody you dig feels like moved to send us a bid (with a link to a portfolio), please do.

Collective intelligence, BarCamp and the Berlin Web Week

I’ll be doing a session tomorrow or Sunday at BarCamp Berlin (currently the most upvoted presentation topic).  I’ll cover some of the basics of modern recommendation systems, including basic categories of algorithms and why recommendations are important for modern web applications.  Dave Sharrock and Garik Petrosyan from Be2 (dating site) will be co-presenting, talking about some of the things they’ve had to overcome in building a scalable match-making system.

Valentin will be doing a session on multi-lingual blogging.

We’ll also be at the following events in the Berlin web week:

Drop us a line if you’d like to meet and talk about recommendations and what they can do for your site!

Fuzzy search, related articles AJAX widget.

This one was requested a few times.  We’ve thrown together a simple AJAX widget for showing related articles on other sites.  We’ll add a little more bling later on, especially once we get some more data sets out there.

The idea is this:  You have a site that has information about music, or movies or airplanes or whatever, and you’d like to add links to related encyclopedia to your page with just a couple lines of customization.

To make that work as one would expect we also added fuzzy search.  That’s important because, for instance, the article for “Madonna” in Wikipedia is titled, “Madonna (entertainer)”.  So, now with our fuzzy searching you can search for the closest hit to a term with a specific tag, in this case “Madonna” tagged as “musician”.

The list of available tags is here.

The new AJAX widget is trivial to use.  Here’s what it looks like.  If you click it’ll take you to a page that you can use to copy-paste from.

You can get that on your page by adding two lines to you page / template:

<script src=”http://pedia.directededge.com/RelatedArticlesWidget.js” type=”text/JAVASCRIPT”></script>

<div class=”RelatedArticles” title=”Music Related to James Brown” fuzzy=”James Brown” fuzzytag=”musician” tags=”musician”></div>

All items that are returned use the “RelatedArticles” class, so that you can style them in your stylesheet as you see fit.  These are the supported attributes (only title or fuzzy are required):

  • topic:  A specific (exact) wikipedia article name.
  • fuzzy:  Find the nearest article name.
  • fuzzytag: modifies the fuzzy search so that it looks for an item with that tag.  Note:  this does not affect the related articles, for that use the following attribute.
  • tags: The tags for the returned articles. i.e. Using a fuzzytag for James Brown will specify that the main article should be a musician named James Brown, but specifying musician for tags will ensure that all of the returned results are also musicians.
  • title: The title to be used for the list.  This defaults to Related Encyclopedia Articles.
  • prefix: The prefix to be used for the links.  This defaults to http://pedia.directededge.com/article/ Here you could also specify http://en.wikipedia.org/wiki/ to link to the English wikipedia.

If you decide to link to Wikipedia or somewhere else using the last tag, we’d request that you at least give us a mention in your company / project’s blog somewhere.

Enjoy and feel free to drop us questions or comments!

Directed Edge interviewed for The Next Web

Thanks to Ernst from The Next Web for publishing an interview with us.

Web Services Live, Documented, Secured.

So in the space since the last blog post we’ve been working on getting everything squared away for our commercial web services API.  It’s now running live at webservices.directededge.com.  There’s some documentation up on how the REST API works there.  I also went through the hoops of moving over from our self-signed certificate to a proper certificate this week; I’d forgotten how much of a pain those can be to deal with.

If you’ve got any questions about the API or things that you’d like to do with it that don’t seem to be supported at the moment, we’d like to hear from you!

BarCamp Berlin Registration is Open

 

Directed Edge will be there.  Will you?

 

Directed Edge will be there.  Will you?

Directed Edge on the Road

In the next few days, we’ll be at the following events in Berlin:

Drop us a line if you’ll be there as well and we can arrange a meeting!

Greasemonkey script.

We’re very grateful that one of our users knocked out one of the items on our to-do list and created a Greasemonkey script for showing related articles on Wikipedia.  If you have Greasemonkey installed in Firefox you can just click on “install script” on this page.  To get related articles without being logged in to Wikipedia.

Per the comments on that page, we will start rolling out our Wikipedia demo in other languages probably about a week from now.

Related Pages on Wikipedia via our web services API.

So, we think this is pretty cool beans.  When we did our demo with a mashup of Wikipedia’s content we knew that we wanted something that potential customers could quickly look at and get a feel for what our recommendation engine is capable of, and we got a lot of good feedback about that in our recent technology preview.  On the other hand, we knew that we weren’t going to get the masses to switch over to user our Wikipedia interface.

One of the open questions for us as we pushed out the first bits of our web-services API  last week was, “Can we get this content to show up in Wikipedia proper?”

Last night after an extended hacking session where I tried a number of strategies for doing DOM scripting to pull in external content (and some misadventures in trying to do cross-site XMLHttpRequests) I managed to come up with a simple way of pulling in content from our web service via JSONP, and added support for JSON output to our web service along the way.  For Wikipedians that are logged in, it only requires adding one line to your monobook.js file and I’ve created a short how-to here.  The source code, for interested hackers is here.

Here’s what it looks like:

When we launched our demo a few people didn’t seem to get quite what it does that our engine is doing — we’re not just analyzing the current page and pulling in a few important links; we’re jumping out a few levels in the link structure and analyzing and ranking usually several thousand links in the neighborhood of the target page.  Often those pages are linked from the target page, but that’s hardly a surprise.  I come from a background of doing research in web-like search, so it’s no coincidence that our approach to finding related pages takes some hints from PageRank and other link-based approaches to sorting out the web.

We’d invite people to try this out and of course to keep playing with our mashup; we’ve gotten so used to having related pages that it’s hard to go back to the vanilla Wikipedia — having the related pages there makes it really easy to sort out things like, “What are the important related topics?” or “Well, I know about X, what are the main alternatives?”  And so on.  We’ve got some other exciting stuff up our collective sleeves that we’ll be rolling out in the next couple of weeks, so stay tuned!

API, Part II: Tags

Work on the web services API for the encyclopedia continues, now with tags.  Here’s a quick rundown:

You can get a list of supported tags here:

http://pedia.directededge.com/api/v1/tags/

That currently returns:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <tag>actor</tag>
  <tag>author</tag>
  <tag>book</tag>
  <tag>company</tag>
  <tag>film</tag>
  <tag>musician</tag>
</directededge>

You can then get results from article queries based on a tag, using something like this:

http://pedia.directededge.com/api/v1/article/KDE/tags/company/

Which returns:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <item id="KDE">
    <link>Trolltech</link>
    <link>Novell</link>
    <link>Hewlett-Packard Company</link>
    <link>Nokia</link>
    <link>World Wide Web Consortium</link>
    <link>Mandriva</link>
    <link>Canonical Ltd.</link>
    <link>Sirius Satellite Radio</link>
  </item>
</directededge>

You can query any article for any tag (unlike in the web interface).  Right now the results for “off topic” tags tend to be hit-or-miss.  One of the other big items on our to-do list is improving tagged results in our engine.

I’m posting incremental updates like this in the hopes that if you’re planning on using our API in a mashup that you’ll let us know what you like and don’t like before we freeze v1.

We’ve also decided on a couple of limitations for the open API that aren’t true for our commercial API (running either on customer data sets or open data sets):

  • You’re limited to 10 results.
  • You can only filter on one tag at a time, meaning, you can’t get ranked results for movies and music simultaneously.

We think those are pretty reasonable and still give users a fair bit of room to play for free.  If you’re interested in using our commercial API, drop us a line!  We’ve also just created an announcement list where we’ll notify folks that are signed up of important details.  You can sign up for that here.

First Encyclopedia API bits up.

This will still definitely be in flux, but I started getting parts of the REST API up if folks want to play with it.  Warning:  the format may change.

You can now hit something like:

http://pedia.directededge.com/api/v1/article/KDE

And get back:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <item id="KDE">
    <link>GNOME</link>
    <link>Unix-like</link>
    <link>Desktop environment</link>
    <link>Konqueror</link>
    <link>Qt (toolkit)</link>
    <link>KDE 4</link>
    <link>GNU Lesser General Public License</link>
    <link>X Window System</link>
    <link>KPart</link>
    <link>Widget toolkit</link>
  </item>
</directededge>

I’ll be adding support for JSON output and filtering based on tags in the next few days.  Once I’ve got a set of features there that I consider feature complete I’ll freeze the “v1″ so that people can create mashups based on that and be sure that the API will remain stable.

This does do capitalization correction, but does not do redirect detection.  I’m debating if I want to do that by default or use another REST path since it requires another couple DB queries and is as such a little slower.

Toy of the Day: FeedMySearch

Like any new startup co-founder, I’m obsessive about seeing how what we’re doing trickles out over the web.  Being an RSS-warrior today I went looking for a Google search to RSS converter and found FeedMySearch, which now, a few hours into using it seems to do quite well in pulling in new information about Directed Edge as it hits Google’s indexes.

FeedMySearch for Directed Edge in Thunderbird

FeedMySearch for directededge.com in Thunderbird

 

Nifty tool.  Anything that stops me from compulsive reloading is a win.  Now back to implementing new features.  :-)

Directed Edge Launches Recommender Engine Public Beta!

It’s an exciting day for us at Directed Edge.  Today we’re finally putting our Wikipedia-based technology preview out there for people to play with.  Before you click over to it, here’s a little about what you’re looking at.

As our name implies, we’re graph theory nerds.  We look at the roughly 60 million links between the 2.5 million English Wikipedia pages, and with a few extra cues from the content, figure out  pages related to the current one and put that in a little box in the upper left (as evident from the image on our home page).  In some cases, if we’re able to pick out what sort of page it is, we also drop in a second box with just other pages of the same type.

Finding related pages in Wikipedia isn’t fundamentally what Directed Edge is about.  We’ve got a super-fast in-house graph storage system that makes it possible to do interesting stuff with graphs quickly, notably figure out which pages are related.  We’ve already got a couple of pilot customers lined up and will be working with a more in the next weeks to analyze their sites and figure out how things are related there.  We’ve got a prototype of our web-services API that they’ll be using to send us break-downs of how stuff’s connected on their site and we’ll send back what we hope are some pretty groovy recommendations.

There are dozens of things in the pipe for us:  ways to make recommendations better, ways to make the Wikipedia demo cooler, things customers want to see in our web services that we’d previously not thought of, and we could ramble on that for a while, but there are a few things that are on the very near horizon that didn’t quite make it into this round:

  • An open web-services API for accessing the recommendations from our Wikipedia demo.  This will be a stripped down, read-only version of our commercial API usable in web mash-ups.
  • Better tagged (i.e. music, movies, authors, companies) recommendations.  Support for tagged articles was one of the last features that made it into the current demo, and we’ve got some plans for improving the results for those.
  • Pulling in non-English Wikipedia variants.  We’ll probably start with German and French.
  • More info about our commercial web-services API.  We’re still nailing down some of the details, but as soon as we freeze the API for the first customers, we’ll add more docs to the site.

If you subscribe to our news feed you’ll see immediately when those services go live.  Even though we’re still in the beta-phase and are only accepting a limited number of customers, if you think you’d be interested in using our engine for your site down the line, we’d encourage you to register now since we’ll be offering a discount for our commercial services to everyone who fills out their info in the contact form during the beta phase.

More soon.  Enjoy!

Launch date, August 13th.

We’ve now committed to going into public beta / technology preview next Wednesday, August 13th.  We’ll be launching our new site with more information about our products and services at that time.

Press / bloggers may request invites by sending us a mail.  It’ll be an exciting next few days as we iron out the last kinks and get ready for the onslaught.

The website will be a bit in flux, but our bio info is still available here.

Edit: If you’ve been testing the demo previously the location has changed.  Drop us a line for the new URL.

In defense of Perl.

Kickin’ it old school.

 

I started writing web applications around 1997.  On Solaris.  Using Netscape’s web server.  In Perl.

My LinkedIn profile starts thusly:

I began working with LAMP back in the days where the men were men and we ate Perl for breakfast. Installing Linux with 40 floppy disks puts hair on your teeth. One thing led to another, and before I knew it I was living with an E-Commerce system. What can I say? I was young and needed the money.

Around 2001 I took a departure from the web world to work on enterprise and desktop software. Sure, I slung a little web code here and there, but I didn’t track the technology landscape like I did back in the good old days.

When looking at founding Directed Edge, it was time to re-approach the web and get back on friendly terms.  Like an old-timer desperately trying to identify all that is hip, I set out to figure out what the cool kids were doing.

Python with Django and Ruby on Rails.

I learned bits of Python and Ruby, things I’d been meaning to learn for ages.  Ruby’s syntax and I didn’t hit it off at first, so I spent a couple days reading Learning Python, which had been sitting on my shelves for a few months.

But friends, home is where the heart is, and even after getting a reasonable grasp on Python I kept going back to Perl.  There are two reasons for this:

CPAN.  Oh, sweet, sweet CPAN.

 

The CPAN has grown so large and comprehensive over the years that many people learning Perl seem to elevate it to a sort of mythical status, and express surprise when they begin to encounter topics for which a CPAN module doesn’t exist already.

-Wikipedia article on CPAN

CPAN is huge, easy to search, well documented, and trivial to deploy.  Need some code to do TLS authenticated SMTP tranfers?  Trivial.  Need a WSDL compiler to work with an old SOAP API?  Up and going in a few minutes.  Need to test a REST API with a really mature HTTP implementation?  It’s there.  Need code for quickly generating mail routing code for feedback processing?  Bingo.  But wait — that’s all pretty common stuff, right?  There are even CPAN modules for stuff like tracking quantum superpositions in quantum computing algorithms or quickly building genetic algorithm implementations (my two research areas in college).  And all under the same roof.

And let’s back up one bit; for all of the perceived culture of sloppiness, I earnestly believe that Perl has the strongest documentation culture of any major programming language.  By and large CPAN’s 12,000-something modules are rather well documented with examples and gotchas in addition to the basic API docs.  As a special bonus as soon as you’ve installed them with the command line cpan tool (which automatically resolves dependancies, downloads, tests and installs) they’re available in your system’s man pages.  The standard man pages for core language features are great, and well written to boot.  The Camel Book will forever have a place on the gilded streets of O’Reilly’s hall of fame as possibly the most enjoyable to read 1000+ page technical book ever written.

Fast hacks, fast, quickly.

 

Combined with the power of CPAN, Perl just has something about it that makes gruesome, and gruesomely fast hacks possible.  Much of this is owing to CPAN solving 75% of the universe’s problems for you from the get-go.  But Perl is something of the Sicilian mobster of the programming world — it gets stuff done.

Add to that that it’s one of the speediest scripting languages performance-wise, and it’s great for quick-and-dirty hacks that programmers invariably have to come up with on a regular basis.  Perl seems to be optimized for writing as little code as possible to get the job done.

What I’m not saying.

 

Most people talking about Perl are quick to groan about its ugliness.  I’ll first note, most of them don’t know Perl, so it’s my earnest belief that much of that is faddiness.  Perl can be well written, but its syntactic moral flexibility means that there’s a lot of ugly Perl out there.  I’m not going to try to pass that off as a good thing.  But a real Perl mensch can write Perl that’s as easy to read as code in most other popular programming languages.

I’m also not advocating doing large projects in Perl.  In a decade of Perl slinging, it’s only happened a time or three that I written tools that were more than a couple thousand lines of code.  (But again, the beauty lies in that I’ve rarely needed to.)  Nothing particularly central to Directed Edge is written in Perl, but it’s been my Swiss Army Chainsaw on the fringes — converting data formats, processing simple forms, interfacing with databases — glue code, basically.

And there I have to say, despite wanting oh-so-desperately to be one of the Python cool kids, I think Perl is there to stay.

Incorporated.

Check one thing off the list of stuff that’s been taking time for us:  as of August 1st, Directed Edge is incorporated; we’re pretty excited.  We’ll transfer the IP into the new company next week.

Even though updates here have been slow, it’s been for the best of problems — despite not having our open beta launched we’ve already got a couple of pilot customers and I’ve been trying to get the integration process nailed down for our web services API.

The prototype is already in the state that we’ll use for our launch and we’ve got a few dozen people testing it in our closed beta.

So what’s next?  Well, we’re going to try to get our public presence ready for launch, mostly meaning finishing up our in progress web site, get press releases ready in German and English, get a few more bloggers on board and do the deed.  If you’re a blogger or have press connections and want to get the scoop before we launch, drop us a line!

Getting the rest of the legal stuff taken care of will take some time in the next week as well, but after thinking too much on the incorporation issue, it’s nice to be moving along.

Seedcamp Berlin

 

Valentin and Scott at Seedcamp Berlin

Valentin and Scott at Seedcamp Berlin

 

 

I stumbled across this photo of Valentin and I getting our questions answered about incorporation style at Seedcamp Berlin.  I must say that was one of the more useful events that we’ve been to so far — it was a great chance to get in touch with an impressive collection of mentors and to network with other (mostly) German startups.

Berlin: OpenCoffee / Business & Beer this week

There will be another update on Directed Edge stuff soon-ish, but I just wanted to get out a list of the upcoming Berlin events where we and other Berlin startups will be talking shop and doing demos:

Next round of events.

Just finished up attending the last of the events in the recent post and thought I’d mention the next batch on the Directed Edge calendar:

I’ve got two weeks left until I’m full-time on Directed Edge and it’s looking to be an exciting time.  Like most startups stumbling towards launch, there are a thousand threads being followed up simultaneously, the last couple of weeks more business than technical.  We’re going to make our demo progressively more open during the next weeks, but the feedback we’ve gotten from the first testers is encouraging.

Not-exactly-TechCrunched.

But I did make it into the video feed from the Prague TechCrunch event. There’s about a half-minute pseudo pitch in here. Lessoned learned from watching it back? Loosen up.  Get to the point faster.  Like in 10 seconds.  (Isn’t showing up in syndicated version, click through to the full post.)

Launch Count Down, Private Beta

We’re coming very close to having a beta to launch.  The interesting parts are good to go, but we still need to:

  • Clean up the (visual) design
  • Fix a couple of rendering errors on complex pages
  • Get the non-demo parts of the website in place — specifically info on what we’re doing and how pilot customers can get up and running
  • Tune the caching code so that it can handle a usage spike on launch

There’s still intentionally little information on the home page, and that will probably remain that way until we’re ready to go for a public beta.

However, I’ve been talking to more and more people face to face and showing off the current prototype (most recently at the two events that I just returned from in Prague), so in the next couple of days we’re going to start a private beta for folks that want to start exploring while we knock out the stuff above.  We’ll accept a limited number of requests.  Send us a mail or wait for the registration form to show up on the home page later this week if you’re interested in getting to the goodies a couple weeks ahead of the crowd.

Places I’ll Be

Want to hear more about Directed Edge? I’ll be at the following upcoming events:

Valentin will probably be at the Berlin-based events as well.  I’d originally planned to be at the UK Hackers’ meetup in London the same day as the Prague TechCrunch event, but the combination of the TechCrunch and Ubuntu events the same weekend was too much to pass up.  I’ll try to catch the London folk the next go around.  Going to be at any of those events?  Drop me a line and we’ll be sure to meet up.

Trapped in an elevator.

Today, for the first time I went to a social event here in Berlin for entrepreneurs. After the lamentations that I’d been exposed to by the locals, I can say that I was pleasantly impressed. The group seemed decent — a mix of coders and business folk and some in between.

One thing was painfully obvious. I’m a whole lot worse at explaining what we’re doing than I thought I would be. Most folks could layout the basics of what they were doing in a few seconds. I stumbled over it even when rambling at length. I’m too used to giving presentations, where I’ve got an audience and an hour.

The first time that we presented our ideas we over simplified. We’re working on some fairly hard problems and we didn’t manage to convince them that we both had something compelling and the skills to pull it off.

This time I went too far in the other direction. I blabbed too much about my background (more than probably anyone cared, and likely to the point of seeming arrogant) and in spoken form, still struggled with outlining what it is exactly that we’re doing.

A good “elevator pitch” is harder than it seems. For us we’ve got to:

  • Show how what we’re doing is interesting.
  • For the non-computer scientists, point out that it’s non-trivial (i.e. hard to duplicate).
  • For computer scientists, convince them that we’re skilled enough to pull it off.
  • Briefly explain how we plan to monetize it.
  • Boil that down to about a minute.

 
The crux of the difficulty, perhaps, lies between points two and three.  For non-technical folk, what we’re doing seems easy.  For technical folk, it seems very hard.

I’ve got another shot at this at the Open Coffee meeting on Friday. Hopefully by then I’ll have managed to get a little closer to something compelling. I’d like to have a clear message that we’re able to present by the time that we go for a public beta in the near future.

Why don’t computer scientists track sub-fields other than their own?

As I’ve worked through some of the ideas that we’ll be using at Directed Edge over the last few years I’ve stumbled across several subfields of computer science: information retrieval, semantic webs, graph algorithms, recommender systems.

As we cross boundaries there is one question that always strikes me: Why don’t computer scientists track related sub-fields?

This problem seems to afflict academia in general, but computer scientists seem to have even worse tunnel vision than par. When I stumble across papers that are potentially useful to the work that we’re doing I tend to track down the papers they’ve cited in search of other useful pieces of the puzzle and invariably there is almost no overlap in co-citation across sub-fields.

Why is this?

On the one hand, this provides a measure of excitement to me; there are these pools of knowledge that I manage to stumble across every few months that help bring our technology closer to realization. On the other hand, there’s this nagging wonder that so many brilliant minds aren’t talking to their colleagues to put together cool solutions to interesting problems.  Theories?

Shameless plug: We’ve started a blog aggregator for startup related blogs. Got one? Drop me a mail and we’ll add you to the syndication.

Networking.

After posting this on Hacker News I’ve gotten pretty serious about setting up some things to get the Hacker News startup community to pull together to our collective advantage.  Here are the main points:

  • Link to me at LinkedIn and join the Hacker News group (assuming you’re a news.YC regular)
  • Join our mailing list for entrepreneurs.  Let’s hone each others projects into something great.
  • Join the Planet.  Mail me.  If you’ve got a startup, we’ll syndicate you.  What’s a Planet?  It’s something like this.  They’re real community hubs in the Open Source world.
  • Food.  We’re working on this.  If you happen to be in or near Berlin, we’ll do dinner regularly starting Saturday the 19th.  If you’re not in Berlin and want to host, let’s get a calendar set up.