Collective intelligence, BarCamp and the Berlin Web Week

I’ll be doing a session tomorrow or Sunday at BarCamp Berlin (currently the most upvoted presentation topic).  I’ll cover some of the basics of modern recommendation systems, including basic categories of algorithms and why recommendations are important for modern web applications.  Dave Sharrock and Garik Petrosyan from Be2 (dating site) will be co-presenting, talking about some of the things they’ve had to overcome in building a scaleable match-making system.

Valentin will be doing a session on multi-lingual blogging.

We’ll also be at the following events in the Berlin web week:

Drop us a line if you’d like to meet and talk about recommendations and what they can do for your site!

Fuzzy search, related articles AJAX widget.

This one was requested a few times.  We’ve thrown together a simple AJAX widget for showing related articles on other sites.  We’ll add a little more bling later on, especially once we get some more data sets out there.

The idea is this:  You have a site that has information about music, or movies or airplanes or whatever, and you’d like to add links to related encyclopedia to your page with just a couple lines of customization.

To make that work as one would expect we also added fuzzy search.  That’s important because, for instance, the article for “Madonna” in Wikipedia is titled, “Madonna (entertainer)”.  So, now with our fuzzy searching you can search for the closest hit to a term with a specific tag, in this case “Madonna” tagged as “musician”.

The list of available tags is here.

The new AJAX widget is trivial to use.  Here’s what it looks like.  If you click it’ll take you to a page that you can use to copy-paste from.

You can get that on your page by adding two lines to you page / template:

<script src=”http://pedia.directededge.com/RelatedArticlesWidget.js” type=”text/JAVASCRIPT”></script>

<div class=”RelatedArticles” title=”Music Related to James Brown” fuzzy=”James Brown” fuzzytag=”musician” tags=”musician”></div>

All items that are returned use the “RelatedArticles” class, so that you can style them in your stylesheet as you see fit.  These are the supported attributes (only title or fuzzy are required):

  • topic:  A specific (exact) wikipedia article name.
  • fuzzy:  Find the nearest article name.
  • fuzzytag: modifies the fuzzy search so that it looks for an item with that tag.  Note:  this does not affect the related articles, for that use the following attribute.
  • tags: The tags for the returned articles. i.e. Using a fuzzytag for James Brown will specify that the main article should be a musician named James Brown, but specifying musician for tags will ensure that all of the returned results are also musicians.
  • title: The title to be used for the list.  This defaults to Related Encyclopedia Articles.
  • prefix: The prefix to be used for the links.  This defaults to http://pedia.directededge.com/article/ Here you could also specify http://en.wikipedia.org/wiki/ to link to the English wikipedia.

If you decide to link to Wikipedia or somewhere else using the last tag, we’d request that you at least give us a mention in your company / project’s blog somewhere.

Enjoy and feel free to drop us questions or comments!

Directed Edge interviewed for The Next Web

Thanks to Ernst from The Next Web for publishing an interview with us.

Web Services Live, Documented, Secured.

So in the space since the last blog post we’ve been working on getting everything squared away for our commercial web services API.  It’s now running live at webservices.directededge.com.  There’s some documentation up on how the REST API works there.  I also went through the hoops of moving over from our self-signed certificate to a proper certificate this week; I’d forgotten how much of a pain those can be to deal with.

If you’ve got any questions about the API or things that you’d like to do with it that don’t seem to be supported at the moment, we’d like to hear from you!

BarCamp Berlin Registration is Open

 

Directed Edge will be there.  Will you?

 

Directed Edge will be there.  Will you?

Directed Edge on the Road

In the next few days, we’ll be at the following events in Berlin:

Drop us a line if you’ll be there as well and we can arrange a meeting!

Greasemonkey script.

We’re very grateful that one of our users knocked out one of the items on our to-do list and created a Greasemonkey script for showing related articles on Wikipedia.  If you have Greasemonkey installed in Firefox you can just click on “install script” on this page.  To get related articles without being logged in to Wikipedia.

Per the comments on that page, we will start rolling out our Wikipedia demo in other languages probably about a week from now.

Related Pages on Wikipedia via our web services API.

So, we think this is pretty cool beans.  When we did our demo with a mashup of Wikipedia’s content we knew that we wanted something that potential customers could quickly look at and get a feel for what our recommendation engine is capable of, and we got a lot of good feedback about that in our recent technology preview.  On the other hand, we knew that we weren’t going to get the masses to switch over to user our Wikipedia interface.

One of the open questions for us as we pushed out the first bits of our web-services API  last week was, “Can we get this content to show up in Wikipedia proper?”

Last night after an extended hacking session where I tried a number of strategies for doing DOM scripting to pull in external content (and some misadventures in trying to do cross-site XMLHttpRequests) I managed to come up with a simple way of pulling in content from our web service via JSONP, and added support for JSON output to our web service along the way.  For Wikipedians that are logged in, it only requires adding one line to your monobook.js file and I’ve created a short how-to here.  The source code, for interested hackers is here.

Here’s what it looks like:

When we launched our demo a few people didn’t seem to get quite what it does that our engine is doing — we’re not just analyzing the current page and pulling in a few important links; we’re jumping out a few levels in the link structure and analyzing and ranking usually several thousand links in the neighborhood of the target page.  Often those pages are linked from the target page, but that’s hardly a surprise.  I come from a background of doing research in web-like search, so it’s no coincidence that our approach to finding related pages takes some hints from PageRank and other link-based approaches to sorting out the web.

We’d invite people to try this out and of course to keep playing with our mashup; we’ve gotten so used to having related pages that it’s hard to go back to the vanilla Wikipedia — having the related pages there makes it really easy to sort out things like, “What are the important related topics?” or “Well, I know about X, what are the main alternatives?”  And so on.  We’ve got some other exciting stuff up our collective sleeves that we’ll be rolling out in the next couple of weeks, so stay tuned!

API, Part II: Tags

Work on the web services API for the encyclopedia continues, now with tags.  Here’s a quick rundown:

You can get a list of supported tags here:

http://pedia.directededge.com/api/v1/tags/

That currently returns:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <tag>actor</tag>
  <tag>author</tag>
  <tag>book</tag>
  <tag>company</tag>
  <tag>film</tag>
  <tag>musician</tag>
</directededge>

You can then get results from article queries based on a tag, using something like this:

http://pedia.directededge.com/api/v1/article/KDE/tags/company/

Which returns:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <item id="KDE">
    <link>Trolltech</link>
    <link>Novell</link>
    <link>Hewlett-Packard Company</link>
    <link>Nokia</link>
    <link>World Wide Web Consortium</link>
    <link>Mandriva</link>
    <link>Canonical Ltd.</link>
    <link>Sirius Satellite Radio</link>
  </item>
</directededge>

You can query any article for any tag (unlike in the web interface).  Right now the results for “off topic” tags tend to be hit-or-miss.  One of the other big items on our to-do list is improving tagged results in our engine.

I’m posting incremental updates like this in the hopes that if you’re planning on using our API in a mashup that you’ll let us know what you like and don’t like before we freeze v1.

We’ve also decided on a couple of limitations for the open API that aren’t true for our commercial API (running either on customer data sets or open data sets):

  • You’re limited to 10 results.
  • You can only filter on one tag at a time, meaning, you can’t get ranked results for movies and music simultaneously.

We think those are pretty reasonable and still give users a fair bit of room to play for free.  If you’re interested in using our commercial API, drop us a line!  We’ve also just created an announcement list where we’ll notify folks that are signed up of important details.  You can sign up for that here.

First Encyclopedia API bits up.

This will still definitely be in flux, but I started getting parts of the REST API up if folks want to play with it.  Warning:  the format may change.

You can now hit something like:

http://pedia.directededge.com/api/v1/article/KDE

And get back:

<?xml version="1.0" encoding="UTF-8"?>

<directededge version="0.1">
  <item id="KDE">
    <link>GNOME</link>
    <link>Unix-like</link>
    <link>Desktop environment</link>
    <link>Konqueror</link>
    <link>Qt (toolkit)</link>
    <link>KDE 4</link>
    <link>GNU Lesser General Public License</link>
    <link>X Window System</link>
    <link>KPart</link>
    <link>Widget toolkit</link>
  </item>
</directededge>

I’ll be adding support for JSON output and filtering based on tags in the next few days.  Once I’ve got a set of features there that I consider feature complete I’ll freeze the “v1″ so that people can create mashups based on that and be sure that the API will remain stable.

This does do capitalization correction, but does not do redirect detection.  I’m debating if I want to do that by default or use another REST path since it requires another couple DB queries and is as such a little slower.

Toy of the Day: FeedMySearch

Like any new startup co-founder, I’m obsessive about seeing how what we’re doing trickles out over the web.  Being an RSS-warrior today I went looking for a Google search to RSS converter and found FeedMySearch, which now, a few hours into using it seems to do quite well in pulling in new information about Directed Edge as it hits Google’s indexes.

FeedMySearch for Directed Edge in Thunderbird

FeedMySearch for directededge.com in Thunderbird

 

Nifty tool.  Anything that stops me from compulsive reloading is a win.  Now back to implementing new features.  :-)

Directed Edge Launches Recommender Engine Public Beta!

It’s an exciting day for us at Directed Edge.  Today we’re finally putting our Wikipedia-based technology preview out there for people to play with.  Before you click over to it, here’s a little about what you’re looking at.

As our name implies, we’re graph theory nerds.  We look at the roughly 60 million links between the 2.5 million English Wikipedia pages, and with a few extra cues from the content, figure out  pages related to the current one and put that in a little box in the upper left (as evident from the image on our home page).  In some cases, if we’re able to pick out what sort of page it is, we also drop in a second box with just other pages of the same type.

Finding related pages in Wikipedia isn’t fundamentally what Directed Edge is about.  We’ve got a super-fast in-house graph storage system that makes it possible to do interesting stuff with graphs quickly, notably figure out which pages are related.  We’ve already got a couple of pilot customers lined up and will be working with a more in the next weeks to analyze their sites and figure out how things are related there.  We’ve got a prototype of our web-services API that they’ll be using to send us break-downs of how stuff’s connected on their site and we’ll send back what we hope are some pretty groovy recommendations.

There are dozens of things in the pipe for us:  ways to make recommendations better, ways to make the Wikipedia demo cooler, things customers want to see in our web services that we’d previously not thought of, and we could ramble on that for a while, but there are a few things that are on the very near horizon that didn’t quite make it into this round:

  • An open web-services API for accessing the recommendations from our Wikipedia demo.  This will be a stripped down, read-only version of our commercial API usable in web mash-ups.
  • Better tagged (i.e. music, movies, authors, companies) reccomendations.  Support for tagged articles was one of the last features that made it into the current demo, and we’ve got some plans for improving the results for those.
  • Pulling in non-English Wikipedia variants.  We’ll probably start with German and French.
  • More info about our commercial web-services API.  We’re still nailing down some of the details, but as soon as we freeze the API for the first customers, we’ll add more docs to the site.

If you subscribe to our news feed you’ll see immediately when those services go live.  Even though we’re still in the beta-phase and are only accepting a limited number of customers, if you think you’d be interested in using our engine for your site down the line, we’d encourage you to register now since we’ll be offering a discount for our commercial services to everyone who fills out their info in the contact form during the beta phase.

More soon.  Enjoy!

Launch date, August 13th.

We’ve now committed to going into public beta / technology preview next Wednesday, August 13th.  We’ll be launching our new site with more information about our products and services at that time.

Press / bloggers may request invites by sending us a mail.  It’ll be an exciting next few days as we iron out the last kinks and get ready for the onslaught.

The website will be a bit in flux, but our bio info is still available here.

Edit: If you’ve been testing the demo previously the location has changed.  Drop us a line for the new URL.

In defense of Perl.

Kickin’ it old school.

 

I started writing web applications around 1997.  On Solaris.  Using Netscape’s web server.  In Perl.

My LinkedIn profile starts thusly:

I began working with LAMP back in the days where the men were men and we ate Perl for breakfast. Installing Linux with 40 floppy disks puts hair on your teeth. One thing led to another, and before I knew it I was living with an E-Commerce system. What can I say? I was young and needed the money.

Around 2001 I took a departure from the web world to work on enterprise and desktop software. Sure, I slung a little web code here and there, but I didn’t track the technology landscape like I did back in the good old days.

When looking at founding Directed Edge, it was time to re-approach the web and get back on friendly terms.  Like an old-timer desperately trying to identify all that is hip, I set out to figure out what the cool kids were doing.

Python with Django and Ruby on Rails.

I learned bits of Python and Ruby, things I’d been meaning to learn for ages.  Ruby’s syntax and I didn’t hit it off at first, so I spent a couple days reading Learning Python, which had been sitting on my shelves for a few months.

But friends, home is where the heart is, and even after getting a reasonable grasp on Python I kept going back to Perl.  There are two reasons for this:

CPAN.  Oh, sweet, sweet CPAN.

 

The CPAN has grown so large and comprehensive over the years that many people learning Perl seem to elevate it to a sort of mythical status, and express surprise when they begin to encounter topics for which a CPAN module doesn’t exist already.

-Wikipedia article on CPAN

CPAN is huge, easy to search, well documented, and trivial to deploy.  Need some code to do TLS authenticaed SMTP tranfers?  Trivial.  Need a WSDL compiler to work with an old SOAP API?  Up and going in a few minutes.  Need to test a REST API with a really mature HTTP implementation?  It’s there.  Need code for quickly generating mail routing code for feedback processing?  Bingo.  But wait — that’s all pretty common stuff, right?  There are even CPAN modules for stuff like tracking quantum superpositions in quantum computing algorithms or quickly building genetic algorithm implementations (my two research areas in college).  And all under the same roof.

And let’s back up one bit; for all of the perceived culture of sloppiness, I earnestly believe that Perl has the strongest documentation culture of any major programming language.  By and large CPAN’s 12,000-something modules are rather well documented with examples and gotchas in addition to the basic API docs.  As a special bonus as soon as you’ve installed them with the command line cpan tool (which automatically resolves dependancies, downloads, tests and installs) they’re available in your system’s man pages.  The standard man pages for core language features are great, and well written to boot.  The Camel Book will forever have a place on the gilded streets of O’Reilly’s hall of fame as possibly the most enjoyable to read 1000+ page technical book ever written.

Fast hacks, fast, quickly.

 

Combined with the power of CPAN, Perl just has something about it that makes gruesome, and gruesomely fast hacks possible.  Much of this is owing to CPAN solving 75% of the universe’s problems for you from the get-go.  But Perl is something of the Sicilian mobster of the programming world — it gets stuff done.

Add to that that it’s one of the speediest scripting languages performance-wise, and it’s great for quick-and-dirty hacks that programmers invariably have to come up with on a regular basis.  Perl seems to be optimized for writing as little code as possible to get the job done.

What I’m not saying.

 

Most people talking about Perl are quick to groan about its ugliness.  I’ll first note, most of them don’t know Perl, so it’s my earnest belief that much of that is fadiness.  Perl can be well written, but its syntactic moral flexibility means that there’s a lot of ugly Perl out there.  I’m not going to try to pass that off as a good thing.  But a real Perl mensch can write Perl that’s as easy to read as code in most other popular programming languages.

I’m also not advocating doing large projects in Perl.  In a decade of Perl slinging, it’s only happened a time or three that I written tools that were more than a couple thousand lines of code.  (But again, the beauty lies in that I’ve rarely needed to.)  Nothing particularly central to Directed Edge is written in Perl, but it’s been my Swiss Army Chainsaw on the fringes — converting data formats, processing simple forms, interfacing with databases — glue code, basically.

And there I have to say, despite wanting oh-so-desperately to be one of the Python cool kids, I think Perl is there to stay.

Incorporated.

Check one thing off the list of stuff that’s been taking time for us:  as of August 1st, Directed Edge is incorporated; we’re pretty excited.  We’ll transfer the IP into the new company next week.

Even though updates here have been slow, it’s been for the best of problems — despite not having our open beta launched we’ve already got a couple of pilot customers and I’ve been trying to get the integration process nailed down for our web services API.

The prototype is already in the state that we’ll use for our launch and we’ve got a few dozen people testing it in our closed beta.

So what’s next?  Well, we’re going to try to get our public presence ready for launch, mostly meaning finishing up our in progress web site, get press releases ready in German and English, get a few more bloggers on board and do the deed.  If you’re a blogger or have press connections and want to get the scoop before we launch, drop us a line!

Getting the rest of the legal stuff taken care of will take some time in the next week as well, but after thinking too much on the incorporation issue, it’s nice to be moving along.

Seedcamp Berlin

 

Valentin and Scott at Seedcamp Berlin

Valentin and Scott at Seedcamp Berlin

 

 

I stumbled across this photo of Valentin and I getting our questions answered about incorporation style at Seedcamp Berlin.  I must say that was one of the more useful events that we’ve been to so far — it was a great chance to get in touch with an impressive collection of mentors and to network with other (mostly) German startups.

Berlin: OpenCoffee / Business & Beer this week

There will be another update on Directed Edge stuff soon-ish, but I just wanted to get out a list of the upcoming Berlin events where we and other Berlin startups will be talking shop and doing demos:

Next round of events.

Just finished up attending the last of the events in the recent post and thought I’d mention the next batch on the Directed Edge calendar:

I’ve got two weeks left until I’m full-time on Directed Edge and it’s looking to be an exciting time.  Like most startups stumbling towards launch, there are a thousand threads being followed up simultaneously, the last couple of weeks more business than technical.  We’re going to make our demo progressively more open during the next weeks, but the feedback we’ve gotten from the first testers is encouraging.

Not-exactly-TechCrunched.

But I did make it into the video feed from the Prague TechCrunch event. There’s about a half-minute pseudo pitch in here. Lessoned learned from watching it back? Loosen up.  Get to the point faster.  Like in 10 seconds.  (Isn’t showing up in syndicated version, click through to the full post.)

Launch Count Down, Private Beta

We’re coming very close to having a beta to launch.  The interesting parts are good to go, but we still need to:

  • Clean up the (visual) design
  • Fix a couple of rendering errors on complex pages
  • Get the non-demo parts of the website in place — specifically info on what we’re doing and how pilot customers can get up and running
  • Tune the caching code so that it can handle a usage spike on launch

There’s still intentionally little information on the home page, and that will probably remain that way until we’re ready to go for a public beta.

However, I’ve been talking to more and more people face to face and showing off the current prototype (most recently at the two events that I just returned from in Prague), so in the next couple of days we’re going to start a private beta for folks that want to start exploring while we knock out the stuff above.  We’ll accept a limited number of requests.  Send us a mail or wait for the registration form to show up on the home page later this week if you’re interested in getting to the goodies a couple weeks ahead of the crowd.

Places I’ll Be

Want to hear more about Directed Edge? I’ll be at the following upcoming events:

Valentin will probably be at the Berlin-based events as well.  I’d originally planned to be at the UK Hackers’ meetup in London the same day as the Prague TechCrunch event, but the combination of the TechCrunch and Ubuntu events the same weekend was too much to pass up.  I’ll try to catch the London folk the next go around.  Going to be at any of those events?  Drop me a line and we’ll be sure to meet up.

Trapped in an elevator.

Today, for the first time I went to a social event here in Berlin for entrepreneurs. After the lamentations that I’d been exposed to by the locals, I can say that I was pleasantly impressed. The group seemed decent — a mix of coders and business folk and some in between.

One thing was painfully obvious. I’m a whole lot worse at explaining what we’re doing than I thought I would be. Most folks could layout the basics of what they were doing in a few seconds. I stumbled over it even when rambling at length. I’m too used to giving presentations, where I’ve got an audience and an hour.

The first time that we presented our ideas we over simplified. We’re working on some fairly hard problems and we didn’t manage to convince them that we both had something compelling and the skills to pull it off.

This time I went too far in the other direction. I blabbed too much about my background (more than probably anyone cared, and likely to the point of seeming arrogant) and in spoken form, still struggled with outlining what it is exactly that we’re doing.

A good “elevator pitch” is harder than it seems. For us we’ve got to:

  • Show how what we’re doing is interesting.
  • For the non-computer scientists, point out that it’s non-trivial (i.e. hard to duplicate).
  • For computer scientists, convince them that we’re skilled enough to pull it off.
  • Briefly explain how we plan to monetize it.
  • Boil that down to about a minute.

 
The crux of the difficulty, perhaps, lies between points two and three.  For non-technical folk, what we’re doing seems easy.  For technical folk, it seems very hard.

I’ve got another shot at this at the Open Coffee meeting on Friday. Hopefully by then I’ll have managed to get a little closer to something compelling. I’d like to have a clear message that we’re able to present by the time that we go for a public beta in the near future.

Why don’t computer scientists track sub-fields other than their own?

As I’ve worked through some of the ideas that we’ll be using at Directed Edge over the last few years I’ve stumbled across several subfields of computer science: information retrieval, semantic webs, graph algorithms, recommender systems.

As we cross boundaries there is one question that always strikes me: Why don’t computer scientists track related sub-fields?

This problem seems to afflict academia in general, but computer scientists seem to have even worse tunnel vision than par. When I stumble across papers that are potentially useful to the work that we’re doing I tend to track down the papers they’ve cited in search of other useful pieces of the puzzle and invariably there is almost no overlap in co-citation across sub-fields.

Why is this?

On the one hand, this provides a measure of excitement to me; there are these pools of knowledge that I manage to stumble across every few months that help bring our technology closer to realization. On the other hand, there’s this nagging wonder that so many brilliant minds aren’t talking to their colleagues to put together cool solutions to interesting problems.  Theories?

Shameless plug: We’ve started a blog aggregator for startup related blogs. Got one? Drop me a mail and we’ll add you to the syndication.

Networking.

After posting this on Hacker News I’ve gotten pretty serious about setting up some things to get the Hacker News startup community to pull together to our collective advantage.  Here are the main points:

  • Link to me at LinkedIn and join the Hacker News group (assuming you’re a news.YC regular)
  • Join our mailing list for entrepreneurs.  Let’s hone each others projects into something great.
  • Join the Planet.  Mail me.  If you’ve got a startup, we’ll syndicate you.  What’s a Planet?  It’s something like this.  They’re real community hubs in the Open Source world.
  • Food.  We’re working on this.  If you happen to be in or near Berlin, we’ll do dinner regularly starting Saturday the 19th.  If you’re not in Berlin and want to host, let’s get a calendar set up.