Nov 122012
 

Currently, as I've mentioned in previous posts, beaches are a strangely under-served segment of the local search space. Searches on Google and Bing for beaches are fielded by entities such as resorts and restaurants that happen to be matches for certain beach related terms. If you search for 'beaches in kauai' you will get hits for beach resorts, etc.

There is plenty of content about beaches, from the many dedicated locale sites to general travel related community sites (like Trip Advisor) and editorial sites (like Fodor's). In addition, there are a number of resources that aggregate structural data about beaches. These include open data resources like GeoNames and GNIS but also proprietary resources like Foursquare.

Unfortunately, there is nothing that brings all these things together. There is not product which provides an aggregate view of the set of beaches or the collection of things said or otherwise reported about them.

With an upcoming trip to Hawai'i at the end of the year, I wanted to make sure I was getting the best value for my travel dollars. I've build a prototype beach search engine which provides the following.

  • a partly curated set of beach data covering approximately 12, 000 international beaches
  • aggregation of beach related content
  • search funtionality (so you can search for kid friendly beaches that offer good snorkeling)
  • summarization of Flickr images so that an impression of what it's like to be at the beach can be formed

I believe there is plenty of potential for such a system. I've already found some hidden beaches that I wasn't aware of at our destination that I'm excited to check out when we get there. My goal is to make the system public in the next few weeks (my trip will be a forcing function for this!).

For now, here is a screen shot of part of the experience.

Beachgeek

Sep 302012
 

[I work at Microsoft where I work on projects that drive data quality in our local search experiences on Bing and other clients.]

Most of the civilized world, but this time, has heard about Apple's fumble with their new mapping and local search capabilities in iOS. Apple replaced Google's application - which is possibly the largest investment in cartography, imagery and local data ever made - with a home grown solution reportedly rolled out of maps from a number of providers including TomTom and local data from providers including Yelp.

As Apple has realised, there is a lot to learn for an entrant in this space. The hardest lesson they are learning now is actually not about data sources but about metrics and how to assess the quality of the product - something which they don't appear to have invested in in a manner fitting to their global user base.

Apple will soon learn another lesson. Once the fog has lifted over the state of their entity data set (e.g. fixing the location of cities and ensuring coverage for local businesses), Apple will have to start worrying about ranking search results. When a user asks for {kid friendly sushi in seattle} which of the many sushi places ought they to return. They will be presented with a choice between specialized providers - with whom they will actually be in competition - or creating the resources required for relevance ranking themselves.

A key aspect of providing appropriate indexing and ranking features is the association of content with the entities. Where does this content come from? The web. How is it acquired? Through large scale crawling, understanding and indexing.

Apple will likely find that as they pull on the thread of local search, their scope will have to open up to quite a different world, another world which - like local - they haven't yet the expertise in.

Sep 302012
 

[I work at Microsoft where I work on projects that drive data quality in our local search experiences on Bing and other clients.]

Most of the civilized world, by this time, has heard about Apple's fumble with their new mapping and local search capabilities in iOS. Apple replaced Google's application - which is possibly the largest investment in cartography, imagery and local data ever made - with a home grown solution reportedly rolled out of maps from a number of providers including TomTom and local data from providers including Yelp.

As Apple has realised, there is a lot to learn for an entrant in this space. The hardest lesson they are learning now is actually not about data sources but about metrics and how to assess the quality of the product - something which they don't appear to have invested in in a manner fitting to their global user base.

Apple will soon learn another lesson. Once the fog has lifted over the state of their entity data set (e.g. fixing the location of cities and ensuring coverage for local businesses), Apple will have to start worrying about ranking search results. When a user asks for {kid friendly sushi in seattle} which of the many sushi places ought they to return. They will be presented with a choice between specialized providers - with whom they will actually be in competition - or creating the resources required for relevance ranking themselves.

A key aspect of providing appropriate indexing and ranking features is the association of content with the entities. Where does this content come from? The web. How is it acquired? Through large scale crawling, understanding and indexing.

Apple will likely find that as they pull on the thread of local search, their scope will have to open up to quite a different world, another world which - like local - they haven't yet the expertise in.

Aug 262012
 

We will soon be embarking on a short trip to Hawai'i. Naturally, I'm turning to search engines to find out about the best beaches to go to. However, it turns out that this simple problem - where to go on vacation - is terribly under supported by today's search engines.

Firstly, there is the problem with the Web Proposition. The web proposition - the reason for traditional web search engines to exist at all - states that there is a page containing the information you seek somewhere online. While there are many pages that list the 'best beaches in Hawai'i' as the analysis below demonstrates these are just sets of opinions - often very different in nature. An additional problem with the Web Proposition is that information and monetization don't always align. Many of the 'best' beaches pages are really channels through which hotel and real estate commerce is done. Thus a balance is needed between objective information and commercial interests.

Secondly, beaches are not considered local entities by search engines. While the query {beaches in kauai} is very similar in form to the query {restaurants in kauai} the later generates results of entities of type <restaurant> while the former generates results of entities of type <businesses that have beach or kauai in their name or associated content>. While local search sounds like search over entities which have location, it is largely limited to local entities with commercial intent.

Finally, there is general confusion due to the fact that the state of Hawai'i contains a sub-region (an island) called Hawai'i.

To get to the answer to my original search query, I reviewed 8 sites which resulted in a search on Bing or Google for the query {best beaches hawaii}. I then reviewed each of these and created a spread sheet tabling all the beaches and whether they were voted for by the site.

Of the 57 beaches that were mentioned on at least one site, the average number of mentions was 1.89. This indicates a general lack of consensus regarding which are the best beaches. In fact, most beaches (38 out of 57) have only a single vote. Consequently, while there might be a set of pages returned by search engines for queries looking for such information, a user will be reading - in isolation - very different opinions with no aggregate view summarizing them.

The top beaches are summarized in the following table showing the beach and the total votes.

Beachtable

Search engines could do a far better job by:

  1. Generalizing local search to include any entity which has location, not just commercial entities.
  2. Leveraging editorial content (like that reviewed in this post) so that variance may be exposed to the user but aggregates can also be synthesized.

In addition, there is a very large opportunity here in analyzing the content associated with these local entities do determine which beaches are best for different activities, their accessibility, and so on.

Aug 262012
 

We will soon be embarking on a short trip to Hawai'i. Naturally, I'm turning to search engines to find out about the best beaches to go to. However, it turns out that this simple problem - where to go on vacation - is terribly under supported by today's search engines.

Firstly, there is the problem with the Web Proposition. The web proposition - the reason for traditional web search engines to exist at all - states that there is a page containing the information you seek somewhere online. While there are many pages that list the 'best beaches in Hawai'i' as the analysis below demonstrates these are just sets of opinions - often very different in nature. An additional problem with the Web Proposition is that information and monetization don't always align. Many of the 'best' beaches pages are really channels through which hotel and real estate commerce is done. Thus a balance is needed between objective information and commercial interests.

Secondly, beaches are not considered local entities by search engines. While the query {beaches in kauai} is very similar in form to the query {restaurants in kauai} the later generates results of entities of type <restaurant> while the former generates results of entities of type <businesses that have beach or kauai in their name or associated content>. While local search sounds like search over entities which have location, it is largely limited to local entities with commercial intent.

Finally, there is general confusion due to the fact that the state of Hawai'i contains a sub-region (an island) called Hawai'i.

To get to the answer to my original search query, I reviewed 8 sites which resulted in a search on Bing or Google for the query {best beaches hawaii}. I then reviewed each of these and created a spread sheet tabling all the beaches and whether they were voted for by the site.

Of the 57 beaches that were mentioned on at least one site, the average number of mentions was 1.89. This indicates a general lack of consensus regarding which are the best beaches. In fact, most beaches (38 out of 57) have only a single vote. Consequently, while there might be a set of pages returned by search engines for queries looking for such information, a user will be reading - in isolation - very different opinions with no aggregate view summarizing them.

The top beaches are summarized in the following table showing the beach and the total votes.

Beachtable

Search engines could do a far better job by:

  1. Generalizing local search to include any entity which has location, not just commercial entities.
  2. Leveraging editorial content (like that reviewed in this post) so that variance may be exposed to the user but aggregates can also be synthesized.

In addition, there is a very large opportunity here in analyzing the content associated with these local entities to determine which beaches are best for different activities, their accessibility, and so on.

Aug 022012
 

I'm late to this, but it is certainly worth posting. A team of researchers at CMU have been working on mining foursquare checkin data to determine behaviourally defined neighborhoods ('livehoods'). They have put together a site - livehoods.org - which showcases their work.

The site has maps for Pittsburgh and Seattle - the work is even more awesome in that it was published at ICWSM this year.

Interacting with the map, as with this example below, shows the boundary of the derived neighborhood. What is interesting in the case below is that the area crosses a natural boundary (the river).

Livehood

Update: here is the video of their presentation at ICWSM 2012.

Apr 192012
 

A colleague brought to my attention a post on the influential search blog Search Engine Land which makes claims about the quality of local data found on search engines and local verticals: Yellow Pages Sites Beat Goolge In Local Data Accuracy Test. The author describes surprise at the outcome reported - that Yellow Pages sites are better at local search than Google. Rather, we should express surprise at how poorly this article is written and at the intentional misleading nature of the title.

The article describes an analysis done by Implied Intelligence. The analysis looks at 1, 000 local businesses in the US. Here is the first problem - these businesses exclude chains and franchises. In addition, if a website wasn't known for the business, it too was excluded. With some general assumptions about the definition of local business, it is safe to assert that firstly there are many instances of chains and franchises out there and secondly that many (if not most) businesses don't have a website (the distribution varies by category of course). Quite where the original sample of 1, 000 came from is not reported.

This biases the analysis - Google, like Bing is intersted in all local entities.

The initial part of the analysis is reasonable - looking at coverage (% in the sample found on the site) and quality (duplicates, phone number errors and adderss errors). Note, however, that this is a measure of the local data, not of local search. A search product includes a relevance component and it is quite possible that a well tuned relevance algorithm might suppress duplicates.

The last table in the analysis sees us swinging back to bad reporting. It describes the percentage of records that have a certain attribute: URL, Hours of Operation and 'additional info'. Did you see what they did there? This is what we call the coverage of an attribute, and it tells us nothing as to the quality of the value. I can quite easily populate a local database with 100% coverage for all attributes. They might all be wrong, but the coverage could be 100%. Consequently, this table is reasonably close to meaningless. If they had included the precision of these values then coverage can be used to compute recall, but that wasn't done.

In summary, an important search publication has either written an intentionally misleading article, or has demonstrated that it doesn't really get data.

Dec 032011
 

There appears to be an interesting discussion going on around Siri (Apple's iOS assistant which, among other things, is a mediator to web search and local search functions) and it's inability to locate 'abortion clinics'. Danny Sullivan has a long piece on the topic which is worth a read but it concludes, in part, that the system can't find the clinics because the clinics themselves don't self describe as 'abortion clinics'. This highlights one of the challenges of search and local search in particular. There are at least 2 descriptions of the world: the source description (how the business or entity describes itself) and the user or customer description (how users conceive of an organize the world). These are not always the same, and a good search engine will figure that out and mediate between the two partially aligned ontologies.

All up, Apple is going to learn that being in the search business is not as simple as hooking up to a single data provider or even a single services search API.