Search Engine
Beyond Simple Text Match
Srikanth Venugopalan
ThoughtWorks, Chennai
need for Search?
- Address disconnect between definition of entities from a user and provider viewpoint
- A very prominent user flow
- Users are so used to Google that they end up searching for almost everything
Why not use something like Google?
Google/Yahoo/Bing/Baidu etc
- are web search engines .. crawl webpages
- very mature in retrieving them
But!!
Webpages denormalize data!
Web Search Engines
Pros
- Indexing pages in Google etc, provide a wider reach
- coherent, independent source of data
- Optionally, one could get statistics out of the box.
Cons
- not customizable to various data structures
- Multiple views becomes difficult (Faceting / Sorting)
- Optionally, one could get statistics out of the box.
Tools available
Lucene based
- SOLR
- ElasticSearch
Endecca
Support in RDBMS
Postgres, MS SQL Server, Oracle etc have full text indexing support
Going beyond Text Match
Search on structured data!
Or rather, semi-structured
De-normalize highly nested structures.
Extremes!
Foo has Bar and Baz. Baz comes with Fizz1 and Fizz2, whereas Baz consists of Blah with Buzz1 and Buzz2
Foo
-Bar
--Fizz1
--Fizz2
-Baz
--Blah
---Buzz1
---Buzz2
[ nth normal form ] <----------------------------> [flat free text]
^
Ideal for search engines?
Foo
-Bar
-Bar_Fizz1
-Bar_Fizz2
-Baz
-Baz_Blah
-Baz_Blah_Buzz1
-Baz_Blah_Buzz2
Exploiting Data-structures
Look-up by a specific field
Combine fields to get multiple combinations
(Nothing new, we've been doing this since SQL)
But...
..SQL fails when
You get collision of matches across fields.
Name = Kawasaki
Brand = Kawasaki
??
Weights / Boosts
Another Example:
Surface ~ A reflective surface
Surface ~ http://www.microsoft.com/surface/
So when I search for Surface, what should I get?
contd..
Weights / Boosts
"Surface" by itself
- Rarely used by a user looking for a reflective surface.
- High chance that this is a search for "Microsoft Surface"
Analyze most of such cases, to arrive at the most searched by fields (priorities).
Most search engine tools support definition of Boosts at query and Index time.
Defining Weights / Boosts
Identifying Priority of relevance
Eliminate the ambiguity
Work around Fuzzyness of the natural language
Identify the right field type
Creates scope of working with the right operations meant for datatypes
Some More Examples
RANGE QUERY
These queries fetch results that fall within the range of a given field values.
An Example
price:[100 TO 200]
A user could fetch all documents that have price between 100 and 200, for example.
Some more Examples
- Aggregate Query
- Conditional Query
- function query
- dynamic fields
- Multi lingual
- date time
- geo spatial
Iterative development and testing
Why?
- Number of cases to be tested gets distributed across the span of development
- Regression testing is continuous - changing/adding/removing one rule can be immediately tested for impact.
So why not do it in all cases?
...it comes at a cost.
Analyzing the data and choosing the right training set could be a daunting task, particularly when the dataset is large and complex.
choosing a training set
A training set is a subset of actual data, that can be used to run the rules and verify behaviour.
Guidelines -
- good representation of actual data
- Right size : is small enough for a tester to be able to handle, and is big enough so that ripple effect can be simulated
Implementing using a training set
Steps
- Model the search engine schema using a training set
- Make sure all rules are satisfied independently
- Combinations of the rules are also tested, and schema refined until all the conditions are satisfactory
Identify the below
- fields to be indexed
- analyzers
- tokenizers
The iteration
Pass n : run rules -> anomalies -> tweak rules & schema -> regression -> Pass (n+1)
Some Tools
..that help in implementing free text search
Apache Nutch
A web crawler that can index web pages. Has integration support with Solr.
Apache UIMA
Unstructured Information Management Applications
analyze large volumes of
unstructured information in order to discover knowledge
that is relevant to an end user.
Ex - ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
THank YOU
Questions?
Search Engine
By steam
Search Engine
- 1,600