2

.

Search Engine

Easy

Given a search query, design a search engine that returns 10 urls that contain the keyword

Functional Requirements

  • 10 relevant results should be returned for each query if they exist
  • When a user clicks on a result they should be sent to the page
  • A "good match" is when a url contains a specific keyword
  • Searches can only consists of a single keyword
  • All grammatical tenses should map back to the same base words
  • All "stop words"( a, an, you, or ,etc.) should not be included in the index
  • Urls should be checked daily for updates
  • Urls do not need to be ranked off of any relevancy

Non-Functional Requirements

  • 99.99% Availability
  • Search latency less than 200ms

Usage Expectations

  • 500,000 queries per day
  • Website list is to store are .5 megabyte in size

Assumptions

  • Pagination of results isn't required
  • You are given a dataset containing all of the web pages that could be considered in the search
  • Urls take up 20 Bytes on average
  • Each webpage contains 20 Megabytes of text content per page
  • Searches are uniformly distributed across all available search terms
  • Each letter of the alphabet has an equal number of keywords associated with it
Companies
Plus sign indicating more info is available
Google
Airbnb
Amazon
Github

Grading Criteria

Before looking at the solution, do your best to document the follow:

  • Diagrams explaining the system architecture, the read flow, and the write flow
  • Database tables and their field
  • API endpoints, including inputs
  • Analytics to collect
  • Data partitioning & Replication Strategy
  • Capacity Estimations
  • Caching strategy
  • Load balancing strategy
  • Security considerations

After looking at the solution, compare your answer with the solution and mark what you missed

No items found.

Flow Explanations

Read Flow

  • Users submits a search query
  • The Receiver server sends the content to be "pre-processed" to remove stop words, casing, and map grammatical tenses
  • The receiver then uses the cleaned string to query the inverted index.
  • The receiver then returns the list of relevant urls to the user

Index Preparation Flow

  • A daily chron job kicks off a crawler
  • The crawler grabs websites from the provided website list
  • The crawler then kicks off workers to scrape content from the websites
  • The content is then put through a "pre-processing" step to remove stop words, casing, and flatten grammatical tenses.
  • The document id is then stored in the inverted index under any keyword it matches

Cost Estimations

  • Amount of text to process: 500,000MB or 500GB
  • Total size of the index: 1
  • Queries per second: 5.78
  • Urls in the list: 25,000

Database

We have a database with entries where each document is given an id and its stripped content is stored, along with the url

Type of database: NoSQL

Table 1: Content Database

Primary Key: id (int)

  • content: varchar
  • url: varchar
  • creation_date: datetime

Inverted Index

We use an inverted index for fast lookups and content grouping. The key to the inverted index should be the keyword that was found in the document.

Key: keyword

  • documents: a list of content ids

Cache

A cache wouldn't make much sense here since all searches are uniformly distributed across the index

Data Partitioning and Replication

For the content database, range based partitioning can be used since we know that each letter has the same number of keywords associated, implying uniform distribution

For the inverted index, range based partitioning can be used since we know that all search terms frequency is uniform

Analytics

  • query_received - log when a search happens and what term was searched

APIs

search

Parameters:

  • query (string)

Security

  • We can add rate limiting based on ip and user agent to make sure we aren't being spammed or scraped by a single users
Hide SolutionShow Solution