Search System Basics

Understanding Search Systems
Reference: [Book] Elasticsearch Practical Guide

1. Understanding Search Systems

1-1. What is a Search System?

: Search services are referred to by names such as search engine, search system, and search service

Search Engine
- A program that collects information from the Web and provides search results
- The implementation varies depending on the characteristics of the data provided as search results
Search System
- A general term for systems built on search engines to provide reliable search results based on large-scale data
- Collects vast amounts of data using a crawler,
  - Indexes it using multiple search engines,
  - And provides search results through a UI
- According to internal policies, documents with high relevance can be placed at the top of search results,
  - And accuracy of search can be improved by assigning weights to specific fields or documents
Search Service
- Provides search results as a service using a search system built on search engines

Search Service > Search System > Search Engine

1-2. Components of a Search System

: A search system consists of a crawler that collects information, storage that stores collected data, an indexer that transforms collected data into a form suitable for searching, and a searcher that finds matching documents from indexed data.

Crawler

A program needed for websites, blogs, cafes, etc.
Also called crawler, spider, worms, web robot, etc.
Most information on the web including files, databases, web pages are collection targets
For files, the crawler collects and stores information such as file name, file content, file path, then the search engine searches the stored information and answers user queries

Storage

The physical storage where data is stored in the database
The search engine stores indexed data in storage

Indexer

To find information matching user queries in the information collected by the search engine, the collected data must be processed into a searchable structure and stored, and the indexer performs this role
The indexer uses various morphological analyzers to extract meaningful terms from information and stores data in an inverted index structure favorable for searching

Searcher

Receives user queries and finds matching documents from the inverted index structure stored by the indexer, returning them as results
Whether a query and document match is determined by similarity-based search ranking algorithms
Like the indexer, the searcher also uses morphological analyzers to extract meaningful terms from user queries for searching
- Therefore, search quality varies depending on the morphological analyzer used!

1-3. Differences from Relational Databases

Search engines and relational databases (RDBMS) share many similarities in that they both find and provide data matching queries to users
However, there are many problems with providing search functionality using relational databases

Relational Database

A database is a collection of data that integrates and manages data
- Broadly divided into relational or hierarchical databases based on storage methods
All data is deduplicated and structured as structured data stored in tables consisting of rows and columns
Searching for desired information is possible using SQL statements, but only simple search through text matching is possible
- That is, transforming text into multiple words or searching using multiple synonyms or similar words is not possible

Search Engine

Search engines can index and search unstructured data that databases cannot handle
Natural language processing is possible through morphological analysis, and fast search speed is guaranteed based on inverted index structures

Comparison of Key Concepts: Elasticsearch vs RDBMS

Elasticsearch

Relational Database

Index

Database

Shard

Partition

Type

Table

Document

Row

Field

Column

Mapping

Schema

Query DSL

SQL

2. Search Systems and Elasticsearch

Nowadays, NoSQL (No Structured Query Language) is widely used for fast searching of large-scale data
Elasticsearch can also be classified as a type of NoSQL, enabling near real-time fast searching through distributed processing
It can search large-scale unstructured data that is difficult to handle with traditional databases, and supports both full text search and structured search
Although it is fundamentally a search engine, Elasticsearch can also be utilized as a large-scale storage like MongoDB or Hbase

2-1. Why Elasticsearch is Powerful

1. Open Source Search Engine

Elasticsearch is an open source search engine developed based on Apache Foundation's Lucene
- Therefore, it is used by countless people worldwide, and bugs are mostly resolved quickly when they occur

2. Full Text Search

Most databases like PostgreSQL and MongoDB only provide basic text search functionality due to limitations in basic query and indexing structures
- However, Elasticsearch enables more advanced full text search
Full text search means indexing entire contents to search for documents containing specific terms
- Traditional RDBMS is not suitable for full text search, but Elasticsearch can search quickly by combining various feature-specific and language-specific plugins

3. Statistical Analysis

Unstructured log data can be collected and aggregated in one place for statistical analysis
By integrating Elasticsearch with Kibana, you can visualize and analyze logs accumulating in real-time

4. Schemaless

Databases store and manage data by transforming it into a conforming format according to a schema structure
- In contrast, Elasticsearch can automatically index and search various forms of non-standardized documents

5. RESTful API

Elasticsearch supports HTTP-based RESTful API and uses JSON format for both requests and responses, making it available on different platforms regardless of development language, OS, or system

6. Multi-tenancy

Even if indexes are different, as long as the field names to search are the same, multiple indexes can be queried at once
- This can be used to provide multi-tenancy functionality

7. Document-Oriented

Multiple layers of data can be stored in an index as structured documents in JSON format
Hierarchical documents can also be easily queried with a single query

8. Inverted Index

As a Lucene-based search engine, it supports inverted indexing
- Here, an inverted index is a special data structure similar to the index page at the back of a paper book
If you need to find all documents containing the term "search engine," normally you would have to read every document from beginning to end to get the desired results
- However, with an inverted index structure, finding the term reveals the location of documents containing that term, enabling fast retrieval

9. Scalability and Availability

If Elasticsearch is distributed and scaled, it can process large volumes of documents more efficiently
In a distributed environment, data is divided into small units called shards, and the number of shards can be adjusted each time an index is created
- This allows distributing data based on its type and characteristics for fast processing

2-2. Weaknesses of Elasticsearch

It is not "real-time"
- Generally, indexed data becomes searchable only after about 1 second
- Indexed data goes through complex internal processes like commit and flush, so it is not real-time
  - Strictly speaking, it can be called Near Realtime
Does not provide Transaction and rollback functionality
- Elasticsearch is fundamentally configured as a distributed system
- Since it does not support rollback and transaction, which are costly for the overall cluster performance improvement, there is a risk of data loss in the worst case
Does not provide data update
- Strictly speaking, when an update command is requested, Elasticsearch deletes the existing document and creates a new document with the changed content
  - For this reason, it incurs relatively higher costs compared to simple updates
    However, this is not a major disadvantage because it can leverage the benefit of immutability!

PreviousELK NextFilebeat Basics

Last updated 29 days ago

hashtag1. Understanding Search Systems

hashtag1-1. What is a Search System?

hashtag1-2. Components of a Search System

hashtagCrawler

hashtagStorage

hashtagIndexer

hashtagSearcher

hashtag1-3. Differences from Relational Databases

hashtagRelational Database

hashtagSearch Engine

hashtagComparison of Key Concepts: Elasticsearch vs RDBMS

hashtag2. Search Systems and Elasticsearch

hashtag2-1. Why Elasticsearch is Powerful

hashtag1. Open Source Search Engine

hashtag2. Full Text Search

hashtag3. Statistical Analysis

hashtag4. Schemaless

hashtag5. RESTful API

hashtag6. Multi-tenancy

hashtag7. Document-Oriented

hashtag8. Inverted Index

hashtag9. Scalability and Availability

hashtag2-2. Weaknesses of Elasticsearch

1. Understanding Search Systems

1-1. What is a Search System?

1-2. Components of a Search System

Crawler

Storage

Indexer

Searcher

1-3. Differences from Relational Databases

Relational Database

Search Engine

Comparison of Key Concepts: Elasticsearch vs RDBMS

2. Search Systems and Elasticsearch

2-1. Why Elasticsearch is Powerful

1. Open Source Search Engine

2. Full Text Search

3. Statistical Analysis

4. Schemaless

5. RESTful API

6. Multi-tenancy

7. Document-Oriented

8. Inverted Index

9. Scalability and Availability

2-2. Weaknesses of Elasticsearch