Search System Basics
Understanding Search Systems
Reference: [Book] Elasticsearch Practical Guide
1. Understanding Search Systems
1-1. What is a Search System?
: Search services are referred to by names such as search engine, search system, and search service
Search Engine
A program that collects information from the Web and provides search results
The implementation varies depending on the characteristics of the data provided as search results
Search System
A general term for systems built on
search enginesto provide reliable search results based on large-scale dataCollects vast amounts of data using a
crawler,Indexes it using multiple
search engines,And provides search results through a
UI
According to internal policies, documents with high relevance can be placed at the top of search results,
And accuracy of search can be improved by assigning weights to specific
fieldsordocuments
Search Service
Provides search results as a service using a
search systembuilt onsearch engines
1-2. Components of a Search System
: A search system consists of a crawler that collects information, storage that stores collected data, an indexer that transforms collected data into a form suitable for searching, and a searcher that finds matching documents from indexed data.
Crawler
A program needed for websites, blogs, cafes, etc.
Also called
crawler,spider,worms,web robot, etc.Most information on the web including files, databases, web pages are collection targets
For files, the
crawlercollects and stores information such as file name, file content, file path, then thesearch enginesearches the stored information and answers user queries
Storage
The physical storage where data is stored in the database
The search engine stores indexed data in storage
Indexer
To find information matching user queries in the information collected by the search engine, the collected data must be processed into a
searchable structureand stored, and theindexerperforms this roleThe indexer uses various morphological analyzers to extract
meaningful termsfrom information and stores data in an inverted index structure favorable for searching
Searcher
Receives user queries and finds matching documents from the inverted index structure stored by the indexer, returning them as results
Whether a query and document match is determined by similarity-based search ranking algorithms
Like the indexer, the searcher also uses morphological analyzers to extract meaningful terms from user queries for searching
Therefore, search quality varies depending on the morphological analyzer used!
1-3. Differences from Relational Databases
Search engines and relational databases (RDBMS) share many similarities in that they both find and provide data matching queries to users
However, there are many problems with providing search functionality using relational databases
Relational Database
A database is a collection of data that integrates and manages data
Broadly divided into relational or hierarchical databases based on storage methods
All data is deduplicated and structured as structured data stored in tables consisting of rows and columns
Searching for desired information is possible using SQL statements, but only simple search through text matching is possible
That is, transforming text into multiple words or searching using multiple synonyms or similar words is not possible
Search Engine
Search engines can index and search unstructured data that databases cannot handle
Natural language processing is possible through morphological analysis, and fast search speed is guaranteed based on inverted index structures
Comparison of Key Concepts: Elasticsearch vs RDBMS
Index
Database
Shard
Partition
Type
Table
Document
Row
Field
Column
Mapping
Schema
Query DSL
SQL
2. Search Systems and Elasticsearch
Nowadays, NoSQL (No Structured Query Language) is widely used for fast searching of large-scale data
Elasticsearch can also be classified as a type of NoSQL, enabling near real-time fast searching through distributed processing
It can search
large-scale unstructured datathat is difficult to handle with traditional databases, and supports both full text search and structured searchAlthough it is fundamentally a search engine, Elasticsearch can also be utilized as a large-scale storage like
MongoDBorHbase
2-1. Why Elasticsearch is Powerful
1. Open Source Search Engine
Elasticsearch is an
open sourcesearch engine developed based on Apache Foundation's LuceneTherefore, it is used by countless people worldwide, and bugs are mostly resolved quickly when they occur
2. Full Text Search
Most databases like
PostgreSQLandMongoDBonly provide basic text search functionality due to limitations in basic query and indexing structuresHowever, Elasticsearch enables more advanced full text search
Full text search means indexing entire contents to search for documents containing
specific termsTraditional RDBMS is not suitable for full text search, but Elasticsearch can search quickly by combining various feature-specific and language-specific plugins
3. Statistical Analysis
Unstructured log data can be collected and aggregated in one place for statistical analysis
By integrating Elasticsearch with Kibana, you can visualize and analyze logs accumulating in real-time
4. Schemaless
Databases store and manage data by transforming it into a conforming format according to a schema structure
In contrast, Elasticsearch can automatically index and search various forms of non-standardized documents
5. RESTful API
Elasticsearch supports HTTP-based RESTful API and uses
JSON formatfor both requests and responses, making it available on different platforms regardless of development language, OS, or system
6. Multi-tenancy
Even if indexes are different, as long as the
field namesto search are the same, multiple indexes can be queried at onceThis can be used to provide multi-tenancy functionality
7. Document-Oriented
Multiple layers of data can be stored in an index as structured documents in
JSON formatHierarchical documents can also be easily queried with a single query
8. Inverted Index
As a Lucene-based search engine, it supports inverted indexing
Here, an inverted index is a special data structure similar to the index page at the back of a paper book
If you need to find all documents containing the term "search engine," normally you would have to read every document from beginning to end to get the desired results
However, with an inverted index structure, finding the term reveals the location of documents containing that term, enabling fast retrieval
9. Scalability and Availability
If Elasticsearch is distributed and scaled, it can process large volumes of documents more efficiently
In a distributed environment, data is divided into small units called
shards, and the number of shards can be adjusted each time anindexis createdThis allows distributing data based on its type and characteristics for fast processing
2-2. Weaknesses of Elasticsearch
It is not "real-time"
Generally, indexed data becomes searchable only after about 1 second
Indexed data goes through complex internal processes like
commitandflush, so it is not real-timeStrictly speaking, it can be called Near Realtime
Does not provide Transaction and rollback functionality
Elasticsearch is fundamentally configured as a
distributed systemSince it does not support rollback and transaction, which are costly for the overall cluster performance improvement, there is a risk of data loss in the worst case
Does not provide data update
Strictly speaking, when an update command is requested, Elasticsearch deletes the existing document and creates a new document with the changed content
For this reason, it incurs relatively higher costs compared to simple updates
However, this is not a major disadvantage because it can leverage the benefit of immutability!
Last updated