Ruby on Rails용 Elasticsearch: 질긴 보석 튜토리얼

게시 됨: 2022-03-11

Elasticsearch는 Apache Lucene 라이브러리를 기반으로 구축된 데이터 인덱싱 및 쿼리를 위한 강력한 RESTful HTTP 인터페이스를 제공합니다. 기본적으로 UTF-8 지원을 통해 확장 가능하고 효율적이며 강력한 검색을 제공합니다. 그것은 방대한 양의 구조화된 데이터를 인덱싱하고 쿼리하기 위한 강력한 도구이며 여기 Toptal에서 플랫폼 검색을 지원하며 곧 자동 완성에도 사용될 것입니다. 우리는 엄청난 팬입니다.

Chewy는 Elasticsearch-Ruby 클라이언트를 확장하여 더 강력하게 만들고 Rails와의 긴밀한 통합을 제공합니다.

우리 플랫폼은 Ruby on Rails를 사용하여 구축되었기 때문에 Elasticsearch 통합은 Elasticsearch-ruby 프로젝트(Elasticsearch 클러스터에 연결하기 위한 클라이언트를 제공하는 Elasticsearch용 Ruby 통합 프레임워크, Elasticsearch의 REST API용 Ruby API 및 다양한 확장 및 유틸리티). 이러한 기반을 바탕으로 우리는 Elasticsearch 애플리케이션 검색 아키텍처의 자체 개선 사항(및 단순화)을 개발하고 출시했으며, 이 아키텍처는 Chewy라는 이름의 Ruby gem으로 패키징되었습니다(여기에서 사용 가능한 예제 앱 포함).

Chewy는 Elasticsearch-Ruby 클라이언트를 확장하여 더 강력하게 만들고 Rails와의 긴밀한 통합을 제공합니다. 이 Elasticsearch 가이드에서는 구현 중에 나타난 기술적 장애물을 포함하여 이를 어떻게 달성했는지 논의합니다(사용 예를 통해).

Elasticsearch와 Ruby on Rails 간의 관계는 이 비주얼 가이드에 설명되어 있습니다.

가이드를 진행하기 전에 몇 가지 간단한 참고 사항:

Chewy와 Chewy 데모 애플리케이션은 모두 GitHub에서 사용할 수 있습니다.
Elasticsearch에 대한 "내부" 정보에 관심이 있는 분들을 위해 이 게시물의 부록으로 간략한 글을 포함시켰습니다.

츄이는 왜?

Elasticsearch의 확장성과 효율성에도 불구하고 이를 Rails와 통합하는 것은 예상만큼 간단하지 않은 것으로 나타났습니다. Toptal에서 우리는 기본 Elasticsearch-Ruby 클라이언트를 크게 확장하여 성능을 높이고 추가 작업을 지원해야 한다는 것을 알게 되었습니다.

Elasticsearch의 확장성과 효율성에도 불구하고 이를 Rails와 통합하는 것은 예상만큼 간단하지 않은 것으로 나타났습니다.

그래서 탄생한 츄이 젬.

Chewy의 몇 가지 특히 주목할만한 기능은 다음과 같습니다.

모든 인덱스는 모든 관련 모델에서 관찰할 수 있습니다.
대부분의 인덱싱된 모델은 서로 관련되어 있습니다. 그리고 때때로, 이 관련 데이터를 비정규화하고 동일한 객체에 바인딩해야 합니다(예: 태그 배열을 관련 기사와 함께 인덱싱하려는 경우). Chewy를 사용하면 모든 모델에 업데이트 가능한 색인을 지정할 수 있으므로 관련 태그가 업데이트될 때마다 해당 기사의 색인이 다시 생성됩니다.
인덱스 클래스는 ORM/ODM 모델과 독립적입니다.
예를 들어 이 향상된 기능을 통해 모델 간 자동 완성을 훨씬 쉽게 구현할 수 있습니다. 인덱스를 정의하고 객체 지향 방식으로 작업할 수 있습니다. 다른 클라이언트와 달리 Chewy gem은 인덱스 클래스, 데이터 가져오기 콜백 및 기타 구성 요소를 수동으로 구현할 필요가 없습니다.
대량 수입은 어디에나 있습니다 .
Chewy는 전체 재인덱싱 및 인덱스 업데이트를 위해 대량 Elasticsearch API를 활용합니다. 또한 원자 업데이트의 개념을 활용하여 원자 블록 내에서 변경된 개체를 수집하고 모두 한 번에 업데이트합니다.
Chewy는 AR 스타일의 쿼리 DSL을 제공합니다.
연결 가능하고 병합 가능하며 게으름으로써 이 개선 사항을 통해 쿼리를 보다 효율적인 방식으로 생성할 수 있습니다.

자, 그럼 이 모든 것이 보석에서 어떻게 작동하는지 봅시다...

Elasticsearch 기본 가이드

Elasticsearch에는 몇 가지 문서 관련 개념이 있습니다. 첫 번째는 index (RDBMS의 database 유사)로, 여러 types 이 될 수 있는 documents 세트로 구성됩니다(여기서 type 은 일종의 RDBMS 테이블임).

모든 문서에는 fields 집합이 있습니다. 각 필드는 독립적으로 분석되며 해당 분석 옵션은 해당 유형의 mapping 에 저장됩니다. Chewy는 이 구조를 객체 모델에서 "있는 그대로" 활용합니다.

 class EntertainmentIndex < Chewy::Index settings analysis: { analyzer: { title: { tokenizer: 'standard', filter: ['lowercase', 'asciifolding'] } } } define_type Book.includes(:author, :tags) do field :title, analyzer: 'title' field :year, type: 'integer' field :author, value: ->{ author.name } field :author_id, type: 'integer' field :description field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) } end {movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope| define_type scope.includes(:director, :tags), name: type_name do field :title, analyzer: 'title' field :year, type: 'integer' field :author, value: ->{ director.name } field :author_id, type: 'integer', value: ->{ director_id } field :description field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) } end end end

위에서 우리는 book , movie 및 cartoon 의 세 가지 유형으로 entertainment 라는 Elasticsearch 인덱스를 정의했습니다. 각 유형에 대해 전체 인덱스에 대한 일부 필드 매핑과 설정 해시를 정의했습니다.

따라서 EntertainmentIndex 를 정의했으며 몇 가지 쿼리를 실행하려고 합니다. 첫 번째 단계로 인덱스를 만들고 데이터를 가져와야 합니다.

 EntertainmentIndex.create! EntertainmentIndex.import # EntertainmentIndex.reset! (which includes deletion, # creation, and import) could be used instead

.import 메소드는 유형을 정의할 때 범위를 전달했기 때문에 가져온 데이터를 인식합니다. 따라서 영구 저장소에 저장된 모든 책, 영화 및 만화를 가져옵니다.

완료되면 몇 가지 쿼리를 수행할 수 있습니다.

 EntertainmentIndex.query(match: {author: 'Tarantino'}).filter{ year > 1990 } EntertainmentIndex.query(match: {title: 'Shawshank'}).types(:movie) EntertainmentIndex.query(match: {author: 'Tarantino'}).only(:id).limit(10).load # the last one loads ActiveRecord objects for documents found

이제 색인은 검색 구현에 사용할 준비가 거의 되었습니다.

레일즈 통합

Rails와의 통합을 위해 가장 먼저 필요한 것은 RDBMS 객체 변경에 대응할 수 있어야 합니다. Chewy는 update_index 클래스 메소드 내에 정의된 콜백을 통해 이 동작을 지원합니다. update_index 는 두 개의 인수를 사용합니다.

"index_name#type_name" 형식으로 제공된 유형 식별자
업데이트된 개체 또는 개체 컬렉션에 대한 역참조를 나타내는 실행할 메서드 이름 또는 블록

각 종속 모델에 대해 다음 콜백을 정의해야 합니다.

 class Book < ActiveRecord::Base acts_as_taggable belongs_to :author, class_name: 'Dude' # We update the book itself on-change update_index 'entertainment#book', :self end class Video < ActiveRecord::Base acts_as_taggable belongs_to :director, class_name: 'Dude' # Update video types when changed, depending on the category update_index('entertainment#movie') { self if movie? } update_index('entertainment#cartoon') { self if cartoon? } end class Dude < ActiveRecord::Base acts_as_taggable has_many :books has_many :videos # If author or director was changed, all the corresponding # books, movies and cartoons are updated update_index 'entertainment#book', :books update_index('entertainment#movie') { videos.movies } update_index('entertainment#cartoon') { videos.cartoons } end

태그도 인덱싱되므로 변경에 반응하도록 일부 외부 모델을 원숭이 패치해야 합니다.

 ActsAsTaggableOn::Tag.class_eval do has_many :books, through: :taggings, source: :taggable, source_type: 'Book' has_many :videos, through: :taggings, source: :taggable, source_type: 'Video' # Updating all tag-related objects update_index 'entertainment#book', :books update_index('entertainment#movie') { videos.movies } update_index('entertainment#cartoon') { videos.cartoons } end ActsAsTaggableOn::Tagging.class_eval do # Same goes for the intermediate model update_index('entertainment#book') { taggable if taggable_type == 'Book' } update_index('entertainment#movie') { taggable if taggable_type == 'Video' && taggable.movie? } update_index('entertainment#cartoon') { taggable if taggable_type == 'Video' && taggable.cartoon? } end

이 시점에서 모든 객체 저장 또는 삭제 는 해당 Elasticsearch 인덱스 유형을 업데이트합니다.

원자성

아직 한 가지 문제가 남아 있습니다. 여러 책을 저장하기 위해 books.map(&:save) 과 같은 작업을 수행 하는 경우 개별 책이 저장될 때마다 entertainment 색인 업데이트를 요청합니다. 따라서 5권의 책을 저장하면 Chewy 인덱스를 5번 업데이트합니다. 이 동작은 REPL에는 허용되지만 성능이 중요한 컨트롤러 작업에는 확실히 허용되지 않습니다.

Chewy.atomic 블록으로 이 문제를 해결합니다.

 class ApplicationController < ActionController::Base around_action { |&block| Chewy.atomic(&block) } end

간단히 말해서 Chewy.atomic 은 이러한 업데이트를 다음과 같이 일괄 처리합니다.

after_save 콜백을 비활성화합니다.
저장된 도서의 ID를 수집합니다.
Chewy.atomic 블록이 완료되면 수집된 ID를 사용하여 단일 Elasticsearch 인덱스 업데이트 요청을 수행합니다.

수색

이제 검색 인터페이스를 구현할 준비가 되었습니다. 우리의 사용자 인터페이스는 양식이기 때문에 그것을 구축하는 가장 좋은 방법은 물론 FormBuilder와 ActiveModel을 사용하는 것입니다. (Toptal에서는 ActiveData를 사용하여 ActiveModel 인터페이스를 구현하지만 원하는 gem을 자유롭게 사용하세요.)

 class EntertainmentSearch include ActiveData::Model attribute :query, type: String attribute :author_id, type: Integer attribute :min_year, type: Integer attribute :max_year, type: Integer attribute :tags, mode: :arrayed, type: String, normalize: ->(value) { value.reject(&:blank?) } # This accessor is for the form. It will have a single text field # for comma-separated tag inputs. def tag_list= value self.tags = value.split(',').map(&:strip) end def tag_list self.tags.join(', ') end end

쿼리 및 필터 자습서

이제 속성을 허용하고 유형 변환할 수 있는 ActiveModel과 유사한 개체가 있으므로 검색을 구현해 보겠습니다.

 class EntertainmentSearch ... def index EntertainmentIndex end def search # We can merge multiple scopes [query_string, author_id_filter, year_filter, tags_filter].compact.reduce(:merge) end # Using query_string advanced query for the main query input def query_string index.query(query_string: {fields: [:title, :author, :description], query: query, default_operator: 'and'}) if query? end # Simple term filter for author id. `:author_id` is already # typecasted to integer and ignored if empty. def author_id_filter index.filter(term: {author_id: author_id}) if author_id? end # For filtering on years, we will use range filter. # Returns nil if both min_year and max_year are not passed to the model. def year_filter body = {}.tap do |body| body.merge!(gte: min_year) if min_year? body.merge!(lte: max_year) if max_year? end index.filter(range: {year: body}) if body.present? end # Same goes for `author_id_filter`, but `terms` filter used. # Returns nil if no tags passed in. def tags_filter index.filter(terms: {tags: tags}) if tags? end end

컨트롤러 및 보기

이 시점에서 우리 모델은 전달된 속성으로 검색 요청을 수행할 수 있습니다. 사용법은 다음과 같습니다.

 EntertainmentSearch.new(query: 'Tarantino', min_year: 1990).search

컨트롤러에서 Chewy 문서 래퍼 대신 정확한 ActiveRecord 개체를 로드하려고 합니다.

 class EntertainmentController < ApplicationController def index @search = EntertainmentSearch.new(params[:search]) # In case we want to load real objects, we don't need any other # fields except for `:id` retrieved from Elasticsearch index. # Chewy query DSL supports Kaminari gem and corresponding API. # Also, we pass scopes for every requested type to the `load` method. @entertainments = @search.search.only(:id).page(params[:page]).load( book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)} ) end end

이제 entertainment/index.html.haml 에 HAML을 작성할 차례입니다.

 = form_for @search, as: :search, url: entertainment_index_path, method: :get do |f| = f.text_field :query = f.select :author_id, Dude.all.map { |d| [d.name, d.id] }, include_blank: true = f.text_field :min_year = f.text_field :max_year = f.text_field :tag_list = f.submit - if @entertainments.any? %dl - @entertainments.each do |entertainment| %dt %h1= entertainment.title %strong= entertainment.class %dd %p= entertainment.year %p= entertainment.description %p= entertainment.tag_list = paginate @entertainments - else Nothing to see here

정렬

보너스로 검색 기능에 정렬 기능도 추가할 예정입니다.

제목과 연도 필드와 관련성을 기준으로 정렬해야 한다고 가정합니다. 불행히도 One Flew Over the Cuckoo's Nest 는 제목은 개별 용어로 분할되므로 이러한 개별 용어로 정렬하는 것은 너무 무작위입니다. 대신 전체 제목을 기준으로 정렬하고 싶습니다.

솔루션은 특수 제목 필드를 사용하고 자체 분석기를 적용하는 것입니다.

 class EntertainmentIndex < Chewy::Index settings analysis: { analyzer: { ... sorted: { # `keyword` tokenizer will not split our titles and # will produce the whole phrase as the term, which # can be sorted easily tokenizer: 'keyword', filter: ['lowercase', 'asciifolding'] } } } define_type Book.includes(:author, :tags) do # We use the `multi_field` type to add `title.sorted` field # to the type mapping. Also, will still use just the `title` # field for search. field :title, type: 'multi_field' do field :title, index: 'analyzed', analyzer: 'title' field :sorted, index: 'analyzed', analyzer: 'sorted' end ... end {movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope| define_type scope.includes(:director, :tags), name: type_name do # For videos as well field :title, type: 'multi_field' do field :title, index: 'analyzed', analyzer: 'title' field :sorted, index: 'analyzed', analyzer: 'sorted' end ... end end end

또한 이러한 새 속성과 정렬 처리 단계를 검색 모델에 추가할 것입니다.

 class EntertainmentSearch # we are going to use `title.sorted` field for sort SORT = {title: {'title.sorted' => :asc}, year: {year: :desc}, relevance: :_score} ... attribute :sort, type: String, enum: %w(title year relevance), default_blank: 'relevance' ... def search # we have added `sorting` scope to merge list [query_string, author_id_filter, year_filter, tags_filter, sorting].compact.reduce(:merge) end def sorting # We have one of the 3 possible values in `sort` attribute # and `SORT` mapping returns actual sorting expression index.order(SORT[sort.to_sym]) end end

마지막으로 정렬 옵션 선택 상자를 추가하여 양식을 수정합니다.

 = form_for @search, as: :search, url: entertainment_index_path, method: :get do |f| ... / `EntertainmentSearch.sort_values` will just return / enum option content from the sort attribute definition. = f.select :sort, EntertainmentSearch.sort_values ...

오류 처리

사용자가 ( 또는 AND 같은 잘못된 쿼리를 수행하면 Elasticsearch 클라이언트에서 오류가 발생합니다. 이를 처리하기 위해 컨트롤러를 몇 가지 변경해 보겠습니다.

 class EntertainmentController < ApplicationController def index @search = EntertainmentSearch.new(params[:search]) @entertainments = @search.search.only(:id).page(params[:page]).load( book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)} ) rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e @entertainments = [] @error = e.message.match(/QueryParsingException\[([^;]+)\]/).try(:[], 1) end end

또한 뷰에서 오류를 렌더링해야 합니다.

 ... - if @entertainments.any? ... - else - if @error = @error - else Nothing to see here

Elasticsearch 쿼리 테스트

기본 테스트 설정은 다음과 같습니다.

Elasticsearch 서버를 시작합니다.
인덱스를 정리하고 생성합니다.
데이터를 가져옵니다.
쿼리를 수행합니다.
결과를 우리의 기대치와 상호 참조하십시오.

1단계에서는 elasticsearch-extensions gem에 정의된 테스트 클러스터를 사용하는 것이 편리합니다. 프로젝트의 Rakefile post-gem 설치에 다음 줄을 추가하기만 하면 됩니다.

 require 'elasticsearch/extensions/test/cluster/tasks'

그러면 다음과 같은 Rake 작업을 받게 됩니다.

 $ rake -T elasticsearch rake elasticsearch:start # Start Elasticsearch cluster for tests rake elasticsearch:stop # Stop Elasticsearch cluster for tests

Elasticsearch 및 Rspec

먼저 인덱스가 데이터 변경 사항과 동기화되도록 업데이트되었는지 확인해야 합니다. 운 좋게도 Chewy gem은 유용한 update_index rspec 매처와 함께 제공됩니다.

 describe EntertainmentIndex do # No need to cleanup Elasticsearch as requests are # stubbed in case of `update_index` matcher usage. describe 'Tag' do # We create several books with the same tag let(:books) { create_list :book, 2, tag_list: 'tag1' } specify do # We expect that after modifying the tag name... expect do ActsAsTaggableOn::Tag.where(name: 'tag1').update_attributes(name: 'tag2') # ... the corresponding type will be updated with previously-created books. end.to update_index('entertainment#book').and_reindex(books, with: {tags: ['tag2']}) end end end

다음으로 실제 검색 쿼리가 제대로 수행되고 예상 결과를 반환하는지 테스트해야 합니다.

 describe EntertainmentSearch do # Just defining helpers for simplifying testing def search attributes = {} EntertainmentSearch.new(attributes).search end # Import helper as well def import *args # We are using `import!` here to be sure all the objects are imported # correctly before examples run. EntertainmentIndex.import! *args end # Deletes and recreates index before every example before { EntertainmentIndex.purge! } describe '#min_year, #max_year' do let(:book) { create(:book, year: 1925) } let(:movie) { create(:movie, year: 1970) } let(:cartoon) { create(:cartoon, year: 1995) } before { import book: book, movie: movie, cartoon: cartoon } # NOTE: The sample code below provides a clear usage example but is not # optimized code. Something along the following lines would perform better: # `specify { search(min_year: 1970).map(&:id).map(&:to_i) # .should =~ [movie, cartoon].map(&:id) }` specify { search(min_year: 1970).load.should =~ [movie, cartoon] } specify { search(max_year: 1980).load.should =~ [book, movie] } specify { search(min_year: 1970, max_year: 1980).load.should == [movie] } specify { search(min_year: 1980, max_year: 1970).should == [] } end end

클러스터 문제 해결 테스트

마지막으로 테스트 클러스터 문제를 해결하기 위한 가이드입니다.

시작하려면 메모리 내 1노드 클러스터를 사용하십시오. 스펙상 훨씬 빠릅니다. 우리의 경우: TEST_CLUSTER_NODES=1 rake elasticsearch:start
단일 노드 클러스터 상태 확인과 관련된 elasticsearch-extensions 테스트 클러스터 구현 자체에 몇 가지 기존 문제가 있습니다. 이 문제는 포크에서 수정되었지만 곧 메인 리포지토리에서 수정되기를 바랍니다.
각 데이터 세트에 대해 요청을 사양으로 그룹화합니다(즉, 데이터를 한 번 가져온 다음 여러 요청 수행). Elasticsearch는 데이터를 가져오는 동안 오랜 시간 동안 예열되고 많은 힙 메모리를 사용하므로 과도하게 사용하지 마십시오. 특히 사양이 많은 경우에는 특히 그렇습니다.
머신에 충분한 메모리가 있는지 확인하십시오. 그렇지 않으면 Elasticsearch가 중지됩니다(각 테스트 가상 머신에 대해 약 5GB가 필요하고 Elasticsearch 자체에 약 1GB가 필요함).

마무리

Elasticsearch는 스스로 "유연하고 강력한 오픈 소스, 분산, 실시간 검색 및 분석 엔진"이라고 설명합니다. 검색 기술의 표준입니다.

Chewy를 통해 레일 개발자는 이러한 이점을 Rails와의 긴밀한 통합을 제공하는 단순하고 사용하기 쉬운 프로덕션 품질의 오픈 소스 Ruby 보석으로 패키징했습니다. Elasticsearch와 Rails – 정말 멋진 조합입니다!

Elasticsearch와 Rails - 정말 멋진 조합입니다!

트위터

부록: Elasticsearch 내부

다음은 Elasticsearch "under the hood"에 대한 아주 간단한 소개입니다...

Elasticsearch는 자체적으로 기본 데이터 구조로 역 인덱스를 사용하는 Lucene을 기반으로 합니다. 예를 들어 "dog jump high", "jump over the fence" 및 "fence was too high" 문자열이 있는 경우 다음 구조를 얻습니다.

 "the" [0, 0], [1, 2], [2, 0] "dogs" [0, 1] "jump" [0, 2], [1, 0] "high" [0, 3], [2, 4] "over" [1, 1] "fence" [1, 3], [2, 1] "was" [2, 2] "too" [2, 3]

따라서 모든 용어에는 텍스트에 대한 참조와 위치가 모두 포함됩니다. 또한, 용어를 수정하고(예: ""와 같은 불용어를 제거하여) 모든 용어에 음성 해싱을 적용합니다(알고리즘을 추측할 수 있습니까?):

 "DAG" [0, 1] "JANP" [0, 2], [1, 0] "HAG" [0, 3], [2, 4] "OVAR" [1, 1] "FANC" [1, 3], [2, 1] "W" [2, 2] "T" [2, 3]

그런 다음 "dog jumps"를 쿼리하면 소스 텍스트와 동일한 방식으로 분석되어 해싱 후 "DAG JANP"가 됩니다("dog"는 "jumps"와 마찬가지로 "dogs"와 해시가 동일합니다. "도약").

또한 ("DAG" AND "JANP") 또는 ("DAG" 또는 "JANP")를 선택하여 문자열의 개별 단어 사이에 일부 논리를 추가합니다(구성 설정 기반). 전자는 [0] & [0, 1] (즉, 문서 0)의 교집합을 반환하고 후자는 [0] | [0, 1] [0] | [0, 1] (즉, 문서 0 및 1). 텍스트 내 위치는 점수 결과 및 위치 종속 쿼리에 사용할 수 있습니다.