无损ElasticSearch数据迁移-CSDN博客

本文深入探讨Elasticsearch的高级特性，包括映射、别名、重新索引API及数据迁移策略。详细讲解了如何优化数据索引，管理数据库模式变更，以及实现无缝的数据迁移。同时，介绍了Elasticsearch在大数据环境下的应用优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Academic data warehouse design recommends keeping everything in a normalized form, with links between. Then the roll forward of changes in relational math will provide a reliable repository with transaction support. Atomicity, Consistency, Isolation, Durability — that's all. In other words, the storage is explicitly built to safely update the data. But it is not optimal for searching, especially with a broad gesture on the tables and fields. We need indices, a lot of indices! Volumes expand, recording slows down. SQL LIKE can not be indexed, and JOIN GROUP BY sends us to meditate in the query planner.

学术数据仓库设计建议将所有内容保持规范化形式，并在两者之间建立链接。然后，关系数学变化的前滚将提供具有事务支持的可靠存储库。原子性，一致性，隔离性，耐用性-仅此而已。换句话说，显式构建了存储以安全地更新数据。但这并不是搜索的最佳选择，尤其是在表格和字段上使用宽大的手势时。我们需要索引，很多索引！音量扩大，录制速度变慢。无法对SQL LIKE进行索引，并且JOIN GROUP BY将我们发送给我们以便在查询计划器中进行冥想。

The increasing load on one machine forces it to expand, either vertically into the ceiling or horizontally, by purchasing more nodes. Resiliency requirements cause data to be spread across multiple nodes. And the requirement for immediate recovery after a failure, without a denial of service, forces us to set up a cluster of machines so that at any time any of them can perform both writing and reading. That is, to already be a master, or become them automatically and immediately.

一台机器上不断增加的负载会通过购买更多节点来使其垂直扩展到天花板或水平扩展。弹性需求导致数据分散在多个节点上。故障后需要立即恢复且不拒绝服务的要求迫使我们建立了一组机器，以便任何时候任何机器都可以执行读写操作。也就是说，已经是一个大师，或者自动并立即成为他们。

The problem of quick search was solved by installing a number of second storage optimized for indexing. Full-text search, faceted, stemming ~~and blackjack~~. The second store accepts records from the first table as an input, analyzes and builds an index. Thus, the data storage cluster was supplemented with another cluster for solely for searching purposes. Having similar master configuration to match the overall SLA. Everything is good, business is happy, admins sleep at night… until the machines in the master-master cluster become more than three.

快速搜索的问题通过安装大量为索引优化的第二存储解决。全文搜索，多面，词干 ~~和二十一点~~ 。第二个存储接受第一个表中的记录作为输入，分析并建立索引。因此，数据存储群集补充了另一个群集，仅用于搜索目的。具有类似的主配置以匹配总体SLA 。一切都很好，生意愉快，管理员晚上睡觉……直到主控集群中的计算机超过三台。

有弹性 (Elastic)

The NoSQL movement has significantly expanded the scaling horizon for both small and big data. NoSQL cluster nodes are able to distribute data among themselves so that the failure of one or more of them does not lead to a denial of service for the entire cluster. The cost for the high availability of distributed data was the impossibility of ensuring their complete consistency on the record at each point in time. Instead, NoSQL promotes the eventual consistency. That is, it is believed that once all the data will disperse across the cluster nodes, and they will become consistent eventually.

NoSQL运动极大地扩展了小数据和大数据的扩展范围。 NoSQL群集节点能够在它们之间分发数据，因此其中一个或多个节点的故障不会导致整个群集的服务被拒绝。分布式数据的高可用性的代价是无法确保每个时间点的记录完全一致。相反，NoSQL促进了最终的一致性 。也就是说，人们相信一旦所有数据都将分散在群集节点上，它们最终将变得一致。

Thus, the relational model was supplemented with a non-relational one and gave power to many database engines that solve the problems of the CAP triangle with one success or another. Developers got into the hands modent tools to build their own perfect persistence layer — for every taste, budget and profile of the load.

因此，关系模型得到了非关系模型的补充，并赋予了许多成功解决CAP三角形问题的数据库引擎。开发人员可以使用现代工具构建自己的完美持久层-针对每种口味，预算和负载情况。

ElasticSearch is a NoSQL cluster with RESTful JSON API on the Lucene engine, open-source, written in Java, that can not only build a search index, but also store the original document. This trick helps to rethink the role of a separate database management system for storing the originals, or even completely abandon it. The end of the intro.

ElasticSearch是一个NoSQL集群，在基于Java的Lucene引擎上以RESTful JSON API开源，它用Java编写，不仅可以构建搜索索引，还可以存储原始文档。此技巧有助于重新考虑使用单独的数据库管理系统来存储原始文件，甚至完全放弃原始文件的作用。介绍的结尾。

制图 (Mapping)

Mapping in ElasticSearch is something like a schema (table structure, in terms of SQL), which tells you exactly how to index incoming documents (records, in terms of SQL). Mapping can be static, dynamic, or absent. Static mapping does not allow the schema to change. Dynamic allows you to add new fields. If mapping is not specified, ElasticSearch will make it automatically, receiving the first document for writing. It analyzes the structure of fields, makes some assumptions about the types of data in them, skips through the default settings and writes down. At first glance, this schema-less behavior seems very convenient. But in fact, its more suitable for experiments than for surprises in production.

ElasticSearch中的映射类似于架构(表结构，用SQL表示)，它可以告诉您确切地如何为传入文档建立索引(用SQL表示记录)。映射可以是静态，动态或不存在的。静态映射不允许更改架构。动态允许您添加新字段。如果未指定映射，ElasticSearch将自动进行映射，并接收要写入的第一个文档。它分析字段的结构，对其中的数据类型做出一些假设，跳过默认设置并写下来。乍一看，这种无模式的行为似乎非常方便。但实际上，它更适合用于实验，而不是给生产带来惊喜。

So, the data is indexed, and this is a one-directional process. Once created, the mapping cannot be changed dynamically as ALTER TABLE in SQL. Because the SQL table stores the original document to which you can attach the search index. And vice-versa in ElasticSearch. ElasticSearch is a search index to which you can attach the original document. That is why the index scheme is static. Theoretically, you could either create a field in the mapping or delete it. But in practice, ElasticSearch only allows you to add fields. An attempt to delete a field leads to nothing.

因此，对数据建立索引，这是一个单向过程。创建映射后，无法像SQL中的ALTER TABLE一样动态更改映射。因为SQL表存储原始文档，您可以在其中附加搜索索引。反之亦然。 ElasticSearch是一个搜索索引，您可以在其中附加原始文档。这就是为什么索引方案是静态的。从理论上讲，您可以在映射中创建一个字段或将其删除。但实际上，ElasticSearch仅允许您添加字段。尝试删除字段不会有任何结果。

别名 (Alias)

The alias is an optional name for the ElasticSearch index. Aliases can be many for a single index. Or one alias for many indices. Then the indices seem to be logically combined and look the same from the outside. Alias is very convenient for services that communicate with the index throughout its lifetime. For example, the pseudonym of products can hide both products_v2 and products_v25 behind, without the need to change the names in the service. Alias is handy for data migration when they are already transferred from the old scheme to the new one, and you need to switch the application to work with the new index. Switching an alias from index to index is an atomic operation. It is performed in one step without data loss.

别名是ElasticSearch索引的可选名称。单个索引的别名可能很多。或多个索引的一个别名。然后，这些索引似乎在逻辑上进行了组合，并且从外部看起来是相同的。别名对于在整个生命周期内与索引进行通信的服务非常方便。例如，产品的化名可以隐藏既products_v2和products_v25背后，而不需要更改服务的名称。当别名已经从旧方案转移到新方案时，别名对于数据迁移很方便，并且您需要切换应用程序以使用新索引。将别名从索引切换到索引是一个原子操作。一步执行，不会丢失数据。

重新索引API (Reindex API)

The data scheme, the mapping, tends to change from time to time. New fields are added, unnecessary fields are deleted. If ElasticSearch plays the role of a single repository, then you need a tool to change the mapping on the fly. For this, there is a special command to transfer data from one index to another, the so-called _reindex API. It works with created or empty mapping of the recipient index, on the server side, quickly indexing in batches of 1000 documents at a time.

数据方案(映射)往往会不时变化。添加了新字段，删除了不必要的字段。如果ElasticSearch充当单个存储库的角色，那么您需要一个工具来动态更改映射。为此，有一个特殊的命令将数据从一个索引传输到另一个索引，即所谓的_reindex API 。它与服务器端收件人索引的已创建或空映射一起工作，可以一次快速地成批索引1000个文档。

The reindexing can do a simple type conversion of the field. For example, long to text and back to long, or boolean to text and back to boolean. But -9.99 to boolean is no longer able, ~~this is not PHP~~. On the other hand, type conversion is an insecure thing. Service written in a language with dynamic typing may forgive such sin. But if the reindex cannot convert the type, the whole document will not be saved. In general, data migration should take place in 3 stages: add a new field, release a service with it, remove the old field.

重新索引可以对字段进行简单的类型转换。例如，将long转换为text然后返回long ，或者将boolean转换为text然后返回boolean 。但是，布尔值-9.99不再可用， ~~这不是PHP~~ 。另一方面，类型转换是不安全的。用动态类型语言编写的服务可能会免除这种罪过。但是，如果重新索引无法转换类型，则不会保存整个文档。通常，数据迁移应分三个阶段进行：添加新字段，使用该字段发布服务，删除旧字段。

A field is added like this. Take the scheme of the source-index, insert new property, create empty index. Then, start the reindexing:

像这样添加一个字段。采用source-index的方案，插入新属性，创建空索引。然后，开始重新索引：

{
  "source": {
    "index": "test"
  },
  "dest": {
    "index": "test_clone"
  }
}

A field is removed like this. Take the scheme of the source-index, remove the field, create empty index. Then, start the reindexing with the list of fields to be copied:

像这样删除一个字段。采用source-index的方案，删除字段，创建空索引。然后，使用要复制的字段列表开始重新索引：

{
  "source": {
    "index": "test",
    "_source": ["field1", "field3"]
  },
  "dest": {
    "index": "test_clone"
  }
}

For convenience, both cases were combined into the cloning function in Kaizen, a desktop client for ElasticSearch. Cloning can recognize the mapping of the recipient index. The example below shows how a partial clone is made from an index with three collections (types, in terms of ElasticSearch) act, line, scene. The clone contains line with two fields, static mapping is enabled, and the speech_number field text becomes long .

为了方便起见，这两种情况都合并到了Kaizen(ElasticSearch的桌面客户端)的克隆功能中。克隆可以识别收件人索引的映射。下面的示例显示了如何从具有三个集合(就ElasticSearch而言，类型)为act ， line和scene的索引进行部分克隆。克隆包含具有两个字段的行，启用了静态映射，并且speech_number字段文本变长。

移民 (Migration)

The reindex API has one unpleasant feature — it does not know how to monitor possible changes in the source index. If after the start of reindexing something changed, then the changes are not reflected in the recipient index. To solve this problem, ElasticSearch FollowUp Plugin was developed, that adds logging commands. The plugin can follow the index, returning the actions performed on the documents in chronological order, in JSON format. The index, type, document ID and operation on it — INDEX or DELETE — are logged. The FollowUp Plugin is published on GitHub and compiled for almost all versions of ElasticSearch.

重新索引API具有一项令人不快的功能-它不知道如何监视源索引中可能发生的变化。如果重新索引开始后发生了某些更改，则这些更改不会反映在收件人索引中。为了解决此问题，开发了ElasticSearch FollowUp插件，该插件添加了日志记录命令。插件可以遵循索引，以JSON格式按时间顺序返回在文档上执行的操作。记录索引，类型，文档ID和对其的操作-INDEX或DELETE-。 FollowUp插件在GitHub上发布，并针对几乎所有版本的ElasticSearch进行编译。

So, for the lossless data migration, you will need FollowUp installed on the node on which the reindexing will be launched. It is assumed that the alias index is already available, and all applications run through it. Before reindexing the plugin must be turned on. When reindexing is complete, the plugin is turned off, and alias is transferred to a new index. Then, the recorded actions are reproduced on the recipient index, catching up with its state. Despite of the high speed of the reindexing, two types of collisions may occur during playback:

因此，对于无损数据迁移，您将需要在将启动重新索引的节点上安装FollowUp。假定别名索引已经可用，并且所有应用程序都通过别名索引运行。重新索引之前，必须先打开插件。重新索引完成后，插件将关闭，别名将转移到新索引。然后，将记录的操作复制到接收者索引上，赶上其状态。尽管重新索引的速度很高，但是在播放过程中可能会发生两种冲突：

in the new index there is no more document with such _id. This means, that the document has been deleted after switching of the alias to the new index.
在新索引中，不再有带有_id的文档。这意味着，在将别名切换到新索引之后，该文档已被删除。
in the new index there is a document with the same _id, but with the version number higher than in the source index. This means, that the document has been updated after switching of the alias to the new index..
在新索引中，有一个文档具有相同的_id ，但版本号高于源索引中的版本。这意味着，在将别名切换到新索引之后，文档已经更新。

In these cases, the action should not be reproduced in the recipient index. The remaining changes are reproduced.

在这些情况下，不应在收件人索引中复制该操作。复制剩余的更改。

Happy coding!

编码愉快！