Elasticsearch数据导入和导出方法

原创已于 2025-07-18 16:20:38 修改 · 857 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch #jenkins #大数据

于 2025-07-18 10:06:07 首次发布

运维专栏收录该内容

6 篇文章

订阅专栏

ES导入和导出

快照与恢复

快照恢和恢复需要大版本保持一致，例如：ES8.x，ES7.x
插件不能缺失，例如：快照的ES中包含 analysis-ik，那么恢复的ES中也必须包含；否则恢复后会出现节点不可用

快照

在 Elasticsearch（简称 ES）中，快照（Snapshot）和恢复（Restore）是用于备份和恢复数据的重要功能。下面介绍如何通过命令行（主要是使用 curl 命令调用 REST API）进行快照和恢复操作。

创建非ROOT目录

Elasticsearch 绝不能以 root 用户身份运行 ，一般是 elasticsearch 用户。否则可能回报错

mkdir -p /data/es_backup
chown -R elasticsearch:elasticsearch /data/es_backup
chmod 750 /data/es_backup

编辑配置文件

修改配置文件

Elasticsearch 出于安全考虑，要求你在配置文件中明确指定哪些路径可以作为快照仓库（repository）。这个配置项叫 path.repo，如果没有设置，或者设置的路径与你实际使用的路径不匹配，就会报错如下错误

"reason":"[my_backup] location [/root/elasticsearch/my_backup] doesn't match any of the locations specified by path.repo because this setting is empty"

找到你的 Elasticsearch 配置文件 elasticsearch.yml，添加如下配置：

path.repo: ["/root/elasticsearch/my_backup"]

或者你可以指定多个路径：

path.repo: ["/data/es_backup", "/data/es_backup01"]

重启服务

配置修改后，必须重启 Elasticsearch 服务 使配置生效。

# 以 systemd 管理为例
systemctl restart elasticsearch

或

service elasticsearch restart

注册仓库

curl -X PUT "localhost:9200/_snapshot/my_backup" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/data/es_backup",
    "compress": true
  }
}'

my_backup 是仓库名称
location 是你服务器上的实际路径（需 ES 有写权限）

创建快照

curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true"

wait_for_completion=true 表示命令会等待快照完成再返回
snapshot_1：快照名，需要和恢复快照时，保持一致

如果只备份指定索引：

curl -X PUT "localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
  "indices": "index1,index2"
}'

查看快照状态

curl -X GET "localhost:9200/_snapshot/my_backup/snapshot_1"

在 Elasticsearch 中，查看快照状态和进度可以通过 REST API 实现。你可以看到快照是否完成、当前进度、每个分片的状态等信息。

返回示例：

{
  "snapshots" : [
    {
      "snapshot" : "snapshot_1",
      "repository" : "my_backup",
      "state" : "SUCCESS",
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 10,
        "failed" : 0,
        "total" : 10
      },
      "stats" : {
        "incremental" : {
          "file_count" : 20,
          "size_in_bytes" : 12345678
        },
        "processed" : {
          "file_count" : 20,
          "size_in_bytes" : 12345678
        },
        "start_time_in_millis" : 1710000000000,
        "time_in_millis" : 12345
      }
    }
  ]
}

state 字段为 SUCCESS 表示快照已完成。
shards_stats 里的 done/total 表示已完成分片/总分片数。
如果快照还在进行中，started、finalizing 字段会有值。

查看所有快照状态

curl -X GET "localhost:9200/_snapshot/你的仓库名/_all?pretty"

可以看到所有快照的整体状态（但没有详细进度）。

快照进度百分比

没有直接的百分比字段，但你可以用 done/total 计算进度。例如：

done: 5, total: 10 → 进度 50%
done: 10, total: 10 → 进度 100%

其它相关命令

查看快照任务（如果你用 wait_for_completion=false，可以查任务队列）：
```
curl -X GET "localhost:9200/_cat/snapshots/my_backup?v"
```

总结

用 /_snapshot/仓库/快照/_status 查看详细进度
关注 shards_stats 的 done 和 total
计算进度 = done/total × 100%

恢复

创建非ROOT目录

Elasticsearch 绝不能以 root 用户身份运行 ，一般是 elasticsearch 用户。否则可能回报错

mkdir -p /data/es_backup
chown -R elasticsearch:elasticsearch /data/es_backup
chmod 750 /data/es_backup

编辑配置文件

修改配置文件

Elasticsearch 出于安全考虑，要求你在配置文件中明确指定哪些路径可以作为快照仓库（repository）。这个配置项叫 path.repo，如果没有设置，或者设置的路径与你实际使用的路径不匹配，就会报错如下错误

"reason":"[my_backup] location [/root/elasticsearch/my_backup] doesn't match any of the locations specified by path.repo because this setting is empty"

找到你的 Elasticsearch 配置文件 elasticsearch.yml，添加如下配置：

path.repo: ["/root/elasticsearch/my_backup"]

或者你可以指定多个路径：

path.repo: ["/data/es_backup", "/data/es_backup01"]

重启服务

配置修改后，必须重启 Elasticsearch 服务 使配置生效。

# 以 systemd 管理为例
systemctl restart elasticsearch

或

service elasticsearch restart

注册仓库


curl -X PUT "localhost:9200/_snapshot/my_backup" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/data/es_backup",
    "compress": true
  }
}'

my_backup 是仓库名称
location 是你服务器上的实际路径（需 ES 有写权限）

恢复快照

curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore"

my_backup 是仓库名称
snapshot_1：创建快照时设置的快照名，需要和创建快照时保持一致

elasticdump命令导入和导出

Elasticdump 是一个用于 Elasticsearch 数据迁移和备份的工具，支持索引数据的导入和导出，可以处理索引的设置、映射、数据等各种组件。

顺序问题：通常应先导入 settings，然后是 mapping，最后是 data
版本兼容性：确保源和目标 Elasticsearch 版本兼容
性能考虑：
增大 --limit 可以提高性能，但会增加内存使用
网络延迟高时，减少批次大小可能更稳定

安全考虑：不要在命令行中直接暴露密码，考虑使用环境变量或配置文件
大索引处理：对于超大索引，考虑分割处理或使用 Elasticsearch 快照功能

安装 Elasticdump

前提条件

已安装 Node.js (建议 v10+)
npm 或 yarn 包管理器

安装方法

全局安装：

npm install elasticdump -g

局部安装：

npm install elasticdump

使用 yarn 安装：

yarn global add elasticdump

验证安装

elasticdump --help

如果看到帮助信息，说明安装成功。

基本命令结构

导出命令格式：

elasticdump \
  --input=http://source.es.com:9200/my_index \
  --output=/path/to/my_index_mapping.json \
  --type=mapping

导入命令格式：

elasticdump \
  --input=/path/to/my_index_mapping.json \
  --output=http://target.es.com:9200/my_index \
  --type=mapping

常用参数说明

参数	缩写	说明
`--input`	`-i`	输入源 (ES URL 或文件路径)
`--output`	`-o`	输出目标 (ES URL 或文件路径)
`--type`	`-t`	操作类型: data, mapping, settings, analyzer, alias, template
`--limit`	`-l`	每批处理文档数 (默认: 100)
`--size`	`-s`	每批从ES获取的文档数 (默认: 100)
`--transform`		对文档进行转换的JS函数
`--overwrite`		覆盖已存在的索引
`--delete`		导入前删除目标索引
`--headers`		添加HTTP头 (如认证头)
`--searchBody`		自定义查询体 (JSON格式)
`--fileSize`		分割文件大小 (如 10mb)
`--retryAttempts`		重试次数 (默认: 5)
`--retryDelay`		重试延迟 (毫秒，默认: 5000)

导出操作示例

导出索引设置

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=my_index_settings.json \
  --type=settings

导出索引映射

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=my_index_mapping.json \
  --type=mapping

导出索引数据

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=my_index_data.json \
  --type=data \
  --limit=5000

导出所有文档（使用查询）

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=query_results.json \
  --searchBody='{"query":{"range":{"timestamp":{"gte":"now-1d/d"}}}}'

导出到多个文件

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=my_index_data_part1.json \
  --fileSize=10mb \
  --type=data

导入操作示例

导入索引设置

elasticdump \
  --input=my_index_settings.json \
  --output=http://new_host:9200/my_index \
  --type=settings

导入索引映射

elasticdump \
  --input=my_index_mapping.json \
  --output=http://new_host:9200/my_index \
  --type=mapping

导入索引数据

elasticdump \
  --input=my_index_data.json \
  --output=http://new_host:9200/my_index \
  --type=data \
  --limit=5000

导入前删除现有索引

elasticdump \
  --input=my_index_data.json \
  --output=http://new_host:9200/my_index \
  --type=data \
  --delete

高级用法

使用认证

elasticdump \
  --input=http://user:pass@localhost:9200/my_index \
  --output=my_index_data.json \
  --type=data

或使用 headers 参数：

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=my_index_data.json \
  --headers='{"Authorization": "Basic dXNlcjpwYXNz"}'

数据转换

elasticdump \
  --input=http://localhost:9200/my_index \
  --output=transformed_data.json \
  --transform='doc._source.new_field = "value"'

多线程导入/导出

# 先导出到多个文件
elasticdump --input=http://localhost:9200/my_index --output=my_index_data1.json --fileSize=10mb
elasticdump --input=http://localhost:9200/my_index --output=my_index_data2.json --fileSize=10mb

# 然后并行导入
elasticdump --input=my_index_data1.json --output=http://new_host:9200/my_index &
elasticdump --input=my_index_data2.json --output=http://new_host:9200/my_index &
wait

导入脚本


ES=http://127.0.0.1:9220
ED=/data/es_data
echo "=============================================================="
index="index_name"
echo "=============================================================="

# settings, analyzer, data, mapping, alias, template
echo "elasticdump --output=$ES/$tg_index --input=$ED/$index"

elasticdump --input=${ED}/${index}_setting.json --output=${ES}/${index}  --type=settings  --limit=10000
elasticdump --input=${ED}/${index}_analyzer.json --output=${ES}/${index} --type=analyzer  --limit=10000
elasticdump --input=${ED}/${index}_alias.json --output=${ES}/${index}  --type=alias  --limit=10000
elasticdump --input=${ED}/${index}_template.json --output=${ES}/${index}  --type=template  --limit=10000
elasticdump --input=${ED}/${index}_mapping.json --output=${ES}/${index}  --type=mapping   --limit=10000
elasticdump --input=${ED}/${index}_data.json --output=${ES}/${index} --type=data  --limit=10000
echo "success"

ES: 定义目标 Elasticsearch 集群的地址和端口

ED: 定义包含导出数据的 JSON 文件所在的本地目录路径
index: 指定要导入的索引名称

导出脚本

#!/bin/bash
ES=http://127.0.0.1:9220

ED=/data/es_data

datename=$(date +%Y-%m-%d)

index=index_qa_yingshi

echo "elasticdump --input=$ES/$index --output=$ED/$index.json"
    elasticdump --input=$ES/$index --output=${ED}/${index}_setting.json  --type=settings  --limit=10000
    elasticdump --input=$ES/$index --output=${ED}/${index}_analyzer.json --type=analyzer  --limit=10000
    elasticdump --input=$ES/$index --output=${ED}/${index}_alias.json  --type=alias  --limit=10000
    elasticdump --input=$ES/$index --output=${ED}/${index}_template.json  --type=template  --limit=10000
    elasticdump --input=$ES/$index --output=${ED}/${index}_mapping.json  --type=mapping   --limit=10000
    elasticdump --input=$ES/$index --output=${ED}/${index}_data.json --type=data  --limit=10000

cd $ED
#tar -zcvf  $index.tar.gz $index.json
#find $ED/* -type f -mtime +10 -exec rm {} \;

echo "success"