Velero系列文章(四):使用Velero进行生产迁移实战
2022/12/14 4:24:03
本文主要是介绍Velero系列文章(四):使用Velero进行生产迁移实战,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
概述
目的
通过 velero 工具, 实现以下整体目标:
- 特定 namespace 在B A两个集群间做迁移;
具体目标为:
- 在B A集群上创建 velero (包括 restic )
- 备份 B集群 特定 namespace :
caseycui2020
:- 备份resources - 如deployments, configmaps等;
- 备份前, 排除特定
secrets
的yaml.
- 备份前, 排除特定
- 备份volume数据; (通过restic实现)
- 通过"选择性启用" 的方式, 只备份特定的pod volume
- 备份resources - 如deployments, configmaps等;
- 迁移特定 namespace 到 A集群 :
caseycui2020
:- 迁移resources - 通过
include
的方式, 仅迁移特定resources; - 迁移volume数据. (通过restic 实现)
- 迁移resources - 通过
安装
-
在您的本地目录中创建特定于Velero的凭证文件(
credentials-velero
):使用的是xsky的对象存储: (公司的netapp的对象存储不兼容)
[default] aws_access_key_id = xxxxxxxxxxxxxxxxxxxxxxxx aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
-
(openshift) 需要先创建 namespace :
velero
:oc new-project velero
-
默认情况下,用户维度的openshift namespace 不会在集群中的所有节点上调度Pod。
要在所有节点上计划namespace,需要一个注释:
oc annotate namespace velero openshift.io/node-selector=""
这应该在安装velero之前完成。
-
启动服务器和存储服务。 在Velero目录中,运行:
velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.0.0 \ --bucket velero \ --secret-file ./credentials-velero \ --use-restic \ --use-volume-snapshots=true \ --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4" \ --snapshot-location-config region="default"
创建的内容包括:
CustomResourceDefinition/backups.velero.io: attempting to create resource CustomResourceDefinition/backups.velero.io: created CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource CustomResourceDefinition/backupstoragelocations.velero.io: created CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource CustomResourceDefinition/deletebackuprequests.velero.io: created CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource CustomResourceDefinition/downloadrequests.velero.io: created CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource CustomResourceDefinition/podvolumebackups.velero.io: created CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource CustomResourceDefinition/podvolumerestores.velero.io: created CustomResourceDefinition/resticrepositories.velero.io: attempting to create resource CustomResourceDefinition/resticrepositories.velero.io: created CustomResourceDefinition/restores.velero.io: attempting to create resource CustomResourceDefinition/restores.velero.io: created CustomResourceDefinition/schedules.velero.io: attempting to create resource CustomResourceDefinition/schedules.velero.io: created CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource CustomResourceDefinition/serverstatusrequests.velero.io: created CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource CustomResourceDefinition/volumesnapshotlocations.velero.io: created Waiting for resources to be ready in cluster... Namespace/velero: attempting to create resource Namespace/velero: created ClusterRoleBinding/velero: attempting to create resource ClusterRoleBinding/velero: created ServiceAccount/velero: attempting to create resource ServiceAccount/velero: created Secret/cloud-credentials: attempting to create resource Secret/cloud-credentials: created BackupStorageLocation/default: attempting to create resource BackupStorageLocation/default: created VolumeSnapshotLocation/default: attempting to create resource VolumeSnapshotLocation/default: created Deployment/velero: attempting to create resource Deployment/velero: created DaemonSet/restic: attempting to create resource DaemonSet/restic: created Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.
-
(openshift) 将
velero
ServiceAccount添加到privileged
SCC:$ oc adm policy add-scc-to-user privileged -z velero -n velero
-
(openshift) 对于OpenShift版本> = 4.1,修改DaemonSet yaml以请求
privileged
模式:@@ -67,3 +67,5 @@ spec: value: /credentials/cloud - name: VELERO_SCRATCH_DIR value: /scratch + securityContext: + privileged: true
或:
oc patch ds/restic \ --namespace velero \ --type json \ -p '[{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value": { "privileged": true}}]'
备份 - B集群
备份集群级别的特定资源
velero backup create <backup-name> --include-cluster-resources=true --include-resources deployments,configmaps
查看备份
velero backup describe YOUR_BACKUP_NAME
备份特定 namespace caseycui2020
排除特定资源
标签为velero.io/exclude-from-backup=true
的资源不包括在备份中,即使它包含匹配的选择器标签也是如此。
通过这种方式, 不需要备份的secret
等资源通过velero.io/exclude-from-backup=true
标签(label)进行排除.
通过这种方式排除的secret
部分示例如下:
builder-dockercfg-jbnzr default-token-lshh8 pipeline-token-xt645
使用restic 备份Pod Volume
🐾 注意:
该 namespace 下, 以下2个pod volume也需要备份, 但是目前还没正式使用:
mycoreapphttptask-callback
mycoreapphttptaskservice-callback
通过 “选择性启用” 的方式进行有选择地备份.
-
对包含要备份的卷的每个Pod运行以下命令:
oc -n caseycui2020 annotate pod/<mybackendapp-pod-name> backup.velero.io/backup-volumes=jmx-exporter-agent,pinpoint-agent,my-mybackendapp-claim oc -n caseycui2020 annotate pod/<elitegetrecservice-pod-name> backup.velero.io/backup-volumes=uploadfile
其中,卷名是容器 spec中卷的名称。
例如,对于以下pod:
apiVersion: v1 kind: Pod metadata: name: sample namespace: foo spec: containers: - image: k8s.gcr.io/test-webserver name: test-webserver volumeMounts: - name: pvc-volume mountPath: /volume-1 - name: emptydir-volume mountPath: /volume-2 volumes: - name: pvc-volume persistentVolumeClaim: claimName: test-volume-claim - name: emptydir-volume emptyDir: {}
你应该运行:
kubectl -n foo annotate pod/sample backup.velero.io/backup-volumes=pvc-volume,emptydir-volume
如果您使用控制器来管理您的pods,则也可以在pod template spec中提供此批注。
备份及验证
备份namespace及其对象, 以及具有相关annotation的pod volume:
# 生产 namespace velero backup create caseycui2020 --include-namespaces caseycui2020
查看备份
velero backup describe YOUR_BACKUP_NAME velero backup logs caseycui2020 oc -n velero get podvolumebackups -l velero.io/backup-name=caseycui2020 -o yaml
describe查看的结果如下:
Name: caseycui2020 Namespace: velero Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.18.3+2cf11e2 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=18+ Phase: Completed Errors: 0 Warnimybackendapp: 0 Namespaces: Included: caseycui2020 Excluded: <none> Resources: Included: * Excluded: <none> Cluster-scoped: auto Label selector: <none> Storage Location: default Velero-Native Snapshot PVs: auto TTL: 720h0m0s Hooks: <none> Backup Format Version: 1.1.0 Started: 2020-10-21 09:28:16 +0800 CST Completed: 2020-10-21 09:29:17 +0800 CST Expiration: 2020-11-20 09:28:16 +0800 CST Total items to be backed up: 591 Items backed up: 591 Velero-Native Snapshots: <none included> Restic Backups (specify --details for more information): Completed: 3
定期备份
使用基于cron表达式创建定期计划的备份:
velero schedule create caseycui2020-b-daily --schedule="0 3 * * *" --include-namespaces caseycui2020
另外,您可以使用一些非标准的速记cron表达式:
velero schedule create test-daily --schedule="@every 24h" --include-namespaces caseycui2020
有关更多用法示例,请参见cron软件包的文档。
集群迁移 - 到A集群
使用 Backups 和 Restores
只要您将每个Velero实例指向相同的云对象存储位置,Velero就能帮助您将资源从一个群集移植到另一个群集。 此方案假定您的群集由同一云提供商托管。 请注意,Velero本身不支持跨云提供程序迁移持久卷快照。 如果要在云平台之间迁移卷数据,请启用restic,它将在文件系统级别备份卷内容。
-
(集群 B)假设您尚未使用Velero
schedule
操作对数据进行检查点检查,则需要首先备份整个群集(根据需要替换<BACKUP-NAME>
):velero backup create <BACKUP-NAME>
默认备份保留期限以TTL(有效期)表示,为30天(720小时); 您可以使用
--ttl <DURATION>
标志根据需要进行更改。 有关备份到期的更多信息,请参见velero的工作原理。 -
(集群 A)配置
BackupStorageLocations
和VolumeSnapshotLocations
, 指向 集群1 使用的位置, 使用velero backup-location create
和velero snapshot-location create
. 要确保配置BackupStorageLocations
为 read-only, 通过在velero backup-location create
时使用--access-mode=ReadOnly
flag (因为我就只有一个bucket, 所以就没有配置只读这一项). 如下为在A集群安装, 安装时已配置了BackupStorageLocations
和VolumeSnapshotLocations
.velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.0.0 \ --bucket velero \ --secret-file ./credentials-velero \ --use-restic \ --use-volume-snapshots=true \ --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4"\ --snapshot-location-config region="default"
-
(集群A)确保已创建Velero Backup对象。 Velero资源与云存储中的备份文件同步。
velero backup describe <BACKUP-NAME>
注意:默认同步间隔为1分钟,因此请确保在检查之前等待。 您可以使用Velero服务器的
--backup-sync-period
标志配置此间隔。 -
(集群A)一旦确认现在存在正确的备份(
<BACKUP-NAME>
),就可以使用以下方法还原所有内容: (因为backup
中只有caseycui2020
一个 namespace , 所以还原是就不需要--include-namespaces caseycui2020
进行过滤)velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io
因为后面验证
persistentvolumeclaims
的restore
有问题, 所以后续使用的时候先拿掉这个pvc, 后面再想办法解决:velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io
验证2个集群
检查第二个群集是否按预期运行:
-
(集群A)运行:
velero restore get
结果如下:
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNImybackendapp CREATED SELECTOR caseycui2020-20201021102342 caseycui2020 Failed <nil> <nil> 0 0 2020-10-21 10:24:14 +0800 CST <none> caseycui2020-20201021103040 caseycui2020 PartiallyFailed <nil> <nil> 46 34 2020-10-21 10:31:12 +0800 CST <none> caseycui2020-20201021105848 caseycui2020 InProgress <nil> <nil> 0 0 2020-10-21 10:59:20 +0800 CST <none>
-
然后运行:
velero restore describe <RESTORE-NAME-FROM-GET-COMMAND> oc -n velero get podvolumerestores -l velero.io/restore-name=YOUR_RESTORE_NAME -o yaml
结果如下:
Name: caseycui2020-20201021102342 Namespace: velero Labels: <none> Annotations: <none> Phase: InProgress Started: <n/a> Completed: <n/a> Backup: caseycui2020 Namespaces: Included: all namespaces found in the backup Excluded: <none> Resources: Included: buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io Cluster-scoped: auto Namespace mappimybackendapp: <none> Label selector: <none> Restore PVs: auto
如果遇到问题,请确保Velero在两个群集中的相同namespace中运行。
我这边碰到问题, 就是openshift里边, 使用了imagestream和imagetag, 然后对应的镜像拉不过来, 容器没有启动.
容器没有启动, podvolume也没有恢复成功.
Name: caseycui2020-20201021110424 Namespace: velero Labels: <none> Annotations: <none> Phase: PartiallyFailed (run 'velero restore logs caseycui2020-20201021110424' for more information) Started: <n/a> Completed: <n/a> Warnimybackendapp: Velero: <none> Cluster: <none> Namespaces: caseycui2020: could not restore, imagetags.image.openshift.io "mybackendapp:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version. could not restore, imagetags.image.openshift.io "mybackendappno:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version. ... Errors: Velero: <none> Cluster: <none> Namespaces: caseycui2020: error restoring imagestreams.image.openshift.io/caseycui2020/mybackendapp: ImageStream.image.openshift.io "mybackendapp" is invalid: []: Internal error: imagestreams "mybackendapp" is invalid: spec.tags[latest].from.name: Invalid value: "mybackendapp@sha256:6c5ab553a97c74ad602d2427a326124621c163676df91f7040b035fa64b533c7": error generating tag event: imagestreamimage.image.openshift.io ...... Backup: caseycui2020 Namespaces: Included: all namespaces found in the backup Excluded: <none> Resources: Included: buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io Cluster-scoped: auto Namespace mappimybackendapp: <none> Label selector: <none> Restore PVs: auto
迁移问题总结
目前总结问题如下:
-
imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io
里的镜像没有成功导入; 确切地说是latest
这个tag没有导入成功.imagestreamtags.image.openshift.io
生效也需要时间. -
persistentvolumeclaims
迁移过来后报错, 报错如下:phase: Lost
原因是: A B集群的StorageClass的配置是不同的, 所以B集群的PVC, 在A集群想要直接binding是不可能的. 而且创建后无法直接修改, 需要删掉重新创建.
-
Routes
域名, 有部分域名是特定于A B集群的域名, 如:jenkins-caseycui2020.b.caas.ewhisper.cn
迁移到A集群调整为:jenkins-caseycui2020.a.caas.ewhisper.cn
-
podVolume
数据没有迁移.
latest
这个tag没有导入成功
手动导入, 命令如下: (1.0.1
为ImageStream的最新的版本)
oc tag xxl-job-admin:1.0.1 xxl-job-admin:latest
PVC phase Lost问题
如果手动创建, 需要对PVC yaml进行调整. 调整前后PVC如下:
B集群原YAML:
kind: PersistentVolumeClaim apiVersion: v1 metadata: annotations: pv.kubernetes.io/bind-completed: 'yes' pv.kubernetes.io/bound-by-controller: 'yes' volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io selfLink: /api/v1/namespaces/caseycui2020/persistentvolumeclaims/jenkins resourceVersion: '77304786' name: jenkins uid: ffcabc42-845d-4cdf-8c7c-56e97cb5ea82 creationTimestamp: '2020-10-21T03:05:46Z' managedFields: - manager: kube-controller-manager operation: Update apiVersion: v1 time: '2020-10-21T03:05:46Z' fieldsType: FieldsV1 fieldsV1: 'f:status': 'f:phase': {} - manager: velero-server operation: Update apiVersion: v1 time: '2020-10-21T03:05:46Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': .: {} 'f:pv.kubernetes.io/bind-completed': {} 'f:pv.kubernetes.io/bound-by-controller': {} 'f:volume.beta.kubernetes.io/storage-provisioner': {} 'f:labels': .: {} 'f:app': {} 'f:template': {} 'f:template.openshift.io/template-instance-owner': {} 'f:velero.io/backup-name': {} 'f:velero.io/restore-name': {} 'f:spec': 'f:accessModes': {} 'f:resources': 'f:requests': .: {} 'f:storage': {} 'f:storageClassName': {} 'f:volumeMode': {} 'f:volumeName': {} namespace: caseycui2020 finalizers: - kubernetes.io/pvc-protection labels: app: jenkins-persistent template: jenkins-persistent-monitored template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91 velero.io/backup-name: caseycui2020 velero.io/restore-name: caseycui2020-20201021110424 spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi volumeName: pvc-414efafd-8b22-48da-8c20-6025a8e671ca storageClassName: nas-data volumeMode: Filesystem status: phase: Lost
调整后:
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: jenkins namespace: caseycui2020 labels: app: jenkins-persistent template: jenkins-persistent-monitored template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91 spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi storageClassName: nas-data volumeMode: Filesystem
podVolume
数据没有迁移
可以手动迁移, 命令如下:
# 登录B集群 # 先把B 集群/opt/prometheus数据拿出来到当前文件夹 oc rsync xxl-job-admin-5-9sgf7:/opt/prometheus . # 上边rsync命令会创建个prometheus的目录 cd prometheus # 登录A集群 # 再把数据拷贝进去(拷贝之前得先确保这个pod启动起来) (可以先把`JAVA_OPTS`删掉) oc rsync ./ xxl-job-admin-2-6k8df:/opt/prometheus/
总结
本文写的比较早, 后面 OpenShift 出了基于 Velero 封装的 OpenShift 专有的迁移工具, 可以直接通过它提供的工具进行迁移.
另外, OpenShift 集群上限制很多, 另外也有很多专属于 OpenShift 的资源, 所以在实际使用上和标准 K8S 的差别还是比较大的, 需要仔细注意.
本次虽然尝试失败, 但是其中的思路还是可供借鉴的.
系列文章
- Velero 系列文章
📚️参考文档
- Velero Docs - Resource filtering
三人行, 必有我师; 知识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.
这篇关于Velero系列文章(四):使用Velero进行生产迁移实战的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-05-31全网首发第二弹!软考2024年5月《软件设计师》真题+解析+答案!(11-20题)
- 2024-05-31全网首发!软考2024年5月《软件设计师》真题+解析+答案!(21-30题)
- 2024-05-30【Java】百万数据excel导出功能如何实现
- 2024-05-30我们小公司,哪像华为一样,用得上IPD(集成产品开发)?
- 2024-05-30java excel上传--poi
- 2024-05-30安装笔记本应用商店的pycharm,再安排pandas等模块,说是没有打包工具?
- 2024-05-29java11新特性
- 2024-05-29哪些无用敏捷指标正在破坏敏捷转型?
- 2024-05-29鸿蒙原生应用再新丁!新华社 入局鸿蒙
- 2024-05-29设计模式 之 迭代器模式(Iterator)