4 Star 14 Fork 3

MindSpore Lab / mindrec

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

MindRec

Python Version LICENSE PRs Welcome

概述

MindRec是昇思MindSpore在推荐领域的高性能加速库,提供了推荐领域AI模型的高效训推一体解决方案及流程指导。MindRec基于MindSpore自动并行图算融合等基础技术能力,增加了分布式多级特征缓存以支持TB级推荐模型训练推理、基于Hash结构的动态特征表达以支持运行期特征的动态准入与淘汰、在线学习以支持分钟级模型实时更新等推荐领域的特殊场景支持,同时提供了开箱即用的数据集下载与转换工具、模型训练样例、典型模型Benchmark等内容,为用户提供了一站式的解决方案。

目录结构

└── mindrec
    ├── benchmarks            // 推荐网络训练性能benchmark
    ├── datasets              // 数据集下载与转换工具
    ├── docs                  // 高级特性使用教程
    ├── examples              // 高级特性示例代码
    ├── mindspore_rec         // 推荐网络训练相关API
    │   └── train
    ├── models                // 典型推荐模型库
    │   ├── deep_and_cross
    │   └── wide_deep
    ├── README.md
    ├── build.sh              // 编译打包入口脚本
    └── setup.py

模型库

持续丰富的模型库为用户提供了推荐领域经典模型的端到端训练流程及使用指导,直接下载MindRec源码即可使用,无需编译构建。训练不同的模型会有少量的Python依赖包需要安装,详见各个模型目录中的requirements.txt。

模型 MindRec版本 硬件 数据集
CPU GPU Ascend
Wide&Deep >= 0.2 / ✔️ ✔️ Criteo
Deep&Cross Network (DCN) >= 0.2 / ✔️ / Criteo

模型脚本及训练精度见下表:

model dataset auc mindrec recipe vanilla mindspore
Wide&Deep Criteo 0.8 link link
Deep&Cross Network (DCN) Criteo 0.8 link link

编译安装

安装MindRec前,请先安装MindSpore,具体详见MindSpore安装指南

1) 下载代码

git clone https://github.com/mindspore-lab/mindrec.git
cd mindrec

2) 编译安装

bash build.sh
pip install output/mindspore_rec-{recommender_version}-py3-none-any.whl

特性介绍

推荐领域在工程实践上面临的三个主要问题包含了持续增长的模型规模、特征的动态变化、以及模型更新的实时性,MindRec针对每个场景提供了相应的解决方案。

I. 推荐大模型

从2016年发布的Wide&Deep模型及其后续各种改进中可以了解到,推荐模型的规模主要取决于模型中特征向量的大小,随着业界在推荐业务规模上的持续发展,模型大小也快速突破了数百GB,甚至达到TB级别,因此需要一套高性能的分布式架构来解决大规模特征向量的存储、训练以及推理的问题。 根据模型规模的差异,MindRec提供了三种训练方案,分别是单卡训练混合并行以及层级化特征缓存

1)单卡训练

单卡训练模式与普通神经网络模型的计算方式相同,通过一张GPU或者NPU加速卡可以装载完整的模型并执行训练或者推理,该模式适合网络模型(尤其是特征表)小于加速卡显存容量(例如Nvidia GPU V100的32GB显存)的情况,训练以及推理的性能最佳。

2)多卡混合并行

混合并行模式是单卡训练的分布式版本,支持多机多卡并行训练以进一步提升模型规模和训练的吞吐量。该模式将模型中的特征表的部分通过模型并行的方式切分并保存到多张加速卡的显存中,而模型的其余部分则通过数据并行的方式完成规约计算。混合并行模式适合模型大小超过单一加速卡显存容量的情况。

3)分布式特征缓存

分布式特征缓存适用于超大规模推荐网络模型(例如TB级特征向量)的场景,该模式建立在混合并行的基础上,通过多层级特征缓存(Device <-> Local Host <-> Remote Host <-> SSD)将特征向量通过逐层级存储分离的方式扩展到更大范围的分布式存储上,从而能够在不改变计算规模的情况下,轻松扩展模型的规模,实现单张加速卡对于TB级模型的训练。

II. Hash动态特征

针对训练过程中特征会跟随时间而发生变化(新增或者消除)的场景,特征向量在表达上更加适合使用Hash结构进行存储和计算,在MindRec中可以使用名为MapParameter的数据类型表达一个Hash类型。逻辑数据结构以及示例代码如下所示:

import mindspore as ms
import mindspore.nn as nn
import mindspore.context as context
from mindspore.common.initializer import One
from mindspore.experimental import MapParameter
from mindspore import context, Tensor, Parameter

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

# Define the network.
class DemoNet(nn.Cell):

    def __init__(self):
        nn.Cell.__init__(self)

        self.map = MapParameter(
            name="HashEmbeddingTable",  # The name of this hash.
            key_dtype=ms.int32,         # The data type of the key.
            value_dtype=ms.float32,     # The data type of the value.
            value_shape=(128),          # The shape of the value.
            default_value="normal",     # The default values.
            permit_filter_value=1,      # The threshold(the number of training step) for new features.
            evict_filter_value=1000     # The threshold(the number of training step) for feature elimination.
        )

    def construct(self, key, val):

        # Insert a key-val pair.
        self.map[key] = val

        # Lookup a value.
        val2 = self.map[key]

        # Delete a key-val pair.
        self.map.erase(key)

        return val2

# Execute the network.
net = DemoNet()
key = Tensor([1, 2], dtype=ms.int32)
val = Tensor(shape=(2, 128), dtype=ms.float32, init=One())
out = net(key, val)
print(out)

III. 在线学习

推荐系统中另外一个关注点是如何根据用户的实时行为数据,以在线的方式增量训练以及更新模型。MindRec支持的在线学习流程如下图所示,整个Pipeline分为四个阶段:

1)实时数据写入:增量的行为数据实时写入数据管道(例如Kafka)。 2)实时特征工程:通过MindPandas提供的实时数据处理能力,完成特征工程,将训练数据写入分布式存储中。 3)在线增量训练:MindData从分布式存储中将增量的训练数据输入MindSpore的在线训练模块中完成训练,并导出增量模型。 4)增量模型更新:增量模型导入到MindSpore推理模块,完成模型的实时更新。

上述四个阶段的开发均可通过MindSpore和MindRec生态组件以及Python表达实现,无需借助三方系统,示例代码如下所示(需要提前搭建和启动Kafka服务),详细步骤可参考在线学习指导文档:

from mindpandas.channel import DataReceiver
from mindspore_rec import RecModel as Model

# Prepare the realtime dataset.
receiver = DataReceiver(address=config.address,
                        namespace=config.namespace,
                        dataset_name=config.dataset_name, shard_id=0)
stream_dataset = StreamingDataset(receiver)

dataset = ds.GeneratorDataset(stream_dataset, column_names=["id", "weight", "label"])
dataset = dataset.batch(config.batch_size)

# Create the RecModel.
train_net, _ = GetWideDeepNet(config)
train_net.set_train()
model = Model(train_net)

# Configure the policy for model export.
ckpt_config = CheckpointConfig(save_checkpoint_steps=100, keep_checkpoint_max=5)
ckpt_cb = ModelCheckpoint(prefix="train", directory="./ckpt", config=ckpt_config)

# Start the online training process.
model.online_train(dataset,
                   callbacks=[TimeMonitor(1), callback, ckpt_cb],
                   dataset_sink_mode=True)

社区

I. 治理

查看MindSpore如何进行开放治理

参与贡献

欢迎参与贡献。更多详情,请参阅我们的贡献者Wiki

许可证

Apache License 2.0

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

MindRec是昇思MindSpore在推荐领域的高性能加速库,提供了推荐领域AI模型的高效训推一体解决方案及流程指导 展开 收起
Python 等 5 种语言
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/mindspore-lab/mindrec.git
git@gitee.com:mindspore-lab/mindrec.git
mindspore-lab
mindrec
mindrec
master

搜索帮助